0% found this document useful (0 votes)
52 views578 pages

CntrlEngg (Optimization) ConvexOptimizationAlgorithms DimitriBertsekas

Uploaded by

p20230520
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views578 pages

CntrlEngg (Optimization) ConvexOptimizationAlgorithms DimitriBertsekas

Uploaded by

p20230520
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 578

Convex Optimization Algorithms

Dimitri P. Bertsekas
Massachusetts Institute of Technology

WWW site for book information and orders

http://www.athenasc.com

®
Athena Scientific, Belmont, Massachusetts
Athena Scientific
Post Office Box 805
Nashua, NH 03061-0805
U.S.A.

Email: [email protected]
WWW: http://www.athenasc.com

© 2015 Dimitri P. Bertsekas


All rights reserved. No part of this book may be reproduced in any form
by any electronic or mechanical means (including photocopying, recording,
or information storage and retrieval) without permission in writing from
the publisher.

Publisher's Cataloging-in-Publication Data


Bertsekas, Dimitri P.
Convex Optimization Algorithms
Includes bibliographical references and index
1. Nonlinear Programming 2. Mathematical Optimization. I. Title.
T57.8.B475 2015 519.703
Library of Congress Control Number: 2002092168

ISBN-10: 1-886529-28-0, ISBN-13: 978-1-886529-28-1


Contents
1. Convex Optimization Models: An Overview p. 1
1. 1. Lagrange Duality . . . . . . . . . . . p. 2
1.1.1. Separable Problems - Decomposition p. 7
1.1.2. Partitioning . . . . . . . . . . . p. 9
1.2. Fenchel Duality and Conic Programming . p. 10
1.2.1. Linear Conic Problems . . . . p. 15
1.2.2. Second Order Cone Programming p. 17
1.2.3. Semidefinite Programming p. 22
1.3. Additive Cost Problems . . p. 25
1.4. Large Number of Constraints . p. 34
1.5. Exact Penalty Functions p. 39
1.6. Notes, Sources, and Exercises p. 47

2. Optimization Algorithms: An Overview . p. 53


2.1. Iterative Descent Algorithms . . . . . . . . . . . . p. 55
2.1.1. Differentiable Cost Function Descent - Unconstrained
Problems p. 58
2.1.2. Constrained Problems - Feasible Direction Methods p. 71
2.1.3. Nondifferentiable Problems - Subgradient Methods p. 78
2.1.4. Alternative Descent Methods . . . . . . . . p. 80
2.1.5. Incremental Algorithms . . . . . . . . . . p. 83
2.1.6. Distributed Asynchronous Iterative Algorithms p. 104
2.2. Approximation Methods . . . . . . . . . . p. 106
2.2.1. Polyhedral Approximation . . . . . . . . p. 107
2.2.2. Penalty, Augmented Lagrangian, and Interior
Point Methods . . . . . . . . . . . . p. 108
2.2.3. Proximal Algorithm, Bundle Methods, and .
Tikhonov Regularization . . . . . . . . . p. 110
2.2.4. Alternating Direction Method of Multipliers p. 111
2.2.5. Smoothing of Nondifferentiable Problems p. 113
2.3. Notes, Sources, and Exercises . . . . . . . p. 119

3. Subgradient Methods p. 135


3.1. Subgradients of Convex Real-Valued Functions p. 136

iii
iv Contents

3.1.l. Characterization of the Subdifferential . . p. 146


3.2. Convergence Analysis of Subgradient Methods p. 148
3.3. E-Subgradient Methods . . . . . . . . . . p. 162
3.3.l. Connection with Incremental Subgradient Methods p. 166
3.4. Notes, Sources, and Exercises . . . . . . . . . . . . p. 167

4. Polyhedral Approximation Methods . . . . p. 181


4.1. Outer Linearization - Cutting Plane Methods p. 182
4.2. Inner Linearization - Simplicial Decomposition p. 188
4.3. Duality of Outer and Inner Linearization . p. 194
4.4. Generalized Polyhedral Approximation p. 196
4.5. Generalized Simplicial Decomposition . . p. 209
4.5.l. Differentiable Cost Case . . . . . . p. 213
4.5.2. Nondifferentiable Cost and Side Constraints p. 213
4.6. Polyhedral Approximation for Conic Programming p. 217
4. 7. Notes, Sources, and Exercises . . . . . . . . . . p. 228

5. Proximal Algorithms p. 233


5.1. Basic Theory of Proximal Algorithms p. 234
5.1.l. Convergence . . . . . p. 235
5.1.2. Rate of Convergence . . . . . . p. 239
5.1.3. Gradient Interpretation . . . . p. 246
5.1.4. Fixed Point Interpretation, Overrelaxation,
and Generalization . . . . . . p. 248
5.2. Dual Proximal Algorithms . . . . . . p. 256
5.2.l. Augmented Lagrangian Methods p. 259
5.3. Proximal Algorithms with Linearization p. 268
5.3.l. Proximal Cutting Plane Methods . p. 270
5.3.2. Bundle Methods . . . . . . . . p. 272
5.3.3. Proximal Inner Linearization Methods p. 276
5.4. Alternating Direction Methods of Multipliers p. 280
5.4.l. Applications in Machine Learning . . . p. 286
5.4.2. ADMM Applied to Separable Problems p. 289
5.5. Notes, Sources, and Exercises . p. 293

6. Additional Algorithmic Topics p. 301


6.1. Gradient Projection Methods . . . . . . p. 302
6.2. Gradient Projection with Extrapolation . p. 322
6.2.l. An Algorithm with Optimal Iteration Complexity p. 323
6.2.2. Nondifferentiable Cost - Smoothing . . p. 326
6.3. Proximal Gradient Methods . . . . . . . p. 330
6.4. Incremental Subgradient Proximal Methods p. 340
6.4. l. Convergence for Methods with Cyclic Order p. 344
Contents V

6.4.2. Convergence for Methods with Randomized Order p. 353


6.4.3. Application in Specially Structured Problems p. 361
6.4.4. Incremental Constraint Projection Methods p. 365
6.5. Coordinate Descent Methods . . . . . . . . . p. 369
6.5.1. Variants of Coordinate Descent . . . . . . p. 373
6.5.2. Distributed Asynchronous Coordinate Descent p. 376
6.6. Generalized Proximal Methods . . . . . . . . . p. 382
6.7. E-Descent and Extended Monotropic Programming p. 396
6. 7.1. E-Subgradients . . . . . . . . . . . . . p. 397
6.7.2. E-Descent Method . . . . . . . . . . . . p. 400
6.7.3. Extended Monotropic Programming Duality p. 406
6.7.4. Special Cases of Strong Duality . . . . . . p. 408
6.8. Interior Point Methods . . . . . . . . . . . p. 412
6.8.1. Primal-Dual Methods for Linear Programming p. 416
6.8.2. Interior Point Methods for Conic Programming p. 423
6.8.3. Central Cutting Plane Methods p. 425
6.9. Notes, Sources, and Exercises . . . . p. 426

Appendix A: Mathematical Background p. 443


A.l. Linear Algebra . . . . p. 445
A.2. Topological Properties p. 450
A.3. Derivatives . . . . . p. 456
A.4. Convergence Theorems p. 458

Appendix B: Convex Optimization Theory: A Summary p. 467


B. l. Basic Concepts of Convex Analysis . . p. 467
B.2. Basic Concepts of Polyhedral Convexity p. 489
B.3. Basic Concepts of Convex Optimization p. 494
B.4. Geometric Duality Framework p. 498
B.5. Duality and Optimization . . . . . . p. 505

References p. 519

Index . . . p. 557
ATHENA SCIENTIFIC
OPTIMIZATION AND COMPUTATION SERIES

l. Convex Optimization Algorithms, by Dimitri P. Bertsekas, 2015,


ISBN 978-1-886529-28-1, 576 pages
2. Abstract Dynamic Programming, by Dimitri P. Bertsekas, 2013,
ISBN 978-1-886529-42-7, 256 pages
3. Dynamic Programming and Optimal Control, Two-Volume Set,
by Dimitri P. Bertsekas, 2012, ISBN 1-886529-08-6, 1020 pages
4. Convex Optimization Theory, by Dimitri P. Bertsekas, 2009,
ISBN 978-1-886529-31-1, 256 pages
5. Introduction to Probability, 2nd Edition, by Dimitri P. Bertsekas
and John N. Tsitsiklis, 2008, ISBN 978-1-886529-23-6, 544 pages
6. Convex Analysis and Optimization, by Dimitri P. Bertsekas, An-
gelia Nedic, and Asuman E. Ozdaglar, 2003, ISBN 1-886529-45-0,
560 pages
7. Nonlinear Programming, 2nd Edition, by Dimitri P. Bertsekas,
1999, ISBN 1-886529-00-0, 791 pages
8. Network Optimization: Continuous and Discrete Models, by Dim-
itri P. Bertsekas, 1998, ISBN 1-886529-02-7, 608 pages
9. Network Flows and Monotropic Optimization, by R. Tyrrell Rock-
afellar, 1998, ISBN 1-886529-06-X, 634 pages
10. Introduction to Linear Optimization, by Dimitris Bertsimas and
John N. Tsitsiklis, 1997, ISBN 1-886529-19-1, 608 pages
11. Parallel and Distributed Computation: Numerical Methods, by
Dimitri P. Bertsekas and John N. Tsitsiklis, 1997, ISBN 1-886529-
01-9, 718 pages
12. Neuro-Dynamic Programming, by Dimitri P. Bertsekas and John
N. Tsitsiklis, 1996, ISBN 1-886529-10-8, 512 pages
13. Constrained Optimization and Lagrange Multiplier Methods, by
Dimitri P. Bertsekas, 1996, ISBN 1-886529-04-3, 410 pages
14. Stochastic Optimal Control: The Discrete-Time Case, by Dimitri
P. Bertsekas and Steven E. Shreve, 1996, ISBN 1-886529-03-5,
330 pages

vi
ABOUT THE AUTHOR

Dimitri Bertsekas studied Mechanical and Electrical Engineering at the


National Technical University of Athens, Greece, and obtained his Ph.D.
in system science from the Massachusetts Institute of Technology. He has
held faculty positions with the Engineering-Economic Systems Department,
Stanford University, and the Electrical Engineering Department of the Uni-
versity of Illinois, Urbana. Since 1979 he has been teaching at the Electrical
Engineering and Computer Science Department of the Massachusetts In-
stitute of Technology (M.1.T.), where he is currently the McAfee Professor
of Engineering.
His teaching and research spans several fields, including deterministic
optimization, dynamic programming and stochastic control, large-scale and
distributed computation, and data communication networks. He has au-
thored or coauthored numerous research papers and sixteen books, several
of which are currently used as textbooks in MIT classes, including "Nonlin-
ear Programming," "Dynamic Programming and Optimal Control," "Data
Networks," "Introduction to Probability," "Convex Optimization Theory,"
as well as the present book. He often consults with private industry and
has held editorial positions in several journals.
Professor Bertsekas was awarded the INFORMS 1997 Prize for Re-
search Excellence in the Interface Between Operations Research and Com-
puter Science for his book "Neuro-Dynamic Programming" (co-authored
with John Tsitsiklis), the 2001 AACC John R. Ragazzini Education Award,
the 2009 INFORMS Expository Writing Award, the 2014 AACC Richard
Bellman Heritage Award for "contributions to the foundations of determin-
istic and stochastic optimization-based methods in systems and control,"
and the 2014 Khachiyan Prize for "life-time accomplishments in optimiza-
tion." In 2001, he was elected to the United States National Academy of
Engineering for "pioneering contributions to fundamental research, practice
and education of optimization/control theory, and especially its application
to data communication networks."

vii
Preface

There is no royal way to geometry


(Euclid to king Ptolemy of Alexandria)

Interest in convex optimization has become intense due to widespread ap-


plications in fields such as large-scale resource allocation, signal processing,
and machine learning. This book aims at an up-to-date and accessible de-
velopment of algorithms for solving convex optimization problems.
The book complements the author's 2009 "Convex Optimization The-
ory" book, but can be read independently. The latter book focuses on
convexity theory and optimization duality, while the present book focuses
on algorithmic issues. The two books share mathematical prerequisites,
notation, and style, and together cover the entire finite-dimensional convex
optimization field. Both books rely on rigorous mathematical analysis, but
also aim at an intuitive exposition that makes use of visualization where
possible. This is facilitated by the extensive use of analytical and algorith-
mic concepts of duality, which by nature lend themselves to geometrical
interpretation.
To enhance readability, the statements of definitions and results of
the "theory book" are reproduced without proofs in Appendix B. Moreover,
some of the theory needed for the present book, has been replicated and/or
adapted to its algorithmic nature. For example the theory of subgradients
for real-valued convex functions is fully developed in Chapter 3. Thus the
reader who is already familiar with the analytical foundations of convex
optimization need not consult the "theory book" except for the purpose of
studying the proofs of some specific results.
The book covers almost all the major classes of convex optimization
algorithms. Principal among these are gradient, subgradient, polyhedral
approximation, proximal, and interior point methods. Most of these meth-
ods rely on convexity (but not necessarily differentiability) in the cost and
constraint functions, and are often connected in various ways to duality. I
have provided numerous examples describing in detail applications to spe-
cially structured problems. The reader may also find a wealth of analysis
and discussion of applications in books on large-scale convex optimization,
network optimization, parallel and distributed computation, signal process-
ing, and machine learning.
The chapter-by-chapter description of the book follows:
Chapter 1: Here we provide a broad overview of some important classes of
convex optimization problems, and their principal characteristics. Several

ix
X Preface

problem structures are discussed, often arising from Lagrange duality the-
ory and Fenchel duality theory, together with its special case, conic duality.
Some additional structures involving a large number of additive terms in
the cost, or a large number of constraints are also discussed, together with
their applications in machine learning and large-scale resource allocation.
Chapter 2: Here we provide an overview of algorithmic approaches, focus-
ing primarily on algorithms for differentiable optimization, and we discuss
their differences from their nondifferentiable convex optimization counter-
parts. We also highlight the main ideas of the two principal algorithmic
approaches of this book, iterative descent and approximation, and we illus-
trate their application with specific algorithms, reserving detailed analysis
for subsequent chapters.
Chapter 3: Here we discuss subgradient methods for minimizing a con-
vex cost function over a convex constraint set. The cost function may be
nondifferentiable, as is often the case in the context of duality and machine
learning applications. These methods are based on the idea of reduction
of distance to the optimal set, and include variations aimed at algorithmic
efficiency, such as E-subgradient and incremental subgradient methods.
Chapter 4: Here we discuss polyhedral approximation methods for min-
imizing a convex function over a convex constraint set. The two main
approaches here are outer linearization (also called the cutting plane ap-
proach) and inner linearization (also called the simplicial decomposition
approach). We show how these two approaches are intimately connected
by conjugacy and duality, and we generalize our framework for polyhedral
approximation to the case where the cost function is a sum of two or more
convex component functions.
Chapter 5: Here we focus on proximal algorithms for minimizing a convex
function over a convex constraint set. At each iteration of the basic proxi-
mal method, we solve an approximation to the original problem. However,
unlike the preceding chapter, the approximation is not polyhedral, but
rather it is based on quadratic regularization, i.e., adding a quadratic term
to the cost function, which is appropriately adjusted at each iteration. We
discuss several variations of the basic algorithm. Some of these include
combinations with the polyhedral approximation methods of the preced-
ing chapter, yielding the class of bundle methods. Others are obtained
via duality from the basic proximal algorithm, including the augmented
Lagrangian method ( also called method of multipliers) for constrained op-
timization. Finally, we discuss extensions of the proximal algorithm for
finding a zero of a maximal monotone operator, and a major special case:
the alternating direction method of multipliers, which is well suited for
taking advantage of the structure of several types of large-scale problems.
Chapter 6: Here we discuss a variety of algorithmic topics that sup-
plement our discussion of the descent and approximation methods of the
Preface xi

preceding chapters. We first discuss gradient projection methods and vari-


ations with extrapolation that have good complexity properties, including
Nesterov's optimal complexity algorithm. These were developed for differ-
entiable problems, and can be extended to the nondifferentiable case by
means of a smoothing scheme. Then we discuss a number of combinations
of gradient, subgradient, and proximal methods that are well suited for
specially structured problems. We pay special attention to incremental
versions for the case where the cost function consists of the sum of a large
number of component terms. We also describe additional methods, such
as the classical block coordinate descent approach, the proximal algorithm
with a nonquadratic regularization term, and the E-descent method. We
close the chapter with a discussion of interior point methods.
Our lines of analysis are largely based on differential calculus-type
ideas, which are central in nonlinear programming, and on concepts of hy-
perplane separation, conjugacy, and duality, which are central in convex
analysis. A traditional use of duality is to establish the equivalence and
the connections between a pair of primal and dual problems, which may in
turn enhance insight and enlarge the set of options for analysis and compu-
tation. The book makes heavy use of this type of problem duality, but also
emphasizes a qualitatively different, algorithm-oriented type of duality that
is largely based on conjugacy. In particular, some fundamental algorithmic
operations turn out to be dual to each other, and whenever they arise in
various algorithms they admit dual implementations, often with significant
gains in insight and computational convenience. Some important examples
are the duality between the subdifferentials of a convex function and its
conjugate, the duality of a proximal operation using a convex function and
an augmented Lagrangian minimization using its conjugate, and the dual-
ity between outer linearization of a convex function and inner linearization
of its conjugate. Several interesting algorithms in Chapters 4-6 admit dual
implementations based on these pairs of operations.
The book contains a fair number of exercises, many of them sup-
plementing the algorithmic development and analysis. In addition a large
number of theoretical exercises (with carefully written solutions) for the "the-
ory book," together with other related material, can be obtained from the
book's web page http://www.athenasc.com/convexalgorithms.html, and
the author's web page http://web.mit.edu/dimitrib/www /home.html. The
MIT OpenCourseWare site http://ocw.mit.edu/index.htm, also provides
lecture slides and other relevant material.
The mathematical prerequisites for the book are a first course in
linear algebra and a first course in real analysis. A summary of the relevant
material is provided in Appendix A. Prior exposure to linear and nonlinear
optimization algorithms is not assumed, although it will undoubtedly be
helpful in providing context and perspective. Other than this background,
the development is self-contained, with proofs provided throughout.
xii Preface

The present book, in conjunction with its "theory" counterpart may


be used as a text for a one-semester or two-quarter convex optimization
course; I have taught several variants of such a course at MIT and else-
where over the last fifteen years. Still the book may not provide all of the
convex optimization material an instructor may wish for, and it may need
to be supplemented by works that aim primarily at specific types of con-
vex optimization models, or address more comprehensively computational
complexity issues. I have added representative citations for such works,
which, however, are far from complete in view of the explosive growth of
the literature on the subject.
The book may also be used as a supplementary source for nonlinear
programming classes that are primarily focused on classical differentiable
nonconvex optimization material (Kuhn-Tucker theory, Newton-like and
conjugate direction methods, interior point, penalty, and augmented La-
grangian methods). For such courses, it may provide a nondifferentiable
convex optimization component.
I was fortunate to have several outstanding collaborators in my re-
search on various aspects of convex optimization: Vivek Borkar, Jon Eck-
stein, Eli Gafni, Xavier Luque, Angelia Nedic, Asuman Ozdaglar, John
Tsitsiklis, Mengdi Wang, and Huizhen (Janey) Yu. Substantial portions of
our joint research have found their way into the book. In addition, I am
grateful for interactions and suggestions I received from several colleagues,
including Leon Bottou, Steve Boyd, Tom Luo, Steve Wright, and particu-
larly Mark Schmidt and Lin Xiao who read with care major portions of the
book. I am also very thankful for the valuable proofreading of parts of the
book by Mengdi Wang and Huizhen (Janey) Yu, and particularly by Ivan
Pejcic who went through most of the book with a keen eye. I developed
the book through convex optimization classes at MIT over a fifteen-year
period, and I want to express appreciation for my students who provided
continuing motivation and inspiration.
Finally, I would like to mention Paul Tseng, a major contributor
to numerous topics in this book, who was my close friend and research
collaborator on optimization algorithms for many years, and whom we
unfortunately lost while he was still at his prime. I am dedicating the book
to his memory.

Dimitri P. Bertsekas
[email protected]
January 2015
1
Convex Optimization Models:
An Overview

Contents

1.1. Lagrange Duality . . . . . . . . p. 2


1.1.1. Separable Problems - Decomposition p. 7
1.1.2. Partitioning . . . . . . . . . . . p.- 9
1.2. Fenchel Duality and Conic Programming . p. 10
1.2.1. Linear Conic Problems . . . . p. 15
1.2.2. Second Order Cone Programming p. 17
1.2.3. Semidefinite Programming p. 22
1.3. Additive Cost Problems . . p. 25
1.4. Large Number of Constraints . p. 34
1.5. Exact Penalty Functions p. 39
1.6. Notes, Sources, and Exercises p. 47

1
2 Convex Optimization Models: An Overview Chap. 1

In this chapter we provide an overview of some broad classes of convex


optimization models. Our primary focus will be on large challenging prob-
lems, often connected in some way to duality. We will consider two types
of duality. The first is Lagrange duality for constrained optimization, which
is obtained by assigning dual variables to the constraints. The second is
Fenchel duality together with its special case, conic duality, which involves
a cost function that is the sum of two convex function components. Both
of these duality structures arise often in applications, and in Sections 1.1
and 1.2 we provide an overview, and discuss some examples.t
In Sections 1.3 and 1.4, we discuss additional model structures in-
volving a large number of additive terms in the cost, or a large number
of constraints. These types of problems also arise often in the context of
duality, as well as in other contexts such as machine learning and signal
processing with large amounts of data. In Section 1.5, we discuss the exact
penalty function technique, whereby we can transform a convex constrained
optimization problem to an equivalent unconstrained problem.

1.1 LAGRANGE DUALITY

We start our overview of Lagrange duality with the basic case of nonlin-
ear inequality constraints, and then consider extensions involving linear
inequality and equality constraints. Consider the problemt
mm1m1ze f (x)
(1.1)
subject to x EX, g(x) ::::; 0,
where X is a nonempty set,
g(x) = (g1(x), ... ,gr(x))',
and f: X H ~ and gj : X H ~, j = 1, ... , r, are given functions. We refer
to this as the primal problem, and we denote its optimal value by f*. A
vector x satisfying the constraints of the problem is referred to as feasible.
The dual of problem (1.1) is given by
maximize q(µ)
(1.2)
subject to µ E ~r,

t Consistent with its overview character, this chapter contains few proofs,
and refers frequently to the literature, and to Appendix B, which contains a full
list of definitions and propositions (without proofs) relating to nonalgorithmic
aspects of convex optimization. This list reflects and summarizes the content
of the author's "Convex Optimization Theory" book [Ber09]. The proposition
numbers of [Ber09] have been preserved, so all omitted proofs of propositions in
Appendix B can be readily accessed from [Ber09].
t Appendix A contains an overview of the mathematical notation, terminol-
ogy, and results from linear algebra and real analysis that we will be using.
Sec. 1.1 Lagrange Duality 3

where the dual function q is

q(µ) ={ ~:EX L(x, µ) ifµ 2'. 0,


otherwise,

and L is the Lagrangian function defined by

L(x, µ) = .f(x) + µ'g(x), XE X, µ E ~r;

(cf. Section 5.3 of Appendix B).


Note that the dual function is extended real-valued, and that the
effective constraint set of the dual problem is

The optimal value of the dual problem is denoted by q*.


The weak duality rela tion, q* ::::; .f *, always holds. It is easily shown
by writing for allµ 2'. 0, and x E X with g(x ) ::::; 0,
T

q(µ) = inf L(z,µ)::::; L(x,µ) = .f(x) + L_µ1g1 (x )::::; .f(x),


zEX
j= l

so that
q* = sup q(µ) = supq(µ)::::; inf .f(x) = .f*.
µ E1Rr µ 2: 0 xEX,g(x)<'.'.O

We state this formally as follows (cf. Prop. 4.1.2 in Appendix B) .

Proposition 1.1.1: (Weak Duality Theorem) Consider problem


(1.1). For any feasible solution x and anyµ E ~r, we have q(µ) ::::; f(x).
Moreover, q* ::::; f *.

When q* = .f*, we say that strong duality holds. The following propo-
sition gives necessary and sufficient conditions for strong duality, and pri-
mal and dual optimality (see Prop. 5.3.2 in Appendix B).

Proposition 1.1.2: (Optimality Conditions) Consider problem


(1.1). There holds q* = f*, and (x*, µ*)area primal and dual optimal
solution pair if and only if x* is feasible, µ* 2'. 0, and

x* E argminL(x,µ*), µ1g1 (x* ) = 0, j = 1, ... ,r.


xEX
4 Convex Optimization Models: An Overview Chap. 1

Both of the preceding propositions do not require any convexity as-


sumptions on f, g, and X. However, generally the analytical and algo-
rithmic solution process is simplified when strong duality (q* = f*) holds.
This typically requires convexity assumptions, and in some cases conditions
on ri( X), the relative interior of X, as exemplified by the following result,
given in Prop. 5.3.l in Appendix B. The result delineates the two principal
cases where there is no duality gap in an inequality-constrained problem.

Proposition 1.1.3: (Strong Duality - Existence of Dual Opti-


mal Solutions) Consider problem (1.1) under the assumption that
the set X is convex, and the functions f, and g1, ... , 9r are convex.
Assume further that f * is finite, and that one of the following two
conditions holds:
(1) There exists x EX such that gj(x) < 0 for all j = 1, ... , r.
(2) The functions 9i, j = 1, ... , r, are affine, and there exists x E
ri(X) such that g(x) $ 0.
Then q* = f* and there exists at least one dual optimal solution.
Under condition (1) the set of dual optimal solutions ~s also compact.

Convex Programming with Inequality and Equality Constraints

Let us consider an extension of problem (1.1), with additional linear equal-


ity constraints. It is our principal constrained optimization model under
convexity assumptions, and it will be referred to as the convex programming
problem. It is given by
minimize f (x)
(1.3)
subject to x EX, g(x) $ 0, Ax= b,

where X is a convex set, g(x) = (g1(x), ... ,gr(x))', f : X H ~ and


= 1, ... , r, are given convex functions, A is an m x n matrix,
gj : X H ~, j
and b E ~m.
The preceding duality framework may be applied to this problem by
converting the constraint Ax = b to the equivalent set of linear inequality
constraints
Ax$ b, -Ax$ -b,
with corresponding dual variables >.+ ~ 0 and >.- ~ 0. The Lagrangian
function is
f(x) + µ'g(x) + (>.+ - >.-) 1 (Ax - b),
and by introducing a dual variable
Sec. 1.1 Lagrange Duality 5

with no sign restriction, it can be written as

L(x, µ, >..) = f(x) + µ'g(x) + >..'(Ax - b).

The dual problem is

maximize inf L(x, µ, >..)


xEX
subject to µ ~ 0, >.. E ~m.

In this manner, Prop. 1.1.3 under condition (2), together with Prop. 1.1.2,
yield the following for the case where all constraint functions are linear.

Proposition 1.lA: (Convex Programming - Linear Equality


and Inequality Constraints) Consider problem (1.3).
(a) Assume that f* is finite, that the functions gj are affine, and
that there exists x E ri(X) such that Ax= band g(x) :s; 0. Then
q* = f* and there exists at least one dual optimal solution.
(b) There holds f* = q*, and (x*,µ*,>..*) are a primal and dual
optimal solution pair if and only if x* is feasible, µ* ~ 0, and

x* E argminL(x,µ*,>..*), µjgj(X*) = 0, j = 1, ... , r.


xEX

In the special case where there are no inequality constraints:

mm1m1ze f (x)
(1.4)
subject to x E X, Ax = b,

the Lagrangian function is

L(x, >..) = f(x) + >.. (Ax - b),


1

and the dual problem is

maximize inf L( x, >..)


xEX
subject to >.. E ~m.

The corresponding result, a simpler special case of Prop. 1.1.4, is given in


the following proposition.
6 Convex OpUrnizaUon Models: An Overview Chap. 1

Proposition 1.1.5: (Convex Programming - Linear Equality


Constraints) Consider problem (1.4).
(a) Assume that f* is finite and that there exists x E ri(X) such
that Ax = b. Then f* = q* and there exists at least one dual
optimal solution.
(b) There holds f* = q*, and (x*,>.*) are a primal and dual optimal
solution pair if and only if x* is feasible and

x* E argminL(x,>.*).
xEX

The following is an extension of Prop. 1.l.4(a) to the case where the


inequality constraints may be nonlinear. It is the most general convex
programming result relating to duality in this section (see Prop. 5.3.5 in
Appendix B).

Proposition 1.1.6: (Convex Programming - Linear Equality


and Nonlinear Inequality Constraints) Consider problem (1.3).
Assume that f* is finite, that there exists x E X such that Ax = b
and g(x) < 0, and that there exists x E ri(X) such that Ax= b. Then
q* = f* and there exists at least one dual optimal solution.

Aside from the preceding results, there are alternative optimality con-
ditions for convex and nonconvex optimization problems, which are based
on extended versions of the Fritz John theorem; see [Be002] and [BOT06],
and the textbooks [Ber99] and [BN003]. These conditions are derived us-
ing a somewhat different line of analysis and supplement the ones given
here, but we will not have occasion to use them in this book.

Discrete Optimization and Lower Bounds

The preceding propositions deal mostly with situations where strong du-
ality holds (q* = f*). However, duality can be useful even when there is
duality gap, as often occurs in problems that have a finite constraint set
X. An example is integer programming, where the components of x must
be integers from a bounded range (usually O or 1). An important special
case is the linear 0-1 integer programming problem

minimize c' x
subject to Ax .:; b, Xi = 0 or 1, i = 1, ... , n,
Sec. 1.1 Lagrange Duality 7

where x = (x1, ... , Xn),


A principal approach for solving discrete optimization problems with
a finite constraint set is the branch-and-bound method, which is described
in many sources; see e.g., one of the original works [LaD60], the survey
[BaT85], and the book [NeW88]. The general idea of the method is that
bounds on the cost function can be used to exclude from consideration
portions of the feasible set. To illustrate, consider minimizing F(x) over
x E X, and let Y1, Y2 be two subsets of X. Suppose that we have bounds

Then, if F2 ::::; F 1 , the solutions in Y1 may be disregarded since their cost


cannot be smaller than the cost of the best solution in Y2 • The lower bound
F 1 can often be conveniently obtained by minimizing f over a suitably
enlarged version of Y1, while for the upper bound F2, a value f(x), where
x E Y2, may be used.
Branch-and-bound is often based on weak duality (cf. Prop. 1.1.1) to
obtain lower bounds to the optimal cost of restricted problems of the form
minimize f (x)
(1.5)
subject to x EX, g(x)::::; 0,
where X is a subset of X; for example in the 0-1 integer case where X
specifies that all Xi should be O or 1, X may be the set of all 0-1 vectors
x such that one or more components Xi are fixed at either O or 1 (i.e., are
restricted to satisfy x; = 0 for all x E X or Xi = 1 for all x E X). These
lower bounds can often be obtained by finding a dual-feasible (possibly
dual-optimal) solution µ 2: 0 of this problem and the corresponding dual
value
q(µ) = in( {f(x) + µ'g(x) }, (1.6)
xEX

which by weak duality, is a lower bound to the optimal value of the re-
stricted problem (1.5). In a strengthened version of this approach, the
given inequality constraints g(x) ::::; 0 may be augmented by additional in-
equalities that are known to be satisfied by optimal solutions of the original
problem.
An important point here is that when X is finite, the dual function
q of Eq. (1.6) is concave and polyhedral. Thus solving the dual problem
amounts to minimizing the polyhedral function -q over the nonnegative
orthant. This is a major context within which polyhedral functions arise
in convex optimization.

1.1.1 Separable Problems - Decomposition

Let us now discuss an important problem structure that involves Lagrange


duality and arises frequently in applications. Here x has m components,
8 Convex Optimization Models: An Overview Chap. 1

x = (x1, ... , Xm), with each x; being a vector of dimension n; (often n; =


1). The problem has the form
m

minimize L f;(x;)
i=l
m
(1.7)
subject to L%(x;) :S: 0, x; EX;, i = 1, ... , m, j = 1, ... , r,
i=l

where f; : ~n; M ~ and 9ij : ~n; M ~r are given functions, and X; are
given subsets of ~n;. By assigning a dual variable µj to the jth constraint,
we obtain the dual problem [cf. Eq. (1.2)]
m

maximize L q; (µ) (1.8)


i=l

subject to µ 2'. 0,

where

q;(µ) = x·EX
inf {f;(x;) ~
+ L..., µj9ij(x;)},
' ' j=l

andµ= (µ1, ... ,µr)-


Note that the minimization involved in the calculation of the dual
function has been decomposed into m simpler minimizations. These min-
imizations are often conveniently done either analytically or computation-
ally, in which case the dual function can be easily evaluated. This is the key
advantageous structure of separable problems: it facilitates computation of
dual function values (as well as subgradients as we will see in Section 3.1),
and it is amenable to decomposition and distributed computation.
Let us also note that in the special case where the components x;
are one-dimensional, and the functions f; and sets X; are convex, there
is a particularly favorable duality result for the separable problem (1.7):
essentially, strong duality holds without any qualifications such as the lin-
earity of the constraint functions, or the Slater condition of Prop. 1.1.3; see
[Tse09].

Duality Gap Estimates for Nonconvex Separable Problems

The separable structure is additionally helpful when the cost and/ or the
constraints are not convex, and there is a duality gap. In particular, in this
case the duality gap turns out to be relatively small and can often be shown
to diminish to zero relative to the optimal primal value as the number m of
separable terms increases. As a result, one can often obtain a near-optimal
primal solution, starting from a dual-optimal solution, without resorting
to costly branch-and-bound procedures.
Sec. 1.1 Lagrange Duality 9

The small duality gap size is a consequence of the structure of the set
S of constraint-cost pairs of problem (1.7), which in the case of a separable
problem, can be written as a vector sum of m sets, one for each separable
term, i.e.,
S= S1 +··· +Sm,
where
Si= { (gi(x;), f;(x;)) Ix; E Xi},
and 9i: ~n; H ~r is the function gi(x;) = (g;1(xi), ... ,gim(x;)). It can
be shown that the duality gap is related to how much S "differs" from
its convex hull (a geometric explanation is given in [Ber99], Section 5.1.6,
and [Ber09], Section 5.7). Generally, a set that is the vector sum of a
large number of possibly nonconvex but roughly similar sets "tends to
be convex" in the sense that any vector in its convex hull can be closely
approximated by a vector in the set. As a result, the duality gap tends to
be relatively small. The analytical substantiation is based on a theorem
by Shapley and Folkman (see [Ber99], Section 5.1, or [Ber09], Prop. 5.7.1,
for a statement and proof of this theorem). In particular, it is shown in
[AuE76], and also [BeS82], [Ber82a], Section 5.6.1, under various reasonable
assumptions, that the duality gap satisfies
f*-q*::; (r+l). max p;,
i=l, ... ,m

where for each i, p; is a nonnegative scalar that depends on the structure of


the functions f;, g;1 , j = 1, ... , r, and the set X; (the paper [AuE76] focuses
on the case where the problem is nonconvex but continuous, while [BeS82]
and [Ber82a] focus on an important class of mixed integer programming
problems). This estimate suggests that as m -+ oo and If* I -+ oo, the
duality gap is bounded, while the "relative" duality gap (!* - q*) /If* I
diminishes to Oas m-+ oo.
The duality gap has also been investigated in the author's book
[Ber09] within the more general min common-max crossing framework
(Section 4.1 of Appendix B). This framework includes as special cases
minimax and zero-sum game problems. In particular, consider a function
¢ : X x Z H ~ defined over nonempty subsets X C ~n and Z C ~m. Then
it can be shown that the gap between "infsup" and "supinf' of ¢ can be
decomposed into the sum of two terms that can be computed separately:
one term can be attributed to the lack of convexity and/ or closure of ¢
with respect to x, and the other can be attributed to the lack of concavity
and/or upper semicontinuity of¢ with respect to z. We refer to [Ber09],
Section 5.7.2, for the analysis.

1.1.2 Partitioning

It is important to note that there are several different ways to introduce


duality in the solution of large-scale optimization problems. For example a
10 Convex Optimization Models: An Overview Chap. 1

strategy, often called partitioning, is to divide the variables in two subsets,


and minimize first with respect to one subset while taking advantage of
whatever simplification may arise by fixing the variables in the other subset.
As an example, the problem

minimize F(x) + G(y)


subject to Ax+ By = c, x E X, y E Y,

can be written as
minimize F(x) + By=c-Ax,
inf
yEY
G(y)

subject to x E X,
or
minimize F(x) + p(c - Ax)
subject to x E X,
where p is given by
p(u) = inf G(y).
By=u,yEY

In favorable cases, p can be dealt with conveniently (see e.g., the book
[Las70] and the paper [Geo72]).
Strategies of splitting or transforming the variables to facilitate al-
gorithmic solution will be frequently encountered in what follows, and in
a variety of contexts, including duality. The next section describes some
significant contexts of this type.

1.2 FENCHEL DUALITY AND CONIC PROGRAMMING

Let us consider the Fenchel duality framework (see Section 5.3.5 of Ap-
pendix B). It involves the problem

minimize fi(x) + h(Ax)


(1.9)
subject to x E ~n,

where A is an m x n matrix, Ji : ~n ,--+ ( -oo, oo] and h : ~m ,--+ ( -oo, oo]


are closed proper convex functions, and we assume that there exists a
feasible solution, i.e., an x E ~n such that x E dom(f1) and Ax E dom(h).t
The problem is equivalent to the following constrained optimization
problem in the variables x1 E ~n and x2 E ~m:

minimize fi(x1) + h(x2)


(1.10)
subject to x1 E dom(f1), x2 E dom(h),

t
We remind the reader that our convex analysis notation, terminology, and
nonalgorithmic theory are summarized in Appendix B.
Sec. 1.2 Fenchel Duality and Conic Programming 11

Viewing this as a convex programming problem with the linear equality


constraint x2 = Ax1, we obtain the dual function as
q(>.) = inf
x1 Edom(h), x2 Edom(h)
{!1 (x1) + h(x2) + N(x2 - Ax1)}

= x1E~n
inf {f1(x1) - A' Axi} + x2E~n
inf {h(x2) + Nx2}.
The dual problem of maximizing q over >. E lRm, after a sign change to
convert it to a minimization problem, takes the form
minimize ft(A'>.) + f:i(->.)
(1.11)
subject to >. E lRm,
where ft and f:i are the conjugate functions of Ji and /2. We denote by
f* and q* the corresponding optimal primal and dual values.
The following Fenchel duality result is given as Prop. 5.3.8 in Ap-
pendix B. Parts (a) and (b) are obtained by applying Prop. l.l.5(a) to
problem (1.10), viewed as a problem with x2 = Ax1 as the only linear
equality constraint. The first equation of part (c) is a consequence of Prop.
l.l.5(b). Its equivalence with the last two equations is a consequence of
the Conjugate Subgradient Theorem (Prop. 5.4.3, App. B), which states
that for a closed proper convex function f, its conjugate f *, and any pair
of vectors (x, y), we have
x E arg min
zE~n
{f (z) - z'y} iff y E 8f(x) iff x E 8f*(y),

with all of these three relations being equivalent to x'y = f (x) + f*(y).
Here 8f(x) denotes the subdifferential off at x (the set of all subgradients
at fat x); see Section 5.4 of Appendix B.

Proposition 1.2.1: (Fenchel Duality) Consider problem (1.9).


(a) If f* is finite and (A· ri(dom(f1))) n ri(dom(/2)) -/= 0, then
f* = q* and there exists at least one dual optimal solution.
(b) If q* is finite and ri ( dom(ft)) n (A' · ri ( ...:. dom(f:i))) -/= 0, then
f * = q* and there exists at least one primal optimal solution.
(c) There holds f * = q*, and (x*, >. *) is a primal and dual opti-
mal solution pair if and only if any one of the following three
equivalent conditions hold:

x* E arg min { Ji (x) - x' A'>.* } and Ax* E arg min { h (z) + z' >. *},
xE~n zE~n
(1.12)
A 1>.* E 8fi(x*) and - >.* E 8/2(Ax*), (1.13)
x* E 8ft(A' >.*) and Ax* E 8J:i{->.*). (1.14)
12 Convex Optimization Models: An Overview Chap. 1

Minimax Problems

Minimax problems involve minimization over a set X of a function F of


the form
F(x) = sup ¢(x, z),
zEZ

where X and Z are subsets of ~n and ~m, respectively, and ¢ : ~n x ~m .-+


~ is a given function. Some (but not all) problems of this type are related
to constrained optimization and Fenchel duality.

Example 1.2.1: (Connection with Constrained Optimization)

Let cp and Z have the form

cp(x, z) = J(x) + z' g(x), Z = {z I z ~ O},

where f : ar t-+ R and g : Rn o-+ Rm are given functions. Then it is seen that

F(x) = supc/J(x,z) = { J(x) if g(x) _so,


zEZ oo otherwise.

Thus minimization of F over x E X is equivalent to solving the constrained


optimization problem

minimize J(x)
(1.15)
subject to x EX, g(x) S 0.

The dual problem is to maximize over z ~ 0 the function

E(z) inf {J(x) + z' g(x)} = inf cp(x, z),


= xEX xEX

and the minimax equality

inf sup cp(x, z) = sup inf cp(x, z) (1.16)


xEX zEZ zEZ xEX

is equivalent to problem (1.15) having no duality gap.

Example 1.2.2: (Connection with Fenchel Duality)

Let cp have the special form

cp(x, z) = f(x) + z' Ax - g(z),


where J : Rn t-+ R and g : Rm t-+ R are given functions, and A is a given
m x n matrix. Then we have

F(x) = sup cp(x, z) = f(x) + sup { (Ax)' z - g(z)} = J(x) + g*(Ax),


zEZ zEZ
Sec. 1.2 Fenchel Duality and Conic Programming 13

where g* is the conjugate of the function


g(z) = { g(z) if z E ~,
oo otherwise.

Thus the minimax problem of minimizing F over x E X comes under the


Fenchel framework (1.9) with h = g* and Ji given by

fi(x) = { f(x) !f XE X,
oo 1fx(/cX.

It can also be verified that the Fenchel dual problem (1.11) is equivalent to
maximizing over z E Z the function E(z) = infxEX ¢(x, z). Again having no
duality gap is equivalent to the minimax equality (1.16) holding.

Finally note that strong duality theory is connected with minimax


problems primarily when X and Z are convex sets, and ¢ is convex in x
and concave in z. When Z is a finite set, there is a different connection
with constrained optimization that does not involve Fenchel duality and
applies without any convexity conditions. In particular, the problem

mm1m1ze max { g1 (x), ... , 9r ( x)}


subject to x E X,

where g1 : ~n H ~ are any real-valued functions, is equivalent to the


constrained optimization problem

minimize y
subject to x E X, g1 (x):5:y, j=l, ... ,r,

where y is an additional scalar optimization variable. Minimax problems


will be discussed further later, in Section 1.4, as an example of problems
that may involve a large number of constraints.

Conic Programming

An important problem structure, which can be analyzed as a special case of


the Fenchel duality framework is conic programming. This is the problem

minimize f (x)
(1.17)
subject to x EC,

where f : ~n H (-oo, oo] is a closed proper convex function and C is a


closed convex cone in ~n.
Indeed, let us apply Fenchel duality with A equal to the identity and
the definitions
if XE C,
fi(x) = f(x), fz(x) = { ~
ifx e/:. C.
14 Convex Optimization Models: An Overview Chap. 1

The corresponding conjugates are

Ji(>.) = sup { A'x - f(x)}, f:i(>.) = sup A'x = { ~ if).. EC*,


xE!Rn xEC if A (/. C* ,

where
C* ={>.I A'x s; 0, I;/ x EC}

is the polar cone of C (note that f:i. is the support function of C; cf. Section
1.6 of Appendix B). The dual problem is

minimize f*(>.)
(1.18)
subject to >. E C,

where f* is the conjugate off and C is the negative polar cone (also called
the dual cone of C):

C= -C* ={>.I A'x ~ 0, I;/ x EC}.

Note the symmetry between primal and dual problems. The strong duality
relation f * = q* can be written as

inf f (x ) = - in( f *(>.) .


xEC >-EC

The following proposition translates the conditions of Prop. 1.2.l(a),


which guarantees that there is no duality gap and that the dual problem
has an optimal solution.

Proposition 1.2.2: (Conic Duality Theorem) Assume that the


primal conic problem (1.17) has finite optimal value, and moreover
ri (dom(f)) n ri(C) =I= 0 . Then, there is no duality gap and the dual
problem (1.18) has an optimal solution.

Using the symmetry of the primal and dual problems, we also obtain
that there is no duality gap and the primal problem (1.17) has an optimal
solution if the optimal value of the dual conic problem (1.18) is finite and
ri (dom(f*)) n ri( C) =I= 0. It is also possible to derive primal and dual op-
timality conditions by translating the optimality conditions of the Fenchel
duality frame work [Prop. 1.2.l(c)].
Sec. 1.2 Fenchel Duality and Conic Programming 15

Figure 1.2.1. Illustration of a linear-conic problem: minimizing a linear function


c'x over the intersection of an affine set b +Sand a convex cone C.

1.2.1 Linear-Conic Problems

An important special case of conic programming, called linear-conic prob-


lem, arises when dom(f) is an affine set and f is linear over dom(f), i.e.,

if XE b+ S,
f(x) = { : if X rt b + S,
where b and c are given vectors, and S is a subspace. Then the primal
problem can. be written as

minimize c' x
(1.19)
subject to x - b E S, x E C;

see Fig. 1.2.1.


To derive the dual problem, we note that

f*(>..) = sup (>.. - c)'x


x-bES

= sup(>.. - c)'(y + b)
yES

= { (>.. - c)'b if>.. - c E S..L,


00 if A - C rt 8.

It can be seen that the dual problem min>..EC f*(>..) [cf. Eq. (1.18)], after
discarding the superfluous term c'b from the cost, can be written as

minimize b' >..


(1.20)
subject to >.. - c E S..L, >..EC,
16 Convex Optimization Models: An Overview Chap. 1

where C is the dual cone:

C ={>.I Nx ~ 0, \:/ x EC} .

By specializing the conditions of the Conic Duality Theorem (Prop. 1.2.2)


to the linear-conic duality context, we obtain the following .

Proposition 1.2.3: (Linear-Conic Duality Theorem) Assume


that the primal problem (1.19) has finite optimal value, and moreover
(b+S)nri(C) -=/ 0 . Then, there is no duality gap and the dual problem
has an optimal solution.

Special Forms of Linear-Conic Problems

The primal and dual linear-conic problems (1.19) and (1.20) have been
placed in an elegant symmetric form. There are also other useful formats
that parallel and generalize similar formats in linear programming. For
example, we have the following dual problem pairs:

min c'x {==} max b' >., (1.21)


Ax= b, xEC c-A' .>.EC

min c' x {==} max b' >., (1.22)


Ax-bEC A' >.=c, .>.EC

where A is an m x n matrix, and x E ~n, >. E ~m, c E ~n, b E ~m .


To verify the duality relation (1.21), let x be any vector such that
Ax = b, and let us write the primal problem on the left in the primal conic
form (1.19) as
minimize c' x
subject to x - x E N(A) , x E C,

where N(A) is the nullspace of A. The corresponding dual conic problem


(1.20) is to solve for µ the problem

minimize x' µ
(1.23)
subject to µ - c E N(A).L, µEC.

Since N(A).L is equal to Ra(A'), the range of A', the constraints of problem
(1.23) can be equivalently written as c - µ E -Ra(A') = Ra(A'), µEC, or

c- µ = A >.,1 µEC ,
Sec. 1.2 Fenchel Duality and Conic Programming 17

for some >. E ~m. Making the change of variables µ = c - A'>., the dual
problem (1.23) can be written as

minimize x' (c - A'>.)


subject to c - A'>. E C.
By discarding the constant x' c from the cost function, using the fact Ax =
b, and changing from minimization to maximization, we see that this dual
problem is equivalent to the one in the right-hand side of the duality pair
(1.21). The duality relation (1.22) is proved similarly.
We next discuss two important special cases of conic programming:
second order cone programming and semidefinite programming. These pro-
blems involve two different special cones, and an explicit definition of the
affine set constraint. They arise in a variety of applications, and their
computational difficulty in practice tends to lie between that of linear and
quadratic programming on one hand, and general convex programming on
the other hand.

1.2.2 Second Order Cone Programming

In this section we consider the linear-conic problem (1.22), with the cone

C = { (X1, ... , Xn) IXn ~ VXi + · ·· + x;,_ l } ,

which is known as the second order cone (see Fig. 1.2.2). The dual cone is

6= {y I O ::::; y'x, V x E C} = {y I O ::::; inf


II (x1, ... ,Xn-1) II :'oXn
y'x} ,

and it can be shown that 6 = C. This property is referred to as self-duality


of the second order cone, and is fairly evident from Fig. 1.2.2. For a proof,
we write

inf
ll(x1, ... ,Xn-1)ll:Sxn
y'x = inf {YnXn +
xn;:=,:O
inf f
ll(x1, ... ,xn-illl:Sxn i=l
YiXi}

= inf {YnXn - ll(Y1, · · · ,Yn-i)II Xn}


Xn;:=,:O

={O ifli(Y1, ... ,yn-i)li::::;yn,


-oo otherwise,

where the second equality follows because the minimum of the inner prod-
uct of a vector z E ~n-l with vectors in the unit ball of ~n-l is -llzll-
Combining the preceding two relations, we have

y EC if and only if O::::; Yn - ll(Y1, ... ,Yn-1)11,


18 Convex Optimization Models: An Overview Chap. 1

Figure 1.2.2. The second order cone

in W3 .

so C= C.
The second order cone programming problem (SOCP for short) is
minimize c' x
(1.24)
subject to Aix - bi E Ci, i = 1, ... , m,
where x E ~ , c is a vector in Rn, and for i = 1, ... ,m, Ai is an nix n
matrix, bi is a vector in Rni, and Ci is the second order cone of Rni. It is
seen to be a special case of the primal problem in the left-hand side of the
duality relation (1.22), where

C = C1 X ·· · X Cm.

Note that linear inequality constraints of the form a~x - bi > 0 can be
written as

where Ci is the second order cone of ~ 2 . As a result, linear-conic problems


involving second order cones contain as special cases linear programming
problems.
Sec. 1.2 Fenchel Duality and Conic Programming 19

We now observe that from the right-hand side of the duality relation
(1.22), and the self-duality relation C = C, the corresponding dual linear-
conic problem has the form
m
maximize L b~Ai
i=l
m
(1.25)
subject to L A~Ai = c, Ai E Ci, i = 1, ... , m,
i=l
where A = (A1, ... , Am). By applying the Linear-Conic Duality Theorem
(Prop. 1.2.3), we have the following.

Proposition 1.2.4: (Second Order Cone Duality Theorem)


Consider the primal SOCP (1.24), and its dual problem (1.25).
(a) If the optimal value of the primal problem is finite and there
exists a feasible solution x such that

Aix - bi E int(Ci), i = l, ... ,m,

then there is no duality gap, and the dual problem has an optimal
solution.
(b) If the optimal value of the dual problem is finite and there exists
a feasible solution X = (>."1, ... , Xm) such that

Xi E int(Ci), i = l, ... ,m,

then there is no duality gap, and the primal problem has an


optimal solution.

Note that while the Linear-Conic Duality Theorem requires a relative


interior point condition, the preceding proposition requires an interior point
condition. The reason is that the second order cone has nonempty interior,
so its relative interior coincides with its interior.
The SOCP arises in many application contexts, and significantly, it
can be solved numerically with powerful specialized algorithms that belong
to the class of interior point methods, which will be discussed in Section
6.8. We refer to the literature for a more detailed description and analysis
(see e.g., the books [BeNOI], [BoV04]).
Generally, SOCPs can be recognized from the presence of convex
quadratic functions in the cost or the constraint functions. The following
are illustrative examples. The first example relates to the field of robust
optimization, which involves optimization under uncertainty described by
set membership.
20 Convex Optimization Models: An Overview Chap. 1

Example 1.2.3: (Robust Linear Programming)

Frequently, there is uncertainty about the data of an optimization problem,


so one would like to have a solution that is adequate for a whole range of
the uncertainty. A popular formulation of this type, is to assume that the
constraints contain parameters that take values in a given set, and require
that the constraints are satisfied for all values in that set. This approach is
also known as a set membership description of the uncertainty and has been
used in fields other than optimization, such as set membership estimation,
and minimax control (see the textbook [Ber07], which also surveys earlier
work).
As an example, consider the problem
• • • I
m1mm1ze c x
(1.26)

where C E ar is a given vector, and Tj is a given subset of ar+i to which


the constraint parameter vectors (aj, bj) must belong. The vector x must
be chosen so that the constraint a1x S bj is satisfied for all (aj, bj) E Tj,
j = 1, ... ,r.
Generally, when Tj contains an infinite number of elements, this prob-
lem involves a correspondingly infinite number of constraints. To convert the
problem to one involving a finite number of constraints, we note that

if and only if

where
(1.27)

Thus, the robust linear programming problem (1.26) is equivalent to


. . . ,
m1mm1ze ex
subject to 9i(x) S 0, j = l, ... ,r.

For special choices of the set Tj, the function 9i can be expressed in
closed form, and in the case where Tj is an ellipsoid, it turns out that the
constraint gj(x) SO can be expressed in terms of a second order cone. To see
this, let

(1.28)

where Pj is a given n x nj matrix, aj E ar and Qj E arj are given vectors,


and bj is a given scalar. Then, from Eqs. (1.27) and (1.28),

9i(x) = sup { (ai + Piui)'x - (bi+ q1ui)}


lluj ll'Sl

= sup (P;x-qi)'ui+a 5x-bj,


lluj ll'Sl
Sec. 1.2 Fenchel Duality and Conic Programming 21

and finally

Thus,

if and only if

where CJ is the second order cone of ~rJ+1; i.e., the "robust" constraint
gJ(x) ::::; 0 is equivalent to a second order cone constraint. It follows that in
the case of ellipsoidal uncertainty, the robust linear programming problem
(1.26) is a SOCP of the form (1.24).

Example 1.2.4: (Quadratically Constrained Quadratic


Problems)

Consider the quadratically constrained quadratic problem

minimize x' Qox + 2qbx + po


subject to x' QJX + 2q;x + PJ ::::; 0, j = 1, ... , r,

where Qo, ... , Qr are symmetric n x n positive definite matrices, qo, ... , qr
are vectors in Rn, and po, ... ,Pr are scalars. We show that the problem can
be converted to the second order cone format. A similar conversion is also
possible for the quadratic programming problem where Qo is positive definite
and QJ = 0, j = 1, ... , r.
Indeed, since each QJ is symmetric and positive definite, we have

XI QjX + 2qjX
I
+ PJ -_ (
Qj1/2 X) I Qj1/2 X + 2 ( Qj-1/2 qj ) I Qj1/2 X + PJ

= IIQ jl/2 X + Q-1/2


j qj 112 + PJ - qj'Q-1
j qj'

for j = 0, 1, ... , r. Thus, the problem can be written as

minimize IIQ~ 12 x + Q;;- 112 qoll 2 + Po - qbQo 1qo


subject to IIQ} 12 x + Q_;-1 12 qJll 2 + PJ - q;Q/qj::::; 0, j = 1, ... ,r,
or, by neglecting the constant Po - qbQ 0 1qo,

minimize IIQ~ 12 x + Q;;- 112 qoll


subject to IIQ} 12 x+Q_;-1 12 qJII::::; (qJQ 31qJ -pj)1 12 , j = l, ... ,r.

By introducing an auxiliary variable Xn+1, the problem can be written as

minimize Xn+1

subject to IIQ~ 12 x + Q;;- 112 qoll ::::; Xn+l


IIQ} 12 x + Q_;-1 12 qj II ::::; (q;Q-;1 qj - PJ )1 12 , j = 1, ... ,r.
22 Convex Optimization Models: An Overview Chap. 1

It can be seen that this problem has the second order cone form (1.24). In
particular, the first constraint is of the form Aox - bo E C, where C is the
second order cone of ar+ 1 and the (n + l)st component of A 0 x - bo is Xn+l·
The remaining r constraints are of the form AjX-bj E C, where the (n+ l)st
component of Ajx - bj is the scalar (q5Qj 1 qj - pj)1 12 .
We finally note that the problem of this example is special in that it
has no duality gap, assuming its optimal value is finite, i.e., there is no need
for the interior point conditions of Prop. 1.2.4. This can be traced to the fact
that linear transformations preserve the closure of sets defined by quadratic
constraints (see e.g., BN003], Section 1.5.2).

1.2.3 Semidefinite Programming

In this section we consider the linear-conic problem (1.21) with C being the
cone of matrices that are positive semidefinite. t This is called the positive
semidefinite cone. To define the problem, we view the space of symmetric
n x n matrices as the space ~n 2 with the inner product
n n
< X, Y >= trace(XY) =LL XiJYiJ·
i=l j=l

The interior of C is the set of positive definite matrices.


The dual cone is

C ={YI trace(XY) ~ 0, 1:/ XE C},

and it can be shown that C = C, i.e., C is self-dual. Indeed, if Y r/:. C,


there exists a vector v E ~n such that

0 > v'Yv = trace(vv'Y).


Hence the positive semidefinite matrix X = vv' satisfies O > trace(XY),
so Yr/:. C and it follows that C::) C. Conversely, let Y EC, and let X be
any positive semidefinite matrix. We can express X as
n
X = L Aieie~,
i=l

where Ai are the nonnegative eigenvalues of X, and ei are corresponding


orthonormal eigenvectors. Then,

trace(XY) = trace ( Y t Aieie~) = t Aie~Yei ~ 0.


t As noted in Appendix A, throughout this book a positive semidefinite ma-
trix is implicitly assumed to be symmetric.
Sec. 1.2 Fenchel Duality and Conic Programming 23

It follows that YE 6 and CC 6. Thus C is self-dual, C = 6.


The semidefinite programming problem (SDP for short) is to mini-
mize a linear function of a symmetric matrix over the intersection of an
affine set with the positive semidefinite cone. It has the form

minimize < D, X >


(1.29)
subject to < Ai,X >= bi, i = 1, ... ,m, XE C,

where D, A1, ... , Arn, are given n x n symmetric matrices, and b1, ... , brn,
are given scalars. It is seen to be a special case of the primal problem in
the left-hand side of the duality relation (1.21).
We can view the SDP as a problem with linear cost, linear constraints,
and a convex set constraint. Then, similar to the case of SOCP, it can be
verified that the dual problem (1.20), as given by the right-hand side of the
duality relation (1.21), takes the form

max1m1ze b' >..


(1.30)
subject to D - (>..1A1 + · · · + ArnArn) E C,
where b = (b1, ... , brn) and the maximization is over the vector >.. =
(>..1, ... , Arn)- By applying the Linear-Conic Duality Theorem (Prop. 1.2.3),
we have the following proposition.

Proposition 1.2.5: (Semidefinite Duality Theorem) Consider


the primal SDP (1.29), and its dual problem (1.30).
(a) If the optimal value of the primal problem is finite and there
exists a primal-feasible solution, which is positive definite, then
there is no duality gap, and the dual problem has an optimal
solution.
(b) If the optimal value of the dual problem is finite and there exist
scalars X1, ... , Xrn such that D - (X1A1 + · · ·+ XrnArn) is positive
definite, then there is no duality gap, and the primal problem
has an optimal solution.

The SDP is a fairly general problem. In particular, it can be shown


that a SOCP can be cast as a SDP. Thus SDP involves a more general
structure than SOCP. This is consistent with the practical observation that
the latter problem is generally more amenable to computational solution.
We provide some examples of problem formulation as an SDP.

Example 1.2.5: (Minimizing the Maximum Eigenvalue)

Given a symmetric n x n matrix M(>..), which depends on a parameter vector


>,. = (>..1, ... , Am), we want to choose >,. so as to minimize the maximum
24 Convex Optimization Models: An Overview Chap. 1

eigenvalue of M(A). We pose this problem as

minimize z

subject to maximum eigenvalue of M(A) ::; z,

or equivalently
minimize z

subject to zI - M(A) E C,

where I is then x n identity matrix, and C is the semidefinite cone. If M(A)


is an affine function of A,

the problem has the form of the dual problem (1.30), with the optimization
variables being (z, A1, ... , Am)-

Example 1.2.6: (Semidefinite Relaxation - Lower Bounds


for Discrete Optimization Problems)

Semidefinite programming provides a means for deriving lower bounds to


the optimal value of several types of discrete optimization problems. As an
example, consider the following quadratic problem with quadratic equality
constraints

minimize x' Qox+ a~x + bo


(1.31)
subject to x'Q;x + a;x + b; = 0, i = l, ... , m,

where Qo, ... , Qm are symmetric n x n matrices, ao, ... , am are vectors in
ar, and bo, ... , bm are scalars.
This problem can be used to model broad classes of discrete optimiza-
tion problems. To see this, consider an integer constraint that a variable x;
must be either O or 1. Such a constraint can be expressed by the quadratic
equality X7 - X; = 0. Furthermore, a linear inequality constraint ajX ::; bj can
be expressed as the quadratic equality constraint yJ
+ aJx - bj = 0, where Yi
is an additional variable.
Introducing a multiplier vector A= (A1, ... , Am), the dual function is
given by
q(A) = inf { x'Q(A)x + a(A)'x + b(A) },
xE?Rn

where
m m m
Q(A) = Qo + L A;Q,, a(A) = ao + L A;a;, b(A) = bo +L A;b;.
i=l i=l i=l

Let f* and q* be the optimal values of problem (1.31) and its dual,
and note that by weak duality, we have f* 2'. q*. By introducing an auxiliary
Sec. 1.3 Additive Cost Problems 25

scalar variable e,
we see that the dual problem is to find a pair (e, >.) that
solves the problem
maximize e
subject to q(>.) 2: e.
The constraint q(>.) 2: e of this problem can be written as

inf {x'Q(>.)x+a(>.)'x+b(>.)-e} 2:0,


xERn

or equivalently, introducing a scalar variable t and multiplying with t 2 ,

inf
xERn, tER
{ (tx)'Q(>.)(tx) + a(>.)'(tx)t + (b(>.) - e)t 2 } 2 0.

Writing y = tx, this relation takes the form of a quadratic in (y, t),

inf
yERn, tER
{y'Q(>.)y + a(>.)'yt + (b(>.) -e)t 2 } 2 o,
or
Q(>.)
( ½a(>.)' ½a(>.) ) (1.32)
b(>.) - e E C,

where C is the positive semidefinite cone. Thus the dual problem is equivalent
e
to the SDP of maximizing over all (e, >.) satisfying the constraint (1.32), and
its optimal value q* is a lower bound to f*.

1.3 ADDITIVE COST PROBLEMS

In this section we focus on a structural characteristic that arises in several


important contexts: a cost function J that is the sum of a large number of
components Ji : ~n ~ ~,
m

J(x) = L Ji(x). (1.33)


i=l

Such cost functions can be minimized with specialized methods, called in-
cremental, which exploit their additive structure, by updating x using one
component function Ji at a time (see Section 2.1.5). Problems with ad-
ditive cost functions can also be treated with specialized outer and inner
linearization methods that approximate the component functions Ji indi-
vidually (rather than approximating !); see Section 4.4.
An important special case is the cost function of the dual of a sepa-
rable problem
m
maximize L Qi(µ)
i=l

subject to µ 2 0,
26 Convex Optimization Models: An Overview Chap. 1

where

Qi(µ)= xEX-
inf {fi(Xi) +~
~
µj9i}(Xi)},
i i j=l

and µ = (µ1, ... , µr) [cf. Eq. (1.8)]. After a sign change to convert to
minimization it takes the form (1.33) with Ji(µ) = -qi(µ). This is a major
class of additive cost problems.
We will next describe some applications from a variety of fields. The
following five examples arise in many machine learning contexts.

Example 1.3.1: (Regularized Regression)

This is a broad class of applications that relate to parameter estimation. The


cost function involves a sum of terms Ji(x), each corresponding to the er-
ror between some data and the output of a parametric model, with x being
the vector of parameters. An example is linear least squares problems, also
referred to as linear regression problems, where f; has quadratic structure.
Often a convex regularization function R( x) is added to the least squares ob-
jective, to induce desirable properties of the solution and/or the corresponding
algorithms. This gives rise to problems of the form

minimize R(x) + ½L:i (c\x - b;) 2


subject to x E J'r,

where Ci and bi are given vectors and scalars, respectively. The regularization
function R is often taken to be differentiable, and particularly quadratic.
However, there are practically important examples of nondifferentiable choices
(see the next example).
In statistical applications, such a problem arises when constructing a
linear model for an unknown input-output relation. The model involves a
vector of parameters x, to be determined, which weigh input data (the com-
ponents of the vectors c;). The inner products c\x produced by the model are
matched against the scalars b;, which are observed output data, corresponding
to inputs Ci from the true input-output relation that we try to represent. The
optimal vector of parameters x* provides the model that (in the absence of a
regularization function) minimizes the sum of the squared errors (c\x* - b;) 2 .
In a more general version of the problem, a nonlinear parametric model
is constructed, giving rise to a nonlinear least squares problem of the form
m

minimize R(x) + L /g;(x)/2


i=l

subject to X E ~r'
where gi : ~r >---+ R are given nonlinear functions that depend on the data.
This is also a common problem, referred to as nonlinear regression, which,
however, is often nonconvex [it is convex if the functions g; are convex and
also nonnegative, i.e., g; ( x) 2 0 for all x E Rn].
Sec. 1.3 Additive Cost Problems 27

It is also possible to use a nonquadratic function of the error between


some data and the output of a linear parametric model. Thus in place of the
squared error (1/2)(c~x - bi)2, we may use hi(c~x - bi), where hi: R c-+ R is
a convex function, leading to the problem

minimize R(x) + L hi(c:x - bi)


i=l

subject to x E Rn.

Generally the choice of the function hi is dictated by statistical modeling


considerations, for which the reader may consult the relevant literature. An
example is

which tends to result in a more robust estimate than least squares in the
presence of large outliers in the data. This is known as the least absolute
deviations method.
There are also constrained variants of the problems just discussed,
where the parameter vector x is required to belong to some subset of Rn,
such as the nonnegative orthant or a "box" formed by given upper and lower
bounds on the components of x. Such constraints may be used to encode into
the model some prior knowledge about the nature of the solution.

Example 1.3.2: (£ 1 -Regularization)

A popular approach to regularized regression involves £1 -regularization, where

R(x) = 'Yllxl\i = 'Y L lxil,


j=l

'Y is a positive scalar and xi is the jth coordinate of x. The reason for the
popularity of the £1 norm \\x\\i is that it tends to produce optimal solutions
where a greater number of components xi are zero, relative to the case of
quadratic regularization (see Fig. 1.3.1). This is considered desirable in many
statistical applications, where the number of parameters to include in a model
may not be known a priori; see e.g., [Tib96], [DoE03], [BJM12]. The special
case where a linear least squares model is used,

minimize ')'\\x\\i + ½I:: 1 (c~x - bi)2


subject to x E Rn,

is known as the lasso problem.


In a generalization of the lasso problem, the £1 regularization function
l\x\1 1 is replaced by a scaled version \\Sx\\i, where Sis some scaling matrix.
28 Convex Optimization Models: An Overview Chap. 1

Figure 1.3.1. Illustration of the effect of l\-regularization for cost functions


of the form-rllxlli +F(x), where-r > 0 and F: wn f-t Wis differentiable (figure
in the left-hand side). The optimal solution x• tends to have more zero com-
ponents than in the corresponding quadratic regularization case, illustrated
in the right-hand side.

The term 11Sx\li then induces a penalty on some undesirable characteristic of


the solution. For example the problem
n-1

minimize 'Y L \x;+1 - Xi\+½ I:::1(c;x - b;) 2

i=l

subject to x E Rn,

is known as the total variation denoising problem; see e.g., [ROF92], [Cha04],
[BeT09a]. The regularization term here encourages consecutive variables to
take similar values, and tends to produce more smoothly varying solutions.
Another related example is matrix completion with nuclear norm regu-
larization; see e.g., [CaR09], [CaTlO], [RFPlO], [Recll], [ReR13]. Here the
minimization is over all m x n matrices X, with components denoted Xij. We
have a set of entries M;j, (i, j) E n, where n is a subset of index pairs, and
we want to find X whose entries X;j are close to Mij for (i, j) E n, and has as
small rank as possible, a property that is desirable on the basis of statistical
considerations. The following more tractable version of the problem is solved
instead:
minimize 'YIIXII. + ½L(i,j)Efl(X;j - M;j) 2
subject to X E Rmxn,

where \\XI\. is the nuclear norm of X, defined as the sum of the singular
values of X. There is substantial theory that justifies this approximation,
for which we refer to the literature. It turns out that the nuclear norm is a
convex function with some nice properties. In particular, its subdifferential
at any X can be conveniently characterized for use in algorithms.
Sec. 1.3 Additive Cost Problems 29

Let us finally note that sometimes additional regularization functions


are used in conjunction with £1 -type terms. An example is the sum of a
quadratic and an £1 -type term.

Example 1.3.3: (Classification)

In the regression problems of the preceding examples we aim to construct a


parametric model that matches well an input-output relationship based on
given data. Similar problems arise in a classification context, where we try to
construct a parametric model for predicting whether an object with certain
characteristics (also called features) belongs to a given category or not.
We assume that each object is characterized by a feature vector c that
belongs to \Jr and a label b that takes the values +1 or -1, if the object
belongs to the category or not, respectively. As illustration consider a credit
card company that wishes to classify applicants as "low risk" ( +1) or "high
risk" (-1), with each customer characterized by n scalar features of financial
and personal type.
We are given data, which is a set of feature-label pairs (c;, b;), i =
1, ... , m. Based on this data, we want to find a parameter vector x E and\Jr
a scalar y E W such that the sign of c' x + y is a good predictor of the label
of an object with feature vector c. Thus, loosely speaking, x and y should be
such that for "most" of the given feature-label data (c;, b;) we have

ifb;=+l,

c;x + y < 0, if bi= -1.


In the statistical literature, c' x +y is often called the discriminant function,
and the value of
bi(c;x + y),
for a given object i provides a measure of "margin" to misclassification of
the object. In particular, a classification error is made for object i when
bi(c;x + y) < 0.
Thus it makes sense to formulate classification as an optimization prob-
lem where negative values of bi(c\x + y) are penalized. This leads to the
problem
m

minimize R(x) + L h(bi(c;x + y))


i=l

subject to XE wn, y E W,
where R is a suitable regularization function, and h : W >---+ W is a convex
function that penalizes negative values of its argument. It would make some
sense to use a penalty of one unit for misclassification, i.e.,

if z 2 0,
h(z) = {~ if z < 0,
but such a penalty function is discontinuous. To obtain a continuous cost
function, we allow a continuous transition of h from negative to positive
30 Convex Optimization Models: An Overview Chap. 1

values, leading to a variety of nonincreasing functions h. The choice of h


depends on the given application and other theoretical considerations for
which we refer to the literature. Some common examples are
h( z) = e -z, ( exponential loss),
h(z) = log (1 + e-z), (logistic loss),
h( z) = max { 0, 1 - z}, (hinge loss).
For the case of logistic loss the method comes under the methodology of lo-
gistic regression, and for the case of hinge loss the method comes under the
methodology of support vector machines. As in the case of regression, the reg-
ularization function R could be quadratic, the l'1 norm, or some scaled version
or combination thereof. There is extensive literature on these methodologies
and their applications, to which we refer for further discussion.

Example 1.3.4: (Nonnegative Matrix Factorization)

The nonnegative matrix factorization problem is to approximately factor a


given nonnegative matrix Bas CX, where C and X are nonnegative matrices
to be determined via the optimization
minimize l[CX - BIi~
subject to C 2: 0, X 2'. 0.
Here II · IIF denotes the Frobenius norm of a matrix (IIMII~ is the sum of the
squares of the scalar components of M). The matrices B, C, and X must have
compatible dimensions, with the column dimension of C usually being much
smaller than its row dimension, so that CX is a low-rank approximation of
B. In some versions of the problem some of the nonnegativity constraints on
the components of C and X may be relaxed. Moreover, regularization terms
may be added to the cost function to induce sparsity or some other effect,
similar to earlier examples in this section.
This problem, formulated in the 90s, [PaT94], [Paa97], [LeS99], has
become a popular model for regression-type applications such as the ones
of Example 1.3.1, but with the vectors c; in the least squares objective
I:;': 1 (c'.x - b;) 2 being unknown and subject to optimization. In the regres-
sion context of Example 1.3.1, we aim to (approximately) represent the data
in the range space of the matrix C whose rows are the vectors c;, and we
may view C as a matrix of known basis functions. In the matrix factorization
context of the present example, we aim to discover a "good" matrix C of basis
functions that represents well the given data, i.e., the matrix B.
An important characteristic of the problem is that its cost function is
not convex jointly in (C,X). However, it is convex in each of the matrices C
and X individually, when the other matrix is held fixed. This facilitates the
application of algorithms that involve alternate minimizations with respect
to C and with respect to X; see Section 6.5. We refer to the literature, e.g.,
the papers [BBL07], [Lin07], [GoZ12], for a discussion of related algorithmic
issues.
Sec. 1.3 Additive Cost Problems 31

Example 1.3.5: (Maximum Likelihood Estimation)

The maximum likelihood approach is a major statistical inference methodol-


ogy for parameter estimation, which is described in many sources (see e.g., the
textbooks [Was04], [HTF09]). In fact in many cases, a maximum likelihood
formulation is used to provide a probabilistic justification of the regression
and classification models of the preceding examples.
Here we observe a sample of a random vector Z whose distribution
Pz ( ·; x) depends on an unknown parameter vector x E ~r. For simplicity
we assume that Z can take only a finite set of values, so that Pz(z; x) is the
probability that Z takes the value z when the parameter vector has the value
x. We estimate x based on the given sample value z, by solving the problem

maximize Pz (z; x)
(1.34)
subject to x E Rn.

The cost function Pz(z; ·) of this problem may either have an additive
structure or may be equivalent to a problem that has an additive structure.
For example the event that Z = z may be the union of a large number of
disjoint events, so Pz(z; x) is the sum of the probabilities of these events. For
another important context, suppose that the data z consists of m independent
samples z 1 , ... , Zm drawn from a distribution P(·; x), in which case

Pz(z;x) = P(z1;x)···P(zm;x).
Then the maximization (1.34) is equivalent to the additive cost minimization
m
minimize L f; (x)
i=l

subject to x E Rn,

where
f;(x) =- log P(z;; x).
In many applications the number of samples m is very large, in which case
special methods that exploit the additive structure of the cost are recom-
mended. Often a suitable regularization term is added to the cost function,
similar to the preceding examples.

Example 1.3.6: (Minimization of an Expected Value -


Stochastic Programming)

An important context where additive cost functions arise is the minimization


of an expected value

minimize E{ F(x, w)}


subject to x E X,
32 Convex Optimization Models: An Overview Chap. 1

where w is a random variable taking a finite but very large number of values
w;, i = 1, ... , m, with corresponding probabilities 7r;. Then the cost function
consists of the sum of them functions 1r;F(x, w;).
For example, in stochastic programming, a classical model of two-stage
optimization under uncertainty, a vector x E X is selected, a random event
occurs that has m possible outcomes W1, ... , Wm, and another vector y E Y
is selected with knowledge of the outcome that occurred (see e.g., the books
[BiL97), [KaW94), [Pre95), [SDR09]). Then for optimization purposes, we
need to specify a different vector y; E Y for each outcome w;. The problem
is to minimize the expected cost
m

F(x) + L 1r;G;(y;),
i=l

where G;(y;) is the cost associated with the choice y; and the occurrence
of w;, and 7r; is the corresponding probability. This is a problem with an
additive cost function.
Additive cost functions also arise when the expected value cost function
E{ F(x, w)} is approximated by an m-sample average

f(x) = ! :tF(x,w;),
i=l

where w; are independent samples of the random variable w. The minimum


of the sample average f(x) is then taken as an approximation of the minimum
of E{F(x,w)}.

Generally additive cost problems arise when we want to strike a bal-


ance between several types of costs by lumping them into a single cost
function. The following is an example of a different character than the
preceding ones.

Example 1.3.7: (Weber Problem in Location Theory)

A basic problem in location theory is to find a point x in the plane whose


sum of weighted distances from a given set of points y1, ... , Ym is minimized.
Mathematically, the problem is
m

minimize L w; llx - Yi II
i=l

subject to X E \Jr,
where w1, ... , Wm are given positive scalars. This problem has many varia-
tions, including constrained versions, and descends from the famous Fermat-
Torricelli-Viviani problem (see [BMS99) for an account of the history of this
problem). We refer to the book [DrH04) for a survey of recent research, and
to the paper [BeTlO) for a discussion that is relevant to our context.
Sec. 1.3 Additive Cost Problems 33

The structure of the additive cost function (1.33) often facilitates the
use of a distributed computing system that is well-suited for the incremental
approach. The following is an illustrative example.

Example 1.3.8: (Distributed Incremental Optimization -


Sensor Networks)

Consider a network of m sensors where data are collected and are used to solve
some inference problem involving a parameter vector x. If f;(x) represents an
error penalty for the data collected by the ith sensor, the inference problem
involves an additive cost function I:;': 1 f;. While it is possible to collect all
the data at a fusion center where the problem will be solved in centralized
manner, it may be preferable to adopt a distributed approach in order to
save in data communication overhead and/or take advantage of parallelism
in computation. In such an approach the current iterate Xk is passed on from
one sensor to another, with each sensor i performing an incremental iteration
involving just its local component f;. The entire cost function need not be
known at any one location. For further discussion we refer to representative
sources such as [RaN04], [RaN05], [BHG08], [MRS10], [GSW12], and [Say14].
The approach of computing incrementally the values and subgradients
of the components f; in a distributed manner can be substantially extended
to apply to general systems of asynchronous distributed computation, where
the components are processed at the nodes of a computing network, and the
results are suitably combined [NBBOl] (see our discussion in Sections 2.1.5
and 2.1.6).

Let us finally note a constrained version of additive cost problems


where the functions Ji are extended real-valued. This is essentially equiv-
alent to constraining x to lie in the intersection of the domains

xi = dom(fi),
resulting in a problem of the form

m
minimize L Ji (x)
i=l

subject to x E n~ 1 Xi,

where each Ji is real-valued over the set Xi. Methods that are well-suited
for the unconstrained version of the problem where Xi = Rn can often be
modified to apply to the constrained version, as we will see in Chapter 6,
where we will discuss incremental constraint projection methods. However,
the case of constraint sets with many components arises independently of
whether the cost function is additive or not, and has its own character, as
we discuss in the next section.
34 Convex Optimization Models: An Overview Chap. 1

1.4 LARGE NUMBER OF CONSTRAINTS

In this section we consider problems of the form

minimize f (x)
(1.35)
subject to x EX, gj(x) ::; 0, j = 1, ... , r,
where the number r of constraints is very large. Problems of this type occur
often in practice, either directly or via reformulation from other problems.
A similar type of problem arises when the abstract constraint set X consists
of the intersection of many simpler sets:

where L is a finite or infinite index set. There may or may not be additional
inequality constraints gj(x) ::; 0 like the ones in problem (1.35). We provide
a few examples.

Example 1.4.1: (Feasibility and Minimum Distance


Problems)

A simple but important problem, which arises in many contexts and embodies
important algorithmic ideas, is a classical feasibility problem, where the ob-
jective is to find a common point within a collection of sets Xe, f EL, where
each Xe is a closed convex set. In the feasibility problem the cost function
is zero. A somewhat more complex problem with a similar structure arises
when there is a cost function, i.e., a problem of the form

minimize / (x)
subject to x E ntELXe,

where / : Rn >-+ R. An important example is the minimum distance problem,


where
f(x) = llx - zll,
for a given vector z and some norm II · II- The following example is a special
case.

Example 1.4.2: (Basis Pursuit)

Consider the problem


minimize llxll1
(1.36)
subject to Ax = b,
where II · 111 is the £1 norm in Rn, A is a given m x n matrix, and b is
a vector in Rm that consists of m given measurements. We are trying to
construct a linear model of the form Ax = b, where x is a vector of n scalar
Sec. 1.4 Large Number of Constraints 35

weights for a large number n of basis functions (m < n). We want to satisfy
exactly the measurement equations Ax = b, while using only a few of the
basis functions in our model. Consequently, we introduce the £1 norm in the
cost function of problem (1.36), aiming to delineate a small subset of basis
functions, corresponding to nonzero coordinates of x at the optimal solution.
This is called the basis pursuit problem (see, e.g., (CDSOl], (VaF08)), and its
underlying idea is similar to the one of £1-regularization (cf. Example 1.3.2).
It is also possible to consider a norm other than £1 in Eq. (1.36). An
example is the atomic norm I[ · IIA induced by a subset A that is centrally
symmetric around the origin (a EA if and only if -a EA):

llxllA = inf { t > 0 I x E t · conv(A)}.

This problem, and other related problems involving atomic norms, have many
applications; see for example (CRP12], (8BT12], (RSW13].
A related problem is

minimize IIXII•
subject to AX = B,
where the optimization is over all m x n matrices X. The matrices A, B
are given and have dimensions £ x m and £ x n, respectively, and IIXIJ. is
the nuclear norm of X. This problem aims to produce a low-rank matrix X
that satisfies an underdetermined set of linear equations AX = B (see e.g.,
(CaR09], [RFPIO], [RXB11]). When these equations specify that a subset of
entries X;j, ( i, j) E n, are fixed at given values M;j,

(i,j) E 0,
we obtain an alternative formulation of the matrix completion problem dis-
cussed in Example 1.3.2.

Example 1.4.3: (Minimax Problems)

In a minimax problem the cost function has the form


J(x) = sup<;l>(x,z),
zEZ

where Z is a subset of some space and¢(-, z) is a real-valued function for each


z E Z. We want to minimize f subject to x E X, where X is a given constraint
set. By introducing an artificial scalar variable y, we may transform such a
problem to the general form

minimize y

subject to x EX, <;l>(x, z) :s; y, \:/ z E Z,

which involves a large number of constraints (one constraint for each z in the
set Z, which could be infinite). Of course in this problem the set X may also
be of the form X = ntELXt as in earlier examples.
36 Convex Optimization Models: An Overview Chap. 1

Example 1.4.4: (Basis Function Approximation for Separable


Problems - Approximate Dynamic Programming)

Let us consider a large-scale separable problem of the form


m

minimize L fi (Yi)
i=l
m
(1.37)
subject to L9iJ(Yi) s; 0, Vj = l, ... ,r, y 2 0,
i=l

where Ji : R H R are scalar functions, and the dimension m of the vector


y = (y1, ... , Ym) is very large. One possible way to address this problem is to
approximate y with a vector of the form <I>x, where <I> is an m x n matrix. The
columns of <I> may be relatively few, and may be viewed as basis functions
for a low-dimensional approximation subspace { <I>x I x E Rn}. We replace
problem (1.37) with the approximate version
m
minimize L fi(<f>;x)
i=l
m
(1.38)
subject to L%(¢:x):::; 0, Vj = l, ... ,r,
i=l

<t>:x 2 0, i = 1, ... , m,
where¢; denotes the ith row of <I>, and ¢;xis viewed as an approximation of
Yi· Thus the dimension of the problem is reduced from m ton. However, the
constraint set of the problem became more complicated, because the simple
constraints Yi 2 0 take the more complex form ¢;x 2 0. Moreover the number
m of additive components in the cost function, as well as the number of its
constraints is still large. Thus the problem has the additive cost structure of
the preceding section, as well as a large number of constraints.
An important application of this approach is in approximate dynamic
programming (see e.g., [BeT96], [SuB98], [Powll], [Ber12]), where the func-
tions Ji and 9iJ are linear. The corresponding problem (1.37) relates to the
solution of the optimality condition (Bellman equation) of an infinite horizon
Markovian decision problem (the constraint y 2 0 may not be present in this
context). Here the numbers m and rare often astronomical (in fact r can be
much larger than m), in which case an exact solution cannot be obtained. For
such problems, approximation based on problem (1.38) has been one of the
major algorithmic approaches (see [Ber12] for a textbook presentation and
references). For very large m, it may be impossible to calculate the cost func-
tion value ~::, 1 fi(¢;x) for a given x, and one may at most be able to sample
individual cost components fi- For this reason optimization by stochastic
simulation is one of the most prominent approaches in large scale dynamic
programming.
Let us also mention that related approaches based on randomization
and simulation have been proposed for the solution of large scale instances of
classical linear algebra problems; see [BeY09], [Ber12) (Section 7.3), [DMM06),
[StV09), [HMTlO], [NeelO], [DMMll], [WaB13a], [WaB13b].
Sec. 1.4 Large Number of Constraints 37

A large number of constraints also arises often in problems involving


a graph, and may be handled with algorithms that take into account the
graph structure. The following example is typical.

Example 1.4.5: (Optimal Routing in a Network -


Multicommodity Flows)

Consider a directed graph that is used to transfer "commodities" from given


supply points to given demand points. We are given a set W of ordered node
pairs w = (i, j). The nodes i and j are referred to as the origin and the
destination of w, respectively, and w is referred to as an OD pair. For each
w, we are given a scalar rw referred to as the input of w. For example, in the
context of routing of data in a communication network, rw (measured in data
units/second) is the arrival rate of traffic entering and exiting the network at
the origin and the destination of w, respectively. The objective is to divide
each rw among the many paths from origin to destination in a way that the
resulting total arc flow pattern minimizes a suitable cost function.
We denote:
Pw: A given set of paths that start at the origin and end at the destination
of w. All arcs on each of these paths are oriented in the direction from
the origin to the destination.
xp: The portion of rw assigned to path p, also called the flow of path p.

The collection of all path flows {xp p E Pw, w E W} must satisfy the
constraints
L Xp =rw, 'vwEW, (1.39)
pEPw

Xp 2'. 0, 'v p E Pw, w E W. (1.40)


The total flow F;j of arc (i, j) is the sum of all path flows traversing the arc:

(1.41)
all paths p
containing (i,j)

Consider a cost function of the form

LDij(Fij)- (1.42)
(i,j)

The problem is to find a set of path flows { Xp} that minimize this cost function
subject to the constraints of Eqs. (1.39)-(1.41). It is typically assumed that
D;j is a convex function of F;j. In data routing applications, the form of
D;j is often based on a queueing model of average delay, in which case D;j is
continuously differentiable within its domain (see e.g., [BeG92)). In a related
context, arising in optical networks, the problem involves additional integer
constraints on Xp, but may be addressed as a problem with continuous flow
variables (see [OzB03)).
38 Convex Optimization Models: An Overview Chap. 1

The preceding problem is known as a multicommodity network flow


problem. The terminology reflects the fact that the arc flows consist of several
different commodities; in the present example the different commodities are
the data of the distinct OD pairs. This problem also arises in essentially iden-
tical form in traffic network equilibrium problems (see e.g., [FIH95], [Ber98],
[Ber99], [Pat99], [Pat04]). The special case where all OD pairs have the same
end node, or all OD pairs have the same start node, is known as the single
commodity network flow problem, a much easier type of problem, for which
there are efficient specialized algorithms that tend to be much faster than
their multicommodity counterparts (see textbooks such as [Ber91], [Ber98]).
By expressing the total flows F; 1 in terms of the path flows in the cost
function (1.42) [using Eq. (1.41)], the problem can be formulated in terms of
the path flow variables {xp Ip E Pw, w E W} as

minimize D(x)

subject to L Xp = rw, V w E W,
pEPw

Xp ~ 0, V p E Pw, WE W,
where

D(x) = LD;
(i,j)
1 ( L
all paths p
Xp)
containing (i,j)

and x is the vector of path flows Xp, There is a potentially huge number
of variables as well as constraints in this problem. However, by judiciously
taking into account the special structure of the problem, the constraint set
can be simplified and approximated by the convex hull of a small number of
vectors x, and the number of variables and constraints can be reduced to a
manageable size (see e.g., [BeG83], [FIH95], [OMVOOJ, and our discussion in
Section 4.2).

There are several approaches to handle a large number of constraints.


One possibility, which points the way to some major classes of algorithms,
is to initially discard some of the constraints, solve the corresponding less
constrained problem, and later selectively reintroduce constraints that seem
to be violated at the optimum. In Chapters 4-6, we will discuss methods
of this type in some detail.
Another possibility is to replace constraints with penalties that assign
high cost for their violation. In particular, we may replace problem (1.35)
with
r
minimize f(x) + c L P(gj(x))
j=l

subject to x E X,
where P( ·) is a scalar penalty function satisfying P( u) = 0 if u '.S 0, and
P( u) > 0 if u > 0, and c is a positive penalty parameter. We discuss this
possibility in the next section.
Sec. 1.5 Exact Penalty Functions 39

1.5 EXACT PENALTY FUNCTIONS

In this section we discuss a transformation that is often useful in the context


of constrained optimization algorithms. We will derive a form of equiva-
lence between a constrained convex optimization problem, and a penalized
problem that is less constrained or is entirely unconstrained. The motiva-
tion is that some convex optimization algorithms do not have constrained
counterparts, but can be applied to a penalized unconstrained problem.
Furthermore, in some analytical contexts, it is useful to be able to work
with an equivalent problem that is less constrained.
We consider the convex programming problem

minimize f (x)
(1.43)
subject to x EX, gj(X) :::; 0, j = 1, ... , r,
where X is a convex subset of ~n, and f : X -+ ~ and gj : X -+ ~ are
given convex functions. We denote by f* the primal optimal value, and by
q* the dual optimal value, i.e.,

q* = supq(µ),
µ?,O

where
q(µ) inf {f(x) + µ'g(x)},
= xEX V µ 2'. 0,

with g(x) = (g1(x), ... ,gr(x))'. We assume that -oo < q* = f* < oo.
We introduce a convex penalty function P : ~r H ~, which satisfies

P(u) = 0, Vu:::; 0, (1.44)

P(u) > 0, if Uj > 0 for some j = 1, ... , r. (1.45)


We consider solving in place of the original problem (1.43), the "penalized"
problem
minimize f(x) + P(g(x))
(1.46)
subject to x E X,
where the inequality constraints have been replaced by the extra cost
P(g(x)) for their violation. Some interesting examples of penalty functions
are based on the squared or the absolute value of constraint violation:
r
C""'
P(u) =2 L..,(max{O,uj}) 2 ,
J=l

and
r
P(u) = c Lmax{O,uj},
j=l
40 Convex Optimization Models: An Overview Chap. 1

~
P(u) = cmax{O,u} if O :S; µ :S; C
Q(µ) =:= { otherwise

/
0 u 0 C µ

P(u) = max{O, au+u2 } Q(µ)


I

O a µ

P (u) = (c/2)(max{O,u}) 2 Q(µ) = { ~2c)µ2 ifµ~ u


ifµ< 0

0 u ,0 µ

Figure 1.5.1. Illustration of various penalty functions P and their conjugate


functions, denoted by Q. Because P(u) = 0 for u :S 0, we have Q(µ) = oo forµ
outside the nonnegative orthant.

where c is a positive penalty parameter. However, there are other possibil-


ities that may be well-matched with the problem at hand.
The conjugate function of P is given by
Q(µ)= sup{u'µ-P(u)},
uE~r

and it can be seen that


Q(µ) ~ 0, \/µE/Rr,
Q(µ) = oo,
if µ 1 < 0 for some j = 1, ... , r.
Figure 1.5.1 shows some examples of one-dimensional penalty functions P,
together with their conjugates.
Sec. 1.5 Exact Penalty Functions 41

Consider the primal function of the original constrained problem,

p(u) = inf f(x), u E ~r.


xEX,g(x):C::u

We have,

inf {f(x)
xEX
+ P(g(x))} = xEX
inf inf
uE~r,g(x):C::u
{f(x) + P(g(x))}
= inf inf
xEX uE~r,g(x):C::u
{f (x) + P (u) }
= inf
xEX,uE~r,g(x):C::u
{f (x) + P(u)}
= inf inf {f(x)+P(u)}
uE~r xEX,g(x):C::u

inf {p(u) + P(u) },


= uE~r

where for the second equality, we use the monotonicity relationt

U ::;_ V P(u)::;. P(v).

Moreover, -oo < q* and f* < oo by assumption, and since for anyµ with
q(µ) > -oo, we have

p(u) ~ q(µ) - µ'u > -oo, Vu E ~r,

it follows that p(O) < oo and p(u) > -oo for all u E ~r, sop is proper.
We can now apply the Fenchel Duality Theorem (Prop. 1.2.1) with
the identifications Ji = p, h = P, and A = I. We use the conjugacy
relation between the primal function p and the dual function q to write

inf {p(u) + P(u)} = sup{q(µ) -Q(µ)}, (1.4 7)


uE~r µ'20

so that
inf {f(x) + P(g(x))} = sup{ q(µ) - Q(µ) }; (1.48)
xEX µ'20

see Fig. 1.5.2. Note that the conditions for application of the theorem are
satisfied since the penalty function P is real-valued, so that the relative

t To show this relation, we argue by contradiction. If there exist u and v with


u ::;_ v and P(u) > P(v), then by continuity of P, there must exist u close enough
to u such that u < v and P(u) > P(v). Since Pis convex, it is monotonically
increasing along the halfline { u + a(u - v) I a 2 0}, and since P(u) > P( v) 2 0,
P takes positive values along this halfline. However, since u < v, this halfline
eventually enters the negative orthant, where P takes the value Oby Eq. (1.44),
a contradiction.
42 Convex Optimization Models: An Overview Chap. 1

j+Q(µ)

Figure 1.5.2. Illustration of the du-


ality relation (1.48), and the opti-
mal values of the penalized and the
dual problem. Here /* is the opti-
mal value of the original problem,
which is assumed to be equal to the
optimal dual value q•, while j is the
j + Q(µ) optimal value of the penalized prob-
lem,
/

]= inf {f(x)+P(g(x))}.
xEX

The point of contact of the graphs


of the functions j + Q(µ) and q(µ)
corresponds to the vector µ that at-
tains the maximum in the relation

J = max{ q(µ) - Q(µ) }.


J+Q(µ) µ?_O

q(µ)
/
f
0 µ

interiors of dom(p) and dom(P) have nonempty intersection. Furthermore,


as part of the conclusions of part (a) of the Fenchel Duality Theorem, it
follows that the supremum overµ ::::: 0 in Eq. (1.48) is attained.
Figure 1.5.2 suggests that in order for the penalized problem (1.46)
to have the same optimal value as the original constrained problem (1.43),
the conjugate Q must be "sufficiently flat" so that it is minimized by some
dual optimal solution µ*. This can be interpreted in terms of properties of
subgradients, which are stated in Appendix B, Section 5.4: we must have
0 E oQ(µ*) for some dual optimal solution µ•, which by Prop. 5.4.3 in
Appendix B, is equivalent to µ* E 8P(O). This is part (a) of the following
proposition, which was given in [Ber75a]. Parts (b) and (c) of the propo-
sition deal with issues of equality of corresponding optimal solutions. The
proposition assumes the convexity and other assumptions made in the early
part in this section regarding problem (1.43) and the penalty function P.
Sec. 1.5 Exact Penalty Functions 43

Proposition 1.5.1: Consider problem (1.43), where we assume that


-oo < q* = f* < 00.
(a) The penalized problem (1.46) and the original constrained prob-
lem (1.43) have equal optimal values if and only if there exists a
dual optimal solutionµ* such that µ* E aP(O).
(b) In order for some optimal solution of the penalized problem (1.46)
to be an optimal solution of the constrained problem (1.43), it is
necessary that there exists a dual optimal solutionµ* such that

u'µ* ~ P(u), 'vuE)Rr. (1.49)

(c) In order for the penalized problem (1.46) and the constrained
problem (1.43) to have the same set of optimal solutions, it is
sufficient that there exists a dual optimal solution µ* such that

u'µ* < P(u), \;/ u E )Rr with Uj > 0 for some j. (1.50)

Proof: (a) We have using Eqs. (1.47) and (1.48),


p(O) 2: infr {p(u) + P(u)} = sup{ q(µ) - Q(µ)} = inf {J(x) + P(g(x)) } .
uEiR µ~O xEX
(1.51)
Since f* = p(O), we have
f* = inf {f(x) + P(g(x))}
xEX

if and only if equality holds in Eq. (1.51). This is true if and only if
0 E arg min {p(u)
uEiRr
+ P(u) },
which by Prop. 5.4. 7 in Appendix B, is true if and only if there exists some
µ* E - ap(O) with µ* E aP(O) (in view of the fact that P is real-valued).
Since the set of dual optimal solutions is -ap(O) (under our assumption
-oo < q* = f* < oo; see Example 5.4.2, [Ber09]), the result follows.
(b) If x* is an optimal solution of both problems (1.43) and (1.46), then by
feasibility of x*, we have P(g(x*)) = 0, so these two problems have equal
optimal values. From part (a), there must exist a dual optimal solution
µ* E oP(O), which is equivalent to Eq. (1.49), by the subgradient inequality.
(c) If x* is an optimal solution of the constrained problem (1.43), then
P(g(x*)) = 0, so we have
f* = f(x*) = J(x*) + P(g(x* )) 2: xEX
inf {f(x) + P(g(x)) } .
44 Convex Optimization Models: An Overview Chap. 1

The condition (1.50) implies the condition (1.49), so that by part (a),
equality holds throughout in the above relation, showing that x* is also
an optimal solution of the penalized problem (1.46).
Conversely, let x* E X be an optimal solution of the penalized prob-
lem (1.46). If x* is feasible [i.e., satisfies in addition g(x*) ::::; OJ, then it is an
optimal solution of the constrained problem (1.43) [since P(g(x)) = 0 for
all feasible vectors x], and we are done. Otherwise x* is infeasible in which
case 9j(x*) > 0 for some j. Then, by using the given condition (1.50), it
follows that there exists a dual optimal solutionµ* and an E > 0 such that

µ*'g(x*) + E < P(g(x*)).


Let x be a feasible vector such that f(x)::::; f* + E. Since P(g(x)) = 0 and
f* = minxEx{f(x) + µ*'g(x)}, we obtain

f(x) + P(g(x)) = f(x)::::; f* + E::::; f(x*) + µ*'g(x*) + f.


By combining the last two relations, it follows that

f(x) + P(g(x)) < f(x*) + P(g(x*)),


which contradicts the hypothesis that x* is an optimal solution of the
penalized problem (1.46). This completes the proof. Q.E.D.

As an illustration, consider the minimization of f(x) = -x over all


x EX= {x Ix~ O} with g(x) = x::::; 0. The dual function is

q(µ) = inf(µ - l)x, µ ~ 0,


x:::0:0

so q(µ) = 0 for µ E [1, oo) and q(µ) = -oo otherwise. Let P(u) =
c max{ 0, u}, so the penalized problem is minx::c:o { - x + c max{ 0, x}}. Then
parts (a) and (b) of the proposition apply if c ~ 1. However, part (c) ap-
plies only if c > 1. In terms of Fig. 1.5.2, the conjugate of Pis Q(µ) = 0 if
µ E [O, c] and Q(µ) = oo otherwise, so when c = 1, Q is "flat" over an area
not including an interior point of the dual optimal solution set [1, oo ).
To elaborate on the idea of the preceding example, let
r
P(u) = c I::max{O,uj},
j=l

where c > 0. The conditionµ* E 8P(O), or equivalently,

u'µ*::::; P(u),
[cf. Eq. (1.49)], is equivalent to

µ '; < C \/ j = l, ... ,r.


J - '
Sec. 1.5 Exact Penalty Functions 45

Similarly, the condition u'µ* < P(u) for all u E !Rr with Uj > 0 for some j
[cf. Eq. (1.50)], is equivalent to
V j =I, . .. ,r.

The reader may consult the literature for other results on exact penalty
functions, starting with their first proposal in the book [Zan69]. The pre-
ceding development is based on [Ber75], and focuses on convex program-
ming problems. For additional representative references, some of which
also discuss nonconvex problems, see [HaM79], [Ber82a], [Bur91], [FeM91],
[BN003], [FrT07]. In what follows we develop an exact penalty function
result for the case of an abstract constraint set, which will be used in the
context of incremental constraint projection algorithms in Section 6.4.4.

A Distance-Based Exact Penalty Function

Let us discuss the case of a general Lipschitz continuous (not necessarily


convex) cost function and an abstract constraint set X C !Rn. The idea is
to use a penalty that is proportional to the distance from X:

dist(x ; X) = yEX
inf llx - Yll-

The next proposition from [Berll] provides the basic result (see Fig. 1.5.3) .

Proposition 1.5.2: Let f : !Rn t--+ !R be a function that is Lipschitz


continuous with constant Lover a set Y C !Rn, i.e.,

IJ(x) - f(y)I ::; Lllx - YII, V x,y E Y. (1.52)

Let also X be a nonempty closed subset of Y, and let c be a scalar


with c > L. Then x* minimizes f over X if and only if x• minimizes

Fc(x) = J(x) + cdist(x;X)

over Y.

Proof: For any x E Y, let ± denote a vector of X that is at minimum


distance from x (such a vector exists by the closure of X and Weierstrass'
Theorem). By using the Lipschitz assumption (1.52) and the fact c > L,
we have

Fc(x) = f(x) +cllx-xll = J(±) + (f(x) - f(x)) + cllx-xll "2 J(±) = Fe(±),

with strict inequality if x f:. x. Thus all minima of Fe over Y must lie in X,
and also minimize f over X (since Fe= f on X). Conversely, all minima
46 Convex Optimization Models: An Overview Chap. 1

f(x) + cdist(x;X)

- - - - - -.... x* X
X
y

Figure 1.5.3. Illustration of Prop. 1.5.2. For c greater than the Lipschitz constant
off, the "slope" of the penalty function counteracts the "slope" off at the optimal
solution x*.

of f over X are also minima of Fe over X (since Fe = f on X), and by the


preceding inequality, they are also minima of Fe over Y. Q.E.D.

The following proposition provides a generalization for constraints


that involve the intersection of several sets.

Proposition 1.5.3: Let f : ~n i--+ ~ be a function, and let Xi,


i = 0, 1, ... , m, be closed subsets of ~n with nonempty intersection.
Assume that f is Lipschitz continuous over X 0 . Then there is a scalar
c > 0 such that for all c :2: c, the set of minima of f over ni=oXi
coincides with the set of minima of
m
f(x) + c L dist(x; Xi)
i=l

over Xo.

Proof: Let L be the Lipschitz constant for f, and let c 1, ... , Cm be scalars
satisfying
Ck > L + c1 + · · · + Ck-1, V k = I, ... ,m,
where co= 0. Define

Hk(x) = f(x) + c1 dist(x; X1) +···+Ck dist(x; Xk), k= 1, ... ,m,

and for k = 0, denote Ho(x) = f(x), co = 0. By applying Prop. 1.5.2, the


set of minima of Hm over Xo coincides with the set of minima of Hm-1
over Xm n Xo, since Cm is greater than L + c1 + · · · + Cm-1, the Lipschitz
constant for Hm-1· Similarly, for all k = I, ... , m, the set of minima of
Sec. 1.6 Notes, Sources, and Exercises 47

Hk over ( n~k+l Xi) n Xo coincides with the set of minima of Hk-1 over
( n~k Xi) n Xo. Thus, fork= l, we obtain that the set of minima of Hm
over Xo coincides with the set of minima of Ho, which is f, over n~ 0 Xi.
Let
X* C n~oXi
be this set of minima. For c ~ Cm, we have Fe ~ Hm, while Fe coincides
with Hm on X*. Hence X* is the set of minima of Fe over Xo. Q.E.D.

We finally note that exact penalty functions, and particularly the dis-
tance function dist(x; Xi), are often relatively convenient in various con-
texts where difficult constraints complicate the algorithmic solution. As an
example, see Section 6.4.4, where incremental proximal methods for highly
constrained problems are discussed.

1.6 NOTES, SOURCES, AND EXERCISES

There is a very extensive literature on convex optimization, and in this sec-


tion we will restrict ourselves to noting some books, research monographs,
and surveys. In subsequent chapters, we will discuss in greater detail the
literature that relates to the specialized content of these chapters.
Books relating primarily to duality theory are Rockafellar [Roc70],
Stoer and Witzgall [StW70], Ekeland and Temam [EkT76], Bonnans and
Shapiro [BoSOO], Zalinescu [Zal02], Auslender and Teboulle [AuT03], and
Bertsekas [Ber09].
The books by Rockafellar and Wets [RoW98], Borwein and Lewis
[BoLOO], and Bertsekas, Nedic, and Ozdaglar [BN003] straddle the bound-
ary between convex and variational analysis, a broad spectrum of topics
that integrate classical analysis, convexity, and optimization of both convex
and nonconvex (possibly nonsmooth) functions.
The book by Hiriart-Urruty and Lemarechal [HiL93] focuses on con-
vex optimization algorithms. The books by Rockafellar [Roc84] and Bert-
sekas [Ber98] have a more specialized focus on network optimization algo-
rithms and monotropic programming problems, which will be discussed in
Chapters 4 and 6. The book by Ben-Tal and Nemirovski [BeNOl] focuses
on conic and semidefinite programming [see also the 2005 class notes by
Nemirovski (on line), and the representative survey papers by Alizadeh and
Goldfarb [AlG03], and Todd [TodOl]l. The book by Wolkowicz, Saigal, and
Vanderberghe [WSVOO] contains a collection of survey articles on semidefi-
nite programming. The book by Boyd and Vanderberghe [BoV04] describes
many applications, and contains a lot of related material and references.
The book by Ben-Tal, El Ghaoui, and Nemirovski [BGN09] focuses on
robust optimization; see also the survey by Bertsimas, Brown, and Cara-
manis [BBCll]. The book by Bauschke and Combettes [BaCll] develops
the connection of convex analysis and monotone operator theory in infinite
48 Convex Optimization Models: An Overview Chap. 1

dimensional spaces. The book by Rockafellar and Wets [RoW98] also has
a substantial finite-dimensional treatment of this subject. The books by
Cottle, Pang, and Stone [CPS92], and Facchinei and Pang [FaP03] focus on
complementarity and variational inequality problems. The books by Palo-
mar and Eldar [PaElO], and Vetterli, Kovacevic, and Goyal [VKG14], and
the surveys in the May 2010 issue of the IEEE Signal Processing Magazine
describe applications of convex optimization in communications and sig-
nal processing. The books by Hastie, Tibshirani, and Friedman [HTF09],
and Sra, Nowozin, and Wright [SNW12] describe applications of convex
optimization in machine learning.

EXERCISES

1.1 (Support Vector Machines and Duality)

Consider the classification problem associated with a support vector machine,

minimize ½llxll 2 + /3 I:: max { 0, 1 - b;(c;x + y)}


1

subject to X E )Jr' y E R,
with quadratic regularization, where /3 is a positive regularization parameter (cf.
Example 1.3.3).
(a) Write the problem in the equivalent form

minimize ½llxll 2 + /3 I:: 1 (;

subject to x E Rn, y ER,


o::;(;, 1-b;(c;x+y)::;(;, i=l, ... ,m.

Associate dual variables µ; 2 0 with the constraints 1 - b;(c;x + y) ::; (;,


and show that the dual function is given by
q(µ) if I:';= 1 µjbj = 0, 0::; µ; ::; /3, i = 1, ... , m,
q(µ) = { .
-oo otherwise,

where
q(µ) = Lµ; - ½I:: I:';= 1 1 b;bjc;cjµiµj.
i=l

Does the dual problem, viewed as the equivalent quadratic program

subject to L µjbj = 0, 0 ::; µ; ::; /3, i = 1, ... , m,


j=l
Sec. 1.6 Notes, Sources, and Exercises 49

always have a solution? Is the solution unique? Note: The dual problem
may have high dimension, but it has a generally more favorable structure
than the primal. The reason is the simplicity of its constraint set, which
makes it suitable for special types of quadratic programming methods, and
the two-metric projection and coordinate descent methods of Section 2.1.2.
(b) Consider an alternative formulation where the variable y is set to 0, leading
to the problem
minimize ½llxll 2 + /3 I:: 1max{O, 1 - b;c;x}
subject to XE ar.
Show that the dual problem should be modified so that the constraint
I:7= 1 µjbj = 0 is not present, thus leading to a bound-constrained quadratic
dual problem.
Note: The literature of the support vector machine field is extensive. Many of the
nondifferentiable optimization methods to be discussed in subsequent chapters
have been applied in connection to this field; see e.g., [MaMOl], [FeM02], [SmS04],
[Bot05], [Joa06], [JFY09], [JoY09], [SSS07], [LeWll].

1.2 (Minimizing the Sum or the Maximum of Norms [LVB98])

Consider the problems


p

minimize L IIF;x + g; II
(1.53)
i=l

subject to X E ar'
and
minimize max
i=l, ... ,p
IIF;x + g;II
subject to X E \Jr,
where F; and g, are given matrices and vectors, respectively. Convert these
problems to second order cone form and derive the corresponding dual problems.

1.3 (Complex li and l 00 Approximation [LVB98])

Consider the complex li approximation problem


minimize IIAx - bll1
subject to X E en,
where en is the set of n-dimensional vectors whose components are complex
numbers, and A and b are given matrix and vector with complex components.
Show that it is a special case of problem (1.53) and derive the corresponding dual
problem. Repeat for the complex loo approximation problem
minimize IIAx - blloo
subject to X E en.
50 Convex Optimization Models: An Overview Chap. 1

1.4

The purpose of this exercise is to show that the SOCP can be viewed as a special
case of SDP.
(a) Show that a vector x E !Rn belongs to the second order cone if and only if
the matrix
0 0
0 0

0 0
Xn-1

is positive semidefinite. Hint: We have that for any positive definite sym-
metric n x n matrix A, vector b E Rn, and scalar d, the matrix

( A1
b
b)
C

is positive definite if and only if

(b) Use part (a) to show that the primal SOCP can be written in the form of
the dual SDP.

1.5 (Explicit Form of a Second Order Cone Problem}

Consider the SOCP (1.24).


(a) Partition the ni x (n + 1) matrices ( Ai bi) as

i = 1, ... ,m,

where Di is an (ni -1) x n matrix, di E Rn;-l, Pi E Rn, and qi E !R. Show


that

Aix - bi E Ci if and only if

so we can write the SOCP (1.24) as


I
minimize ex
subject to IIDix - dill ~ p:x - qi, i = 1, ... , m.

(b) Similarly partition Ai as

i = 1, ... ,m,
Sec. 1.6 Notes, Sources, and Exercises 51

whereµ; E Rni- 1 and v; E R. Show that the dual problem (1.25) can be
written in the form

maximize I)d~µi + qw;)


i=l
m

subject to L(D~µ; + v;p;) = c, llµ;!I ::; v;, i = 1, ... , m.


i=l

(c) Show that the primal and dual interior point conditions for strong duality
(Prop. 1.2.4) hold if there exist primal and dual feasible solutions x and
(µ;, v;) such that

i= 1, ... ,m,

and
i = 1, ... ,m,
respectively.

1.6 (Separable Conic Problems)

Consider the problem


m

minimize L Ji (xi)
i=l

subject to x E Sn C,

where x = (x1, ... , Xm) with Xi E Rn;, i = 1, ... , m, and f; : Rn; I-+ (-oo, oo] is
a proper convex function for each i, and S and C are a subspace and a cone of
Rn1 +··+nm, respectively. Show that a dual problem is

maximize L q, (A;)
i=l

subject to A E C +SJ..,

where A = (A1, ... , Am), C is the dual cone of C, and

i = 1, ... ,m.
52 Convex Optimization Models: An Overview Chap. 1

1. 7 (Weber Points)

Consider the problem of finding a circle of minimum radius that contains r points
y1, ... , Yr in the plane, i.e., find x and z that minimize z subject to llx - YJ II :S z
for all j = 1, ... , r, where x is the center of the circle under optimization.
(a) Introduce multipliers µj, j = l, ... ,r, for the constraints, and show that
the dual problem has an optimal solution and there is no duality gap.
(b) Show that calculating the dual function at some µ 2 0 involves the com-
putation of a Weber point of Y1, ... , Yr with weights µ1, ... , µr, i.e., the
solution of the problem

min Lµjllx -yjll


xE!R 2
j=l

(see Example 1.3.7).

1.8 (Inconsistent Convex Systems of Inequalities)

Let gj : Rn 1--t R, j = 1, ... , r, be convex functions over the nonempty convex set
X C Rn. Show that the system

9j(X) < 0, j = 1, ... ,r,

has no solution within X if and only if there exists a vector µ E Rr such that

µ 2 0,

µ' g(x) 2 0, V XE X.

Note: This is an example of what is known as a theorem of the alternative.


There are many results of this type, with a long history, such as the Farkas
Lemma, and the theorems of Gordan, Motzkin, and Stiemke, which address the
feasibility (possibly strict) of linear inequalities. They can be found in many
sources, including Section 5.6 of [Ber09]. Hint: Consider the convex program

minimize y
subject to x E X, y E R, gj(x)'.Sy, j=l, ... ,r.
2
Optimization Algorithms:
An Overview

Contents

2.1. Iterative Descent Algorithms . . . . . . . . . . . . . p. 55


2.1.1. Differentiable Cost Function Descent - Unconstrained
Problems . . . . . . . . . . . . . . . . . . p. 58
2.1.2. Constrained Problems - Feasible Direct ion Methods p. 71
2.1.3. Nondifferentiable Problems - Subgradient Methods p. 78
2.1.4. Alternative Descent Methods . . . . . . . . p. 80
2.1.5. Incremental Algorithms . . . . . . . . . . p. 83
2.1.6. Distributed Asynchronous Iterative Algorithms p. 104
2.2. Approximation Methods . . . . . . . . . . . p. 106
2.2.1. Polyhedral Approximation . . . . . . . . . p. 107
2.2.2. P enalty, Augmented Lagrangian, and Interior
Point Methods . . . . . . . . . . . . p. 108
2.2.3. Proximal Algorithm, Bundle Methods, and .
Tikhonov Regularization . . . . . . . . . . p. 110
2.2.4. Alternating Direction Method of Multipliers p . 111
2.2.5. Smoothing of Nondifferentiable Problems p. 113
2.3. Notes, Sources, and Exercises . . . . . . . . p . 119

53
54 Optimization Algorithms: An Overview Chap. 2

In this book we are primarily interested in optimization algorithms, as op-


posed to "modeling," i.e., the formulation of real-world problems as math-
ematical optimization problems, or "theory," i.e., conditions for strong du-
ality, optimality conditions, etc. In our treatment, we will mostly focus on
guaranteeing convergence of algorithms to desired solutions, and the asso-
ciated rate of convergence and complexity analysis. We will also discuss
special characteristics of algorithms that make them suitable for particular
types of large scale problem structures, and distributed (possibly asyn-
chronous) computation. In this chapter we provide an overview of some
broad classes of optimization algorithms, their underlying ideas, and their
performance characteristics.
Iterative algorithms for minimizing a function f : ~n f-t ~ over a set
X generate a sequence {xk}, which will hopefully converge to an optimal
solution. In this book we focus on iterative algorithms for the case where X
is convex, and f is either convex or is nonconvex but differentiable. Most
of these algorithms involve one or both of the following two ideas, which
will be discussed in Sections 2.1 and 2.2, respectively:
(a) Iterative descent, whereby the generated sequence { xk} is feasible,
i.e., {xk} C X, and satisfies

if and only if Xk is not optimal,

where ¢ is a merit function, that measures the progress of the algo-


rithm towards optimality, and is minimized only at optimal points,
i.e.,
argmincp(x) = argmin f(x).
xEX xEX

Examples are cp(x) = f(x) and cp(x) = infx*EX* llx-x*II, where X* is


the set of optimal points, assumed nonempty. In some cases, iterative
descent may be the primary idea, but modifications or approximations
are introduced for a variety of reasons. For example one may modify
an iterative descent method to make it suitable for distributed asyn-
chronous computation, or to deal with random or nonrandom errors,
but in the process lose the iterative descent property. In this case,
the analysis is appropriately modified, but often maintains important
aspects of its original descent-based character.
(b) Approximation, whereby the generated sequence {xk} need not be
feasible, and is obtained by solving at each k an approximation to the
original optimization problem, i.e.,

where Fk is a function that approximates f and Xk is a set that


approximates X. These may depend on the prior iterates xo, ... , x k,
Sec. 2.1 Iterative Descent Algorithms 55

as well as other parameters. Key ideas here are that minimization


of Fk over X k should be easier than minimization of f over X, and
that Xk should be a good starting point for obtaining Xk+i via some
(possibly special purpose) method. Of course, the approximation of
f by Fk and/or X by Xk should improve as k increases, and there
should be some convergence guarantees as k -+ oo. We will summarize
the main approximation ideas of this book in Section 2.2.
A major class of problems that we aim to solve is dual problems,
which by their nature involve nondifferentiable optimization. The funda-
mental reason is that the negative of a dual function is typically a conjugate
function, which is closed and convex, but need not be differentiable. More-
over nondifferentiable cost functions naturally arise in other contexts, such
as exact penalty functions, and machine learning with £1 regularization.
Accordingly many of the algorithms that we discuss in this book do not
require cost function differentiability for their application.
Still, however, differentiability plays a major role in problem formula-
tions and algorithms, so it is important to maintain a close connection be-
tween differentiable and nondifferentiable optimization approaches. More-
over, nondifferentiable problems can often be converted to differentiable
ones by using a smoothing scheme (see Section 2.2.5). We consequently
summarize in Section 2.1 some of the main ideas of iterative algorithms
that rely on differentiability, such as gradient and Newton methods, and
their incremental variants. We return to some of these ideas in Sections
6.1-6.3, but for most of the remainder of the book we focus primarily on
convex possibly nondifferentiable cost functions.
Since the present chapter has an overview character, our discussion
will not be supplemented by complete proofs; in many cases we will provide
just intuitive explanations and refer to the literature for a more detailed
analysis. In subsequent chapters we will treat various types of algorithms in
greater detail. In particular, in Chapter 3, we discuss descent-type iterative
methods that use subgradients. In Chapters 4 and 5, we discuss primarily
the approximation approach, focusing on two types of algorithms and their
combinations: polyhedral approximation and proximal, respectively. In
Chapter 6, we discuss a number of additional methods, which extend and
combine the ideas of the preceding chapters.

2.1 ITERATIVE DESCENT ALGORITHMS

Iterative algorithms generate sequences {Xk} according to

where Gk : ~n f-t ~n is some function that may depend on k, and xo


is some starting point. In a more general context, Gk may depend on
56 Optimization Algorithms: An Overview Chap. 2

some preceding iterates Xk-I,Xk-2, .. .. We are typically interested in the


convergence of the generated sequence { xk} to some desirable point. We
are also interested in questions of rate of convergence, such as for example
the number of iterations needed to bring a measure of error to within a
given tolerance, or asymptotic bounds on some measure of error as the
number of iterations increases.
A stationary iterative algorithm is obtained when Gk does not depend
on k, i.e.,

This algorithm aims to solve a fixed point problem: finding a solution of


the equation x = G(x). A classical optimization example is the gradient
iteration
(2.1)
which aims at satisfying the optimality condition v' f(x) = 0 for an uncon-
strained minimum of a differentiable function f : ~n e--+ ~- Here a is a
positive stepsize parameter that is used to ensure that the iteration makes
progress towards the solution set of the corresponding problem. Another
example is the iteration

Xk+I = Xk - a(Qxk - b) = (I - aQ)xk + ab, (2.2)

which aims at solution of the linear system Qx = b, where Q is a matrix


that has eigenvalues with positive real parts (so that the matrix I - aQ
has eigenvalues within the unit circle for sufficiently small a > 0, and the
iteration is convergent to the unique solution). If f is the quadratic function
f(x) = ½x'Qx - b'x, where Q is positive definite symmetric, then we have
v' f(xk) = Qxk - b and the gradient iteration (2.1) can be written in the
form (2.2).
Convergence of the stationary iteration Xk+I = G(xk) can be ascer-
tained in a number of ways. The most common is to verify that G is a
contraction mapping with respect to some norm, i.e., for some p < l, and
some norm II · II (not necessarily the Euclidean norm), we have

IIG(x) - G(y)jj ~ Pllx - YII,


Then it can be shown that G has a unique fixed point x*, and Xk --t x*,
starting from any xo E ~n; this is the well-known Banach Fixed Point The-
orem (see Prop. A.4.1 in Section A.4 of Appendix A, where the contraction
and other approaches for convergence analysis are discussed). An example
is the mapping
G(x) = (I - aQ)x + ab
of the linear iteration (2.2), when the eigenvalues of I - aQ lie strictly
within the unit circle.
Sec. 2.1 Iterative Descent Algorithms 57

The case where G is a contraction mapping provides an example of


convergence analysis based on a descent approach: at each iteration we
have
(2.3)
so the distance llx - x* II is decreased with each iteration at a nonsolution
point x. Moreover, in this case we obtain an estimate of the convergence
rate: llxk - x* I is decreased at least as fast as the geometric progression
{pk llxo - x* II}; this is called linear or geometric convergence. t
Many optimization algorithms involve a contraction mapping as de-
scribed above. There are also other types of convergent fixed point itera-
tions, which do not require that G is a contraction mapping. In particular,
there are cases where G is a nonexpansive mapping [p = l in Eq. (2.3)],
and there is sufficient structure in G to ensure a form of improvement of an
appropriate figure of merit at each iteration; the proximal algorithm, intro-
duced in Section 2.2.3 and discussed in detail in Chapter 5, is an important
example of this type.
There are also many cases of nonstationary iterations of the form

whose convergence analysis is difficult or impossible with a contraction or


nonexpansive mapping approach. An example is unconstrained minimiza-
tion of a differentiable function f with a gradient method of the form

(2.4)

where the stepsize O:k is not constant. Still many of these algorithms admit
a convergence analysis based on a descent approach, whereby we introduce
a function ¢ that measures the progress of the algorithm towards optimality,
and show that

if and only if Xk is not optimal.

Two common cases are when ¢(x) = f(x) or ¢(x) = dist(x, X*), the Eu-
clidean minimum distance of x from the set X* of minima off. For example
convergence of the gradient algorithm (2.4) is often analyzed by showing
that for all k,

where "'/k is a positive scalar that depends on O:k and some characteristics
of f, and is such that ~%°=a "'/k = oo; this brings to bear the convergence

t Generally, we say that a nonnegative scalar sequence {/3k} converges ( at


least) linearly or geometrically if there exist scalars 1 > 0 and p E (0, 1) such
that /3k :S , / for all k. For a discussion of different definitions of linear and
other types of convergence rate, see [OrR70], [Ber82a], and [Ber99].
58 Optimization Algorithms: An Overview Chap. 2

methodology of Section A.4 in Appendix A and guarantees that either


v' f (xk) -+ 0 or J(xk) -+ -oo.
In what follows in this section we will provide an overview of iterative
optimization algorithms that rely on some form of descent for their validity,
we discuss some of their underlying motivation, and we raise various issues
that will be discussed later. We will also provide in the exercises a sampling
of some related convergence analysis, while deferring to subsequent chapters
a more detailed theoretical development. Moreover, in the present section
we focus in greater detail on the differentiable cost function case and the
potential benefits of differentiability. Our focus in subsequent chapters will
be primarily on nondifferentiable problems.

2.1.1 Differentiable Cost Function Descent - Unconstrained


Problems

A natural iterative descent approach to minimizing a real-valued function


f : ~n f---t ~ over a set X is based on cost improvement: starting with a
point xo EX, construct a sequence {xk} C X such that

k=O,I, ... ,

unless Xk is optimal for some k, at which time the method stops.


In this context it is useful to consider the directional derivative of f
at a point x in a direction d. For a differentiable f, it is given by

f'(x; d) = lim f(x + ad) - f(x) = v' f(x)'d, (2.5)


o,.j.0 0:

(cf. Section A.3 of Appendix A). From this formula it follows that if dk is
a descent direction at Xk, in the sense that

we may reduce the cost by moving from Xk along dk with a small enough
positive stepsize 0:. In the unconstrained case where X = ~n, this leads to
an algorithm of the form

(2.6)

where dk is a descent direction at Xk and O:k is a positive scalar stepsize. If


no descent direction can be found at Xk, i.e., f'(xk; d) ~ 0, for all d E ~n,
from Eq. (2.5) it follows that Xk must satisfy the necessary condition for
optimality
Sec. 2.1 Iterative Descent Algorithms 59

Gradient Methods for Differentiable Unconstrained Minimization

For the case where f is differentiable and X = ~n, there are many popular
descent algorithms of the form (2.6). An important example is the classical
gradient method, where we use dk = - v' f (xk) in Eq. (2.6):
Xk+l = Xk - 0:k y' f (xk),
Since for differentiable f we have
f'(xk; d) = v' f(xk)'d,
it follows that
y' f (Xk) _ • / ,
llv'f( Xk )II - arg Jld!!:c;l
mm f (xk,d)
[assuming v'f(xk) =/- OJ. Thus the gradient method is the descent algorithm
of the form (2.6) that uses the direction that yields the greatest rate of
cost improvement. For this reason it is also called the method of steepest
descent.
Let us now discuss the convergence rate of the steepest descent method,
assuming that f is twice continuously differentiable. With proper step-
size choice, it can be shown that the method has a linear rate, assuming
that it generates a sequence { xk} that converges to a vector x* such that
v' f (x*) = 0 and v' 2 f (x*) is positive definite. For example, if O:k is a
sufficiently small constant 0: > 0, the corresponding iteration
Xk+I = Xk - av' f(xk), (2.7)
can be shown to be contractive within a sphere centered at x*, so it con-
verges linearly.
To get a sense of this, assume for convenience that f is quadratic, t
so by adding a suitable constant to f, we have
f(x) = ½(x - x*)'Q(x - x*), v'f(x) = Q(x - x*),
t Convergence analysis using a quadratic model is commonly used in nonlin-
ear programming. The rationale is that behavior of an algorithm for a positive
definite quadratic cost function is typically a correct predictor of its behavior for
a twice differentiable cost function in the neighborhood of a minimum where the
Hessian matrix is positive definite. Since the gradient is zero at that minimum,
the positive definite quadratic term dominates the other terms in the Taylor se-
ries expansion, and the asymptotic behavior of the method does not depend on
terms of order higher than two.
This time-honored line of analysis underlies some of the most widely used
unconstrained optimization methods, such as Newton, quasi-Newton, and conju-
gate direction methods, which will be briefly discussed later. However, the ratio-
nale for these methods is weakened when the Hessian is singular at the minimum,
since in this case third order terms may become significant. For this reason, when
considering algorithmic options for a given differentiable optimization problem,
it is important to consider (in addition to its cost function structure) whether
the problem is "singular or "nonsingular."
60 Optimization Algorithms: An Overview Chap. 2

where Q is the positive definite symmetric Hessian off. Then for a constant
stepsize a, the steepest descent iteration (2.7) can be written as

Xk+I - x* = (I - aQ)(xk - x*).

For a < 2 /Amax, where Amax is the largest eigenvalue of Q, the matrix
I - aQ has eigenvalues strictly within the unit circle, and is a contraction
with respect to the Euclidean norm. It can be shown (cf. Exercise 2.1)
that the optimal modulus of contraction can be achieved with the stepsize
choice
2
a*== - - -
M +m'
where Mand mare the minimum and maximum eigenvalues of Q. With
this stepsize, we obtain the linear convergence rate estimate

(2.8)

Thus the convergence rate of steepest descent may be estimated in terms of


the condition number of Q, the ratio M / m oflargest to smallest eigenvalue.
As the condition number increases to oo (i.e., the problem is increasingly
"ill-conditioned") the modulus of contraction approaches 1, and the conver-
gence can be very slow. This is the dominant characteristic of the behavior
of gradient methods for the class of twice differentiable problems with pos-
itive definite Hessian. This class of problems is very broad, so condition
number issues often become the principal consideration when implementing
gradient methods in practice.
Choosing an appropriate constant stepsize may require some prelim-
inary experimentation. Another possibility is the line minimization rule,
which uses some specialized line search algorithm to determine

With this rule, when the steepest descent method converges to a vector x*
such that v' f(x*) = 0 and v' 2 f(x*) is positive definite, its convergence rate
is also linear, but not faster than the one of Eq. (2.8), which is associated
with an optimally chosen constant stepsize (see [Ber99], Section 1.3).
If the method converges to an optimal point x* where the Hessian
matrix v' 2 f(x*) is singular or does not exist, the convergence rate that we
can guarantee is typically slower than linear. For example, with a properly
chosen constant stepsize, and under some reasonable conditions (Lipschitz
continuity of v' !), we can show that

k = 1,2, ... , (2.9)


Sec. 2.1 Iterative Descent Algorithms 61

where f* is the optimal value off and c(xo) is a constant that depends on
the initial point xo (see Section 6.1).
For problems where v' f is continuous but cannot be assumed Lips-
chitz continuous at or near the minimum, it is necessary to use a stepsize
rule that can produce time-varying stepsizes. For example in the scalar case
where f(x) = lxl 312 , the steepest descent method with any constant step-
size oscillates around the minimum x* = 0, because the gradient grows too
fast around x*. However, the line minimization rule as well as other rules,
such as the Armijo rule to be discussed shortly, guarantee a satisfactory
form of convergence (see the end-of-chapter exercises and the discussion of
Section 6.1).
On the other hand, with additional assumptions on the structure of
f, we can obtain a faster convergence than the 0(1/k) estimate on the
cost function error of Eq. (2.9). In particular, the rate of convergence to
a singular minimum depends on the order of growth of the cost function
near that minimum; see [Dun81], which shows that if f is convex, has a
unique minimum x*, and satisfies the growth condition

,Bllx - x*ll'Y ~ f(x) - f(x*), V x such that f(x) ~ f(xo),

for some scalars ,8 > 0 and 'Y > 2, then for the method of steepest descent
with the Armijo rule and other related rules we have

f(xk) - f(x*) = 0 (+).


k "y"='1
(2.10)

Thus for example, with a quartic order of growth off ('Y = 4), an 0(1/k 2 )
estimate is obtained for the cost function error after k iterations. The paper
[Dun81] provides a more comprehensive analysis of the convergence rate of
gradient-type methods based on order of growth conditions, including cases
where the convergence rate is linear and faster than linear.

Scaling

To improve the convergence rate of the steepest descent method one may
"scale" the gradient v' f (xk) by multiplication with a positive definite sym-
metric matrix Dk, i.e., use a direction dk = - Dk v' f (Xk), leading to the
algorithm
Xk+l = Xk - akDk v' f (xk); (2.11)
cf. Fig. 2.1.l. Since for v' f (xk) -/:- 0 we have

it follows that we still have a cost descent method, as long as the positive
stepsize O:k is sufficiently small so that f(xk+1) < f(xk).
62 Optimization Algorithms: An Overview Chap. 2

Figure 2.1.1. Illustration of descent directions. Any direction of the form

where Dk is a positive definite matrix, is a descent direction because d~ v' f(xk) =


-d~Dkdk < 0. In this case dk makes an angle less than 1r /2 with -v' f(xk)-

Scaling is a major concept in the algorithmic theory of nonlinear pro-


gramming. It is motivated by the idea of modifying the "effective condition
number" of the problem through a linear change of variables of the form
x = D!
12 y. In particular, the iteration (2.11) may be viewed as a steepest
descent iteration

for the equivalent problem of minimizing the function hk (y) = f (D!12 y).
For a quadratic problem, where f(x) = ½x'Qx - b'x, the condition number
of hk is the ratio of largest to smallest eigenvalue of the matrix D!12 Q 12 v!
(rather than Q).
Much of unconstrained nonlinear programming methodology deals
with ways to compute "good" scaling matrices Dk, i.e., matrices that result
in fast convergence rate. The "best" scaling in this sense is attained with

assuming that the inverse above exists and is positive definite, which asymp-
totically leads to an "effective condition number" of 1. This is Newton's
method, which will be discussed shortly. A simpler alternative is to use a
diagonal approximation to the Hessian matrix v' 2 f (xk), i.e., the diagonal
Sec. 2.1 Iterative Descent Algorithms 63

matrix Dk that has the inverse second partial derivatives

i = 1, ... ,n,

along the diagonal. This often improves the performance of the classical
gradient method dramatically, by providing automatic scaling of the units
in which the components xi of x are measured, and also facilitates the choice
of stepsize - good values of O:k are often chose to 1 (see the subsequent
discussion of Newton's method and sources such as [Ber99], Section 1.3).
The nonlinear programming methodology also prominently includes
quasi-Newton methods, which construct scaling matrices iteratively, using
gradient information collected during the algorithmic process (see nonlin-
ear programming textbooks such as [Pol71], [GMW81], [Lue84], [DeS96],
[Ber99], [FleOO], [NoW06], [LuY08]). Some of these methods approximate
the full inverse Hessian of f, and eventually attain the fast convergence
rate of Newton's method. Other methods use a limited number of gradient
vectors from previous iterations (have "limited memory") to construct a
relatively crude but still effective approximation to the Hessian off, and
attain a convergence rate that is considerably faster than the one of the
unscaled gradient method; see [Noc80], [NoW06].

Gradient Methods with Extrapolation

A variant of the gradient method, known as gradient method with mo-


mentum, involves extrapolation along the direction of the difference of the
preceding two iterates:

(2.12)

where f3k is a scalar in [O, 1), and we define X-1 = xo. When O:k and f3k are
chosen to be constant scalars a and (3, respectively, the method is known as
the heavy ball method [Pol64]; see Fig. 2.1.2. This is a sound method with
guaranteed convergence under a Lipschitz continuity assumption on v' f. It
can be shown to have faster convergence rate than the corresponding gradi-
ent method where o:k is constant and f3k = 0 (see [Pol87], Section 3.2.1, or
[Ber99], Section 1.3). In particular, for a positive definite quadratic prob-
lem, and with optimal choices of the constants a and /3, the convergence
rate of the heavy ball method is linear, and is governed by the formula (2 .8)
but with JM/m in place of M/m. This is a substantial improvement over
the steepest descent method, although the method can still be very slow.
Simple examples also suggest that with a momentum term, the steepest
descent method is less prone to getting trapped at "shallow" local minima,
and deals better with cost functions that are alternately very flat and very
steep along the path of the algorithm.
64 Optimization Algorithms: An Overview Chap. 2

Gradient Step

Xk+' - x, - \ !(::+, - Xk-a'v f(x,)+~(X>-Xk-d

y' f (xk) Xk'\

Xk-1,/
/
/
""' Extrapolation Step

Figure 2.1.2. Illustration of the heavy ball method (2.12), where Cik =a and
f3k = (3.

A method with similar structure as (2.12), proposed in [Nes83], has


received a lot of attention because it has optimal iteration complexity prop-
erties under certain conditions, including Lipschitz continuity of v' f. As
we will see in Section 6.2, it improves on the 0(1/k) error estimate (2.9)
of the gradient method by a factor of 1/k. The iteration of this method,
when applied to unconstrained minimization of a differentiable function f
is commonly described in two steps: first an extrapolation step, to compute

with (A chosen in a special way so that f3k --+ l, and then a gradient step
with constant stepsize o:, and gradient calculated at Yk,

Compared to the method (2.12), it reverses the order of gradient calculation


and extrapolation, and uses v' f(Yk) in place of v' f(xk)-

Conjugate Gradient Methods

There is an interesting connection between the extrapolation method (2.12)


and the conjugate gradient method for unconstrained differentiable opti-
mization. This is a classical method, with an extensive theory, and the dis-
tinctive property that it minimizes an n-dimensional convex quadratic cost
function in at most n iterations, each involving a single line minimization.
Fast progress is often obtained in much less than n iterations, depending
on the eigenvalue structure of the quadratic cost [see e.g., [Ber82a] (Section
1.3.4), or [Lue84] (Chapter 8)]. The method can be implemented in several
different ways, for which we refer to textbooks such as [Lue84], [Ber99].
It is a member of the more general class of conjugate direction methods,
which involve a sequence of exact line searches along directions that are
orthogonal with respect to some generalized inner product.
Sec. 2.1 Iterative Descent Algorithms 65

It turns out that if the parameters ak and f3k in iteration (2.12) are
chosen optimally for each k so that

k = 0, l, ... ,
(2.13)
with X-1 = xo, the resulting method is an implementation of the conjugate
gradient method (see e.g., [Ber99], Section 1.6). By this we mean that if f
is a convex quadratic function, the method (2.12) with the stepsize choice
(2.13) generates exactly the same iterates as the conjugate gradient method,
and hence minimizes f in at most n iterations. Finding the optimal pa-
rameters according to Eq. (2.13) requires solution of a two-dimensional
optimization problem in a and /3, which may be impractical in the absence
of special structure. However, this optimization is facilitated in some im-
portant special cases, which also favor the use of other types of conjugate
direction methods. t
There are several other ways to implement the conjugate gradient
method, all of which generate identical iterates for quadratic cost functions,
but may differ substantially in their behavior for nonquadratic ones. One of
them, which resembles the preceding extrapolation methods, is the method
of parallel tangents or PARTAN, first proposed in the paper [SBK64]. In
particular, each iteration of PARTAN involves extrapolation and two one-
dimensional line minimizations. At the typical iteration, given Xk, we
obtain Xk+i as follows:
(1) We find a vector Yk that minimizes f over the line

(2) We generate Xk+I by minimizing f over the line that passes through
Xk-l and Yk·

t Examples of favorably structured problems for conjugate direction methods


include cost functions of the form J(x) = h(Ax), where A is a matrix such that
the calculation of the vector y = Ax for a given x is far more expensive than the
calculation of h(y) and its gradient and Hessian (assuming it exists). Several of
the applications described in Sections 1.3 and 1.4 are of this type; see also the
papers [NaZ05] and [GoSlO], where the application of the subspace minimization
method (2.13) and PARTAN are discussed. For such problems, calculation of a
stepsize by line minimization along a direction d, as in various types of conjugate
direction methods, is relatively inexpensive. In particular, calculation of values,
first, and second derivatives of the function g(a) = f(x + ad) = h(Ax + o:Ad)
requires just two expensive operations: the one-time calculation of the matrix-
vector products Ax and Ad. Similarly, minimization over a subspace that passes
through x and is spanned by m directions di, ... , dm, requires the one-time cal-
culation of the matrix-vector products Ax and Adi, ... , Adm.
66 Optimization Algorithms: An Overview Chap. 2

Gradient Step
Yk = Xk - tk \7 f (xk)

Xki_ . \
/ Yk
/
/ Extrapolation Step
Xk+l = Yk + f3k(yk - Xk - l)

Figure 2.1.3. Illustration of the two-step method

By writing the method equivalently as

we see that the heavy ball method (2.12) with constant parameters a and f3 is
obtained when '°Yk = a/(1 + (3) and f3k =(3. The PARTAN method is obtained
when '°Yk and f3k are chosen by line minimization, in which case the corresponding
parameter °'k of iteration (2.12) is °'k = 'Yk(l + f3k)-

This iteration is a special case of the gradient method with momentum


(2.12), corresponding to special choices of O:k and f3k- To see this, observe
that we can write iteration (2.12) as a two-step method:

where
O:k
'°Yk = l + f3k.
Thus starting from Xk, the parameter (A is determined by the second
line search of PARTAN as the optimal stepsize along the line that passes
through Xk-l and Yk, and then O:k is determined as '°Yk(l + f3k), where '°'fk
is the optimal stepsize along the line

(cf. Fig. 2.1.3).


The salient property of PARTAN is that when f is convex quadratic
it is mathematically equivalent to the conjugate gradient method (it gen-
erates exactly the same iterates and terminates in at most n iterations).
For this it is essential that the line minimizations are exact, which may
Sec. 2.1 Iterative Descent Algorithms 67

be difficult to guarantee in practice. However, PARTAN seems to be quite


resilient to line minimization errors relative to other conjugate gradient
implementations. Note that PARTAN ensures at least as large cost re-
duction at each iteration, as the steepest descent method, since the latter
method omits the second line minimization. Thus even for nonquadratic
cost functions it tends to perform faster than steepest descent, and often
considerably so. We refer to [Lue84], [Pol87], and [Ber99], Section 1.6, for
further discussion. These books also address additional issues for the con-
jugate gradient and other conjugate direction methods, such as alternative
implementations, scaling (also called preconditioning), one-dimensional line
search algorithms, and rate of convergence.

Newton's Method

In Newton's method the descent direction is

provided v' 2 f (xk) exists and is positive definite, so the iteration takes the
form
Xk+1 = Xk - °'k (v' 2 f(xk) )- 1 v' f (xk)·
If v' 2 f(xk) is not positive definite, some modification is necessary. There
are several possible modifications of this type, for which the reader may
consult nonlinear programming textbooks. The simplest one is to add to
v' 2 f(xk) a small positive multiple of the identity. Generally, when f is
convex, v' 2 f(xk) is positive semidefinite (Prop. 1.1.10 in Appendix B), and
this facilitates the implementation of reliable Newton-type algorithms.
The idea in Newton's method is to minimize at each iteration the
quadratic approximation off around the current point Xk given by

By setting the gradient of Jk ( x) to zero,

and solving for x, we obtain as next iterate the minimizing point

(2.14)

This is the Newton iteration corresponding to a stepsize °'k = l. It follows


that, assuming °'k = 1, Newton's method finds the global minimum of a
positive definite quadratic function in a single iteration.
Newton's method typically converges very fast asymptotically, assum-
ing that it converges to a vector x* such that v' f (x*) = 0 and v' 2 f(x*)
68 Optimization Algorithms: An Overview Chap. 2

is positive definite, and that a stepsize °'k = 1 is used, at least after some
iteration. For a simple argument, we may use Taylor's theorem to write

By multiplying this relation with (v' 2 f(xk))- 1 we have

so for the Newton iteration with stepsize °'k = 1 we obtain

Xk+l - x* = o(llxk - x* II),

or, for Xk # x*,

lim llxk+l - x*II = lim o(llxk - x*II) = 0


k----too llxk - x* II k----too llxk - x* II '
implying convergence that is faster than linear (also called superlinear).
This argument can also be used to show local convergence to x* with °'k = 1,
that is, convergence assuming that xo is sufficiently close to x*.
In implementations of Newton's method, some stepsize rule is often
used to ensure cost reduction, but the rule is typically designed so that
near convergence we have °'k = 1, to ensure that a superlinear convergence
rate is attained [assuming v' 2 f(x*) is positive definite at the limit x*].
Methods that approximate Newton's method also use a stepsize close to
1, and modify the stepsize based on the results of the computation (see
sources on nonlinear programming, such as [Ber99], Section 1.4).
The price for the fast convergence of Newton's method is the overhead
required to calculate the Hessian matrix, and to solve the linear system of
equations

in order to find the Newton direction. There are many iterative algorithms
that are patterned after Newton's method, and aim to strike a balance be-
tween fast convergence and high overhead (e.g., quasi-Newton, conjugate
direction, and others, extensive discussions of which may be found in non-
linear programming textbooks such as [GMW81], [DeS96], [Ber99], [FleOO],
[BSS06], [NoW06], [LuY08]).
We finally note that for some problems the special structure of the
Hessian matrix can be exploited to facilitate the implementation of New-
ton's method. For example the Hessian matrix of the dual function of the
separable convex programming problem of Section 1.1, when it exists, has
particularly favorable structure; see [Ber99], Section 6.1. The same is true
for optimal control problems that involve a discrete-time dynamic system
and a cost function that is additive over time; see [Ber99], Section 1.9.
Sec. 2.1 Iterative Descent Algorithms 69

Stepsize Rules

There are several methods to choose the stepsize O:k in the scaled gradient
iteration (2.11). For example, O:k may be chosen by line minimization:

This can typically be implemented only approximately, with some iterative


one-dimensional optimization algorithm; there are several such algorithms
(see nonlinear programming textbooks such as [GMW81], [Ber99J, [BSS06],
[NoW06], [LuY08]).
Our analysis in subsequent chapters of this book will mostly focus on
two cases: when O:k is chosen to be constant,

k = 0, l, ... ,
and when ak is chosen to be diminishing to 0, while satisfying the condi-
tionst
00 00

LO:k = oo, La%< oo. (2.15)


k=O k=O
A convergence analysis for these two stepsize rules is given in the end-
of-chapter exercises, and also in Chapter 3, in the context of subgradient
methods, as well as in Section 6.1.
We emphasize the constant and diminishing stepsize rules because
they are the ones that most readily generalize to nondifferentiable cost
functions. However, other stepsize rules, briefly discussed in this chap-
ter, are also important, particularly for differentiable problems, and are
used widely. One possibility is the line minimization rule discussed earlier.
There are also other rules, which are simple and are based on successive
reduction of O:k, until a form of descent is achieved that guarantees con-
vergence. One of these, the Armijo rule (first proposed in [Arm66], and
sometimes called backtracking rule), is popular in unconstrained minimiza-
tion algorithm implementations. It is given by

where mk is the first nonnegative integer m for which

t The condition L;;'= 0 ak = oo is needed so that the method can approach


the minimum from arbitrarily far, and the condition L~o a% < oo is needed so
that ak ---+ 0 and also for technical reasons relating to the convergence analysis
(see Section 3.2). If f is a positive definite quadratic, the steepest descent method
with a diminishing stepsize ak satisfying L;;'=D ak = oo can be shown to converge
to the optimal solution, but at a rate that is slower than linear.
70 Optimization Algorithms: An Overview Chap. 2

Set of acceptable stepsizes

0
----------- ~ ~
a

)
Slope: -Vf(xk)'Dk'vf(xk) \
\
-uaVJ(xk)'DkVf(xk)

f(xk - aDkVf(xk)) - J(xk)

Figure 2.1.4. Illustration of the successive points tested by the Armijo rule along
the descent direction dk = -Dk'ilf(xk)- In this figure, °'k is obtained as {3 2 sk
after two unsuccessful trials. Because a E (0, 1), the set of acceptable stepsizes
begins with a nontrivial interval interval when dk i= 0. This implies that if dk = 0,
the Armijo rule will find an acceptable stepsize with a finite number of stepsize
reductions.

where j3 E (0, 1) and u E (0, 1) are some constants, and Sk > 0 is positive
initial stepsize, chosen to be either constant or through some simplified
search or polynomial interpolation. In other words, starting with an initial
trial sk, the stepsizes j3msk, m = 0, 1, ... , are tried successively until the
above inequality is satisfied for m = mk; see Fig. 2.1.4. We will explore
the convergence properties of this rule in the exercises.
Aside from guaranteeing cost function descent, successive reduction
rules have the additional benefit of adapting the size of the stepsize °'k to
the search direction -Dk 'v f(xk), particularly when the initial stepsize Sk
is chosen by some simplified search process. We refer to nonlinear program-
ming sources for detailed discussions.
Note that the diminishing stepsize rule does not guarantee cost func-
tion descent at each iteration, although it reduces the cost function value
once the stepsize becomes sufficiently small. There are also some other
rules, often called nonmonotonic, which do not explicitly try to enforce
cost function descent and have achieved some success, but are based on
ideas that we will not discuss in this book; see [GLL86], [BaB88], [Ray93],
[Ray97], [BMROO], [DHS06]. An alternative approach to enforce descent
without explicitly using stepsizes is based on the trust region methodol-
ogy for which we refer to book sources such as [Ber99], [CGTOO], [FleOO],
[NoW06].
Sec. 2.1 Iterative Descent Algorithms 71

2.1.2 Constrained Problems - Feasible Direction Methods

Let us now consider minimizing a differentiable cost function f over a closed


convex subset X of ~n. In a natural form of the cost function descent
approach, we may consider generating a feasible sequence {xk} C X with
an iteration of the form

(2.16)

while enforcing cost improvement. However, this is now more complicated


because it is not enough for dk to be a descent direction at Xk· It must
also be a feasible direction in the sense that Xk + adk must belong to X for
small enough a > 0, in order for the new iterate Xk+1 to belong to X with
suitably small choice of °'k. By multiplying dk with a positive constant if
necessary, this essentially restricts dk to be of the form Xk - Xk for some
Xk E X with Xk -/=- Xk. Thus, if f is differentiable, for a feasible descent
direction, it is sufficient that

for some xk EX with '7f(xk)'(xk - Xk) < 0.

Methods of the form (2.16), where dk is a feasible descent direction


were introduced in the 60s (see e.g., the books [Zou60], [Zan69], [Pol71],
[Zou76]), and have been used extensively in applications. We refer to them
as feasible direction methods, and we give examples of some of the most
popular ones.

Conditional Gradient Method

The simplest feasible direction method is to find at iteration k,

Xk E argmin '7f(xk)'(x - Xk), (2.17)


xEX

and set

in Eq. (2.16); see Fig. 2.1.5. Clearly '7f(xk)'(xk - Xk)::; 0, with equality
holding only if '7f(xk)'(x - Xk) 2'. 0 for all x EX, which is a necessary
condition for optimality of Xk.
This is the conditional gradient method (also known as the Frank-
Wolfe algorithm) proposed in [FrW56] for convex programming problems
with linear constraints, and for more general problems in [LeP65]. The
method has been used widely in many contexts, as it is theoretically sound,
quite simple, and often convenient. In particular, when X is a polyhedral
set, computation of Xk requires the solution of a linear program. In some
important cases, this linear program has special structure, which results in
great simplifications, e.g., in the multicommodity flow problem of Example
72 Optimization Algorithms: An Overview Chap. 2

Figure 2.1.5. Illustration of the condi-


tional gradient iteration at Xk- We find
xk, a point of X that lies farthest along
the negative gradient direction - 'v f ( Xk).
We then set

where °'k is a stepsize from (0, 1] (the fig-


ure illustrates the case where °'k is cho-
sen by line minimization).
/
Level sets of f

1.4.5 (see the book [BeG92], or the surveys [FlH95], [PatOl]). There has
been intensified interest in the conditional gradient method, thanks to ap-
plications in machine learning; see e.g., [ClalO], [Jag13], [LuT13], [RSW13],
[FrG14], [HJN14], and the references quoted there.
However, the conditional gradient method often tends to converge
very slowly relative to its competitors (its asymptotic convergence rate
can be slower than linear even for positive definite quadratic programming
problems); see [CaC68], [Dun79], [Dun80]. For this reason, other methods
with better practical convergence rate properties are often preferred.
One of these methods, is the simplicial decomposition algorithm (first
proposed independently in [CaG74] and [Hol74]), which will be discussed
in detail in Chapter 4. This method is not a feasible direction method of
the form (2.16), but instead it is based on multidimensional optimizations
over approximations of the constraint set by convex hulls of finite numbers
of points. When X is a polyhedral set, it converges in a finite number of
iterations, and while this number can potentially be very large, the method
often attains practical convergence in very few iterations. Generally, simpli-
cial decomposition can provide an attractive alternative to the conditional
gradient method because it tends to be well-suited for the same type of
problems [it also requires solution of linear cost subproblems of the form
(2.17); see the discussion of Section 4.2].
Somewhat peculiarly, the practical performance of the conditional
gradient method tends to improve in highly constrained problems. An
explanation for this is given in the papers [Dun79], [DuS83], where it is
shown among others that the convergence rate of the method is linear when
the cost function is positive definite quadratic, and the constraint set is not
polyhedral but rather has a "positive curvature" property (for example it is
a sphere). When there are many linear constraints, the constraint set tends
to have very many closely spaced extreme points, and has this "positive
curvature" property in an approximate sense.
Sec. 2.1 Iterative Descent Algorithms 73

Figure 2.1.6. Illustration of the gradi-


ent projection iteration at Xk. We move
from x k along the direction - 'v f ( x k) and
project Xk - °'k 'v f(xk) onto X to obtain
Xk+l· We have

and unless xk+l = Xk, in which case


Xk minimizes f over X, the angle be-
tween 'v f(xk) and (xk+l -xk) is strictly
greater than 90 degrees, and we have

Level sets of f

Gradient Projection Method

Another major feasible direction method, which generally achieves a faster


convergence rate than the conditional gradient method, is the gradient
projection method (originally proposed in [Gol64], [LeP65]), which has the
form
(2.18)

where O:k > 0 is a stepsize and Px(·) denotes projection on X (the projec-
tion is well defined since X is closed and convex; see Fig. 2.1.6).
To get a sense of the validity of the method, note that from the
Projection Theorem (Prop. 1.1.9 in Appendix B), we have

and by the optimality condition for convex functions (cf. Prop. 1.1.8 in
Appendix B), the inequality is strict unless Xk is optimal. Thus Xk+l - Xk
defines a feasible descent direction at Xk, and based on this fact, we can
show the descent property f(xk+I) < f(xk) when O:k is sufficiently small.
The stepsize O:k is chosen similar to the unconstrained gradient me-
thod, i.e., constant, diminishing, or through some kind of reduction rule
to ensure cost function descent and guarantee convergence to the opti-
mum; see the convergence analysis of Section 6.1, and [Ber99], Section 2.3,
for a detailed discussion and references. Moreover the convergence rate
estimates given earlier for unconstrained steepest descent in the positive
definite quadratic cost case [cf. Eq. (2.8)] and in the singular case [cf. Eqs.
(2.9) and (2.10)] generalize to the gradient projection method under vari-
ous stepsize rules (see Exercise 2.1 for the former case and [Dun81] for the
latter case).
74 Optimization Algorithms: An Overview Chap. 2

Two-Metric Projection Methods

Despite its simplicity, the gradient projection method has some significant
drawbacks:
(a) Its rate of convergence is similar to the one of steepest descent, and
is often slow. It is possible to overcome this potential drawback by
a form of scaling. This can be accomplished with an iteration of the
form

Xk+l E argmin
xEX
{v' f(xk)'(x - xk) 1-(x -
+ -2ak Xk)' Hk(x - Xk)},
(2.19)
where Hk is a positive definite symmetric matrix and O:k is a positive
stepsize. When Hk is the identity, it can be seen that this itera-
tion gives the same iterate Xk+l as the unscaled gradient projection
v'
iteration (2.18). When Hk = 2 f(xk) and ak = 1, we obtain a
constrained form of Newton's method (see nonlinear programming
sources for analysis; e.g., [Ber99]).
(b) Depending on the nature of X, the projection operation may involve
substantial overhead. The projection is simple when Hk is the identity
(or more generally, is diagonal), and X consists of simple lower and/or
upper bounds on the components of x:

X = {(x 1, ... , xn) I h.i ::; xi ::; lii, i = 1, ... , n} . (2.20)

This is an important special case where the use of gradient projection


is convenient. Then the projection decomposes to n scalar projec-
tions, one for each i = 1, ... , n: the ith component of Xk+l is obtained
by projection of the ith component of Xk - ak f (xk), v'

onto the interval of corresponding bounds [h.i, bi], and is very simple.
However, for general nondiagonal scaling the overhead for solving the
quadratic programming problem (2.19) is substantial even if X has a
simple bound structure of Eq. (2.20).
To overcome the difficulty with the projection overhead, a scaled pro-
jection method known as two-metric projection method has been proposed
for the case of the bound constraints (2.20) in [Ber82a], [Ber82b]. It has a
similar form to the scaled gradient method (2.11), and it is given by

(2.21)

It is thus a natural and simple adaptation of unconstrained Newton-like


methods to bound-constrained optimization, including quasi-Newton meth-
ods. The main difficulty here is that an arbitrary positive definite matrix
Sec. 2.1 Iterative Descent Algorithms 75

Dk will not necessarily yield a descent direction. However, it turns out that
if some of the off-diagonal terms of Dk that correspond to components of
Xk that are at their boundary are set to zero, one can obtain descent (see
Exercise 2.8). Furthermore, one can select Dk as the inverse of a partially
diagonalized version of the Hessian matrix 'v' 2 f (xk) and attain the fast
convergence rate of Newton's method (see [Ber82a], [Ber82b], [GaB84]).
The idea of simple two-metric projection with partial diagonaliza-
tion may be generalized to more complex constraint sets, and it has been
adapted in [Ber82b], and subsequent papers such as [GaB84], [Dun91],
[LuT93b], to problems of the form

minimize f (x)
subject to Q :S: x :S: b, Ax = c,
where A is an m x n matrix, and Q, b E ~n and c E ~m are given vectors. For
example the algorithm (2.21) can be easily modified when the constraint
set involves bounds on the components of x together with a few linear
constraints, e.g., problems involving a simplex constraint such as

minimize f (x)
subject to O :S: x, a'x = c,
where a E ~n and c E ~, or a Cartesian product of simplexes. For an
example of a Newton algorithm of this type, applied to the multicommodity
flow problem of Example 1.4.5, see [BeG83]. For representative applications
in related large-scale contexts we refer to the papers [Dun91], [LuT93b],
[FJS98], [Pyt98], [GeM05], [OJW05], [TaP13], [WSK14].
The advantage that the two-metric projection approach can offer is to
identify quickly the constraints that are active at an optimal solution. After
this happens, the method reduces essentially to an unconstrained scaled
gradient method (possibly Newton method, if Dk is a partially diagonalized
Hessian matrix), and attains a fast convergence rate. This property has
also motivated variants of the two-metric projection method for problems
involving £\-regularization, such as the ones of Example 1.3.2; see [SFR09],
[SchlO], [GKXlO], [SKS12], [Lan14].

Block Coordinate Descent

The preceding methods require the computation of the gradient and possi-
bly the Hessian of the cost function at each iterate. An alternative descent
approach that does not require derivatives or other direction calculations
is the classical block coordinate descent method, which we will briefly de-
scribe here and consider further in Section 6.5. The method applies to the
problem
minimize f (x)
subject to x E X,
76 Optimization Algorithms: An Overview Chap. 2

where f : ~n f---t ~ is a differentiable function, and X is a Cartesian product


of closed convex sets X1, ... , Xm:

The vector x is partitioned as

X = (X 1 , x 2 , ... , xm),
where each xi belongs to ~ni, so the constraint x E X is equivalent to

i = 1, ... ,m.
The most common case is when ni = 1 for all i, so the components xi
are scalars. The method involves minimization with respect to a single
component xi at each iteration, with all other components kept fixed.
In an example of such a method, given the current iterate Xk =
(Xk, ... , Xk ), we generate the next iterate Xk+l = (xl+l, ... , Xk+l), ac-
cording to the "cyclic" iteration
i
xk+l · f( xk+i,···,xk+i,'>,xk
E argm1n 1 i-1 c i+l
, ... ,xkm) , i = 1, ... , m. (2.22)
~EX;

Thus, at each iteration, the cost is minimized with respect to each of the
"block coordinate" vectors xi, taken one-at-a-time in cyclic order.
Naturally, the method makes practical sense only if it is possible to
perform this minimization fairly easily. This is frequently so when each xi
is a scalar, but there are also other cases of interest, where xi is a multi-
dimensional vector. Moreover, the method can take advantage of special
structure off; an example of such structure is a form of "sparsity," where
f is the sum of component functions, and for each i, only a relatively small
number of the component functions depend on xi, thereby simplifying the
minimization (2.22). The following is an example of a classical algorithm
that can be viewed as a special case of block coordinate descent.

Example 2.1.1 (Parallel Projections Algorithm)

We are given m closed convex sets X 1, ... , Xm in ~r, and we want to find a
point in their intersection. This problem can equivalently be written as
m

minimize L IIY; - xll 2

i=l

subject to x E ~r, yi E Xi, i = 1, ... , m,


where the variables of the optimization are x, y 1 , ... , ym (the optimal solu-
tions of this problem are the points in the intersection n~ 1 X;, if this inter-
section is nonempty). A block coordinate descent algorithm iterates on each
of the vectors y 1 , ... , ym in parallel according to

i = l, ... ,m,
Sec. 2.1 Iterative Descent Algorithms 77

and then iterates with respect to x according to

Xk+l = Yi+1 + · · · + Y'f:'+1 ,


m

which minimizes the cost function with respect to x when each yi is fixed at
Yic+1·

Here is another example where the coordinate descent method


takes advantage of decomposable structure.

Example 2.1.2 (Hierarchical Decomposition)

Consider an optimization problem of the form

minimize L (hi(Yi) + li(x,yi))


i=l

subject to x E X, yi E Y;, i = 1, ... , m,


where X and Y;, i = 1, ... , m, are closed, convex subsets of corresponding
Euclidean spaces, and hi, Ji are given functions, assumed differentiable. This
problem is associated with a paradigm of optimization of a system consisting
of m subsystems, with a cost function hi + Ji associated with the operations
of the ith subsystem. Here yi is viewed as a vector of local decision variables
that influences the cost of the ith subsystem only, and x is viewed as a vector
of global or coordinating decision variables that affects the operation of all
the subsystems.
The block coordinate descent method has the form

i = 1, ... ,m,

Xk+1 E argmin ~ Ji(X,Yic+1)-


xEX~
i=l

The method has a natural real-life interpretation: at each iteration, each


subsystem optimizes its own cost, viewing the global variables as fixed at
their current values, and then the coordinator optimizes the overall cost for
the current values of the local variables (without having to know the "local"
cost functions hi of the subsystems).

In the absence of special structure of f, differentiability is essential


for the validity of the coordinate descent method; this can be verified with
simple examples. In our convergence analysis of Chapter 6, we will also
require a form of strict convexity off along each block component, as first
suggested in the book [Zan69] (subtle examples of nonconvergence have
been constructed in the absence of a property of this kind [Pow73]). We
78 Optimization Algorithms: An Overview Chap. 2

note, however, there are some interesting cases where nondifferentiabilities


with special structure can be dealt with; see Section 6.5.
There are several variants of the method, which incorporate various
descent algorithms in the solution of the block minimizations (2.22). An-
other type of variant is one where the block components are iterated in an
irregular order instead of a fixed cyclic order. In fact there is a substan-
tial theory of asynchronous distributed versions of coordinate descent, for
which we refer to the parallel and distributed algorithms book [BeT89a],
and the sources quoted there; see also the discussion in Sections 2.1.6 and
6.5.2.

2.1.3 Nondifferentiable Problems - Subgradient Methods

We will now briefly consider the minimization of a convex nondifferentiable


cost function f : ~n f--t ~ ( optimization of a nonconvex and nondifferen-
tiable function is a far more complicated subject, which we will not address
in this book). It is possible to generalize the steepest descent approach so
that when f is nondifferentiable at Xk, we use a direction dk that minimizes
the directional derivative f'(xk; d) subject to JJdJJ ::; 1,

dk E arg min f'(xk; d).


lldll:51

Unfortunately, this minimization (or more generally finding a descent


direction) may involve a nontrivial computation. Moreover, there is a wor-
risome theoretical difficulty: the method may get stuck far from the opti-
mum, depending on the stepsize rule. An example is given in Fig. 2.1.7,
where the stepsize is chosen using the minimization rule

In this example, the algorithm fails even though it never encounters a point
where f is nondifferentiable, which suggests that convergence questions in
convex optimization are delicate and should not be treated lightly. The
problem here is a lack of continuity: the steepest descent direction may
undergo a large/discontinuous change close to the convergence limit. By
contrast, this would not happen if f were continuously differentiable at
the limit, and in fact the steepest descent method with the minimization
stepsize rule has sound convergence properties when used for differentiable
functions.
Because the implementation of cost function descent has the limita-
tions outlined above, a different kind of descent approach, based on the
notion of subgradient, is often used when f is nondifferentiable. The the-
ory of subgradients of extended real-valued functions is outlined in Section
5.4 of Appendix B, as developed in the textbook [Ber09]. The properties of
Sec. 2.1 Iterative Descent Algorithms 79

40

20

-20

x,

Figure 2.1. 7. An example of failure of the steepest descent method with the line
minimization stepsize rule for a convex nondifferentiable cost function [Wol75].
Here we have the two-dimensional cost function

if xi > lx2I,
if xi ::::; lx2I,
shown in the figure. Consider the method that moves in the direction of steepest
descent from the current point, with the stepsize determined by cost minimization
along that direction (this can be done analytically). Suppose that the algorithm
starts anywhere within the set

The generated iterates are shown in the figure, and it can be verified that they
converge to the nonoptimal point (0, 0).

subgradients of real-valued convex functions will also be discussed in detail


in Section 3.1.
In the most common subgradient method (first proposed and analyzed
in the mid 60s by Shor in a series of papers, and later in the books [Sho85],
[Sho98]), an arbitrary subgradient 9k off at Xk is used in an iteration of
the form
Xk+l = Xk - Ctk9k, (2.23)
where ak is a positive stepsize. The method, together with its many vari-
ations, will be discussed extensively in this book, starting with Chapter 3.
We will see that while it may not yield a cost reduction for any value of ak
it has another descent property, which enhances the convergence process:
at any nonoptimal point Xk, it satisfies
dist(xk+i,X*) < dist(xk,X*)
for a sufficiently small stepsize ak, where dist(x, X*) denotes the Euclidean
minimum distance of x from the optimal solution set X*.
80 Optimization Algorithms: An Overview Chap. 2

Stepsize Rules and Convergence Rate

There are several methods to choose the stepsize ak in the subgradient


iteration (2.23), some of which will be discussed in more detail in Chapter
3. Among these are:
(a) ak is chosen to be a positive constant,

k=0,1, ....
In this case only approximate convergence can be guaranteed, i.e.,
convergence to a neighborhood of the optimum whose size depends
on a. Moreover the convergence rate may be slow. However, there is
an important case where some favorable results can be shown. This is
when a so called sharp minimum condition holds, i.e., for some f3 > 0,

f* + /3 x*EX*
min llx - x*II::; f(x), Vx EX, (2.24)

where f* is the optimal value (see Exercise 3.10). We will prove in


Prop. 5.1.6 that this condition holds when f and X are polyhedral,
as for example in dual problems arising in integer programming.
(b) ak is chosen to be diminishing to 0, while satisfying the conditions
00 00

Lak = oo, La~< oo.


k=O k=O

Then exact convergence can be guaranteed, but the convergence rate


is sublinear, even for polyhedral problems, and typically very slow.
There are also more sophisticated stepsize rules, which are based on
estimation of f* (see Section 3.2, and [BN003] for a detailed account).
Still, unless the condition (2.24) holds, the convergence rate can be very
slow relative to other methods. On the other hand in the presence of
special structure, such as in additive cost problems, incremental versions
of subgradient methods (see Section 2.1.5) may perform satisfactorily.

2.1.4 Alternative Descent Methods

Aside from methods that are based on gradients or subgradients, like the
ones of the preceding sections, there are some other approaches to effect
cost function descent. A major approach, which applies to any convex cost
function is the proximal algorithm, to be discussed in detail in Chapter 5.
This algorithm embodies both the cost improvement and the approximation
ideas. In its basic form, it approximates the minimization of a closed
proper convex function f : a:?n .-+ (-oo, oo] with another minimization that
involves a quadratic term. It is given by

Xk+I E arg min {f(x) 1 llx - Xkll 2 } ,


+ -2Ck (2.25)
xE~n
Sec. 2.1 Iterative Descent Algorithms 81

f(x)

x* X

Figure 2.1.8. Illustration of the proximal algorithm (2.25) and its descent prop-
erty. The minimum of f( x )+ 2 ~k llx - xkll 2 is attained at the unique point Xk+I at
1 llx -
which the graph of the quadratic function -2- xkii 2 , raised by the amount
ck

just touches the graph off. Since 'Yk < f(xk) , it follows that f( xk+1) < f(xk) ,
unless Xk minimizes f , which happens if and only if xk+I = Xk·

where xo is an arbitrary starting point and Ck is a positive scalar param-


eter (see Fig. 2.1.8) . One of the motivations for the algorithm is t hat it
"regularizes" the minimization of f: t he quadratic term in Eq. (2.25) when
added to f makes it strictly convex with compact level sets, so it has a
unique minimum (cf. Prop. 3.1.1 and Prop. 3.2.1 in Appendix B).
The algorithm has an inherent descent character, which facilitates its
combination with other algorithmic schemes. To see this note that since
x = Xk+i gives a lower value of f( x) + 2 !k ll x - Xk ll 2 than x = Xk, we have
1
f( xk+i ) + -ll xk+ l - Xk ll 2 :S f(xk) .
2ck
It follows that {f (xk) } is monotonically nonincreasing; see also Fig. 2.1.8.
There are several variations of the proximal algorithm, which will be
discussed in Chapters 5 and 6. Some of these variations involve modification
of the proximal minimization problem of Eq. (2.25) , motivated by the need
for a convenient solution of this problem. Here are some examples:
(a) T he use of a nonquadratic proximal term Dk(x;xk) in Eq. (2.25) , in
place of (1/2ck) ll x - xkll 2 , i.e., the iteration

(2.26)
82 Optimization Algorithms: An Overview Chap. 2

This approach may be useful when Dk has a special form that matches
the structure off.
(b) Linear approximation off using its gradient at Xk

assuming that f is differentiable. Then, in place of Eq. (2.26), we


obtain the iteration

When the proximal term Dk(x; xk) is the quadratic (1/2ck) llx -xkll 2 ,
this iteration can be seen to be equivalent to the gradient projection
iteration (2.18):

but there are other choices of Dk that lead to interesting methods,


known as mirror descent algorithms.
(c) The proximal gradient algorithm, which applies to the problem

minimize f(x) + h(x)


subject to x E ~n,

where f : ~n H ~ is a differentiable convex function, and h: ~n H


(-oo, oo] is a closed proper convex function. This algorithm com-
bines ideas from the gradient projection method and the proximal
method. It replaces f with a linear approximation in the proximal
minimization, i.e.,

Xk+i E arg min


xE~n
{v J(xk)'(x - Xk) 1-llx -
+ h(x) + -2ak xkll 2 },
(2.27)
where ak > 0 is a parameter. Thus when f is a linear function,
we obtain the proximal algorithm for minimizing f + h. When h is
the indicator function of a closed convex set, we obtain the gradient
projection method. Note that there is an alternative/equivalent way
to write the algorithm (2.27):

Xk+l E arg min {h(x) 1-llx -


+ -2ak Zkll 2 } ,
xE~n
(2.28)
as can be verified by expanding the quadratic
Sec. 2.1 Iterative Descent Algorithms 83

Thus the method alternates gradient steps on J with proximal steps


on h. The advantage that this method may have over the proximal
algorithm is that the proximal step in Eq. (2.28) is executed with
h rather than with J + h, and this may be significant if h has sim-
ple/favorable structure (e.g., his the £1 norm or a distance function
to a simple constraint set), while J has unfavorable structure. Under
relatively mild assumptions, it can be shown that the method has a
cost function descent property, provided the stepsize a is sufficiently
small (see Section 6.3).
In Section 6. 7, we will also discuss another descent approach, called
E-descent, which aims to avoid the difficulties due to the discontinuity of
the steepest descent direction (cf. Fig. 2.1.7). This is done by obtaining
a descent direction via projection of the origin on an E-subdifferential, an
enlarged version of the subdifferential. The method is theoretically interest-
ing and will be used to establish conditions for strong duality in extended
monotmpic pmgramming, an important class of problems with partially
separable structure, to be discussed in Section 4.4.
Finally, we note that there are a few types of descent methods that
we will not discuss at all, either because they are based on ideas that do
not connect well with convexity, or because they are not well suited for the
type of large-scale problems that we emphasize in this book. Included are
direct search methods that do not use derivatives, such as the Nelder-Mead
simplex algorithm [DeT91], [Tse95], [LRW98], [NaT02], feasible direction
methods such as reduced gradient and gradient projection methods based
on manifold suboptimization [GiM74], [GMW81], [MoT89], and sequen-
tial quadratic programming methods [Ber82a], [Ber99], [NoW06]. Some of
these methods have extensive literature and applications, but are beyond
our scope.

2.1.5 Incremental Algorithms

An interesting form of approximate gradient, or more generally subgradient


method, is an incremental variant, which applies to minimization over a
closed convex set X of an additive cost function of the form
m
J(x) = L Ji(x),
i=l

where the functions Ji : ~n r-+ ~ are either differentiable or convex and


nondifferentiable. We mentioned several contexts where cost functions of
this type arise in Section 1.3. The idea of the incremental approach is to
sequentially take steps along the subgradients of the component functions
Ji, with intermediate adjustment of x after processing each Ji,
Incremental methods are interesting when m is very large, so a full
subgradient step is very costly. For such problems one hopes to make
84 Optimization Algorithms: An Overview Chap. 2

progress with approximate but much cheaper incremental steps. Incre-


mental methods are also well-suited for problems where m is large and
the component functions Ji become known sequentially, over time. Then
one may be able to operate on each component as it reveals itself, with-
out waiting for the other components to become known, i.e., in an on-line
fashion.
In a common type of incremental subgradient method, an iteration is
viewed as a cycle of m subiterations. If Xk is the vector obtained after k
cycles, the vector Xk+l obtained after one more cycle is

Xk+l = 'I/Jm,k, (2.29)

where starting with


'lpQ,k = Xk,

we obtain 'I/Jm,k after the m steps

i = 1, ... ,m, (2.30)

with gi,k being a subgradient of Ji at 'I/Ji-1,k [or the gradient 'v fi('I/J;-1,k)
in the differentiable case].
In a randomized version of the method, given Xk at iteration k, an
index ik is chosen from the set { 1, ... , m} randomly, and the next iterate
Xk+1 is generated by

i = 1, ... ,m, (2.31)

where 9ik is a subgradient of fik at Xk- Here it is important that all


indexes are chosen with equal probability. It turns out that there is a rate
of convergence advantage for this and other types of randomization, as we
will discuss in Section 6.4.2. We will ignore for the moment the possibility
of randomizing the component selection, and assume cyclic selection as in
Eqs. (2.29)-(2.30).
In the present section we will explain the ideas underlying incremental
methods by focusing primarily on the case where the component functions
Ji are differentiable. We will thus consider methods that compute at each
step a component gradient 'v fi and possibly Hessian matrix 'v 2 k We will
discuss the case where Ji may be nondifferentiable in Section 6.4, after the
analysis of nonincremental subgradient methods to be given in Section 3.2.

Incremental Gradient Method

Assume that the component functions Ji are differentiable. We refer to the


method
(2.32)
Sec. 2.1 Iterative Descent Algorithms 85

where starting with 1/Jo,k = Xk, we generate 1/Jm,k after them steps

i = 1, ... ,m, (2.33)

[cf. (2.29)-(2.30)], as the incremental gradient method. A well known and


important example of such a method is the following. Together with its
many variations, it is widely used in computerized imaging; see e.g., the
book [Her09].

Example 2.1.3: (Kaczmarz Method)

Let
i = 1, ... ,m,

where c, are given nonzero vectors in ar and b; are given scalars, so we have
a linear least squares problem. The constant term 1/(2jjc; 11 2 ) multiplying
each of the squared functions (c'.x - b;) 2 serves a scaling purpose: with its
inclusion, all the components f; have a Hessian matrix

2 ( ) 1 I
V f; x = lie; 112 c;c;

with trace equal to 1. This type of scaling is often used in least squares
problems (see [Ber99] for explanations). The incremental gradient method
(2.32)-(2.33) takes the form Xk+I = '1/Jm,k, where '1/Jm,k is obtained after the
m steps

'1/J;,k = 'l/J;-1,k OCk ( I


- llc;jj 2 c;'I/Ji-1,k - b; c;,
)
i = 1, ... ,m, (2.34)

starting with '1/Jo,k = Xk (see Fig. 2.1.9).


The stepsize ak may be chosen in a number of different ways, but if ak is
chosen identically equal to 1, ak = l, we obtain the Kaczmarz method, which
dates to 1937 [Kac37); see Fig. 2.1.9(a). The interpretation of the iteration
(2.34) in this case is very simple: '1/J;,k is obtained by projecting '1/Ji,k-I onto
the hyperplane defined by the single equation c'.x = b;. Indeed from Eq. (2.34)
with ak = l, it is easily verified that c'.'1/J;,k = b; and that '1/J;,k - '1/Ji,k-I is
orthogonal to the hyperplane, since it is proportional to its normal c;. (There
are also other related methods involving alternating projections on subspaces
or other convex sets, one of them attributed to von Neumann from 1933; see
Section 6.4.4.)
If the system of equations c'.x = b;, i = 1, ... , m, is consistent, i.e.,
has a unique solution x*, then the unique minimum of 1 f;(x) is x*. In I:::
this case it turns out that for a constant stepsize ak a, with O < a < 2, =
the method converges to x*. The convergence process is illustrated in Fig.
2.1.9(b) for the case ak =
1: the distance ll'I/Ji,k - x*II is guaranteed not to
increase for any i within cycle k, and to strictly decrease for at least one i, so
Xk+I will be closer to x* than Xk (assuming Xk =I= x*). Generally, the order
in which the equations are taken up for iteration can affect significantly the
86 Optimization Algorithms: An Overview Chap. 2

Hyperplane
<x = bi
(a) (b)

Figure 2.1.9. Illustratiuu of the Ka1.;:<ma.r:< method (2.34) with unit stepsize
ak = 1: (a) 'lj;i,k is obtained by projecting 'lpi-l,k onto the hyperplane defined
by the single equation <x = bi. (b) The convergence process for the case
where the system of equations <x = bi, i = 1, ... , m, is consistent and has
a unique solution x*. Here m = 3, and Xk is the vector obtained after k
cycles through the equations. Each incremental iteration decreases the dis-
tance to x*, unless the current iterate lies on the hyperplane defined by the
corresponding equation.

performance. In particular, faster convergence can be shown if the order is


randomized in a special way; see [StV09].
If the system of equations
i = l, ... ,m,
is inconsistent, the method does not converge with a constant stepsize; see
Fig. 2.1.10. In this case a diminishing stepsize ak is necessary for convergence
to an optimal solution. These convergence properties will be discussed further
later in this section, and in Chapters 3 and 6.

Convergence Properties of Incremental Methods

The motivation for the incremental approach is faster convergence. In par-


ticular, we hope that far from the solution, a single cycle of the incremental
gradient method will be as effective as several (as many as m) iterations
of the ordinary gradient method (think of the case where the components
fi are similar in structure). Near a solution, however, the incremental
method may not be as effective. Still, the frequent superiority of the incre-
mental method when far from convergence can be a decisive advantage for
problems where solution accuracy is not of paramount importance.
To be more specific, we note that there are two complementary per-
formance issues to consider in comparing incremental and nonincremental
methods:
Sec. 2.1 Iterative Descent Algorithms 87

Figure 2.1.10. Illustration of the Kacz-


marz method (2.34) with °'k = 1 for the
case where the system of equations <x=
bi, i = 1, ... , m, is inconsistent. In this
figure there are three equations with cor-
responding hyperplanes as shown. The
method approaches a neighborhood of the
optimal solution, and then oscillates. A
similar behavior would occur if the step-
size °'k were a constant a E (0, 1), except
that the size of the oscillation would di-

Hyperplanes sx = bi minish with a.

(a) Progress when far from convergence. Here the incremental method
can be much faster. For an extreme case let X = ~n ( no constraints),
and take m large and all components Ji identical to each other. Then
an incremental iteration requires m times less computation than a
classical gradient iteration, but gives exactly the same result, when
the stepsize is appropriately scaled to be m times larger. While this
is an extreme example, it reflects the essential mechanism by which
incremental methods can be much superior: far from the minimum
a single component gradient will point to "more or less" the right
direction, at least most of the time.
(b) Progress when close to convergence. Here the incremental method
can be inferior. As a case in point, assume that all components Ji
are differentiable functions. Then the nonincremental gradient pro-
jection method can be shown to converge with a constant stepsize
under reasonable assumptions, as we will see in Section 6.1. How-
ever, the incremental method requires a diminishing stepsize, and its
ultimate rate of convergence can be much slower. When the compo-
nent functions Ji are nondifferentiable, both the nonincremental and
the incremental subgradient methods require a diminishing stepsize.
The nonincremental method tends to require a smaller number of it-
erations, but each of the iterations involves all the components Ji and
thus larger computation overhead, so that on balance, in terms of
computation time, the incremental method tends to perform better.
As an illustration consider the following example.

Example 2.1.4:

Consider a scalar linear least squares problem where the components J. have
the form
XE )R,
88 Optimization Algorithms: An Overview Chap. 2

where Ci and bi are given scalars with Ci =/ 0 for all i. The minimum of each
of the components fi is

while the minimum of the least squares cost function f = I::1Ji is

It can be seen that x* lies within the range of the component minima

and that for all x outside the region R, the gradient

'v'f;(x) = (c;x - b;)c;

has the same sign as 'v'f(x) (see Fig. 2.1.11). As a result, when outside the
region R, the incremental gradient method

approaches x* at each step, provided the stepsize ak is small enough. In fact


it is sufficient that
ak <
. 1
_ mm 2 .
i Ci

However, for x inside the region R, the ith step of a cycle of the in-
cremental gradient method need not make progress. It will approach x* (for
small enough stepsize ak) only if the current point 'I/Ji-1 does not lie in the
interval connecting xT and x*. This induces an oscillatory behavior within R,
and as a result, the incremental gradient method will typically not converge
to x* unless ak -+ 0.
Let us now compare the incremental gradient method with the nonin-
cremental version, which takes the form
m

Xk+1 = Xk - ak L(c;xk - b;)c;.


i=l

It can be shown that this method converges to x* for any constant stepsize
O'.k= a satisfying
1
0< a~ I::1 c;'
On the other hand, for x outside the region R, an iteration of the nonin-
cremental method need not make more progress towards the solution than
a single step of the incremental method. In other words, with comparably
intelligent stepsize choices, far from the solution ( outside R), a single cy-
cle through the entire set of component functions by the incremental method
is roughly as effective as m iterations by the nonincremental method, which
require m times as many component gradient calculations.
Sec. 2.1 Iterative Descent Algorithms 89

FAROUT REGION REGION OF CONFUSION FAROUT REGION

Figure 2.1.11. Illustrating the advantage of incrementalism when far from


the optimal solution. The region of component minima

is labeled as the "region of confusion." It is the region where the method


does not have a clear direction towards the optimum. The ith step in an
incremental gradient cycle is a gradient step for minimizing (qx - bi) 2 , so
x
if lies outside the region of component minima R = [mini x;,maxi x;] ,
(labeled as the "farout region") and the stepsize is small enough, progress
towards the solution x* is made.

Example 2.1.5:

The preceding example assumes that each component function Ji has a min-
imum, so that the range of component minima is defined. In cases where
the components Ji have no minima, a similar phenomenon may occur. As an
example consider the case where J is the sum of increasing and decreasing
convex exponentials, i.e.,
XE R,
where ai and b; are scalars with a; > 0 and bi =I= 0. Let
1+ = {i I bi> O}, r = {i I bi< o},
and assume that 1+ and 1- have roughly equal numbers of components. Let
also x* be the minimum of I:::
1 /i.
Consider the incremental gradient method that given the current point,
call it Xk, chooses some component J;k and iterates according to the incre-
mental iteration
90 Optimization Algorithms: An Overview Chap. 2

Then it can be seen that if Xk > > x•, Xk+I will be substantially closer to x* if
i E J+, and negligibly further away than x* if i E 1-. The net effect, averaged
over many incremental iterations, is that if Xk > > x•, an incremental gradient
iteration makes roughly one half the progress of a full gradient iteration, with
m times less overhead for calculating gradients. The same is true if Xk < < x*.
On the other hand as Xk gets closer to x* the advantage of incrementalism is
reduced, similar to the preceding example. In fact in order for the incremental
method to converge, a diminishing stepsize is necessary, which will ultimately
make the convergence slower than the one of the nonincremental gradient
method with a constant stepsize.

The preceding examples rely on x being one-dimensional, but in many


multidimensional problems the same qualitative behavior can be observed.
In particular, the incremental gradient method, by processing the ith com-
ponent Ji, can make progress towards the solution in the region where the
component function gradient 'v Ji ('l/Ji-1) makes an angle less than 90 de-
grees with the full cost function gradient 'v J('l/Ji-i). If the components Ii
are not "too dissimilar," this is likely to happen in a region of points that
are not too close to the optimal solution set.

Stepsize Selection

The choice of the stepsize ak plays an important role in the performance


of incremental gradient methods. On close examination, it turns out that
the iterate differential Xk - Xk+l corresponding to a full cycle of the in-
cremental gradient method, and the corresponding vector O:k 'v J(xk) of its
nonincremental counterpart differ by an error that is proportional to the
stepsize (see the discussion in Exercises 2.6 and 2.10). For this reason a
diminishing stepsize is essential for convergence to a minimizing point of
J. However, it turns out that a peculiar form of convergence also typically
occurs for the incremental gradient method if the stepsize O:k is a constant
but sufficiently small a. In this case, the iterates converge to a "limit cy-
cle," whereby the ith iterates 'I/Ji within the cycles converge to a different
limit than the jth iterates 'I/Ji for i # j. The sequence {xk} of the iterates
obtained at the end of cycles converges, except that the limit obtained need
not be optimal even if J is convex. The limit tends to be close to an optimal
point when the constant stepsize is small [for analysis of the case where the
components Ji are quadratic, see Exercise 2.13(a), [BeT96] (Section 3.2),
and [Ber99] (Section 1.5), where a linear convergence rate is also shown].
In practice, it is common to use a constant stepsize for a (possibly
prespecified) number of iterations, then decrease the stepsize by a certain
factor, and repeat, up to the point where the stepsize reaches a prespecified
minimum. An alternative possibility is to use a stepsize O:k that diminishes
to Oat an appropriate rate [cf. Eq. (2.15)]. In this case convergence can be
shown under reasonable conditions; see Exercise 2.10.
Sec. 2.1 Iterative Descent Algorithms 91

Still another possibility is to use an adaptive stepsize rule, whereby


the stepsize is reduced (or increased) when the progress of the method indi-
cates that the algorithm is oscillating because it operates within (or outside,
respectively) the region of confusion. There are formal ways to implement
such stepsize rules with sound convergence properties (see [Gri94], [Tse98],
[MYF03]). One of the ideas is to look at a batch of incremental updates
'lpi, ... , 'l/Ji+M, for some relatively large M ::; m, and compare ll'l/Ji - 'l/Ji+M II
with 'I:~ 1 11'¢1+£-1 - '¢Hell- If the ratio of these two numbers is "small"
this suggests that the method is oscillating.
Incremental gradient and subgradient methods have a rich theory,
which includes convergence and rate of convergence analysis, optimization
and randomization issues of the component order selection, and distributed
computation aspects. Moreover they admit interesting combinations with
other methods, such as the proximal algorithm. We will more fully discuss
their properties and extensions in Chapter 6, Section 6.4.

Aggregated Gradient Methods

Another variant of incremental gradient is the incremental aggregated gra-


dient method, which has the form

Xk+l = Px (xk - ak tP=O


1
v' fik-£ (xk-d) , (2.35)

where fik is the new component function selected for iteration k. Here,
the component indexes ik may either be selected in a cyclic order [ik =
( k modulo m) + 1], or according to some randomization scheme, consis-
tently with Eq. (2.31). Also for k < m, the summation should go up to
C = k, and a should be replaced by a corresponding larger value, such
as ak = ma/(k + 1). This method, first proposed in [BHG08], computes
the gradient incrementally, one component per iteration, but in place of
the single component gradient, it uses an approximation to the total cost
gradient v' f(xk), which is the sum of the component gradients computed
in the past m iterations.
There is analytical and experimental evidence that by aggregating
the component gradients one may be able to attain a faster asymptotic
convergence rate, by ameliorating the effect of approximating the full gra-
dient with component gradients; see the original paper [BHG08], which
provides an analysis for quadratic problems, the paper [SLB13], which pro-
vides a more general convergence and convergence rate analysis, and ex-
tensive computational results, and the papers [Mai13], [Mai14], [DCD14],
which describe related methods. The expectation of faster convergence
should be tempered, however, because in order for the effect of aggregat-
ing the component gradients to fully manifest itself, at least one pass (and
possibly quite a few more) through the components must be made, which
may be too long if m is very large.
92 Optimization Algorithms: An Overview Chap. 2

A drawback of this aggregated gradient method is that it requires that


the most recent component gradients be kept in memory, so that when a
component gradient is reevaluated at a new point, the preceding gradient
of the same component is discarded from the sum of gradients of Eq. (2.35).
There have been alternative implementations of the incremental aggregated
gradient method idea that ameliorate this memory issue, by recalculating
the full gradient periodically and replacing an old component gradient by
a new one, once it becomes available; see [JoZ13], [ZMJ13], [XiZ14]. More
specifically, instead of the gradient sum
m-1

Sk = L V fik_e(Xk-P),
P=O
in Eq. (2.35), these methods use
m-1

Bk = V Jik (xk) - V Jik (xk) + L


V Jik-l (xk),
P=O
where Xk is the most recent point where the full gradient has been calcu-
lated. To calculate Bk one only needs to compute the difference of the two
gradients
Vfik(xk)-Vfik(xk)
and add it to the full gradient I:~--;; 1 V Jik-t (xk)· This bypasses the need
for extensive memory storage, and with proper implementation, typically
leads to small degradation in performance. In particular, convergence with
a sufficiently small constant stepsize, with an attendant superior conver-
gence rate over the incremental gradient method, has been shown.

Incremental Gradient Method with Momentum

There is an incremental version of the gradient method with momentum


or heavy ball method, discussed in Section 2.1.1 [cf. Eq. (2.12)]. It is given
by
(2.36)
where Jik is the component function selected for iteration k, f3k is a scalar
in [O, 1), and we define X-1 = xo; see e.g., [MaS94], [Tse98]. As noted
earlier, special nonincremental methods with similarities to the one above
have optimal iteration complexity properties under certain conditions; cf.
Section 6.2. However, there have been no proposals of incremental versions
of these optimal complexity methods.
The heavy ball method (2.36) is related with the aggregated gradient
method (2.35) when f3k ,:::; 1. In particular, when ak = a and f3k = (3, the
sequence generated by Eq. (2.36) satisfies
k

Xk+I = Xk - a L (JP V fik-e (xk-P) (2.37)


P=O
Sec. 2.1 Iterative Descent Algorithms 93

[both iterations (2.35) and (2.37) involve different types of diminishing de-
pendence on past gradient components]. Thus, the heavy ball iteration
(2.36) provides an approximate implementation of the incremental aggre-
gated gradient method (2.35), while it does not have the memory storage
issue of the latter.
A further way to intertwine the ideas of the aggregated gradient
method (2.35) and the heavy ball method (2.36) for the unconstrained
case (X = ~n) is to form an infinite sequence of components

Ji, h, · · ·, fm, Ji,h, · · ·, fm,Ji,h, · · ·, (2.38)

and group together blocks of successive components into batches. One way
to implement this idea is to add p preceding gradients (with 1 < p < m) to
the current component gradient in iteration (2.36), thus iterating according
to p

Xk+l = Xk - O'.k L,'vfik-e(Xk-P) + f3k(Xk - Xk-1), (2.39)


P=O
Here fik is the component function selected for iteration k using the order of
the sequence (2.38). This essentially amounts to reformulating the problem
by redefining the components as sums of p + 1 successive components and
applying an approximation of the incremental heavy ball method (2.36).
The advantage of the method (2.39) over the aggregated gradient method
is that it requires keeping in memory only p previous component gradients,
and p can be chosen according to the memory limitations of the given
computational environment. Generally in incremental methods, grouping
together several components h, a process sometimes called batching, tends
to reduce the size of the region of confusion (cf. Fig. 2.1.11), and with
a small region of confusion, the incentive for aggregating the component
gradients diminishes (see [Ber97] and [FrS12], for different implementations
and analysis of this idea). The process of batching can also be implemented
adaptively, based on some form of heuristic detection that the method has
entered the region of confusion.

Stochastic Subgradient Methods

Incremental subgradient methods are related to methods that aim to min-


imize an expected value

f(x) = E{F(x,w)},

where w is a random variable, and F(·, w) : ~n t-+ ~ is a convex function for


each possible value of w. The stochastic subgradient method for minimizing
f over a closed convex set X is given by

(2.40)
94 Optimization Algorithms: An Overview Chap. 2

where Wk is a sample of w and g(xk, wk) is a subgradient of F(·, wk) at


Xk. This method has a rich theory and a long history, particularly for
the case where F(·, w) is differentiable for each value of w (for representa-
tive references, see [PoT73], [Lju77], [KuC78], [TBA86], [Pol87], [BeT89a],
[BeT96J, [Pfl96], [LBB98], [BeTOO], [KuY03], [Bot05], [Be107], [Mey07],
[Bor08], [BBG09], [Ben09J, [NJL09], [BotlOJ, [BaMll], [DHS11], [ShZ12],
[FrG13J, [NSW14]). It is strongly related to the classical algorithmic field
of stochastic approximation; see the books [KuC78], [BeT96], [KuY03],
[Spa03], [Mey07], [Bor08J, [BPP13].
If we view the expected value cost E{ F(x, w)} as a weighted sum of
cost function components, we see that the stochastic subgradient method
(2.40) is related to the incremental subgradient method

(2.41)

for minimizing a finite sum L:i Ji, when randomization is used for com-
ponent selection [cf. Eq. (2.31)]. An important difference is that the former
method involves sequential sampling of cost components F(x, w) from an
infinite population under some statistical assumptions, while in the latter
the set of cost components Ji is predetermined and finite. However, it is
possible to view the incremental subgradient method (2.41), with uniform
randomized selection of the component function Ji (i.e., with ik chosen
to be any one of the indexes 1, ... , m, with equal probability 1/m, and
independently of preceding choices), as a stochastic subgradient method.
Despite the apparent similarity of the incremental and the stochastic
subgradient methods, the view that the problem
m

mm1m1ze J(x) = L Ji(x) (2.42)


i=l

subject to x E X,

can simply be treated as a special case of the problem

minimize J(x) = E{ F(x, w)}


subject to x E X,

is questionable.
One reason is that once we convert the finite sum problem to a
stochastic problem, we preclude the use of methods that exploit the finite
sum structure, such as the aggregated gradient methods we discussed ear-
lier. Under certain conditions, these methods offer more attractive conver-
gence rate guarantees than incremental and stochastic gradient methods,
and can be very effective for many problems, as we have noted.
Another reason is that the finite-component problem (2.42) is often
genuinely deterministic, and to view it as a stochastic problem at the outset
Sec. 2.1 Iterative Descent Algorithms 95

may mask some of its important characteristics, such as the number m of


cost components, or the sequence in which the components are ordered
and processed. These characteristics may potentially be algorithmically
exploited. For example, with insight into the problem's structure, one
may be able to discover a special deterministic or partially randomized
order of processing the component functions that is superior to a uniform
randomized order.

Example 2.1.6:

Consider the one-dimensional problem

minimize f(x) = 21L...,(x


~
- Wi)
2

i=l

subject to x E !R,

where the scalars Wi are given by

W; = {1-1 if i: odd,
if i: even.

Assuming that mis an even number, the optimal solution is x* = 0.


An incremental gradient method with the commonly used diminishing
stepsize ak = l/(k + 1) chooses a component index ik at iteration k, and
updates Xk according to

starting with some initial iterate xo. It is then easily verified by induction
that
Xo W;o + ... + Wik-I
Xk =k + k k = 1,2, ....
Thus the iteration error, which is Xk (since x* = 0), consists of two terms. The
first is the error term x 0 /k, which is independent of the method of selecting
ik, and the second is the error term

Wio +,,, + W;k-1


ek = k

which depends on the selection method for ik.


If ik is chosen by independently randomizing with equal probability 1/2
over the odd and even cost components, then ek will be a random variable
whose variance can be calculated to be 1/2k. Thus the standard deviation of
the error Xk will be of order 0(1/v'k). If on the other hand ik is chosen by the
deterministic order, which alternates between the odd and even components,
we will have ek = l/k for the odd iterations and ek = 0 for the even iterations,
so the error Xk will be of order 0(1/k), much smaller than the one for the
randomized order. Of course, this is a favorable deterministic order, and we
96 Optimization Algorithms: An Overview Chap. 2

may obtain much worse results with an unfavorable deterministic order (such
as selecting first all the odd components and then all the even components).
However, the point here is that if we take the view that we are minimizing
an expected value, we are disregarding at the outset information about the
problem's structure that could be algorithmically useful.

A related experimental observation is that by suitably mixing the


deterministic and the stochastic order selection methods we may produce
better practical results. As an example, a popular technique for incremental
methods, called random reshuffling, is to process the component functions
Ji in cycles, with each component selected once in each cycle, and to re-
order randomly the components after each cycle. This alternative order
selection scheme has the nice property of allocating exactly one computa-
tion slot to each component in an m-slot cycle (m incremental iterations).
By comparison, choosing components by uniform sampling allocates one
computation slot to each component on the average, but some components
may not get a slot while others may get more than one. A nonzero variance
in the number of slots that any fixed component gets within a cycle, may
be detrimental to performance, and suggests that reshuffling randomly the
order of the component functions after each cycle works better. While it
seems difficult to establish this fact analytically, a justification is suggested
by the view of the incremental gradient method as a gradient method with
error in the computation of the gradient (see Exercise 2.10). The error has
apparently greater variance in the uniform sampling method than in the
random reshuffling method. Heuristically, if the variance of the error is
larger, the direction of descent deteriorates, suggesting slower convergence.
For some experimental evidence, see [Bot09], [ReR13].
Let us also note that in Section 6.4 we will compare more formally
various component selection orders in incremental methods. Our analysis
will indicate that in the absence of problem-specific knowledge that can
be exploited to select a favorable deterministic order, a uniform random-
ized order (each component Ji chosen with equal probability 1/m at each
iteration, independently of preceding choices) has superior worst-case com-
plexity.
Our conclusion is that in incremental methods, it may be beneficial
to search for a favorable order for processing the component functions Ji,
exploiting whatever problem-specific information may be available, rather
than ignore all prior information and apply a uniform randomized order
of the type commonly used in stochastic gradient methods. However, if
a favorable order cannot be found, a randomized order is usually better
than a fixed deterministic order, although there is no guarantee that this
will be so for a given practical problem; for example a fixed deterministic
order has been reported to be considerably faster on some benchmark test
problems without any attempt to order the components favorably [Bot09].
Sec. 2.1 Iterative Descent Algorithms 97

Incremental Newton Methods

We will now consider an incremental version of Newton's method for un-


constrained minimization of an additive cost function of the form
m
J(x) = LJi(x),
i=l

where the functions Ji : Rn i-+ R are convex and twice continuously dif-
ferentiable. Consider the quadratic approximation Ji of a function Ji at a
vector 'I/; E Rn, i.e., the second order Taylor expansion of Ji at 'lj;:

]i(x; 'I/;)= V Ji('l/;)'(x - 'I/;)+ ½(x - 'l/;)'\72Ji('l/;)(x - 'I/;), V x, 'I/; E Rn.


Similar to Newton's method, which minimizes a quadratic approxima-
tion at the current point of the cost function [cf. Eq. (2.14)], the incremental
form of Newton's method minimizes a sum of quadratic approximations of
components. Similar to the incremental gradient method, we view an it-
eration as a cycle of m subiterations, each involving a single additional
component Ji, and its gradient and Hessian at the current point within the
cycle. In particular, if Xk is the vector obtained after k cycles, the vector
Xk+1 obtained after one more cycle is

where starting with '1/;o,k = Xk, we obtain '1/;m,k after the m steps

'lj;i k E arg min ~ ]c(x; 'l/;c-1 k), i = l, ... ,m. (2.43)


' xEWnL..i '
£=1

If all the functions Ji are quadratic, it can be seen that the method finds
the solution in a single cycle. t The reason is that when Ji is quadratic,
each Ji(x) differs from ]i(x; 'I/;) by a constant, which does not depend on
x. Thus the difference
m m

L fi(x) - L ]i(x; 'l/;i-1,k)


i=l i=l

t Here we assume that them quadratic minimizations (2.43) to generate 1Pm,k


have a solution. For this it is sufficient that the first Hessian matrix v' 2 fi (xo) be
positive definite, in which case there is a unique solution at every iteration. A
simple possibility to deal with this requirement is to add to Ji a small positive
definite quadratic term, such as ½llx - xo 11 2 . Another possibility is to lump
together several of the component functions (enough to ensure that the sum of
their quadratic approximations at Xo is positive definite), and to use them in
place of Ji. This is generally a good idea and leads to smoother initialization, as
it ensures a relatively stable behavior of the algorithm for the initial iterations.
98 Optimization Algorithms: An Overview Chap. 2

Xo =, "'f''
1'0 0

Figure 2.1.12. Illustration of the incremental Newton method for the case of a
two-dimensional linear least squares problem with m = 3 cost function compo-
nents (compare with the Kaczmarz method, cf. Fig. 2.1.10).

is a constant that is independent of x, and minimization of either sum in


the above expression gives the same result.
As an example, consider a linear least squares problem, where

i = l, ... ,m.

Then the ith subiteration within a cycle minimizes

Lfe(x),
£=1

and when i = m, the solution of the problem is obtained (see Fig. 2.1.12).
This convergence behavior should be compared with the one for the Kacz-
marz method (cf. Fig. 2.1.10).
It is important to note that the quadratic minimizations of Eq. (2.43)
can be carried out efficiently. For simplicity, let as assume that J1 (x; '¢)
is a positive definite quadratic, so that for all i, 1Pi,k is well defined as the
unique solution of the minimization problem in Eq. (2.43). We will show
that the incremental Newton method (2.43) can be implemented in t erms
of the incremental update formula

(2.44)

where Di,k is given by

D i,k = (t
£=1
V 2!£(1/Jt- 1,k)) -
1
, (2.45)

and is generated iteratively as

(2.46)
Sec. 2.1 Iterative Descent Algorithms 99

Indeed, from the definition of the method (2.43), the quadratic function
I:~:i Jt(x; 1P£-1,k) is minimized by 1Pi-1,k and its Hessian matrix is n-;_!1,k,
so we have
i-1
'"" -
L., ft(x; 1P£-1,k) = 21 (x - -1
'I/Jt-1,k)' Di-l,k(x - 'I/Jt-1,k) + constant.
£=1

Thus, by adding Ji(x; 1Pi-l,k) to both sides of this expression, we obtain

'""-
L.,ft(x; 1P£-1,k) = 21 (x - 1
'I/Jt-1,k)' Di-l,k(x - 'I/J£-1,k) + constant
£=1
+ ½(x -'l/Ji-l,k)''\7 2fi(1Pi-1,k)(x - 'I/Ji-1,k) + '\lfi(1Pi-1,k)'(x - 'I/Ji-1,k)·

Since by definition 1Pi,k minimizes this function, we obtain Eqs. (2.44)-


(2.46).
The update formula (2.46) for the matrix Di,k can often be efficiently
implemented by using convenient formulas for the inverse of the sum of two
matrices. In particular, if Ji is given by

fi(x) = hi(a~x - bi),


for some twice differentiable convex function hi : ~ H ~, vector ai, and
scalar bi, we have

and the update formula (2.46) can be written as

D _ D· _ Di-1,kaia~Di-1,k .
i,k - i-1,k '7 2
v 'f'i-l,k )-l + ai'D i-l,kai ,
h i (·'·

this is the well-known Sherman-Morrison formula for the inverse of the sum
of an invertible matrix and a rank-one matrix (see the matrix inversion
formula in Section A.l of Appendix A).
We have considered so far a single cycle of the incremental Newton
method. One algorithmic possibility for cycling through the component
functions multiple times, is to simply create a larger set of components by
concatenating multiple copies of the original set, that is, by forming what
we refer to as the extended set of components

Ji,h, ... ,fm, fi,h, ... ,fm, Ji,h, ....


The incremental Newton method, when applied to the extended set, asymp-
totically resembles a scaled incremental gradient method with diminishing
stepsize of the type described earlier. Indeed, from Eq. (2.45)], the matrix
100 Optimization Algorithms: An Overview Chap. 2

Di,k diminishes roughly in proportion to 1/ k. From this it follows that


the asymptotic convergence properties of the incremental Newton method
are similar to those of an incremental gradient method with diminishing
stepsize of order 0(1/k). Thus its convergence rate is slower than linear.
To accelerate the convergence of the method one may employ a form
of restart, so that Di,k does not converge to 0. For example Di,k may
be reinitialized and increased in size at the beginning of each cycle. For
problems where f has a unique nonsingular minimum x* [one for which
v' 2 f(x*) is nonsingular], one may design incremental Newton schemes with
restart that converge linearly to within a neighborhood of x* (and even
superlinearly if x* is also a minimum of all the functions Ji, so there is
no region of confusion). Alternatively, the update formula (2.46) may be
modified to
(2.4 7)

by introducing a fading factor >..k E (0, 1), which can be used to accelerate
the practical convergence rate of the method (see [Ber96] for an analysis
of schemes where >..k -+ 1; in cases where Ak is some constant >.. < 1, linear
convergence to within a neighborhood of the optimum may be shown).
The following example provides some insight regarding the behavior
of the method when the cost function f has a very large number of cost
components, as is the case when f is defined as the average of a very large
number of random samples.

Example 2.1.7: (Infinite Number of Cost Components)

Consider the problem

minimize f(x) ~f lim


m--+oo
_!_
m ~
~ F(x, w;)
i=l

subject to x E Rn,

where {wk} is a given sequence from some set, and each function F(·,w;):
Rn f-t R is positive semidefinite quadratic. We assume that f is well-defined
(i.e., the limit above exists for each x E Rn), and is a positive definite
quadratic. This type of problem arises in linear regression models (cf. Ex-
ample 1.3.1) involving an infinite amount of data that is obtained through
random sampling.
The natural extension of the incremental Newton's method, applied to
the infinite set of components F( ·, w1), F( ·, w2), ... , generates the sequence
{xi] where
k

x'i., E arg min fk(x)


xE1Rn
~f _kl '°'
L....,
F(x, w;).
i=l

Since f is positive definite and the same is true for /k, when k is large enough,
we have x'i., -+ x*, where x* is the minimum of f. The rate of convergence
Sec. 2.1 Iterative Descent Algorithms 101

is determined strictly by the rate at which the vectors x'i., approach x*, or
equivalently by the rate at which /k approaches f. It is impossible to achieve
a faster rate of convergence with an algorithm that is nonanticipative in the
sense that it uses just the first k cost components in the first k iterations.
By contrast, if we were to apply the natural extension of the incremental
gradient method to this problem, the convergence rate could be much worse.
There would be an error due to the difference (x'i., - x*), but also an additional
error due to the difference (x'i., - Xk) between x'i., and the kth iterate Xk of
the incremental gradient method, which is generally diminishing quite slowly,
possibly more slowly than (xi; - x*). The same is true for other gradient-
type methods based on incremental computations, including the aggregated
gradient methods discussed earlier.

Incremental Newton Method with Diagonal Approximation

Generally, with proper implementation, the incremental Newton method is


often substantially faster than the incremental gradient method, in terms
of numbers of iterations (there are theoretical results suggesting this prop-
erty for stochastic versions of the two methods; see the end-of-chapter ref-
erences). However, in addition to computation of second derivatives, the
incremental Newton method involves greater overhead per iteration due to
matrix-vector calculations in Eqs. (2.44), (2.46), and (2.47), so it is suitable
only for problems where n, the dimension of x, is relatively small.
One way to remedy in part this difficulty is to approximate v' 2 Ji('l/Ji,k)
by a diagonal matrix, and recursively update a diagonal approximation
of Di,k using Eqs. (2.46) or (2.47). One possibility, inspired by similar
diagonal scaling schemes for nonincremental gradient methods, is to set to 0
the off-diagonal components of v' 2 /i( 'l/Ji,k), In this case, the iteration (2.44)
becomes a diagonally scaled version of the incremental gradient method,
and involves comparable overhead per iteration (assuming the required
diagonal second derivatives are easily computed). As an additional option,
one may multiply the diagonal components with a stepsize parameter that
is close to 1 and add a small positive constant (to bound them away from 0).
Ordinarily, for the convex problems considered here, this method should
require little experimentation with stepsize selection.

Incremental Newton Methods with Constraints

The incremental Newton method can also be adapted to constrained prob-


lems of the form
m
minimize L Ji (x)
i=l
subject to x E X,
where Ji : Rn f--t R are convex, twice continuously differentiable convex
functions. If X has a relatively simple form, such as upper and lower
102 Optimization Algorithms: An Overview Chap. 2

bounds on the variables, one may use a two-metric implementation, such as


the ones discussed earlier, whereby the matrix Di,k is partially diagonalized
before it is applied to the iteration
'l/Ji,k = Px('l/Ji-I,k - Di,k'\lfi('l/Ji-1,k)),
[cf. Eqs. (2.21) and (2.44)].
For more complicated constraint sets of the form
X = n~ 1 Xi,
where each Xi is a relatively simple component constraint set (such as a
halfspace), there is another possibility. This is to apply an incremental pro-
jected Newton iteration, with projection on a single individual component
Xi, i.e., an iteration of the form
'l/Ji,k E arg min {VJ;('l/Ji-1,k)'('lj;-'l/Ji-1,k)+½('l/J-'l/Ji-1,k)'Hi,k('l/J-'l/Ji-I,k)},
,j;EXi

where

Hi,k = L '\7 h('l/Je-1,k)-


2
R=I
Note that each component Xi can be relatively simple, in which case the
quadratic optimization problem above may be simple despite the fact that
Hi,k is nondiagonal. Depending on the problem's special structure, one
may also use efficient methods that pass information from the solution of
one quadratic subproblem to the next.
A similar method may also be used for problems of the form
m

minimize R(x) + L fi(x)


i=l

subject to x EX= n~ 1 Xi,


where R(x) is a regularization function that is a multiple of either the £1 or
the £2 norm. Then the incremental projected Newton iteration takes the
form
'l/Ji,k E arg~t {R('lf;) + Vfi('l/Ji-1,k)'('l/J-'l/Ji-1,k)

+ ½('l/J - 'l/Ji-I,k)' Hi,k('l/J - 'lpi-l,k) }.


When Xi is a polyhedral set, this problem is a quadratic program.
The idea of incremental projection on constraint set components is
complementary to the idea of using gradient and possibly Hessian infor-
mation from single cost function components at each iteration, and will
be discussed in more detail in Section 6.4.4, in the context of incremental
subgradient and incremental proximal methods. Several variations are pos-
sible, whereby the cost function component and constraint set component
selected at each iteration may be chosen according to special deterministic
or randomized rules. We refer to the papers [Nedll], [Berll], [WaB13a]
for a discussion of these incremental methods, their variations, and their
convergence analysis.
Sec. 2.1 Iterative Descent Algorithms 103

Incremental Gauss-Newton Method - The Extended Kalman


Filter
We will next consider an algorithm that operates similar to the incremental
Newton method, but is specialized for the nonlinear least squares problem
m

minimize I: llgi(x) 11
2

i=l
subject to x E ~n,

where gi : ~n H ~ni are some possibly nonlinear functions (cf. Example


1.3.1). As noted in Section 1.3, this is a common problem in practice.
We introduce a function iii that represents a linear approximation of
9i at a vector 'ljJ E ~n:

9i(x; 'I/J) = Vgi('I/J)'(x - 'I/J) + 9i('I/J),


where V 9i ('lj;) is the n x ni gradient matrix of gi evaluated at 'ljJ. Similar to
the incremental gradient and Newton methods, we view an iteration as a
cycle of m subiterations, each requiring linearization of a single additional
component at the current point within the cycle. In particular, if Xk is
the vector obtained after k cycles, the vector Xk+i obtained after one more
cycle is
Xk+l = 'I/Jm,k, (2.48)
where starting with 'I/Jo,k = Xk, we obtain 'I/Jm,k after them steps

'I/Ji,k E arg x~k~ I: ll9£(x; 'I/Jc-1,k) 2


11 , i = 1, ... ,m. (2.49)
f=l

If all the functions 9i are linear, we have 9i(x; 'I/J) = gi(x), and the method
solves the problem exactly in a single cycle. It then becomes identical to
the incremental Newton method.
When the functions gi are nonlinear the algorithm differs from the
incremental Newton method because it does not involve second deriva-
tives of gi. It may be viewed instead as an incremental version of the
Gauss-Newton method, a classical nonincremental scaled gradient method
for solving nonlinear least squares problems (see e.g., [Ber99], Section 1.5).
It is also known as the extended Kalman filter, and has found extensive ap-
plication in state estimation and control of dynamic systems, where it was
introduced in the mid-60s (it was also independently proposed in [Dav76]).
The implementation issues of the extended Kalman filter are simi-
lar to the ones of the incremental Newton method. This is because both
methods solve similar linear least squares problems at each iteration [cf.
Eqs. (2.43) and (2.49)]. The convergence behaviors of the two methods are
104 Optimization Algorithms: An Overview Chap. 2

also similar: they asymptotically operate as scaled forms of incremental


gradient methods with diminishing stepsize. Both methods are primarily
well-suited for problems where the dimension of x is much smaller than the
number of components in the additive cost function, so that the associated
matrix-vector operations are not overly costly. Moreover their practical
convergence rate can be accelerated by introducing a fading factor [cf. Eq.
(2.47)]. We refer to [Ber96], [MYF03] for convergence analysis, variations,
and computational experimentation.

2.1.6 Distributed Asynchronous Iterative Algorithms

We will now consider briefly distributed asynchronous counterparts of some


of the algorithms discussed earlier in this section. We have in mind a
situation where an iterative algorithm, such as a gradient method or a
coordinate descent method, is parallelized by separating it into several
local algorithms operating concurrently at different processors. The main
characteristic of an asynchronous algorithm is that the local algorithms do
not have to wait at predetermined points for predetermined information to
become available. We thus allow some processors to execute more iterations
than others, we allow some processors to communicate more frequently
than others, and we allow the communication delays to be substantial and
unpredictable.
Let us consider for simplicity the problem of unconstrained mini-
mization of a differentiable function f : Rn r-+ R. Out of the iterative
algorithms of Sections 2.1.1-2.1.3, there are three types that are suitable
for asynchronous distributed computation. Their asynchronous versions
are as follows:
(a) Gradient methods, where we assume that the ith coordinate xi is
updated at a subset of times Ri C {O, 1, .. .}, according to

if k tf. Ri,
i = 1, ... ,n,
if k E Ri,

where ak is a positive stepsize. Here Tij ( k) is the time at which the


jth coordinate used in this update was computed, and the difference
k - Tij ( k) is commonly called the communication delay from j to i
at time k. In a distributed setting, each coordinate xi (or block of
coordinates) may be updated by a separate processor, on the basis of
values of coordinates made available by other processors, with some
delay.
(b) Coordinate descent methods, where for simplicity we consider a block
size of one; cf. Eq. (2.22). We assume that the ith scalar coordinate
Sec. 2.1 Iterative Descent Algorithms 105

is updated at a subset of times Ri C { 0, 1, ... } , according to

xi+i E argminf
{E~
(x 71il. (k), ... ,xi~l
7 i,i-1
(k)'~,xi+l
Ti,i+l
(k), ... ,xn_
Tin
(k)),

and is left unchanged (xi+l = xi) if k 1:- Ri, The meanings of the
subsets of updating times Ri and indexes Tij ( k) are the same as in the
case of gradient methods. Also the distributed environment where the
method can be applied is similar to the case of the gradient method.
Another practical setting that may be modeled well by this iteration
is when all computation takes place at a single computer, but any
number of coordinates may be simultaneously updated at a time,
with the order of coordinate selection possibly being random.
(c) Incremental gradient methods for the case where
m

f(x) = I: fi(X).
i=l

Here the ith component is used in an update of x at a subset of times


Ri, according to

k ERi,

and we assume that a single component gradient v' f i is used at each


time (i.e., Ri nRj = 0 for i =/- j). The meaning of Tij(k) is the same
as in the preceding cases, and the gradient v' fi can be replaced by a
subgradient in the case of nondifferentiable f;. Here the entire vector
xis updated at a central computer, based on component gradients v' fi
that are computed at other computers and are communicated with
some delay to the central computer. For validity of these methods, it
is essential that all the components fi are used in the iteration with
the same asymptotic frequency, 1/ m ( see [NBBO 1]). For this type of
asynchronous implementation to make sense, the computation of v' Ji
must be substantially more time-consuming than the update of Xk
using the preceding incremental iteration.
An interesting fact is that some asynchronous algorithms, called to-
tally asynchronous, can tolerate arbitrarily large delays k - Tij ( k), while
other algorithms, called partially asynchronous, are not guaranteed to work
unless there is an upper bound on these delays. The convergence mecha-
nisms at work in each of these two cases are genuinely different and so are
their analyses; see [BeT89a], where totally and partially asynchronous algo-
rithms, and various special cases including gradient and coordinate descent
methods, are discussed in Chapters 6 and 7, respectively.
The totally asynchronous algorithms are valid only under special con-
ditions, which guarantee that any progress in the computation of the indi-
vidual processors, is consistent with progress in the collective computation.
106 Optimization Algorithms: An Overview Chap. 2

For example to show convergence of a (synchronous) stationary iteration


of the form
Xk+1 = G(xk)
it is sufficient to show that G is a contraction mapping with respect to
some norm (see Section A.4 of Appendix A), but for asynchronous con-
vergence it turns out that one needs the contraction to be with respect to
the sup-norm llxlloo = maxi=l, ... ,n !xii or a weighted sup-norm (see Sec-
tion 6.5.2). To guarantee totally asynchronous convergence of a gradient
method with a constant and sufficiently small stepsize Dk = a, a diagonal
dominance condition is required; see the paper [Ber83]. In the special case
of a quadratic cost function
f(x) = ½x'Qx + b'x
this condition is that the Hessian matrix Q is diagonally dominant, i.e.,
has components qij such that
n
i = 1, ... ,n.
#i
Without this diagonal dominance condition, totally asynchronous conver-
gence is unlikely to be guaranteed (for examples see [BeT89a], Section
6.3.2).
The partially asynchronous algorithms do not need a weighted sup-
norm contraction structure, but typically require a diminishing stepsize
parameter. The idea is that when the delays are bounded and the step-
size becomes small enough, the asynchronous algorithm resembles its syn-
chronous counterpart sufficiently closely, so that the convergence properties
of the latter are maintained (cf. the convergence analysis of gradient meth-
ods with errors in Exercise 2.6). Note that this mechanism for convergence
is very similar to the one for incremental methods. For this reason, incre-
mental gradient methods are natural candidates for distributed partially
asynchronous implementation; see [NBBOl].
For further discussion of the implementation and convergence anal-
ysis of partially asynchronous algorithms, we refer to the paper [TBA86],
to the books [BeT89a] (Chapters 6 and 7) and [Bor08] for deterministic
and stochastic gradient, and coordinate descent methods, and to the paper
[NBBOl] for incremental gradient and subgradient methods. For recent
related work on distributed partially asynchronous algorithms of the gradi-
ent and coordinate descent type, see [Ne009a], [RRWll], and for partially
asynchronous implementations of the Kaczmarz method, see [LWS14].

2.2 APPROXIMATION METHODS

Approximation methods for minimizing a convex function f : Rn t--+ R over


a convex set X, are based on replacing f and X with approximations Fk
Sec. 2.2 Approximation Methods 107

and Xk, respectively, at each iteration k, and finding


Xk+i E arg min A(x).
xEXk

At the next iteration, Fk+l and Xk+l are generated by refining the approx-
imation, based on the new point Xk+l, and possibly on the earlier points
Xk, ... , xo. Of course such a method makes sense only if the approximat-
ing problems are simpler than the original. There is a great variety of
approximation methods, with different aims, and suitable for different cir-
cumstances. The present section provides a brief overview and orientation,
while Chapters 4-6 provide a detailed analysis.

2.2.1 Polyhedral Approximation

In polyhedral approximation methods, Fk is a polyhedral function that


approximates f and Xk is a polyhedral set that approximates X. The
idea is that the approximate problem is polyhedral, so it may be easier
to solve than the original problem. The methods include mechanisms for
progressively refining the approximation, thereby obtaining a solution of
the original problem in the limit. In some cases, only one of f and X is
polyhedrally approximated.
In Chapter 4, we will discuss the two main approaches for polyhedral
approximation: outer linearization (also called the cutting plane approach)
and inner linearization (also called the simplicial decomposition approach).
As the name suggests, outer linearization approximates epi(f) and X from
without, Fk(x) :::; f(x) for all x, and Xk => X, using intersections of finite
numbers of halfspaces. By contrast, inner linearization approximates epi(f)
and X from within, Fk(x)?.:: f(x) for all x, and Xk C X, using convex hulls
of finite numbers of halflines or points. Figure 2.2.1 illustrates outer and
inner linearization of convex sets and functions.
We will show in Sections 4.3 and 4.4 that these two approaches are
intimately connected by conjugacy and duality: the dual of an outer ap-
proximating problem is an inner approximating problem involving the con-
jugates of Fk and the indicator function of Xk, and reversely. In fact, using
this duality, outer and inner approximations may be combined in the same
algorithm.
One of the major applications of the cutting plane approach is in
Dantzig- Wolfe decomposition, an important method for solving large scale
problems with special structure, including the separable problems of Sec-
tion 1.1.1 (see e.g., [BeT97], [Ber99]). Simplicial decomposition also finds
many important applications in problems with special structure; e.g., in
high-dimensional problems with a constraint set X such that minimization
of a linear function over X is relatively simple. This is exactly the same
structure that favors the use of the conditional gradient method discussed
in Section 2.1.2 (see Chapter 4). A prominent example is the multicom-
modity flow problem of Example 1.4.5.
108 Optimization Algorithms: An Overview Chap. 2

epi(f)

X · X

Figure 2.2.1. Illustration of outer and inner linearization of a convex function f


and a convex set X using hyperplanes and convex hulls.

2.2.2 Penalty, Augmented Lagrangian, and Interior Point


Methods

Generally in optimization problems, the presence of constraints complicates


the algorithmic solution, and limits the range of available algorithms. For
this reason it is natural to try to eliminate constraints by using approxima-
tion of the corresponding indicator functions. In particular, we may replace
constraints by penalty functions that prescribe a high cost for their vio-
lation. We discussed in Section 1.5 such an approximation scheme, which
uses exact nondifferentiable penalty functions. In this section we focus on
differentiable penalty functions that are not necessarily exact.
To illustrate this approach, let us consider the equality constrained
problem
minimize f (x)
(2.50)
subject to x EX, a~x = bi, i = 1, ... , m.
We replace this problem with a penalized version

minimize f(x) + Ck L P(a~x - bi)


i=l
(2.51)
subject to x E X,
Sec. 2.2 Approximation Methods 109

where P( ·) is a scalar penalty function satisfying

P(u)=O if u=O,

and
P(u)>O if ulO.
The scalar Ck is a positive penalty parameter, so by increasing Ck to oo,
the solution Xk of the penalized problem tends to decrease the constraint
violation, thereby providing an increasingly accurate approximation to the
original problem. An important practical point here is that Ck should
be increased gradually, using the optimal solution of each approximating
problem to start the algorithm that solves the next approximating problem.
Otherwise serious numerical problems occur due to "ill-conditioning."
A common choice for P is the quadratic penalty function

P(u) -- lu
2
2
'

in which case the penalized problem (2.51) takes the form


Ck
minimize f(x) +2 11Ax - bll 2
(2.52)
subject to x E X,

where Ax = bis a vector representation of the system of equations a~x = bi,


i = 1, ... ,m.
An important enhancement of the penalty function approach is the
augmented Lagrangian methodology, where we add a linear term to P(u),
involving a multiplier vector Ak E ~m. Then in place of problem (2.52),
we solve the problem
Ck
minimize f(x) + >.~(Ax - b) +2 11Ax - bll 2
(2.53)
subject to x E X.

After a minimizing vector Xk is obtained, the multiplier vector >.k is up-


dated by some formula that aims to approximate an optimal dual solution.
A common choice that we will discuss in Chapter 5 is

(2.54)

This is also known as the first order augmented Lagrangian method (also
called first order method of multipliers). It is a major general purpose,
highly reliable, constrained optimization method, which applies to non-
convex problems as well. It has a rich theory, with a strong connection
to duality, and many variations that are aimed at increased efficiency, in-
volving for example second order multiplier updates and inexact minimiza-
tion of the augmented Lagrangian. In the convex programming setting of
110 Optimization Algorithms: An Overview Chap. 2

this book, augmented Lagrangian methods embody additional favorable


structure. Among others, convergence is guaranteed for any nondecreasing
sequence { ck} (for nonconvex problems, Ck must exceed a certain positive
threshold). Moreover there is no requirement that Ck ---+ oo, which is needed
for penalty methods that do not involve multiplier updates, and is often
the cause of numerical problems.
Generally, penalty and augmented Lagrangian methods can be used
for inequality as well as equality constraints. The penalty function is mod-
ified to reflect penalization for violation of inequalities. For example the
inequality constraint analog of the quadratic penalty P( u) = ½u 2 is

P(u) = ½(max{O,u})2.

We will consider these possibilities in greater detail in Section 5.2.1.


The penalty methods just discussed are known as exterior penalty
methods: they approximate the indicator function of the constraint set from
without. Another type of algorithm involves approximation from within,
which leads to the so called interior point methods. These are important
methods that find application in a broad variety of problems, including
linear programming. They will be discussed in Section 6.8.

2.2.3 Proximal Algorithm, Bundle Methods, and Tikhonov


Regularization

The proximal algorithm, briefly discussed in Section 2.1.4, aims to minimize


a closed proper convex function f: Rn.-+ (-oo, oo], and is given by

Xk+1 E arg min


xE~n
{t(x) + -2ck
1-llx - xkll 2 } , (2.55)

[cf. Eq. (2.25)], where x 0 is an arbitrary starting point and Ck is a positive


scalar parameter. As the parameter Ck tends to oo, the quadratic regular-
ization term becomes insignificant and the proximal minimization (2.55)
approximates more closely the minimization of f, hence the connection of
the proximal algorithm with the approximation approach.
We will discuss the proximal algorithm in much more detail in Chap-
ter 5, including dual and polyhedral approximation versions. Among oth-
ers, we will show that when f is the dual function of the constrained opti-
mization problem (2.50), the proximal algorithm, via Fenchel duality, be-
comes equivalent to the multiplier iteration of the augmented Lagrangian
method [cf. Eq. (2.54)]. Since any closed proper convex function can be
viewed as the dual function of an appropriate convex constrained optimiza-
tion problem, it follows that the proximal algorithm (2.55) is essentially
equivalent to the augmented Lagrangian method: the two algorithms are
dual sides of the same coin.
Sec. 2.2 Approximation Methods 111

There are also variants of the proximal algorithm where f in Eq.


(2.55) is approximated by a polyhedral or other function. One possibil-
ity is bundle methods, which involve a combination of the proximal and
polyhedral approximation ideas. The motivation here is to simplify the
proximal minimization subproblem (2.25), replacing it for example with a
quadratic programming problem. Some of these methods may be viewed
as regularized versions of Dantzig-Wolfe decomposition (see Section 4.3).
Another approximation approach that bears similarity to the prox-
imal algorithm is Tikhonov regularization, which approximates the mini-
mization of f with the minimization

Xk+l E arg min


xE1Rn
{t(x) + -2Ck
1-lixll 2 } . (2.56)

The quadratic regularization term makes the cost function of the preced-
ing problem strictly convex, and guarantees that it has a unique minimum.
Sometimes the quadratic term in Eq. (2.56) is scaled and a term 11Sxll 2
is used instead, where S is a suitable scaling matrix. The difference with
the proximal algorithm (2.55) is that Xk does not enter directly the min-
imization to determine Xk+l, so the method relies for its convergence on
increasing ck to oo. By contrast this is not necessary for the proximal al-
gorithm, which is generally convergent even when Ck is left constant (as
we will see in Section 5.1), and is typically much faster. Similar to the
proximal algorithm, there is a dual and essentially equivalent algorithm
to Tikhonov regularization. This is the penalty method that consists of
sequential minimization of the quadratically penalized cost function (2.52)
for a sequence {ck} with Ck ---+ oo.

2.2.4 Alternating Direction Method of Multipliers

The proximal algorithm embodies fundamental ideas that lead to a vari-


ety of other interesting methods. In particular, when properly generalized
(see Section 5.1.4), it contains as a special case the alternating direction
method of multipliers (ADMM for short), a method that resembles the aug-
mented Lagrangian method, but is well-suited for some important classes
of problems with special structure.
The starting point for the ADMM is the minimization problem of the
Fenchel duality context:
minimize fi(x) + h(Ax)
(2.57)
subject to x E ~n,
where A is an m x n matrix, Ji : ~n H (-oo, oo] and h : ~m H (-oo, oo]
are closed proper convex functions. We convert this problem to the equiv-
alent constrained minimization problem
minimize fi(x) + h(z)
(2.58)
subject to x E ~n, z E ~m, Ax= z,
112 Optimization Algorithms: An Overview Chap. 2

and we introduce its augmented Lagrangian function


C
Lc(x, z, .\) = fi(x) + h(z) + .\'(Ax - z) + 2 [1Ax - zll 2 ,

where c is a positive parameter.


The ADMM, given the current iterates (xk,Zk,Ak) E ~n x ~m x ~m,
generates a new iterate (xk+l, Zk+i, Ak+1) by first minimizing the aug-
mented Lagrangian with respect to x, then with respect to z, and finally
performing a multiplier update:

(2.59)

(2.60)

(2.61)
The important advantage that the ADMM may offer over the aug-
mented Lagrangian method, is that it does not involve a joint minimization
with respect to x and z. Thus the complications resulting from the coupling
of x and z in the penalty term IIAx - zll 2 of the augmented Lagrangian
are eliminated. This property can be exploited in special applications, for
which the ADMM is structurally well suited, as we will discuss in Sec-
tion 5.4. On the other hand the ADMM may converge more slowly than
the augmented Lagrangian method, so the flexibility it provides must be
weighted against this potential drawback.
In Chapter 5, we will see that the proximal algorithm for minimiza-
tion can be viewed as a special case of a generalized proximal algorithm for
finding a solution of an equation involving a multivalued monotone opera-
tor. While we will not fully develop the range of algorithms that are based
on this generalization, we will show that both the augmented Lagrangian
method and the ADMM are special cases of the generalized proximal al-
gorithm, corresponding to two different multivalued monotone operators.
Because of the differences of these two operators, some of the properties of
the two methods are quite different. For example, contrary to the case of
the augmented Lagrangian method (where Ck is often taken to be increasing
with k in order to accelerate convergence), there seems to be no generally
good way to adjust c in ADMM from one iteration to the next. Moreover,
even when both methods have a linear convergence rate, the performance of
the two methods may differ markedly in practice. Still there is more than a
superficial connection between the two methods, which can be understood
within the context of their common proximal algorithm ancestry.
In Section 6.3, we will also see another connection of ADMM with
proximal-related methods, and particularly the proximal gradient method,
which we briefly discussed in Section 2.1.4 [cf. Eq. (2.27)]. It turns out
that both the ADMM and the proximal gradient method can be viewed
Sec. 2.2 Approximation Methods 113

as instances of splitting algorithms for finding a zero of the sum of two


monotone operators. The idea is to decouple the two operators within an
iteration: one operator is treated by a proximal-type algorithm, while the
other is treated by a proximal-type or a gradient algorithm. In so doing, the
complications that arise from coupling of the two operators are mitigated.

2.2.5 Smoothing of Nondifferentiable Problems

Generally speaking, differentiable cost functions are preferable to nondiffer-


entiable ones, because algorithms for the former are better developed and
are more effective than algorithms for the latter. Thus there is an incentive
to eliminate nondifferentiabilities by "smoothing" their corners. It turns
out that penalty functions and smoothing are closely related, reflecting the
fact that constraints and nondifferentiabilities are also closely related. As
an example of this connection, the unconstrained minimax problem

minimize max { Ji (x), ... , f m ( x)}


(2.62)
subject to x E ~n,

where Ji, ... , fm are differentiable functions can be converted to the dif-
ferentiable constrained problem

minimize z
(2.63)
subject to Jj(x) ::; z, j = 1, ... , m,

where z is an artificial scalar variable. When a penalty or augmented


Lagrangian is applied to the constrained problem (2.63), we will show that
a smoothing method is obtained for the minimax problem (2.62).
We will now describe a technique (first given in [Ber75b], and gen-
eralized in [Ber77]) to obtain smoothing approximations. Let f : ~n H
(-oo, oo] be a closed proper convex function with conjugate denoted by
f *. For fixed c > 0 and A E ~n, define

Jc >.(x) = inf {f(x - u) +Nu+ -2c llull 2}, (2.64)


' uEWn

The conjugates of ¢1(u) = f(x - u) and ¢2(u) =Nu+ ~llull 2 are ¢t(y) =
f*(-y) + y'x and ¢2(y) = t,:IIY - All 2, so by using the Fenchel duality
formula infuEWn { ¢1(u) + ¢2(u)} = supyEWn { - ¢t(-y) - ¢2(y) }, we have

fc,>.(x) = sup {y'x - f*(y) - _!_IIY - All 2}, (2.65)


yEWn 2c

It can be seen that f c,>. approximates f in the sense that

lim fc,>.(x)
c--+oo
= f**(x) = J(x),
114 Optimization Algorithms: An Overview Chap. 2

f(x) = max{O,x} fc,>.(X)

).2 2c
2c
0 X
~:!===::;::::;~,ro~'-=----~x
,>.
C C

Figure 2.2.2. Illustration of smoothing of the function f(x) = max{O, x }. Note


that as c-+ oo, we have fc,>..(x)-+ J(x) for all x E 1R, regardless of the value of.>...

the double conjugate f ** is equal to f by the Conjugacy Theorem (Prop.


1.6.1 in Appendix B). Furthermore, it can be shown using the optimality
conditions of the Fenchel Duality Theorem Prop. 1.2.l(c) (see also [Ber77])
that f c,>. is convex and differentiable as a function of x for fixed c and >.,
and that the gradient V f c,>. (x) at any x E ~n can be obtained in two ways:
(i) As the vector >. + cu, where u is the unique vector attaining the
infimum in Eq. (2.64).
(ii) As the unique vector y that attains the supremum in Eq. (2.65).
The smoothing approach consists of replacing unsmoothed functions f
with their smooth approximations fc,>., wherever they occur within the cost
and constraint functions of a given problem. Note that there may be several
functions f that are being smoothed simultaneously, and each occurrence of
f may use different >. and c. In this way we obtain a differentiable problem
that approximates the original.
An an example consider a common source of nondifferentiability:
f(x) = max{O,x}, XE~.

!
It can be verified using Eqs. (2.64) and (2.65) that
X - (l - ).)2 if 1-A < X
2C C - l

fc,>.(x) = AX+ ~x 2 if-¾ ::; X::; 1 -;::>-,


).2
ifx<-~-
2c - c'

see Fig. 2.2.2. The function f(x) = max{O, x} may also be used as a
building block to construct more complicated nondifferentiable functions,
such as for example
max{x1, x2} = x1 + max{O, x1 - x2};
see [Ber82a], Ch. 3.
Sec. 2.2 Approximation Methods 115

Smoothing and Augmented Lagrangians

The smoothing technique just described can also be combined with the
augmented Lagrangian method. As an example, let f : ~n H (-oo, oo]
be a closed proper convex function with conjugate denoted by f *. Let
F: ~n H ~ be another convex function, and let X be a closed convex set.
Consider the problem

minimize F(x) + f(x)


subject to x E X,

and the equivalent problem

mm1m1ze F(x) + f(x - u)


subject to x E X, u = 0.
Applying the augmented Lagrangian method (2.53)-(2.54) to the latter
problem leads to minimizations of the form

(xk+l, Uk+1) E arg min


xEX,uEWn
{F(x) + f(x - u) + >.~u + Ck2 llull 2 } .
By first minimizing over u E ~n, these minimizations yield

where f ck>>-k is the smoothed function

[cf. Eq. (2.64)]. The corresponding multiplier update (2.54) is

where

The preceding technique can be extended so that it applies to general


convex/concave minimax problems. Let Z be a nonempty convex subset
of ~m, respectively, and¢: ~n x Z H ~ is a function such that ¢(·, z) :
~n H ~ is convex for each z E Z, and -¢(x, ·) : Z H ~ is convex and
closed for each x E ~n. Consider the problem

minimize sup ¢(x, z)


zEZ

subject to x E X,
116 Optimization Algorithms: An Overview Chap. 2

where X is a nonempty closed convex subset of ~n. Consider also the


equivalent problem
minimize f(x, y)
subject to x E X, y = 0,
where f is the function

f(x, y) = sup{ cp(x, z) - y'z },


zEZ

which is closed and convex, being the supremum of closed and convex
functions. The augmented Lagrangian minimization (2.53) for this problem
takes the form
Xk+I E argmin fck .xk(x),
xEX '

where fc,.\ : ~n f-t ~ is the differentiable function given by

fc,.x(x)= min {f(x,y)+Ny+-cllYll


2 2 },
yE1Rm

The corresponding multiplier update (2.54) is

where
Yk+l E arg min
yE1Rm
{f (xk+I, y) + >..~y + c2 IIYll
k 2 }.

This method of course makes sense only in the case where the function f
has a convenient form that facilitates the preceding minimization.
For further discussion of the relations and combination of smoothing
with the augmented Lagrangian method, see [Ber75b], [Ber77], [Pap81],
and for a detailed textbook analysis, [Ber82a], Ch. 3. There have also been
many variations of smoothing ideas and applications in different contexts;
see [Ber73], [Geo77], [Pol79], [Pol88], [BeT89b], [PiZ94], [Nes05], [Che07],
[OvG 14]. In Section 6.2, we will also see an application of smoothing as an
analytical device, in the context of complexity analysis.

Exponential Smoothing

We have used so far a quadratic penalty function as the basis for smooth-
ing. It is also possible to use other types of penalty functions. A simple
penalty function, which often leads to convenient formulas is the exponen-
tial, which will also be discussed further in Section 6.6. The advantage of
the exponential function over the quadratic is that it produces twice differ-
entiable approximating functions. This may be significant when Newton's
method is used to solve the smoothed problem.
Sec. 2.2 Approximation Methods 117

As an example, a smooth approximation of the function

J(x) = max{fi(x), ... , fm(x)}

is given by

fc,>.(x) = -;;1 ln { ~
m .
>.•ecfi(x) }
' (2.66)

where c > 0, and>.= (>.1, ... , >.m) is a vector with

m
I: ).i = 1, ).i > 0, Vi= 1, ... , m.
i=l

There is an augmented Lagrangian method associated with the ap-


proximation (2.66). It involves minimizing over x E Rn the function
fck>>.k(x) for a given Ck and Ak to obtain an approximation Xk to the min-
imum off. This approximation is refined by setting ck+ 1 ~ Ck and

i = l, ... ,m, (2.67)

and by repeating the process. t The generated sequence { Xk} can be shown
to converge to the minimum of f under mild assumptions, based on gen-
eral convergence properties of augmented Lagrangian methods that use
nonquadratic penalty functions; see [Ber82a], Ch. 5, for a detailed devel-
opment.

Example 2.2.1: (Smoothed £1 Regularization)

Consider the £1 -regularized least squares problem

~ · 1,f- 2
1 ~ lx 1 I+
I
minimize 2
~(aix - b;)
j=l i=l

subject to X E rJr,

t Sometimes the use of the exponential in Eq. (2.67) and other related formu-
las, such as (2.66), may lead to very large numbers and computer overflow. In this
case one may use a translation device to avoid such numbers, e.g., multiplying
numerator and denominator in Eq. (2.67) by e- 13 k where

A similar idea works for Eq. (2.66).


118 Optimization Algorithms: An Overview Chap. 2

Figure 2.2.3. Illustration of the exponentially smoothed version

Rc,.x.(x) = ~ In { >.ecx + (1 - >.)e-cx}

of the absolute value function lxl. The approximation becomes asymptotically


exact as c --+ oo for any fixed value of the multiplier .>. E (0, 1). Also by adjust-
ing the multiplier .>. within the range (0, 1), we can attain better approximation
for x positive or negative. As .>. --+ 1 (or .>. --+ 0) the approximation becomes
asymptotically exact for x 2'. 0 (or x:::; 0, respectively).

where a; and b; are given vectors and scalars, respectively (cf. Example 1.3.1).
The nondifferentiable £1 penalty may be smoothed by writing each term lxjl
as max{ xi, -xi} and by smoothing it using Eq. (2.66), i.e., replacing it by

where c and >.i are scalars satisfying c > 0 and ),.i E (0, 1) (see Fig. 2.2.3).
We may then consider an exponential type of augmented Lagrangian method,
whereby we minimize over lRn the twice differentiable function

'Y t
j=l
Rck,>..{ (xj) + ~ I)a:x - b;)2,
i=l
(2.68)

to obtain an approximation Xk to the optimal solution. This approximation


is refined by setting Ck+1 2". Ck and

j = 1, ... ,n, (2.69)

[cf. Eq. (2.67)], and by repeating the process. Note that the minimization of
the exponentially smoothed cost function (2.68) can be carried out efficiently
by incremental methods, such as the incremental gradient and Newton meth-
ods of Section 2.1.5.
As Fig. 2.2.3 suggests, the adjustment of the multiplier >...i can selec-
tively reduce the error
Sec. 2.3 Notes, Sources, and Exercises 119

depending on whether good approximation for positive or negative xj is de-


sired. For this reason it is not necessary to increase Ck to infinity; the multi-
plier iteration (2.69) is sufficient for convergence even with Ck kept constant
at some positive value (see (Ber82a], Ch. 5).

2.3 NOTES, SOURCES, AND EXERCISES

Section 2.1: Textbooks on nonlinear programming with a substan-


tial algorithmic content are good sources for the overview material on
unconstrained and constrained differentiable optimization in this chap-
ter, e.g., Zangwill [Zan69], Polak [Pol71], Hestenes [Hes75], Zoutendijk
[Zou76], Shapiro [Sha79], Gill, Murray, and Wright [GMW81], Luen-
berger [Lue84], Poljak [Pol87], Dennis and Schnabel [DeS96], Bertsekas
[Ber99], Kelley [Kel99], Fletcher [FleOO], Nesterov [Nes04], Bazaraa,
Shetty, and Sherali [BSS06], Nocedal and Wright [NoW06], Ruszczynski
[Rus06], Griva, Nash, and Sofer [GNS08], Luenberger and Ye [LuY08].
References for specific algorithmic nondifferentiable optimization topics
will be given in subsequent chapters.
Incremental gradient methods have a long history, particularly for
the unconstrained case (X = ~n), starting with the Widrow-Hoff least
mean squares (LMS) method [WiH60], which stimulated much subse-
quent research. They have also been used widely, under the generic
name "backpropagation methods," for training of neural networks, which
involves nonquadratic/nonconvex differentiable cost components. There
is an extensive literature on this subject, and for some representative
works, we refer to the papers by Rumelhart, Hinton, and Williams
[RHW86], [RHW88], Becker and LeCun [BeL88], Vogl et al. [VMR88],
and Saarinen, Bramley, and Cybenko [SBC91], and the books by Bishop
[Bis95], Bertsekas and Tsitsiklis [BeT96], and Haykin [Hay08]. Some of
this literature overlaps with the literature on stochastic gradient meth-
ods, which we noted in Section 2.1.5.
Deterministic convergence analyses of several variants of incremen-
tal gradient methods were given in the 90s under various assumptions
and for a variety of stepsize rules; see Luo [Luo91], Grippo [Gri94],
[GriOO], Luo and Tseng [LuT94a], Mangasarian and Solodov [MaS94],
Bertsekas [Ber97], Solodov [Sol98], Tseng [Tse98], and Bertsekas and
Tsitsiklis [BeTOO]. Recent theoretical work on incremental gradient
methods has focused, among others, on the aggregated gradient meth-
ods discussed in Section 2.1.5, on extensions to nondifferentiable and
constrained problems, on combinations with the proximal algorithm,
and on constrained versions where the constraints are treated by incre-
mental projection (see Section 6.4 and the references cited there).
The incremental Newton and related methods have been consid-
ered by several authors in a stochastic approximation framework; e.g.,
by Sakrison [Sak66], Venter [Ven67], Fabian [Fab73], Poljak and Tsyp-
120 Optimization Algorithms: An Overview Chap. 2

kin [PoT80], [PoT81], and more recently by Bottou and LeCun [BoL05],
and Bhatnagar, Prasad, and Prashanth [BPP13]. Among others, these
references quantify the convergence rate advantage that stochastic New-
ton methods have over stochastic gradient methods. Deterministic
incremental Newton methods have received little attention (for a re-
cent work see Gurbuzbalaban, Ozdaglar, and Parrilo [GOP14]). How-
ever, they admit an analysis that is similar to a deterministic analysis
of the extended Kalman filter, the incremental version of the Gauss-
Newton method (see Bertsekas [Ber96], and Moriyama, Yamashita, and
Fukushima [MYF03]). There are also many stochastic analyses of the
extended Kalman filter in the literature of estimation and control of
dynamic systems.
Let us also note another approach to accelerate the theoretical
convergence rate of incremental gradient methods, which involves using
a larger than 0(1/k) stepsize and averaging the iterates (for analysis of
the corresponding stochastic gradient methods, see Ruppert [Rup85],
and Poljak and Juditsky [PoJ92], and for a textbook account, Kushner
and Yin [KuY03]).
Section 2.2: The nonlinear programming textbooks cited earlier con-
tain a lot of material on approximation methods. In particular, the lit-
erature on polyhedral approximation is extensive. It dates to the early
days of nonlinear and convex programming, and it involves applications
in data communication and transportation networks, and large-scale
resource allocation. This literature will be reviewed in Chapter 4.
The research monographs by Fiacco and MacCormick [FiM68],
and Bertsekas [Ber82a] focus on penalty and augmented Lagrangian
methods, respectively. The latter book also contains a lot of mate-
rial on smoothing methods and the proximal algorithm, including cases
where nonquadratic regularization is involved, leading in turn to non-
quadratic penalty terms in the augmented Lagrangian (e.g., logarithmic
regularization and exponential penalty).
The proximal algorithm was proposed in the early 70s by Martinet
[Mar70], [Mar72]. The literature on the algorithm and its extensions,
spurred by the influential paper by Rockafellar [Roc76a], is voluminous,
and reflects the central importance of proximal ideas in convex optimiza-
tion and other problems. The ADMM, an important special case of the
proximal context, was proposed by Glowinskii and Morocco [GIM75],
and Gabay and Mercier [GaM76], and was further developed by Gabay
[Gab79], [Gab83]. We refer to Section 5.4 for a detailed discussion of this
algorithm, its applications, and its connections to more general operator
splitting methods. Recent work involving proximal ideas has focused on
combinations with other algorithms, such as gradient, subgradient, and
coordinate descent methods. Some of these combined methods will be
discussed in detail in Chapters 5 and 6.
Sec. 2.3 Notes, Sources, and Exercises 121

EXERCISES

2.1 (Convergence Rate of Steepest Descent and Gradient


Projection for a Quadratic Cost Function)

Let f be the quadratic cost function,

f(x) = ½x'Qx - b'x,

where Q is a symmetric positive definite matrix, and let m and M be the


minimum and maximum eigenvalues of Q, respectively. Consider the mini-
mization off over a closed convex set X and the gradient projection mapping

G(x) = Px (x - av' f(x))

with constant stepsize a< 2/M.


(a) Show that G is a contraction mapping and we have

IIG(x)-G(y)II Smax{ll-aml, 11-aMl}llx-yll, 'i x,y E ~r,


and its unique fixed point is the unique minimum x* of f over X.
Solution: First note the nonexpansive property of the projection

IIPx(x) - Px(y)II S llx - YII, 'i x,y E llr;

(use a Euclidean geometric argument, or see Section 3.2 for a proof).


Use this property and the gradient formula V f(x) = Qx - b to write

IIG(x) - G(y) II = IIPx (x - av' f(x)) - Px (y - av' f(y)) II


::; ll(x-aVf(x))-(y-aVJ(y))II
= IIU- aQ)(x -y)II
::; max{ll - aml, 11 - aMl}llx - YII,
where m and M are the minimum and maximum eigenvalues of Q.
Clearly x* is a fixed point of G if and only if x* = Px (x* - av' f(x*)),
which by the projection theorem, is true if and only if the necessary and
sufficient condition for optimality V f (x*)' (x - x*) 2'. 0 for all x E X
is satisfied. Note: In a generalization of this convergence rate estimate
to the case of a nonquadratic strongly convex differentiable function f,
the maximum eigenvalue M is replaced by the Lipschitz constant of V f
and the minimum eigenvalue m is replaced by the modulus of strong
convexity off; see Section 6.1.
122 Optimization Algorithms: An Overview Chap. 2

(b) Show that the value of a that minimizes the bound of part (a) is
2
Q * = M+m'
in which case

IIG(x) - G(y)II ~ (:~:~!)!Ix - YII·


Note: The linear convergence rate estimate,

that this contraction property implies for steepest descent with con-
stant stepsize is sharp, in the sense that there exist starting points xo
for which the preceding inequality holds as an equation for all k (see
[Ber99], Section 2.3).

2.2 (Descent Inequality)

This exercise deals with an inequality that is fundamental for the convergence
analysis of gradient methods. Let X be a convex set, and let f : Rn t-t R be
a differentiable function such that for some constant L > 0, we have

llv' f(x) - v' f(y) I ~ Lllx - YII, \:/ x,y EX.


Show that

f(y) ~ f(x) + v' f(x)' (y - x) + ½IIY - x112, \:/ x,y EX. (2.70)

Proof: Let t be a scalar parameter and let g( t) = f ( x + t(y - x)). The chain
rule yields (dg/dt)(t) = v' f (x + t(y - x) )' (y - x). Thus, we have
f(y) - f(x) = g(l) - g(O)

= 1 1
!;(t)dt

=1 1
(y- x)'v'f(x + t(y- x)) dt

~1 1
(y - x)'v'f(x) dt + 11 1
(y - x)' (v'f (x + t(y - x)) - v' f(x)) dt I
~1 1
(y - x)'v' f(x) dt + 1 1
IIY - xii· !Iv' f (x + t(y - x)) - v' f(x)lldt

~ (y - x)'v' f(x) + IIY - xii 1 1


Ltlly - xii dt

= (y- x)'v'J(x) + f lly- xll 2 •


Sec. 2.3 Notes, Sources, and Exercises 123

2.3 ( Convergence of Steepest Descent with Constant


Stepsize)

Let f : Rn >---+ R be a differentiable function such that for some constant


L > 0, we have

llv'f(x) - v'f(y)II '.S: Lllx -yjj, \;/ x,y E Rn. (2.71)

Consider the sequence {xk} generated by the steepest descent iteration

where O <ex< f- Show that if {xk} has a limit point, then v' f(xk)-+ 0, and
every limit point x of { xk} satisfies v' f (x) = 0. Proof: We use the descent
inequality (2.70) to show that the cost function is reduced at each iteration
according to

f (xk+1) = f ( Xk - cxv' f (xk))

S f(xk) + v' f(xk) 1


( - av' f(xk)) + ex? llv' f(xk)ll 2
= f(xk) cxL) llv' f(xk)II 2 .
- ex ( 1 - 2

Thus if there exists a limit point x of {xk}, we have J(xk) -+ J(x) and
v' f(xk) -+ 0. This implies that v' f(x) = 0, since v' f (-) is continuous by Eq.
(2.71).

2.4 (Armijo/Backtracking Stepsize Rule)

Consider minimization of a continuously differentiable function f : Rn >---+ R,


using the iteration

where dk is a descent direction. Given fixed scalars /3, and a, with O < /3 < 1,
0 < a < 1, and Bk with infk::::o Bk > 0, the stepsize CXk is determined as follows:
we set CXk = (3mk Bk, where mk is the first nonnegative integer m for which

Assume that there exist positive scalars c1, c2 such that for all k we have

(2.72)

(a) Show that the stepsize CXk is well-defined, i.e., that it will be determined
after a finite number of reductions if v' J(xk) i= 0. Proof: We have for
allB>O
124 Optimization Algorithms: An Overview Chap. 2

Thus the test for acceptance of a stepsize s > 0 is written as

or using Eq. (2. 72),

-o(s) ::::; (1 - a)c1 II v7 f(xk) 11 2 ,


s

which is satisfied for s in some interval (0, sk]. Thus the test will be
passed for all m for which f3msk ::::; Sk.
(b) Show that every limit point x of the generated sequence {xk} satisfies
v7 J(x) = 0. Proof: Assume, to arrive at a contradiction, that there
is a subsequence { xk},c that converges to some x with v7 f (x) =I= 0.
Since {f(xk)} is monotonically nonincreasing, {f(xk)} either converges
to a finite value or diverges to -oo. Since f is continuous, J(x) is a
limit point of {f(xk)}, so it follows that the entire sequence {J(xk)}
converges to J(x). Hence,

By the definition of the Armijo rule and the descent property v7 f(xk)' dk ::::;
0 of the direction dk, we have

so by combining the preceding two relations,

(2.73)

From the left side of Eq. (2.72) and the hypothesis v7 J(x) =I= 0, it follows
that
limsupv7f(xk) 1 dk < 0, (2.74)
k-too
kEIC

which together with Eq. (2.73) implies that

{ak},c--+ 0.

Since sk, the initial trial value for O:k, is bounded away from 0, sk will be
reduced at least once for all k E IC that are greater than some iteration
index k. Thus we must have for all k E IC with k > k,

(2.75)

From the right side of Eq. (2.72), {dk},c is bounded, and it follows that
there exists a subsequence {dk}x; of {dk},c such that
Sec. 2.3 Notes, Sources, and Exercises 125

where dis some vector. From Eq. (2.75), we have

\/ k E K., k 2'. k,

where cik = O:k / /3. By using the mean value theorem, this relation is
written as

\/ k E K., k 2'. k,
where O:k is a scalar in the interval [O, cik]- Taking limits in the preceding
relation we obtain

-v7 f (x)'d :S -av7 f (x)'d,

or
o :S (1 - a)v7 f(x)'d.
Since a < 1, it follows that

o :S v7 f (x)'d,
a contradiction of Eq. (2.74).

2.5 (Convergence of Steepest Descent to a Single Limit


(BGI95])

Let f : ar >-+ R be a differentiable convex function, and assume that for some
L > 0, we have

llv7 f(x) - v7 f(y) II :S L llx - YII, \/ x,y E Rn.

Let X* be the set of minima off, and assume that X* is nonempty. Consider
the steepest descent method

Show that {xk} converges to a minimizing point off under each of the fol-
lowing two stepsize rule conditions:
(i) For some E > 0, we have

E :S < 2(1 L- 1:) ,


O:k - \/ k.

(ii) O:k -+ 0 and I:;'= 0 O:k = oo.


Notes: The original source [BGI95] also shows convergence to a single limit
for a variant of the Armijo rule. This should be contrasted with a result of
[GonOO], which shows that the steepest descent method with the exact line
126 Optimization Algorithms: An Overview Chap. 2

minimization rule may produce a sequence with multiple limit points (all of
which are of course optimal), even for a convex cost function. There is also
a "local capture" theorem that applies to gradient methods for nonconvex
continuously differentiable cost functions f and an isolated local minimum of
f (a local minimum x* that is unique within a neighborhood of x*). Under
mild conditions it asserts that there is an open sphere Bx• centered at x*
such that once the generated sequence { Xk} enters Sx*, it converges to x*
(see [Ber82a], Prop. 1.12, or [Ber99], Prop. 1.2.5 and the references given
there). Abbreviated Proof: Consider the stepsize rule (i). From the descent
inequality (Exercise 2.2), we have for all k

so {f(xk)} is monotonically nonincreasing and converges. Adding the pre-


ceding relation for all values of k and taking the limit as k -+ oo, we obtain
for all x* E X*,
00

1(x*) s 1(xo) -c2 I: 11v11(xk)r


k=O

It follows that I:~=O llv'f(xk)ll 2< oo and v'f(xk)-+ 0, and also


00

L llxk+l - Xk 11 2 < oo, (2.76)


k=O

since v'f(xk) = (xk -xk+i)/ak. Moreover any limit point of {xk} belongs to
X*, since v' f(xk) -+ 0 and f is convex.
Using the convexity off, we have for all x* E X*,

llxk+l - x*ll 2 - llxk - x*ll 2 - llxk+1 - Xkll 2 = -2(x* - Xk) 1 (xk+1 - xk)
= 20:k(x* - Xk)'v'f(xk)
:S 20:k (f (x*) - f(xk))
:S 0,
so that

\;/ x* EX*. (2. 77)

We now use Eqs. (2.76) and (2.77), and the Fejer Convergence Theorem
(Prop. A.4.6 in Appendix A). From part (a) of that theorem it follows that
{xk} is bounded, and hence it has a limit point x, which must belong to X*
as shown earlier. Using this fact and part (b) of the theorem, it follows that
{ xk} converges to x.
The proof for the case of the stepsize rule (ii) is similar. Using the
assumptions O:k -+ 0 and I:~=O
°'k = oo, and the descent inequality, we show

that v' f(xk) -+ 0, that {f(xk)} converges, and that Eq. (2.76) holds. From
this point, the preceding proof applies.
Sec. 2.3 Notes, Sources, and Exercises 127

2.6 {Convergence of Gradient Method with Errors [BeTOO])

Consider the problem of unconstrained minimization of a differentiable func-


tion f: Rn f-t R. Let {xk} be a sequence generated by the method

where ak is a positive stepsize, and Wk is an error vector satisfying for some


positive scalars p and q,

k = 0, 1, .... (2.78)

Assume that for some constant L > 0, we have

llv'f(x) - v'f(y)II $ Lllx -yll, Vx,yE~r,


and that
00 00

LCl!k = oo, La~< oo. (2.79)


k=O k=O

Show that either f(xk) -+ -oo or else f(xk) converges to a finite value
and limk-+oo v'J(xk) = 0. Furthermore, every limit point x of {xk} satis-
fies v'f(x) = 0. Abbreviated Proof: The descent inequality (2.70) yields

Using Eq. (2.78), we have

-v'f(xk)'(v'f(xk) +wk)$ -llv'J(xk)ll 2 + llv'f(xk)II llwkll


$ -llv' f(xk)ll 2 + akqllv' f(xk)II + Cl!kPllv' f(xk)ll2,
and

½llv' f(xk) + Wk 11 2 $ lly' f(xk) 11 2 + llwk 11 2


$ jjv'J(xk)ll + aHl + 2pqjjv'J(xk)II +lllv'J(xk)II\
2

Combining the preceding three relations and collecting terms, it follows that

J(xk+1) $ f(xk) - ak(l - akL - akp- atp2 L)ljv'J(xk)jl 2


+a~(q+2a~pqL)jjv'J(xk)jj +afq2L.

Since Cl!k -+ 0, we have for some positive constants c and d, and all k suffi-
ciently large
128 Optimization Algorithms: An Overview Chap. 2

Using the inequality jjv'J(xk)jj :s; 1 + jjv'J(xk)jl 2 , the above relation yields
for all k sufficiently large

By applying the Supermartingale Convergence Theorem (Prop. A.4.4 in Ap-


pendix A), using also the assumption (2.79), it follows that either f(xk) -+
-oo or else f(xk) converges to a finite value and I:;;"=o°'kjjv'J(xk)jj 2 < oo.
In the latter case, in view of the assumption I:;;"= 0 ak = oo, we must have
lim infk-+oo ll'v J(xk) 11 = 0. This implies that 'v J(xk) -+ O; for a detailed proof
of this last step see [BeTOO]. This reference also provides a stochastic version
of the result of this exercise. This result, however, requires a different line
proof, which does not rely on supermartingale convergence arguments.

2. 7 (Steepest Descent Direction for Nondifferentiable


Cost Functions [BeM71])

Let f : Rn >--+ R be a convex function, and let us view the steepest descent
direction at x as the solution of the problem

minimize J' (x; d)


(2.80)
subject to lldll :s; 1.

Show that this direction is -g*, where g* is the vector of minimum norm in
8f(x). Abbreviated Solution: From Prop. 5.4.8 in Appendix B, J' (x; ·) is the
support function of the nonempty and compact subdifferential 8f(x), i.e.,

J'(x;d) = max d'g, 'r:/ x,d E Rn.


gEof(x)

Since the sets { d I lldll :s; 1} and 8f(x) are convex and compact, and the
function d' g is linear in each variable when the other variable is fixed, by the
Saddle Point Theorem of Prop. 5.5.3 in Appendix B, it follows that

min max d'g = max min d'g,


!ldli:C::1 gEof(x) gEof(x) lldll9

and that a saddle point exists. For any saddle point (d*, g*), g* maximizes
the function minlldll9 d'g = -llgll over 8f(x), so g* is the unique vector of
minimum norm in 8f(x). Moreover, d* minimizes maxgEBf(x) d'g or equiva-
lently f' (x; d) [by Eq. (2.80)] subject to lldll :s; 1 (so it is a direction of steepest
descent), and minimizes d'g* subject to lldll :s; 1, so it has the form

d*=-L
llg*II

[except if OE 8f(x), in which cased*= OJ.


Sec. 2.3 Notes, Sources, and Exercises 129

2.8 (Two-Metric Projection Methods for Bound Constraints


[Ber82a], [Ber82b])
Consider the minimization of a continuously differentiable function f : Rn >-+
R over the set
X = { (x1, ... , xn) I 1;/ :S xi :S /ji, i = 1, ... , n},
where !l and /ji, i = 1, ... , n, are given scalars. The two-metric projection
method for this problem has the form
Xk+l = Px(xk -akDk'v'f(xk)),
where Dk is a positive definite symmetric matrix.
(a) Construct an example off and Dk, where Xk does not minimize f over
X and f(xk+1) > f(xk) for all ak > 0.
(b) For given Xk EX, let h = {i I xi =!:/with 8f(xk)/8xi > 0 or xi=
P with 8f(xk)/8xi < O}. Assume that Dk is diagonal with respect to
h in the sense that (Dk)ij = (Dk)ji = 0 for all i Eh and j = 1, ... , n,
with j =I= i. Show that if Xk is not optimal, there exists ih > 0 such
that f(xk+1) < f(xk) for all ak E (0, ih].
(c) Assume that the nondiagonal portion of Dk is the inverse of the corre-
sponding portion of 'v' 2 f(xk). Argue informally that the method can
be reasonably expected to have superlinear convergence rate.

2.9 (Incremental Methods - Computational Exercise)

This exercise deals with the (perhaps approximate) solution of a system of


linear inequalities c;x :S bi, i = 1, ... , m, where Ci E Rn and bi E Rare given.
(a) Consider a variation of the Kaczmarz algorithm that operates in cycles
as follows. At the end of cycle k, we set Xk+l = 't/Jm,k, where 't/Jm,k is
obtained after the m steps

't/J;,k = 't/Ji-1,k Qk { I }
- llcill 2 max 0, c;'t/Ji-1,k - bi Ci, i = l, ... ,m,

starting with 't/Jo,k = Xk. Show that the algorithm can be viewed as an
incremental gradient method for a suitable differentiable cost function.
(b) Implement the algorithm of (a) for two examples where n = 2 and
m = 100. In the first example, the vectors Ci have the form Ci = (l;i, (;),
where .;;, (i, as well as bi, are chosen randomly and independently
from [-100, 100] according to a uniform distribution. In the second
example, the vectors Ci have the form Ci = (<;i, (;), where <;i, (i are
chosen randomly and independently within [-10, 10) according to a
uniform distribution, while bi is chosen randomly and independently
within [O, 1000] according to a uniform distribution. Experiment with
different starting points and stepsize choices, and deterministic and
randomized orders of selection of the indexes i for iteration. Explain
your experimental results in terms of the theoretical behavior described
in Section 2.1.
130 Optimization Algorithms: An Overview Chap. 2

2.10 (Convergence of the Incremental Gradient Method)

Consider the minimization of a cost function

f(x) = L Ji(x),
i=l

where Ji : ~r i-+ R are continuously differentiable, and let {xk} be a se-


quence generated by the incremental gradient method. Assume that for some
constants L, C, D, and all i = 1, ... , m, we have

jjv7J;(x)- v'Ji(y)jj ~ Lllx -yll,

and
jjv7J;(x)jj ~ C + Djjv7J(x)jj,
Assume also that

00 00

Lak = oo, La%< oo.


k=O k=O

Show that either f(xk) -+ -oo or else f (xk) converges to a finite value
and limk-+oo v'f(xk) = 0. Furthermore, every limit point x of {xk} satis-
fies v7 f (x) = 0. Abbreviated Solution: The idea is to view the incremental
gradient method as a gradient method with errors, so that the result of Ex-
ercise 2.6 can be used. For simplicity we assume that m = 2. The proof is
similar when m > 2. We have

By adding these two relations, we obtain

where

We have

Thus Exercise 2.6 applies and the result follows.


Sec. 2.3 Notes, Sources, and Exercises 131

2.11 (Convergence Rate of the Kaczmarz Algorithm with


Random Projection [St V09])

Consider a consistent system of linear equations c~x = b;, i = 1, ... , m, and


assume for convenience that the vectors c; have been scaled so that lie; II = 1
for all i. A randomized version of the Kaczmarz method is given by

Xk+1 = Xk - (c~kx - b;k)c;k,

where ik is an index randomly chosen from the set {1, ... , m} with equal
probabilities 1/m, independently of previous choices. Let P(x) denote the
Euclidean projection of a vector x E ~r
onto the set of solutions of the
system, and let C be the matrix whose rows are c1, ... , Cm. Show that

where Amin is the minimum eigenvalue of the matrix C'C. Hint: Show that

and take conditional expectation of both sides to show that

E{llxk+l - P(xk+1)11 2 I xk} S llxk - P(xk)ll 2 - ~IICxk - bll 2


m

::; ( 1- >..:n) llxk - P(xk)r

2.12 (Limit Cycle of Incremental Gradient Method [Luo91])

Consider the scalar least squares problem

minimize ½((b1 - x) 2 + (b2 - x)2)


subject to x E R,

where b1 and b2 are given scalars, and the incremental gradient algorithm
that generates Xk+l from Xk according to

where
'lpk = Xk - a(xk - b1),
and a is a positive stepsize. Assuming that a< 1, show that {xk} and {'lj;k}
converge to limits x(a) and 'lj;(a), respectively. However, unless bi = b2,
x(a) and 'lj;(a) are neither equal to each other, nor equal to the least squares
solution x* = (b1 + b2)/2. Verify that

lim x(a) = lim 'lj;(a) = x*.


a-tO o:-tO
132 Optimization Algorithms: An Overview Chap. 2

2.13 ( Convergence of Incremental Gradient Method for


Linear Least Squares Problems)

Consider the linear least squares problem of minimizing

over x E Rn, where the vectors Zi and the matrices Ci are given. Let Xk be
the vector at the start of cycle k of the incremental gradient method that
operates in cycles where components are selected according to a fixed order.
Thus we have
m

Xk+l = Xk + °'k L c:(zi - Ci1!Ji-i),


i=l

where 1/Jo = Xk and

i = l, ... ,m.
Assume that I:=:, 1 Cf Ci is a positive definite matrix and let x* be the optimal
solution. Then:
(a) There exists a> 0 such that if °'k is equal to some constant a E (0, a]
for all k, {xk} converges to some vector x(a). Furthermore, the error
llxk -x(a)II converges to O linearly. In addition, we have lima--+O x(a) =
x*. Hint: Show that the mapping that produces Xk+l starting from Xk
is a contraction mapping for a sufficiently small.
(b) If °'k > 0 for all k, and
00

°'k -+ 0, L°'k = oo,


k=O

then {xk} converges to x*. Hint: Use Prop. A.4.3 of Appendix A.


Note: The ideas of this exercise are due to [Luo91]. For a complete solution,
see [BeT96], Section 3.2, or [Ber99], Section 1.5.

2.14 (Linear Convergence Rate of Incremental Gradient


Method [Ber99], [NeBOO])

This exercise quantifies the rate of convergence of the incremental gradient


method to the "region of confusion" (cf. Fig. 2.1.11), for any order of process-
ing the additive cost components, assuming these components are positive
definite quadratic. Consider the incremental gradient method

k = 0, 1, ... ,
where Jo, Ji, ... , are quadratic functions with eigenvalues lying within some
interval ['y, r], where 'Y > 0. Suppose that for a given E > 0, there is a vector
x* such that
V k = 0, 1, ....
Sec. 2.3 Notes, Sources, and Exercises 133

Show that for all O! with O < O! :::; 2/('y + r), the generated sequence {xk}
converges to a 2E/'y-neighborhood of x*, i.e.,

lim sup llxk - x* II :::; 2E.


k-too 'Y
Moreover the rate of convergence to this neighborhood is linear, in the sense
that

llxk - x*JJ > 2E JJxk+l - x*JI < ( 1 - 0!27 ) IJxk - x*JJ,


'Y
while

Hint: Let /k(x) = ½x'Qkx - b~x, where Qk is positive definite symmetric,


and write

For other related convergence rate results, see [NeBOO] and [Sch14a].

2.15 (Proximal Gradient Method, £1-Regularization, and the


Shrinkage Operation)

The proximal gradient iteration (2.27) is well suited for problems involving a
nondifferentiable function component that is convenient for a proximal iter-
ation. This exercise considers the important case of the £1 norm. Consider
the problem
minimize f(x) +-yJJxlJi
subject to x E Rn,
where f : Rn >-+ R is a differentiable convex function, II · Iii is the £1 norm,
and -y > 0. The proximal gradient iteration is given by the gradient step

followed by the proximal step

[cf. Eq. (2.28)]. Show that the proximal step can be performed separately for
each coordinate xi of x, and is given by the so-called shrinkage operation:

i = 1, ... ,n.

Note: Since the shrinkage operation tends to set many coordinates xi+i to
0, it tends to produce "sparse" iterates.
134 Optimization Algorithms: An Overview Chap. 2

2.16 (Determining Feasibility of Nonlinear Inequalities by


Exponential Smoothing, [Ber82a], p. 314, [Sch82])

Consider the problem of finding a solution of a system of inequality constraints

gi(x) :S 0, i = 1, ... ,m,

where gi : ar o----+ R are convex functions. A smoothing method based on the


exponential penalty function is to minimize instead

fc,>. (x) = ~Infi=l


Aiecgi(x),

where c > 0 is some scalar, and the scalars Ai, i = 1, ... , m, are such that

Ai > 0, i = 1, ... , m,
i=l

(a) Show that if the system is feasible (or strictly feasible) the optimal
value is non positive (or strictly negative, respectively). If the system is
infeasible, then

Jim inf fc,>.(x)


c-+oo xERn
= xERn
inf max{g1(x), .. ,,gm(x)}.

(b) ( Computational Exercise) Apply the incremental gradient method and


the incremental Newton method for minimizing I::
1 ecgi(x) [which is
equivalent to minimizing fc,>.(x) with Ai= 1/m], for the case

gi(x) = c;x - bi, i = l, ... ,m,

where (Ci, bi) are randomly generated as in the two problems of Exercise
2.9(b ). Experiment with different starting points and stepsize choices,
and deterministic and randomized orders of selection of the indexes i
for iteration.
(c) Repeat part (b) where the problem is instead to minimize f(x) =
maxi=l, ... ,m gi(X) and the exponential smoothing method of Section
2.2.5 is used, possibly with the augmented Lagrangian update (2.67)
of >.. Compare three methods of operation: ( 1) c is kept constant and
A is updated, (2) c is increased to oo and A is kept constant, and (3) c
is increased to oo and .>. is updated.
3

Subgradient Methods

Contents

3.1. Subgradients of Convex Real-Valued Functions p. 136


3.1.1. Characterization of the Subdifferential . . p. 146
3.2. Convergence Analysis of Subgradient Methods p. 148
3.3. E-Subgradient Methods . . . . . . . . . . p. 162
3.3.1. Connection with Incremental Subgradient Methods p. 166
3.4. Notes, Sources, and Exercises . . . . . . . . . . . p. 167

135
136 Subgradient Methods Chap. 3

In this chapter we discuss subgradient methods for minimizing a real-valued


convex function f : ~n r-+ ~ over a closed convex set X. The simplest form
of a subgradient method is given by

where 9k is any subgradient of f at Xk, <Yk is a positive stepsize, and


Px(·) denotes Euclidean projection on the set X. Note the similarity with
the gradient projection iteration of Section 2.1.2: at points Xk where f is
differentiable, v' f (xk) is the unique subgradient, and the two iterations are
identical. t
We first review in Section 3.1 the theory of subgradients of real-
valued convex functions, with an emphasis on properties of algorithmic
significance. We also discuss the computation of a subgradient for func-
tions arising in duality and minimax contexts. In Section 3.2 we provide a
convergence analysis of the principal form of subgradient method, with a
variety of stepsize rules. In Section 3.3, we discuss variants of subgradient
methods involving approximations of various kinds.

3.1 SUB GRADIENTS OF CONVEX REAL-VALUED FUNCTIONS

Given a proper convex function f : ~n r-+ (-oo, oo], we say that a vector
g E ~n is a subgradient of f at a point x E dom(f) if

f(z) 2: f(x) + g'(z - x), (3.1)

see Fig. 3.1.1. The set of all subgradients of f at x E ~n is called the


subdifferential of f at x, and is denoted by af(x). For x (j. dom(f) we
use the convention af(x) = 0. Figure 3.1.2 provides some examples of
subdifferentials. Note that af ( x) is a closed convex set, since based on Eq.
(3.1), it is the intersection of a collection of closed halfspaces (one for each
z E ~n).
It is generally true that af(x) is nonempty for all x E ri(dom(f)),
the relative interior of the domain off, but it is possible that af(x) = 0
at some points in the relative boundary of dom(f). The properties of
subgradients of extended real-valued functions are summarized in Section

t Since the gradient projection method can be viewed as the subgradient


method applied to a differentiable cost function, the analysis of the present section
also applies to gradient projection. However, when f is differentiable, the stepsize
ak is often chosen so that the cost function value is reduced at each iteration,

thereby allowing a more powerful analysis, based on a cost function descent


approach, and improved convergence and rate of convergence results. We will
postpone this analysis to Section 6.1.
Sec. 3.1 Subgradients of Convex Real-Valued Functions 137

f(z)
Epigraph of f (-g, 1)

Figure 3.1.1. Illustration of the definition of a subgradient. The subgradient


inequality (3.1) can be written as

f(z) - g'z 2". f(x)- g'x, V z E Wn.

Thus, g is a subgradient off at x if and only if the hyperplane in wn+I that has
normal (-g, 1) and passes through ( x, f(x)) supports the epigraph off, as shown
in the figure.

5.4 of Appendix B. When f is real-valued, however, stronger results can


be shown: 8f(x) is not only closed and convex, but also nonempty and
compact for all x E ~n. Moreover the proofs of this and other related
results are generally simpler than for the extended real-valued case. For
this reason, we will provide an independent development of the results that
we need for the case where f is real-valued (which is the primary case of
interest in algorithms).
To this end, we recall the definition of the directional derivative of f
at a point x in a direction d:

f'(x; d) = lim f(x + ad) - f(x) (3.2)


a.J,.O a

(cf. Section 5.4.4 of Appendix B). The ratio on the right-hand side is mono-
tonically nonincreasing to f'(x; d), as shown in Section 5.4.4 of Appendix
B; also see Fig. 3.1.3.
Our first result shows some basic properties, and provides the con-
nection between 8f(x) and f'(x; d) for real-valued f. A related and more
refined result is given in Prop. 5.4.8 in Appendix B for extended real-valued
f. Its proof, however, is more intricate and includes some conditions that
are unnecessary for the case where f is real-valued.
138 Subgradient Methods Chap. 3

f(x) = lxl f(x) = max {o, (1/2)(x 2 - 1)}

0 X -1 0 1 X

Bf(x) Bf(x)

1 ,
, ,,
, ,,

0 :x -1 ,, 0 1 X
, ,,
,,
-1

Figure 3.1.2. The subdifferentials of some real-valued scalar convex functions as


a function of the argument x.

Proposition 3.1.1: (Subdifferential and Directional Deriva-


tive) Let f : ~n t---+ ~ be a convex function. For every x E ~n, the
following hold: ·
(a) The subdifferential Bf(x) ,is a nonempty, convex, and compact
set, and we have

f'(x; d) = gE8f(x)
max g'd, \/ d E ~n, (3.3)

i.e., f'(x; ·) is the support function of Bf(x). In particular, the


directional derivative f'(x; ·) is a real-valued convex function.
(b) If f is differentiable at x with gradient '\l f (x), then '\l f (x) is its
unique subgradient at x, and we have f'(x;d) = '\lf(x)'d.

Proof: (a) We first provide a characterization of subgradients. The sub-


gradient inequality (3.1) is equivalent to

f(x + ad) - f(x)


------- ~ g'd, \/ d E ~n, a> 0.
0:
Sec. 3.1 Subgradients of Convex Real- Valued Functions 139

f(x +o:d)
Figure 3.1.3. Illustration of the direc-
tional derivative of a convex function f.
The ratio

f(x + ad) - f(x)


a

f(x) is monotonically nonincreasing and con-


verges to f' (x; d) as a ..J,. 0.

0 0:

Since the quotient on the left above decreases monotonically to f'(x; d)


as o: .J,. 0, we conclude that the subgradient inequality is equivalent to
f'(x; d) 2': g'd for all d E )Rn:

g E of(x) f'(x; d) 2': g'd, \/dE)Rn. (3.4)

We next show that f'(x; ·) is a real-valued function. For a fixed


d E )Rn, consider the scalar convex function ¢(0:) = f(x + o:d). From the
convexity of ¢, we have for all o: > 0 and (3 > 0

0: (3
¢(0) :s: o:+(3</>(-(3) + o:+(3</J(o:),
or equivalently

0: (3
J(x) :S: --(3f(x - (3d) + --(3f(x + o:d).
o:+ o:+
This relation can be written as
J(x) - J(x - (3d) J(x + o:d) - f(x)
(3 :s; 0: .

Noting that the quotient in the right-hand side is monotonically nonin-


creasing with o: (cf. Fig. 3.1.3), and taking limit as o: .J,. 0, we see that

f(x) - f(x - (3d) < f'(x· d) < _J(_x_+_o:_d)_-_J_(x_) \/ o:, (3 > 0.


(3 - ' - 0: '

Thus f'(x; d) is a real number for all x, d E )Rn.


Next we show that of(x) is a convex and compact set. From Eq.
(3.4), we see that of(x) is the intersection of the closed halfspaces

{g I !'(x;d) 2: g'd},
140 Subgradient Methods Chap. 3

where d ranges over the nonzero vectors of ~n. It follows that 8J(x) is
closed and convex. Also it cannot be unbounded, since otherwise, for some
d E ~n, g' d could be made unbounded from above by proper choice of
g E 8J(x), contradicting Eq. (3.4) [since f'(x; ·) was proved to be real-
valued].
Next we show that 8f(x) is nonempty. Take any x and din ~n, and
consider the convex subset of ~n+l

C1 = {(z,w) I w > f(z)},


and the half-line

C2 = {(z,w) I w = J(x) + af'(x;d), z= x + ad, a 2 O};

see Fig. 3.1.4. Since the quotient on the right in Eq. (3.2) is monotonically
nonincreasing and converges to f'(x; d) as a ..J_ 0 (cf. Fig. 3.1.3), we have

J(x) + af'(x; d) ::; f(x + ad), \:/ a 2 0.

It follows that the convex sets 0 1 and C2 are disjoint. By applying the
Separating Hyperplane Theorem (Prop. 1.5.2 in Appendix B), we see that
there exists a nonzero vector (µ, 1 ) E ~n+l such that

1 w+µ'z:::; 1 (J(x)+af'(x; d))+µ'(x+ad), \;/ a 2 0, z E ~n, w > J(z).


(3.5)
We cannot have 1 > 0 since then the left-hand side above could be made
arbitrarily large by choosing w sufficiently large. Also if 1 = 0, then Eq.
(3.5) implies that µ = 0, which is a contradiction, since (µ, 1 ) must be
nonzero. Therefore, 1 < 0 and by dividing with I in Eq. (3.5), we obtain

w+(z-x)'(µh) 2 J(x)+af'(x; d)+a(µh)'d, \:/ a 2 0, z E ~n, w > f(z).


(3.6)
By taking the limit in the above relation as a ..J_ 0 and w ..J_ f(z), we obtain

f(z) 2 f(x) + (-µh)'(z - x), \;/ z E ~n,

implying that (-µh) E 8f(x), so 8f(x) is nonempty.


Finally, we show that Eq. (3.3) holds. We take z = x and a = 1 in
Eq. (3.6), and taking the limit as w ..J_ J(x), we obtain that the subgradient
(-µh) satisfies
(-µh)'d 2 f'(x; d).
Together with Eq. (3.4), this shows Eq. (3.3). The latter equation also
implies that f'(x; ·) is convex, being the supremum of a collection of linear
functions.
(b) From the definition of directional derivative and gradient, we see that
if f is differentiable at x with gradient "v f (x), its directional derivative
Sec. 3.1 Subgradients of Convex Real-Valued Functions 141

w
Slope= f'(x; d)/d
\

0 d X z

Figure 3.1.4. Illustration of the sets C1 and C2 used in the hyperplane separation
argument of the proof of Prop. 3.1.l(a).

is f'(x;d) = 'vf(x)'d. Thus, from Eq. (3.3), f has 'vf(x) as its unique
subgradient at x. Q.E.D.

Part (b) of the preceding proposition can be used to show that if f


is convex and differentiable, it is continuously differentiable (see Exercise
3.4). The following proposition extends the boundedness property of the
subdifferential and establishes a connection with Lipschitz continuity. It is
given as Prop. 5.4.2 in Appendix B, but in view of its significance for our
algorithmic purposes, we restate it here and we provide a proof.

Proposition 3.1.2: (Subdifferential Boundedness and Lips-


chitz Continuity) Let f: ~n t-+ ~ be a real-valued convex function,
and let X be a nonempty bounded subset of ~n.
(a) The set UxEx8f(x) is nonempty and bounded.
(b) The function f is Lipschitz continuous over X,

IJ(x) - f(z)I ~ L llx - zll, \I x,z EX,

where
L = sup 11911-
gEUxExaf(x)

Proof: (a) Nonemptiness follows from Prop. 3.1.l(a). To prove bounded-


ness, assume the contrary, so that there exists a sequence { Xk} c X, and
an unbounded sequence {gk} with
o < ll9kll < ll9k+1II, k = o, 1, ....
142 Subgradient Methods Chap. 3

We denote dk = 9k/ll9kll- Since 9k E 8f(xk), we have

Since both {x k} and { dk} are bounded, they contain convergent subse-
quences. We assume without loss of generality that {xk} and {dk} con-
verge to some vectors. Therefore, by the continuity off (cf. Prop. 1.3.11
in Appendix B), the left-hand side of the preceding relation is bounded.
Hence the right-hand side is also bounded, thereby contradicting the un-
boundedness of {gk}.
(b) Let x and z be any two points in X. By the subgradient inequality
(3.1), we have for all g E 8f(x),

f(x) + g'(z - x) :S: f(z),

so that
f(x) - f(z) :S: 11911 · llx - zll :S: Lllx - zll-
By exchanging the roles of x and z, we similarly obtain

f(z) - f(x) :S: L llx - zll,


and by combining the preceding two relations,

IJ(x) - f(z)I :S: L llx - zll-


Q.E.D.

The next proposition provides the analog of the chain rule for sub-
differentials of real-valued convex functions. The proposition is a special
case of more general results that apply to extended real-valued functions
(Props. 5.4.5 and 5.4.6 of Appendix B), but admits a simpler proof.

Proposition 3.1.3:
(a) (Chain Rule): Let F be the composition of a convex function
h : Rm 1-t R and an m x n matrix A,

F(x) = h(Ax), XE Rn.

Then

8F(x) = A 18h(Ax) = {A'g I g E 8h(Ax)}, XE Rn.


Sec. 3.1 Subgradients of Convex Real-Valued Functions 143

(b) (Subdifferential of a Sum): Let F be the sum of convex functions


/i : ~n t-+ ~, i = 1, ... , m,
F(x) = fi(x) + · · · + fm(x), XE ~n.

Then

oF(x) = ofi(x) + · · · + ofm(x), XE ~n.

Proof: (a) Let g E oh(Ax). By Prop. 3.1.l(a), we have

g' Ads; h'(Ax; Ad) = F'(x; d),


where the equality holds using the definition of directional derivative. Hence

(A'g)'d :S: F'(x;d),

and by Prop. 3.1.l(a), A 1g E oF(x), so that A'oh(Ax) c oF(x).


To prove the reverse inclusion, suppose to come to a contradiction,
that there exists g E oF(x) such that g <j. A'oh(Ax). By Prop. 3.1.l(a),
the set oh(Ax) is compact. Since linear transformations preserve convexity
and compactness, the set A'oh(Ax) is also convex and compact, and by
Prop. 1.5.3 in Appendix B, there exists a hyperplane strictly separating
the singleton set {g} from A'oh(Ax), i.e., a vector d and a scalar c such
that
(A'y)'d < C < g'd, Vy E oh(Ax).
From this we obtain
max (Ad)'y < g'd,
yEoh(Ax)

so by using the equation h'(Ax; Ad) = F'(x; d) and Prop. 3.1.l(a),

F'(x;d) = h 1 (Ax;Ad) < g'd.

In view of Prop. 3.1.l(a) this is a contradiction.


(b) We write F as
F(x) = h(Ax),
where h : ~ n t-+ ~ is the function

and A is the matrix defined by the equation Ax= (x, ... , x). The subdif-
ferential sum formula then follows from part (a). Q.E.D.
144 Subgradient Methods Chap. 3

The following proposition generalizes the classical optimality condi-


tion for optimization of a differentiable convex function f over a convex set
X:
v' J(x*)'(x - x*) ~ 0, VxEX,

(cf. Prop. 1.1.8 in Appendix B). The proposition can be generalized in turn
to the case where f can be extended real-valued. In this case the optimality
condition requires the additional assumption that ri( dom(f)) n ri(X) i= 0,
or some polyhedral assumption on f and/or X; see Prop. 5.4.7 in Appendix
B, whose proof is simple but requires a more sophisticated version of the
chain rule that applies to extended real-valued functions .

Proposition 3. 1.4: (Optimality Condition) A vector x minimizes


a convex function f : ~n H ~ over a convex set X C ~n if and only if
there exists a subgradient g E af(x) such that

g'(z - x) ~ 0, V zEX.

Proof: Suppose that g'(z -x) ~ 0 for some g E af(x) and all z EX. Then
from the subgradient inequality (3.1), we have f(z) - f( x ) ~ g'(z - x) for
all z E X, so f( z ) - f( x ) ~ 0 for all z EX, and x minimizes f over X.
Conversely, suppose that x minimizes f over X . Consider the set of
feasible directions of X at x , i.e., the cone

W = {w i= 0 I x + aw E X for some a, > 0},

and the dual cone


W = {y I y'w ~ 0, V w E W}
(this is equal to -W* , the set of ally such that -y belongs to the polar
cone W*). If af(x) and W have a point in common, we are done, so to
arrive at a contradiction, assume the opposite, i.e., af(x) n W = 0. Since
af(x ) is compact [because f is real-valued and Prop. 3.1.l(a) applies] and
Wis closed, there exists a hyperplane strictly separating af(x) and W (cf.
Prop. 1.5.3 in Appendix B) , i.e., a vector di= 0 and a scalar c such that

g 1d < C < y'd, V g E af(x), y E W.


Since Wis a cone, infyEW y'd is either O or -oo. The latter case is impos-
sible, as it would violate the preceding relation. Therefore we have

C < 0::; y'd, VyEW, (3.7)


Sec. 3.1 Subgradients of Convex Real-Valued Functions 145

N x (:x* )

Figure 3.1.5. Illustration of the optimality condition of Prop. 3.1.4. In the figure
on the left, f is differentiable and the optimality condition is

- 'vf(x *) E Nx (x* ),

where N x (x* ) is the norma l cone of X at x*, which is equivalent to

'v f(x* )'(x - x*) 2' 0, V XE X.

In the figure on the right, f is nondifferentiable, and the optimality condition is

-g E Nx(x*) for some g E 8f(x*) .

which when combined with the preceding inequality, also yields

max g'd < c < 0.


gEBJ(x)

T hus, using Prop. 3.1.l(a), we have f'(x; d) < 0, while from Eq. (3.7),
we see that d belongs to the polar cone of W*, which by the Polar Cone
Theorem [Prop. 2.2.l (b) in Appendix BJ is t he closure of W. Hence there
is a sequence {yk} C W that converges to d. Since f'( x; ·) is a continuous
function [being convex a nd real-valued by Prop. 3.1.l(a)] and f'( x; d) < 0,
we have f'(x; Yk) < 0 for all k after some index, which contradicts t he
optimality of x. Q.E.D.

Figure 3.1.5 illustrates how the optimality condition of Prop. 3.1.4 is


related to the normal cone of X at x, which is denoted by Nx(x ) and is
defined by

Nx (x ) = {g I g'(z - x ) :-=:; 0, v' z E X}, xE X,


146 Subgradient Methods Chap. 3

and Nx(x) = 0 if x ff:. X. In particular, the condition states that x


minimizes f over X if and only if there exists g such that
g E 8f(x), -g E Nx(x). (3.8)
Note that the normal cone Nx(x) is the subdifferential of the indicator
function 8x of X at a point x EX. It can thus be seen that the form (3.8)
of the optimality condition for minimization of f over X is the same as the
optimality condition of the Fenchel duality theorem [Prop. 1.2.l(c)], applied
to the function f + 8x. Indeed, if X were assumed closed, we could have
proved Prop. 3.1.4 by appealing to the Fenchel duality theorem. However,
as in the case of earlier results in this section, we have given instead a more
elementary proof that is based on a convex set separation argument.

3.1.1 Characterization of the Subdifferential

The characterization and computation of af (x) may not be convenient in


general. It is, however, possible in some special cases. Principal among
these is when
f(x) = supcp(x,z), (3.9)
zEZ
where x E Rn, z E Rm, ¢ : Rn x Rm H R is a function, Z is a compact
subset of Rm, ¢( ·, z) is convex and differentiable for each z E Z, and
v' x ¢( x, ·) is continuous on Z for each x. Then the form of af (x) is given
by Danskin's Theorem [Dan67], which states that
8f(x) =conv{v'x¢(x,z) I z E Z(x)}, XE Rn, (3.10)
where Z(x) is the set of maximizing points in Eq. (3.9),

Z(x) = { z I ¢(x, z) = ~lf ¢(x, z)}.


The proof is somewhat long, so it is relegated to the exercises.
An important special case of Eq. (3.10) is when Z is a finite set, so f
is the maximum of m differentiable convex functions ¢1, ... , ¢m:
f (X) = max { ¢1 (x), ... , <Pm (x) } ,
Then we have
8f (x) = conv{v' <Pi(x) I i E I(x)}, (3.11)
where I(x) is the set of indexes i for which the maximum is attained, i.e.,
<Pi(x) = f(x). Another important special case is when ¢(·, z) is differen-
tiable for all z E Z, and the supremum in Eq. (3.9) is attained at a unique
point, so Z(x) consists of a single point z(x). Then f is differentiable at x
and
v'f(x) = v'cp(x,z(x)).
For various related results, see [Ber09], Examples 5.4.3, 5.4.4, 5.4.5.
Sec. 3.1 Subgradients of Convex Real- Valued Functions 147

Computation of Subgradients

Generally, obtaining the entire subdifferential 8f(x) at points x where f is


nondifferentiable may be complicated, as indicated by Eq. (3.10): it may
be difficult to determine all the maximizing points in Eq. (3.9). However,
if we are interested in obtaining a single subgradient, as in the subgradient
algorithms of the next section, the calculation is often much simpler. We
illustrate this with some examples.

Example 3.1.1: (Subgradient Calculation in Minimax


Problems)

Let
f (x) = sup q'>(x, z ), (3.12)
zEZ

where X E a:r' z E Rm, q'> : Rn X Rm f--t (-oo, oo] is a function, and z is a


subset of Rm. We assume that q'>(·,z) is convex for each z E Z, so f is also
convex, being the supremum of convex functions. For a fixed x E dom(J), let
us assume that Zx E Z attains the supremum in Eq. (3.12), and that gx is
some subgradient of the function ¢(·, Zx) at x, i.e., gx E 8xq'>(x, Zx)- Then by
using the subgradient inequality, we have for all y E Rn,

J(y) = sup q'>(y, z) 2 q'>(y, zx) 2 q'>(x, zx) + g:(y - x) = f(x) + g:(y - x),
zEZ

i.e., gx is a subgradient off at x. Thus

gx E 8f(x).
This relation provides a convenient method for calculating a single subgra-
dient of f at x with little extra computation, once a maximizer Zx E Z of
q'>(x, ·) has been found: we simply use any subgradient in 8xq'>(x, zx)-

The next example is particularly important in the context of solving


dual problems by subgradient methods. It shows that when calculating
the dual function value at some point, we also obtain a subgradient with
essentially no additional computation.

Example 3.1.2: (Subgradient Calculation in Dual Problems)

Consider the problem

minimize f (x)
subject to x EX, g(x) ::::; 0,
where f : Rn f--t R, g : Rn f--t Rr are given functions, X is a subset of Rn.
Consider the dual problem

maximize q(µ)
subject to µ 2 0,
148 Subgradient Methods Chap. 3

where q is the concave function


q(µ) = xEX
inf {f(x) + µ' g(x) }.

Thus the dual problem involves minimization of the convex function -q over
µ 2'. 0. Note that in many cases, q is real-valued (for example when f and g
are continuous, and X is compact).
For a convenient way to obtain a subgradient of -q at µ E 1W, suppose
that Xµ minimizes the Lagrangian over x EX,

Xµ E arg min{f(x) + µ' g(x) }.


xEX

Then we claim that -g(xµ) is a subgradient of -q atµ, i.e.,

q(v) :Sq(µ)+ (v - µ)' g(xµ), 'c/ II E Rr.

This is essentially a special case of the preceding example, and can also be
verified directly by writing for all z; E Rr,

q(v)= inf{J(x)+v'g(x)}
xEX

:S J(xµ) + v' g(xµ)


= J(xµ) + µ' g(xµ) + (v - µ)' g(xµ)
= q(µ) + (v - µ)' g(xµ)-
Thus after computing the function value q(µ), obtaining a single subgradient
typically requires little extra calculation.

Let us finally mention another important differentiation formula that


is based on an optimization operation. For any closed proper convex f :
Rn f--t ( -oo, oo] and its conjugate f*, we have

8f(x) = arg max {y'x - f*(y) }, 'vxERn,


yE1Rn

8f*(y) = arg max { x'y - f(x) },


xE1Rn

This follows from the Conjugate Subgradient Theorem (see Props. 5.4.3
and 5.4.4 of Appendix B). Thus a subgradient off at a given x can be
obtained by finding a solution to a maximization problem that involves f*.

3.2 CONVERGENCE ANALYSIS OF SUBGRADIENT METHODS

In this section we consider subgradient methods for minimizing a real-


valued convex function f : Rn f--t R over a closed convex set X. In partic-
ular, we focus on methods of the form

(3.13)
Sec. 3.2 Convergence Analysis of Subgradient Methods 149

where 9k is any subgradient off at Xk, ak is a positive stepsize, and Px(·)


denotes projection on the set X (with respect to the standard Euclidean
norm). An important fact here is that by projecting (xk - O'.k9k) on X, we
do not increase the distance to any feasible point, and hence also to any
optimal solution, i.e.,

\/XE X.

This is a consequence of the following basic property of the projection.

Proposition 3.2.1: (Nonexpansiveness of the Projection) Let


X be a nonempty closed convex set. We have

IIPx(x) - Px(Y)II ::::; llx - YII, \/x,yEWn. (3.14)

Proof: From the Projection Theorem (Prop. 1.1.9 in Appendix B),

(z - Px(x))' (x - Px(x)) ::::; 0, \/zEX.

Letting z = Px (y) in this relation, we obtain

(Px(y) - Px(x))' (x - Px(x)) ::::; 0.

Similarly,
(Px(x) - Px(y))' (y - Px(y)) ::::; 0.
By adding these two inequalities, we see that

(Px(y) - Px(x))' (x - Px(x) - y + Px(y)) ::::; 0.

By rearranging and by using the Schwarz inequality, we have

11Px(y)-Px(x)ll 2 ::::; (Px(y)-Px(x))' (y-x) ::::; IIPx(y)-Px(x)ll · IIY-xll,

from which the result follows. Q.E.D.

Another important characteristic of the subgradient method (3.13) is


that the new iterate may not improve the cost for any value of the stepsize;
i.e., for some k, we may have

\/ Q > 0,
(see Fig. 3.2.1). However, if the stepsize is small enough, the distance of
the current iterate to the optimal solution set is reduced (this is illustrated
150 Subgradient Methods Chap. 3

Figure 3.2.1. Illustration of how


the subgradient method iterate

may not improve the cost function


with a particular choice of subgra,..
dient 9k, regardless of the value of
the stepsize c,k.

in Fig. 3.2.2). Part (b) of the following proposition provides a formal


proof of the distance reduction property and an estimate for the range of
appropriate stepsizes.

Proposition 3.2.2: Let {xk} be the sequence generated by the sub-


gradient method (3.13). Then, for ally EX and k 2 0:
(a) We have

(b) If f(y) < f(xk), we have

for all stepsizes ak such that

Proof: (a) Using the nonexpansion property of the projection [cf. Eq.
(3.14)], we obtain for ally E X and k,

llxk+l - Y!l 2 = IIPx (xk - D'k9k) - Yll 2


:S: llxk - D'k9k - Yll 2
= !lxk - Y!l 2 - 2akg~(xk -y) + a%!19k!l 2
:S: llxk - Yll 2 - 2ak (f (xk) - f (y)) + a% ll9k 11 2 ,
where the last step follows from the subgradient inequality.
Sec. 3.2 Convergence Analysis of Subgradient Methods 151

Figure 3.2.2. Illustration of how,


given a nonoptimal xk, the distance
to any optimal solution x* is reduced
using a subgradient iteration with a
sufficiently small stepsize. The crit-
ical fact, which follows from the def-
inition of a subgradient, is that the
angle between the negative subgra-
dient -gk and the vector x* - Xk is
less than 7r /2. As a result, if °'k is
small enough, the vector Xk - °'kgk
is closer to x* than Xk is. Through
the projection on X, Px(xk-Dikgk)
gets even closer to x*.

(b) Follows from part (a). Q.E.D.

Part (b) of the preceding proposition suggests the stepsize rule

(3.15)

where f* is the optimal value (assuming gk -=/= 0, otherwise Xk is optimal).


This rule selects O:k to be in the middle of the range given by Prop. 3.2.2(b),

2(f(xk) - f(x•)))
(
o, llgkll 2 '

where x* is any optimal solution, and reduces the distance of the current
iterate to x*.
Unfortunately, however, the stepsize (3.15) requires that we know f•,
which is rare. In practice, one must either estimate f * or use some simpler
scheme for selecting a stepsize. The simplest possibility is to select O:k to
be the same for all k, i.e., O:k = a for some a > 0. Then, if the subgradients
gk are bounded, i.e., llgkll :S: c for some constant c and all k, Prop. 3.2.2(a)
shows that for all optimal solutions x•, we have

and implies that the distance to x* decreases if

2(f(xk) - J•)
O<a< 2
C

or equivalently, if Xk is outside the level set

{xEX I f (x) '.S: f* + 0


;2};
152 Subgradient Methods Chap. 3

Level set
{x EX I /(x) $ /• +ac2 /2}
~Optimal solution set

Figure 3.2.3. Illustration of a principal convergence property of the subgradient


method with a constant stepsize a, assuming a bound c on the subgradient norms
IIYkll- When the current iterate is outside the level set

{XE X I /(x) ::; r + °';2} '


the distance to any optimal solution is reduced at the next iteration. As a result
the method gets arbitrarily close to (or inside) this level set.

(see Fig. 3.2.3). Thus, if a is taken to be small enough, the conver-


gence properties of the method are satisfactory. Since a small stepsize
may result in slow initial progress, it is common to use a variant of this
approach whereby we start with moderate stepsize values ak, which are
progressively reduced up to a small positive value a, using some heuristic
scheme. Other possibilities for stepsize choice include a diminishing step-
size, whereby ak-+ 0, and schemes that replace the unknown optimal value
f* in Eq. (3.15) with an estimate. We will consider these stepsize rules in
what follows in the present section.

Convergence Analysis

We will now discuss the convergence of the subgradient method (3.13).


Throughout our analysis, we denote by { xk} the corresponding generated
sequence, and we denote by f* and X* the optimal value and optimal
solution set, respectively:

f* = inf f(x),
xEX
X * = {X E X I f (X) = f *}.
We will consider the subgradient method

and three different types of rules for selecting the stepsize ak:
Sec. 3.2 Convergence Analysis of Subgradient Methods 153

(a) A constant stepsize.


(b) A diminishing stepsize.
(c) A dynamically chosen stepsize based on the optimal value f* [cf. Prop.
3.2.2(b)] or a suitable estimate.
Additional stepsize rules, and related convergence and rate of convergence
results can be found in [NeBOO], [NeBOl], and [BN003]. For the first two
stepsize rules we will assume the following:

Assumption 3.2.1: (Subgradient Boundedness) For some scalar


c, we have
c ~ sup{ ll9k II I k = 0, 1, ... } .

We note that Assumption 3.2.1 is satisfied if f is polyhedral,

f(x) = . max {a~x + bi},


i=l, ... ,m

an important special case in practice. This is because the subdifferential of


such a function at any point x is the convex hull of a subset of { a 1, ... , am}
[cf. Eq. (3.11)], so in this case we may use c = maxi=l, ... ,m llaill in Assump-
tion 3.2.1. Another important case where Assumption 3.2.1 is satisfied is
when Xis compact [see Prop. 3.1.2(a)]. More generally, Assumption 3.2.1
will hold if it can be ascertained somehow that {xk} is bounded.
From the point of view of analysis, the main consequence of Assump-
tion 3.2.1 is the inequality
\:/ y E X, (3.16)

which follows from Prop. 3.2.2(a). This type of inequality allows the use of
supermartingale convergence arguments (see Section A.4 in Appendix A),
and lies at the heart of the convergence proofs of this section, as well as
the convergence proofs of other subgradient-like methods given in the next
section and Section 6.4.

Constant Stepsize

When the stepsize is constant in the subgradient method (i.e., O:k = o:), we
cannot expect to prove convergence, in the absence of additional assump-
tions. As indicated in Fig. 3.2.3, we may only guarantee that asymptot-
ically we will approach a neighborhood of the set of minima, whose size
will depend on o:. The following proposition quantifies the size of this
neighborhood and provides an estimate on the difference between the cost
value
Joo= liminf f(xk)
k-+oo
154 Subgradient Methods Chap. 3

that the method achieves, and the optimal cost value /*.

Proposition 3.2.3: (Convergence within a Neighborhood) Let


Assumption 3.2.1 hold, and assume that O'.k is constant, O'.k = a.
(a) If f* = -oo, then /co = /*.
(b) If /* > -oo, then

Proof: We prove (a) and (b) simultaneously by contradiction. If the result


does not hold, there must exist an E > 0 such that
a.c2
f 00 >f*+- + 2€.
2

Let ii E X be such that

a.c2
f oo ~/(ii)+ 2 + 2E,

and let k be large enough so that for all k ~ k we have

By adding the preceding two relations, we obtain for all k ~ k,


a.c2
f (xk) - !(ii) ~ 2 + E.
Using Eq. (3.16) with y = ii, together with the above relation, we obtain
for all k ~ k,

llxk+l - iill 2 :S: llxk - iill 2 - 2a.(f(xk) - !(ii))+ a. 2 c 2


a.c2 )
:S: llxk - iill 2 - 2a. ( 2 + E + a. 2c 2
= llxk - iill 2 - 2m.

Thus we have

llxk+l - iill 2 :S: llxk - iill 2 - 2m


:S: llxk-1 - iill 2 - 4m

:S: llx1c - iill 2 - 2(k + 1 - k)m,


Sec. 3.2 Convergence Analysis of Subgradient Methods 155

which cannot hold fork sufficiently large - a contradiction. Q.E.D.

The next proposition gives an estimate of the number of iterations


needed to guarantee a level of optimality up to the threshold tolerance
oc2 /2 given in the preceding proposition. As can be expected, the number
of necessary iterations depends on the distance of the initial point xo to the
optimal solution set X*. In the following proposition and the subsequent
discussion we denote

d(x) = min llx - x*II, XE Rn.


x*EX*

Proposition 3.2.4: {Convergence Rate) Let Assumption 3.2.1


hold. Assume further that Ok is constant, Ok = o, and that X * is
nonempty. Then for any positive scalar E, we have

oc2 + E
min f (xk)
05,k'.5,K
~ f* + ---,
2

where

Proof: Assume the contrary, i.e., that for all k with O ~ k ~ K, we have

+ oc 2+ E .
2
f (xk ) > f*

From this relation, and Eq. (3.16) with y = x* EX* and Ok = o, we obtain
for all x * EX* and k with O ~ k ~ K ,

llxk+l - x*ll 2 ~ llxk - x* ll 2 - 2o(f(xk ) - f*)+o 2c2


~ llxk - x* ll 2 - (o2 c2 +OE)+ o2 c2
= llxk - x*ll 2 - m.
Adding the above inequalities over k for k = 0, .. . , K , yields
0 ~ llxKH - x* ll 2 ~ llxo - x* ll 2 - (K + l)m, Vx*EX*.
Taking the minimum over x* E X* , we obtain

which contradicts the definition of K. Q.E.D.


156 Subgradient Methods Chap. 3

By letting o: = E/ c2, we see from the preceding proposition that we


can obtain an E-optimal solution in 0(1/E 2) iterations of the subgradient
method. Equivalently, with k iterations, we can attain an optimal solution
to within a 0(1/v'k) cost function error. Note that the number ofrequired
iterations is independent of the dimension n of the problem.
Another interesting result is that the rate of convergence (to the ap-
propriate neighborhood) is linear under a strong convexity type of assump-
tion, as shown in the following proposition. Several additional convergence
rate results that apply to incremental subgradient methods and other step-
size rules may be found in [NeBOO].

Proposition 3.2.5: (Linear Convergence Rate) Let the assump-


tions of Prop. 3.2.4 hold, and assume further that for some 'Y > 0,

\/xEX, (3.17)

and that o: ::;: 2~. Then for all k,

Proof: We let y be the projection of Xk and O:k = a in Eq. (3.16), and


strengthen the left side, and also use Eq. (3.17) to obtain

(d(xk+1)) 2 ::;: (d(xk)) 22a(f(xk) - f*)


- + o: 2 c2
::;: (1- 2o:'Y)(d(xk)) 2 + o:2c2.

From this we can show by induction that for all k


k
(d(xk+i) ) 2 ::;: (1 - 2o:')')k+i (d(xo) ) 2 + o: 2c2 ~)1 - 2o:')')J,
j=O

and the result follows using the fact I:1= (1 - 2o:')')J::;: 2,1,'Y.
0 Q.E.D.

The preceding proposition shows that the method converges linearly


to the set of all x E X with
2
(d(x)) 2 ::;: ~~.
The assumption (3.17) is implied by strong convexity of f, as defined in
Appendix B. Moreover, it is satisfied if f is polyhedral, and Xis polyhedral
Sec. 3.2 Convergence Analysis of Subgradient Methods 157

and compact. To see this, note that for polyhedral f and X, there exists
(3 > 0 such that
f(x) - f* ~ (3d(x) , l;;f XE X;

a proof of this is given in Prop. 5.1.6 of Chapter 5. For X compact, we


have
(3d( x ) ~ 1(d(x)) 2 , \fxEX,
for some 'Y > 0, and Eq. (3.17) holds.

Diminishing Stepsize

We next consider the case where the stepsize ak diminishes to zero, but
satisfies I::r=o ak = oo. This condition is needed so that the method
can "travel" infinitely far if necessary to attain convergence; otherwise,
convergence to X* may be impossible from starting points xo that are far
from X*, as for example in the case where X = ~n and
00

d(xo) >c L ak,


k=O

with c being the constant in Assumption 3.2.1. A common choice that


satisfies ak ---+ 0 and I::;::0ak = oo is
(3
Cl'.k = k + ,.,,,
where (3 and 'Y are some positive scalars, often determined by some prelim-
inary experiment ation.t

Proposition 3.2.6: (Convergence) Let Assumption 3.2.1 hold. If


a k satisfies
oc
lim a k
k -+oo
= 0, Lak = oo,
k=O
then f 00 = f *. Moreover if

and X * is nonempty, then ,{x k} converges to some optimal solution.

t Larger stepsizes may also b e used, provided we employ a device known as


iterate averaging, whereby the running average Xk = I:;=O
D'.t Xe/ I::;=O
O'.t of the
past iterates is maintained ; see Exercise 3.8.
158 Subgradient Methods Chap. 3

Proof: Assume, to arrive at a contradiction, that there exists an E > 0


such that
Joo - 2E > J*.
Then there exists a point fj E X such that

Joo - 2E > J(fj).

Let ko be large enough so that for all k ~ ko, we have

By adding the preceding two relations, we obtain for all k ~ ko,

By setting y = fj in Eq. (3.16), and by using the above relation, we have


for all k ~ ko,

Since O:k ----+ 0, without loss of generality, we may assume that ko is large
enough so that
\/ k ~ ko.
Therefore for all k ~ ko we have
k

llxk+l - iJll 2 :S llxk - iJll 2 - O:kf :S · · · :S llxko - iJll 2 - EL O:j,


j=ko

which cannot hold for k sufficiently large, a contradiction showing that


Joo=J*.
Assume that X* is nonempty. By Eq. (3.16) we have

llxk+I - x*ll 2 :S llxk - x*ll 2 - 20:k(f(xk) - J(x*)) + 0:ic2 , \/ x* EX*.


(3.18)
From the convergence result of Prop. A.4.4 of Appendix A, we have that
foreachx* EX*, llxk-x*II convergestosomerealnumber, and hence {xk}
is bounded. Consider a subsequence {xk}K such that limk---+oo,kEK J(xk) =
f*, and let x be a limit point of {x k} K • Since J is continuous, we must
have J(x) = J*, so x EX*. To prove convergence of the entire sequence to
x, we use x* = x in Eq. (3.18). It then follows that llxk - xii converges to
a real number, which must be equal to O since xis a limit point of {xk}.
Thus xis the unique limit point of {xk}.t Q.E.D.

t Note that this argument is essentially the same as the one we used to prove
the Fejer Convergence Theorem (Prop. A.4.6 in Appendix A). Indeed we could
have invoked that theorem for the last part of the proof.
Sec. 3.2 Convergence Analysis of Subgradient Methods 159

Note that the proposition shows that {xk} converges to an optimal


solution (assuming one exists), which is stronger than all limit points of
{ xk} being optimal. The later assertion is the type of result one typically
shows for gradient methods for differentiable, possibly nonconvex problems
(see e.g., nonlinear programming texts such as [Ber99], [Lue84], [NoW06]).
The stronger assertion of convergence to a unique limit is possible because
of the convexity of f, based on which the proof rests.
The preceding proposition can be strengthened, assuming that X* is
nonempty and ak satisfies the slightly stronger conditions
00 00

z:=ak =oo, Z::o:~ < 00.


k=O k=O

Then convergence of {xk} to some optimal solution can be proved if for


some scalar c, we have

Vk ~ 0, (3.19)

in place of the stronger Assumption 3.2.1 (thus covering for example the
case where f is positive definite quadratic and X = 1Rn, which is not covered
by Assumption 3.2.1). This is shown in Exercise 3.6, with essentially the
same proof, after replacing Eq. (3.18) with another inequality that relies
on the assumption (3.19).

Dynamic Stepsize Rules

We now discuss the stepsize rule

Vk ~ 0, (3.20)

assuming gk f. 0. This rule is motivated by Prop. 3.2.2(b) [cf. Eq. (3.15)].


Of course knowing f * is typically unrealistic, but we will later modify the
stepsize, so that f * can be replaced by a dynamically updated estimate.

Proposition 3.2.7: (Convergence) Assume that X* is nonempty.


Then, if O:k is determined by the dynamic stepsize rule (3.20), {xk}
converges to some optimal solution.

Proof: From Prop. 3.2.2(a) with y = x* E X*, we have


160 Subgradient Methods Chap. 3

By using the definition of Dk [cf. Eq. (3.20)], we obtain

V x* E X*, k 2=: 0.

This implies that {xk} is bounded. Furthermore, f(xk) ---+ f*, since other-
wise we would have JJxkH - x* II :S Jlxk - x* JI - E for some suitably small
E > 0 and infinitely many k. Hence for any limit point x of {xk}, we have

x EX*, and since the sequence {Jjxk - x*JI} is nonincreasing, it converges


to Jjx - x* JI for every x* E X*. If there are two distinct limit points i; and
x of {xk}, we must have i; EX*, x EX*, and Jlx - x*JJ = Jlx - x*JI for all
x* E X*, which is possible only if i; = x. Q.E.D.

For most practical problems the optimal value f* is not known. In


this case we may modify the dynamic stepsize (3.20) by replacing f* with
an approximation. This leads to the stepsize rule

V k 2=: 0, (3.21)

where f k is an estimate off*. One possibility is to estimate f* by using the


cost function values obtained so far, setting fk below mino::;j::;k f(xj ), and
adjusting fk upwards if the algorithm appears not to be making progress.
In a simple scheme proposed in [NeBOl] and [BN003], fk is given by

(3.22)

and Ok is updated according to

if f(xk+i) :S !k,
(3.23)
if f(xk+1) > fk,

where o, /3, and 0 are fixed positive constants with f3 < l and 0 2=: 1.
Thus in this scheme, we essentially "aspire" to reach a target level fk
that is smaller by Ok over the best value achieved thus far [cf. Eq. (3.22)].
Whenever the target level is achieved, we increase Ok (if 0 > l) or we
keep it at the same value (if 0 = l). If the target level is not attained
at a given iteration, Ok is reduced up to a threshold o. If the subgradient
boundedness Assumption 3.2.1 holds, this threshold guarantees that the
stepsize Dk of Eq. (3.21) is bounded away from zero, since from Eq. (3.22),
we have J(xk) - fk 2=: o and hence Dk 2=: o/JJgkJl 2 2=: o/c 2 . As a result, the
method behaves somewhat similar to the one with a constant stepsize (cf.
Prop. 3.2.3), as indicated by the following proposition.
Sec. 3.2 Convergence Analysis of Subgradient Methods 161

Proposition 3.2.8: (Convergence within a Neighborhood) As-


sume that ak is determined by the dynamic stepsize rule (3.21) ;with
the adjustment procedure (3.22)-(3.23). If f* = -oo, then

inf f(x ·)
O<. J
= f* '
_J

while if f * > -oo, then

inf f(xj) :Sf*+ 8.


05,j

Proof: Assume, to arrive at a contradiction, that

f* + 8 < 05,j
inf f(xj)- (3.24)

Each time the target level is attained [i.e., f (xk) :S fk-1], the current best
function value mino$j$k f(xj) decreases by at least 8 [cf. Eqs. (3.22) and
(3.23)], so in view of Eq. (3.24), the target level can be attained only a
finite number of times. From Eq. (3.23) it follows that after finitely many
iterations, 8k is decreased to the threshold value 8, and remains at that
value for all subsequent iterations, i.e., there is an index k such that

V k ~ k.

Let us select y E X such that f* :S f(y) :S info5,j f(xj) - 8; this is


possible in view of Eq. (3.24). Then using Eq. (3.22), we have

V k ~ k.
(3.25)
Applying Prop. 3.2.2(a) with y = y, together with the preceding relation,
we have
\\xk+l -y\\ 2 :S -y\1 2 -
\\xk 2ak(f(xk) - f(y)) + a~\\gk\\ 2
:S \lxk - y\\ 2 - 2ak(f (xk) - fk) + a~\\gk\\ 2, Vk ~ k.
By using the definition of ak [cf. Eq. (3.21)] and Eq. (3.25), we obtain

II Xk+l - -112 < \\ - -112 -


Y _ Xk Y
2 (f(xk) -
\\gk\\
!k)2 + (f(xk) - !k)2
\\gk\\

= II --112 - (f(xk)-!k)2 (3.26)


Xk Y \\gk\\
82
:S \lxk - 'tJll2 - ll9kll2' V k ~ k,
162 Subgradient Methods Chap. 3

where the last inequality follows from the right side of Eq. (3.25). Hence
{xk} is bounded, which implies that {gk} is also bounded (cf. Prop. 3.1.2).
Letting c be such that llgkll :::; c for all k and adding Eq. (3.26) over k, we
have
V k 2: k,
which cannot hold for sufficiently large k - a contradiction. Q.E.D.

In a variation of the preceding scheme, we may use in place of the


stepsize rule (3.21), one of the two rules

Vk 2: 0,

where 'Y is a fixed positive scalar and fk is given by the same adjustment
procedure (3.22)-(3.23). This will guard against the potential practical
difficulty of ak becoming too large due to very small values of llgkll- The
result of the preceding proposition still holds with this modification (see
Exercise 3.9).
We finally note that the line of convergence analysis of this section can
be applied with small modifications to related methods that are based on
subgradients, most notably to the E-subgradient methods of the next sec-
tion, and the incremental subgradient and incremental proximal methods
of Section 6.4.

3.3 E-SUBGRADIENT METHODS

In this section we briefly discuss subgradient-like methods that use approx-


imate subgradients in place of subgradients. There may be several different
motivations for such methods; for example, computational savings in the
subgradient calculation, or exploitation of special problem structure.
Given a proper convex function f: ~n f--t (-oo, oo] and a scalar E > 0,
we say that a vector g is an E-subgradient off at a point x E dom(f) if

f(z) 2: f(x) + (z - x)'g - E, Vz E ~n. (3.27)

The E-subdifferential 8ef (x) is the set of all E-subgradients off at x, and
by convention, 8d(x) = 0 for x rJ_ dom(f). It can be seen that

if O < fl < E2,

and that
nd.o8d(x) = 8f(x).
Sec. 3.3 E-Subgradient Methods 163

f(z)
Figure 3.3.1. Illustration of an E-sub-
Epigraph of f (-g, 1) gradient of a convex function f. A vec-
tor g is an E-subgradient at x E dom(f)
if and only if there is a hyperplane with
normal (- g, 1), which passes through
the point (x,f(x) - <"), and separates
this point from the epigraph off.

To interpret geometrically an E-subgradient, note that the defining


relation (3.27) can be written as

f(z) - z'g ~ (f(x) - E) - x'g, \/ Z E Rn.

Thus g is an E-subgradient at x if and only if the epigraph of f is contained


in the positive halfspace corresponding to the hyperplane in Rn+l that has
normal (-g, 1) and passes through (x, f(x)-E), as illustrated in Fig. 3.3.1.
Figure 3.3.2 illustrates the definition of the E-subdifferential ad(x)
for the case of a one-dimensional function f. The figure indicates that
if f is closed, then [in contrast with 8/(x)] ad(x) is nonempty at all
points of dom(f), including the relative boundary points of dom(f). This
follows by the Nonvertical Hyperplane Theorem (Prop. 1.5.8 in Appendix
B). As an illustration, consider the scalar function f(x) = !xi- Then it is
straightforward to verify that for x E R and E > 0, we have

[-1,-1-½] for X < -½,


{
ad(x) = [-1, 1] for X E [- ½, ½] ,
[1-½,1] for X > ½·
Given the problem of minimizing a real-valued convex function f
Rn f-tR over a closed convex set X, the E-subgradient method is given by

(3.28)

where 9k is an Ek-subgradient off at Xk, with Ek a positive scalar, ak is a


positive stepsize, and Px(·) denotes projection on X. Thus the method is
the same as the subgradient method, except that E-subgradients are used
in place of subgradients.
The following example motivates the use of E-subgradients in the con-
text of duality and minimax problems. It shows that E-subgradients may be
computed more economically than subgradients, through an approximate
minimization.
164 Subgradient Methods Chap. 3

/(z) /(z)

0 I X I Z 0 x 1..._ _ _ _ _ _ _..., z
-+---...,v=----., '
D

Figure 3.3.2. Illustration of the E"-subdifferential 8.J(x) of a one-dimensional


function f : ~ >-+ (- oo, oo], which is closed and convex, and has as effective
domain an interval D. The E"-subdifferential is a nonempty interval with endpoints
corresponding to the slopes indicated in the figure. At boundary points of <lorn(/),
these endpoints can be oo or -oo (as in the figure on the right) .

Example 3.3.1: {E-Subgradient Calculation in Minimax and


Dual Problems)

As in Example 3.1.1, let us consider the minimization of

f(x) = sup¢(x,z), (3.29)


zEZ

where x E Rn, z E Rm, Z is a subset of Rm, and ¢ : Rn X Rm t---t ( -oo, oo] is


a function such that¢(·, z) is convex for each z E Z. We showed in Example
3.1.1 that if we carry out exactly the maximization over z in Eq. (3.29), we
can then obtain a subgradient at x. We will show with a similar argument,
that if we carry out the maximization over z approximately, within 1:, we
can then obtain an 1:-subgradient at x, which we can use in turn within an
1:-subgradient method.
Indeed, for a fixed x E dom(f), let us assume that Zx E Z attains the
supremum within 1: > 0 in Eq. (3.29), i.e.,

¢(x, zx) 2 sup ¢(x, z) - 1: = J(x) - 1:,


zEZ

and that 9x is some subgradient of the convex function ¢ (·, zx) at x, i.e.,
9x E 8¢(x, zx )- Then, for ally E Rn, we have using the subgradient inequality,

J(y) = sup ¢(y, z ) 2 ¢(y, zx) 2 ¢(x, Zx ) + g~(y - x) 2 J(x) - 1: + g~(y - x),
zE Z

i.e., 9x is an 1:-subgradient off at x. In conclusion,

</>(x,zx) 2 sup¢(x ,z)- I: and 9x E 8¢ (x,zx) => 9x E 8,J(x).


zEZ
Sec. 3.3 t-Subgradient Methods 165

The behavior and analysis of E-subgradient methods are similar to


those of subgradient methods, except that E-subgradient methods generally
aim to converge to the E-optimal set, where E = limk-+oo Ek, rather than the
optimal set, as subgradient methods do. To get a sense of the convergence
mechanism, note that there is a simple modification of the fundamental
inequality of Prop. 3.2.2(a). In particular, if {Xk} is the sequence generated
by the E-subgradient method, we have for all y E X and k ~ 0

Using this inequality, one can essentially replicate the convergence analysis
of Section 3.2, while carrying along the E parameter.
As an example, consider the case where ak and Ek are constant: ak =
a for some a > 0 and Ek = E for some E > 0. Then, if the E-subgradients
gk are bounded, with llgkll :":'.: c for some constant c and all k, we obtain for
all optimal solutions x*,

where f* = infxEX f(x) is the optimal value [cf. Eq. (3.16)]. This implies
that the distance to all optimal x* decreases if

2(J(xk) - f* - E)
O<a< 2 ,
C

or equivalently, if Xk is outside the level set

(cf. Fig. 3.2.3). With analysis similar to the one for the subgradient case,
we can also show that if
00

Lak = oo,
k=O

we have

(cf. Prop. 3.2.6). There is also a related convergence result for an analog of
the dynamic stepsize rule and other rules (see [NeBlO]). Ifwe have Ek-+ 0
instead of Ek = E, the convergence properties of the E-subgradient method
(3.28) are essentially the same as the ones of the ordinary subgradient
method, both for a constant and for a diminishing stepsize.
166 Subgradient Methods Chap. 3

3.3.1 Connection with Incremental Subgradient Methods

We discussed in Section 2.1.5 incremental variants of gradient methods,


which apply to minimization over a closed convex set X of an additive cost
function of the form m
f(x) = I: Ji(x),
i=l

where the functions Ji : Rn H R are differentiable. Incremental variants of


the subgradient method are also possible in the case where the Ji are non-
differentiable but convex. The idea is to sequentially take steps along the
subgradients of the component functions Ji, with intermediate adjustment
of x after processing each Ji- We simply use an arbitrary subgradient of Ji
at a point where Ji is nondifferentiable, in place of the gradient that would
be used if Ji were differentiable at that point.
Incremental methods are particularly interesting when the number of
cost terms m is very large. Then a full subgradient step is very costly,
and one hopes to make progress with approximate but much cheaper incre-
mental steps. We will discuss in detail incremental subgradient methods
and their combinations with other methods, such as incremental proximal
methods, in Section 6.4. In this section we will discuss the most common
type of incremental subgradient method, and highlight its connection with
the E-subgradient method.
Let us consider the minimization of I:::
1 Ji over x E X, for the case
where each Ji is a convex real-valued function. Similar to the incremental
gradient methods of Section 2.1.5, we view an iteration as a cycle of m
subiterations. If xk is the vector obtained after k cycles, the vector Xk+l
obtained after one more cycle is

Xk+l = "Pm,k,
where starting with "Po,k = Xk, we obtain "Pm,k after the m steps

i = l, ... ,m, (3.30)

with gi,k being an arbitrary subgradient of Ji at "Pi-1,k·


To see the connection with E-subgradients, we first note that if two
vectors x and x are "near" each other, then subgradients at x can be viewed
as E-subgradients at x, with E "small." In particular, if g E 8J(x), we have
for all z E Rn,

J(z) 2 f(x) + g'(z - x)


~ f(x) + g'(z - x) + f(x) - J(x) + g'(x - x)
2 f (X) + g' (Z - X) - E,
where
E = max{O, J(x) - J(x)} + llgll · llx - xii-
Sec. 3.4 Notes, Sources, and Exercises 167

Thus, g E 8J(x) implies that g E 8,J(x), with E: small when xis near x.
We now observe from Eq. (3.30) that the ith step within a cycle of
the incremental subgradient method involves the direction 9i,k, which is
a subgradient of Ji at the corresponding vector 'I/Ji-1,k· If the stepsize ak
is small, then 'I/Ji-1,k is close to the vector Xk available at the start of the
cycle, and hence 9i,k is an Ei-subgradient of Ji at Xk, where Ei is small. In
particular, assuming for simplicity that X = ~, we have
m
Xk+l = Xk - Ci.k L 9i,k, (3.31)
i=l

where 9i,k is a subgradient of Ji at 'I/Ji-1,k, and hence an Ei-subgradient of


Ji at Xk, for an Ei that is small (proportional to ak)- Thus, using also the
definition of E-subgradient, we have
m
L Yi,k E 8,if1(xk) + ... + a,mJm(Xk) C 8,J(xk),
i=l

where E = E1 + · · · + Em.
From this analysis it follows that the incremental subgradient itera-
tion (3.31) can be viewed as an E-subgradient iteration at Xk, the starting
point of the cycle. The size of E depends on the size of the stepsize ak, as
well as the function J, and we have E -+ 0 as ak -+ 0. As a result, when
00

LCi.k =oo,
k=O

the incremental subgradient method embodies a convergence mechanism


similar to the one of the ordinary subgradient method, and has similar
convergence properties. If ak is kept constant, convergence to a neighbor-
hood of the solution can be expected. These results will be established
in detail later, with a somewhat different but related line of reasoning; see
Section 6.4 where we will also consider methods that select the components
Ji for iteration by using a randomized rather than cyclic order.

3.4 NOTES, SOURCES, AND EXERCISES

Section 3.1: Subgradients are central in the work of Fenchel [Fen51]. The
original theorem by Danskin [Dan67] provides a formula for the directional
derivative of the maximum of a (not necessarily convex) directionally dif-
ferentiable function. When adapted to a convex function J, this formula
yields Eq. (3.10) for the subdifferential of J; see Exercise 3.5.
Another important subdifferential formula relates to the subgradients
of an expected value function

J(x) = E{ F(x,w) },
168 Subgradient Methods Chap. 3

where w is a random variable taking values in a set n, and F(·, w) : ~n H ~


is a real-valued convex function such that f is real-valued (note that f is
easily verified to be convex). If w takes a finite number of values with
probabilities p(w), then the formulas

f'(x;d) = E{F'(x,w;d)}, 8f(x) = E{8F(x,w)}, (3.32)

hold because they can be written in terms of finite sums as

f'(x; d) = L p(w)F'(x, w; d), of(x) = L p(w)BF(x,w),


wE!1 wE!1

so Prop. 3.l.3(b) applies. However, the formulas (3.32) hold even in the
case where n is uncountably infinite, with appropriate mathematical inter-
pretation of the integral of set-valued functions E{ 8F(x, w)} as the set of
integrals

1 wE!1
g(x,w)dP(w), (3.33)

where g(x, w) E 8F(x, w), w E n (measurability issues must be addressed


in this context). For a formal proof and analysis, see the author's papers
[Ber72], [Ber73], which also provide a necessary and sufficient condition for
f to be differentiable, even when F(·, w) is not. In this connection, it is
important to note that the integration over w in Eq. (3.33) may smooth
out the nondifferentiabilities of F( ·, w) if w is a "continuous" random vari-
able. This property can be used in turn in algorithms, including schemes
that bring to bear the methodology of differentiable optimization; see e.g.,
Yousefian, Nedic, and Shanbhag [YNSlO], [YNS12], Agarwal and Ouchi
[AgDll], Ouchi, Bartlett, and Wainwright [DBW12], Brown and Smith
[BrS13], Abernethy et al. [ALS14], and Jiang and Zhang [JiZ14].
Section 3.2: Subgradient methods were first introduced in the middle 60s
by Shor; the works of Ermoliev and Poljak were also particularly influen-
tial. Description of these works can be found in many sources, including
the books by Ermoliev [Erm76], Shor [Sho85], and Poljak [Pol87]. An ex-
tensive bibliography for the early period of the subject is given in the edited
volume by Balinski and Wolfe [BaW75]. Some of the first papers in the
Western literature on nondifferentiable optimization appeared in this vol-
ume. There are many works dealing with analysis of subgradient methods.
There are also several variations of subgradient methods that aim to ac-
celerate the convergence of the basic method (see e.g., [CFM75], [Sho85],
[Min86], [Str97], [LPS98], [Sho98], [ZLW99], [BLY14]).
The line of analysis given here is based on the joint work of the
author with A. Nedic [NeBOO], [NeBOl], and has been used in several sub-
sequent works, [NBBOl], [Ne009a], [Ne009b], [NeBlO], [Berll], [Nedll],
[WaB13a]. The book by Bertsekas, Nedic, and Ozdaglar [BN003] contains
a more extensive convergence analysis, which relates to a greater variety of
Sec. 3.4 Notes, Sources, and Exercises 169

stepsize rules, including dynamic rules for nonincremental and incremental


subgradient methods.
Section 3.3: Methods using E-subgradients (cf. Section 3.3) have been in-
vestigated by several authors, including Robinson [Rob99], Auslender and
Teboulle [AuT04], and Nedic and Bertsekas [NeBlO]. Subgradient methods
are often implemented in approximate form, with errors in the calculation
of the subgradient or the cost function value. For analysis related to such
implementations, see Nedic and Bertsekas [NeBlO], and Hu, Yang, and Sim
lHYS15\. For some nondifferentiable problems involving sharp minima, ex-
act convergence may be obtained despite persistent errors in the calculation
of the subgradient (see Exercise 3.10 and [NeBlO]). Moreover, additional
algorithms based on the E-subdifferential, called E-descent methods, will be
discussed in Section 6.7.
Problems of minimization of additive cost functions f(x) = 1 fi(x) z:=:
(cf. Section 3.3.1) arise in many applications, as noted in Section 1.3. In-
cremental subgradient methods for such problems will be discussed in Sec-
tion 6.4, and detailed references will be given in Chapter 6. Extensions of
incremental subgradient methods, called incremental constraint projection
methods, will also be considered for constraint sets of the form X = n~ 1 Xi,
where the component sets Xi are more suitable for the projection operation
than X itself, and the constraint projections are done incrementally (see
Section 6.4.4).

EXERCISES

3.1 (Computational Exercise)

Consider the unconstrained minimization of


m

f(x) = L max{O, c;x - bi},


i=l

where Ci are given vectors in Rn and bi are given scalars.


(a) Verify that a subgradient method has the form

X1<+1 = Xk - ak L 9i,k,
i=l

where
c· if c;xk > b;,
9i,k ={ 0' otherwise.
170 Subgradient Methods Chap. 3

(b) Consider an incremental subgradient method that operates in cycles as


follows. At the end of cycle k, we set Xk+1 = 'lfm,k, where 'lfm,k is obtained
after the m steps

if C~'lfi-1,k > bi,


otherwise,
i = 1, ... ,m,

starting with 'I/Jo,k = Xk· Compare this method with the algorithm of (a)
computationally with two examples where n = 2 and m = 100. In the first
example, the vectors Ci have the form Ci = (ei, (i), where ei,
(i, as well
as bi, are chosen randomly and independently from [-100, 100] according
to a uniform distribution. In the second example, the vectors Ci have the
form Ci = (ei, (i), where ei,
(i are chosen randomly and independently
within [-10, 10] according to a uniform distribution, while bi is chosen
randomly and independently within [O, 1000] according to a uniform dis-
tribution. Experiment with different starting points and stepsize choices,
and deterministic and randomized orders of selection of the indexes i for
iteration. In the case of the second example, under what circumstances
does the method stop after a finite number of iterations?

3.2 (Optimality Condition with Directional Derivatives)

The purpose of this exercise is to express the necessary and sufficient condition
for optimality of Prop. 3.1.4 in terms of the directional derivative of the cost
function. Consider the minimization of a convex function f : ~n >---+ ~ over a
convex set X C ~n. For any x E X, the set of feasible directions of f at x is
defined to be the convex cone

D(x) = {a(x-x) Ix EX, a> o}.


Show that a vector x minimizes f over X if and only if x E X and

J'(x;d) ~ 0, 'v' d E D(x). (3.34)

Note: In words, this condition says that x is optimal if and only if there is no
feasible descent direction of f at x. Solution: Let D(x) denote the closure of
D(x). By Prop. 3.1.4, x minimizes f over X if and only if there exists g E 8f(x)
such that
g'd ~ 0, 'v' d E D(x),

which is equivalent to
g'd ~ 0,

Thus, x minimizes f over X if and only if

max min g'd ~ 0.


gE8f(x) lldll9, dED(x)
Sec. 3.4 Notes, Sources, and Exercises 171

Since the minimization and maximization above are over convex and compact
sets, by the Saddle Point Theorem of Prop. 5.5.3 in Appendix B, this is equivalent
to
min max g'd 2'. 0,
1idll9, dED(x) gE8f(x)

or by Prop. 3.1.l(a),
min __ j' (x; d) 2: 0.
ildli:51, dED(x)

This is in turn equivalent to the desired condition (3.34), since J'(x; ·) is contin-
uous being convex and real-valued.

3.3 (Subdifferential of an Extended Real-Valued Function)

Extended real-valued convex functions arising in algorithmic practice are often


of the form
f(x) = { h(x) ~f x EX, (3.35)
oo ifx~X,
where h : )Jr i--+ R is a real-valued convex function and X is a nonempty convex
set. The purpose of this exercise is to show that the subdifferential of such
functions admits a more favorable characterization compared to the case where
h is extended real-valued.
(a) Use Props. 3.1.3 and 3.1.4 to show that the subdifferential of such a function
is nonempty for all x E X, and has the form

af(x) = ah(x) + Nx(x), VxEX,

where Nx(x) is the normal cone of X at x EX. Note: If his convex but
extended-real valued, this formula requires the assumption ri ( dom( h)) n
ri(X) ¥, 0 or some polyhedral conditions on h and X; see Prop. 5.4.6 of
Appendix B. Proof: By the subgradient inequality (3.1), we have g E af(x)
if and only if x minimizes p( z) = h( z) - g' z over z E X, or equivalently,
some subgradient of pat x [i.e., a vector in ah(x) - {g}, by Prop. 3.1.3]
belongs to -Nx(x) (cf. Prop. 3.1.4).
(b) Let f(x) = -,Ix if x 2'. 0 and f(x) = oo if x < 0. Verify that f is a closed
convex function that cannot be written in the form (3.35) and does not
have a subgradient at x = 0.
(c) Show the following formula for the subdifferential of the sum of functions
Ji that have the form (3.35) for some hi and Xi:

8(!1 + · · · + fm)(x) = 8h1(x) + · · · + ohm(x) + Nx1n ... nxm(x),


for all x E X1 n · · · n Xm,. Demonstrate by example that in this formula we
cannot replace Nx 1n ... nxm(x) by Nx 1 (x) + · · · + Nxm,(x). Proof: Write
Ji+···+ fm, = h + 8x, where h = h1 + · · · + hm and X = X1 n · · · n Xm,
For a counterexample, let m = 2, and X1 and X2 be unit spheres in the
plane with centers at (-1, 0) and (1, 0), respectively.
172 Subgradient Methods Chap. 3

3.4 ( Continuity of Gradient and Directional Derivative)

The following exercise provides a basic continuity property of directional deriva-


tives and gradients of convex functions. Let f : !Jr >-+ R be a convex function,
and let {fk} be a sequence of convex functions fk : Rn >-+ R with the property
that limk-+oo fk(xk) = f(x) for every x E Rn and every sequence {xk} that con-
verges to x. Show that for any x E Rn and y E Rn, and any sequences { x k} and
{yk} converging to x and y, respectively, we have

limsupf~(xk;Yk) :C:::: J'(x;y).


k-+oo

Furthermore, if f is differentiable over Rn, then it is continuously differentiable


over Rn. Solution: From the definition of directional derivative, it follows that
for any E > 0, there exists an a > 0 such that

f(x + ay) - f(x) < J'(x; y) +L


a

Hence, using also the equation

! '( x; y ) = m
. f f(x-+
- -ay)
- -- -f(x)
-,
a>O Ct

we have for all sufficiently large k,

. ) <_
! k'( Xk,Yk fk(xk + ayk) - . ) + E,
fk(xk) < J'( x,y
a

so by taking the limit as k -+ oo,

limsup J;(xk; Yk) :C:::: J'(x; y) + E.


k-+oo

Since this is true for all E > 0, we obtain limsupk-+oo f~(xk; Yk) :C:::: J'(x; y).
If f is differentiable at all x E Rn, then by Prop. 3.1.1 (b), we have J' (x; y) =
'i7 f ( x )' y for all x, y E Rn, and by using the part of the proposition just proved,
it follows that for every sequence { Xk} converging to x and every y E Rn,

limsup 'i7 f(xk)' y = lim sup J' (xk; y) :C:::: J' (x; y) = 'i7 f(x)'y.
k--+ OC> k--+oo

By replacing y with -y in the preceding argument, we obtain

- lim inf 'i7 f (xk)' y = lim sup(-'il J(xk)' y) :C:::: -'il f (x )' y.
k--+oo k--+oo

Combining the preceding two relations, we have 'ilf(xk)'y-+ 'ilf(x)'y for every
y, which implies that 'ilf(xk)-+ 'ilf(x). Hence, 'ilf(·) is continuous.
Sec. 3.4 Notes, Sources, and Exercises 173

3.5 (Danskin's Theorem)

Let Z be a compact subset of Rm, and let q> : Rn x Z 1--t R be continuous and
such that ¢>(·, z) : Rn 1--t R is convex for each z E Z.
(a) Show that the function f : Rn 1--t R given by

f(x) = max(j>(x,z)
zEZ
(3.36)

is convex and has directional derivative given by

J'(x;y) = max ¢>'(x,z;y), (3.37)


zEZ(x)

where ¢>' (x, z; y) is the directional derivative of the function ¢>(·, z) at x in


the direction y, and Z(x) is the set of maximizing points in Eq. (3.36)

Z(x) = { z I (j>(x, z) = 1;1~ (j>(x, z)}.


Furthermore, the maximum in Eq. (3.37) is attained. In particular, if Z(x)
consists of a unique point z and ¢>(-, z) is differentiable at x, then f is
differentiable at x, and 'i7 f (x) = 'i7 x(/>(x, z), where 'i7 x¢>(x, z) is the vector
with components
8(/>(x, z)
i = 1, ... ,n.
8xi

(b) Show that if ¢>( ·, z) is differentiable for all z E Z and 'i7 x ¢>( x, ·) is continuous
on Z for each x, then

8f(x) = conv{'ilx¢>(x,z) I z E Z(x)}, \/XE Rn.

Solution: (a) We note that since q> is continuous and Z is compact, the set Z(x)
is nonempty by Weierstrass' Theorem and f is real-valued. For any z E Z(x),
y E Rn, and a > 0, we use the definition off to obtain

J(x + ay) - f(x) ?'. ¢>(x + ay, z) - (j>(x, z).


a a

Taking the limit as a decreases to zero, we obtain J' (x; y) ?'. ¢>'(x, z; y). Since
this is true for every z E Z(x), we conclude that

J'(x;y)?'. sup (/>'(x,z;y), (3.38)


zEZ(x)

We will next prove the reverse inequality and that the supremum in the
right-hand side of the above inequality is attained. To this end, we fix x, we
consider a sequence { ak} of positive scalars that converges to zero, and we let
Xk = x + CTkY· For each k, let Zk be a vector in Z(xk)· Since {zk} belongs to the
174 Subgradient Methods Chap. 3

compact set Z, it has a subsequence converging to some z E Z. Without loss of


generality, we assume that the entire sequence {Zk} converges to z. We have

'v' z E Z,
so by taking the limit as k -+ oo and by using the continuity of <p, we obtain

cp(x, z) 2 cp(x, z), 'v' z E Z.

Therefore, z E Z(x). We now have


J' (x; y) :S J(x + aky) - J(x)
ak
_ cp(x + aky, zk) - cp(x, z)
ak
:S cp(x + O'.kY, Zk) - cp(x, Zk) (3.39)
O'.k
:S -cp' (x + aky, Zki -y)
:S </J' (x + O'.kY, Zki y),
where the last inequality follows from the fact -f'(x;-y) :S J'(x;y). We apply
the result of Exercise 3.4 to the functions !k defined by fk(-) = cp(·, Zk), and with
Xk = x + aky, to obtain
limsup </J' (x + aky, zk; y) :S </J' (x, z; y). (3.40)
k-tco

We take the limit in inequality (3.39) as k -+ oo, and we use inequality (3.40) to
conclude that
f'(x;y) :S cp'(x,z;y).
This relation together with inequality (3.38) proves Eq. (3.37).
For the last statement of part (a), if Z(x) consists of the unique point z,
the differentiability assumption on <p and Eq. (3.37) yield

J'(x;y) = </J'(x,z;y) = y''ii'x</J(x,z), Vy E a:r,


which implies that 'ii' f(x) = 'ii' ,,cp(x, z).
(b) By part (a), we have

J'(x;y) = zEZ(x)
max 'ii'x</J(x,z)'y,

while by Prop. 3.1.l(a),


J'(x;y) = zEaf(x)
max d'y.

For all z E Z(x) and y E ~n, we have

f(y) = maxcp(y,
zEZ
z)

::::: </J(y,z)
2 cp(x,z) + 'ii'x</J(x,z)'(y- x)
= f(x) + 'ii'.,cp(x,z)'(y - x).
Sec. 3.4 Notes, Sources, and Exercises 175

Therefore, 'V.,<f>(x, z) is a subgradient off at x, implying that

conv{V.,<f>(x,z) I z E Z(x)} C 8f(x).

To prove the reverse inclusion, we use a hyperplane separation argument. By the


continuity of 'V.,<f>(x, ·) and the compactness of Z, we see that Z(x) is compact,
and therefore also the set {V .,<f>(x, z) I z E Z(x)} is compact. By Prop. 1.2.2 in
Appendix B, it follows that conv{V.,<f>(x,z) I z E Z(x)} is compact. If d E 8f(x)
whiled(/. conv{V.,<f>(x, z) I z E Z(x) }, by the Strict Separation Theorem (Prop.
1.5.3 in Appendix B), there exists y # 0, and 'Y E lR, such that

d'y > 'Y > 'V.,<f>(x,z)'y, \;/ z E Z(x).

Therefore, we have

d'y > max 'V.,<f>(x,z)'y


zEZ(x)
= j'(x;y),

contradicting Prop. 3.1.l(a). Therefore,

8f(x) C conv{V.,<f>(x,z) J z E Z(x)}

and the proof is complete.

3.6 (Convergence of Subgradient Method with Diminishing


Stepsize Under Weaker Conditions)

This exercise shows an enhanced version of Prop. 3.2.6, whereby we assume that
for some scalar c, we have

\;/ k, (3.41)

in place of the stronger Assumption 3.2.1. Assume also that x• is nonempty and
that
00 00

LOk = oo, La~< oo. (3.42)


k=O k=O

Show that {xk} converges to some optimal solution. Abbreviated proof: Similar
to the proof of Prop. 3.2.6 [cf. Eq. (3.18)], we apply Prop. 3.2.2(a) with y equal
to any x• EX*, and then use the assumption (3.41) to obtain

(3.43)

In view of the assumption (3.42), the convergence result of Prop. A.4.4 of Ap-
pendix A applies, and shows that {xk} is bounded and that liminfk-too f(xk) =
f*. From this point the proof follows the one of Prop. 3.2.6.
176 Subgradient Methods Chap. 3

3. 7 (Convergence Rate of Subgradient Method with Dynamic


Stepsize)

Consider the subgradient method Xk+l = Px (xk - akgk) with the dynamic step-
size rule
(3.44)

and assume that the optimal solution set X* is nonempty. Show that:
(a) {xk} and {gk} are bounded sequences. Proof: Let x* be an optimal solu-
tion. From Prop. 3.2.2(a), we have

Using the stepsize form (3.44) in this relation, we obtain

(3.45)

Therefore

implying that {xk} is bounded, and by Prop. 3.1.2, that {gk} is bounded.
(b) ( Sublinear Convergence) We have

lim inf vk (f (xk) -


k-+oo
r) = 0.

Proof: Assume to obtain a contradiction that there is an E > 0 and large


enough k such that v'k (f(xk) - r) 2 t for all k 2 k. Then

Vk 2 k,

implying that
00 00

L(f(xk) - f*) 2 2 E2 L ¼= <XJ.

k=k k=k

On the other hand, by adding Eq. (3.45) over all k, and using the bound-
edness of {gk}, shown in part (a), we have

00

LUcxk) - r)2 < =,


k=D

a contradiction.
(c) ( Linear Convergence) Assume that there exists a scalar f3 > 0 such that

r + fJd(x) :s: f(x), \/XE X, (3.46)


Sec. 3.4 Notes, Sources, and Exercises 177

where we denote d(x) = minx*EX* llx - x*II; this assumption, known as a


sharp minimum condition, is satisfied in particular if f and X are polyhe-
dral (see Prop. 5.1.6); problems where this condition holds have especially
favorable properties in several convex optimization algorithmic contexts
(see Exercise 3.10 and the exercises of Chapter 6). Show that for all k,

where p = \,h - /3 2 / , 2 and I is any upper bound to ll9k II with 1 > /3 [cf.
part (a)]. Proof: From Eqs. (3.45), (3.46), we have for all k

Using the fact supk::C,O ll9kll :<S; 1 , the desired relation follows.

3.8 (Subgradient Methods with Iterate Averaging [NJL09])

If the stepsize ak in the subgradient method

is chosen to be large (such as constant or such that the condition I:;;°= 0 a% < oo
is violated) the method may not converge. This exercise shows that by averaging
the iterates of the method, we may obtain convergence with larger stepsizes. Let
the optimal solution set X* be nonempty, and assume that for some scalar c, we
have
c:2:sup{ll9kll lk=0,1, ... }, V k 2". 0,
(cf. Assumption 3.2.1). Assume further that ak is chosen according to

0
a - -==
k - cv'k+T' k = 0, 1, ... ,

where 0 is a positive constant. Show that

k = 0, 1, ... ,

where Xk is the averaged iterate, generated according to

Note: The averaging approach seems to be less sensitive to the choice of step-
size parameters. Practical variants include restarting the method with the most
recent averaged iterate, and averaging over just a subset of recent iterates. A
178 Subgradient Methods Chap. 3

similar analysis applies to incremental and to stochastic subgradient methods.


Abbreviated proof: Denote

Applying Prop. 3.2.2(a) with y equal to the projection of x1e onto X*, we obtain

Adding this inequality from O to k, and using the fact 81e+1 2 0,

le

I: o:e vcx1e) - r) ~ 80 + ½c z::::;=


2
0 a~,
l=O

so by dividing with z:=;=O at,

The convexity of f implies that

Combining the preceding two relations, we obtain

Substituting the stepsize expression o:e = 0 / ( cv'l+T), we have

f( x ) _ 1• < 80 + ½02 z:=;=o zh < 80 + 02 ln(k + 2)


0 re--;-:; '
le -
c0 °"le 1
~l=O v'l+I - -;:v k + 1

which implies the result.

3.9 (Modified Dynamic Stepsize Rules)

Consider the subgradient method

with the stepsize chosen according to one of the two rules

V k 2 0, (3.47)
Sec. 3.4 Notes, Sources, and Exercises 179

where I is a fixed positive scalar and /k is given by the dynamic adjustment


procedure (3.22)-(3.23). Show that the convergence result of Prop. 3.2.8 still
holds. Abbreviated Proof: We proceed by contradiction, as in the proof of Prop.
3.2.8. From Prop. 3.2.2(a) with y = y, we have for all k 2 k,

llxk+1 - fill 2 S llxk - fill 2 - 2ak (f (xk) - f (y)) + a% ll9k 11 2


S llxk - fill 2 - 2ak (f (xk) - f(y)) + ak (f (xk) - fk)
= llxk - fill 2 - D'.k (f (xk) - fk) - 2ak (fk - f(y))
S llxk - fill 2 - D'.k (f (xk) - /k).

Hence {xk} is bounded, which implies that {gk} is also bounded (cf. Prop. 3.1.2).
Let c be such that ll9k II S c for all k. Assume that ak is chosen according to the
first rule in Eq. (3.47). Then from the preceding relation we have for all k 2 k,

As in the proof of Prop. 3.2.8, this leads to a contradiction and the result follows.
The proof is similar if ak is chosen according to the second rule in Eq. (3.47).

3.10 (Subgradient Methods with Low Level Errors for Sharp


Minima [NeBlO])

Consider the problem of minimizing a convex function f : a:r -+ R over a closed


convex set X, and assume that the optimal solution set X* is nonempty. The
purpose of this exercise is to show that under certain conditions, which are satis-
fied if f and X are polyhedral, the subgradient method is convergent even with
"small" errors in the calculation in the subgradient. Assume that for some /3 > 0,
we have
r
+ f3d(x) s f(x), V XE X, (3.48)
where r minxEX f(x) and d(x) = minx*EX* llx - x*II [a sharp minimum
condition; cf. Exercise 3.7(c)J. Consider the iteration

Xk+I = Px (xk - ak(9k + ek)),


where for all k, 9k is a subgradient of/ at Xk, and ek is an error such that

k = 0, 1, ... ,
where 1: is some positive scalar with 1: < /3. Assume further that for some c > 0,
we have
V k 2 0,
cf. the subgradient boundedness Assumption 3.2.1.
(a) Show that if O'.k is equal to some constant a for all k, then

. . ( ) * af3(c+ 1:)2
hm mf f Xk S
k--+oo
f + 2 (/3 - E ) , (3.49)
180 Subgradient Methods Chap. 3

while if
=
Letk = oo,
k=D

then lim infk--+= f (xk) = f*. Hint: Show that

and hence

(b) Use the scalar function f(x) = lxl to show that the estimate (3.49) is tight.

3.11 (E-Complementary Slackness and E-Subgradients)

The purpose of this exercise (based on unpublished joint work with P. Tseng)
is to show how to calculate E-subgradients of the dual function of the separable
problem

minimize L Ji(x;)
i=l
n
subject to Lgij(X;) S 0, j = 1, ... ,r, et;Sx;Sf];, i=l, ... ,n,
i=l

where f; : R I-), R, g; 1 : R I-), R are convex functions. For an E > 0, we say that a
pair (x, µ) satisfies E-complementary slackness ifµ :::: 0, X; E [et;, fl;] for all i, and

r r

o s Jt(x;)+ Lµjg~(x;)+E, Vi Er, J;-(x;)+ Lµjg;3 (xi)-E so, Vi Er+,


j=l j=l

where r = {i I x; < f]i}, r+ = {i I et; < x;}, J;,g 0 and Jt,g~ denote
the left and right derivatives of f;, g;j, respectively. Show that if (x, µ) sat-
isfies E-complementary slackness, the r-dimensional vector with jth component
L~=l g;j(x;) is an E-subgradient of the dual function q atµ, where

E = E L(fl; - et;).
i=l

Note: The notion of E-complementary slackness, sometimes also referred to as


E-optimality, is important, among others, in network optimization algorithms,
dating to the auction algorithm of [Ber79], and the related E-relaxation and pre-
flow push methods; see the books [BeT89a], [Ber98], and the references given
there.
4

Polyhedral Approximation
Methods

Contents

4.1. Outer Linearization - Cutting Plane Methods p. 182


4.2. Inner Linearization - Simplicial Decomposition p. 188
4.3. Duality of Outer and Inner Linearization . p. 194
4.4. Generalized Polyhedral Approximation p. 196
4.5. Generalized Simplicial Decomposition . . p. 209
4.5.1. Differentiable Cost Case . . . . . . p. 213
4.5.2. Nondifferentiable Cost and Side Constraints p. 213
4.6. Polyhedral Approximation for Conic Programming p. 217
4. 7. Notes, Sources, and Exercises . . . . . . . . . . p. 228

181
182 Polyhedral Approximation Methods Chap. 4

In this chapter, we discuss polyhedral approximation methods for minimiz-


ing a real-valued convex function f over a closed convex set X. Here we
generate a sequence { xk} by solving at each k the approximate problem

where Fk is a polyhedral function that approximates f and Xk is a poly-


hedral set that approximates X (in some variants only one of f or X is
approximated). The idea is that the approximate problem, thanks to its
polyhedral structure, may be easier to solve than the original. The methods
include mechanisms for progressively refining the approximation, thereby
obtaining a solution of the original problem in the limit.
We first discuss in Sections 4.1 and 4.2 the two main approaches
for polyhedral approximation: outer linearization (also called the cutting
plane approach) and inner linearization (also called the simplicial decom-
position approach). In Section 4.3, we show how these two approaches are
intimately connected by conjugacy and duality. In Section 4.4, we gener-
alize our framework for polyhedral approximation when the cost function
is a sum of two or more convex (possibly nondifferentiable) component
functions. Duality plays a central role here: each generalized polyhedral
approximation algorithm has a dual where the roles of outer and inner
linearization are exchanged.
Our generalized polyhedral approximation approach of Section 4.4 not
only connects outer and inner linearization, but also gives rise to a diverse
class of methods that can exploit a broad variety of special structures.
There are two characteristics that are important in this respect:
(a) Multiple component functions can be linearized individually, based
on the solution of the approximate problem. This speeds up the
approximation process, and can exploit better the special structure
of the cost function components.
(b) Outer and inner linearization may be simultaneously applied to differ-
ent component functions. This allows additional flexibility to exploit
special features of the problem at hand.
In Section 4.5, we consider various special cases of the framework of Section
4.4, some involving large-scale network flow problems, while in Section
4.6, we develop algorithmic variants that are useful when there are conic
constraints. In these sections we focus on inner linearization, although
there are similar (dual) algorithms based on outer linearization.

4.1 OUTER LINEARIZATION - CUTTING PLANE METHODS

Cutting plane methods are rooted in the representation of a closed con-


vex set as the intersection of its supporting halfspaces, cf. Prop. 1.5.4 in
Sec. 4.1 Outer Linearization - Cutting Plane Methods 183

Vif(xi) + (x - x1)'g1
I
I
I
I
I
I
__.--+-f(xo) ; - (x :_ xo)'go
I
I
I
Xo X

Figure 4.1.1. Illustration of the cutting plane method. With each new iterate xk,
a new hyperplane f(xk) + (x - xk)' 9k is added to the polyhedral approximation
of the cost function, where 9k is a subgradient off at Xk·

Appendix B. The idea is to approximate either the constraint set or the


epigraph of the cost function by the intersection of a limited number of half-
spaces, and to gradually refine the approximation by generating additional
halfspaces through the use of subgradients.
Throughout this section, we consider the problem of minimizing a
convex function f : ~n M ~ over a closed convex set X. In the simplest
cutting plane method, we start with a point xo E X and a subgradient
go E 8f(xo). At the typical iteration we solve the approximate problem

minimize Fk (x)
subject to x E X,

where f is replaced by a polyhedral approximation Fk, constructed us-


ing the points xo, ... , Xk generated so far, and associated subgradients
go, ... ,gk, with gi E 8f(xi) for all i :s; k. In particular, fork= 0, 1, ... , we
define

Fk(x) = max{J(xo) + (x - xo)'go, ... , f(xk) + (x - Xk)'gk}, (4.1)

and compute Xk+I that minimizes Fk(x) over x EX,

Xk+I E arg min Fk(x); (4.2)


xEX

see Fig. 4.1.1. We assume that the minimum of Fk(x) above is attained
for all k. For those k for which this is not guaranteed (as may happen in
the early iterations if X is unbounded), artificial bounds may be placed on
the components of x, so that the minimization will be carried out over a
compact set and consequently the minimum will be attained by Weierstrass'
Theorem.
184 Polyhedral Approximation Methods Chap. 4

The following proposition establishes the associated convergence prop-


erties.

Proposition 4.1.1: Every limit point of a sequence {xk} generated


by the cutting plane method is an optimal solution.

Proof: Since for all j, 9j is a subgradient off at Xj, we have

'vxEX,

so from the definitions (4.1) and (4.2) of Fk and Xk, it follows that

J(xj) + (xk - 'v x EX, j < k.


Xj) 19j:::; A-1(xk)::::; Fk-1(x)::::; f(x),
(4.3)
Suppose that a subsequence {xk}K converges to x. Then, since Xis closed,
we have x EX, and by using Eq. (4.3), we obtain for all k and all j < k,

By taking the upper limit above as j --+ oo, k --+ oo, j < k, j E K, k E K,
we obtain

limsup {f(xj)+(xk-Xj) 1gj}:::; limsup Fk-i(xk):s;J(x). (4.4)


j-+=, k-+=, j<k k-+oo, kEK
jEK, kEK

Since the subsequence { Xk} K is bounded and the union of the subdif-
ferentials of a real-valued convex function over a bounded set is bounded (cf.
Prop. 3.1.2), it follows that the subgradient subsequence {gj }K is bounded.
Moreover, we have
lim (xk - Xj) = 0,
j----::,.oo, k----::,.oo,j<k
jEK, kEK

so that
lim
j-+oo, k-+oo,j<k
(xk - x·)'g·
J J
= 0. (4.5)
jEK,kEK

Also by the continuity of f, we have

. lim f(xj) = J(x). (4.6)


1-+oo,3EK

Combining Eqs. (4.4)-(4.6), we obtain

limsup Fk-1 (xk) = f(x).


k-+oo, kEK
Sec. 4.1 Outer Linearization - Cutting Plane Methods 185

This relation together with Eq. (4.3) yields

f(x) :S f(x), \/ x EX,

showing that x is an optimal solution. Q.E.D.

In practice, it is common to use the inequalities

k = 0,1, ... ,

to bound the optimal value f * of the problem. In such a scheme, the


iterations are stopped when the upper and lower bound difference

comes within some small tolerance.


An important special case arises when f is polyhedral of the form

(4.7)

where I is a finite index set, and ai and bi are given vectors and scalars,
respectively. Then, any vector aik that maximizes a~xk +bi over {ai I i E I}
is a subgradient off at Xk (cf. Example 3.1.1). We assume that the cutting
plane method selects such a vector at iteration k, call it aik. We also assume
that the method terminates when

Then, since Fk-1(x) :S f(x) for all x EX and Xk minimizes Fk-l over X,
we see that, upon termination, Xk minimizes f over X and is therefore op-
timal. The following proposition shows that the method converges finitely;
see also Fig. 4.1.2.

Proposition 4.1.2: Assume that the cost function f is polyhedral of


the form (4.7). Then the cutting plane method, with the subgradi- '
ent selection and termination rules just described, obtains an optimal
solution in a finite number of iterations.

Proof: If (aik, bik) is equal to some pair (aij, bij) generated at some earlier
iteration j < k, then
186 Polyhedral Approximation Methods Chap. 4

Vif(x1) + (x - x1)'g1

I
..-----f- /(xo) + (x - xo)'go
I
I
I

Xo X3 x* X2 X

Figure 4.1.2. Illustration of the finite convergence property of the cutting plane
method in the case where / is polyhedral. What happens here is that if Xk is not
optimal, a new cutting plane will be added at the corresponding iteration, and
there can be only a finite number of cutting planes.

where the first inequality follows since a:i Xk + bii corresponds to one of the
hyperplanes defining Fk- l, and the last inequality follows from the fact
Fk- 1 ( x) :::; f (x) for all x E X. Hence equality holds throughout in the
preceding relation, and it follows that the method terminates if the pair
(aik, bik) has been generated at some earlier iteration. Since the number of
pairs (ai, bi), i EI, is finite, the method must terminate finitely. Q.E.D.

Despite the finite convergence property shown in Prop. 4.1.2, the


cutting plane method has several drawbacks:
(a) It can take large steps away from the optimum, resulting in large
cost increases, even when it is close to (or even at) the optimum.
For example, in Fig. 4.1.2, f (x1) is much larger than f (xo). This
phenomenon is referred to as instability, and has another undesirable
effect, namely that the current point Xk may not be a good starting
point for the algorithm that minimizes the new approximate cost
function Fk(x) over X.
(b) The number of subgradients used in the cutting plane approximation
Fk increases without bound as k -t oo leading to a potentially large
and difficult optimization problem to find Xk. To remedy this, one
may occasionally discard some of the cutting planes. To guarantee
convergence, it is essential to do so only at times when improvement
in the cost is recorded, e.g., f(xk):::; minj<k f(x 1) - 8 for some small
positive 8. Still one has to be judicious about discarding cutting
planes, as some of them may reappear later.
Sec. 4.1 Outer Linearization - Cutting Plane Methods 187

(c) The convergence is often slow. Indeed, for challenging problems, even
when f is polyhedral, one should base termination on the upper and
lower bounds

rather than wait for finite termination to occur.


To overcome some of the limitations of the cutting plane method, a
number of variants have been proposed, some of which are discussed in the
present section. In Chapter 5, we will discuss additional methods, including
bundle methods, which are aimed at limiting the effects of instability though
combinations with the proximal algorithm.

Partial Cutting Plane Methods

In some cases the cost function has the form

f(x) + c(x),
where f : X f-t ~ and c : X f-t ~ are convex functions, but one of them,
say c, is convenient for optimization, e.g., is quadratic. It may then be
preferable to use a piecewise linear approximation of f only, while leaving
c unchanged. This leads to a partial cutting plane algorithm, involving
solution of the problems

minimize Fk(x) + c(x)


subject to x E X,

where as before

Fk(x) = max{f(xo) + (x - xo)'go, ... , f(xk) + (x - xk)'gk},

with 9j E fJJ(xj) for all j, and Xk+i minimizes Fk(x) over x EX,

Xk+l E arg min{ Fk(x)


xEX
+ c(x) }.

The convergence properties of this algorithm are similar to the ones


shown earlier. In particular, if f is polyhedral, the method terminates
finitely, cf. Prop. 4.1.2. The idea of partial piecewise approximation can
be generalized to the case of more than two cost function components and
arises also in a few other contexts to be discussed later in Sections 4.4-4.6.
188 Polyhedral Approximation Methods Chap. 4

Linearly Constrained Versions

Consider the case where the constraint set X is polyhedral of the form

X = {x I <x +di~ 0, i EI},


where I is a finite set, and Ci and di are given vectors and scalars, respec-
tively. Let
p(x) = max{c'.x + di},
iEJ •

so the problem is to maximize f(x) subject to p(x) ~ 0. It is then possible


to consider a variation of the cutting plane method, where both functions f
and pare replaced by polyhedral approximations Fk and Pk, respectively:

As earlier,

Fk(x) = min{f(xo) + (x - xo)'go, ... , f(xk) + (x - xk)'gk},


with gj being a subgradient off at Xj. The polyhedral approximation Pk
is given by

where h is a subset of J generated as follows: Io is an arbitrary subset of


I, and h is obtained from h-1 by setting h = h-1 if p(xk) ~ 0, and by
adding to h-1 one or more of the indices i (t h-1 such that ci'xk + di > 0
otherwise.
Note that this method applies even when f is a linear function. In
this case there is no cost function approximation, i.e., Fk = f, just outer
approximation of the constraint set, i.e., X c {x I Pk(x) ~ O}.
The convergence properties of the method are very similar to the
ones of the earlier method. In particular, propositions analogous to Props.
4.1.1 and 4.1.2 can be formulated and proved. There are also versions of
this method where X is a general closed convex set, which is iteratively
approximated by a polyhedral set. Variants of this type will be discussed
later in Sections 4.4 and 4.5.

4.2 INNER LINEARIZATION - SIMPLICIAL DECOMPOSITION

In this section we consider an inner approximation approach for the prob-


lem of minimizing a convex function f : Rn H R over a closed convex
set X. In particular, we approximate X with the convex hull of an ever
expanding finite set Xk C X that consists of extreme points of X plus
Sec. 4.2 Inner Linearization - Simplicial Decomposition 189

an arbitrary starting point xo E X. t The addition of new extreme points


to X k is done in a way that guarantees a cost improvement each time we
minimize f over conv(Xk) (unless we are already at the optimum).
In this section we assume a differentiable convex cost function f :
~n H ~ and a bounded polyhedral constraint set X. The method is then
appealing under two conditions:
(1) Minimizing a linear function over Xis much simpler than minimizing
f over X. (The method makes sense only if f is nonlinear.)
(2) Minimizing f over the convex hull of a relative small number of ex-
treme points is much simpler than minimizing f over X. The method
makes sense only if X has a large number of extreme points.
Several classes of important large-scale problems, arising for example in
communication and transportation networks, have structure that satisfies
these conditions (see the discussion on multicommodity flows later in this
section, and the end-of-chapter references).
Note that the minimization of f over the convex hull of m points
x1, ... , Xm is a differentiable m-dimensional optimization problem over a
simplex:

minimize ¢(a1, .. ,,am) ~f f (tJ=l


a 1x1)
(4.8)
m

subject to La 1 = 1, CYj 2: 0, J = 1, ... ,m.


j=l

The function¢ inherits its smoothness properties from f, and in particular,


if f is twice differentiable, so is ¢, and the problem above can be solved
with Newton-like versions of the two-metric projection method.
Note also that the solution of the above problem may be simplified
by exploiting special structure that may be present in f. For example, if
f(x) = h(Ax) and computation of Ax is by far the most expensive part
of computing J(x), then one may compute '!Jj = Axj just once for each j,
and write the cost function of problem (4.8) in the inexpensively computed
form

¢(a1, ... , am)= h (t J=l


CYj'!Jj) .

This simplification applies to other inner linearization algorithms of this


chapter as well, and depending on the available structure, can be exploited
to solve problems where x has extraordinarily large dimension (see the
subsequent discussion on multicommodity flows in this section).

t Extreme points and related notions of polyhedral convexity are discussed


in Sections 2.1-2.4 of Appendix B.
190 Polyhedral Approximation Methods Chap. 4

Figure 4.2.1. Successive iterates of the simplicial decomposition method. For


example, the figure shows how given the initial point xo, and the calculated
extreme points xo, :i:1, we determine the next iterate x2 as a minimizing point of
f over the convex hull of {xo,xo , xt} . At each iteration, a new extreme point of
X is added, and after four iterations, the optimal solution is obtained.

At the typical iteration of the simplest type of inner linearization


algorithm (also called simplicial decomposition) we have the current iterate
Xk, and the finite set X k that consists of the starting point xo together with
a finite collection of extreme points of X (initially Xo = {xo} ). We first
generate Xk as an extreme point of X that solves the linear program

minimize 'v J(xk)'(x - Xk)


(4.9)
subject to x E X .

We then add xk to Xk,

and we generate Xk+1 as an optimal solution of the problem

minimize f (x)
(4.10)
subject to x E conv(Xk+1)·

Note that this is a problem of the form (4.8). The process is illustrated in
Fig. 4.2.1.
Sec. 4.2 Inner Linearization - Simplicial Decomposition 191

The following proposition shows finite convergence of the method.

Proposition 4.2.1: Assume that the cost function f is convex and


differentiable and the constraint set X is bounded and polyhedral.
Then the simplicial decomposition method obtains an optimal solution
in a finite number: of iterations.

Proof: There are two possibilities for the extreme point Xk that minimizes
Vf(xk)'(x - xk) over x EX [cf. problem (4.9)]:
(a) We have

in which case Xk minimizes f over X, since it satisfies the necessary


and sufficient optimality condition of Prop. 1.1.8 in Appendix B.
(b) We have
(4.11)
in which case Xk ¢. conv(Xk), since Xk minimizes f over x E conv(Xk),
so that V f(xk)'(x - Xk) 2'.'. 0 for all x E conv(Xk)-
Since case (b) cannot occur an infinite number of times (xk ¢. Xk and X
has finitely many extreme points, cf. Prop. 2.3.3 in Appendix B), case (a)
must eventually occur, so the method will find a minimizer of f over X in
a finite number of iterations. Q.E.D.

The simplicial decomposition method has been applied to several


types of problems that have a suitable structure. Experience has generally
been favorable and suggests that the method requires a lot fewer iterations
than the cutting plane method that uses an outer approximation of the
constraint set. As an indication of this, we note that if f is linear, the
simplicial decomposition method terminates in a single iteration, whereas
the cutting plane method may require a very large number of iterations to
attain the required solution accuracy. Moreover simplicial decomposition
does not exhibit the kind of instability phenomenon that is typical of the
cutting plane method. In particular, once an optimal solution belongs to
Xk, the method will terminate at the next iteration. By contrast, the cut-
ting plane method, even after generating an optimal solution, it may move
away from that solution.
The method is also asymptotically much faster than the conditional
gradient method, which is similar and can exploit similar problem structure.
Indeed the simplicial decomposition and the conditional gradient methods
require solution of the same linear cost problem (4.9) to obtain Xk at each
iteration. They differ only in that the former requires minimization of f
192 Polyhedral Approximation Methods Chap. 4

over the convex hull of a finite number of points [cf. problem (4.8)], while
the latter requires a search over the line segment [x k, xk].

Variants of the Simplicial Decomposition Method

We will now discuss some variations and extensions of the simplicial de-
composition method. The essence of the convergence proof of Prop. 4.2.1
is that the extreme point Xk does not belong to Xk, unless the optimal so-
lution has been reached. Thus it is not necessary that Xk solves exactly the
linearized problem (4.9). Instead it is sufficient that Xk is an extreme point
and that the inner product 'vf(xk)'(xk - Xk) is negative [cf. Eq. (4.11)].
This idea may be used in variants of the simplicial decomposition method
whereby 'v f(xk)'(x-xk) is minimized inexactly over x EX. Moreover, one
may add multiple extreme points Xk, as long as they satisfy the condition
':;Jf(xk) 1 (Xk - Xk) < 0.
There are a few other variants of the method. For example to address
the case where X is an unbounded polyhedral set, one may augment X with
additional constraints to make it bounded (an alternative for the case where
X is a cone is discussed in Section 4.6). There are extensions that allow
for a nonpolyhedral constraint set, which is approximated by the convex
hull of some of its extreme points in the course of the algorithm; see the
discussion in Sections 4.4-4.6. Finally, one may use variants, known as
restricted simplicial decomposition methods, which allow discarding some
of the extreme points generated so far. In particular, given the minimum
Xk+l off over Xk+l [cf. problem (4.10)], we may discard from Xk+ 1 all
points x such that
'vf(xk+1)'(x - Xk+l) > 0,
while possibly augmenting the constraint set with the additional constraint

(4.12)

The idea is that the costs of the subsequent points Xk+ 2, Xk+ 3, .. ., generated
by the method will all be no greater than the cost of Xk+1, so they will
satisfy the constraint (4.12).
In fact a stronger result can be shown: any number of extreme points
may be discarded, as long as conv(Xk+i) contains Xk and Xk- The proof is
based on the theory of feasible direction methods (cf. Section 2.1.2), and
the fact that Xk - Xk is a descent direction for f, since if Xk is not optimal,
we have

so a point with improved cost can be found along the line segment con-
necting Xk and Xk· Indeed, the method that discards all previous points
xo, xo, ... , Xk-1, replacing them with just Xk, is essentially the same as the
conditional gradient method that was discussed in Section 2.1.2.
Sec. 4.2 Inner Linearization - Simplicial Decomposition 193

In Section 4.5, we will discuss additional variations and extensions of


the simplicial decomposition method, where among others, we will allow f
to be nondifferentiable and X to be nonpolyhedral. We will also allow the
presence of additional inequality constraints, which are not approximated
by linearization. Moreover, in Section 4.6, we will discuss specialized simpli-
cial decomposition methods for conical constraints, a case of an unbounded
constraint set.

Simplicial Decomposition and Multicommodity Flows

Let us now discuss an important application in network optimization, de-


scribed in Example 1.4.5. As noted earlier, simplicial decomposition is
well suited to problems where (a) minimizing a linear function over X is
much simpler than minimizing f over X, and (b) minimizing f over the
convex hull of a relative small number of extreme points is much simpler
than minimizing f over X. These two conditions are eminently satisfied in
multicommodity network flow problems such as the one of Example 1.4.5.
Here we have a set W of origin-destination pairs, where the origin
and destination are some distinct nodes of a given directed graph. Traffic
of some kind (cars, material, or packets of information, for example) enters
the origins and must be routed to the corresponding destinations, while
minimizing a certain cost. We denote by rw the input traffic of w E W
(a given positive scalar), and we are given a set Pw of paths (possibly
all acyclic paths) that start at the origin and end at the destination of
w. We wish to divide each rw into path flows Xp 2::: 0, p E Pw, such
that LpEPw Xp = rw, The optimization variables are the path flows Xp,
p E Pw, w E W, and we denote by x the generic vector of path flows,
x = {xv p E Pw, w E W}. The problem is
J

minimize D(x) ~f L Dij (Fij)


(i,j)

subject to L Xp = rw, \/ w E W,
pEPw

Xp 2::: 0, \/ p E Pw, W E W,

where Fij is the total flow that passes through arc (i, j):

(4.13)
all paths p
containing ( i,j)

and Dij is a differentiable monotonically increasing convex one-dimensional


function for each arc (i,j).
Sometimes D has a more complicated form, and there may be ad-
ditional constraints on the total flows Fij, but we restrict ourselves to
194 Polyhedral Approximation Methods Chap. 4

the problem above, which is the "standard" formulation. Later in Section


4.5.2, we will discuss more general versions of simplicial decomposition,
which may apply to more complex multicommodity fl.ow problems. Note
that the cost function is of the form h(Ax) where x is the vector of path
flows xv, and A is the matrix that maps x into the vector of arc flows Fij ·
Calculating F =Axis far more complicated than calculating h(F), so an
important favorable structure noted earlier for the application of simplicial
decomposition is present in multicommodity flow problems.
Another major structural characteristic of the problem relates to the
linear approximation of the cost function

v7 D(xk)'(x-xk) = L L L v7 Dij ( L Xp,k) (xp-Xp,k)


wEW pEPw {(i,j)l(i,j)Ep} {pl(i,j)Ep}

at the kth iterate Xk of the simplicial decomposition method. Here Xp,k de-
notes the path flow/ component of Xk that goes through path p and "( i, j) E
p" means that (i,j) is part of path p [we use Eq. (4.13) in the preceding
expression]. The key fact is that minimizing this linear approximation over
the constraint set is a shortest path problem, which can be solved with
very fast algorithms: the length of arc (i,j) is v7Dij (I::{vl(i,j)Ep} Xp,k ),
the length of path p is the sum of the lengths of the arcs on the path, and
the computation of the path of minimum length over paths p E Pw can
be done separately for each w E W. Once the shortest path for each w is
determined, the input flow rw is placed on that shortest path, and the new
extreme point Xk is the flow vector formed by these shortest path flows.
We also note that the minimization of D over the convex hull of the
extreme points forming Xk+ 1 [cf. Eq. (4.10)] is a low-dimensional problem
that can be conveniently solved by two-metric Newton-like methods (in
practice, few extreme points are typically required). In conclusion, the mul-
ticommodity flow problem combines all the important structural elements
that are necessary for the effective application of simplicial decomposition.
We refer to the end-of-chapter references for further discussion, including
the application of alternative algorithms.

4.3 DUALITY OF INNER AND OUTER LINEARIZATION

We have considered so far cutting plane and simplicial decomposition meth-


ods, and we will now aim to connect them via duality. To this end, we define
in this section outer and inner linearizations, and we formalize their con-
jugacy relation and other related properties. An outer linearization of a
closed proper convex function f : ~n H (-oo, oo] is defined by a finite set
of vectors {Y1, ... , ye} such that for every j = 1, ... , £, we have Yj E af(xj)
for some Xj E ~n. It is given by
F(x) = . max {f (xj) + (x - Xj )'yj }, x E ~n, (4.14)
J=l, ... ,£
Sec. 4.3 Duality of Inner and Outer Linearization 195

Yi Y2 0 Y3 y

Outer Linearization of / Inner Linearization of Conjugate /"

Figure 4.3.1. Illustration of the conjugate F* of an outer linearization F of a


convex function f defined by a finite set of "slopes" Yl, ... , Yi and corresponding
points xi, ... ,xi such that Yj E 8f(xj) for all j = 1, ... ,£. It is an inner lin-
earization of the conjugate f* of f, a piecewise linear function whose break points
are Yl, ... , Yi·

and it is illustrated in the left side of Fig. 4.3.1. The choices of Xj such
that Yj E of(xj) may not be unique, but result in the same function F(x):
the epigraph of F is determined by the supporting hyperplanes to the
epigraph off with normals defined by Yj, and the points of support Xj are
immaterial. In particular, the definition (4.14) can be equivalently written
in terms of the conjugate f * of f as
F(x) = _max {x'yj - f*(yj)}, (4.15)
J=l, ... ,e

using the relation xjyJ = f(xj) + f*(yj), which is implied by Yj E of(xj)


(the Conjugate Subgradient Theorem, Prop. 5.4.3 in Appendix B).
Note that F(x) :=; f(x) for all x, so as is true for any outer approxi-
mation off, the conjugate F* satisfies F*(y) 2 f*(y) for ally. Moreover,
it can be shown that F* is an inner linearization of the conjugate f *, as
illustrated in the right side of Fig. 4.3.1. Indeed we have, using Eq. (4.15),
F*(y) = sup {y'x - F(x)}
xERn

sup
xERn
{v'x - .Eiax
J-1, ... ,e
{v;x - f*(yj)}} ,
sup {y'x-o.
xElRn, {ElR
yjx- f*(Yj )5,(, j=l, ... ,i

By linear programming duality, the optimal value of the linear program in


(x, ~) of the preceding equation can be replaced by the dual optimal value,
196 Polyhedral Approximation Methods Chap. 4

and we have with a straightforward calculation

inf "£ _
L..j=l °'jYj-Y
Ll=l O'.jj*(yj)
J
if y E conv({y1, ... ,Yt}),
F*(y) ={ u->o '°'l °' =l
J- 'L..j=l J

oo otherwise,
( 4.16)
where a 1 is the dual variable of the constraint y;x - f*(y1 ) :Sr
From this formula, it can be seen that F* is a piecewise linear ap-
proximation of f * with domain

dom(F*) = conv( {y1, ... , ye}),

and "break points" at Y1, ... , Yi with values equal to the corresponding
values of f *. In particular, as indicated in Fig. 4.3.1, the epigraph of F*
is the convex hull of the union of the vertical halflines corresponding to
Y1, ···,Yi:

epi(F*) = conv( Uj=l, ... ,i { (y1 , w) I f*(y1 ) :S w}). (4.17)

In what follows, by an inner linearization of a closed proper convex


function f * defined by a finite set {Y1, ... , ye} we will mean the function
F* given by Eq. (4.16). Note that not all sets {y1, ... , ye} define conjugate
pairs of outer and inner linearizations via Eqs. (4.15) and (4.16), respec-
tively, within our framework: it is necessary that for every y 1 there exists
x 1 such that y1 E 8f(x1 ), or equivalently that 8f*(y1 ) =/- 0 for all j [which
implies in particular that y 1 E dom(f *)]. By exchanging the roles of f and
f*, we also obtain a dual statement, namely that for a set {x1, ... , xe} to
define an inner linearization of a closed proper convex function f as well as
an outer linearization of its conjugate J*, it is necessary that 8f(x1) =/- 0
for all j.

4.4 GENERALIZED POLYHEDRAL APPROXIMATION

We will now consider a unified framework for polyhedral approximation,


which combines the cutting plane and simplicial decomposition methods.
We consider the problem
m

minimize L fi(xi) (4.18)


i=l

subject to x E S,

where
def ( )
X = Xl,· .. ,Xm,
Sec. 4.4 Generalized Polyhedral Approximation 197

is a vector in ~n1 +··+nm, with components Xi E ~n;, i = 1, ... , m, and


Ji : ~n; f-i ( -oo, oo] is a closed proper convex function for each i,
Sis a subspace of ~n1 +··+nm.

We refer to this as an extended monotropic program (EMP for short).t


A classical example of EMP is a single commodity network optimiza-
tion problem, where Xi represents the (scalar) flow of an arc of a directed
graph and Sis the circulation subspace of the graph (see e.g., [Ber98]). Also
problems involving general linear constraints and an additive extended real-
valued convex cost function can be converted to EMP. In particular, the
problem
m

minimize L Ji (Xi) (4.19)


i=l
subject to Ax = b,
where A is a given matrix and bis a given vector, is equivalent to
m

minimize L Ji(xi) + Oz(z)


i=l
subject to Ax - z = 0,
where z is a vector of artificial variables, and Oz is the indicator function
of the set Z = { z I z = b}. This is an EMP with constraint subspace
S = {(x, z) I Ax - z = 0}.
When all components Xi are one-dimensional and the functions Ji are lin-
ear within dom(fi), problem (4.19) reduces to a linear program. When
the functions Ji are positive semidefinite quadratic within dom(fi), and
dom(fi) are polyhedral, problem (4.19) reduces to a convex quadratic pro-
gram.
Note that while the vectors x 1 , ... , Xm appear independently in the
cost function
m

L Ji(Xi),
i=l
they are coupled through the subspace constraint. This allows a variety of
transformations to the EMP format. For example, consider a cost function
of the form
m

J(x) = F(x1, ... , Xm) +L fi(xi),


i=l
t Monotropic programming, a class of problems introduced and extensively
analyzed in the book [Roc84], is the special case of problem (4.18) where each
component x; is one-dimensional (i.e., n; = 1). The name "monotropic" means
"turning in a single direction" in Greek, and captures the characteristic mono-
tonicity property of convex functions of a single variable such as f;.
198 Polyhedral Approximation Methods Chap. 4

where F is a closed proper convex function of all the components Xi. Then,
by introducing an auxiliary vector z E ~n1 +·+nm, the problem of mini-
mizing J over a subspace X can be transformed to the problem
m

minimize F(z) + L Ji(xi)


i=l
subject to (x,z) ES,

where Sis the subspace of ~ 2 (n1 +·+nm)

S = {(x,x) Ix EX}.
This problem is of the form (4.18).
Another problem that can be converted to the EMP format (4.18) is
m

minimize L Ji (x)
i=l
subject to x E X,

where Ji : ~n f--+ (-oo, oo] are closed proper convex functions, and X is
a subspace of ~n. This can be done by introducing m copies of x, i.e.,
auxiliary vectors Zi E ~n that are constrained to be equal, and write the
problem as
m

minimize L Ji (Zi)
i=l

subject to (z1, ... , zm) ES,


where S is the subspace

S = {(x, ... , x) Ix EX}.


It can thus be seen that convex problems with linear constraints can
generally be formulated as EMP. We will see that these problems share a
powerful and symmetric duality theory, which is similar to Fenchel duality,
and forms the basis for a symmetric and general framework for polyhedral
approximation.

The Dual Problem

To derive the appropriate dual problem, we introduce auxiliary vectors


Zi E ~ni and we convert the EMP (4.18) to the equivalent form
m

minimize L Ji (Zi) (4.20)


i=l

subject to Zi = Xi, i = 1, ... , m, (x1, ... ,xm) ES.


Sec. 4.4 Generalized Polyhedral Approximation 199

We then assign a multiplier vector .\ E atn; to the constraint Zi = Xi,


thereby obtaining the Lagrangian function
m
L(x1, ... ,xm, z1, ... , Zm, A1, ... , Am)= L (Ji(Zi) + A~(Xi - Zi)). (4.21)
i=l

The dual function is

q(.A)= inf L(x1, ... ,Xm,Z1, ... ,Zm,.A1, ... ,.Am)


(x1 , ... ,xm)ES, z; E1Rni

m m
= inf
(xl,···,xm)ES i=l
L A~Xi + L int {/i(zi)
i=l z;E1R '
- .>.~zi}

if (.>.1, ... , Am) E 51.,


otherwise,

where
i = l, ... ,m,

and 51. is the orthogonal subspace of 5.


Note that since qi can be written as

it follows that -qi is the conjugate of Ii, so by Prop. 1.6.1 of Appendix B,


-qi is a closed proper convex function. The dual problem is
m
maximize L qi (Ai) (4.22)
i=l
subject to (.>.1, ... , Am) E 51.,

or with a change of sign to convert maximization to minimization,


m

minimize L ft (Ai) (4.23)


i=l
subject to (.>.1, ... , Am) E 51.,

where ft is the conjugate of fi. Thus the dual problem has the same
form as the primal. Moreover, assuming that the functions /i are closed,
when the dual problem is dualized, it yields the primal, so the duality has
a symmetric character, like Fenchel duality. We will discuss further the
duality theory for EMP in Sections 6. 7.3 and 6. 7.4, using algorithmic ideas
(the E-descent method of Section 6.7.2).
200 Polyhedral Approximation Methods Chap. 4

Throughout our analysis of this section, we denote by fopt and Qopt


the optimal values of the primal and dual problems (4.18) and (4.22), re-
spectively, and in addition to the convexity assumption on fi made earlier,
we will assume that appropriate conditions hold that guarantee the strong
duality relation fopt = Qopt· In Section 6.7, we will revisit EMP and we will
develop conditions under which strong duality holds, using the algorithmic
ideas developed there.
Since EMP in the form (4.20) can be viewed as a special case of the
convex programming problem with equality constraints of Section 1.1, it
is possible to obtain optimality conditions as special cases of the corre-
sponding conditions in that section (cf. Prop. 1.1.5). In particular, it can
be seen that a pair (x, A) satisfies the Lagrangian optimality condition of
Prop. 1.1.5(b), applied to the Lagrangian (4.21), if and only if Xi attains
the infimum in the equation

i = l, ... ,m.

Thus, by applying Prop. 1.1.5(b), we obtain the following.

Proposition 4.4.1: (EMP Optimality Conditions) There holds


-oo < Qopt -_ f opt < oo an d (x opt opt \opt \Opt)
1 , ... , Xm , -"l , ... , -"m are an op-
timal primal and dual solution pair of the EMP problem if and only
if
( X1opt , ... ,Xm
opt)
E S, \opt
( -"l \ opt)
, .. ·,-"m E
S1- ,

and

xi
opt • _{!()
E arg mm i Xi -
/\Opt} ,
xi"i i = 1, ... ,m. (4.24)
xiE1Rni

Note that by the Conjugate Subgradient Theorem (Prop. 5.4.3 in


Appendix B), the condition (4.24) of the preceding proposition is equivalent
to either one of the following two subgradient conditions:

i = 1, ... ,m.

These conditions are significant, because they show that once (x?t, ... , x';!;t)
or (A?t, ... , A';!;t) is obtained, its dual counterpart can be computed by
"differentiation" of the functions fi or J:, respectively. We will often use
the equivalences of the preceding formulas in the remainder of this chapter,
so we depict them in Fig. 4.4.1.
Sec. 4.4 Generalized Polyhedral Approximation 201

A E 8f(x) x E arg min {f(z) - z' -A}


zE1Rn

1 Conjugate Subgradient Theorem 1


x E 8f*(-A) A E arg min {f*(z) - z'x}
zElRn

Figure 4.4.1. Equivalent "differentiation" formulas for a closed proper convex


function f and its conjugate f*. All of these four relations are also equivalent to

>..'x = f(x) + J*(>..);


cf. the Conjugate Subgradient Theorem (Prop. 5.4.3 in Appendix B).

General Polyhedral Approximation Scheme

The EMP formalism allows a broad and elegant algorithmic framework


that combines elements of the cutting plane and simplicial decomposition
methods of the preceding sections. In particular, the primal and dual EMP
problems (4.18) and (4.23) will be approximated, by using inner or outer
linearization of some of the functions Ji and ft. The optimal solution of
the dual approximate problem will then be used to construct more refined
inner and outer linearizations.
We introduce an algorithm, referred to as the generalized polyhedral
approximation or GPA algorithm. It uses a fixed partition of the index set
{1, ... ,m}:
{1, ... , m} = I u I u I.
This partition determines which of the functions Ji are outer approximated
(set I) and which are inner approximated (set J).
For i EI, given a finite set Ai c dom(ft) such that oft(>.) =/- 0 for
all >. E Ai, we consider the outer linearization of fi corresponding to Ai:

J. A (xi)= rnax{>.'xi - ft(>.)},


-i, ' >.EA;

or equivalently, as mentioned in Section 4.3,

where for each>. E Ai, x5. is such that >. E 8fi(x5.).


202 Polyhedral Approximation Methods Chap. 4

For i E I, given a finite set Xi c dom(fi) such that 8fi(x) :/- 0 for
all x E Xi, we consider the inner linearization of Ji corresponding to Xi by

if Xi E conv(Xi),
otherwise,

where C(xi, Xi) is the set of all vectors with components ax, x E Xi,
satisfying

L axx = Xi, L O:x = 1, ax ~ 0, "I x E Xi,


xEX; xEX;

[cf. Eq. (4.16)]. As noted in Section 4.3, this is the function whose epigraph
is the convex hull of the halflines { (xi, w) I fi(xi) S: w }, Xi E Xi (cf. Fig.
4.3.1).
We assume that at least one of the sets I and I is nonempty. At
the start of the typical iteration, we have for each i E I, a finite subset
Ai C dom(ft), and for each i E I, a finite subset Xi C dom(fi). The
iteration is as follows:

Typical Iteration of GPA Algorithm:


Step 1: (Approximate Problem Solution) Find a primal-dual
optimal solution pair (x, >.) = (x1,>.1, ... , xm, >-m) of the EMP

minimize Lfi(xi) + LL,A/xi) + Lh.x;(xi)


iEJ iE!_ iEf (4.25)
subject to (x1, ... , Xm) E S,

where L,A;
and h,x; are the outer and inner linearizations of Ji cor-
responding to Ai and Xi, respectively.
Step 2: (Test for Termination and Enlargement) Enlarge the
sets Ai and Xi as follows (see Fig. 4.4.2):
(a) For i EI, we add any subgradient >.i E 8fi(xi) to Ai.
(b) For i EI, we add any subgradient Xi E 8ft(>.i) to Xi,
If there is no strict enlargement, i.e., for all i EL we have Ai E Ai, and
for all i E I we have Xi E Xi, the algorithm terminates. Otherwise,
we proceed to the next iteration, using the enlarged sets Ai and Xi.

We will show in a subsequent proposition that if the algorithm ter-


minates, the current vector (x, >.) is a primal and dual optimal solution
pair. If there is strict enlargement and the algorithm does not terminate,
we proceed to the next iteration, using the enlarged sets Ai and Xi.
Sec. 4.4 Generalized Polyhedral Approximation 203

ew slope X,
.,. . . J. . .
,, ... "'

Slope Ai
"' A

New break point Xi x,

Figure 4.4.2. Illustration of the enlargement step in the GPA algorithm, after
we obtain a primal-dual optimal solution pair

Note that in the figure on the right, we use the fact

(cf. Fig. 4.4.1- the Conjugate Subgradient Theorem, Prop. 5.4.3 in Appendix B).
The enlargement step on the left (finding .\i) is also equivalent to .\i satisfying
Xi E o/;(.\i), or equivalently, solving the optimization problem

maximize { >.~xi - J;(>.i)}


subject to Ai E !Rn;.

The enlargement step on the right (finding :i;.) is also equivalent to solving the
optimization problem

maximize { >.~xi - /i(xi )}


subject to Xi E !Rn;;

cf. Fig. 4.4.1.

Note that we implicitly assume that at each iteration, there exists a


primal and dual optimal solution pair of problem (4.25). Furthermore, we
assume that the enlargement step can be carried out, i.e., that 8/i(xi) =/. 0
for all i E I and oft(>.i) =/. 0 for all i E I. Sufficient assumptions may
need to be imposed on the problem to guarantee that this is so.
There are two potential advantages of the GPA algorithm over the
earlier cutting plane and simplicial decomposition methods of Sections 4.1
and 4.2, depending on the problem's structure:
204 Polyhedral Approximation Methods Chap. 4

(a) The refinement process may be faster, because at each iteration, mul-
tiple cutting planes and break points are added (as many as one per
function Ji). As a result, in a single iteration, a more refined approx-
imation may be obtained, compared with the methods of Sections 4.1
and 4.2, where a single cutting plane or extreme point is added. More-
over, when the component functions Ji are scalar, adding a cutting
plane/break point to the polyhedral approximation of Ji can be very
simple, as it requires a one-dimensional differentiation or minimiza-
tion for each Ji. Of course if the number m of component functions
is large, maintaining these multiple cutting planes and break points
may add significant overhead to the method, in which case a scheme
for discarding some old cutting planes and break points may be used,
similar to the case of the restricted simplicial decomposition scheme.
(b) The approximation process may preserve some of the special struc-
ture of the cost function and/or the constraint set. For example if
the component functions Ji are scalar, or have partially overlapping
dependences, such as for example,
f(x1, ... ,xm) = fi(x1,x2) + h(x2,x3) + · · ·
+ fm-1(Xm-1,Xm) + fm(Xm),
the minimization off by the cutting plane method of Section 4.1 leads
to general/unstructured linear programming problems. By contrast,
using separate outer approximation of the components functions leads
to linear programs with special structure, which can be solved effi-
ciently by specialized methods, such as network flow algorithms, or
interior point algorithms that can exploit the sparsity structure of the
problem.
Generally, in specially structured problems, the preceding two advantages
can be of decisive importance.
Note two prerequisites for the GPA algorithm to be effective:
(1) The (partially) linearized problem (4.25) must be easier to solve than
the original problem (4.18). For example, problem (4.25) may be a
linear program, while the original may be nonlinear (cf. the cutting
plane method of Section 4.1); or it may effectively have much smaller
dimension than the original (cf. the simplicial decomposition method
of Section 4.2).
(2) Finding the enlargement vectors (\ for i E I, and Xi for i E I)
must not be too difficult. This can be done by the differentiation
5:.i E afi(i:i) for i EI, and Xi E aft(>.i) or i El. Alternatively, if this
is not convenient for some of the functions (e.g., because some of the
Ji or the ft are not available in closed form), one may calculate Ai
and/ or Xi via the relations
Xi E aft(5:.i), >.i E afi(xi);
Sec. 4.4 Generalized Polyhedral Approximation 205

(cf. Fig. 4.4.1 - the Conjugate Subgradient Theorem, Prop. 5.4.3 in


Appendix B). This involves solving optimization problems. For ex-
ample, finding Xi such that 5..i E afi(xi) for i E I is equivalent to
solving the problem

maximize { 5..:xi - Ji(Xi)}


subject to Xi E ~ni,

and may be nontrivial (cf. Fig. 4.4.2).


Generally, the facility of solving the linearized problem (4.25) and carrying
out the subsequent enlargement step may guide the choice of functions that
are inner or outer linearized. If Xi is one-dimensional, as is often true in
separable-type problems, the enlargement step is typically quite easy.
We finally note that the symmetric duality of the EMP can be ex-
ploited in the implementation of the GPA algorithm. In particular, the
algorithm may be applied to the dual problem of problem (4.18):

minimize L ft (Ai) (4.26)


i=l
subject to (A1, .. ,,Am) E 51-,

where ft is the conjugate of f;. Then the inner (or outer) linearized index
set I of the primal becomes the outer (or inner, respectively) linearized in-
dex set of the dual. At each iteration, the algorithm solves the approximate
dual EMP,

minimize L ft (Ai) + L ft A/ Ai) + L D, xi (Ai)


iEJ iEl_ iEf (4.27)
subject to (A1, ... , Am) E 51-,

which is simply the dual of the approximate primal EMP (4.25) [since the
outer (or inner) linearization of ft is the conjugate of the inner (or respec-
tively, outer) linearization of Ji]. Thus the algorithm produces mathemat-
ically identical results when applied to the primal or the dual EMP. The
choice of whether to apply the algorithm in its primal or its dual form is
simply a matter of whether calculations with Ji or with their conjugates
ft are more or less convenient. In fact, when the algorithm makes use of
both the primal solution x and the dual solution 5.. in the enlargement step,
the question of whether the starting point is the primal or the dual EMP
becomes moot: it is best to view the algorithm as applied to the pair of
primal and dual EMP, without designating which is primal and which is
dual.
206 Polyhedral Approximation Methods Chap. 4

Termination and Convergence

Now let us discuss the validity of the GPA algorithm. To this end, we will
use two basic properties of outer approximations. The first is that for any
closed proper convex functions f and [_, and vector x E dom(f), we have

L Sc f, [_(x) = f(x) a[_(x) C af(x). (4.28)

To see this, use the subgradient inequality to write for any g E a[_(x),

f(z) ~ [_(z) ~ L(x) + g'(z - x) = f(x) + g'(z - x), zE ~n,

which implies that g E af(x). The second property is that for any outer
linearization LA of f, we have

.XE A, .XE af(x) (4.29)

To see this, consider vectors x.x such that >. E af(x.x), >.EA, and write
LA (x) = T/t {f (x.x) + N(x - x.x)} ~ f(x5,) + .X'(x - x5,) ~ f(x),

where the second inequality follows from .X E af(x5,). Since we also have
f ~ LA' we obtain f A(x) = f(x). We first show the optimality of the
primal and dual solution pair obtained upon termination of the algorithm.

Proposition 4.4.2: (Optimality at Termination) If the GPA


algorithm terminates at some iteration, the corresponding primal and
dual solutions, (x1, ... , xm) and (.X1, ... , >.m), form a primal and dual
optimal solution pair of the EMP problem.

Proof: From Prop. 4.4.1 and the definition of (x1, ... , Xm) and (>.1, ... , >.m)
as a primal and dual optimal solution pair of the approximate problem
(4.25), we have

(£1, ... ,xm) E 8,

We will show that upon termination, we have for all i

(4.30)

which by Prop. 4.4.1 implies the desired conclusion.


Since (x1, ... , Xm) and (>.1, ... , >.m) are a primal and dual optimal
solution pair of problem (4.25), Eq. (4.30) holds for all i fj. I U I (cf. Prop.
Sec. 4.4 Generalized Polyhedral Approximation 207

4.4.1). We will complete the proof by showing that it holds for all i E I
(the proof for i E J follows by a dual argument).
Indeed, let us fix i E I and let 5..i E 8 Ji (Xi) be the vector generated by
the enlargement step upon termination. We must have 5..i E Ai, since there
is no strict enlargement when termination occurs. Since L,Ai is an outer
linearization of Ii, by Eq. (4.29), the fact 5..i E Ai, 5..i E ofi(xi) implies that

f.
-i,

i
(xi) = fi(xi),

which in turn implies by Eq. (4.28) that

oL,Ai (xi) c ofi(xi)-

By Prop. 4.4.1, we also have .5..i E BL,Ai (xi), so .5..i E ofi(Xi)- Q.E.D.

As in Sections 4.1 and 4.2, convergence can be easily established in


the case where the functions Ji, i E I U I, are polyhedral, assuming that
care is taken to ensure that the corresponding enlargement vectors 5..i are
chosen from a finite set of extreme points. In particular, assume that:
(a) All outer linearized functions Ji are real-valued and polyhedral, and
for all inner linearized functions Ji, the conjugates ft are real-valued
and polyhedral.
(b) The vectors 5..i and Xi added to the polyhedral approximations are
elements of the finite representations of the corresponding ft and Ji.
Then at each iteration there are two possibilities: either (x, >..) is an optimal
primal-dual pair for the original problem and the algorithm terminates, or
the approximation of one of the Ji, i E I U I, will be refined/improved.
Since there can be only a finite number of refinements, convergence in a
finite number of iterations follows.
Other convergence results are possible, extending some of the analysis
of Sections 4.1 and 4.2. In particular, let (xk, >,.k) be the primal and dual
pair generated at iteration k, and let

>..7 E Oji(i:7), i EI, x} E oft(>..7), EI,

be the vectors used for the corresponding enlargements. If the set I is empty
(no inner approximation) and the sequence {>..7} is bounded for every i EI,
then we can easily show that every limit point of {xk} is primal optimal.
To see this, note that for all k, C:S k -1, and (x1, ... ,xm) ES, we have

L fi(x7) +L (fi(xf) + (xt - xf)'>..f) ::::: L fi(xt) + LL,Ak-1(x7)


i(/:J_ iEJ. i(/.J. iEJ. '
m

::::: Lfi(Xi)-
i=l
208 Polyhedral Approximation Methods Chap. 4

Let {xk}K be a subsequence converging to a vector x. By taking the limit


as£-+ oo, k EK,£ EK,£< k, and using the closedness of fi, we obtain
m

liminf I::ticxn + liminf I::ticxD::::: I::!icxi)


I:ti(xi)::::: k---too, kEK €-too, €EK
i=l irf.l iEl i=l

for all (x 1 , ... , Xm) E S. It follows that xis primal optimal, i.e., every limit
point of {i;k} is optimal. The preceding convergence argument also goes
through even if the sequences { >.}} are not assumed bounded, as long as the
limit points Xi belong to the relative interior of the corresponding functions
fi (this follows from the subgradient decomposition result of Prop. 5.4.1 in
Appendix B).
Exchanging the roles of primal and dual, we similarly obtain a conver-
gence result for the case where I is empty (no outer linearization): assuming
that the sequence { x7} is bounded for every i E l, every limit point of { ).k}
is dual optimal.
We finally state a more general convergence result from [BeYll],
which applies to the mixed case where we simultaneously use outer and
inner approximation (both l and I are nonempty). The proof is more
complicated than the preceding ones, and we refer to [BeYl 1] for the cor-
responding analysis.

Proposition 4.4.3: (Convergence of the GPA Algorithm) Con-


sider the GPA algorithm for the EMP problem, .assuming the strong
duality relation -oo < qopt = fopt < oo. Let (xk, >.k) be a primal
and dual optimal solution pair of the approximate problem at the kth
iteration, and let >.}, i E I and xf, i E l be the vectors generated at
the corresponding enlargement step. Suppose that there exist conver-
gent subsequences { xfl K' i EI, { >.fl
K' i E l, such that the sequences
{>.fl K' i EI, { xfl K' i E l, are bounded. Then:
(a) Any limit point of the sequence { (xk, ).k) }x: is a prim~l and dual '
optimal solution pair of the original problem. ·
(b) The sequence of optimal values of the approximate problems con-
verges to the optimal value fopt,

Application to Network Optimization and Monotropic


Programming
Let us consider a directed graph with set of nodes N and set of arcs A. A
classical network optimization problem is to minimize a cost function
L fa(xa),
aEA
Sec. 4.5 Generalized Simplicial Decomposition 209

where fa is a scalar closed proper convex function, and Xa is the flow of


arc a E A. The minimization is over all flow vectors x = {Xa I a E A} that
belong to the circulation subspace S of the graph (at each node, the sum
of all incoming arc flows is equal to the sum of all outgoing arc flows).
The GPA algorithm that uses inner linearization of all the functions
fa that are nonlinear is particularly attractive for this problem, because of
the favorable structure of the corresponding approximate EMP:
minimize L fa,Xa (xa)
aE.A
subject to x E S,
where for each arc a, fa,Xa is the inner approximation of fa, corresponding
to a finite set of break points Xa C dom(fa), By suitably introducing
multiple arcs in place of each arc, we can recast this problem as a linear
minimum cost network flow problem that can be solved using very fast
polynomial algorithms. These algorithms, simultaneously with an optimal
primal (flow) vector, yield a dual optimal (price differential) vector (see
e.g., [Ber98], Chapters 5-7). Furthermore, because the functions fa are
scalar, the enlargement step is very simple.
Some of the preceding advantages of the GPA algorithm with inner
linearization carry over to monotropic programming problems (ni = 1 for
all i), the key idea being the simplicity of the enlargement step. Further-
more, there are effective algorithms for solving the associated approximate
primal and dual EMP, such as out-of-kilter methods [Roc84], [TseOlb], and
E-relaxation methods [Ber98], [TsBOO].

4.5 GENERALIZED SIMPLICIAL DECOMPOSITION

In this section we will aim to highlight some of the applications and the
fine points of the general algorithm of the preceding section. As vehicle we
will use the simplicial decomposition approach, and the problem
minimize f(x) + c(x)
(4.31)
subject to x E ~n,
where f: ~n H (-oo, oo] and c: ~n H (-oo, oo] are closed proper convex
functions. This is the Fenchel duality context, and it contains as a special
case the problem to which the ordinary simplicial decomposition method of
Section 4.2 applies (where f is differentiable, and c is the indicator function
of a bounded polyhedral set). Here we will mainly focus on the case where
f is nondifferentiable and possibly extended real-valued.
We apply the polyhedral approximation scheme of the preceding sec-
tion to the equivalent EMP
minimize fi(x1) + h(x2)
subject to (x1,x2) ES,
210 Polyhedral Approximation M ethods Chap. 4

where

Note that the orthogonal subspace has the form


51. = {(.X1 , A2) I Al= -.X2} = {(.X, -,X) I A E ~n }.
Optimal primal and dual solutions of this EMP problem are of the form
(xovt , xovt) and (,Xovt, -,Xovt), with
,Xopt E of(xovt), - ,Xopt E oc(x ovt) ,
consistently with the optimality conditions of Prop. 4.4.1. A pair of such op-
timal solutions (x 0 vt , ,Xovt) satisfies the necessary and sufficient optimality
conditions of the Fenchel Duality Theorem [Prop. 1.2.l(c)] for the original
problem.
In one possible polyhedral approximation method, at the typical it-
eration h is replaced by an inner linearization involving a set of break
points X, while Ji is left unchanged. At the end of the iteration, if (5.., -5..)
is a dual optimal solution, X is enlarged to include a vector x such that
-5.. E oh(x) (cf. Section 4.4). We will now transcribe this method in the
notation of problem (4.31).
We st art with some finite set Xo c dom(c). At the typical iteration,
given a finite set X k C dom(c), we use the following three steps t o compute
vect ors Xk , Xk, and the enlarged set X k+l = Xk U {xk } to start the next
iteration:

Typical Iteration of Generalized Simplicial Decomposition


Algorithm to Minimize f + c

(1) We obtain
Xk E arg min {J(x)
xE!Rn
+ Ck(x)}, (4.32)

where Ck is the polyhedral/inner linearization function whose


epigraph is the convex hull of the finite collection of halflines
{(x,w) I c(x)::; w}, x E Xk.
(2) We obtain Ak such that

(4.33)

(3) We obtain Xk such that

(4.34) '

and form
Sec. 4.5 Generalized Simplicial Decomposition 211

As in the case of the GPA algorithm, we assume that f and care such
that the steps (1)-(3) above can be carried out. In particular, the existence
of the subgradient Ak in step (2) is guaranteed by the optimality conditions
of Prop. 5.4.7 in Appendix B, applied to the minimization in Eq. (4.32),
under the appropriate relative interior conditions.
Note that step (3) is equivalent to finding

Xk E arg min { >..~x + c(x) }, (4.35)


xE~n

and that this is a linear or quadratic program in important special cases


where c is polyhedral or quadratic, respectively. Note also that the approx-
imate problem (4.32) is a linearized version of the original problem (4.31),
where c is replaced by Ck(x), which is an inner linearization of c. More
specifically, if Xk = {x 1 I j E Jk}, where Jk is a finite index set, Ck is given
by

if x E conv(Xk),

if x tf_ conv(Xk),

so the minimization (4.32) involves in effect the variables a 1 , j E Jk, and


is equivalent to

minimize f ( L a 1x1) + L a 1 c(x1 )


jEJk jEJk (4.36)
subject to L a1 = 1, a 1 ~ 0, j E Jk.
jEJk

The dimension of this problem is the cardinality of Jk, which can be quite
small relative to the dimension of the original problem.

Dual/Cutting Plane Implementation

Let us also provide a dual implementation, which is an equivalent outer


linearization/cutting plane-type of method. The Fenchel dual of the mini-
mization off+ c [cf. Eq. (4.31)] is

minimize f*(>..) + c*(->..)


subject to >.. E ~n,

where f* and c* are the conjugates of f and c, respectively. According to


the theory of the preceding section, the generalized simplicial decomposi-
tion algorithm (4.32)-( 4.34) can alternatively be implemented by replacing
212 Polyhedral Approximation Methods Chap. 4

'
c*(->.) /,

~~/"( x,,
~
\

Slope: Xi, i < k-----' \


<k
/ ''
Slope: i
. ) ~ - - - -c,.,,_-,.
Ck ( - >.) --\___
\
' V ,, Slope: Xk
,'

0
Constant - f*(>.)-

Figure 4.5.1. Illustration of the cutting plane implementation of the generalized


simplicial decomposition method for minimizing the sum of two closed proper
convex functions f and c [cf. Eq. (4.31)]. The ordinary cutting plane method,
described in the beginning of Section 4.1, is obtained as the special case where
J*(x) = 0. In this case, f is the indicator function of the set consisting of just
the origin, and the primal problem is to evaluate c(O) (the optimal value of c*).

c* by a piecewise linear/cutting plane outer linearization, while leaving f*


unchanged, i.e., by solving at iteration k the problem
minimize f*(>.) + Ck(->.)
(4.37)
subject to ). E ~n,

where Ck is an outer linearization of c* (the conjugate of Ck)- This problem


is the (Fenchel) dual of problem (4.32) [or equivalently, the low-dimensional
problem (4.36)].
Note that solutions of problem (4.37) are the subgradients Ak satis-
fying Ak E 8f(xk) and -Ak E 8Ck(xk), where Xk is the solution of the
problem (4.32) [cf. Eq. (4.33)], while the associated subgradient of c* at
-Ak is the vector Xk generated by Eq. (4.34), as shown in Fig. 4.5.1. In
fact, the function ck has the form

where Aj and Xj are vectors that can be obtained either by using the
generalized simplicial decomposition method (4.32)-( 4.34), or by using its
dual, the cutting plane method based on solving the outer approximation
problems (4.37). The ordinary cutting plane method, described in the
beginning of Section 4.1, is obtained as the special case where f*(>.) = 0
[or equivalently, f(x) = oo if x-/= 0, and f(O) = OJ.
Sec. 4.5 Generalized Simplicial Decomposition 213

Whether the primal or the dual implementation is preferable depends


on the structure of the functions f and c. When f (and hence also f*) is
not polyhedral, the dual implementation may not be attractive, because it
requires the n-dimensional nonlinear optimization (4.37) at each iteration,
as opposed to the typically low-dimensional optimization (4.32) or (4.36).
When f is polyhedral, both methods require the solution of linear programs
for the enlargement step, and the choice between them may depend on
whether it is more convenient to work with c rather than c*.

4.5.1 Differentiable Cost Case

Let us first consider the favorable case of the generalized simplicial decom-
position algorithm (4.32)-(4.35), where f is differentiable and c is poly-
hedral with bounded effective domain. Then the method is essentially
equivalent to the simple version of the simplicial decomposition method of
Section 4.2. In particular:
(a) When c is the indicator function of a bounded polyhedral set X, and
Xo = {xo}, the method reduces to the earlier simplicial decomposi-
tion method (4.9)-(4.10). Indeed, step (1) corresponds to the min-
imization (4.10), step (2) simply yields Ak = v' f(xk), and step (3),
as implemented by Eq. (4.35), corresponds to solution of the linear
program (4.9) that generates a new extreme point.
(b) When c is a general polyhedral function, the method can be viewed
as essentially the special case of the earlier simplicial decomposition
method (4.9)-(4.10) applied to the problem of minimizing f(x) + w
subject to x EX and (x, w) E epi(c) [the only difference is that epi(c)
is not bounded, but this is inconsequential if we assume that dom(c)
is bounded, or more generally that the problem (4.32) has a solution].
In this case, the method terminates finitely, assuming that the vectors
(xk, c(h)) obtained by solving the linear program (4.35) are extreme
points of epi(c) (cf. Prop. 4.2.1).
For the more general case where f is differentiable and c is a (non-
polyhedral) convex function, the method is illustrated in Fig. 4.5.2. The
existence of a solution Xk to problem (4.32) [or equivalently (4.36)] is guar-
anteed by the compactness of conv(Xk) and Weierstrass' Theorem, while
step (2) yields Ak = v' f(xk), The existence of a solution to problem (4.35)
must be guaranteed by some assumption such as for example compactness
of the effective domain of c.

4.5.2 Nondifferentiable Cost and Side Constraints

Let us now consider the problem of minimizing f + c [cf. Eq. (4.31)] for
the more complicated case where f is extended real-valued and nondiffer-
214 Polyhedral Approximation Methods Chap. 4

0 X

Figure 4.5.2. Illustration of successive iterates of the generalized simplicial de-


composition method in the case where / is differentiable and c is a general convex
function. Given the inner linearization Ck of c, we minimize / + Ck to obtain Xk
(graphically, we move the graph of - / vertically until it touches the graph of Ck)-
We then compute Xk as a point at which - 'v f(xk) is a subgradient of c, and we
use it to form the improved inner linearization Ck+l of c. Finally, we minimize
f + Ck+l to obtain Xk+l (graphically, we move the graph of - / vertically until
it touches the graph of Ck+1 )-

entiable. Then assuming that


ri(dom(J)) n conv(Xo) ::J 0,
the existence of the subgradient )..k of Eq. (4.33) is guaranteed by the
optimality conditions of Prop. 5.4.7 in Appendix B, and the existence of a
solution Xk to problem (4.32) is guaranteed by Weierstrass' Theorem, since
the effective domain of Ck is compact.
When c is the indicator function of a polyhedral set X, the condition
of step (2) becomes
V x E conv(Xk) , (4.38)
i.e., - >.k is in the normal cone of conv(Xk) at Xk. The method is illustrated
for this case in Fig. 4.5.3. It terminates finitely, assuming that the vector
Xk obtained by solving the linear program (4.35) is an extreme point of X.
The reason is that in view of Eq. (4.38) , the vector Xk does not belong to
Xk (unless Xk is optimal), so Xk+l is a strict enlargement of Xk.
Let us now address the calculation of a subgradient )..k E 8f(xk)
such that ->.k E 8Ck(xk) [cf. Eq. (4.33)]. This may be a difficult problem
as it may require knowledge of 8f(xk) as well as 8Ck(xk)- However, in
important special cases, )..k may be obtained simply as a byproduct of the
minimization
Xk E arg min {J(x ) + Ck(x) }, (4.39)
x E ~n
Sec. 4.5 Generalized Simplicial Decomposition 215

XO

Figure 4.5.3. Illustration of the generalized simplicial decomposition method for


the case where f is nondifferentiable a nd c is the indicator funct ion of a polyhedral
set X. For each k, we compute a subgradient Ak E BJ(xk) such that - Ak lies
in the normal cone of conv(Xk) at Xk, and we use it to generate a new extreme
point Xk of X using the relation

cf. Step (3) of the a lgorithm.

[cf. Eq. (4.32)]. This is illustrated in the following examples.

Example 4.5.1 (Minimax Problems)

Consider the minimization of f + c for the case where c is t he indicator of a


closed convex set X, and
J(x) = max{J1 (x), . .. , fr(x) },
where Ji, ... , fr : ~n ~~are differentiable functions. Then the minimization
(4.39) talces the form

minimize z
(4.40)
subject to Ji (x) :S z , j = 1, . .. , r, x E conv(Xk),
where conv(Xk) is a polyhedral inner approximation to X . According to the
optimality conditions of Prop. 1.1.3, the optimal solution (xk, z*) together
with dual optimal variables µ;,
satisfy
216 Polyhedral Approximation Methods Chap. 4

and the primal feasibility


Xk E conv(Xk), Ji (xk) :S z*, j = 1, ... , r,
Lagrangian optimality

(4.41)

and dual feasibility and complementary slackness conditions


µ; 2 0, µ; = 0 if fi(xk) < z* = f(xk), j = 1, ... ,r. (4.42)
It follows that
r

I:µ;= 1, (4.43)
j=l

[since otherwise the minimum in Eq. (4.41) would be -oo), and

(t, µj'v /;(x,)), (x - x,)?: 0, V x E conv(Xk), (4.44)

[which is the optimality condition for the optimization in Eq. (4.41)). Using
Eqs. (4.42) and (4.43), it can be shown that the vector
r

Ak = Lµ;v'Jj(Xk) (4.45)
j=l

is a subgradient off at Xk (cf. Danskin's Theorem, Section 3.1). Furthermore,


using Eq. (4.44), it follows that ->.k is in the normal cone of conv(Xk) at Xk,
In conclusion, Ak as given by Eq. (4.45), is a suitable subgradient for
determining a new extreme point Xk to refine the inner approximation of the
constraint set X via solution of the problem
Xk E arg min >.~x; (4.46)
:z:EX

cf. Eq. (4.35). To obtain Ak, we need to:


(a) Compute the primal approximation Xk as the solution of problem (4.40).
(b) Simultaneously compute the dual variables/multipliersµ; of this prob-
lem.
(c) Use Eq. (4.45).
Note an important advantage of this method over potential competitors: it
involves solution of often simple programs (linear programs if X is polyhe-
dral) of the form (4.46) to generate new extreme points of X, and solution
of nonlinear programs of the form (4.40), which are low-dimensional [their
dimension is equal to the number of extreme points of Xk; cf. Eq. (4.36)).
When each Ji is twice differentiable, the latter programs can be solved by
fast Newton-like methods, such as sequential quadratic programming (see
e.g., [Ber82a], [Ber99), [NoW06]).
Sec. 4.6 Polyhedral Approximation for Conic Programming 217

Example 4.5.2 (Minimax Problems with Side Constraints)

Consider a more general version of the preceding example, where there are
additional inequality constraints defining the domain off. This is the problem
of minimizing f + c where c is the indicator of a closed convex set X, and f
is of the form

f(x) = { :ax {fi(x), ... , fr(x)} if g;(x) '.S 0, i = 1, ... ,P, (4.47)
otherwise,
with fJ and g; being convex differentiable functions. Applications of this type
include multicommodity flow problems with "side constraints" (the inequal-
ities g;(x) :S 0, which are separate from the network flow constraints that
comprise the set X; cf. the discussion of Section 4.2].
Similarly, to calculate Ak, we introduce dual variables v;
2'. 0 for the
constraints g;(x) :S 0, and we write the Lagrangian optimality and comple-
mentary slackness conditions. Then Eq. (4.44) takes the form

V x E conv(Xk)-

Similar to Eq. (4.45), it can be shown that the vector


r p

>..k = I>jv'Ji(xk) + I:>;Vg;(xk)


j=l i=l

is a subgradient of f at Xk, while we have ->..k E ack(Xk) as required by


Eq. (4.33). Once Ak is computed simultaneously with Xk, then as in the
preceding example, the enlargement of the inner approximation of X involves
the addition of Xk, which is obtained by solving the problem (4.46).

The preceding two examples involve the simple nondifferentiable func-


tion max{fi(x), ... , fr(x) }, where Ji are differentiable. However, the ideas
for the calculation of extreme points of X and the associated process
of enlargement via dual variables apply more broadly. In particular, we
may treat similarly problems involving other types of nondifferentiable or
extended-real valued cost function f, by converting them to constrained
minimization problems. Then by using the dual variables of the approx-
imate inner-linearized problems, we may compute a vector Ak such that
>..k E 8J(xk) and ->..k E 8Ck(xk), which can in turn be used for enlarge-
ment of xk.

4.6 POLYHEDRAL APPROXIMATION FOR CONIC


PROGRAMMING
In this section we will aim to extend the range of applications of the gen-
eralized polyhedral approximation framework for EMP by introducing ad-
ditional conic constraint sets. Our motivation is that the framework of the
218 Polyhedral Approximation Methods Chap. 4

Figure 4.6. 1. Illustration of cone( X),


the cone generated by a subset X of
a cone C, as an inner linearization of
C. The polar cone(X)* is an outer
linearization of the polar cone

C* = {y I y' X ::; o, 'v' X E C}.

preceding two sections is not well-suited for the case where some of the
component functions of the cost are indicator functions of unbounded sets
such as cones. There are two main reasons for this:
(1) The enlargement procedure of the GPA algorithm may not be imple-
mentable by optimization, as in Fig. 4.4.2, because this optimization
may not have a solution. This may be true in particular if the function
involved is the indicator function of an unbounded set.
(2) The inner linearization procedure of the GPA algorithm approximates
an unbounded set by the convex hull of a finite number of points,
which is a compact set. It would appear that an unbounded polyhe-
dral set may provide a more effective approximation.
Motivated by these concerns, we extend the generalized polyhedral
approximation approach of Section 4.4 so that it applies to the problem of
minimizing the sum I::::;,1 fi(Xi) of convex extended real-valued functions
Ji, subject to (x1, ... , Xm) being in the intersection of a given subspace and
the Cartesian product of closed convex cones. To this end we first discuss
an alternative method for linearization of a cone, which allows enlargements
using directions of recession rather than points.
In particular, given a closed convex cone C and a finite subset X CC,
we view cone(X), the cone generated by X (see Section 1.2 in Appendix
B), as an inner linearization of C. Its polar, denoted cone(X)*, is an outer
linearization of the polar C* (see Fig. 4.6.1). This type of linearization has
a twofold advantage: a cone is approximated by a cone (rather than by a
compact set), and outer and inner linearizations yield convex functions of
the same type as the original (indicator functions of cones).
As a first step in our analysis, we introduce some duality concepts
relating to cones. We say that (x, >.) is a dual pair with respect to the
Sec. 4.6 Polyhedral Approximation for Conic Programming 219

closed convex cones C and C* if

x = Pc(x + >.) and

where Pc (y) and Pc* (y) denote projection of a vector y onto C and C*,
respectively. We also say that (x, >.) is a dual pair representation of a vector
y if y = x +). and (x, >.) is a dual pair with respect to C and C*. The
following proposition shows that ( Pc (y), Pc* (y)) is the unique dual pair
representation of y, and provides a related characterization; see Fig. 4.6.1.

Proposition 4.6.1: (Cone Decomposition Theorem) Let C be


a nonempty closed convex cone in Rn and C* be its polar cone.
(a) Any vector y E Rn, has a unique dual pair representation, the '
pair (Pc(y),Pc*(Y)).
(b) The following conditions are equivalent:
(i) (x, >.) is a dual pair with respect to C and C*.
(ii) x E C, ). E C*, and x ..l >..

Proof: (a) We denote~= y - Pc(y), and we will show that~= Pc*(y).


This will prove that (Pc(y), Pc* (y)) is a dual pair representation of y,
which must be unique, since by the definition of dual pair, a vector y
can have at most one dual pair representation, the pair (Pc(y), Pc* (y)).
Indeed, by the Projection Theorem (Prop. 1.1.9 in Appendix B), we have

e(z - Pc(y)) :S: 0, V z EC. (4.48)

Since C is a cone, we have (1/2)Pc(y) E C and 2Pc(y) E C, so by taking


z = (1/2)Pc(y) and z = 2Pc(y) in Eq. (4.48), it follows that

e Pc(y) = 0. (4.49)

e
By combining Eqs. (4.48) and (4.49), we obtain z < 0 for all z E C,
implying that~ EC*. Moreover, since Pc(y) EC, we have

(y - ~)'(z - ~) = Pc(y)'(z - ~) = Pc(y)'z :s; 0, V z EC*,

where the second equality follows from Eq. (4.49). Thus ~ satisfies the
necessary and sufficient condition for being the projection Pc* (y).
(b) Suppose that property (i) holds, i.e., x and ). are the projections of
x + ). on C and C*, respectively. Then we have, using also the Projection
Theorem,
xEC, >.EC*, ((x + >.) - x))'x = 0,
220 Polyhedral Approximation Methods Chap. 4

or
X EC, A EC*, Nx=O,
which is property (ii).
Conversely, suppose that property (ii) holds. Then, since A E C*, we
have Nz SO for all z EC, and hence

((x + A) - x)' (z - x) = N(z - x) = Nz S 0, \/ z EC,

where the second equality follows from the fact x 1- A. Thus x satisfies
the necessary and sufficient condition for being the projection Pc(x + A).
By a symmetric argument, it follows that A is the projection Pc* (x + A).
Q.E.D.

Duality and Optimality Conditions

We now introduce a version of the EMP problem of Section 4.4, generalized


to include cone constraints. It is given by
m r

minimize L /i(Xi) + L 8(xi I Ci)


(4.50)
i=l i=m+l
subject to (x1, ... , Xr) E S,

where (x1, ... , Xr) is a vector in ~n1 +··+nr, with components Xi E ~n;,
i = 1, ... , r, and
Ji : ~n; r--+ (-oo, oo] is a closed proper convex function for each i,
S is a subspace of ~n1 +·+nr,

Ci C ~n;, i = m + 1, ... , r, is a closed convex cone, and 8(xi I Ci)


denotes the indicator function of Ci.
Interesting special cases are the conic programming problems of Section
1.2, as well as several other problems described in the exercises. Included,
in particular, are problems where the cost function involves some posi-
tively homogeneous additive components, whose epigraphs are cones, such
as norms and support functions of sets. Such cost function components
may be expressed in terms of conical constraints.
Note that the conjugate of 8(- I Ci) is 8(- I en,
the indicator function
of the polar cone Cf. Thus, according to the EMP duality theory of Section
4.4, the dual problem is
m r
minimize L ft( Ai) + L 8 (Ai I en
i=l i=m+l (4.51)

subject to (A1, .. ,,Ar) E SJ_,


Sec. 4.6 Polyhedral Approximation for Conic Programming 221

and has the same form as the primal problem (4.50). Furthermore, since Ji
is assumed closed proper and convex, and Ci is assumed closed convex, the
conjugate of ft is ft* = Ji and the polar cone of Cf is (Ct)* = C. Thus
when the dual problem is dualized, it yields the primal problem, similar to
the EMP problem of Section 4.4.
Let us denote by fopt and Qopt the optimal primal and dual values.
According to Prop. 1.1.5, (x 0 Pt, >,opt) form an optimal primal and dual
solution pair if and only if they satisfy the standard primal feasibility, dual
feasibility, and Lagrangian optimality conditions. By working out these
conditions similar to Section 4.4, we obtain the following proposition, which
parallels Prop. 4.4.1.

Proposition 4.6.2: (Optimality Conditions) We have -oo <


Qopt = f opt < 00 1 an d X O Pt = ( Xlopt , ... , Xropt) and /\
\opt = ('opt ,opt)
"1 , ... , Ar
are optimal primal and dual solutions, respectively, of problems (4.50)
and (4.51) if and only if

( X1opt , ... , Xropt) E S, , opt , opt ) S.L


( "l , ... , "r E , (4.52)

opt
xi E .
arg mm {f()
i Xi -
,,opt} ,
xi"i i = 1, ... ,m, (4.53)
XiE!J?n

(x?t, >-?t) is a dual pair with respect to Ci and Cf, i = m + 1, ... , r.


(4.54)

Note that by the Conjugate Subgradient Theorem (Prop. 5.4.3 in


Appendix B), the condition (4.53) of the preceding proposition is equivalent
to either one of the following two subgradient conditions

>,~pt E
z
of·( xiopt) '
i
i = l, ... ,m;

(cf. Fig. 4.4.1). Thus the optimality conditions are fully symmetric, con-
sistently with the symmetric form of the primal and dual problems (4.50)
and (4.51).

Generalized Simplicial Decomposition for Conical Constraints

We will now describe an algorithm, whereby problem (4.50) is approxi-


mated by using inner linearization of some of the functions Ji and of all the
cones Ci. The optimal primal and dual solution pair of the approximate
problem is then used to construct more refined inner linearizations. For
simplicity, we are focusing on the pure simplicial decomposition approach
(and by duality on the pure cutting plane approach). It is straightforward
to extend our algorithms to the mixed case, where some of the component
222 Polyhedral Approximation Methods Chap. 4

functions Ji and/ or cones Ci are inner linearized while others are outer
linearized.
We introduce a fixed subset I C { 1, ... , m}, which corresponds to
functions Ji that are inner linearized. For notational convenience, we de-
note by I the complement of I in { 1, ... , m}:
{1, ... , m} = I u I,
and we also denotet
Ic={m+l, ... ,r}.
At the typical iteration of the algorithm, we have for each i E I, a
finite set Xi such that 8 fi(xi) =/- 0 for all Xi E Xi, and for each i E Ic a
finite set Xi c Ci. The iteration is as follows.

Typical Iteration of Simplicial Decomposition for Conic


Constraints
Step 1: (Approximate Problem Solution) Find a primal and
dual optimal solution pair

of the problem

minimize LJi(xi) + Lf\,xJxi) + L8(xi J cone(Xi))


iEJ iEf iElc (4.55)
subject to (x1, ... , Xr) E S,

where Ji
,

'
are the inner linearizations of Ji corresponding to Xi, i E I.
Step 2: (Test for Termination and Enlargement) Enlarge the
sets Xi as follows (see Fig. 4.6.2): '
(a) For i EI, we add any subgradient Xi E aJt(>.i) to Xi,
(b) For i E Ic, we add the projection Xi = Pei (>.i) to Xi.
If there is no strict enlargement for all i EI, i.e., we have Xi E Xi, and
moreover Xi = 0 for all i E Ic, the algorithm terminates. Otherwise,
we proceed to the next iteration, using the enlarged sets xi. .

tWe allow I to be empty, in which case none of the functions Ji is inner


linearized. Then the portions of the subsequent algorithmic descriptions and
analysis that refer to the functions Ji with i E I should be simply omitted. Also,
there is no loss of generality in using I c = {m + 1, ... , r}, since the indicator
functions of the cones that are not linearized, may be included within the set of
functions Ji, i E J.
Sec. 4.6 Polyhedral Approximation for Conic Programming 223

Xi +
New break point Xi

Figure 4.6.2. Illustration of the enlargement step of the algorithm, aft er we ob-
tain a primal and dual optimal solution pair (:ri, ... , Xr, 5.1, ... , 5.r ). The enlarge-
ment step on the left [finding Xi with Xi E aJ:(5.i) for i E l] is also equivalent
to finding Xi satisfying ),i E 8/i(xi), or equivalently, solving the optimization
problem
maximize { 5.~xi - fi(xi)}
subject to Xi E lRni.

The enlargement step on the right, for i E le, is to add to Xi the vector Xi =
Pei (>,i), the projection on Ci of 5.i.

The enlargement process in the preceding iteration is illustrated in


Fig. 4.6.2. Note that we implicitly assume that at each iteration, there
exists a primal and dual optimal solution pair of problem (4.55). The al-
gorithm for finding such a pair is left unspecified. Furthermore, we assume
that the enlargement step can be carried out, i.e., that aft(>..i) -/- 0 for all
i E I. Sufficient assumptions may need to be imposed on the problem to
guarantee that this is so.
The enlargement steps for Ji (left side of Fig. 4.6.2) and for Ci (right
side of Fig. 4.6.2) are quite related. Indeed, it can be verified that the pro-
jection Xi = Pc;(>..i) can be obtained as a positive multiple of the solution
of the problem
maximize { 5..~xi - o(xi I Ci)}
subject to llxill ~ 1 ,
where I is any positive scalar, and II· II denotes the Euclidean norm.t This

t To see this, write the problem as

minimize
subject to Xi E Ci, llxill 2 ~ ,2,
224 Polyhedral Approximation Methods Chap. 4

problem (except for the normalization condition jjxijj ~ ,, which ensures


the attainment of the maximum) is quite similar to the maximization

maximize { >.~xi - fi(Xi)}


subject to Xi E ~ni,

which is used for the enlargement of the set of break points Xi for the
functions Ji (cf. Fig. 4.6.2).
Note that the projection on a cone that is needed for the enlargement
process can be done conveniently in some important special cases. For
example when Ci is a polyhedral cone (in which case the projection is a
quadratic program), or when Ci is the second order cone (see Exercise 4.3
and [FLT02], [SchlO]), or in other cases, including when Ci is the semidefi-
nite cone (see [BoV04], Section 8.1.1, or [HeMll], [HeM12]). The following
is an illustration of the algorithm for a simple special case.

Example 4.6.1 (Minimization Over a Cone)

Consider the problem


minimize f (x)
(4.56)
subject to x E C,
where f: !Jr
f-t (-oo, oo] is a closed proper convex function and C is a closed
convex cone. We reformulate this problem into our basic form (4.50) as

minimize j(x1) + o(x2 I C)


(4.57)
subject to (x1,x2) E S~r {(x1,x2) I x1 = x2}.
Primal and dual optimal solutions have the form (x*, x*) and (>. *, ->. *), re-
spectively, since

By transcribing our algorithm to this special case, we see that (xk, xk)
and (>.\ ->.k) are optimal primal and dual solutions of the corresponding
approximate problem of the algorithm if and only if

xk E arg min f(x),


xECOne(Xk)

and
>.k E af(xk), (4.58)

(cf. Prop. 4.6.2). Once 5.k is found, Xk is enlarged by adding xk, the projec-
tion of -5.k onto C. This construction illustrated in Fig. 4.6.3.

introduce a dual variable µ for the constraint llx;ll 2 ::; "'y2, and show that if
>.; <f_c;,
then the optimal solution is x; = (1/2µ)Pc;(\).
Sec. 4.6 Polyhedral Approximation for Conic Programming 225

Figure 4.6.3. Illustration of the generalized simplicial decomposition method for


minimizing a closed proper convex function f over a cone C (cf. Example 4.6.1).
For each k, given the subset Xk C C, we find a minimum 5;k off over cone(Xk),
we compute a subgradient j_k E 8/(xk) such that _j_k lies in the normal cone
of cone(Xk) at xk [cf. Eq. (4.58)], and we enlarge Xk with xk, the projection of
_j_k onto C.

When C is a cone generated by a finite set of directions X, there is


an interesting variant of the algorithm: we may represent xk as a positive
combination of vectors in X, and simultaneously add all of these vectors to
Xk in place of xk. Because X is finite, it can be seen that this version of
the algorithm terminates finitely. An example of such a possibility arises in
£1-regularization, where C is the epigraph of the £1 norm (see Exercise 4.6).

Convergence Analysis

We will now discuss the convergence properties of the algorithm. We first


show that if it terminates, it does so at an optimal solution.

Proposition 4.6.3: (Optimality at Termination) If the algo-


rithm of this section terminates at some iteration, the corresponding
primal and dual solutions, (±1, ... , Xr) and (A1, ... , Ar), form a primal
and dual optimal solution pair of problem (4.50).

Proof: We will verify that upon termination, the three conditions of Prop.
4.6.2 are satisfied for the original problem (4.50). From the definition of
(x1 , ... , Xr) and ( A1, ... , Ar) as a primal and dual optimal solution pair of
226 Polyhedral Approximation Methods Chap. 4

the approximate problem (4.55), we obtain


(x1, ... ,xr) ES, (>.1, ... )r) E SJ_,
thereby satisfying the first condition (4.52). Upon termination we have
Pei (>.i) = 0, and hence 5.i E c; for all i E le. Also from the optimality
conditions of Prop. 4.6.2, applied to the approximate problem (4.55), we
have that for all i E le, (xi, >.i) is a dual pair with respect to cone(Xi)
and cone(Xi)*, so that by Prop. 4.6.l(b), Xi ..l >.i and Xi E Ci. Thus by
Prop. 4.6.l(b), (xi)i) is a dual pair with respect to Ci and c;, and the
optimality condition (4.54) is satisfied.
Finally, we will show that upon termination, we have
\/ i E JU f, (4.59)
which by Prop. 4.6.2 will imply the desired conclusion. Since (x 1, ... , Xr)
and (>.1, ... , >-r) are a primal and dual optimal solution pair of problem
(4.55), Eq. (4.59) holds for all i E J (cf. Prop. 4.6.2). We will complete the
proof by showing that it holds for all i E J.
Indeed, let us fix i E f and let :i\ E 8ft(>.i) be the vector generated
by the enlargement step upon termination, so that Xi E Xi. Since Ji X· is
an inner linearization of Ji, it follows that T; X· is an outer linearizatio~ of
ft of the form ' "
T; x(>.) = xEXi
' i
max{Jt(>.x) + (>. - >.x)'x}. (4.60)

where the vectors Ax can be any vectors such that x E 8ft(>-x)- Therefore,
the relations Xi E Xi and Xi E 8Jt(>.i) imply that
T:,xi (>.i) = ft(>.i),
which by Eq. (4.28), shows that
87':,xi (>.i) C 8ft(>.i)-
By Eq. (4.53), we also have Xi E aT:,xJAi), so Xi E 8ft(>.i)- Thus Eq.
(4.59) is shown for i E f, and all the optimality conditions of Prop. 4.6.2
are satisfied for the original problem (4.50). Q.E.D.

The next proposition is a convergence result that is similar to the one


we showed in Section 4.4 for the case of pure outer linearization.

Proposition 4.6.4: (Convergence) Consider the algorithm of this


section, under the strong duality condition -oo < Qopt = f opt < oo.
Let (xk, )..k) be the primal and dual optimal solution pair of the ap-
proximate problem (4.55), generated at the kth iteration, and let xt,
i E f, be the vectors generated at the corresponding enlargenient step.
Consider a subsequence {>.k}K that converges to a vector 5... Then:
Sec. 4.6 Polyhedral Approximation for Conic Programming 227

(a) .Xi E Ci for all i E le.


(b) If the subsequences {x7}K:, i EI, are bounded, 5. is dual optimal,
and the optimal value of the inner approximation problem (4.55)
converges monotonically from above to f 0Pt, while the optimal
value of the dual problem of (4.55) co11'verges monotonically fro~n
below to - f 0Pt.

Proof: (a) Let us fix i E le. Since x7 = Pc.(>.7), the subsequence {x7}JC
converges to Xi = Pei (.Xi). We will show that Xi = 0, which implies that
.Xi EC;_
Denote Xf° = Uf= 0 Xf. Since 5.7 E cone(Xf)*, we have x~>.7 ::=; 0
for all Xi E Xik, so that x: .Xi ::=; 0 for all Xi E Xf°. Since Xi belongs to
the closure of Xf°, it follows that x:5.i ::=; 0. On the other hand, since
Xi= PcJ>.i), from Prop. 4.6.l(b) we have x~(>.i - xi) = 0, which together
with x~>.i ::=; 0, implies that llxill 2 :S: 0, or Xi= 0.
(b) From the definition of 7; xk [cf. Eq. (4.60)], we have for all i EI and
k, £ E K, with £ < k, ' i

f.t* (A·)+

't
'k
(A·
't
'£ ,-£
- A-)
't
x.'t <
-
-=* 'k
f i,· xk(A·
i 'L
).

Using this relation and the optimality of ).k for the kth approximate dual
problem to write for all k, £ E K, with £ < k

::::: I:1t(Ai) + I:1:,xt(Ai),


iEI iEJ

for all (A1, ... , Arn) such that there exist Ai E cone(Xf)*, i E le, with
(A1, ... , Ar) ES. Since c; c cone(Xf)*, it follows that

L ft(>.f> + I:Ut(>.D + (>.7 - >.f)'xfl ::::: L ft(Ai) + I:Y:.xt(Ai)


iEJ iEJ iEI iEJ
rn

i=l
(4.61)
for all (A1, ... , Arn) such that there exist Ai E c;, i E le, with (A1, ... , Ar) E
S, where the last in€quality holds since J:,xk is an outer linearization of
ft. I
228 Polyhedral Approximation Methods Chap. 4

By taking limit inferior in Eq. (4.61), ask,£-+ oo with k,£ EK, and
by using the lower semicontinuity of ft, which implies that

ft(5..i) :::; £----,oo,


lim inf ft(5..1),
£EK
i E le,

we obtain
m m

(4.62)
i=l i=l

for all (A1, ... , Am) such that there exist Ai E c;, i E le, with (A1, ... , Ar) E
S. We have 5.. E S and 5..; E c; for all i E le, from part (a). Thus Eq.
(4.62) implies that 5.. is dual optimal. The sequence of optimal values of the
dual approximation problem [the dual of problem (4.55)] is monotonically
nondecreasing (since the outer approximation is monotonically refined) and
converges to - f 0 Pt since 5.. is dual optimal. This sequence is the opposite
of the sequence of optimal values of the primal approximation problem
(4.55), so the latter sequence is monotonically nonincreasing and converges
to f 0 Pt. Q.E.D.

As in Prop. 4.4.3 (cf. the GPA algorithm), the preceding proposition


leaves open the question whether there exists a convergent subsequence
{ )..k} K, and whether the corresponding subsequences { xn
K, i E f, are
bounded. This must be verified separately, for the problem at hand.

4. 7 NOTES, SOURCES, AND EXERCISES

Section 4.1: Cutting plane methods were introduced by Cheney and Gold-
stein [ChG59], and by Kelley [Kel60]. For analysis of related methods, see
Ruszczynski [Rus86], Mifflin [Mif96], Burke and Qian [BuQ98], Mifflin,
Sun, and Qi [MSQ98], and Bonnans et al. [BGL09].
Section 4.2: The simplicial decomposition method was introduced by Hol-
loway [Hol74]; see also Hohenbalken [Hoh77], Pang and Yu [PaY84], Hearn,
Lawphongpanich, and Ventura [HLV87], Ventura and Hearn [VeH93], and
Patriksson [PatOl]. The method was also independently proposed in the
context of multicommodity flow problems by Cantor and Gerla [CaG74].
Some of these references describe applications to communication and trans-
portation networks; see also the surveys by Florian and Hearn [FlH95], Pa-
triksson [Pat04J, the nonlinear programming textbook [Ber99] (Examples
2.1.3 and 2.1.4), and the discussion of the application of gradient projection
methods in [BeG83], [BeG92]. Simplicial decomposition in a dual setting
for problems with a large number of constraints (Exercise 4.4), was pro-
posed by Huizhen Yu, and was developed in the context of some large-scale
parameter estimation/machine learning problems in the papers [YuR07]
and [YBR08].
Sec. 4.7 Notes, Sources, and Exercises 229

Section 4.3: The duality relation between outer and inner linearization
has been known for a long time, particularly in the context of the Dantzig-
Wolfe decomposition algorithm [DaW60], which is a cutting plane/simpli-
cial decomposition algorithm applied to separable problems (see textbooks
such as [Las70], [BeT97], [Ber99] for descriptions and analysis). Our de-
velopment of the conjugacy-based form of this duality follows the paper by
Bertsekas and Yu [BeYll].
Section 4.4: The generalized polyhedral approximation algorithm is due
to Bertsekas and Yu [BeYll], which contains a detailed convergence analy-
sis. Extended monotropic programming and its duality theory were devel-
oped in the author's paper [BerlOa], and will be discussed in greater detail
in Section 6.7.
Section 4.5: The generalized simplicial decomposition material of this
section follows the paper [Be Yl l]. A different simplicial decomposition
method for minimizing a nondifferentiable convex function over a poly-
hedral set, based on concepts of ergodic sequences of subgradients and
a conditional subgradient method, is given by Larsson, Patriksson, and
Stromberg (see [Str97], [LPS98]).
Section 4.6: The simplicial decomposition algorithm with conical approx-
imations is new and was developed as the book was being written.

EXERCISES

4.1 ( Computational Exercise)

Consider using the cutting plane method for finding a solution of a system of
inequality constraints g;(x) :S 0, i = 1, ... , m, where g; : ar >---+ Rare convex
functions. Formulate this as a problem of unconstrained minimization of the
convex function
f(x) = . max g;(x).
i=l, ... ,m

(a) State the cutting plane method, making sure that the method is well-
defined.
(b) Implement the method of part (a) for the case where g;(x) = c;x - b;,
n = 2, and m = 100. The vectors c; have the form c; = (l;, (i), where
l;, (; are chosen randomly and independently within [-1, 1] according to a
uniform distribution, while b; is chosen randomly and independently within
[O, 1] according to a uniform distribution. Does the method converge in a
finite number of iterations? Is the problem solved after a finite number
230 Polyhedral Approximation Methods Chap. 4

of iterations? How can you monitor the progress of the method towards
optimality using upper and lower bounds?

4.2

Consider the conic programming problem (4.50) of Section 4.6,


m r

minimize L f;(x;) + L 8(x; IC;)


i=l i=m+l

subject to (x1, ... , Xr) E S.

Verify that an appropriate dual problem is the one of Eq. (4.51):


m

minimize L ft()..;)+ L 8()..i I en


i=l i=m+l

subject to ()..1, .. ,,)..r) E SJ_.

Verify also the optimality conditions of Prop. 4.6.2.

4.3 (Projection on the Second Order Cone)

Consider the second order cone in Rn:

and the problem of Euclidean projection of a given vector i: = (i:1, ... , i:n) onto
C. Let z E Rn-l be the vector z = (i:1, .. ,,i:n-1). Show that the projection,
denoted x, is given by

if llzll ::; Xn,


if llzll > i:n, llzll + Xn > 0,
if llzll > Xn, llzll + Xn ::; 0.

Note: For a derivation of this formula, together with derivations of projection


formulas for other cones, see [SchlO].

4.4 (Dual Conic Simplicial Decomposition and Primal Constraint


Aggregation)

In this exercise the simplicial decomposition approach is applied to the dual of a


constrained optimization problem, using the conic approximation framework of
Section 4.6. Consider the problem

minimize f (x)
subject to Ax ::; 0, X EX,
Sec. 4.7 Notes, Sources, and Exercises 231

where f : ar >--+ R is a convex function, X is a convex set, and A is an m x n


matrix.
(a) Derive a dual problem of the form

maximize h(e)
subject to eE C,
where
h(e) = inf {f(x)
xEX
+ (x },
and C is the cone {A'µ I µ 2: O}, the polar of the cone { x I Ax ~ O}.
(b) Suppose that the cone C of the dual problem of (a) is approximated by a
polyhedral cone of the form

e
where 1 , ... , em
are m vectors from C. Show that the resulting approxi-
mate problem is dual to the problem

minimize f (x)
subject to µ'.Ax ~ 0, i = 1, ... , m, X EX,

where /Li satisfies /Li 2: 0 and ei


= A'µi. Show also that the constraint
set of this approximate problem is an outer linearization of the original,
and interpret the constraints µ'.Ax ~ 0 as aggregate inequality constraints,
(i.e., nonnegative combinations of constraints).
(c) Explain why the duality of parts (a) and (b) is a special case of the conic
approximation duality framework of Section 4.6.
(d) Generalize the analysis of parts (a) and (b) for the case where the constraint
Ax ~ 0 is replaced by Ax ~ b. Hint: Derive a dual problem of the form

maximize h(e) - (
subject to (e, () EC,

where h(e) = infxEX {f(x) + ( x} and C is the cone {(A'µ, b' µ) I µ 2: 0 }.

4.5 (Conic Simplicial Decomposition with Vector Sum


Constraints)

The algorithms and analysis of Section 4.6 apply to cases where the constraint
set involves the intersection of compact sets and cones, which can be inner lin-
earized separately (the compact set constraints can be represented as indicator
functions via the functions Ji)- This exercise deals with the related case where
the constraints are vector sums of compact sets and cones, which again can be
232 Polyhedral Approximation Methods Chap. 4

linearized separately. Describe how the algorithm of Section 4.6 can be applied
to the problem
minimize J (x)
subject to x E X + C,
where X is a compact set and C is a closed convex cone. Hint: Write the problem
as
minimize f(x1) + 8(x2IX) + 8(x3JC)
subject to x1 = x2 + x3,
which is of the form (4.50) with

4.6 (Conic Polyhedral Approximations of Positively


Homogeneous Functions - £1 Regularization)

We recall that an extended real-valued function is said to be positively homoge-


neous if its epigraph is a cone (see Section 1.6 of Appendix B); examples of such
functions include norms and, more generally, support functions of sets. Consider
the minimization of a sum f + h of closed proper convex functions such that h is
positively homogeneous.
(a) Show that this problem is equivalent to

minimize f(x) +w
subject to (x, w) E epi(h),

and describe how the conic polyhedral approximation algorithm of Section


4.6 (cf. Example 4.6.1) can be applied.
(b) (Conical Approximation of the £1 Norm) Consider the problem

minimize J(x) + llxlli


subject to x E 3r,

where f : Rn >--+ R is a convex function, and the equivalent problem

minimize J(x) +w
subject to (x, w) E C,

where CC Rn+l is the cone C = {(x,w) I llxlli :S w}. Describe how


the algorithm of Example 4.6.1 can be applied. Discuss a finitely ter-
minating variant, where the cone C is approximated with a cone gener-
ated exclusively by coordinate directions of the form (e;, 1), where e; =
(0, ... , 0, 1, 0, ... , 0), with the 1 in the ith position.
5

Proximal Algorithms

Contents

5.1. Basic Theory of Proximal Algorithms p. 234


5.1.1. Convergence . . . . . p. 235
5.1.2. Rate of Convergence. . . . . . p. 239
5.1.3. Gradient Interpretation . . . . p. 246
5.1.4. Fixed Point Interpretation, Overrelaxation, .
and Generalization . . . . . . p. 248
5.2. Dual Proximal Algorithms . . . . . . p. 256
5.2.1. Augmented Lagrangian Methods p. 259
5.3. Proximal Algorithms with Linearization p. 268
5.3.1. Proximal Cutting Plane Methods . p. 270
5.3.2. Bundle Methods . . . . . . . . p. 272
5.3.3. Proximal Inner Linearization Methods ,p. 276
5.4. Alternating Direction Methods of Multipliers p. 280
5.4.1. Applications in Machine Learning . . . p. 286
5.4.2. ADMM Applied to Separable Problems p. 289
5.5. Notes, Sources, and Exercises . . . . . . . p. 293

233
234 Proximal Algorithms Chap. 5

In this chapter, we continue our discussion of iterative approximation meth-


ods for minimizing a convex function f. In particular the generated se-
quence {Xk} is obtained by solving at each k an approximate problem,

where Fk is a function that approximates f. However, unlike the pre-


ceding chapter, Fk is not polyhedral. Instead Fk is obtained by adding
to f a quadratic regularization term centered at the current iterate Xk,
and weighted by a positive scalar parameter ck. This is a fundamental
algorithm, with broad extensions, which can also be combined with the
subgradient and polyhedral approximation approaches of Chapters 3 and
4, as well as the incremental approach of Section 3.3.
We develop the basic theory of the method in Section 5.1, setting the
stage for related methods, which are discussed in subsequent sections in this
chapter and in Chapter 6. In Section 5.2, we consider a dual version of the
algorithm, which among others yields the popular augmented Lagrangian
method for constrained optimization. In Section 5.3, we discuss variations
of the algorithm, which include combinations with the polyhedral approx-
imation methods of Chapter 4. In Section 5.4, we develop another type of
augmented Lagrangian method, the alternating direction method of mul-
tipliers, which is well suited for the special structure of several types of
large-scale problems. In Chapter 6, we will revisit the proximal algorithm
in the context of combinations with gradient and subgradient methods, and
develop generalizations where the regularization term is not quadratic.

5.1 BASIC THEORY OF PROXIMAL ALGORITHMS

In this section we consider the minimization of a closed proper convex


function f: ~n i---+ (-oo, oo] using an approximation approach whereby we
modify f by adding a regularization term. In particular, we consider the
algorithm
Xk+i E arg min 1-llx - xkll 2 } ,
{f(x) + -2Ck (5.1)
xE~n

where xo is an arbitrary starting point and ck is a positive scalar parameter;


see Fig. 5.1.l. This is the proximal algorithm (also known as the proximal
minimization algorithm or the proximal point algorithm).
The degree of regularization is controlled by the parameter Ck. For
small values of ck, Xk+l tends to stay close to Xk, albeit at the expense
of slower convergence. The convergence mechanism is illustrated in Fig.
5.1.2. Note that the quadratic term llx - xkll 2 makes the function that is
minimized at each iteration strictly convex with compact level sets. This
guarantees, among others, that Xk+l is well-defined as the unique minimum
Sec. 5.1 Basic Theory of Proximal Algorithms 235

x* X

Figure 5.1.1. Geometric view of the proximal algorithm (5.1). The minimum of

1 2
f(x) + -llx - xkll
2ck

is attained at a unique point Xk+l as shown. In this figure, 'Yk is the scalar by
which the graph of the quadratic -2- 1 llx - Xkll 2 must be raised so that it just
Ck
touches the graph of f. The slope shown in the figure,

is the common subgradient of f(x) and -2- 1 llx - Xkll 2 at the minimizing point
Ck
Xk+l, cf. the Fenchel Duality Theorem (Prop. 1.2.1).

in Eq. (5.1) [cf. Prop. 3.1.1 and Prop. 3.2.1 in Appendix B; also the broader
discussion of existence of minima in Chapter 3 of [Ber09l].
Evidently, the algorithm is useful only for problems that can benefit
from regularization. It turns out, however, that many interesting problems
fall in this category, and often in unexpected and diverse ways. In particu-
lar, as we will see in this and the next chapter, the creative application of
the proximal algorithm and its variations, together with duality ideas, can
allow the elimination of constraints and nondifferentiabilities, the stabiliza-
tion of the linear approximation methods of Chapter 4, and the effective
exploitation of special problem structures.

5.1.1 Convergence

The proximal algorithm has excellent convergence properties, which we


develop in this section. We first derive some preliminary results in the
following two propositions.
236 Proximal Algorithms Chap. 5

Figure 5.1.2. Illustration of the role of the parameter ck in the convergence


process of the proximal algorithm. In the figure on the left, Ck is large, the graph
of the quadratic term is "blunt," and the method makes fast progress toward the
optimal solution. In the figure on the right, ck is small, the graph of the quadratic
term is "pointed," and the method makes slow progress.

Proposition 5.1.1: If Xk and Xk+l are two successive iterates of the


proximal algorithm (5.1), we have

(5.2)

Proof: Since the function


1
f(x) + -llx
2ck
- xkll 2
is minimized at Xk+l, the origin must belong to its subdifferential at Xk+l,
which is equal to
8j(Xk+l) + Xk+l - Xk,
Ck
(cf. Prop. 5.4.6 in Appendix B, which applies because its relative interior
condition is satisfied since the quadratic term is real-valued). The de-
sired relation (5.2) holds if and only if the origin belongs to the above set.
Q.E.D.

The preceding proposition may be visualized from Fig. 5.1.1. An


interesting observation is that the move from Xk to Xk+l is "nearly" a sub-
gradient step [it would be a subgradient step if 8f(xk+i) were replaced by
8f(xk) in Eq. (5.2)]. This fact will provide motivation later for combina-
tions of the proximal algorithm with the subgradient method (see Section
6.4).
Sec. 5.1 Basic Theory of Proximal Algorithms 237

Generally, starting from any nonoptimal point Xk, the cost function
value is reduced at each iteration, since from the minimization in the algo-
rithm's definition [cf. Eq. (5.1)], by setting x = Xk, we have
1
f(xk+1) + -llxk+1
2ck
- xkll 2 :<::::: f(xk).
The following proposition provides an inequality, which among others shows
that the iterate distance to any optimal solution is also reduced. This in-
equality resembles (but is more favorable than) the fundamental inequality
of Prop. 3.2.2(a) for the subgradient method.

Proposition 5.1.2: (Three-Term Inequality) Consider a closed


proper convex function f : Rn r-+ (-oo, oo], and for any Xk E Rn and
ck > 0, the proximal algorithm (5.1). Then for ally E Rn, we have

Proof: We have
llxk - Yll 2 = llxk - Xk+1 + Xk+l - Yll 2
= llxk - Xk+1ll 2 + 2(xk - Xk+1)'(xk+1
- y) + llxk+l -yll 2 -
Using Eq. (5.2) and the definition of subgradient, we obtain
1
-(xk - Xk+1)'(xk+1 - y) ~ f(xk+1) - f(y).
Ck
By multiplying this relation with 2ck and adding it to the preceding rela-
tion, the result follows. Q.E.D.

Let us denote by f* the optimal value


f* = xEWn
inf f(x),

(which may be -oo) and by X* the set of minima off (which may be
empty),
X* = arg min f(x).
xEWn
The following is the basic convergence result for the proximal algorithm.

Proposition 5.1.3: (Convergence) Let {xk} be a sequence gen-


erated by the proximal algorithm (5.1). Then, if I:%:o Ck = oo, we
have

and if X* is nonempty, {xk} converges to some point in X*.


238 Proximal Algorithms Chap. 5

Proof: We first note that since Xk+i minimizes f(x) + 2!k llx - xkll 2 , we
have by setting x = Xk,
1
f(xk+1) + -2 llxk+l - Xkll 2 S f(xk), V k.
Ck
It follows that {f (xk)} is monotonically nonincreasing. Hence f(xk) .j,. f 00,
where / 00 is either a scalar or -oo, and satisfies Joo?: f*.
From Eq. (5.3), we have for ally E 3?n,
llxk+l -yll 2 S llxk -yll 2 - 2ck(f(xk+1) - f(y)). (5.4)
By adding this inequality over k = 0, ... , N, we obtain
N
llxN+l - Yll 2 + 2 L Ck (f (xk+i) - f(y)) S llxo - Yll 2 , VY E 3?n, N?: 0,
k=O
so that
N
2 L ck(f(xk+1) - f(y)) S llxo - Yll 2 , \;/ y E 3?n, N?: 0.
k=O
Taking the limit as N -t oo, we have
00

(5.5)
k=O
Assume to arrive at a contradiction that f 00 > f *, and let i) be such
that
Joo> f(i}) > f*.
Since {f (xk)} is monotonically nonincreasing, we have
f(xk+1) - f(i))?: Joo - f(i)) > 0.
Then in view of the assumption I:%°=o Ck = oo, Eq. (5.5), with y = i), leads
to a contradiction. Thus f oo = f *.
Consider now the case where X* is nonempty, and let x* be any point
in X*. Applying Eq. (5.4) with y = x*, we have
llxk+1-x*ll 2 S llxk-x*ll 2 -2ck(f(xk+i)-f(x*)), k=O,l, .... (5.6)
From this relation it follows that llxk -x* 11 2 is monotonically nonincreasing,
so {Xk} is bounded. If x is a limit point of {xk}, we have
f(x) s k-+oo,
liminf f(xk) =
kEK
f*

for any subsequence {xk}x:, -t x, since {f(xk)} monotonically decreases to


f* and f is closed. Hence x must belong to X*. Finally, by Eq. (5.6), the
distance of Xk to every x* E X* is monotonically nonincreasing, so {xk}
must converge to a unique point in X*. Q.E.D.

Note some remarkable properties from the preceding proposition.


Convergence to the optimal value is obtained even if X* is empty or
f* = -oo. Moreover, when X* is nonempty, convergence to a single point
of X* occurs.
Sec. 5.1 Basic Theory of Proximal Algorithms 239

5.1.2 Rate of Convergence

The following proposition describes how the convergence rate of the proxi-
mal algorithm depends on the magnitude of Ck and on the order of growth
off near the optimal solution set (see also Fig. 5.1.3) .

f(x)
\

Figure 5.1.3. Illustration of the convergence rate of the proximal algorithm and
the effect of the growth properties of / near the optimal solution set. In the figure
on the left, / grows slowly and the convergence is slow. In the figure on the right,
/ grows fast and the convergence is fast.

Proposition 5.1.4: (Rate of Convergence) Assume that X* is


nonempty and that for some scalars /3 > 0, 8 > 0, and 1 2': 1, we have

f* + f3(d(x)f ~ f(x), \;/ x E ~n with d(x) ~ 8, (5.7)

where
d(x) = x*min
E X*
llx-x*ll- 1

Let also CX)

Z:ck = oo,
k=O

so that the sequence {xk} generated by the proximal algorithm (5.1)


converges to some point in X* by Prop. 5.1.3. Then:
(a) For all k sufficiently large, we have

(5.8)

if 1 > 1, and
240 Proximal Algorithms Chap. 5

d(xk+i) + f3ck S d(xk), (5.9)


if,= 1 and Xk+1 (/; X*.
(b) (Superlinear Convergence) Let 1 <, < 2 and xk (/. X* for all k.
Then if infk~O Ck > 0,

(c) (Linear Convergence) Let,= 2 and Xk (/; X• for all k. Then if


limk--tcxi Ck= c with c E (0, oo),

..
11msup d(Xk+1) 1
~---'- < - - ,
k--+oo d(xk) - l + pc

while if limk--+= Ck = oo,

(d) (Sublinear Convergence) Let,> 2. Then

. d(xk+1)
hm sup ( )2 / < oo.
k--+oc d Xk "I

Proof: (a) The proof uses an argument that can be visualized from Fig.
5.1.4. Since the conclusion clearly holds when Xk+l EX* , we assume that
Xk+l (/. X* and we denote by Xk+i and Xk the projections of Xk+l and Xk
on X*, respectively. From the subgradient relation (5.2), we have

Using the hypothesis, {xk} converges to some point in X*, so it follows


from Eq. (5.7) that

for k sufficiently large. Adding the preceding two relations, we obtain

(5.10)
Sec. 5.1 Basic Theory of Proximal Algorithms 241

Slope

f(x)-----
/
r_______.. X

Figure 5.1.4. Visualization of the estimate d(xk+1)+,Bck ( d(xk+1) )'- 1 ~ d(xk),


cf. Eq. (5.8), in one dimension. Using the hypothesis (5.7), and the triangle
geometry indicated, we have

.B(d(xk+d)' ~ f(xk+l) -r
Xk - Xk+l
= - - ~ - · (xk+l - 8k+1)
Ck

< d(xk) - d(xk+1) d( )


_ . Xk+l ,
Ck

where 8k+l is the scalar shown in the figure. Canceling d(xk+i) from both sides,
we obtain Eq. (5.8) .

for k sufficiently large. We also write the identity

IJ xk+l -i:k+1ll 2 -(xk+1 -xk+1)'(xk+1 -xk) = (xk+l - xk+1)'(xk -xk+1),


and note that since Xk+1 is the projection of Xk+l on X* and Xk EX*, by
the Projection Theorem the above expression is nonpositive. We thus have

ll xk+l - Xk+1 11 2 ::::; (xk+l - Xk+1)'(xk+1 - Xk),


which by adding to Eq. (5.10) and using the Schwarz inequality, yields

llxk+1 - Xk+111 2 + /3ck(d(xk+1))"!::::; (xk+l - Xk+1)'(xk - xk)


::::; l!xk+l - Xk+1 ll llxk - Xk ll·
Dividing with ll xk+l - Xk+ i II (which is nonzero since we assumed that
Xk+1 .J_ X•), Eqs. (5.8) and (5.9) follow .
(b) From Eq. (5.8) and the fact 1 < 2, we obtain the desired relation.
(c) For 1 = 2, Eq. (5.8) becomes

(1 + /3ck)d(xk+1) ::::; d(xk),


242 Proximal Algorithms Chap. 5

from which the result follows.


(d) We have for all sufficiently large k,
~ d(xk) 2
f3(d(xk+1)) :S f(xk+1) - f* '.S - 2- ,
Ck
where the inequalities follow from the hypothesis (5.7), and Prop. 5.1.2
with y equal to the projection of Xk onto X*. Q.E.D.

Proposition 5.1.4 shows that as the growth order 'Yin Eq. (5.7) in-
creases, the rate of convergence becomes slower. An important threshold
value is 'Y = 2; in this case the distance of the iterates to X* decreases
at a rate that is at least linear if ck remains bounded, and decreases even
faster (superlinearly) if Ck -+ oo. Generally, the convergence is accelerated
if ck is increased with k, rather than kept constant; this is illustrated most
clearly when 'Y = 2 [cf. Prop. 5.1.4(c)]. When 1 < 'Y < 2, the convergence
rate is faster than linear (superlinear) [cf. Prop. 5.1.4(b)]. When 'Y > 2, the
convergence rate is generally slower than when 'Y = 2, and examples show
that d(xk) may converge to O sublinearly, i.e., slower than any geometric
progression [cf. Prop. 5.1.4(d)].
The threshold value of 'Y = 2 for linear convergence is related to the
quadratic growth property of the regularization term. A generalized version
of the proposition, with similar proof, is possible for proximal algorithms
that use nonquadratic regularization functions (see [KoB76], and [Ber82a],
Section 3.5, and also Example 6.6.5 in Section 6.6). In this context, the
threshold value for linear convergence is related to the order of growth of
the regularization function.
When 'Y = 1, f is said to have a sharp minimum, a favorable condition
that we encountered in Chapter 3. Then the proximal algorithm converges
finitely. This is shown in the following proposition (see also Fig. 5.1.5).

Proposition 5.1.5: (Finite Convergence) Assume that the set of


minima X* off is nonempty and that there exists a scalar /3 > 0 such
that
f* + f3d(x) :S f(x), \:/xElRn, (5.11)
where d(x) = minx*EX* llx -x*II- Then if L~o Ck= oo, the proximal
algorithm (5.1) converges to X* finitely (i.e., there exists k > 0 such
that Xk E X* for all k ~ k). Furthermore, if co ~ d(xo)//3, the
algorithm converges in a single iteration (i.e., x 1 EX*).

Proof: The assumption (5. 7) of Prop. 5.1.4 holds with 'Y = 1 and all 8 > 0,
so Eq. (5.9) yields
d(xk+1) + f3ck '.S d(xk), if Xk+l (/;. X*. (5.12)
Sec. 5.1 Basic Theory of Proximal Algorithms 243

xo xi x2 = x* X xo X

Figure 5.1.5. Finite convergence of the proximal algorithm for the case of a
sharp minimum, when J(x) grows at a linear rate near the optimal solution set
(e.g., when f is polyhedral). In the figure on the right, convergence occurs in a
single iteration for sufficiently large co.

If L~o Ck = oo and Xk (j. X* for all k, by adding Eq. (5.12) over all k,
we obtain a contradiction. Hence we must have Xk EX* for k sufficiently
large. Also if co ~ d(xo)/ /3, Eq. (5.12) cannot hold with k = 0, so we must
have x1 E X*. Q.E.D.

It is also possible to prove the one-step convergence property of Prop.


5.1.5 with a simpler argument that does not rely on Prop. 5.1.4 and Eq.
(5.9). Indeed, assume that xo (j. X*, let x 0 be the projection of x 0 on X*,
and consider the function

- 1
f(x) = f* + f3d(x) + -llx
2co - xoll -
2 (5.13)

Its subdifferential at x0 is given by the sum formula of Prop. 3.1.3(b):

- { xo - xo
8J(xo) = /3~ llxo _ ±oil + col (xo - xo)
I ~ E [O, 1] }

= { ( di:) - : 0 ) (xo - xo) I ~ E[O, 1]} ·

J(
Therefore, if Co ~ d( xo) / /3, then O E 8 xo), so that xo minimizes J(x).
Since from Eqs. (5.11) and (5.13), we have

- 1
f(x) ::; f(x) + 2Co llx - xoll 2 , 't/ XE ~n,

with equality when x = xo, it follows that xo minimizes

1
f(x) + -2Co llx - xoll 2
244 Proximal Algorithms Chap. 5

Figure 5.1.6. Illustration of the con


Slo dit ion

r + /3d(x) :c; f(x),


for a sharp minimum [cf. Eq. (5.11)].

,__..,
I I
X
X*

over x E X. Thus xo is equal to the first iterate x1 of the proximal algo-


rithm.
The growth condition (5.11) is illustrated in Fig. 5.1.6. The following
proposition shows that the condition holds when f is a polyhedral function
and X* is nonempty.

Proposition 5.1.6: (Sharp Minimum Condition for Polyhedral


Functions) Let f : lRn t-+ ( -oo, oo] be a polyhedral function, and
assume that X*, the set of minima off, is nonempty. Then there
exists a scalar (3 > 0 such that

f* + (3d(x) :S f(x), V X ¢:. X*,

where d(x) = minx*EX* llx - x*II-

Proof: We assume first that f is linear within dom(f), and then general-
ize. Then, there exists a E lRn such that for all x, x E dom(f), we have
f(x) - f(x) = a'(x - x).
For any x E X*, let Bx be the cone of vectors d that are in the normal
cone N x * (x) of X * at x, and are also feasible directions in the sense that
x + ad E dom(f) for a small enough a > 0. Since X* and dom(f) are
polyhedral set s, there exist only a finite number of possible cones Bx as x
ranges over X*. Thus, there is a finite set of nonzero vectors {c1 I j E J},
such that for any x E X *, Bx is either equal to {0}, or is the cone generated
by a subset {c1 I j E Jx}, where J = U xE X*Jx , In addition, for all x EX *
and d E Bx with lldll = 1, we have

d= L 1jCj,
jEJx
Sec. 5.1 Basic Theory of Proximal Algorithms 245

for some scalars "/j ~ 0 with LjEJ,, "(j ~"'?,where "'y = 1/maxjEJ licjll-
Also we can show that for all j E J, we have a'cj > 0, by using the fact
Cj E Bx for some x EX*.
For x E dom(f) with x <f- X*, let x be the projection of x on X*.
Then the vector x - x belongs to S,i, and we have

where (3 = "'yminjEJ a'cj. Since J is finite, we have /3 > 0, and this implies
the desired result for the case where f is linear within dom(f).
Assume now that f is of the form

f(x) = max{a~x
iEJ "
+ bi}, \:/ x E dom(f),

where I is a finite set, and ai and bi are some vectors and scalars, respec-
tively. Let
Y = {(x,z) I z ~ f(x), x E dom(f)},
and consider the function

g(x,z) = { z00 if (x,z) E Y,


otherwise.
Note that g is polyhedral and linear within dom(g). Moreover, its set of
minima is
Y* = {(x,z) Ix EX*, z = f*},
and its minimal value is f *.
Applying the result already shown to the function g, we have for some
/3 > 0
f* + (3d(x, z) :S g(x, z), \:/(x,z)<f_Y*,
where

d(x,z) = min (llx-x*ll 2 +lz-z*l 2 )1 12 = min (llx-x*ll 2 +lz-f*l 2 )1 12 .


(x*,z*)EY* x*EX*

Since
d(x, z) ~ min llx - x* 11 = d(x),
xEX*

we have
f* + (3d(x) :S g(x, z), \:/(x,z)<f-Y*,
and by taking the infimum of the right-hand side over z for any fixed x,

f* + /3d(x) :S f(x), \:/ x <f_ X*.

Q.E.D.
246 Proximal Algorithms Chap. 5

Figure 5.1. 7. Illustration of the func-


tion

J(z) -----
<Pc(z) = xE!Rn
inf {f(x) + ~llx - zll 2 } .
2c
J(x)
\ We have <Pc(z) ~ f(z) for all z E
~n, and at the set of minima of f,
: Slope V</Jc(z)
1 \ i ,7 <Pc coincides with f . We also have
¢c(z)- 2 11x-zll2 :
....,
v</Jc (Z ) = -
z - Xc(z)
I

z Xc(z) X* X --;
C

cf. Prop. 5.1.7.

From the preceding discussion and graphical illustrations, it can be


seen that the rate of convergence of the proximal algorithm is improved by
choosing large values of c. However, the corresponding regularization ef-
fect is reduced as c is increased, and this may adversely affect the proximal
minimizations. In practice, it is often suggested to start with a moderate
value of c, and gradually increase this value in subsequent proximal mini-
mizations. How fast c can increase depends on the method used to solve
the corresponding proximal minimization problems. If a fast Newton-like
method is used, a fast rate of increase of c (say by a factor 5-10) may be
possible, resulting in very few proximal minimizations. If instead a rela-
tively slow first order method is used, it may be best to keep c constant at
a moderate value, which is usually determined by trial and error.

5.1.3 Gradient Interpretation

An interesting interpretation of the proximal iteration is obtained by con-


sidering the function

<Pc(z) = inf {f(x)


xE~n
+~
2c
llx - zll 2 } (5.14)

for a fixed positive value of c. It can be seen that

inf
xE ~n
f (x) ::; <Pc(z) ::; f (z ), V z E ?Rn,

from which it follows that the set of minima off and <Pc coincide (this is also
evident from the geometric view of the proximal minimization given in Fig.
5.1.7) . The following proposition shows that <Pc is a convex differentiable
function, and derives its gradient.
Sec. 5.1 Basic Theory of Proximal Algorithms 247

Proposition 5.1. 7: The function <Pc of Eq. ,(5.14) is convex and dif-
ferentiable, and we have

"v</ic(z) = z-:- Xc(z) \:/ z E lRn, (5.15)


, C

where Xc(z) is the unique minimizer in Eq. (5.14). Moreo~er

"v<./ic(z) E of(xc(z)),

Proof: We first note that <Pc is convex, since it is obtained by partial


minimization of f(x) + 2 1c llx - zll 2, which is convex as a function of (x, z)

(cf. Prop. 3.3.1 in Appendix B). Furthermore, <Pc is real-valued, since the
infimum in Eq. (5.14) is attained.
Let us fix z, and for notational simplicity, denote z = xc(z). To show
that <Pc is differentiable with the given form of gradient, we note that by
the optimality condition of Prop. 3.1.4, we have v E 8¢c(z), or equivalently
0 E 8</>c(z) - v, if and only if z attains the minimum over y E lRn of

</>c(Y) - v'y = inf {t(x)


xElRn
+ _!_llx
2c
- Yll 2}- v'y.
Equivalently, v E 8</>c(z) if and only if (z, z) attains the minimum over
(x, y) E lR2 n of the function
1
F(x, y) = f(x) + 2c llx - Yll2 - v'y,
which is equivalent to (0,0) E 8F(z,z), or
z-z z-z
0 E 8/(z) + --,
C
- --.
V -
C
(5.16)

[This last step is obtained by viewing F as the sum of the function f and
the differentiable function
1
2cllx - Yll 2- v'y,

and by writing

x-y y-x }
8F(x,y) = {(g,0) I g E 8/(x)} + { -c-, -c- -v ;

cf. Prop. 5.4.6 in Appendix B.] The right side ofEq. (5.16) uniquely defines
v, so that v is the unique subgradient of <Pc at z, and it has the form
248 Proximal Algorithms Chap. 5

v = (z - z)/c, as required by Eq. (5.15). From the left side of Eq. (5.16),
we also see that v = 'V</>c(z) E of (xc(z)). Q.E.D.

Using the gradient formula (5.15), we see that the proximal iteration
can be written as
(5.17)
so it is a gradient iteration for minimizing </>ck with stepsize equal to Ck.
This interpretation provides insight into the working mechanism of the al-
gorithm and has formed the basis for various acceleration schemes, based
on gradient and Newton-like schemes, particularly in connection with the
augmented Lagrangian method, to be discussed in Section 5.2.1. In this
connection, we will show in the next subsection that a stepsize as large as
2ck can be used in place of Ck in Eq. (5.17). Moreover, the use of extrapo-
lation schemes to modify the stepsize Ck has been shown to be beneficial in
the constrained optimization context of the augmented Lagrangian method
(see [Ber82a], Section 2.3.1).

5.1.4 Fixed Point Interpretation, Overrelaxation, and


Generalization
We will now discuss the connection of the proximal algorithm with iter-
ations for finding fixed points of mappings that are nonexpansive with
respect to the Euclidean norm. As a first step, we will view the problem of
minimizing f as a fixed point problem involving a special type of mapping.
For a scalar c > 0 and a closed proper convex function f : ~n i--+
(-oo, oo], let us consider the (single-valued) mapping Pc,t : ~n i--+ ~n
given by

Pc 1(z) = arg min {f(x)


' xE~n
+ 21C llx - zll 2 } , z E ~n, (5.18)

which is known as the proximal operator corresponding to c and f. The


set of fixed points of Pc,t coincides with the set of minima of f, and the
proximal algorithm, written as

may be viewed as a fixed point iteration. This alternative view leads to


useful insights and some important generalizations.
The key idea is based on the mapping Nc,t : ~n i--+ ~n given by

z E ~n. (5.19)

We can visualize this mapping by writing

P.c,f (Z )-
-
Nc,!(z) +z ,
2
Sec. 5.1 Basic Theory of Proximal Algorithms 249

so Pc,J(z) is the midpoint of the line segment connecting Nc,J(z) and z. For
this reason, Nc,f is called the reflection operator. Some interesting facts
here are that:
(a) The set of fixed points of Nc,f is equal to the set of fixed points of
Pc,f and hence the set of minima off. Moreover, as we will show
shortly, the mapping Nc,f is nonexpansive, i.e.,

Thus for any x, Nc,t(x) is at least as close to the set of minima off
as x.
(b) The interpolated iteration

(5.20)

where the interpolation parameter a,k satisfies O.k E [E, 1- E] for some
scalar E > 0 and all k, converges to a fixed point of Nc,f, provided
Nc,t has at least one fixed point (this is a consequence of a classical
result on the convergence of interpolated nonexpansive iterations, to
be stated shortly).
(c) The preceding interpolated iteration (5.20), in view of the definition
of Nc,t [cf. Eq. (5.19)], can be written as

(5.21)

and as a special case, for a,k = 1/2, yields the proximal algorithm
Xk+i = Pc,J(xk)- We thus obtain a generalized form of the proximal
algorithm, which depending on the parameter ak, provides for extrap-
olation (when 1/2 < a,k < 1) or interpolation (when O < a,k < 1/2).
We will now prove the facts just stated in the following two propo-
sitions. To this end, we note that for any z E ~n, the proximal iterate
Pc,t (z) is uniquely defined, and we have

z = Pc,t(z) =} z = z + CV for some VE af(z), (5.22)

since the right side above is the necessary condition for optimality of z
in the proximal minimization (5.18) that defines Pc,J(z). Moreover the
converse also holds,

z = z + CV for some v E af(z) and z E ~n =} z = Pc,J(z), (5.23)

since the left side above is the sufficiency condition for z to be (uniquely)
optimal in the proximal minimization. An equivalent way to state the two
250 Proximal Algorithms Chap. 5

Figure 5.1.8. The figure on the left provides a graphical interpretation of the
proximal iterat ion at a vector z for a one-dimensional problem. The line that
passes through z a nd has slope -1/c intercepts the graph of the (monotone)
subdifferential mapping Bf(x ) at a unique point v, which corresponds to z, the
unique vector Pc,f(z) produced by the proximal iteration [cf. Eqs. (5.22)-(5.24)] .
The figure on the left also illustrates the reflection operator Nc,J(z) = 2Pc,J(z)-z.
The iterate Pc, J(z) lies at the midpoint between z and Nc,J(z) [cf. Eq. (5.25)] .
Note that all points between z and Nc,J(z) are at least as close to x * as z. The
figure on the right illustrates the proximal iteration xk+l = Pc,J(xk) -

relations (5.22) and (5.23) is that any vector z E lRn can be written in
exactly one way as
where z E lRn , VE af(z), (5.24)
and moreover the vector z is equal to Pc,J(z),
z = Pc,J(z).
Using Eq. (5.19), we also obtain a corresponding formula for Nc,1:
Nc,J(z) = 2Pc,t(z ) - z = 2z - (z + cv) = z - CV. (5.25)
Figure 5.1.8 illustrates the preceding relations and provides a graphical
interpretation of the proximal algorithm. The following proposition verifies
the nonexpansiveness property of Nc,t·

Proposition 5.1.8: For any c > 0 and closed proper convex function
f : lRn t-+ ( -oo, oo] , the mapping

Nc,t(z) = 2Pc,J(z) - z

[cf. Eqs. (5.18) and (5.19)] is nonexpansive, i.e.,

Moreover, any interpolated mapping (1 - o:)z + o:Nc,t(z), a: E (0, 1]


(including the proximal operator Pc,t, which corresponds to a: =,1/2)
is nonexpansive.
Sec. 5.1 Basic Theory of Proximal Algorithms 251

Proof: Consider any z 1 , z2 E ~n, and express them as

with

cf. Eq. (5.24). Then we have

llz1 - z2ll 2 = ll(z1 + cv1) - (z2 + cv2)ll 2


(5.26)
= llz1 - z2ll 2 + 2c(z1 - z2)'(v1 - v2) + c2llv1 - v2ll 2-

Also, from Eq. (5.25),

and it follows that

IINc,1(z1) - Nc,1(z2)ll 2 = ll(z1 - cvi) - (z2 - cv2)ll 2


= llz1 - z2ll 2 - 2c(z1 - z2)'(v1 - v2) + c2llv1 - v2ll 2-
(5.27)
By subtracting Eq. (5.26) from Eq. (5.27), we obtain

The nonexpansiveness of Ne,! will follow if we can show that the inner
product in the right-hand side is nonnegative. Indeed this is obtained by
using the definition of subgradients to write

so by adding these two relations, we have

(5.28)

and the result follows. Finally the nonexpansiveness of Nc,f clearly implies
the nonexpansiveness of the interpolated mapping. Q.E.D.

We will now use the Krasnosel'skii-Mann Theorem, which shows that


fixed points of nonexpansive mappings can be found by an interpolated
iteration. The theorem is proved and intuitively explained in Appendix A
(Prop. A.4.2). For convenience, we reproduce its statement here.
252 Proximal Algorithms Chap. 5

Proposition 5.1.9: (Krasnosel'skii-Mann Theorem for Non-


expansive Iterations) Consider a mapping T : ~n 1--t ~n that is
nonexpansive with respect to the Euclidean norm 11 · 11, i.e.,

IIT(x) -T(y)II S llx -yll, 'v x,y E ~n,

and has at least one fixed point. Then the iteration

(5.29)

where a,k E [O, 1] for all k and I:;~0 ak(l - a,k) = oo, converges to a
fixed point of T, starting from any xo E ~n.

By applying the preceding theorem with T = N e,!, we obtain the


following.

Proposition 5.1.10: (Stepsize Relaxation in the Proximal Al-


gorithm) The iteration

(5.30)

where "/k E [E, 2 - Ej for some scalar E > 0 and all k, converges to a
minimum of f, assuming at least one minimum exists.

Proof: Using the definition

the iteration (5.30) is equivalent to

with 'Yk = 2ak . Since the fixed points of Ne,! are the minima off, the
result follows from Prop. 5.1.9 with T = N e,f· Q.E.D.

In iteration (5.30) the parameter c is constant, but an extension is


possible, whereby convergence can be shown for the case of the version with
variable Ck:
(5.31)
provided that infk2:o Ck > 0 [see [Ber75d] for the present minimization
context, and [EcB92J for a more general context]. This is based on the fact
Sec. 5.1 Basic Theory of Proximal Algorithms 253

that the set of fixed points of Nc,t does not depend on c as long as c > 0.
Note that for "Yk = 1, we obtain the proximal algorithm Xk+l = Pck,t(xk),
Another interesting fact is that the iteration (5.31) can also be written
as a gradient iteration

where
c/>c(z) = inf {f(x)
xE~n
+ _!_llx
2c
- zll 2 } ,
[cf. Eq. (5.14)], based on the fact

r1,1,. ( ) _ Xk - Pck,t(xk)
V'f'ck Xk - ,
Ck

[cf. Eq. (5.17)]. Since the performance of gradient methods is often im-
proved by intelligent stepsize choice, this motivates stepsize selection sche-
mes that are aimed at acceleration of convergence.
Indeed, it turns out that with extrapolation along the interval con-
necting Xk and Pc,t(xk), we can always obtain points that are closer to the
set of optimal solutions X* than Pc,t(xk)- By this we mean that for each
x with Pc,t(x) rf. X*, there exists "YE (1, 2) such that

min
x*EX*
llx+7(Pc1(x)-x)-x*II<
'
min IIPcJ(x)-x*II·
x*EX* '
(5.32)

This can be seen with a simple geometrical argument (cf. Fig. 5.1.9). Thus
the proximal algorithm can always benefit from overrelaxation, i.e., "'/k E
(1, 2), if only we knew how to do it effectively. One may consider a trial and
error scheme to determine a constant value of "Yk E (1, 2) that accelerates
convergence relative to "Yk = l; this may work well when Ck is kept constant.
More systematic procedures for variable values Ck have been suggested in
[Ber75d] and [Ber82a], Section 2.3.1. In the procedure of [Ber82a], the
overrelaxation parameter is chosen within (1, 2) as

where f3 is a positive scalar that is determined experimentally for a given


problem.

Generalization of the Proximal Algorithm

The preceding analysis can be generalized further to address the problem


of finding a zero of a multivalued mapping M : Rn .-+ 2~n, which maps
vectors x E Rn into subsets M(x) c Rn. By a zero of M we mean a vector
254 Proximal Algorithms Chap. 5

· Pc,J(x) - x*

Figure 5.1.9. Geometric proof that overrelaxation in the proximal algorithm


can produce iterates that are closer to any point in X* than Pc,1(x), assuming
Pc,1(x) is not optimal. Let x* be the point of X* that is at minimum distance
from Pc,1(x). The iterate Pc,J(x) lies at the midpoint between x and Nc,J(x) [cf.
Eq. (5.19)]. By Prop. 5.1.1, the vector ½(x - Pc,J(x)) is a subgradient off at
Pc,J(x), so

.!.(x-Pc,J(x)) 1 (x*-Pc,1(x)) ~f(x*)-f(Pc,1(x)) <0.


C

Thus the angle between Pc,f (x) - x and Pc,1 (x) - x* is strictly greater than 7r /2.
From triangle geometry, it follows that there exist points in the interval connecting
Pc,J(x) and Nc,J(x), which are closer to x* than Pc,1(x), so there exists 'YE (1, 2)
such that Eq. (5.32) holds.

x• such that O E M(x*). As an example, the problem of minimizing a


convex function f : lRn 1-t (-oo, oo] is equivalent to finding a zero of the
multivalued mapping M(x) = &f(x). However, there are other important
applications where M is not the subdifferential mapping of a convex func-
tion, such as for example the solution of monotone variational inequalities
(see the end-of-chapter references).
Looking back into the preceding analysis, we see that it generalizes,
essentially verbatim, from the case M(x) = &f(x) to the case of a general
multivalued mapping M : lRn 1-t 2wn, provided M has the following two
properties:
(a) Any vector x E lRn can be written in exactly one way as

x=x+cv where x E lRn, v E M(x), (5.33)

[cf. Eq. (5.24)]. This was necessary in order for the mapping Pc,t that
maps x to x, and the corresponding mapping

Nc,t(x) = 2Pc,t(x) - x, XE lRn,

[cf. Eq. (5.19)] to be well-defined as a single-valued mapping.


Sec. 5.1 Basic Theory of Proximal Algorithms 255

(b) We have
(x1 - x2)'(v1 - v2) 2: 0, V x1,x2 E dom(M)
(5.34)
and V1 E M(x1), v2 E M(x2),
where
dom(M) = {x I M(x) # 0}
(assumed nonempty). This property, known as monotonicity of M,
was used to prove that the mapping Nc,f is nonexpansive in Prop.
5.1.8 [cf. Eq. (5.28)].
It can be shown that both of the preceding two properties hold if and
only if M is maximal monotone, i.e., it is monotone in the sense of Eq.
(5.34), and its graph {(x,v) Iv E M(x)}, is not strictly contained in the
graph of any other monotone mapping on ?Rn t (the subdifferential mapping
can be shown to be maximal monotone; this is shown in several sources,
[Roc66], [Roc70], [RoW98], [BaCll]). Maximal monotone mappings, the
associated proximal algorithms, and related subjects have been extensively
treated in the literature, to which we refer for further discussion; see the
end-of-chapter references.
In summary, the proximal algorithm in its full generality applies to
the problem of finding a zero of a maximal monotone multivalued mapping
M : ?Rn f-t 21Rn [a vector x* such that O E M (x*)]. It takes the form
Xk+l = Xk - CVk,

where Vk is the unique point v such that v E M(xk+i); cf. Fig. 5.1.10. If
Mis a single-valued mapping, we have Xk+l = Xk - cM(xk+1), or
Xk+l = (I+ cM)- 1 (xk),
where I is the identity mapping and (I + cM)- 1 is the inverse of the
mapping I + cM. Moreover, a more general version of the algorithm is
valid, allowing for a stepsize 'Yk E (0, 2),
Xk+l = Xk - "fkCVk,

with the possibility of reduction of the distance to all zeroes of M using


an appropriate overrelaxation scheme with 'Yk > I. The analysis of the
present section readily extends to this more general context. There is only
one difficult point in this analysis, which we have not addressed and have
referred instead to the literature: the equivalence of the properties (a) and
(b) above with the maximal monotonicity of the mapping M.

t Note that the monotonicity property (5.34) and the existence of the rep-
resentation (5.33) for some x E 3r implies the uniqueness of this representation
[if X = X1 + CV1 = X2 + CV2, then Q ::::: (x1 - x2)'(v1 - v2) = -llx1 - x211 2, so
x1 = x2]. Thus maximal monotonicity of M is equivalent to monotonicity and
existence of a representation of the form (5.33) for every x E 3r, something that
can be easily visualized (cf. Fig. 5.1.10) but quite hard to prove (see the original
work [Min62], or subsequent sources such as [Bre73], [RoW98], [BaCll]).
256 Proximal Algorithms Chap. 5

M(x)
Vk - - - - - \ - - -

Xk+3 Xk+2 X
X k+I = Xk - CVk

Figure 5.1.10. One-dimensional illustration of the proximal algorithm for find-


ing a zero of a maximal monotone multivalued mapping M : ~n t---+ 2!Rn. The
important fact here is that the maximal monotonicity of M implies that every
Xk E ~n can be uniquely represented as Xk = Xk+l + cvk, where vk E M(xk+1)-

5.2 DUAL PROXIMAL ALGORITHMS

In this section we will develop an equivalent dual implementation of the


proximal algorithm, based on Fenchel duality. We will then apply it in a
special way to obtain a popular constrained optimization algorithm, the
augmented Lagrangian method.
We recall the proximal algorithm of Section 5.1:

Xk+l E arg min {J(x) 1-llx - xkll 2 } ,


+ -2ck (5.35)
xE1J?n

where J : lRn H (-oo, oo] is a closed proper convex function, xo is an


arbitrary starting point, and {ck} is a positive scalar parameter sequence.
We note that the minimization above is in a form suitable for application
of the Fenchel duality theory of Section 1.2, with the identifications

fi(x) = J(x),

We can write the Fenchel dual problem as

minimize Ji(>.)+ J2(- >.)


(5.36)
subject to ).. E lRn,

where Ji and J2 are the conjugate functions of Ji and h, respectively. We


have
Sec. 5.2 Dual Proximal Algorithms 257

where the last equality follows by noting that the supremum over x is
attained at x = Xk + Ck A- Introducing f*, the conjugate of f,

Ji(>..)= f*(>..) = sup { x' >.. - f(x) },


xE~n

and substituting into Eq. (5.36), we see that the dual problem (5.36) can
be written as
minimize f*(>..) - xk>.. + ii>..11 2 c; (5.37)
subject to >.. E lRn.
We also note that there is no duality gap, since h and f 2 are real-
valued, so the relative interior conditions of the Fenchel Duality Theorem
[Prop. 1.2.l(a),(b)] are satisfied. In fact there exist unique primal and dual
optimal solutions, since both primal and dual problems involve a strictly
convex cost function with compact level sets.
Let Ak+l be the unique solution of the minimization (5.37). Then
Ak+l together with Xk+l satisfy the necessary and sufficient optimality
conditions of Prop. 1.2.l(c),

(5.38)

Using the form of h, the second relation above yields

, _ Xk - Xk+l
"k+l - , (5.39)
Ck

see Fig. 5.2.1. This equation can be used to find the primal proximal iterate
Xk+1 of Eq. (5.35), once Ak+l is known,

(5.40)

We thus obtain a dual implementation of the proximal algorithm. In


this algorithm, instead of solving the Fenchel primal problem involved in the
proximal iteration (5.35), we first solve the Fenchel dual problem (5.37) to
obtain the optimal dual solution Ak+l, and then obtain the optimal primal
Fenchel solution Xk+l using Eq. (5.40).

Dual Proximal Algorithm:


Find
Ak+l E arg min
AE~n
{!*(>..) - xk>.. + Ck2 ll>..11 2 } , (5.41)

and then set


(5.42)
258 Proximal Algorithms Chap. 5

Slope Ak+I
1 ~ Optimal dual
proximal solution

Xk Xk+l X* X

+
Optimal primal
proximal solution

Figure 5.2.1. Illustration of the optimality condition

cf. Eq. (5.39), and the relation between the primal and dual proximal solutions.

The dual algorithm is illustrated in Fig. 5.2.2. Note that as Xk con-


verges to a minimum x* off, Ak converges to 0. Thus the dual iteration
(5.41) does not aim to minimize f*, but rather to find a subgradient off*
at 0, which minimizes f [cf. Prop. 5.4.4(b) in Appendix BJ. In particular,
we have

v' k 2 0,
[cf. Eq. (5.38) for the left side, and the Conjugate Subgradient Theorem
(Prop. 5.4.3 in Appendix B) for the right side], and as Ak converges to 0
and Xk converges to a minimum x* off, we have

0E of(x*), x* E of*(O).
The primal and dual implementations of the proximal algorithm are
mathematically equivalent and generate identical sequences {xk}, assuming
the same starting point xo and penalty parameter sequence {Ck}. Whether
one is preferable over the other depends on which of the minimizations
(5.35) and (5.41) is easier, i.e., whether f or its conjugate f* has more
convenient structure. In the next section we will discuss a case where
the dual proximal algorithm is more convenient and yields the augmented
Lagrangian method.
Sec. 5.2 Dual Proximal Algorithms 259

------ 1k

I
Optimal Slope= xk-
Slope= 0
Primal Proximal Iteration Dual Proximal Iteration

Figure 5.2.2. Illustration of primal and dual proximal algorithms. The primal
algorithm aims to find x*, a minimum off. The dual algorithm aims to find x*
as a subgradient of J* at O [cf. Prop. 5.4.4(b) in Appendix BJ.

5.2.1 Augmented Lagrangian Methods

We will now apply the proximal algorithm to the dual problem of a con-
strained optimization problem. We will show how the corresponding dual
proximal algorithm leads to the class of augmented Lagrangian methods.
These methods are popular because they allow the solution of constrained
optimization problems, through a sequence of easier unconstrained (or less
constrained) optimizations, which can be performed with fast and reliable
algorithms, such as Newton, quasi-Newton, and conjugate gradient meth-
ods. Augmented Lagrangian methods can also be used for smoothing of
nondifferentiable cost functions, as described in Section 2.2.5; see nonlinear
programming textbooks, and the monograph [Ber82a], which is a compre-
hensive reference on augmented Lagrangian, and related smoothing and
sequential quadratic programming methods.
Consider the constrained minimization problem

minimize f (x)
(5.43)
subject to x E X, Ax = b,
where f: ~n H (-oo, oo] is a convex function, X is a convex set, A is an
m x n matrix, and b E ~m. t

t We focus on linear equality constraints for convenience, but the analysis


can be extended to convex inequality constraints as well (see the subsequent
discussion). In particular, a linear inequality constraint of the form ajx :S bj can
be converted to an equality constraint ajx+zi = bj by using an artificial variable
zj, and the constraint zi 2: 0, which can be absorbed into the set X.
260 Proximal Algorithms Chap. 5

Consider also the corresponding primal and dual functions


p(u) = inf
xEX, Ax-b=u
f(x), q(>.) = inf
xEX
{f (x) + N(Ax - b) }.
We assume that pis closed and proper, and that the optimal value p(O) is
finite, so that, except for sign changes, q and p are conjugates of each other
[i.e., -q(->.) is the conjugate convex function of p(u); cf. the discussion in
Section 4.2 in Appendix BJ and there is no duality gap.
Let us apply the proximal algorithm to the dual problem of maximiz-
ing q. It has the form t

Ak+l E arg max {q(>.) - - 1-11>- - Akll 2 } .


>-E1Rm 2ck

In view of the conjugacy relation between q and p, it can be seen that the
dual proximal algorithm (5.41 )-(5.42) has the form+

Uk+l E arg min {p(u)


uE1Rm
+ >./u + c2k llull 2 } , (5.44)

which is Eq. (5.41), and


(5.45)
which is Eq. (5.42); see Fig. 5.2.3.
To implement this algorithm, we introduce for any c > 0, the aug-
mented Lagrangian function
C
Lc(x, >.) = J(x) + N(Ax - b) + 2IIAx - bll 2 ,
and we use the definition of p to write the minimization (5.44) as

inf {
uE1Rm
inf
xEX, Ax-b=u
{f(x)} + Ak 1U + Ck2 llull 2 }
= uE1Rm
inf inf {f(x) + >.k(Ax - b) + Ck IIAx - bll 2 }
xEX, Ax-b=u 2

= inf {f(x) + >.k(Ax - b) +


xEX
Ck
2
IIAx - bll 2 }
= xEX
inf Lck(x, Ak).

tThere is an unfortunate (but hard to avoid) reversal of notation in this


section, because the primal proximal algorithm is applied to the dual problem
max>, q(>.) (i.e., to minimize the negative dual function -q), while the dual prox-
imal algorithm involves the conjugate of -q whose argument is the perturbation
vector u. Thus the dual variable >. corresponds to the primal vector x in the
preceding section, while the perturbation vector u corresponds to the dual vector
>. of the preceding section.
+ This takes into account the required sign and symbol changes, so that
f"" -q, x"" ->., Xk "" -Ak, f* ""p, and u is the argument of p.
Sec. 5.2 Dual Proximal Algorithms 261

Slope= - >.k
/

"
Slope= -Ak+l
u

Figure 5.2.3. Geometric interpretation of the dual proximal minimization

uk+I E arg min {p(u)


uE~m
+ >../u+ Ck
2
llull 2 } , (5.46)

and the update


Ak+l = Ak + CkUk+l
in the augmented Lagrangian algorithm. From the minimization (5.46) we have

so the vector uk+I is the one for which ->..k is a subgradient of p(u) + ~ llull 2 at
u = uk+I, as shown in the figure. By combining the last two relations, we obtain
->..k+l E 8p(uk+I), as shown in the figure. The optimal value in the minimization
(5.46) is equal to infxEX Lek (x, >..k), and can be geometrically interpreted as in
the figure.

The minimizing u and x in this equation are related, and we have

where Xk+l is any vector that minimizes Lck(x, Ak) over X (we assume
that such a vector exists - while the existence of the minimizing Uk+i is
guaranteed, since the minimization (5.44) has a solution, the existence of
the minimizing Xk+I is not guaranteed, and must be either assumed or
verified independently).
Using the preceding expression for uk+l, we see that the dual proximal
algorithm (5.44)-(5.45), applied to the maximization of the dual function
q, starts with an arbitrary initial vector >.o, and iterates according to

where Xk+l is any vector that minimizes Lck(x, Ak) over X. This method is
known as the augmented Lagrangian algorithm or the method of multipliers.
262 Proximal Algorithms Chap. 5

Augmented Lagrangian Algorithm:


Find
(5.47)

and then set


(5.48)

The convergence properties of the augmented Lagrangian algorithm


are derived from the corresponding properties of the proximal algorithm (cf.
Section 5.1). The sequence { q(>.k)} converges to the optimal dual value,
and { >.k} converges to an optimal dual solution, provided such a solution
exists (cf. Prop. 5.1.3). Furthermore, convergence in a finite number of
iterations is obtained when f and X are polyhedral (cf. Prop. 5.1.5).
Assuming that there exists a dual optimal solution, so that {>.k} con-
verges to such a solution, we also claim that every limit point of the gen-
erated sequence {xk} is an optimal solution of the primal problem (5.43).
To see this, note that from the update formula (5.48) we obtain
Axk+l - b --t 0, ck(Axk+l - b) --t 0.
Furthermore, we have
Lek (xk+l, >.k) = min {f(x)
xEX
+ >.~(Ax - b) + Ck2 I Ax - bll 2 } .
The preceding relations yield
limsupf(xk+1) = limsupLck(xk+l,>.k) :S J(x), V x EX with Ax= b,
k-+oo k-+oo
so if x* E X is a limit point of { Xk}, we obtain
J(x*) :S J(x), V x EX with Ax= b,
as well as Ax* = b (in view of Axk+I - b --t 0). Therefore any limit point
x* of the generated sequence { xk} is an optimal solution of the primal
problem (5.43). We summarize the preceding discussion in the following
proposition.

Proposition 5.2.1: (Convergence Properties of Augmented


Lagrangian Algorithm) Consider a sequence { (xk, >.k)} generated
by the augmented Lagrangian algorithm (5.47), (5.48), applied to
problem (5.43), assuming that L~o Ck = oo. Assume further that
the primal function pis closed and proper, and that the optimal value
p(O) is finite. Then the dual function sequence { q(>.k)} converges to
the common primal and dual optimal value. Moreover, if the dual
problem has at least one optimal solution, the following hold:
Sec. 5.2 Dual Proximal Algorithms 263

(a) The sequence {>.k} converges to an optimal dual solution. ·Fur-


thermore, convergence in a finite number of iterations is obtained
if f and X are polyhedral.
(b) Every limit point of {xk} is an optimal solution of the primal
problem (5.43).

Note that there is no guarantee that { Xk} has a limit point, and indeed
the dual sequence { >.k} will converge to a dual optimal solution, if one
exists, even if the primal problem (5.43) does not have an optimal solution.
As an example, the reader may verify that for the two-dimensional/single
constraint problem where f(x) = ex 1 , x 1 +x 2 = 0, x 1 ER, x 2 2:: 0, the dual
optimal solution is >.* = 0, but there is no primal optimal solution. For
this problem, the augmented Lagrangian algorithm will generate sequences
{.Xk} and {xk} such that Ak-+ 0 and xl-+ -oo, while J(xk )-+ f* = 0.

Linear and Nonlinear Inequality Constraints

The simplest way to treat inequality constraints in the context of t he aug-


mented Lagrangian methodology, is to convert them to equality constraints
by using additional nonnegative variables. In particular, consider the ver-
sion of the problem with linear inequality constraints Ax :s; b, which we
write as
minimize f (x)
(5.49)
subject to x E X, a~ x :S: b1, . . . , a~x ::'.: br,
where f: Rn H (-oo, oo] is a convex function and X is a convex set. We
can convert this problem to the equality constrained problem

minimize f (x)
(5.50)
subjectto xEX, z2::0, a~x+z 1 = b1, . .. ,a~x+ zr = br,

where z = (z 1 , . .. , zr ) is a vector of additional artificial variables.


The augmented Lagrangian method for this problem involves mini-
mizations of the form

for a sequence of values ofµ = (µ 1 , .. . ,µr) and c > 0. This type of mini-
mization can be done by first minimizing L c(x, z, µ) over z 2:: 0, obtaining

L c(x, µ) = minLc(x, z, µ),


z2'.0
264 Proximal Algorithms Chap. 5

and then by minimizing Lc(x, µ) over x EX. A key observation is that the
first minimization with respect to z can be carried out in closed form for
each fixed x, thereby yielding a closed form expression for Lc(x, µ).
Indeed, we have

r
minLc(x, z, µ) = f(x)
z2'.0
+L
.
min{µ) (ajx -
zJ >O
bj + zJ) + -2c lajx - bj + zJl 2 } .
J=l -
(5.51)
The function in braces above is quadratic in zj. Its constrained minimum
is zJ = max{O, zJ}, where zJ is the unconstrained minimum at which the
derivative is zero. The derivative is µJ + c( ajx - bj + zJ), so we obtain

z) =max{O,zJ} =max{o,-(µ: +ajX-bj) }·

Denoting
gj+( x,µJ,c
. ) =max { ajx-bj,--z
I µJ} , (5.52)

we have ajx - bj + zJ = 9j(x, µ), c). Substituting in Eq. (5.51), we obtain


a closed form expression for Lc(x, µ) = minz::::o Lc(x, z, µ):

r
Lc(x,µ) = f(x) + L {µJgj(x,µJ,c) +
j=l
i (gt(x,µJ,c)) 2
}. (5.53)

After some calculation, left for the reader, we can also write this expression
as

Lc(x, µ) = f(x) +;ct {J=l


(max{O, µJ + c(ajx - bj)} )2 - (µJ)2}, (5.54)

and we can view it as the augmented Lagrangian function for the inequality
constrained problem (5.49).
It follows from the preceding transcription that the augmented La-
grangian method for the inequality constrained problem (5.49) consists of
a sequence of minimizations of the form

minimize Lek (x, µk)


subject to x E X,

followed by the multiplier iterations

j = 1, ... ,r,
Sec. 5.2 Dual Proximal Algorithms 265

dc {(max{O,µ+ cg}) 2 - µ2 }

"' Slope=µ -
Constraint Level g
_i:
2c

Figure 5.2.4. Form of the quadratic penalty term for a single inequality con-
straint g(x) :S 0.

which can be equivalently written as

µ{+I = max {o, µ{ + ck(aJxk - bj) }, j = 1, .. . ,r.


Note that the penalty term

corresponding to the jth inequality constraint in Eq. (5.54) is convex and


continuously differentiable in x (see Fig. 5.2.4). However, its Hessian matrix
is discontinuous for all x such that ajx - bj = -µJ /c; this may cause some
difficulties in the minimization of Lc(x, µ), particularly when Newton-like
methods are used, and motivates alternative twice differentiable augmented
Lagrangian methods for inequality constraints (see Section 6.6).
To summarize, the augmented Lagrangian method for the inequality
constrained problem (5.49) consists of a sequence of minimizations

minimize Lck(x,µk)
subject to x E X,

where Lck(x, µk) is given by Eq. (5.53) or Eq. (5.54), {µk} is a sequence
updated as above, and {Ck} is a positive penalty parameter sequence with
L%':o Ck = oo. Since this method is equivalent to the equality-constrained
method applied to the corresponding equality-constrained problem (5.50),
our earlier convergence results (cf. Prop. 5.2.1) apply with the obvious
modifications.
We finally note that there is a similar augmented Lagrangian method
for problems with nonlinear inequality constraints
266 Proximal Algorithms Chap. 5

in place of the linear ones in problem (5.49). This method has identical
form to the one developed above, with the functions 9j defined by

cf. Eq. (5.52) (the derivation is very similar; see [Ber82a]). In particular, the
method consists of successive minimizations of the augmented Lagrangian

Lek(x,µk) = f(x) + t {µfot(x,µ{,ck) + c; (gt(x,µ{,ck)f},


to obtain Xk, followed by multiplier updates of the form

j = 1, ... ,r;

see the end-of-chapter references. Note that Lek(·, µk) is continuously dif-
ferentiable in x if f and 9j are, and is also convex in x if f and gj are.

Variants of the Augmented Lagrangian Algorithm

The augmented Lagrangian algorithm is an excellent general purpose con-


strained minimization method, and applies to considerably more general
problems than the ones treated here. For example, it can be used for dif-
ferentiable problems with nonconvex cost functions and constraints. It may
also be used in smoothing approaches (cf. Section 2.2.5), for both convex
and nonconvex optimization.
These properties and their connections to duality are due to the con-
vexification effect of the quadratic penalty, even in the context of a non-
convex problem. Further discussion is beyond our scope, and we refer to
nonlinear programming textbooks and the monograph [Ber82a], which fo-
cuses on augmented Lagrangian and other Lagrange multiplier methods.
The algorithm also embodies a rich structure, which lends itself to
many variations. In particular, let us consider the "penalized" dual func-
tion Qe, given by

max {q(y) - 21 IIY -


Qe(,\) = yEiRm All 2 } • (5.55)
C

Then, according to Prop. 5.1.7, Qe is differentiable, and we have

(5.56)

where Ye(,\) is the unique vector attaining the maximum in Eq. (5.55).
Since Yek(,\k) = Ak+l, we have using Eqs. (5.48) and (5.56),

t"'7
V Qek (,\ k ) -- ,\k+i - ,\k
= A Xk+l - b,
Ck
Sec. 5.2 Dual Proximal Algorithms 267

so the multiplier iteration (5.48) can be written as a gradient iteration:

)..k+l = )..k + Ck'vqck()..k),


[cf. Eq. (5.17)].
This interpretation motivates variations based on faster Newton or
quasi-Newton methods for maximizing qc, whose maxima coincide with
the ones of q, for any c > 0. (The algorithm described so far is also known
as the first order method of multipliers, to distinguish it from Newton-like
methods, which are also known as second order methods of multipliers.)
There are many algorithms along this line, some of which involve inexact
minimization of the augmented Lagrangian to enhance computational ef-
ficiency. We refer to [Ber82a] and other literature cited at the end of the
chapter for analysis of such methods.
We finally note that because the proximal algorithm can be general-
ized to the case where a nonquadratic regularization function is used, the
dual proximal algorithm and hence also the augmented Lagrangian method
can be accordingly generalized (see Section 6.6 and [Ber82a], Chapter 5).
One difficulty with the method is that even if the cost function is
separable, the augmented Lagrangian Le(·,>.) is typically nonseparable be-
cause it involves the quadratic term IIAx - bll 2 . With the use of a block
coordinate descent method, however, it may be possible to deal to some
extent with the loss of separability, as shown in the following example from
[BeT89a].

Example 5.2.1: (Additive Cost Problems)

Consider the problem


m

minimize L f; (x)
i=l

where f; : Rn ,-+ R are convex functions and X; are closed, convex sets with
nonempty intersection. This problem contains as special cases several of the
examples given in Section 1.3, such as regularized regression, classification,
and maximum likelihood.
We introduce additional artificial variables zi, i = 1, ... , m, we consider
the equivalent problem
m

minimize Lf;(zi)
i=l
(5.57)
subject to x = zi, zi E X;, i = 1, ... ,m,
and we apply the augmented Lagrangian method to eliminate the constraints
zi = x, using corresponding multiplier vectors >,.i. The method takes the form

i = l, ... ,m, (5.58)


268 Proximal Algorithms Chap. 5

where Xk+1 and zi+i, i = 1, ... , m, solve the problem

minimize I:
i=l
(f;(zi) + .>.f (x - zi) + c; llx - zi11 2 )

subject to x Ear, /EX;, i = 1, ... ,m.

Note that there is coupling between x and the vectors zi, so this problem
cannot be decomposed into separate minimizations with respect to some of the
variables. On the other hand, the problem has a Cartesian product constraint
set, and a structure that is suitable for the application of block coordinate
descent methods that cyclically minimize the cost function, one component at
a time. In particular, we can consider a method that minimizes the augmented
Lagrangian with respect to X E ar with the iteration

(5.59)

then minimizes the augmented Lagrangian with respect to zi EX;, with the
iteration

z i E arg mm
. {f(i)
i z - ,i
"k
1
zi + -Ckll x - z ill2} , i = 1, ... ,m, (5.60)
ztEX; 2

and repeats until convergence to a minimum of the augmented Lagrangian,


which is then followed by a multiplier update of the form (5.58).

In the preceding example the minimization of the augmented La-


grangian exploits the problem structure, yet requires an infinite number
of cyclic minimization iterations of the form (5.59)-(5.60), before the mul-
tiplier update (5.58) can be performed. Actually, exact convergence to a
minimum of the augmented Lagrangian is not necessary, only a limited
number of minimization cycles in x and zi may be performed prior to a
multiplier update. In particular, there are modified versions of augmented
Lagrangian methods, with sound convergence properties, which allow for
inexact minimization of the augmented Lagrangian, subject to certain ter-
mination criteria (see [Ber82a], and subsequent sources such as [EcB92],
[Eck03], [Ek813]). In Section 5.4, we will discuss the alternating direction
method of multipliers, a somewhat different type of method, which is based
on augmented Lagrangian ideas and performs only one minimization cycle
in x and zi before updating the multiplier vectors.

5.3 PROXIMAL ALGORITHMS WITH LINEARIZATION

In this section we will consider the minimization of a real-valued convex


function f : ~n c-+ ~, over a closed convex set X, and we will combine
the proximal algorithm and the polyhedral approximation approaches of
Sec. 5.3 Proximal Algorithms with Linearization 269

Figure 5.3.1. Illustration of the proximal algorithm with outer linearization. The
point Xk+l is the one at which the graph of the negative proximal term, raised by
some amount, first touches the graph of Fk. A new cutting plane is added, based
on a subgradient 9k+l of f at Xk+l. Note that the proximal term reduces the
effect of instability: Xk+l tends to be closer to Xk, with the distance llxk+l - Xk 11
depending on the size of the proximal term, i.e., the penalty parameter ck.

Chapter 4. As discussed in Section 4.1, one of the drawbacks of the cutting


plane method is the instability phenomenon, whereby the method can take
large steps away from the current point, with significant deterioration of
the cost function value. A way to limit the effects of this is to add to the
polyhedral function approximation a regularization term pk(x) that penal-
izes large deviations from some reference point Yk, similar to the proximal
algorithm of Section 5.1. Thus in this method, Xk+I is obtained as
Xk+I E argmin{ Fk(x)
xEX
+ pk(x) }, (5.61)
where similar to the cutting plane method,
Fk(x) = max{J(xo) + (x - xo)'go, ... , f(xk) + (x - xk)'gk},
and similar to the proximal algorithm,
1
Pk(x) = -2
Ck
llx - Ykll 2 ,
where ck is a positive scalar parameter; see Fig. 5.3.1 for the case where
Yk = Xk. We refer to Pk(x) as the proximal term, and to its center Yk
as the proximal center (which may be different than Xk, for reasons to be
explained later).
The idea of the method is to provide a measure of stability to the
cutting plane method at the expense of solving a more difficult subproblem
at each iteration (e.g., a quadratic versus a linear program, in the case
where X is polyhedral). We can view iteration (5.61) as a combination of
the cutting plane method and the proximal method. We first discuss the
case where the proximal center is the current iterate (Yk = xk) in the next
section. Then in Section 5.3.2, we discuss alternative choices of Yk, which
are aimed at further improvements in stability.
270 Proximal Algorithms Chap. 5

5.3.1 Proximal Cutting Plane Methods

We consider minimization of a real-valued convex function f : ~n H ~,


over a closed convex set X, by using a proximal-like algorithm where f
is replaced by a cutting plane approximation Fk, thereby simplifying the
corresponding proximal minimization. At the typical iteration, we perform
a proximal iteration, aimed at minimizing the current polyhedral approxi-
mation to f given by

Fk(x) = max{J(xo) + (x - xo)'go, ... , f(xk) + (x - xk)'gk}- (5.62)

A subgradient 9k+l off at xk+l is then computed, Fk is accordingly up-


dated, and the process is repeated. We call this the proximal cutting plane
method. This is the method (5.61) where the proximal center is reset at
every iteration to the current point (Yk = Xk for all k); see Fig. 5.3.1.

Proximal Cutting Plane Method:


Find
Xk+l E argmin {Fk(x) 1-llx - Xkll 2 } ,
+ -2ck (5.63) ,
xEX

where Fk is the outer approximation function of Eq. (5.62), and Ck


is a positive scalar parameter, and then refine the approximation by
introducing a new cutting plane based on f(xk+i) and a subgradient

The method terminates if Xk+1 = xk; in this case, Eqs. (5.62) and
(5.63) imply that

'vxEX,

so xk is a point where the proximal algorithm terminates, and it must


therefore be optimal by Prop. 5.1.3. Note, however, that unless f and X
are polyhedral, finite termination is unlikely.
The convergence properties of the method are easy to obtain, based
on what we already know. The idea is that Fk asymptotically "converges"
to f, at least near the generated iterates, so asymptotically, the algorithm
essentially becomes the proximal algorithm, and inherits the corresponding
convergence properties. In particular, we can show that if the optimal
solution set X* is nonempty, the sequence { xk} generated by the proximal
cutting plane method (5.63) converges to some point in X*. The proof is
Sec. 5.3 Proximal Algorithms with Linearization 271

based on a combination of the convergence arguments of the cutting plane


method (cf. Prop. 4.1.1) and the proximal algorithm (cf. Prop. 5.1.3), and
will not be given.
In the case where f and X are polyhedral, convergence to an optimal
solution occurs in a finite number of iterations, as shown in the following
proposition. This is a consequence of the finite convergence property of
both the cutting plane and the proximal methods.

Proposition 5.3.1: (Finite Termination of the Proximal Cut-


ting Plane Method) Consider the proximal cutting plane method
for the case where f and X are polyhedral, with

where I is a finite index set, and ai and bi are given vectors and scalars,
respectively. Assume that the optimal solution set is nonempty and
that the subgradient added to the cutting plane approximation'at each
iteration is one of the vectors ai, i E J. Then the method terminates
finitely with an optimal solution.

Proof: Since there are only finitely many vectors ai to add, eventually
the polyhedral approximation Fk will not change, i.e., Fk = Fy;; for all
k > k. Thus, for k 2'. k, the method will become the proximal algorithm
for minimizing Fy;;, so by Prop. 5.1.5, it will terminate with a point z that
minimizes Fy;; subject to x E X. But then, we will have concluded an
iteration of the cutting plane method for minimizing f over X, with no
new vector added to the approximation Fk. This implies termination of
the cutting plane method, necessarily at a minimum off over X. Q.E.D.

The proximal cutting plane method aims at increased stability over


the ordinary cutting plane method, but it has some drawbacks:
(a) There is a potentially difficult tradeoff in the choice of the parameter
ck. In particular, stability is achieved only by choosing ck small, since
for large values of ck the changes Xk+l -Xk may be substantial. Indeed
for large enough ck, the method finds the exact minimum of Fk over
X in a single minimization (cf. Prop. 5.1.5), so it is identical to the
ordinary cutting plane method, and fails to provide any stabilization!
On the other hand, small values of Ck lead to slow rate of convergence,
even when f is polyhedral, or even linear (cf. Fig. 5.1.2).
(b) The number of subgradients used in the approximation Fk may grow
to be very large, in which case the quadratic program solved in Eq.
(5.63) may become very time-consuming.
272 Proximal Algorithms Chap. 5

These drawbacks motivate algorithmic variants, called bundle methods,


which we will discuss next. The main difference is that in order to ensure
a certain measure of stability, the proximal center is updated selectively,
only after making enough progress in minimizing f.

5.3.2 Bundle Methods

In the basic form of a bundle method, the iterate Xk+l is obtained by


minimizing over X the sum of Fk, a cutting plane approximation to f, and
a quadratic proximal term Pk(x):

(5.64)

The proximal center of Pk need not be Xk (as in the proximal cutting plane
method), but is rather one of the past iterates Xi, i :S: k.
In one version of the method, Fk is given by

Fk(x) = max{f(xo) + (x - xo)'go, ... , J(xk) + (x - xk)'gk}, (5.65)

while Pk (x) is of the form

where Yk E {xi I i :S: k}. Following the computation of Xk+i, the new
proximal center Yk+I is set to Xk+I, or is left unchanged (Yk+I = Yk)
depending on whether, according to a certain test, "sufficient progress"
has been made or not. An example of such a test is

where /3 is a fixed scalar with /3 E (0, 1), and

Thus,
Xk+I if f(Yk) - f (xk+I) 2:: f38k,
Yk+I = { Yk if f(Yk) - f(xk+1) < f38k,
(5.66)

and initially yo = xo. In the parlance of bundle methods, iterations where


Yk+l is updated to Xk+I are called serious steps, while iterations where
Yk+l = Yk are called null steps.
The scalar 8k is illustrated in Fig. 5.3.2. Since f(Yk) = Fk(Yk) [cf.
Eq. (5.65)], 8k represents the reduction in the proximal objective Fk + Pk in
moving from Yk to Xk+I· If the reduction in the true objective,
Sec. 5.3 Proximal Algorithms with Linearization 273

X X

Serious Step Null Step

Figure 5.3.2. Illustration of the test (5.66) for a serious or a null step in the
bundle method. It is based on

the reduction in proximal cost, which is always positive, except at termination. A


serious step is performed if and only if the reduction in true cost, f(Yk)- f(xk+1),
exceeds a fraction /3 of the reduction 8k in proximal cost.

does not exceed a fraction /3 of 8k (or is even negative as in the right-hand


side of Fig. 5.3.2), this indicates a large discrepancy between proximal and
true objective, and an associated instability. As a result the algorithm fore-
goes the move from Yk to Xk+l with a null step [cf. Eq. (5.66)], but improves
the cutting plane approximation by adding the new plane corresponding
to Xk+i. Otherwise, it performs a serious step, with the guarantee of true
cost improvement afforded by the test (5.66).
An important point is that if Xk+l -/- Yk, then 8k > 0. Indeed, since

H(xk+1) + Pk(Xk+1) ~ H(Yk) + Pk(Yk) = Fk(Yk),


and H(Yk) = f(Yk), we have

0 ~ f(Yk) - (H(xk+1) + Pk(Xk+d) = 8k,


with equality only if Xk+l = Yk·
The method terminates if Xk+i = Yk; in this case, Eqs. (5.64) and
(5.65) imply that

f(Yk)+Pk(Yk) = Fk(Yk)+Pk(Yk) ~ Fk(x)+Pk(x) ~ f(x)+Pk(x), V x EX,


so Yk is a point where the proximal algorithm terminates, and must there-
fore be optimal. Of course, finite termination is unlikely, unless f and X
are polyhedral.
274 Proximal Algorithms Chap. 5

The convergence analysis of the preceding bundle method is similar to


the corresponding analysis for the cutting plane and the proximal method.
The idea is that the method makes "substantial" progress with every serious
step. Furthermore, null steps cannot be performed indefinitely, for in this
case, the polyhedral approximation to f will become increasingly accurate
and the reduction in true cost will converge to the reduction in proximal
cost. Then, since /3 < 1, the test for a serious step will be passed eventually,
and a convergence argument can be constructed using this fact. In the case
where f and X are polyhedral, the method converges finitely, similar to
the case of the proximal and proximal cutting plane algorithms (cf. Props.
5.1.5 and 5.3.1).

Proposition 5.3.2: (Finite Termination of the Bundle Method)


Consider the bundle method for the case where X and f are polyhe-
dral, with
J(x) = 1rlf{a~x + bi},
where I is a finite index set, and ai and bi are given vectors and scalars,
respectively. Assume that the optimal solution set is nonempty and
that the subgradient added to the cutting plane approximation at each
iteration is one of the vectors ai, i E I. Then the method ter~inates
finitely with an optimal solution.

Proof: Since there are only finitely many vectors ai to add, eventually the
polyhedral approximation A will not change, i.e., Fk = Fk for all k > k.
We then have A(xk+i) = f(xk+i) for all k > k, since otherwise a new
cutting plane would be added to Fk. Thus, fork> k,

f(Yk) - f(xk+l) = f(Yk) - Fk(xk+1)


= f(Yk) - (Fk(Xk+1) + Pk(xk+i)) + Pk(Xk+1)
= 8k + Pk(Xk+1)
2:: f38k.

Therefore, according to Eq. (5.66), the method will perform serious steps
for all k > k, and become identical to the proximal cutting plane algorithm,
which converges finitely by Prop. 5.3.1. Q.E.D.

Discarding Old Subgradients

We mentioned earlier that one of the drawbacks of the cutting plane al-
gorithms is that the number of subgradients used in the approximation
Fk may grow to be very large. The monitoring of progress through the
Sec. 5.3 Proximal Algorithms with Linearization 275

test (5.66) for serious/null steps can also be used to discard some of the
accumulated cutting planes. For example, at the end of a serious step,
upon updating the proximal center Yk to Yk+l = Xk+l, we may discard any
subset of the cutting planes.
It may of course be useful to retain some of the cutting planes, par-
ticularly the ones that are "active" or "nearly active" at Yk+l, i.e., those
i :::; k for which the linearization error

is O or close to 0, respectively. The essential validity of the method is


maintained, by virtue of the fact that { f (Yk)} is a monotonically decreasing
sequence, with "sufficiently large" cost reductions between proximal center
updates.
An extreme possibility is to discard all past subgradients following a
serious step from Yk to Xk+1 · Then, after a subgradient gk+l at Xk+l is
calculated, the next iteration becomes

Xk+2 E argmin {f(Xk+1) + g~+l (x - Xk+1) 1 llx - xk+111 2 } .


+ -2Ck+l
xEX

It can be seen that we have

where Px(·) denotes projection on X, so after discarding all past subgradi-


ents following a serious step, the next iteration is an ordinary subgradient
iteration with stepsize equal to Ck+l.
Another possibility is, following a serious step, to replace all the cut-
ting planes with a single cutting plane: the one obtained from the hyper-
plane that passes through (xk+l, Fk(Xk+i)) and separates the epigraphs of
the functions Fk(x) and "'/k - 2!k llx - Ykll 2 , where

(see Fig. 5.3.3). This is the cutting plane

(5.67)

where 9k is given by
(5.68)

The next iteration will then be performed with only two cutting planes:
the one just given by Eqs. (5.67)-(5.68) and a new one obtained from Xk+i,
276 Proximal Algorithms Chap. 5

Yk Yk+l = Xk+i X

Figure 5.3.3. Illustration of the cutting plane

where
9k = Yk - Xk+l
Ck

The "slope" 9k can be shown to be a convex combination of the subgradients that


are "active" at Xk+l.

where 9k+l E 8f(xk+1).


The vector 9k is sometimes called an "aggregate subgradient," be-
cause it can be shown to be a convex combination of the past subgradients
go, ... , 9k· This is evident from Fig. 5.3.3, and can also be verified by using
quadratic programming duality arguments.
There are also many other variants of bundle methods, which aim at
increased efficiency and the exploitation of special structure. We refer to
the literature for discussion and analysis.

5.3.3 Proximal Inner Linearization Methods

In the preceding section we saw that the proximal algorithm can be com-
bined with outer linearization to yield the proximal cutting plane algorithm
and its bundle versions. In this section we use a dual combination, involving
the dual proximal algorithm (5.41)-(5.42) and inner linearization (the dual
of outer linearization). This yields another method, which is connected to
the proximal cutting plane algorithm of Section 5.3.1 by Fenchel duality
(see Fig. 5.3.4).
Let us recall the proximal cutting plane method applied to minimizing
a real-valued convex function f: ~n c-+ ~, over a closed convex set X. The
typical iteration involves a proximal minimization of the current cutting
Sec. 5.3 Proximal Algorithms with Linearization 277

Outer
Linearization
Proximal Algorithm Proximal Cutting Plane
Bundle Versions

Fenchel 1 Duality
Inner
Fenchel 1
Duality

Linearization
Dual Proximal Algorithm Proximal Simplicial Decomposition
Augmented Lagrangian Bundle Versions

Figure 5.3.4. Relations of the proximal and proximal cutting plane methods,
and their duals. The dual algorithms are obtained by application of the Fenchel
Duality Theorem (Prop. 1.2.1), taking also into account the conjugacy relation
between outer and inner linearization (cf. Section 4.3).

plane approximation to f given by

Fk(x) = max{f (xo) + (x-xo)'go, ... , f(xk) + (x-xk)'gk} +8x(x), (5.69)

where 9i E of(xi) for all i and 8x is the indicator function of X. Thus,

where Ck is a positive scalar parameter. A subgradient 9k+i off at Xk+l is


then computed, Fk+l is accordingly updated, and the process is repeated.
There is a dual version of this algorithm, similar to the one of Section
5.2. In particular, we may use Fenchel duality to implement the preced-
ing proximal minimization in terms of conjugate functions [cf. Eq. (5.41)].
Thus, the Fenchel dual of this minimization can be written as [cf. Eq.
(5.37)]
minimize Ft(>,.) - x~),. + c; 11),.11 2
(5. 70)
subject to ),. E ~n,

where Ft is the conjugate of Fk. Once Ak+l, the unique minimizer in this
dual proximal iteration, is computed, Xk is updated via

[cf. Eq. (5.42)]. Then, a subgradient 9k+l of f at xk+l is obtained by


"differentiation," and is used to update Fk.
278 Proximal Algorithms Chap. 5

Proximal Inner Linearization Method:


Find
Ak+l E arg min {Fk(>.) - xk>. + Ck II.X\1 2 } , (5.71)
AERn 2
and
(5.72)
where Fk is the conjugate of the outer approximation function Fk of
Eq. (5.69), and ck is a positive scalar parameter. Then refine the
approximation by introducing a new cutting plane based on f(Xk+1)
and a subgradient

to form Fk+1 and Pt,+ 1 .

Note that the new subgradient 9k+l may also be obtained as a vector
attaining the supremum in the conjugacy relation
f(xk+1) = sup {xk+l>.- f*(>.)},
AERn
where f * is the conjugate function of f, since we have
9k+1 E 8 f (xk+i) if and only if 9k+l E arg max { xk+l >. - f*(>.)},
AERn
(5.73)
(cf. the Conjugate Subgradient Theorem of Prop. 5.4.3 in Appendix B).
The maximization above may be preferable if the "differentiation" 9k+ 1 E
of(xk+l) is inconvenient.

Implementation by Simplicial Decomposition

We will now discuss the details of the preceding computations, assuming


for simplicity that there are no constraints, i.e., X = )Rn. According to
Section 4.3, Fk (the conjugate of the outer linear approximation Fk off)
is the piecewise linear (inner) approximation of J* with domain
dom(Fn = conv( {go, ... , gk}),
and "break points" at gi, i = 0, ... , k. In particular, using the formula of
Section 4.3 for the conjugate Fk, the dual proximal optimization of Eq.
(5.71) takes the form

minimi'" tk
a;J•(g;) - x[ t a;g; + '~' lit a;gf
(5.74)
subject to L O'.i = 1, O'.i :::0: 0, i = 0, ... , k.
i=O
Sec. 5.3 Proximal Algorithms with Linearization 279

-.......__ Slope = Xk

Figure 5.3.5. Illustration of an iteration of the proximal simplicial decomposition


algorithm. The proximal minimization determines the "slope" Xk+l of F{, which
then determines the next subgradient/break point 9k+l via the maximization

i.e., 9k+l is a point at which Xk+l is a subgradient off*.

If (a~, ... , a~) attains the minimum, Eqs. (5.71) and (5.72) yield
k k
Ak+l = L af 9i, Xk+l = Xk - ck L af gi. (5.75)
i=O i=O

The next subgradient 9k+l E of(xk+i) may also be obtained from the
maximization
9k+l E arg max { x~+i >. - f*(>.)} (5.76)
>-ElRn

if this is convenient; cf. Eq. (5.73). As Fig. 5.3.5 indicates, 9k+l provides a
new break point and an improved inner approximation to f *.
We refer to the algorithm defined by Eqs. (5.74)-(5.76), as the prox-
imal simplicial decomposition algorithm. Note that all the computations
of the algorithm involve the conjugate f* and not f. Thus, if f* is more
convenient to work with than f, the proximal simplicial decomposition al-
gorithm is preferable to the proximal cutting plane algorithm. Note also
that the duality between the two linear approximation versions of the prox-
imal algorithm is a special case of the generalized polyhedral approximation
framework of Section 4.4.
The problem (5.74) can also be written without reference to the con-
jugate f*. Since 9i is a subgradient off at Xi, and hence we have

f*(gi) = x~gi - J(xi), i = 0, ... ,k,


280 Proximal Algorithms Chap. 5

by the Conjugate Subgradient Theorem (Prop. 5.4.3 in Appendix B), this


problem can equivalently be written as the quadratic program

minimi,e ta,((x, - x,)'g, - f(x,)) + c; lltaw,11'


k
subject to L Di = 1, Di ;::: 0, i = 0, ... , k.
i=O
Then the vectors Ak+l and Xk+l can be obtained from Eq. (5.75), while
9k+l can be obtained from the differentiation 9k+l E of(Xk+i)- This form
of the algorithm may be viewed as a regularized version of the classical
Dantzig-Wolfe decomposition method (see [Ber99], Section 6.4.1).
We may also consider bundle versions of the proximal simplicial de-
composition algorithm, and in fact this may be essential to avoid the kind
of difficulties discussed at the end of Section 5.3.1. For this we need a test
to distinguish between serious steps, where we update Xk via Eq. (5.75),
and null steps, where we leave Xk unchanged, but simply add the pair
(9k+l ,f*(9k+l)) to the current inner approximation of f*.
Finally, in a different line of extension, it is possible to combine the
proximal algorithm and its bundle versions with the generalized polyhedral
approximation algorithms of Sections 4.4-4.6 for problems whose cost func-
tion has an additive form. Such combinations are clearly valid and useful
when both polyhedral approximation and regularization are beneficial, but
have not been analyzed or systematically tested so far.

5.4 ALTERNATING DIRECTION METHODS OF MULTIPLIERS

In this section we discuss an algorithm that is related to the augmented


Lagrangian method of Section 5.2.1, and is well suited for special structures
involving among others separability and large sums of component functions.
The algorithm uses alternate minimizations to decouple sets of variables
that are coupled within the augmented Lagrangian, and is known as the
alternating direction method of multipliers or ADMM. The name comes
from the similarity with some methods for solving differential equations,
known as alternating direction methods (see [FoG83] for an explanation).
The following example, which involves the additive cost problem of
Example 5.2.1, illustrates the decoupling process of ADMM.

Example 5.4.1: (Additive Cost Problems - Continued)

Consider the problem

minimize L Ji (x) (5.77)


i=l
Sec. 5.4 Alternating Direction Methods of Multipliers 281

where Ji : Rn >---t R are convex functions and Xi are closed convex sets with
nonempty intersection. As in Example 5.2.1 we can reformulate this as an
equality constrained problem, by introducing additional artificial variables zi,
i = 1, ... , m, and the equality constraints x = zi:

i=l

subject to x = z\ zi E Xi, i = 1, ... ,m,


[cf. Eq. (5.57)].
As motivation for the development of the ADMM for this problem, let
us recall the augmented Lagrangian method that uses multipliers >. i for the
constraints x = zi (cf. Example 5.2.1). At the typical iteration of this method,
we find Xk+1 and zJ;,+ 1 , i = 1, ... , m, that solve the problem

minimize t
i=l
(ti(zi) + >.;;' (x - zi) + c; llx - zi11 2 )
(5.78)
subject to x E Rn, zi E Xi, i = 1, ... ,m,
[cf. Eqs. (5.58), (5.47)], and then update the multipliers according to

i = 1, ... ,m.
The minimization in Eq. (5.78) can be done by alternating minimizations of
x and i (a block coordinate descent method), and the multipliers >.i may be
changed only after (typically) many updates of x and zi (enough to minimize
the augmented Lagrangian within adequate precision).
An interesting variation is to perform only a small number of minimiza-
tions with respect to x and zi before changing the multipliers. In the extreme
case, where only one minimization is performed, the method takes the form

(5.79)

(5.80)

[cf. Eqs. (5.59), (5.60)], followed by the multiplier update

i = 1, ... ,m, (5.81)

[cf. Eq. (5.58)]. Thus the multiplier iteration is performed after just one block
coordinate descent iteration on each of the (now decoupled) variables x and
(z 1 , ... , zm). This is precisely the ADMM specialized to the problem of this
example.

The preceding example also illustrates another advantage of ADMM.


Frequently the decoupling process results in computations that are well-
suited for parallel and distributed processing (see e.g., [BeT89a], [We013]).
282 Proximal Algorithms Chap. 5

This will be observed in many of the examples to be presented in this


section.
We will now formulate the ADMM and discuss its convergence proper-
ties. The starting point is the minimization problem of the Fenchel duality
context:
minimize fi(x) + h(Ax)
(5.82)
subject to x E lRn,
where A is an m x n matrix, Ji : lRn t-t ( -oo, oo] and h : lRm t-t (-oo, oo]
are closed proper convex functions, and we assume that there exists a
feasible solution. We convert this problem to the equivalent constrained
minimization problem

minimize fi(x) + h(z)


(5.83)
subject to x E lRn, z E lRm, Ax= z,

and we introduce its augmented Lagrangian function


C
Lc(x, z, ,\) = fi(x) + h(z) + ,\'(Ax - z) + 2i!Ax - zjj 2 .

The ADMM, given the current iterates (xk, Zk, ,\k) E lRn x lRm x lRm,
generates a new iterate (xk+l, Zk+1, Ak+l) by first minimizing the aug-
mented Lagrangian with respect to x, then with respect to z, and finally
performing a multiplier update:

(5.84)

(5.85)

(5.86)
The penalty parameter c is kept constant in ADMM. Contrary to the case of
the augmented Lagrangian method (where Ck is often taken to be increasing
with k in order to accelerate convergence), there seems to be no generally
good way to adjust c from one iteration to the next. Note that the iteration
(5.79)-(5.81), given earlier for the additive cost problem (5.77), is a special
case of the preceding iteration, with the identification z = (z1, ... , zm).
We may also formulate an ADMM that applies to the closely related
problem
minimize fi(x) + h(z)
(5.87)
subject to x EX, z E Z, Ax+ Bz = d,
where Ji : lRn t-t lR, h : lRm t-t lR are convex functions, X and Z are closed
convex sets, and A, B, and dare given matrices and vector, respectively, of
appropriate dimensions. Then the corresponding augmented Lagrangian is
C
Lc(x, z, ,\) = fi(x) + h(z) +,\'(Ax+ Bz - d) + 2jjAx + Bz - djj 2 , (5.88)
Sec. 5.4 Alternating Direction Methods of Multipliers 283

and the ADMM iteration takes a similar form [cf. Eqs. (5.84)-(5.86)]:

(5.89)

(5.90)

(5.91)
For some problems, this form may be more convenient than the ADMM of
Eqs. (5.84)-(5.86), although the two forms are essentially equivalent.
The important advantage that the ADMM may offer over the aug-
mented Lagrangian method is that it involves a separate minimization with
respect to x and with respect to z. Thus the complications resulting from
the coupling of x and z in the penalty term IIAx - zil 2 or the penalty
term IIAx + Bz - dll 2 are eliminated. Here is another illustration of this
advantage.

Example 5.4.2 (Finding a Point in a Set Intersection)

We are given m closed convex sets X 1, ... , Xm in Rn, and we want to find a
point in their intersection. We write this problem in the form (5.83), with x
defined as x = (x 1 , ... , xm),

fi(x)=O, h(z) = 0, XE X1 X ··· X Xm, Z=Rn,

and with the constraint Ax =z representing the system of equations

i = 1, ... ,m.

The augmented Lagrangian is

m m

i=l i=l

The parameter c does not influence the algorithm, because it simply intro-
duces scaling of .>..i by 1/c, so we may assume with no loss of generality that
c = 1. Then, by completing the square, we may write the augmented La-
grangian as

i=l i=l

Using Eqs. (5.89)-(5.91), we see that the corresponding ADMM iterates for
x' according to

i = l, ... ,m,
284 Proximal Algorithms Chap. 5

then iterates for z according to

and finally iterates for the multipliers according to

i=l, ... ,m.

Aside from the decoupling of the iterations of the variables xi and z, notice
that the projections on Xi can be carried out in parallel.
In the special case where m = 2, we can write the constraint more
simply as x 1 = x 2 , in which case the augmented Lagrangian takes the form

Le (X 1, X 2, A) = A'( X 1 - X 2) + 2C II X 1 - X 2112 ,

Assuming as above that c = 1, the corresponding ADMM is


xt+ 1 = Px 1 (x% - .>..k),

Xk+1 = Px 2 (xt+1 + .>..k),


Ak+l = Ak + xt+1 - xk+1.

On the other hand there is a price for the flexibility that the ADMM
provides. A major drawback is a much slower practical convergence rate
relative to the augmented Lagrangian method of the preceding subsec-
tion. Both methods can be shown to have a linear convergence rate for
the multiplier updates under favorable circumstances (see e.g., [Ber82a]
for augmented Lagrangian, and [HoL13], [DaY14a], [DaY14b], [GiB14] for
ADMM). However, it seems difficult to compare them:on the basis of theo-
retical results alone, because the geometric progression rate at which they
converge is different and also because the amount of work between multi-
plier updates must be properly taken into account. A corollary of this is
that just because the ADMM updates the multipliers more often than the
augmented Lagrangian method, it does not necessarily require less com-
putation time to solve a problem. A further consideration in comparing
the two types of methods is that while ADMM effectively decouples the
minimizations with respect to x and z, augmented Lagrangian methods
allow for some implementation flexibility that may be exploited by taking
advantage of the structure of the given problem:
(a) The minimization of the augmented Lagrangian can be done with a
broad variety of methods (not just block coordinate descent). Some
of these methods may be well suited for the problem's structure.
(b) The minimization of the augmented Lagrangian need not be done ex-
actly, and its accuracy can be readily controlled through theoretically
sound and easily implementable termination criteria.
Sec. 5.4 Alternating Direction Methods of Multipliers 285

(c) The adjustment of the penalty parameter c can be used with advan-
tage in the augmented Lagrangian method, but there is apparently
no general way to do this in ADMM. In particular, by increasing c
to oo, superlinear or finite convergence can often be achieved in the
augmented Lagrangian method [cf. Props. 5.l.4(b) and 5.1.5].
Thus, on balance, it appears that the relative performance merits of ADMM
and augmented Lagrangian methods are problem-dependent in practice.

Convergence Analysis

We now turn to the convergence analysis of ADMM of Eqs. (5.84)-(5.86).


The following proposition gives the main convergence result. The proof of
the proposition is long and not very insightful. It may be found in Section
3.4 (Prop. 4.2) of [BeT89a] (which can be accessed on-line). A variant of
this proof for the ADMM of Eqs. (5.89)-(5.91) is given in [BPCll], and
essentially the same convergence result is shown.

Proposition 5.4.1: (ADMM Convergence) Consider problem


(5.82) and assume that there exists a primal and dual optimal solu-
tion pair, and that either dom(fi) is compact or else A' A is invertible.
Then:
I

(a) The sequence {xk, Zk, ).k} generated by the ADMM (5.84)-(5.86)
is bounded, and every limit point of { xk} is an optimal solution
of problem (5.83). Furthermore {).k} converges to an optimal
dual solution.
(b) The residual sequence { Axk - zk} converges to 0, and if A' A is
invertible, then { Xk} converges to an optimal primal solution.

There is an alternative line of analysis, given in [EcB92], which con-


nects the ADMM with the generalized proximal point algorithm of Section
5.1.4. This line of analysis is more insightful, and is based on the fixed point
view of the proximal algorithm discussed in Section 5.1.4. In particular, it
treats the ADMM as an algorithm for finding a fixed point of the composi-
tion of reflection operators corresponding to the conjugate functions Ji and
ft The Krasnosel'skii-Mann theorem (Prop. 5.1.9), which establishes the
convergence of interpolated nonexpansive iterations, applies to this fixed
point algorithm (each of the reflection operators is nonexpansive by Prop.
5.1.8).
Among others, this line of analysis shows that despite similarities, the
ADMM is not really an approximate version of an augmented Lagrangian
method that uses cyclic minimization with respect to the vectors x and
z. Instead both methods may be viewed as exact versions of the proximal
286 Proximal Algorithms Chap. 5

algorithm, involving the same fixed point convergence mechanism, but dif-
ferent mappings (and hence also different convergence rate). The full proof
is somewhat lengthy, but we will provide an outline and some of the key
points in Exercise 5.5.

5.4.1 Applications in Machine Learning

We noted earlier the application of ADMM to the additive cost problem of


Example 5.4.1. This problem contains as special cases important machine
learning contexts, such as the /\-regularization and maximum likelihood
examples of Section 1.3. Here are some more examples with similar flavor.

Example 5.4.3 (Basis Pursuit)

Consider the problem

minimize llxll 1
subject to Cx = b,

where II · Iii is the l\ norm in ar, C is a given m x n matrix and b is a vector


in ar. This is the basis pursuit problem of Example 1.4.2. We reformulate
it as
minimize Ji (x) + h(z)
subject to x = z,
where Ji is the indicator function of the set {x I Cx = b} and h(z) = llzll1-
The augmented Lagrangian is

L c ( x,z,A') -_ { llzlli + >.'(x - z) + ~llx - zll 2 if Cx = b,


oo HCxj~

The ADMM iteration (5.84)-(5.86) takes the form

The iteration for z can also be written as

(5.92)

The type of minimization over z in Eq. (5.92) arises often in £1 -


regularization problems. It is straightforward to verify that the solution
Sec. 5.4 Alternating Direction Methods of Multipliers 287

is given by the so-called shrinkage operation, which for any o: > 0 and
w = (w 1 , ... ,wm) E ~m, is defined as

S(o:,w) E arg min


zE~m
{llzlli + 2_11z
20:
-wll 2 } , (5.93)

and has components given by


wi - o: if wi > o:
Si(o:, w) = { 0 if lwil ~ ~, i = 1, ... ,m. (5.94)
wi +a if wi < -o:,
Thus the minimization over z in Eq. (5.92) is expressed in terms of the
shrinkage operation as

Example 5.4.4 (t\-Regularization)

Consider the problem

minimize f(x) + 'Yllxll1


subject to x E Rn,
where f : Rn I-+ ( -oo, oo] is a closed proper convex function and 'Y is a
positive scalar. This includes as special cases the /\-regularization problem
of Example 1.3.2, including the lasso formulation where f is a quadratic
function. We reformulate the problem as
minimize fi(x) + h(z)
subject to x = z,
where fi(x) = f(x) and h(z) = 'Yilzll1- The augmented Lagrangian is

Lc(x, z, >.) = f(x) + 'Yllzll1 + >.' (x - z) + ~ llx - zll 2 .


The ADMM iteration (5.84)-(5.86) takes the form

Xk+1 E arg min {f(x)


xE1Rn
+ >.~x + -2c llx - Zkll 2 } ,

Zk+i E arg min


zE1Rn
{'Yllzll1 - >.~z + -2c llxk+1 - zll 2 } ,
>.k+1 = Ak + c(xk+l - zk+1).
The iteration for z can also be written in closed form, in terms of the shrinkage
operation (5.93)-(5.94):
288 Proximal Algorithms Chap. 5

Example 5.4.5 (Least Absolute Deviations Problem)

Consider the problem

minimize [[Cx - b[[i


subject to X E ar'
where C is an m x n matrix of rank n, and b E Rm is a given vector. This is
the least absolute deviations problem of Example 1.3.1. We reformulate it as

minimize Ji (x) + h(z)


subject to Cx - b = z,

where
Ji(x) = 0, h(z) = [[z[[1.
Here the augmented Lagrangian function is modified to include the constant
vector b [cf. Eq. (5.88)]. It is given by

The ADMM iteration (5.89)-(5.91) takes the form

Xk+l = (C , C )-1 C , ( Zk + b - Ak)


-;;- '

Setting >.k = >..k/c, the iteration can be written in the notationally simpler
form

Zk+l E arg min C


{[[z[[1 + -[[Cxk+1 - - 2} ,
z - b + >..k[[ (5.95)
zERm 2

The minimization over z in Eq. (5.95) is expressed in terms of the shrinkage


operation as
Sec. 5.4 Alternating Direction Methods of Multipliers 289

5.4.2 ADMM Applied to Separable Problems

In this section we consider a separable problem of the form


m

minimize L Ji (xi)
i=l
m
(5.96)
subject to L Aixi = b, xi E Xi, i = 1, ... , m,
i=l

where Ji : Rni H R are convex functions, Xi are closed convex sets, and
Ai and b are given. We have often noted that this problem has a fa-
vorable structure that is well-suited for the application of decomposition
approaches. Since the primary attractive feature of ADMM is that it de-
couples the augmented Lagrangian optimization calculations, it is natural
to consider its application to this problem.
An idea that readily comes to mind is to form the augmented La-
grangian

and use an ADMM-like iteration, whereby we minimize Le sequentially


with respect to x 1, ... , xm, i.e.,

i
Xk+l
· L cXk+
E argm1n ( 1 1 , ... ,xk+i,x,xk
i-1 i i+l m , )
, ... ,xk,/\k, i = 1, ... ,m,
x'EXi
(5.97)
and follow these minimizations with the multiplier iteration

(5.98)

Methods of this type have been proposed in various forms long time ago,
starting with the important paper [StW75], which stimulated considerable
further research. The context was unrelated to ADMM (which was un-
known at that time), but the motivation was similar to the one of the
ADMM: working around the coupling of variables induced by the penalty
term in the augmented Lagrangian method.
When there is only one component, m = 1, we obtain the augmented
Lagrangian method. When there are only two components, m = 2, the
above method is equivalent to the ADMM of Eqs. (5.89)-(5.91), so it has
the corresponding convergence properties. On the other hand, when m > 2,
the method is not a special case of the ADMM that we have discussed and
is not covered by similar convergence guarantees. In fact a convergence
290 Proximal Algorithms Chap. 5

counterexample has been given for m = 3 in [CHY13]. The same refer-


ence shows that the iteration (5.97)-(5.98) is convergent under additional,
but substantially stronger assumptions. Related convergence results are
proved in [Ho113] under alternative but also strong assumptions; see also
[WHM13].
In what follows we will develop an ADMM (first given in [BeT89a],
Section 3.4 and Example 4.4), which is similar to the iteration (5.97)-(5.98),
and is covered by the convergence guarantees of Prop. 5.4.1 without any
assumptions other than the ones given in the proposition. We will derive
this algorithm, by formulating the separable problem as a special case of
the Fenchel framework problem (5.82), and by applying the convergent
ADMM (5.84)-(5.86).
We reformulate problem (5.96) by introducing additional variables
z 1 , ... , zm as follows:
m

minimize L Ji (xi)
i=l

i = l, ... ,m, (5.99)


m

We denote x = (x1, ... , xm), z = (z 1 , ... , zm), we view X = X1 x ... x Xm


as a constraint set for x, we view

as a constraint set for z, and we introduce a multiplier vector pi for each


of the equality constraints Aixi = zi. The augmented Lagrangian has the
form
m
Lc(x, z,p) = L (!i(xi) + (Aixi - zi) 1pi + ~IIAixi - zill 2 ) , x EX, z E Z,
i=l

and the ADMM (5.84)-(5.86) is given by


Sec. 5.4 Alternating Direction Methods of Multipliers 291

We will now show how to simplify this algorithm. We will first obtain
the minimization (5.101) for z in closed form by introducing a multiplier
vector Ak+1 for the constraint I:;:
1 zi = b, and then show that the multi-
pliers Pi+ 1 obtained from the update (5.102) are all equal to Ak+l· To this
end we note that the Lagrangian function corresponding to the minimiza-
tion (5.101), is given by
m

L ((Aixi+l -
i=l
zi)'Pi + ~IIAixi+ 1 - zill 2 + >-~+ 1zi) - >-~+ 1b.

By setting to zero its gradient with respect to zi, we see that the minimizing
vectors zl+ 1 are given in terms of Ak+l by
· · Pi - Ak+l
zl,+ 1 = Aixk+l + - - - - (5.103)
C
A key observation is that we can write this equation as
Ak+l =Pi+ c(Aixi+ 1 - 4+ 1 ), i = 1, ... , m,
so from Eq. (5.102), we have
P1+ 1 = Ak+1, i = 1, ... ,m.
Thus during the algorithm all the multipliers pi are updated to a common
value: the multiplier Ak+l of the constraint I:;:
1 zi = b of problem (5.101).
We now use this fact to simplify the ADMM (5.100)-(5.102). Given
Zk and Ak (which is equal to Pi for all i), we first obtain Xk+l from the
augmented Lagrangian minimization (5.100) as

xl+ 1 E arg min {fi(xi) + (Aixi - 4)' Ak + -2c IIAixi - zll1 2 }. (5.104)
x'EX;

To obtain Zk+l and Ak+l from the augmented Lagrangian minimization


(5.101), we express zl+ 1 in terms of the unknown Ak+l as
. . Ak - Ak+1
zl,+ 1 = Aixk+l + - - -
C
-, (5.105)
[cf. Eq. (5.103)], and then obtain Ak+l by requiring that the constraint of
the minimization (5.101), I:;:
1 z1+ 1 = b, is satisfied. Thus, by adding Eq.
(5.105) over i = 1, ... , m, we have

Ak+l = Ak +: (tAiX1+ 1 - b). (5.106)

We then obtain Zk+l using Eq. (5.105).


In summary, the iteration of the algorithm consists of the three equa-
tions (5.104), (5.106), and (5.105), applied in that order. The vectors .Xo
and zo, which are needed at the first iteration, can be chosen arbitrar-
ily. This is a correct ADMM, mathematically equivalent to the algorithm
(5.100)-(5.102), which has guaranteed convergence as per Prop. 5.4.1. It is
as simple as the iteration (5.97)-(5.98), which, however, is not theoretically
sound form> 2 as we noted earlier.
292 Proximal Algorithms Chap. 5

Example 5.4.6 (Constrained ADMM)

Consider the problem

minimize + h(Ax)
Ji(x)
subject to Ex = d, x E X,

which differs from the standard format (5.82) in that it includes the convex
set constraint x EX, and the linear equality constraint Ex= d, where E and
dare given matrix and vector, respectively. We convert this problem into the
separable form (5.96) as follows:

minimize Ji(x 1) + h(x 2)

subject to (1)x 1 +(~ 1 )x 2 = (~), x 1 EX, x 2 E~r.

Assigning multipliers A and µ to the two equality constraints, and applying


the algorithm of Eqs. (5.104)-(5.106), with the notation

z21)
zi = ( z~2 ,

we obtain the following iteration:

xt+1 E arg min {ti(x 1) + (Ax 1 - z! 1)1 Ak + (Ex 1 - z! 2)'µk


x 1 EX

+ ~(IIAx 1 - z! 1ll 2 + IIEx 1 - zt 21l 2) },

1
Zk+l -
_ (A) 1
E Xk+l + -1 ( Ak - Ak+1 )
µ k - µ k+l '
C

Note that this algorithm maintains the main attractive characteristic of the
ADMM: the components Ji and h of the cost function are decoupled in the
augmented Lagrangian minimizations.
Sec. 5.5 Notes, Sources, and Exercises 293

We mention a more refined form of the multiplier iteration (5.106),


also from [BeT89a], Example 4.4, whereby the coordinates >,.{ of the mul-
tiplier vector Ak are updated according to

j = 1, ... ,r, (5.107)

where r is the row dimension of the matrices Ai, and IDj is the number of
submatrices Ai that have nonzero jth row. Using the j-dependent stepsize
c/mj of Eq. (5.107) in place of the stepsize c/m of Eq. (5.106) may be
viewed as a form of diagonal scaling. The derivation of the algorithm of
Eqs. (5.104), (5.105), and (5.107) is nearly identical to the one given for the
algorithm (5.104)-(5.106). The idea is that the components of the vector
zi represent estimates of the corresponding components of Aixi at the
optimum. However, if some of these components are known to be O because
some of the rows of Ai are 0, then the corresponding values of zi might
as well be set to O rather than be estimated. If we repeat the preceding
derivation of the algorithm (5.104)-(5.106), but without introducing the
components of zi that are known to be 0, we obtain by a straightforward
calculation the multiplier iteration (5.107).
Finally let us note that the ADMM can be applied to the dual of the
separable problem (5.96), and yield a similar decomposition algorithm. The
idea is that the dual problem has the form discussed in Example 5.4.1, for
which the ADMM can be conveniently applied. This approach is applicable
even in the case where the primal problem has some nonlinear inequality
constraints; see [Fuk92], which also discusses a connection with the method
of partial inverses of [Spi83], [Spi85].

5.5 NOTES, SOURCES, AND EXERCISES

Section 5.1: The proximal algorithm was introduced by Martinet [Mar70],


[Mar72] in a form that applies to convex optimization and monotone varia-
tional inequality problems. The generalized version that aims to find a zero
of a maximal monotone operator (cf. Section 5.1.4) received wide attention
following the work of Rockafellar [Roc76a], [Roc76b], which was based on
the fundamental work of Minty [Min62J, [Min64J.
Together with its special cases and variations, the proximal algo-
rithm has been analyzed by several authors, including connections with
the augmented Lagrangian method, convergence and rate of convergence
issues, and various extensions dealing with alternative stepsize rules, ap-
proximate implementations, special cases, and generalizations to noncon-
vex problems (see e.g., Brezis and Lions [BrL78], Spingarn [Spi83], [Spi785],
Luque [Luq84], Golshtein [Gol85], Lawrence and Spingarn [LaS87], Lemaire
[Lem89], Rockafellar and Wets [RoW91], Tseng [Tse91b], Eckstein and
294 Proximal Algorithms Chap. 5

Bertsekas [EcB92], Guler [Gul92]). For textbook discussions, see Rockafel-


lar and Wets [RoW98], Facchinei and Pang [FaP03], and Bauschke and
Combettes [BaCll], which include many references.
The rate of convergence analysis given here (Prop. 5.1.4) is due to
Kort and Bertsekas [KoB76], and has been extensively discussed in the
monograph [Ber82a] (Section 5.4) in a more general form where the regu-
larization term may be nonquadratic (in this case the order of growth of
the regularization term also affects the convergence rate; see Section 6.6).
In particular, if a regularization term llx - Xk IIP with 1 < p < 2 is used in
place of the quadratic II x - Xk II 2 , the order of convergence is 1/ (p-1) (,-1),
where 'Y is the order of growth off (cf. Prop. 5.1.4). Thus superlinear con-
vergence is achieved when 'Y = 2. Limited computational experimentation,
some of which is described in [KoB76], suggests that values p < 2 may be
beneficial in some problem contexts. For an extension of the linear conver-
gence result of Prop. 5.1.4(c), which applies to finding a zero of a maximal
monotone operator, see Luque [Luq84].
The finite termination of the proximal algorithm when applied to
polyhedral functions (Prop. 5.1.5) was shown independently, using differ-
ent methods and assumptions, by Poljak and Tretjakov [PoT74], and by
Bertsekas [Ber75a], which we follow in our analysis here.
Section 5.2: The augmented Lagrangian algorithm was independently
proposed in the papers by Hestenes [Hes69], Powell [Pow69], and Haarhoff
and Buys [HaB70] in a nonlinear programming context where convexity
played no apparent role. These papers contained little analysis, and did
not suggest any relation to duality and the proximal algorithm. This re-
lation was clarified and analyzed for convex programming problems by
Rockafellar [Roc73], [Roc76b]. The convergence and rate of convergence of
the algorithm for nonconvex differentiable problems were investigated by
the author in [Ber75c], [Ber76b], [Ber76c], along with variants including
inexact augmented Lagrangian minimization and second order methods of
multipliers. A related and contemporary development was the introduc-
tion of nonquadratic augmented Lagrangians in the papers by Kort and
Bertsekas [KoB72], [KoB76], including the exponential method, which will
be discussed in Section 6.6. There has been much subsequent work; rep-
resentative references, some relating to the proximal algorithm, include
Robinson [Rob99], Pennanen [Pen02], Eckstein [Eck03], Iusem, Pennanen,
and Svaiter [IPS03], and Eckstein and Silva [EcS13].
For surveys of the augmented Lagrangian literature, see Bertsekas
[Ber76b], Rockafellar [Roc76c], and Iusem [Ius99]. Discussions of aug-
mented Lagrangian methods of varying levels of detail may also be found
in most nonlinear programming textbooks. An extensive treatment of aug-
mented Lagrangian methods, for both convex and nonconvex constrained
problems, is given in the author's research monograph [Ber82a], includ-
ing an analysis of smoothing and sequential quadratic programming algo-
Sec. 5.5 Notes, Sources, and Exercises 295

rithms, and detailed references on the early history of the subject. The
distributed algorithms monograph by Bertsekas and Tsitsiklis [BeT89a]
discusses applications of augmented Lagrangians to classes of large-scale
problems with special structure, including separable problems and prob-
lems with additive cost functions. There has been considerable recent in-
terest in using augmented Lagrangian, proximal, and smoothing methods
for machine learning and signal processing applications; see e.g., Osher et
al. [OBG05], Yin et al. [YOG08], and Goldstein and Osher [Go009].
Section 5.3: The proximal cutting plane and simplicial decomposition
algorithms of Sections 5.3.1 and 5.3.3, may be viewed as regularized versions
of the classical Dantzig-Wolfe decomposition algorithm (see e.g., [Las70],
[BeT97], [Ber99]). The latter algorithm is obtained in the limit, as the
regularization term diminishes to O ( Ck ~ oo).
For presentations of bundle methods, see the books by Hiriart-Urrutu
and Lemarechal [HiL93], and Bonnans et al. [BGL06], which give many
references. For related methods, see Ruszczynski [Rus86], Lemarechal and
Sagastizabal [LeS93], Mifflin [Mif96], Burke and Qian [BuQ98], Mifflin,
Sun, and Qi [MSQ98], Frangioni [Fra02], and Teo et al. [TVSlO].
The term "bundle" has been used with a few different meanings in
the convex algorithmic optimization literature, with some confusion result-
ing. To our knowledge, it was first introduced in the 1975 paper by Wolfe
[Wol75] to describe a collection of subgradients used for calculating a de-
scent direction in the context of a specific algorithm of the descent type -
a context with no connection to cutting planes or proximal minimization.
It subsequently appeared in related descent contexts through the 70s and
early 80s. The meaning of the term "bundle method" shifted gradually in
the 80s, and it is now commonly associated with the stabilized proximal
cutting plane methods that we have described in Section 5.3.2.
Section 5.4: The ADMM was first proposed by Glowinskii and Morocco
[GIM75], and Gabay and Mercier [GaM76], and was further developed by
Gabay [Gab79], [Gab83]. It was generalized by Lions and Mercier [LiM79],
where the connection with alternating direction methods for solving differ-
ential equations was pointed out. The method and its applications in large
boundary-value problems were discussed by Fortin and Glowinskii [FoG83].
The recent literature on the ADMM is voluminous and cannot be
surveyed here (a Google Scholar search produced thousands of papers ap-
pearing in the two years preceding the publication of this book). The
surge of interest is largely due to the flexibility that the ADMM provides
in exploiting special problem structures, such as for example the ones from
machine learning that we have discussed in Section 5.4.1.
In our discussion we have followed the analysis of the book by Bert-
sekas and Tsitsiklis [BeT89a] (which among others gave the ADMM for
separable problems of Section 5.4.2), and in part the paper by Eckstein
and Bertsekas [EcB92] (which established the connection of the ADMM
296 Proximal Algorithms Chap. 5

with the general form of the proximal algorithm of Section 5.1.4, and gave
extensions involving, among others, extrapolation and inexact minimiza-
tion). In particular, the paper [EcB92] showed that the general form of
the proximal algorithm contains as a special case the Douglas-Ratchford
splitting algorithm for finding a zero of the sum of two maximal monotone
operators, proposed by Lions and Mercier [LiM79]. The latter algorithm
contains in turn as a special case the ADMM, as shown by Gabay [Gab83].

EXERCISES

5.1 (Proximal Algorithm via Trust Regions)

Consider using in place of the proximal algorithm the following iteration:

Xk+l E arg min f(x),


llx-xkll:O:,k

where {"Yk} is a sequence of positive scalars. Use dual variables to relate this
algorithm with the proximal algorithm. In particular, provide conditions under
which there is a proximal algorithm, with an appropriate sequence of penalty
parameters {ck}, which generates the same iterate sequence {xk} starting from
the same point xo.

5.2 (Contraction Properties of the Proximal Operator [Roc76a])

Consider a multivalued mapping M : Rn >-t 2~n, which maps vectors x E Rn


into subsets M(x) C Rn. Assume that M has the following two properties:
( 1) Any vector z E Rn can be written in exactly one way as

z=z+cv where z E Rn' VE M(z), (5.108)

[cf. Eq. (5.33)]. (As noted in Section 5.1.4, this is true if M has a maximal
monotonicity property.)
(2) For some a> 0, we have
(x1 - x2) 1 (v1 - v2) 2: allx1 - x2ll 2, V x1, x2 E dom(M)
and v1 E M(x1), v2 E M(x2),

where dom(M) = { x I M(x) # 0} (assumed nonempty). (This is called


the strong monotonicity condition.)
Show that the proximal operator Pc,f, which maps z to the unique vector z of
Eq. (5.108) is a contraction mapping with respect to the Euclidean norm, and in
fact
Sec. 5.5 Notes, Sources, and Exercises 297

Note: While a fixed point iteration involving a contraction has a linear conver-
gence rate, the reverse is not true. In particular, Prop. 5.l.4(c) gives a condition
under which the proximal algorithm has a linear convergence rate. However, this
condition does not guarantee that the proximal operator Pc,f is a contraction
with respect to any particular norm. For example, all the minimizing points of
f are fixed points of Pc,f, but the condition of Prop. 5.1.4( c) does not preclude
the possibility of f having multiple minimizing points. See also [Luq84] for an
extension of Prop. 5.l.4(c), which applies to the case of a maximal monotone
operator, and other related convergence rate results. Hint: Consider the multi-
valued mapping
M(z) = M(z) - (j z,
and for any c > 0, let I\,f (z) be the unique vector z such that z = z + c( v - (j z)
[cf. Eq. (5.108)]. Note that M is monotone, and hence by the theory of Section
5.1.4, l\,t is nonexpansive. Verify that
Pc,J(z) = F\;,J((l + w)- 1 z), z E Rn,
where c = c(l + c/j)- 1 , and use the nonexpansiveness of Pc,f·

5.3 (Partial Proximal Algorithm [Ha90], [BeT94a], [IbF96])

The purpose of this exercise is to develop the elements of an algorithm that


is similar to the proximal, but uses partial regularization, that is, a quadratic
regularization term that involves only a subset of the coordinates of x. For c > 0,
let c/Jc be the real-valued convex function on Rn defined by

c/Jc(z) min {f(x) + _!_llx - zll 2 } ,


= xEX 2c
where f is a convex function over the closed convex set X. Let x 1 , ... , xn denote
the scalar components of the vector x and let I be a subset of the index set
{1, ... , n}. For any z = (z 1 , ... , zn) E Rn, consider a vector z satisfying

zE argmin {f(x)
xEX
+_!_~Ix;
2c~
- zil 2 } . (5.109)
iEl

(a) Show that for a given z, the iterate z can be obtained by the two-step
process
z E arg .min c/Jc(x),
{x[x'=z", iEI}

z E arg min {f(x) + _!_ llx - zll 2 } .


xEX 2c
(b) Interpret z as the result of a block coordinate descent iteration correspond-
ing to the components zi, i ¢:. I, followed by a proximal iteration, and show
that
c/Jc(z) ::; J(z) ::; c/Jc(z) ::; f(z).
Note: Partial regularization, as in iteration (5.109), may yield better approxima-
tion to the original problem, and accelerated convergence if f is "well-behaved"
with respect to some of the components of x (the components xi with i ¢:. I).
298 Proximal Algorithms Chap. 5

5.4 (Convergence of Proximal Cutting Plane Method)

Show that if the optimal solution set X* is nonempty, the sequence {xk} gen-
erated by the proximal cutting plane method (5.69) converges to some point in
x·.

5.5 (Fixed Point Interpretation of ADMM)

Consider the ADMM framework of Section 5.4. Let d1 : Rm f-+ (-oo, oo] and
d 2 : Rm f-+ (-oo, oo] be the functions

d2(>..) = sup {(->..)'z-h(z)},


zE~n

and note that the dual to the Fenchel problem (5.82) is to minimize d1 + d2 [cf.
Eq. (5.36) or Prop. 1.2.1]. Let N1 : Rm f-+ Rm and N2 : Rm f-+ Rm be the
reflection operators corresponding to d1 and d2, respectively [cf. Eq. (5.19)].
(a) Show that the set of fixed points of the composition N1 · N2 is the set

(5.110)

see Fig. 5.5.1.


(b) Show that the interpolated fixed point iteration

(5.111)

where ak E [O, 1) for all k and I:;;°= 0 ak(l - ak) = oo, converges to a fixed
point of N1 · N2. Moreover, when ak = 1/2, this iteration is equivalent to
the ADMM; see Fig. 5.5.2.
Hints and Notes: We have N1(z) = z - cv, where z E Rm and v E 8d1(z) are
obtained from the unique decomposition z = z + cv, and N2(z) = z - cv, where
z E Rm and v E 8d2 (z) are obtained from the unique decomposition z = z + cv
[cf. Eq. (5.25)]. Part (a) shows that finding a fixed point of N1 · N2 is equivalent
to finding two vectors ).. * and v* that satisfy the optimality conditions for the
dual problem of minimizing d1 + d2, and then computing ).. * - cv* ( assuming that
the conditions for strong duality are satisfied). In terms of the primal problem,
we will then have

A'>..* E 8Ji(x*) and - >,.* E Bh(Ax*),

as well as the equivalent condition

x* E oft(A'>..*) and Ax* E BJI(->..*),

where x* is any optimal primal solution [cf. Prop. 1.2.l(c)J. Moreover we will
have v* = -Ax*.
Sec. 5.5 Notes, Sources, and Exercises 299

Figure 5.5.1. Illustration of the mapping N1 -N2 and its fixed points (cf. Exercise
5.5). The vector>..* shown is an optimal solution of the dual problem of minimizing
d1 + d2, and according to the optimality conditions we have v* E 8d1(>..*) and
v* E -8d2 (>.. *) for some v*. It can be seen then that >.. * - cv* is a fixed point
of N1 · N2 and conversely (in the figure , by applying N2 to >..* - cv* using the
graphical process of Fig. 5.1.8, and by applying N1 to the result, we end back at
>..* - cv*).

For a proof of part (a), note that for any z we have

(5.112)

which also implies that

(5.113)

Thus from Eqs. (5.112) and (5.113), z is a fixed point of N1 · N2 if and only if

Z = Z2 + CV2 = Z1 - CV1,

while from Eq. (5.113), we have

The last two relations yield z1 = z2 and v1 = -v2. Thus, denoting>.= z1 = z2


and v = V1 = -v2, we have that z is a fixed point of N1 · N2 if and only if it has
the form A+ cv, with VE od1(>.) and -v E od2(>.), verifying Eq. (5.110) for the
set of fixed points.
For part (b), note that since both N1 and N2 are nonexpansive (cf. Prop.
5.1.8), the composition N1 · N2 is nonexpansive. Therefore, based on the Krasno-
sel'skii-Mann theorem (Prop. 5.1.9), the interpolated fixed point iteration (5.111)
300 Proximal Algorithms Chap. 5

Slope= -1/c
/

ADMM Iterate

Figure 5.5.2. Illustration of the interpolated fixed point iteration (5.111). Start-
ing from Yk, we obtain (N1 · N2)(Yk) using the process illustrated in the fig-
ure: first compute N2(Yk) as shown (cf. Fig. 5.1.8), then apply N1 to compute
(N1 · N2)(yk), and finally interpolate between Yk and (N1 · N2)(Yk) using a pa-
rameter °'k E (0, 1). When the interpolation parameter is °'k = 1/2, we obtain
the ADMM iterate, which is the midpoint between Yk and (N1 · N2)(Yk), denoted
by Yk+l in the figure. The iteration converges to a fixed point y* of N1 · N2,
which when written in the form y* = >.. * - cv*, yields a dual optimal solution >.. *.

converges to a fixed point of N1 · N2, starting from any yo E ~r


and assuming
that N1 · N2 has at least one point (see Fig. 5.5.2).
The verification that for ak = 1/2 we obtain the ADMM, is a little com-
plicated, but the idea is clear. As shown in Section 5.2.1, generally, a proximal
iteration can be dually implemented as an augmented Lagrangian minimization
involving the conjugate function. Thus a proximal iteration for d2 (or di) is
the dual of an augmented Lagrangian iteration involving Ji (or h, respectively).
Therefore, a fixed point iteration for the mapping N1 · N2, which is the com-
position of these proximal iterations, can be implemented with an augmented
Lagrangian iteration for Ji, followed by an augmented Lagrangian iteration for
h, consistently with the character of the ADMM. See [EcB92] or [Eck12] for a
detailed derivation.
6

Additional Algorithmic
Topics

Contents

6.1. Gradient Projection Methods . . . . . . . , . 1. . • p. 302


6.2. Gradient Projection with Extrapolation . . . . . . p. 322
6.2.l. An Algorithm with Optimal Iteration Complexity p. 323
6.2.2. Nondifferentiable Cost - Smoothing . . p. 326
6.3. Proximal Gradient Methods . . . . . . . . . p. 330
6.4. Incremental Subgradient Proximal Methods p. 340
6.4.l. Convergence for Methods with Cyclic Order .1 • p. 344
6.4.2. Convergence for Methods with Rando:µiized Order p. 353
6.4.3. Application in Specially Structured Problems p. 361
6.4.4. Incremental Constraint Projection Methods p. 365
6.5. Coordinate Descent Methods . . . . . . . . . p. 369
6.5.l. Variants of Coordinate Descent . . . . . . p. 373
6.5.2. Distributed Asynchronous Coordinate Descent p. 376
6.6. Generalized Proximal Methods . . . . . . . . . p. 382
6. 7. E-Descent and Extended Mono tropic Programming p. 396
6.7.l. E-Subgradients . . . . . . . . . . . . . p. 397
6.7.2. E-Descent Method . . . . . . . . . . . . p. 400
6.7.3. Extended Monotropic Programming Duality .P· 406
6.7.4. Special Cases of Strong Duality . . . . . . p. 408
6.8. Interior Point Methods . . . . . . . . . . . p. 412
6.8.l. Primal-Dual Methods for Linear Programming p. 416
6.8.2. Interior Point Methods for Conic Programming p. 423
6.8.3. Central Cutting Plane Methods p. 425
6.9. Notes, Sources, and Exercises . . . . . . . . . . p. 426

301
302 Additional Algorithmic Topics Chap. 6

In this chapter, we consider a variety of algorithmic topics that supplement


our discussion of the descent and approximation methods of the preceding
chapters. In Section 6.1, we expand our discussion of descent algorithms
for differentiable optimization of Sections 2.1.1 and 2.1.2, and the gradient
projection method in particular. We then focus on clarifying some compu-
tational complexity issues relating to convex optimization problems, and
in Section 6.2 we derive algorithms based on extrapolation ideas, which are
optimal in terms of iteration complexity. These algorithms are developed
for differentiable problems, and can be extended to the nondifferentiable
case by means of a smoothing scheme.
Following the analysis of optimal algorithms in Section 6.2, we dis-
cuss in Sections 6.3-6.5 additional algorithms that are largely based on
the descent approach. A common theme in these sections is to split the
optimization process into a sequence of simpler optimizations, with the
motivation to exploit the special structure of the problem. The simpler op-
timizations handle incrementally the components of the cost function, or
the components of the constraint set, or the components of the optimiza-
tion vector x. In particular, in Section 6.3, we discuss proximal gradient
methods, for minimizing the sum of a differentiable function (handled by
a gradient method) and a nondifferentiable function (handled by a proxi-
mal method). In Section 6.4, we discuss combinations of the incremental
subgradient algorithms of Section 3.3.1 with proximal methods that en-
hance the flexibility and reliability of the incremental approach. In Section
6.5, we discuss the classical block coordinate descent approach, its many
variations, and related issues of distributed asynchronous computation.
In Sections 6.6-6.8, we selectively describe a variety of algorithms that
supplement the ones of the preceding sections. In particular, in Section 6.6,
we generalize the proximal algorithm to use a nonquadratic regularization
term, such as an entropy function, and we obtain corresponding augmented
Lagrangian algorithms. In Section 6.7, we focus on the E-descent method,
a descent algorithm based on E-subgradients, which among others can be
used to establish strong duality properties for the extended monotropic pro-
gramming problem that we addressed algorithmically in Section 4.4, using
polyhedral approximation schemes. Finally, in Section 6.8, we discuss inte-
rior point methods and their application in linear and conic programming.
Our coverage of the various methods in this chapter is not always as
comprehensive as in earlier chapters, and occasionally it takes the character
of a survey. Part of the rationale for this is to limit the size of the book.
Another reason is that some of the methods are under active development,
and views about their character and merits have not settled.

6.1 GRADIENT PROJECTION METHODS

In this section we focus on a fundamental method that we briefly discussed


at several points earlier. In particular, we consider the gradient projection
Sec. 6.1 Gradient Projection Methods 303

Figure 6.1.1. Illustration of the projection arc

a> 0.

It starts at Xk and defines a curve continuously parametrized by a E (0, oo).

method,
(6.1)
where f : ~n H ~ is a continuously differentiable function, X is a closed
convex set, and ak > 0 is a stepsize. We have outlined in Section 2.1.2 some
of its characteristics, and its connection to feasible direction methods. In
this section we take a closer look at its convergence and rate of convergence
properties, and its implementation.

Descent Properties of the Gradient Projection Method

The gradient projection method can be viewed as the specialization of the


subgradient method of Section 3.2, for the case where f is differentiable,
so it is covered by the convergence analysis given there. However, when
f is differentiable, stronger convergence and rate of convergence results
can be shown than for the nondifferentiable case. Furthermore, there is
more flexibility in applying the method because there is a greater variety
of stepsize rules that can be used. The fundamental reason is that we can
operate the gradient projection method based on iterative cost function
descent, i.e., it is possible to select ak so that

if Xk is not optimal. By contrast, this is not possible when f is nondiffer-


entiable, and an arbitrary subgradient is used in place of the gradient, as
we have seen in Section 3.2.
For a given x k E X, let us consider the projection arc, defined by
a> O; (6.2)
304 Additional Algorithmic Topics Chap. 6

see Fig. 6.1.1. This is the set of possible next iterates parametrized by
the stepsize a. The following proposition shows that for all a > 0, unless
xk(a) = Xk (which is a condition for optimality of xk), the vector xk(a)-xk
is a feasible descent direction, i.e., v' f(xk)'(xk(a) - Xk) < 0.

Proposition 6.1.1: (Descent Properties) Let f : Rn r-+ R be a


continuously differentiable function, and let X be a closed convex set.
Then for all Xk E X and a > 0:
(a) If Xk(a) i=- Xk, then Xk(a) - Xk is a feasible descent direction at
Xk. In particular, we have

\:/a> 0. (6.3)

(b) If xk(a) = Xk for some a > 0, then xk satisfies the necessary


condition for minimizing f over X,

\:/XE X. (6.4)

Proof: (a) From the Projection Theorem (Prop. 1.1.9 in Appendix B), we
have

\:/xEX, (6.5)

so by setting x = Xk, we obtain Eq. (6.3).


(b) If xk(a) = Xk for some a> 0, Eq. (6.5) becomes Eq. (6.4). Q.E.D.

Convergence for a Constant Stepsize

The simplest way to guarantee cost function descent in the gradient pro-
jection method is to keep the stepsize fixed at a constant but sufficiently
small value a > 0. In this case, however, it is necessary to assume that f
has Lipschitz continuous gradient, i.e., for some constant L, we havet

llv'f(x) - v'f(y)II SL llx -yll, V x,y EX. (6.6)

t Without this condition the method may not converge, for any constant
stepsize choice a, as can be seen with the scalar example where f(x) = lxl 312
(the method oscillates around the minimum x* = 0, because the gradient grows
too fast around x*). A different stepsize rule that ensures cost function descent
is needed for convergence.
Sec. 6.1 Gradient Projection Methods 305

X X - t;v'f(x) y

Figure 6.1.2. Visualization of the descent lemma (cf. Prop. 6.1.2). The Lipschitz
constant L serves as an upper bound to the "curvature" of f along directions, so
½
the quadratic function £(y; x)+ lly-xll 2 is an upper bound to f(y). The steepest
descent iterate x- tv'f(x), with stepsize a= 1/L, minimizes this upper bound.

This condition can be used to provide a quadratic function/upper bound


to f in terms of the linear approximation of f based on the gradient at any
x E Rn, defined as

f(y;x) = f(x) + "vf(x)'(y- x), x,y E Rn. (6.7)

This is shown in the following proposition and is illustrated in Fig. 6.1.2.

Proposition 6.1.2: (Descent Lemma) Let f: Rn f-t R be a con-


tinuously differentiable function, with gradient satisfying the Lipschitz
condition (6.6), and let X be a closed convex set. Then for all x, y EX,
we have
L
f(y) :S f(y; x) + 2 11Y - xll 2 - (6.8)

Proof: Lett be a scalar parameter and let g(t) = f(x + t(y - x)). The
chain rule yields

(dg/dt)(t) = (y- x)'"vf(x + t(y - x)).

Thus, we have

f(y) - f(x) = g(l) - g(O)

= 1
1

o
d
_!!_ (t) dt
dt

= fo\y-x)'"vf(x+t(y-x))dt
306 Additional Algorithmic Topics Chap. 6

~ .£\y-x)'"v'f(x)dt+ Jfo\y-x)'("v'f(x+t(y-x))-"v'f(x))dt l
~ la (y-x)'"v'f(x)dt+ Jor1 \ly-x\\ · I\Vf(x+t(y-x))-"v'f(x )\\dt
I

~ (y- x)'"v'f(x) + \ly- xii fo 1 Lt\ly- xii dt


L
= (y - x)'"v' f(x) + 2 11Y - xll 2 ,

where for the second inequality we use the Schwarz inequality, and for the
third inequality we use the Lipschitz condition (6.6). Q.E.D.

We can use the preceding descent lemma to assert t hat a constant


stepsize in the range (0, 2/ L) leads to cost function descent, and guarantees
convergence. This is shown in the following classical convergence result,
which does not require convexity off.

P roposit ion 6 .1.3: Let f: iJ?n HR be a continuously differentiable


function, and let X be a closed convex set. Assume that "v f satis-
fies the Lipschitz condition (6.6), and consider the gradient projection
iteration
Xk+l = Px(xk - oSJJ(xk)),
with a constant stepsize a in the range (0, 2/L). Then every limit point
x of the generated sequence {Xk} satisfies the optimality condition
'1 f(x)'(x - x) 2'.'. o, \:/XE X.

Proof: From Eq. (6.8) with y = Xk+1, we have


L
f( xk+1 ) ~ f (xk+1; xk ) + 2 llxk+I - Xkll 2
L
= f( xk) + "v'f(xk)'(xk+l - Xk) + 2 11xk+l - xkll 2 ,
while from P rop. 6.1.1 [cf. Eq. (6.3)], we have
1 2
"v'f(xk)'(xk+1 - xk ) ~ --
a
llxk+I - xk ll .
By combining the preceding two relations,

(6.9)
Sec. 6.1 Gradient Projection Methods 307

so since a E (0,2/L), the gradient projection method (6.1) reduces the


cost function value at each iteration. It follows that if x is the limit of a
subsequence {xk}K, we have f(xk) _j,. f(x), and by Eq. (6.9),

Hence
Px(x-a'vf(x))-x= lim (xk+1-xk)=O,
k-+oo, kE/C

which, by Prop. 6.1.l(b), implies that x satisfies the optimality condition.


Q.E.D.

Connection with the Proximal Algorithm

We will now consider the convergence rate of the gradient projection me-
thod, and as a first step in this direction, we develop a connection to the
proximal algorithm of Section 5.1. The following lemma shows that the
gradient projection iteration (6.1) may also be viewed as an iteration of the
proximal algorithm, applied to the linear approximation function £(·; xk)
(plus the indicator function of X), with penalty parameter equal to the
stepsize ak (see Fig. 6.1.3).

Proposition 6.1.4: Let f: Wn i---+ W be a continuously differentiable


function, and let X be a closed convex set. Then for all x E X and
a> 0,
Px(x - a'vf(x))
is the unique vector that attains the minimum in

min
yEX
{c(y; x) + ~
2a
IIY - xll2}.

Proof: Using the definition of C [cf. Eq. (6.7)], we have for all x, y E Wn
and a> 0,

1 1
C(y;x) + -IIY-
2a
xll 2 = f(x) + 'vf(x)'(y-x) + -lly-xll
2a
2

= f(x) + ~IIY-
2a
(x -a'vf(x))ll 2 - ~ll'vf(x)ll 2 -
2
The gradient projection iterate Px (x- av' f(x)) minimizes the right-hand
side over y EX, so it minimizes the left hand side. Q.E.D.
308 Additional Algorithmic Topics Chap. 6

f(xk)
"tk

1k - 2~k llx - Xkll 2

Figure 6.1.3. Illustration of the relation of the gradient projection method and
the proximal algorithm, as per Prop. 6.1.4. The gradient projection iterate Xk+I
is the same as the proximal iterate with f replaced by l:(,; xk).

A useful consequence of the connection with the proximal algorithm


is the following adaptation of the three-term inequality of Prop. 5.1.2.

Proposition 6.1.5: Let f: Rn>---+ R be a continuously differentiable


function, and let X be a closed convex set. For a given iterate xk of
the gradient projection method (6.1), consider the arc of points

a >0.

Then for all y E Rn and a > 0, we have

llxk(a)-yll 2
::; llxk-Yll 2 -2a(fi(xk(a); Xk)-fi(y; xk))-llxk-xk(a)i!2-
(6.10)

Proof: Based on the proximal interpretation of Prop. 6.1.4, we can apply


Prop. 5.1.2, with f replaced by fi(·; xk) plus the indicator function of X,
and Ck replaced by a. The three-term inequality of this proposition yields
Eq. (6.10). Q.E.D.

Eventually Constant Stepsize Rule

While the constant stepsize rule is simple, it requires the knowledge of the
Lipschitz constant L. There is an alternative stepsize rule that aims to deal
Sec. 6.1 Gradient Projection Methods 309

with the practically common situation where L, and hence also the range
(0, 2/ L) of stepsizes that guarantee cost reduction, are unknown. Here we
start with a stepsize o: > 0 that is a guess for the midpoint 1/ L of the
range (0, 2/ L). Then we keep using o:, and generate iterates according to

as long as the condition

(6.11)

is satisfied. As soon as this condition is violated at some iteration, we


reduce o: by a certain factor, and repeat the iteration as many times as is
necessary for Eq. (6.11) to hold.
From the descent lemma (Prop. 6.1.2), the condition (6.11) will be
satisfied if o: ~ 1/ L. Thus, after a finite number of reductions, the test
(6.11) will be passed at every subsequent iteration and o: will stay constant
at some value a > 0. We refer to this as the eventually constant stepsize
rule. The next proposition shows that cost function descent is guaranteed
with this rule.

Proposition 6.1.6: Let f : ~n f-t ~ be a continuously differentiable


function, and let X be a closed convex set. Assume that v' f satisfies
the Lipschitz condition (6.6). For a given iterate xk of the gradient
projection method ,(6.1), consider the arc of points

0: > 0.

Then the inequality

(6.12)

implies the cost reduction property

(6.13)

Moreover the inequality (6.12) is satisfied for all a E (0, 1/ L] .

Proof: By setting y = Xk in Eq. (6.10), using the fact f(xk; Xk) = f(xk),
and rearranging, we have
310 Additional Algorithmic Topics Chap. 6

If the inequality (6.12) holds, it can be added to the preceding relation to


yield Eq. (6.13). Also, the descent lemma (Prop. 6.1.2) implies that the
inequality (6.12) is satisfied for all o: E (0, 1/ L]. Q.E.D.

Convergence and Convergence Rate for a Convex Cost Function

We will now show convergence and derive the convergence rate of the gra-
dient projection method for a convex cost function, under the gradient
Lipschitz condition (6 .6). It turns out that an additional benefit of convex-
ity is that we can show stronger convergence results than for the nonconvex
case (cf. Prop. 6.1.3). In particular, we will show convergence to an optimal
solution of the entire sequence of iterates {xk} as long as the optimal solu-
tion set is nonempty. t By contrast, in the absence of convexity of f, Prop.
6. 1.3 asserts that all limit points of { Xk} satisfy the optimality condition,
but there is no assertion of uniqueness of limit point.
We will assume that the stepsize satisfies some conditions, which hold
in particular if it is either a constant in the range (0, 1/ L] or it is chosen
according to the eventually constant stepsize rule described earlier.

Proposition 6.1.7: Let f: 3?n t-+ 3? be a convex differentiable func-


tion, and let X be a closed convex set. Assume that '\7 f satisfies the
Lipschitz condition (6.6), and that the set of minima X* of f over
X is nonempty. Let {xk} be a sequence generated by the gradient
projection method (6.1) using the eventually constant stepsize rule, or
more generally, any stepsize rule such that o:k .J,. a for some a > 0, and
for all k, we have

(6.14)

Then { xk} converges to some point in X* and

f( x ) _ f* < minx*EX* llxo - x*ll 2 k = 0,1, .... (6.15)


k+i - 2(k + l)a '

Proof: By applying Prop. 6.1.5, with o: = O:k and y = x*, where x* EX•,

tWe have seen a manifestation of this result in the analysis of Section 3.2,
where we showed convergence of the subgradient method with diminishing step-
size to an optimal solution. In that section we used ideas of supermartingale and
Fejer convergence. Similar ideas apply for gradient projection methods as well.
Sec. 6.1 Gradient Projection Methods 311

we have
1
l'(xk+1; Xk) + -llxk+l
2ak
- Xkll 2
1 1
:::; l'(x*;xk) + -llx*
2ak
- xkll 2 - -llx* - Xk+ill 2-
2ak
By adding Eq. (6.14), we obtain for all x* EX*,
1 1
f(xk+1):::; l'(x*; Xk) + -llx*
2ak
- Xkll 2 - -llx* - Xk+l 11 2
2ak
(6.16)
1 1
:::; f(x*) + -llx* - xkll 2 - -llx* - Xk+1ll 2,
2ak 2ak
where for the last inequality we use the convexity of f and the gradient
inequality:

f(x*) - l'(x*; xk) = J(x*) - f(xk) - v' f(xk)'(x* - xk) ~ 0.


Thus, denoting ek = f(xk) - f*, we have from Eq. (6.16) for all k
and x* EX*,
(6.17)
Using repeatedly this relation with k replaced by k - l, ... , 0, and adding,
we obtain for all k and x* E X*,

so since ao ~ 0:1 > · · · ~ ak ~ a and e1 ~ e2 ~ · · · ~ ek+l (cf. Prop.


6.1.6), we have
2a(k + l)ek+1 :::; llx* - xoll 2 .
Taking the minimum over x* E X* in the preceding relation, we obtain
Eq. (6.15). Finally, by Eq. (6.17), {xk} is bounded, and each of its limit
points must belong to X* since ek ~ 0. Also by Eq. (6.17), the distance of
Xk to each x* E X* is monotonically nonincreasing, so {xk} cannot have
multiple limit points, and it must converge to some point in X*. Q.E.D.

The monotone decrease property of {f (xk)}, shown in Prop. 6.1.6


[cf. Eq. (6.13)], has an interesting consequence. It can be used to show
the convergence result of Prop. 6.1.7 assuming that the Lipschitz condition
(6.6) holds just within the level set

Xo = {x EX I f(x):::; f(xo)}
rather than within X. The reason is that under the assumptions of the
proposition the iterates Xk are guaranteed to stay within the initial level
set Xo, and the preceding analysis still goes through. This allows the
application of the proposition to cost functions such as lxl 3 , for which the
Lipschitz condition (6.6) does not hold when X is unbounded.
312 Additional Algorithmic Topics Chap. 6

Convergence Rate Under Strong Convexity

Proposition 6.1.7 shows that the cost function error of the gradient projec-
tion method converges to Oas 0(1/k). However, this is true without assum-
ing any condition other than Lipschitz continuity of v' f. When f satisfies
a growth condition in the neighborhood of the optimal set, a faster conver-
gence rate can be proved as noted in Section 2.1.1. In the case where the
growth of f is at least quadratic, a linear convergence rate can be shown,
and in fact it turns out that if f is strongly convex, the gradient projection
mapping
Ga(x) = Px(x - av' J(x)) (6.18)
is a contraction when O < a < 2/ L. Let us recall here that the differentiable
convex function f is strongly convex over Rn with a coefficient a > 0 if

(v'f(x) -v'J(y))'(x -y) 2:: ailx -vll 2 , V x,y E Rn; (6.19)

cf. Section 1.1 of Appendix B. Note that by using the Schwarz inequality
to bound the inner product on the left above, this condition implies that

llv'J(x) - v'f(y)II 2:: allx -yll, Vx,yERn, (6.20)

so that if in addition v' f satisfies a Lipschitz condition with Lipschitz


constant L, we must have L 2:: a.t
The following proposition shows the contraction property of the gra-
dient projection mapping.

Proposition 6.1.8: (Contraction Property under Strong Con-


vexity) Let f : Rn t--+ R be a convex differentiable function, and
let X be a closed convex set. Assume that v' f satisfies the Lipschitz
condition

llv' f(x) - v' f(y)II :::; L llx -yll, (6.21)

for some L > 0, and that it is strongly convex over Rn in the sense that
for some a E (O,L] it satisfies Eq. (6.19). Then the gradient projection
mapping G 0 of Eq. (6.18) satisfies

jjGa(x) - Ga(Y)II ::S max {11- aLI, 11-aal}llx-yll, Vx,yERn,

and is a contraction for all a E (0, 2/ L).

t
A related but different property is that strong convexity of f is equivalent
to Lipschitz continuity of the gradient of the conjugate f* when f and f* are
real-valued (see [RoW98], Prop. 12.60, for a more general result).
Sec. 6.1 Gradient Projection Methods 313

We first show the following preliminary result, which provides several


useful properties of differentiable convex functions relating to Lipschitz
continuity of the gradient and strong convexity.

Proposition 6.1.9: Let f : Wn i--+ W be a convex differentiable func-


tion, and assurne that '\lf satisfies the Lipschitz condition (6.21).
(a) We have for all x, y E Wn
(i) J(x) + '7f(x)'(y- x) + A IIVJ(x) - '7f(y)ll 2 :::: f(y).
(ii) ('1f(x) - '1f(y)) 1 (x - y) 2 tllVJ(x)- '7f(y)ll 2-
(iii) ('\lf(x) - '1f(y)) 1 (x -y):::: L\lx -yl\ 2.

(b) If f is strongly convex in the sense that for some a E (0, L] it


satisfies Eq. (6.19), we have for all x, y E Wn · ·

('1f(x)-'7f(y)) , (x-y) ~ --Lllx-y\1


aL l
2+--L II '1f(x)-'7f(y) ll2 ·
a+ a+

Proof: (a) To show (i), fix x E Wn and let¢ be the function

¢(y) = f(y) - '7f(x)'y, (6.22)

We have
'1¢(y) = '1f(y) - '1f(x), (6.23)
so¢ is minimized over y at y = x, and we have

(6.24)

Moreover from Eq. (6.23), '7¢ is Lipschitz continuous with constant L, so by


applying the descent lemma (Prop. 6.1.2) with¢ in place off, y- '7¢(y) t
in place of y, and yin place of x, we have

¢ (y - ½v¢(y)) :::: ¢(y) + '7¢(y)' (-½'7¢(y)) + ½ll½v¢(y)r


= ¢(y) - 2~IIV¢(y)ii2-
By combining this inequality with Eq. (6.24), we obtain

¢(x) + 2~IIV¢(y)ll2:::: ¢(y),


314 Additional Algorithmic Topics Chap. 6

and by using the expressions (6.22) and (6.23), we obtain the desired result.
To show (ii), we use (i) twice, with the roles of x and y interchanged,
and add to obtain the desired relation. We similarly use the descent lemma
(Prop. 6.1.2) twice, with the roles of x and y interchanged, and add to
obtain (iii).
(b) If a= L, the result follows by combining (ii) of part (a) and Eq. (6.20),
which is a consequence of the strong convexity assumption. For a < L
consider the function
a
¢(x) = f(x) - 2 11xll 2 -
We will show that '\/ ¢ is Lipschitz continuous with constant L - a. Indeed,
we have
'\1¢(x) = '\lf(x) - ax, (6.25)

and it follows that

ll 9 ¢(x) - '\/¢(y)ll 2 = ll'\/f(x) - '\lf(y) - a(x -y)ll 2


= II'\/ f(x) - '\l f(y)ll 2 - 2a('\/ f(x) - '\l f(y))' (x - y) + a2 llx - Yll 2
~ (1 - 2;) II'\/ f(x) - '\l f(y)ll 2 + a2 llx - Yll 2

~ (1- 2;) £211x -yll 2 + a2 llx - Yll 2


= (L - a)211x - Yll 2 ,

where for the first inequality we use (ii) of part (a) and for the second
inequality we use the Lipschitz condition for '\/ f.
We now apply (ii) of part (a) to the function¢ and obtain

('\1¢(x) - '\1¢(y))' (x -y) ~ L ~ a ll'\/¢(x) - '\/¢(y)I(

Using the expression (6.25) for'\/¢ in this relation,

('\lf(x)- '\lf(y)-a(x-y))'(x-y) ~ L ~ )l'\lf(x)-'\lf(y)-a(x-y)ll 2 ,

which after expanding the quadratic and collecting terms, can be verified
to be equivalent to the desired relation. Q.E.D.

We note that a stronger version of Prop. 6.1.9(a) holds, namely that


each of the properties (i)-(iii) is equivalent to '\/ f satisfying the Lipschitz
condition (6.21); see Exercise 6.1. We now complete the proof of the con-
traction property of the gradient projection mapping.
Sec. 6.1 Gradient Projection Methods 315

Proof of Prop. 6.1.8: For all x, y E ~n, we have by using the nonexpan-
sive property of the projection (cf. Prop. 3.2.1)

IIGa(x) - Ga(Y)ll 2 = IIPx (x - av7 f(x)) - Px (y- av7 f(y)) 11 2


:::; ll(x-av7f(x))-(y-av7f(y))i!2-
Expanding the quadratic on the right-hand side, and using Prop. 6.1.9(b),
the Lipschitz condition (6.6), and the strong convexity condition (6.20), we
obtain

IIGa(x) - Ga(Y)ll 2 :::; llx - Yll 2 - 2a(v7 f(x) - v7 f(y))' (x - y)


+ a 2 IIVJ(x) - v7f(y)ll 2
2aaL 2a 2
:::; llx-yll 2 - a+Lllx-yll 2 - a+Lllv7f(x)-v7f(y)II

+ a 2 IIVJ(x) - v7f(y)ll 2

= (1 - !a;~) llx -yll 2

+a(a- a:L) llv7f(x)-v7f(y)11 2

:::; ( 1 - ! :~) II X - Y II 2

+amax{L2(a- a:L),a 2 (a- a:L)}llx-yll 2


= max { (1- aL)2, (1 - aa)2} llx - Yll 2 ,
(6.26)
from which the desired inequality follows. Q.E.D.

Note from the last equality of Eq. (6.26) that the smallest modulus
of contraction is obtained when
2
a---
- a+L·
When this optimal value of stepsize a is used, it can be seen by substitution
in Eq. (6.26) that for all x, y E ~n,

IIGa(x) - Ga(Y)II :::;


4aL
1 - (a+ L) 2 llx - YII =
(£f -1)
+ l llx - YII-

We can observe the similarity of this convergence rate estimate with the
one of Section 2.1.1 and Exercise 2.1 for quadratic functions: the ratio
L / a plays the role of the condition number of the problem. Indeed for a
positive definite quadratic function f we can use as L and a the maximum
and minimum eigenvalues of the Hessian of f, respectively.
316 Additional Algorithmic Topics Chap. 6

Alternative Stepsize Rules

In addition to the constant and eventually constant stepsize rules, there


are several other stepsize rules for gradient projection that are often used
in practice, and do not require that v' f satisfies the Lipschitz condition
(6.6). We will describe some of them and summarize their properties.
One possibility is a diminishing stepsize ak, satisfying the conditions
00 00

lim ak = 0, Lak = oo, La~< oo.


k--+oo
k=O k=O

With this rule, the convergence behavior of the method is very similar to
the one of the corresponding subgradient method. In particular, by Prop.
3.2.6 and the discussion following that proposition (cf. Exercise 3.6), if
there is a scalar c such that

c2 (1 + min llxk - x*l1 2 ) 2': sup{llv'f(xk)ll 2 I k


x*EX*
= 0, 1, .. . },

the gradient projection method converges to some x* EX* (assuming X*


is nonempty). The gradient Lipschitz condition (6.6) is not required for
this property; as an example, the method converges to x* = 0 for the
scalar function f(x) = lxl 3 / 2 with a diminishing stepsize, but not with a
constant stepsize. Note that if f is not convex, the standard result for
convergence with a diminishing stepsize is that every limit point x of { Xk}
satisfies v' J(x) = 0, and there is no assertion of existence or uniqueness of
the limit point. A drawback of the diminishing stepsize rule is that it leads
to a sublinear convergence rate even under the most favorable conditions,
e.g., when f is a positive definite quadratic function, in which case the
constant stepsize rule can be shown to attain a linear convergence rate (see
Exercise 2.1 in Chapter 2).
Another possibility is stepsize reduction and line search, based on
cost function descent, i.e.,

Indeed, if xk(a) -/= Xk for some a> 0, it can be shown (see [Ber99], Section
2.3.2) that there exists ak > 0 such that

(6.27)

Thus there is an interval of stepsizes a E (0, ak] that lead to reduction of the
cost function value. Stepsize reduction and line search rules are motivated
by some of the drawbacks of the constant and eventually constant stepsize
rules: along some directions the growth rate of v' f may be fast requiring a
small stepsize for guaranteed cost function descent, while in other directions
Sec. 6.1 Gradient Projection Methods 317

the growth rate of v' f may be slow, requiring a large stepsize for substantial
progress. A form of line search may deal adequately with this difficulty.
There are many variants of line search rules. Some rules use an exact
line search, aiming to find the stepsize O:k that yields the maximum possible
cost improvement; these rules are practical mostly for the unconstrained
case, where they can be implemented via one of the several possible inter-
polation and other one-dimensional algorithms (see nonlinear programming
texts such as [Ber99], [Lue84], [NoW06]). For constrained problems, step-
size reduction rules are primarily used: an initial stepsize is chosen through
some heuristic procedure (possibly a fixed constant, obtained through some
experimentation, or a crude line search based on some polynomial inter-
polation scheme). This stepsize is then successively reduced by a certain
factor until a cost reduction test is passed.
One of the most popular stepsize reduction rules searches for a step-
size along the set of points

a> 0,

cf. Eq. (6.2). This is the Armijo rule along the projection arc, proposed
in [Ber76a], which is a generalization of the Armijo rule for unconstrained
problems, given in Section 2.1. It has the form

where mk is the first integer m such that

(6.28)

with j3 E (0, 1) and a E (0, 1) being some constants, and sk > 0 being an
initial stepsize. Thus, the stepsize O:k is obtained by reducing sk as many
times as necessary for the inequality (6.28) to be satisfied; see Fig. 6.1.4.
This stepsize rule has strong convergence properties. In particular, it can
be shown that for a convex f with nonempty set X* of minima over X, and
with initial stepsize Sk that is bounded away from 0, it leads to convergence
to some x* E X*, without requiring the gradient Lipschitz condition (6.6).
The proof of this is nontrivial, and was given in [GaB84]; see [Ber99],
Section 2.3.2 for a textbook account [the original paper [Ber76a] gave an
easier convergence proof for the special case where X is the nonnegative
orthant, and also for the case where the Lipschitz condition (6.6) holds and
X is any closed convex set]. Related asymptotic convergence rate results
that involve the rate of growth of f, suitably modified for the presence of
the constraint set, are given in [Dun81], [Dun87].
The preceding Armijo rule requires that with each reduction of the
trial stepsize, a projection operation on X is performed. While this may not
involve much overhead in cases where X is simple, such as for example a box
318 Additional Algorithmic Topics Chap. 6

Set of acceptable stepsizes

/
---------~ ,, ~
0

Figure 6.1.4. Illustration of the successive points tested by the Armijo rule along
the projection arc. In this figure, ak is obtained as (32 sk after two unsuccessful
trials.

constraint consisting of lower and/or upper bounds on the variables, there


are other cases where X is more complicated. In such cases an alternative
Armijo-like rule may be used. Here we first use gradient project ion to
determine a feasible descent direction dk according to

wheres is a fixed positive scalar [cf. Prop. 6.1.l(a)], and then we set

where fJ E (0, 1) is a fixed scalar, and mk is the first nonnegative integer m


for which

In other words, we search along the line { Xk + ,dk I , > 0} by successively


trying the stepsizes 1 = 1, fl , fJ 2 , ... , until the above inequality is satisfied
for m = mk. While this rule may be simpler to implement, t the earlier rule
of Eq. (6.28), which operates on the projection arc, has the advantage that
it tends to keep the iterates on the boundary of the constraint set, and thus
tends to identify constraints that are active earlier. When X = ~n the two
Armijo rules described above coincide, and are identical to t he Armijo rule
given for unconstrained minimization in Section 2.1.1.

t An example where search along a line is considerably simpler than search


a long the projection arc is when the cost function is of the form f(x) = h(Ax),
where A is a matrix such that the calculation of the vector y = Ax for a given x
is far more expensive than the calculation of h(y).
Sec. 6.1 Gradient Projection Methods 319

For a convex cost function, both Armijo rules can be shown to guaran-
tee convergence to a unique limit point/optimal solution, without requiring
the gradient Lipschitz condition (6.6); see [Ius03] and compare also with
the comments in Exercise 2.5. When f is not convex but differentiable, the
standard convergence results with these rules state that every limit point
x of { xk} satisfies the optimality condition
v7 f (x)'(x - x) ?'.: o, \:/xE X,

but there is no assertion of existence or uniqueness of the limit point; cf.


Prop. 6.1.3. We refer to the nonlinear programming literature for further
discussion.

Complexity Issues and Gradient Projection

Let us now consider in some generality computational complexity issues


relating to optimization problems of the form

minimize f (x)
subject to x E X,

where f : Rn ,-+ R is convex and X is a closed convex set. We denote by f *


the optimal value, and we assume throughout that there exists an optimal
solution. We will aim to delineate algorithms that have good performance
guarantees, in the sense that they require a relatively low number of itera-
tions (in the worst case) to achieve a given optimal solution tolerance.
Given some E > 0, suppose we want to estimate the number of it-
erations required by a particular algorithm to obtain a solution with cost
that is within E of the optimal. If we can show that any sequence {xk}
generated by a method has the property that for any E > 0, we have

min J(xk) $ f* + E,
k$c/eP

where c and p are positive constants, we say that the method has iteration
complexity O C1) (the constant c may depend on the problem data and
the starting point xo). Alternatively, if we can show that

minf(xe) $ f*
£$k
+ kcq ,
where c and q are positive constants, we say that the method involves cost
function error of order O ( klq).
It is generally thought that if the constant c does not depend on the
dimension n of the problem, then the algorithm holds some advantage for
problems where n is large. This view favors simple gradient/subgradient-
like methods over sophisticated conjugate direction or Newton-like meth-
ods, whose overhead per iteration increases at an order up to O(n 2 ) or
320 Additional Algorithmic Topics Chap. 6

O(n 3 ). In this chapter, we will focus on algorithms with iteration complex-


ity that is independent of n, and all our subsequent references to complexity
estimates implicitly assume this. t
As an example, we mention the subgradient method for which an
0(1/E 2 ) complexity, or equivalently error O ( 1/vi), can be shown (cf.,
the discussion following Prop. 3.2.3). On the other hand, Prop. 6.1. 7 shows
that in order for the algorithm to attain a vector Xk with

it requires a number of iterations k 2'. 0(1/E), for an error O (1/k). Thus


the gradient projection method when applied to convex cost functions with
Lipschitz continuous gradient, has iteration complexity 0(1/E).t

t Some caution is necessary when considering the relative advantages of the


various gradient and subgradient methods of this and the next section, and com-
paring them with other methods, such as conjugate direction, Newton-like or
incremental, based on the complexity and error estimates that we provide. One
reason is that our complexity estimates involve unknown constants, whose size
may affect the theoretical comparisons between various methods. Moreover, ex-
perience with linear programming methods, such as simplex and ellipsoid, has
shown that good (or bad) theoretical complexity may not translate into good (or
bad, respectively) practical performance, at least in the context of continuous
optimization using worst case measures of complexity (rather than some form of
average complexity).
A further weakness of our analysis is that it does not take into account the
special structure that is typically present in large-scale problems. An example
of such special structure is cost functions of the form f(x) = h(Ax) where A
is a matrix such that the calculation of the vector y = Ax for a given x is far
more expensive than the calculation of h(y). For such problems, calculation of a
stepsize by line minimization, as in conjugate direction methods, as well as solving
the low-dimensional problems arising in simplicial decomposition methods, is
relatively inexpensive.
Another interesting example of special structure, which is not taken into
account by the rate of convergence estimates of this section, is an additive cost
function with a large number of components. We have argued in Section 2.1.5
that incremental methods are well suited for such cost functions, while the non-
incremental methods of this and the next two sections may not be. Indeed, for a
very large number of components in the cost function, it is not uncommon for an
incremental method to reach practical convergence after one or very few passes
through the components. The notion of iteration complexity loses its meaning
when so few iterations are involved. For a discussion of the subtleties underlying
the complexity analysis of incremental and nonincremental methods, see [AgB14].
+ As we have noted in Sections 2.1.1 and 2.1.2, the asymptotic convergence
rate of steepest descent and gradient projection also depends on the rate of growth
of the cost function in the neighborhood of a minimum. The 0(1/c) iteration
Sec. 6.1 Gradient Projection Methods 321

-f
0 X

Figure 6.1.5. The differentiable scalar cost function f of Example 6.1.1. It is


quadratic for \xi :<::: £ and linear for \xi > €. The gradient Lipschitz constant is
L = 1.

Example 6.1.1

Consider the unconstrained minimization of the scalar function f given by

!. ix/ 2 if !xi :S: E,


f(x) = { EX
2
1I- <2
2
I I > E,
I"f X

with E > 0 (cf. Fig. 6.1.5). Here the constant in the Lipschitz condition (6.6)
is L = 1, and for any Xk > E, we have v' f(xk) = E. Thus the gradient iteration
with stepsize a= 1/ L = 1 takes the form

It follows that the number of iterations to get within an E-neighborhood of


x* = 0 is /xo//E. The number of iterations to get to within E of the optimal
cost f* = 0, is also proportional to 1/E.

In the next section, we will discuss a variant of the gradient projec-


tion method that employs an intricate extrapolation device, and has the

complexity estimate assumes the worst case where there is no positive order of
growth . For the case where there is a unique minimum x*, this means that there
are no scalars fJ > 0, 8 > 0, and , > 1 such that

fJl!x - x*ll'Y :S: f(x) - f(x*), \/ x with !Ix - x*II :S: 8.


This is not likely to be true in the context of a practical problem.
322 Additional Algorithmic Topics Chap. 6

improved iteration complexity of O ( 1/ JE). It can be shown that O ( 1/ JE)


is a sharp estimate, i.e., it is the best that we can expect across the class
of problems with convex cost functions with Lipschitz continuous gradient
(see [Nes04], Ch. 2).

6.2 GRADIENT PROJECTION WITH EXTRAPOLATION

In this section we discuss a method for improving the iteration complexity


of the gradient projection method. A closer examination of Example 6.1.1
suggests that while a stepsize less than 2 is necessary within the region
where !xi :::; E to ensure that the method converges, a larger stepsize outside
this region would accelerate convergence. In Section 2.1.1 we discussed
briefly an acceleration scheme, the gradient method with momentum or
heavy-ball method, which has the form
Xk+I = Xk - av' f(xk) + f3(xk - Xk-1),
and adds the extrapolation term {3(xk - Xk-1) to the gradient increment,
where X-1 = xo and f3 is a scalar with O < f3 < 1.
A variant of this scheme with similar properties separates the extrap-
olation and the gradient steps as follows:
Yk = Xk + f3(xk - Xk-1), (extrapolation step),
(6.29)
Xk+I = Yk - av' f(Yk), (gradient step).
When applied to the function of the preceding example, the method con-
verges to the optimum, and reaches a neighborhood of the optimum more
quickly: it can be verified that for a starting point xo > > 1 and Xk > E, it
has the form Xk+l = Xk - Ek, with E:::; Ek < E/(1 - {3). In fact it is gener-
ally true that with extrapolation, the practical performance of the gradient
projection method is typically improved. However, for this example the
method still has an 0(1/E) iteration complexity, since for xo >> 1, the
number of iterations needed to obtain Xk <Eis 0((1-{3)/E). This can be
seen by verifying that for Xk > > 1 we have
Xk+l - Xk = {3(Xk - Xk-i) - E,

so approximately lxk+l - xkl ~ E/(1 - {3).


It turns out that for convex cost functions that have a Lipschitz con-
tinuous gradient a better iteration complexity is possible with more vig-
orous extrapolation. We will show next that what is needed is to replace
the constant extrapolation factor f3 with a variable factor f3k that con-
verges to 1 at a properly selected rate. Unfortunately, it is very difficult
to obtain strong intuition about the mechanism by which this remarkable
phenomenon occurs, at least based on the analytical framework of this sec-
tion (see [Nes04] for a line of development that provides intuition from a
different point of view).
Sec. 6.2 Gradient Projection with Extrapolation 323

6.2.1 An Algorithm with Optimal Iteration Complexity

We will consider a constrained version of the gradient/extrapolation method


(6.29) with a variable value of /3 for the problem

minimize f (x)
(6.30)
subject to x E X,

where f : ~n i-+ ~ is convex and differentiable, and X is a closed convex


set. We will assume that f has Lipschitz continuous gradient [cf. Eq. (6.6)],
and that X*, the set of minima off over X, is nonempty.
The method has the form

= Xk + f3k(Xk - Xk-1),
Yk (extrapolation step),
(6.31)
Xk+l = Px(Yk - a'vf(Yk)), (gradient projection step),

where Px(·) denotes projection on X, X-1 = xo, and f3k E (0, 1); see Fig.
6.2.1. The method has a similar flavor with the heavy ball and PARTAN
methods discussed in Section 2.1.1, but with some important differences:
it applies to constrained problems, and it also reverses the order of extrap-
olation and gradient projection within an iteration.
The following proposition shows that with proper choice of f3k, the
method has iteration complexity 0(1/JE) or equivalently error 0(1/k 2 ).
In particular, we use

k = 0, 1, ... , (6.32)

where the sequence {Bk} satisfies Bo= 01 E (0, 1], and

1 - 0k+l 1
---<- k = 0, 1, .... (6.33)
0i+1 - 0i'

As an example, it can be verified that one possible choice is

if k = 0, ifk=-1,
if k = 1, 2, ... , ifk=O,l, ....

We will assume a stepsize er = 1/ L, but the result can be extended


to stepsize reduction rules along the lines of Prop. 6.1.7. One possibility is
the eventually constant stepsize rule of the preceding section, whereby we
start with some stepsize er > 0 and we keep using er, and generate iterates
according to
324 Additional Algorithmic Topics Chap. 6

Extrapolation Step
Yk = Xk + f3k(Xk - Xk-1)

Gradient Projection Step


Xk+l = Px(Yk - oSlf(Yk))

Figure 6.2.1. Illustration of the two-step method (6.31) with extrapolation.

as long as the condition

(6.34)

is satisfied. As soon as this condition is violated at some iteration, we


reduce a by a certain factor , and repeat the iteration as many times as is
necessary for Eq. (6.34) to hold. Similar to the gradient projection case,
the condition (6.34) will be satisfied if a :S 1/ L, and after a finite number
of reductions, the test (6.34) will be passed at every subsequent iteration
and a will stay constant at some value a> 0. With this stepsize rule we
can handle the case where the constant L is not known, and the following
proof can then be modified to show that the variant has error 0(1/k 2 ) .

Proposition 6.2.1: Let f: ~HR be a convex differentiable func-


tion, and let X be a closed convex set. Assume that 'v f satisfies
the Lipschitz condition (6.6), and that the set of minima X* of f
over X is nonempty. Let { xk} be a sequence generated by the algo-
rithm (6.31), where a= 1/Land f3k satisfies Eqs. (6.32)-(6.33). Then
limk-..oo d(xk) = 0, and

2L 2
f(xk) - f* :S (k + l) 2 (d(xo)) , k = 1,2, . .. ,

where we denote

d(x) min llx - x*II,


= x*EX* XE Rn.
Sec. 6.2 Gradient Projection with Extrapolation 325

Proof: We introduce the sequence

= Xk-1 + 0k~l (xk - Xk-1),


Zk k = 0, 1, ... , (6.35)
where X-1 = xo, so that zo = xo. We note that by using Eqs. (6.31), (6.32),
Zk can also be rewritten as
Zk = Xk + 0-,; 1(Yk - Xk), k = 1,2, .... (6.36)
Fix k ~ 0 and x* EX*, and let
y* = (1 - 0k)Xk + 0kX*.
Using Eq. (6.8), we have
L
f (xk+i) ::::: £(xk+l; Yk) + 2 llxk+l - Yk 1 2 , (6.37)

where we use the notation


£(u; w) = f(w) + '\l f(w)'(u - w), \:/ u,w E ~n.

Since Xk+I is the projection of Yk - (1/ L )'\/ f (Yk) on X, it minimizes


L
£(y; Yk) + 2IIY - Ykll 2
over y EX, so using Prop. 6.1.5, we have
L L L
£(xk+1;yk) + 2 11xk+l -ykll 2 ::::: £(y*;yk) + 2 11Y* -Ykll 2 - 2 11Y* -Xk+1ll 2 .
Combining this relation with Eq. (6.37), we obtain
L L
j(Xk+1) ::::: £(y*; Yk) + 2 IIY* - Yk 1 2 - 2 IIY* - Xk+l 1 2
L
= £( (1 - 0k)Xk + 0kx*; Yk) + 211 (1 - 0k)Xk + 0kx* - Yk 11 2
L
- 2 11 (1 - 0k)Xk + 0kX* - Xk+l 11 2
0~L
= £ ( (1 - 0k)Xk + 0kx*; Yk ) + - 2-llx* + 0-,; 1 (xk -yk) - Xkll 2
0~L
- - 2-llx* + 0k 1 (xk - Xk+I) - Xkll 2
02 L
= £((1 - 0k)Xk + 0kx*; Yk) + +11x* - Zkll 2
02 L
- _k_ llx* - Zk+l 11 2
2
::::: (1 - 0k)£(xk; Yk) + 0k£(x*; Yk) + T
0 L
2
llx* - Zk 1 2

02 L
- +11x* -Zk+1ll 2 ,
326 Additional Algorithmic Topics Chap. 6

where the last equality follows from Eqs. (6.35) and (6.36), and the last
inequality follows from the convexity of£(·; Yk)- Using the inequality

we have

Finally, by rearranging terms, we obtain

By adding this inequality for k = 0, 1, ... , while using the inequality


1 - 0k+l 1
---<-
0~+1 - 0~'
we obtain

Using the facts xo = zo, f* - £(x*; Yi) ~ 0, and 0k ::::; 2/(k + 2), and taking
the minimum over all x* E X*, we obtain
2L 2
f(xk+I) - f* ::::; (k + 2 ) 2 (d(xo)) ,

from which the desired result follows. Q.E.D.

6.2.2 Nondifferentiable Cost - Smoothing

The preceding analysis applies to differentiable cost functions. However, it


can be extended to cases where f is real-valued and convex but nondiffer-
entiable by using a smoothing technique to convert the nondifferentiable
problem to a differentiable one. t In this way an iteration complexity of

t As noted in Section 2.2.5, smoothing is a general and often very effective


technique to deal with nondifferentiabilities. It can be based on differentiable
penalty methods and augmented Lagrangian methods (see the papers [Ber75b],
[Ber77], and the textbook account of [Ber82a], Chapters 3 and 5). In this section,
however, smoothing is used (in combination with the gradient projection method
with extrapolation of Section 6.2.1) as an aid for the complexity analysis, and it
is not necessarily recommended as an effective practical algorithm.
Sec. 6.2 Gradient Projection with Extrapolation 327

0(1/t) can be attained, which is much faster than the 0(1/t 2 ) complexity
of the subgradient method. The idea is to replace a nondifferentiable con-
vex cost function by a smooth t-approximation whose gradient is Lipschitz
continuous with constant L = 0(1/t). By applying the optimal method
given earlier, we obtain an t-optimal solution with iteration complexity
0(1/t) or equivalently error 0(1/k).
We will consider the smoothing technique for the special class of con-
vex functions Jo : ~n 1---t ~ of the form

Jo(x) = max{u'Ax - ¢(u)}, (6.38)


uEU

where U is a convex and compact subset of ~m, ¢ : U 1---t ~ is convex and


continuous over U, and A is an m x n matrix. Note that Jo is just the
composition of the matrix A and the conjugate function of

if u EU,
if u tf. U,

so the class of convex functions Jo of the form (6.38) is quite broad. We


introduce a function p : ~m 1---t ~ that is strictly convex and differentiable.
Let uo be the unique minimum of p over U, i.e.,

uo E argminp(u).
uEU

We assume that p(uo) = 0 and that p is strongly convex over U with


modulus of strong convexity a, i.e., that
(J
p(u) ~ 2 11u - uoll 2 -

An example is the quadratic function p(u) = %11u - uoll 2 , but there are
also other functions of interest (see the paper by [Nes05] for some other
examples, which also allow p to be nondifferentiable and to be defined only
on U).
For a parameter t > 0, consider the function

J,(x) = max { u' Ax - ¢(u) - tp(u) }, XE ~n, (6.39)


uEU

and note that J, is a uniform approximation of Jo in the sense that

J,(x)::; Jo(x) ::; J,(x) + p*t, '<:/XE ~n, (6.40)

where
p* = maxp(u).
uEU
328 Additional Algorithmic Topics Chap. 6

The following proposition shows that le is also smooth and its gradient is
Lipschitz continuous with Lipschitz constant that is proportional to 1/f..

Proposition 6.2.2: For all f. > 0, the function le of Eq. (6.39) is


convex and differentiable over ~n, with gradient given by

where ue(x) is the unique vector attaining the maximum in Eq. (6.39).
Furthermore, we have

Proof: We first note that the maximum in Eq. (6.39) is uniquely attained
in view of the strong convexity of p (which implies that p is strictly con-
vex) . Furthermore, I, is equal to l*(A'x), where I * is t he conjugate of t he
function
¢(u) + Ep(u) + 8u(u),
with 8u being the indicator function of U. It follows that le is convex, and
it is also differentiable with gradient

Vf, (x ) = A'u, (x )

by the Conjugate Subgradient Theorem (Prop. 5.4.3 in Appendix B).


Consider any vectors x, y E ~n, and let 9x and gy be subgradients
of ¢ at u,(x) and u,(y), respectively. From the subgradient inequality, we
have
¢ (ue(Y)) - ¢(u,(x)) ~ g{i,(u,(y) - Ue(x)),

¢ (u, (x)) - ¢ (u, (y)) ~ g~(u,(x) - ue(Y)),


so by adding these two inequalities, we obtain

(gx - gy)'(ue(x) - u, (y)) ~ 0. (6.41 )

By using the optimality condition for the maximization (6.39), we have

( Ax - 9x - Ev'p(u,(x) ) )' (u,(y) - u,(x)) ~ 0,

( Ay - gy - EVp(ue(Y)) )' (u, (x ) - Ue(Y)) ~ 0.


Sec. 6.2 Gradient Projection with Extrapolation 329

Adding the two inequalities, and using the convexity of ¢ and the strong
convexity of p, we obtain

(x -y)'A'(u,(x) - u,(y)) ~ (gx - gy + E(''vp(u,(x)) - v'p(u,(y)) )'


(u,(x) - u,(y))
~ E(v'p(u,(x)) - v'p(u,(y)) )' (u,(x) - u,(y))

~ wllu,(x) - u,(y)ll2,

where for the second inequality we used Eq. (6.41), and for the third in-
equality we used a standard property of strongly convex functions. Thus,

llv'f,(x) - v'f,(y)ll 2 = IIA'(u,(x) - u,(y)) 11 2


:::; IIA'll 2llu,(x) - u,(y)ll 2
:::; IIA'll2 (x - y)' A'(u,(x) - u,(y))
ECJ
:::; IIA'll2 llx -yll llA'(u,(x) - u,(y))II
ECJ
= IIAll 2llx -yll llv'f,(x) - v'f,(y)II,
ECJ

from which the result follows. Q.E.D.

We now consider the minimization over a closed convex set X of the


function
f(x) = F(x) + fo(x),
where Jo is given by Eq. (6.38), and F : ~n H ~ is convex and differen-
tiable, with gradient satisfying the Lipschitz condition

llv'F(x) - v'F(y)II:::; L llx - YII, '<:/ x,y EX. (6.42)

We replace f with the smooth approximation

](x) = F(x) + f,(x),


and note that J uniformly differs from f by at most p*E [cf. Eq. (6.40)], and
has Lipschitz continuous gradient with Lipschitz constant L+ L, = 0(1/E).
Thus, by applying the algorithm (6.31) and by using Prop. 6.2.1, we see
that we can obtain a solution x EX such that f(x) :s; f* + p*E with

0 ( J(L + IIAll 2/w)/E) = 0(1/E)

iterations.
330 Additional Algorithmic Topics Chap. 6

6.3 PROXIMAL GRADIENT METHODS

In this section we consider the problem

minimize f(x) + h(x)


(6.43)
subject to x E ~n,

where f : ~n H ~ is a differentiable convex function, and h : ~n H


(-oo, oo] is a closed proper convex function. We will for the most part
assume that f has Lipschitz continuous gradient, i.e., for some L > 0,

llv'f(x) - v'f(y)II :SL llx -yll, Vx,yE~n. (6.44)

We will discuss an algorithm called proximal gradient, which combines


ideas from the gradient projection method and the proximal algorithm. It
replaces f with its linear approximation in the proximal minimization, i.e.,

Xk+l E arg min {c(x; xk) 1 llx -


+ h(x) + -2ak Xkll 2 } , (6.45)
xE~n

where ak > 0 is a scalar parameter, and as in Sections 6.1 and 6.2, we


denote by

C(y; x) = f(x) + v' f(x)'(y - x), x, y E ~n,

the linear approximation off at x. Thus if f is a linear function, we obtain


the proximal algorithm for minimizing f + h. If h is the indicator function
of a closed convex set, then by Prop. 6.1.4 we obtain the gradient projec-
tion method. Because of the quadratic term in the proximal minimization
(6.45), the minimum is attained uniquely and the algorithm is well defined.
The method exploits the structure of problems where f + h is not
suitable for proximal minimization, while h alone (plus a linear function)
is. The resulting benefit is that we can treat as much as possible of the cost
function with proximal minimization, which is more general (admits non-
differentiable and/or extended real-valued cost) as well as more "stable"
than the gradient method (it involves essentially no restriction on the step-
size a). A typical example is the case where his a regularization function
such as the £1 norm, for which the proximal iteration can be done in essen-
tially closed form by means of the shrinkage operation (cf. the discussion
in Section 5.4.1).
One way to view the iteration (6.45) is to write it as a two-step
process:

Xk+l E arg min {h(x) 1-iix -


+ -2ak Zkll 2 } ;
xE~n
(6.46)
Sec. 6.3 Proximal Gradient Methods 331

Gradient Step
Xo - o:v'f(xo)

X* X3
I X

Slope= -1/a

Figure 6.3.1. Illustration of the implementation of the proximal gradient method


with the two-step process (6.46). Starting with xo, we compute

xo - a"vf(xo)

with the gradient step as shown. Starting from that vector, we compute x1 with a
proximal step as shown ( cf. the geometric interpretation of the proximal iteration
of Fig. 5.1.8 in Section 5.1.4). Assuming that "v / is Lipschitz continuous (so "v /
has bounded slope along directions) and a is sufficiently small, the method makes
progress towards the optimal solution x*. When f(x) = 0 we obtain the proximal
algorithm, and when h is the indicator function of a closed convex set, we obtain
the gradient projection method.

this can be verified by expanding the quadratic

Thus the method alternates gradient steps on f with proximal steps on h.


Figure 6.3.1 illustrates the two-step process. t

t Iterative methods based on algorithmic mappings that are formed by com-


position of two or more simpler mappings, each possibly involving only part of the
problem data, are often called splitting algorithms. They find wide application in
optimization and the solution of linear and nonlinear equations, as they are capa-
ble of exploiting the special structures of many types of practical problems. Thus
the proximal gradient algorithm, as well as other methods that we have discussed
such as the ADMM, may be viewed as examples of splitting algorithms.
332 Additional Algorithmic Topics Chap. 6

The preceding observation suggests that the convergence and con-


vergence rate properties of the method are closely linked to those of the
gradient projection and proximal methods. In particular, let us consider a
constant stepsize, ak = a, in which case the iteration becomes

(6.47)

where G 0 is the mapping of the gradient method,

Ga(x) = x - a'v f(x),


and Pa,h is the proximal mapping corresponding to a and h,

Pa h(z) E arg min {h(x)


' xElRn
+ _!_!Ix
2a
- zll 2 } , (6.48)

[cf. Eq. (6.46)]. It follows that the iteration aims to converge to a fixed
point of the composite mapping Pa,h · G 0 •
Let us now note an important fact: the fixed points of Pa,h · G 0
coincide with the minima of f + h. This is guaranteed by the fact that
the same parameter a is used in the gradient and the proximal mappings
in the composite iteration (6.47}. To see this, note that x* is a fixed point
of Pa,h · Ga if and only if

x* E arg min {h(x)


xElRn
+ _!_llx
2a
- Ga(x*)ll 2 }

min {h(x) + _!_llx - (x* - a'vf(x*))ll 2 } ,


= arg xElRn 2a

which is true if and only if the subdifferential at x* of the function mini-


mized above contains 0:

0 E 8h(x*) + 2~ 'v (llx - (x* - a'vf(x*))li2) lx=x* = 8h(x*) + 'vf(x*).

This is the necessary and sufficient condition for x* to minimize f + h. t


Turning to the convergence properties of the algorithm, we know from
Prop. 5.1.8 that Pa,h is a nonexpansive mapping, so if the mapping Ga
corresponds to a convergent algorithm, it is plausible that the proximal

t Among others, this argument shows that it is not correct to use different
parameters a in the gradient and the proximal portions of the proximal gradient
method. This also highlights the restrictions that must be observed when replac-
ing Go. and Pa.,h with other mappings (involving for example diagonal or Newton
scaling, or extrapolation), with the aim to accelerate convergence.
Sec. 6.3 Proximal Gradient Methods 333

gradient algorithm is convergent, as it involves the composition of a non-


expansive mapping with a convergent algorithm. This is true in particular,
if Go: is a Euclidean contraction (as it is under a strong convexity assump-
tion on f, cf. Prop. 6.1.8), in which case it follows that Pa,h · Ga is also a
Euclidean contraction. Then the algorithm (6.47) converges to the unique
minimum off+ h at a linear rate (determined by the product of the modulii
of contraction of Go: and Po:,h)- Still, however, just like the unscaled gra-
dient and gradient projection methods, the proximal gradient method can
be very slow for some problems, even in the presence of strong convexity.
Indeed this can be easily shown with simple examples.

Convergence Analysis

The analysis of the proximal gradient method combines elements of the


analyses of the proximal and the gradient projection algorithms. In partic-
ular, using Prop. 5.1.2, we have the following three-term inequality.

Proposition 6.3.1: Let f: Rn HR be a differentiable convex func-


tion, and let h : Rn H (-oo, oo] be a closed proper convex function.
For a given iterate Xk of the proximal gradient method (6.45), consider
the arc of points

Xk(a) E arg min {t(x;xk)


xE~n
+ h(x) + -21Cl' llx - xkll 2 } , a> 0.

Then for all y E Rn and a > 0, we have

llxk(a) - Yll 2 :S llxk - Yll 2


- 2a(£(xk(a);xk) + h(xk(a)) -£(y;xk) - h(y))

- llxk - Xk(a)ll 2 -
(6.49)

The next proposition shows that cost function descent can be guar-
anteed by a certain inequality test, which is automatically satisfied for all
stepsizes in the range (0, 1/ L].

Proposition 6.3.2: Let f : Rn H R be a differentiable convex func-


tion, let h : Rn H (-oo, oo] be a closed proper convex function, and
assume that v' f satisfies the Lipschitz condition (6.44). For a given
iterate Xk of the proximal gradient method (6.45), consider the arc of
points
334 Additional Algorithmic Topics Chap. 6

Xk(o:) E arg min


xE~n
{c(x;xk) + h(:c) + 2-11x
20:
- Xk11 2 } , 0: > 0.

Then for all o: > 0, the inequality

(6.50)

implies the cost reduction property

(6.51)
, I

Moreover the inequality (6.50) is satisfied for all o: E (0, 1/ L].

Proof: By setting y = Xk in Eq. (6.49), using the fact C(xk; xk) = f(xk),
and rearranging, we have

If the inequality (6.50) holds, it can be added to the preceding relation to


yield Eq. (6.51). Also, the descent lemma (Prop. 6.1.2) implies that the
inequality (6.50) is satisfied for all o: E (0, 1/L]. Q.E.D.

The proofs of the preceding two propositions go through even if f is


nonconvex (but still continuously differentiable). This suggests that the
proximal gradient method has good convergence properties even in the
absence of convexity of f. We will now show a convergence and rate of
convergence result that parallels Prop. 6.1.7 for the gradient projection
method, and admits a similar proof (some other convergence rate results are
given in the exercises). The following result applies to a constant stepsize
rule with O:k = o: ::;; 1/ L and an eventually constant stepsize rule, similar
to the one of Section 6.1. With this rule we start with a stepsize o:o > 0
that is a guess for 1/ L, and keep this stepsize unchanged while generating
iterates according to

as long as the condition

(6.52)

is satisfied. As soon as this condition is violated at some iteration k, we


reduce o:k by a certain factor, and repeat the iteration as many times
Sec. 6.3 Proximal Gradient Methods 335

as is necessary for Eq. (6.52) to hold. According to Prop. 6.3.2, this rule
guarantees cost function descent, and guarantees that ak will stay constant
after finitely many iterations.

Proposition 6.3.3: Let f : Rn H R be a differentiable convex func-


tion, and let h : Rn H (-oo, oo] be a closed proper convex function.
Assume that 'v f satisfies the Lipschitz condition (6.44), and that the
set of minima X* off + h is nonempty. Let {xk} be a sequence gen-
erated by the proximal gradient method (6.45) using the eventually
constant stepsize rule or any stepsize rule such that ak -!- a, for some
a> 0, and for all k,

(6.53)

Then {xk} converges to some point of X*, and for all k = 0, 1, ... , we
have

minx*EX* llxo - x*ll 2


f(xk+i) + h(xk+1) - min {f(x)
xE~n
+ h(x) } :S 2(k + l)-a ·
(6.54)

Proof: By using Prop. 6.3.1, we have for all x E Rn

By setting x = x *, where x• E X* , and adding Eq. (6.53), we obtain

f( xk+1) + h(xk+I) :S f(x*; Xk) + h(x •)


1 1
+ -2ak llx* - xk !l 2 - -2ak llx* - Xk+i ll 2
(6.55)
:S f(x•) + h(x *)
1 1
+ -llx*
2ak
- Xk ll2 - -!Ix*
2ak
- Xk+1 ll 2 ,
where for the last inequality we use the fact

Thus, denot ing


336 Additional Algorithmic Topics Chap. 6

we have from Eq. (6.55), for all k and x* E X*,

(6.56)

Using repeatedly this relation with k replaced by k - l, ... , 0, we obtain

so since ao 2: a1 > · · · 2: CYk 2: a and e1 2: e2 2: · · · 2: ek+l ( cf. Prop.


6.3.2), we have
2a(k + l)ek+l :S: llx* - xoll 2.
Taking the minimum over x* EX*, we obtain Eq. (6.54). Finally, by Eq.
(6.56), {xk} is bounded, and each of its limit points must belong to X*
since ek --+ 0. Moreover by Eq. (6.56), the distance of Xk to each x* E X*
is monotonically nonincreasing, so { xk} cannot have multiple limit points,
and it must converge to some point in X*. Q.E.D.

Dual Proximal Gradient Algorithm

The proximal gradient algorithm may also be applied to the Fenchel dual
problem
minimize ft ( - A' A) + fz (A)
(6.57)
subject to A E ~m,
where Ji and h are closed proper convex functions, ft and f 2 are their
conjugates
ft(-A',\) = sup {(-,\)'Ax - fi(x)}, (6.58)
xE~n

fz(,\) = sup {Nx - h(x)}, (6.59)


xE~n

and A is an m x n matrix. Note that we have reversed the sign of,\ relative
to the formulation of Sections 1.2 and 5.4 [the problem has not changed, it
is still dual to minimizing Ji (x) + h (Ax), but with this reversal of sign, we
will obtain more convenient formulas]. The proximal gradient method for
the dual problem consists of first applying a gradient step using the function
ft ( - A' A) and then applying a proximal step using the function fz (,\) [cf.
Eq. (6.46)]. We refer to this as the dual proximal gradient algorithm, and
we will show that it admits a primal implementation that resembles the
ADMM of Section 5.4.
To apply the algorithm it is of course necessary to assume that the
function ft(-A' ,\) is differentiable. Using Prop. 5.4.4(a) of Appendix B, it
can be seen that this is equivalent to requiring that the supremum in Eq.
(6.58) is uniquely attained for all,\ E ~m. Moreover, using the chain rule,
the gradient of ft ( -A' A), evaluated at any A E ~m, is

-A (arg min {/1(x) +NAx}).


xE~n
Sec. 6.3 Proximal Gradient Methods 337

Thus, the gradient step of the proximal gradient algorithm is given by


"Xk = Ak + akAXk+I, (6.60)
where ak is the stepsize, and
Xk+I = arg xE~n
min {!1(x) +.>..~Ax}. (6.61)
The proximal step of the algorithm has the form

Ak+1 E arg min 1-11.>.. - °Xkll 2 } .


{f:5.(.>..) + -2ak
>.E~m

According to the theory of the dual proximal algorithm (cf. Section 5.2), the
proximal step can be dually implemented using an augmented Lagrangian-
type minimization: first find
0:k }
Zk+I E arg min { h(z) - AkZ + -llzll 2 ,
-I
(6.62)
zE~m 2
and then obtain Ak+I using the iteration
Ak+I = Ak - 0:kZk+I· (6.63)
The dual proximal gradient algorithm (6.60)-(6.63) is a valid imple-
mentation of the proximal gradient algorithm, applied to the Fenchel dual
problem (6.57). Its convergence is guaranteed by Prop. 6.3.3, provided the
gradient of fi(-A 1 .>..) is Lipschitz continuous, and ak is a sufficiently small
constant or is chosen by the eventually constant stepsize rule.
It is interesting to note that this algorithm bears similarity to the
ADMM for minimizing fi(x) + h(Ax) (which applies more generally, as
it does not require that Ji is differentiable). Indeed we may rewrite the
algorithm (6.60)-(6.63) by combining Eqs. (6.60) and (6.63), so that
Ak+I = Ak + ak(Axk+l - Zk+1),
where Xk+i minimizes the Lagrangian,
Xk+I min {fi(x) + .>..~(Ax - zk)},
= arg xE~n (6.64)

while by using Eqs. (6.60) and (6.62), we can verify that Zk+I minimizes
the augmented Lagrangian
Zk+I E arg min
zE~m
{h(z) + .>..~(Axk+l - z) + ak
2
IIAxk+I - zll 2 } .
Other than minimizing with respect to x the Lagrangian in Eq. (6.64),
instead of the augmented Lagrangian, the only other difference of this dual
proximal gradient algorithm from the ADMM is that there is a restriction
on the magnitude of the stepsize [it is limited by the size of the Lipschitz
constant of the gradient of the function fi(-A' .>..), as per Prop. 6.3.2].
Note that in the ADMM the penalty parameter can be chosen freely, but
(contrary to the augmented Lagrangian method) it may not be clear how to
choose it in order to accelerate convergence. Thus all three proximal-type
methods, proximal gradient, ADMM, and augmented Lagrangian, have
similarities, and relative strengths and weaknesses. The choice between
them hinges largely on the given problem's structure.
338 Additional Algorithmic Topics Chap. 6

Proximal Newton Methods

The proximal gradient method admits a straightforward generalization to


incorporate Newton and quasi-Newton scaling, similar to the steepest de-
scent and gradient projection methods. The scaled version takes the form

(6.65)

where Hk is a positive definite symmetric matrix.


This iteration can be interpreted by a two-step process, as shown in
Fig. 6.3.2 for the special case where Hk = v' 2 f (xk) (the Hessian is assumed
to exist and be positive definite). In this case, we obtain a Newton-like
iteration, sometimes referred to as the proximal Newton method. In the
case where Hk is a multiple of the identity matrix: Hk = _l_J, ak
we obtain
the earlier proximal gradient method (6.45).
Note that when Hk = v' 2 J(xk) and h is the indicator function of a
closed convex set X, the algorithm (6.65) reduces to a constrained version
of Newton's method, and when in addition X = ~n it coincides with the
classical form of Newton's method. It is also possible to use scaling that
is simpler (such as diagonal) or that does not require the computation of
second derivatives, such as limited memory quasi-Newton schemes. Gener-
ally, the machinery of unconstrained and constrained Newton-like schemes,
developed for the case where h(x) = 0 or h is the indicator function of a
given set can be fruitfully brought to bear.
A variety of stepsize rules may also be incorporated into the scaling
matrix H k, to ensure convergence based for example on cost function de-
scent, and under certain conditions, superlinear convergence. In particular,
when f is a positive definite quadratic and Hk is chosen to be the Hessian
off (with unit stepsize), the method finds the optimal solution in a single
iteration. We refer to the literature for analysis and further discussion.

Proximal Gradient Methods with Extrapolation

There is a proximal gradient method with extrapolation, along the lines


of the corresponding optimal complexity gradient projection method of
Section 6.2 [cf. Eq. (6.31)]. The method takes the form

(extrapolation step),

(gradient step),

Xk+i E arg min {h(x) 1-llx - zkll 2 } ,


+ -2ak (proximal step),
xE1Rn

where X-1 = xo, and f3k E (0, 1). The extrapolation parameter f3k is selected
as in Section 6.2.
Sec. 6.3 Proximal Gradient Methods 339

,xo X

-Vf(x)

Figure 6.3.2. Geometric interpretation of the proximal Newton iteration. Given


xk, the next iterate Xk+I is obtained by minimizing f(x,xk) + h(x), where

is the quadratic approximation off at Xk· Thus Xk+I satisfies

(6.66)

It can be shown that Xk+I can be generated by a two-step process: first perform
a Newton step, obtaining Xk = Xk - (v' 2 f(xk) )- 1 v'f(xk), and then a proximal
step

To see this, write the optimality condition for the above minimization and show
that it coincides with Eq. (6.66). Note that when h(x) = 0 we obtain the pure
form of Newton's method. If v' 2 f(xk) is replaced by a positive definite symmetric
matrix Hk, we obtain a proximal quasi-Newton method.

The method can be viewed as a splitting algorithm, and can be jus-


tified with similar arguments and analysis as the method without extrapo-
lation. In particular, it can be seen that x* minimizes f + h if and only if
(x*, x*) is a fixed point of the preceding iteration [viewed as an operation
that maps (xk,Xk-i) to (xk+ 1,xk)]. Since the algorithm consists of the
composition of the nonexpansive proximal mapping with the convergent
gradient method with extrapolation, it makes sense that the algorithm is
convergent similar to the case of no extrapolation. Indeed, this can be
340 Additional Algorithmic Topics Chap. 6

established with rigorous analysis for which we refer to the literature.


The complexity of the method, with a proper choice of constant step-
size a and the parameter (A, is comparable to the optimal complexity of
the method of Section 6.2. Its analysis is based on a combination of the
arguments of the present section and the preceding ones (see [BeT09b],
[BeTlO], [Tse08]), and a rate of convergence result similar to the one of
Prop. 6.2.1 can be shown. Thus the method achieves the best possible iter-
ation complexity 0(1/ JE) for convex functions f with Lipschitz continuous
gradient.

6.4 INCREMENTAL SUBGRADIENT PROXIMAL METHODS

In this section we consider the minimization of a cost function that is the


sum of a large number of component functions,
m
minimize L Ji (x)
i=l
subject to x E X,

where fi : Rn t-+ R, i = 1, ... , m, are convex real-valued functions, and


X is a closed convex set. We have considered incremental gradient and
subgradient methods in Sections 2.1.5 and 3.3.1, for this problem. We now
consider an extension to proximal algorithms. The simplest one has the
form
(6.67)

where ik is an index from { 1, ... , m}, selected in ways to be discussed


shortly, and { CTk} is a positive scalar sequence. This method relates to
the proximal algorithm in the same way that the incremental subgradient
method of Section 3.3.1 relates to its nonincremental version of Section 3.2.
The motivation here is that with a favorable structure of the com-
ponents, the proximal iteration (6.67) may be given in closed form or be
relatively simple, in which case it may be preferable to a gradient or sub-
gradient iteration, since it is generally more stable. For example in the
nonincremental case, the proximal iteration converges essentially for any
choice of ak, while this is not so for gradient-type methods.
Unfortunately, while some cost function components may be well
suited for a proximal iteration, others may not be because the minimiza-
tion (6.67) is inconvenient. This leads us to consider combinations of gra-
dient/subgradient and proximal iterations. In fact this was the motivation
for the proximal gradient and related splitting algorithms that we discussed
in the preceding section.
With similar motivation in mind, we adopt in this section a unified al-
gorithmic framework that includes incremental gradient, subgradient, and
Sec. 6.4 Incremental Subgradient Proximal Methods 341

proximal methods, and their combinations, and serves to highlight their


common structure and behavior. We focus on problems of the form
m

minimize F(x) ~f L Fi(x) (6.68)


i=l

subject to x E X,

where for all i,


Fi(x) = fi(x) + hi(x), (6.69)
fi : ~n r-+ ~ and hi : ~n r-+ ~ are real-valued convex functions, and X is a
nonempty closed convex set.
One of our algorithms has the form

(6.70)

(6.71)

where Vhik (Zk) is an arbitrary subgradient of hik at Zk-t


Note that the iteration is well-defined because the minimum in Eq.
(6. 70) is uniquely attained, while the subdifferential ahik (zk) is nonempty
since hik is real-valued. Note also that by choosing all the fi or all the
hi to be identically zero, we obtain as special cases the subgradient and
proximal iterations, respectively.
The iterations (6. 70) and (6. 71) maintain both sequences { Zk} and
{xk} within the constraint set X, but it may be convenient to relax this
constraint for either the proximal or the subgradient iteration, thereby
requiring a potentially simpler computation. This leads to the algorithm

(6.72)

(6.73)
where the restriction x E X has been omitted from the proximal iteration,
and the algorithm
Zk = Xk - akVhik(xk), (6.74)

Xk+I E argmin {fik(x) 1-llx - zkll 2 } ,


+ -2ak (6.75)
xEX

t To facilitate notation when using both differentiable and nondifferentiable


functions, we will use "\lh(x) to denote a subgradient of a convex function h at
a point x. The method of choice of subgradient from within 8h(x) will be clear
from the context.
342 Additional Algorithmic Topics Chap. 6

where the projection onto X has been omitted from the subgradient itera-
tion. It is also possible to use different stepsize sequences in the proximal
and subgradient iterations, but for notational simplicity we will not discuss
this type of algorithm.
Part (a) of the following proposition is a key fact about incremental
proximal iterations. It shows that they are closely related to incremental
subgradient iterations, with the only difference being that the subgradient
is evaluated at the end point of the iteration rather than at the start point.
Part (b) of the proposition is the three-term inequality, which was shown
in Section 5.1 (cf. Prop. 5.1.2). It will be useful in our convergence analysis
and is restated here for convenience.

Proposition 6.4.1: Let f: Rn t-+ (-oo, oo] be a closed proper convex


function, and let X be a nonempty closed convex set such that ri(X) n
ri(dom(f)) =/= 0. For any Xk E Rn and ak > 0, consider the proximal
iteration
Xk+I E argmin {f(x) 1-llx - xkll 2 } .
+ -2ak (6.76)
xEX

(a) The iteration can be written as

(6.77)

where Vf(xk+I) is some subgradient off at Xk+I·


(b) For ally E Rn, we have

llxk+l - Yll 2 ·:::; llxk -yi1 2 - 2ak(f(xk+I) - f(y)) - llxk - Xk+I 11 2


S llxk - Yll 2 - 2ak(f(xk+1) - f(y)).
(6.78)

Proof: (a) We use the formula for the subdifferential of the sum of the
three functions f, (1/2ak)llx - xkll 2 , and the indicator function of X (cf.
Prop. 5.4.6 in Appendix B), together with the condition that O should
belong to this subdifferential at the optimum Xk+I. We obtain that Eq.
(6.76) holds if and only if
1
-(Xk - Xk+I) E of(Xk+1) + Nx(xk+1), (6.79)
ak
where Nx(xk+I) is the normal cone of X at Xk+I [the set of vectors y such
that y'(x - Xk+I) S O for all x E X, and also the subdifferential of the
indicator function of X at Xk+i; cf. Section 3.1]. This is true if and only if

Xk - Xk+I - ak V f(xk+I) E Nx(xk+I),


Sec. 6.4 Incremental Subgradient Proximal Methods 343

for some "9 f(xk+I) E af(xk+1), which in turn is true if and only if Eq.
(6.77) holds, by the Projection Theorem (Prop. 1.1.9 in Appendix B).
(b) See Prop. 5.1.2. Q.E.D.

Based on Prop. 6.4.l(a), we see that all the iterations (6.70)-(6.71),


(6.72)-(6.73), and (6.74)-(6.75) can be written in an incremental subgradi-
ent format:
(a) Iteration (6.70)-(6.71) can be written as

Xk+l = Px(zk - G:k°9hik(zk)).


(6.80)
(b) Iteration (6.72)-(6.73) can be written as

(c) Iteration (6.74)-(6.75) can be written as

Note that in all the preceding updates, the subgradient "9 hik can be any
vector in the subdifferential of hik, while the subgradient "9 hk must be
a specific vector in the subdifferential of fik, specified according to Prop.
6.4.l(a). Note also that iteration (6.81) can be written as

Xk+l = Px(xk - G:k°9Fik(zk)),

and resembles the incremental subgradient method for minimizing over X


the cost function
m
F(x) = L Fi(x)
i=l

[cf. Eq. (6.68)], the only difference being that the subgradient of Fik is
computed at Zk rather than Xk-
An important issue which affects the methods' effectiveness is the
order in which the components {Ji, hi} are chosen for iteration. In this
section we consider and analyze the convergence for two possibilities:
(1) A cyclic order, whereby {Ji, hi} are taken up in the fixed determin-
istic order 1, ... , m, so that ik is equal to (k modulo m) plus 1. A
contiguous block of iterations involving

{Ji, hi},·••, {Jm, hm}


in this order and exactly once is called a cycle. We assume that the
stepsize G:k is constant within a cycle (for all k with ik = l we have
G:k = G:k+l · · · = G:k+m-1).
344 Additional Algorithmic Topics Chap. 6

(2) A randomized order based on uniform sampling, whereby at each iter-


ation a component pair {Ji, hi} is chosen randomly by sampling over
all component pairs with a uniform distribution, independently of the
past history of the algorithm.
It is essential to include all components in a cycle in the cyclic case, and to
sample according to the uniform distribution in the randomized case. Oth-
erwise some components will be sampled more often than others, leading
to a bias in the convergence process, and convergence to an incorrect limit.
Another popular technique for incremental methods, which we dis-
cussed in Section 2.1.5, is to reshuffle randomly the order of the compo-
nent functions after each cycle. This alternative order selection scheme
leads to convergence, like the preceding two, and works well in practice.
However, its convergence rate seems harder to analyze. In this section we
will focus on the easier-to-analyze randomized uniform sampling order, and
demonstrate its superiority over the cyclic order.
There are also irregular and possibly distributed incremental schemes
where multiple subgradient iterations may be performed in between two
proximal iterations, and reversely. Moreover the order of component se-
lection may be made dependent on algorithmic progress, as long as all
components are sampled with the same long-term frequency.
For the remainder of this section, we denote by F* the optimal value
of problem (6.68):
F* = inf F(x),
xEX

and by X* the set of optimal solutions (which could be empty):

X* = {x* Ix* EX, F(x*) = F*}.


Also, for a nonempty closed convex set X, we denote by dist(·; X) the
distance function given by

dist(x; X) = min
zEX
llx - zll, XE ~n.

6.4.1 Convergence for Methods with Cyclic Order

We first discuss convergence under the cyclic order. We focus on the se-
quence {xk} rather than {zk}, which need not lie within X in the case of
iterations (6.81) and (6.82) when X :/=- ~n. In summary, the idea is to show
that the effect of taking subgradients of Ji or hi at points near Xk (e.g., at
Zk rather than at xk) is inconsequential, and diminishes as the stepsize G:k
becomes smaller, as long as some subgradients relevant to the algorithms
are uniformly bounded in norm by some constant. This is similar to the
convergence mechanism of incremental gradient methods described infor-
mally in Section 3.3.1. We use the following assumptions throughout the
present section.
Sec. 6.4 Incremental Subgradient Proximal Methods 345

Assumption 6.4.1: [For iterations (6.80) and (6.81)] There is


a constant c E ~ such that for all k

(6.83)

Furthermore, for all k that mark the beginning of a cycle (i.e., all k > 0
with ik = 1), we have for all j = 1, ... , m,

max_ {!J(xk) - !J(zk+j-1), hj(Xk) - hj(Zk+j-1)}::; cllxk - Zk+j-111-


(6.84)

Assumption 6.4.2: [For iteration (6.82)] There is a constant c E


~ such that for all k

(6.85)

Furthermore, for all k that mark the beginning of a cycle (i.e., all k > 0
with ik = 1), we have for all j = 1, ... , m,

max {Jj(xk) - fJ(xk+j-1), hj(Xk) - hj(Xk+j-1)} ::; c l\xk - Xk+j-1 II,


(6.86)
Jj(Xk+j-i) - !J(xk+j)::; C llxk+j-1 - Xk+j 11. (6.87)

Note that the condition (6.84) is satisfied if for each i and k, there is
a subgradient of Ji at Xk and a subgradient of hi at Xk, whose norms are
bounded by c. Conditions that imply the preceding assumptions are:
(a) For algorithm (6.80): Ji and hi are Lipschitz continuous over the set
X.
(b) For algorithms (6.81) and (6.82): Ji and hi are Lipschitz continuous
over the entire space ~n.
(c) For all algorithms (6.80), (6.81), and (6.82): Ji and hi are polyhedral
[this is a special case of (a) and (b)].
(d) The sequences { xk} and { zk} are bounded (since then Ii and hi, being
real-valued and convex, are Lipschitz continuous over any bounded
set that contains { x k} and { zk}).
The following proposition provides a key estimate that reveals the
convergence mechanism of the methods.
346 Additional Algorithmic Topics Chap. 6

Proposition 6.4.2: Let {xk} be the sequence generated by any one


of the algorithms (6.80)-(6.82), with a cyclic order of component selec-
tion. Then for all y E X and all k that mark the beginning of a cycle
(i.e., all k with ik = 1), we have

where /3 = ~ + 4.

Proof: We first prove the result for algorithms (6.80) and (6.81), and
then indicate the modifications necessary for algorithm (6.82). Using Prop.
6.4.l(b), we have for ally EX and k,

Also, using the nonexpansion property of the projection,

IIPx(u) - Px(v)II :c:; llu - vii, \I u,v E ~n,

the definition of subgradient, and Eq. (6.83), we obtain for all y E X and
k,

llxk+1 -yll 2 = IIPx(zk - a/Jhik(zk)) -yll 2


:c:; llzk - ak Vhik (zk) - Yll 2
= llzk -yll 2 - 2akVhik(zk)'(zk -y) + a%llflhik(zk)ll 2
:c:; llzk -yll 2 - 2ak(hik(zk) - hik(y)) + a%c2 .
(6.90)
Combining Eqs. (6.89) and (6.90), and using the definition Fj = Ji+ hj,
we have

llxk+l - Yll 2 :c:; llxk - Yll 2 - 2ak (fik (zk) + hik (zk) - fik (y) - hik (y)) + a%c2
= llxk -yll 2 - 2ak(Fik(zk) - Fik(y)) + a%c2 .
(6.91)
Let now k mark the beginning of a cycle (i.e., ik = 1). Then at
iteration k + j - 1, j = 1, ... , m, the selected components are {Ji, hj }, in
view of the assumed cyclic order. We may thus replicate the preceding
inequality with k replaced by k + 1, ... , k + m - 1, and add to obtain

llxk+m - Yll :c:; llxk -yll


2 2 - 2ak L (Fj(Zk+j-1) - Fj(Y)) + ma%c2 ,
j=l
Sec. 6.4 Incremental Subgradient Proximal Methods 347

or equivalently, since F = I:;': 1 Fj,

m
+ 2o:k L (Fj(Xk) - Fj(Zk+j-i)). (6.92)
j=l

The remainder of the proof deals with appropriately bounding the last term
above.
From Eq. (6.84), we have for j = 1, ... , m,

We also have

llxk-Zk+j-111 llxk-Xk+ill+· · ·+llxk+j-2-Xk+j-1ll+llxk+j-1-zk+j-1II,


:<S;
(6.94)
and by the definition of the algorithms (6.80) and (6.81), the nonexpansion
property of the projection, and Eq. (6.83), each of the terms in the right-
hand side above is bounded by 2o:kc, except for the last, which is bounded
by o:kc. Thus Eq. (6.94) yields llxk-Zk+J-ill :<S; o:k(2j- l)c, which together
with Eq. (6.93), shows that

(6.95)

Combining Eqs. (6.92) and (6.95), we have


m
llxk+m -yll 2 :<S; llxk-Yll 2 -2a:k(F(xk)-F(y)) +mo:ic2 +4o:ic2 L(2j-1),
j=l

and finally

llxk+m - Yll 2 :<S; llxk - Yll 2 - 2o:k (F(xk) - F(y)) + mo:ic2 + 4o:ic2m 2,

which is of the form (6.88) with /3 = ¾, + 4.


For the algorithm (6.82), a similar argument goes through using As-
sumption 6.4.2. In place of Eq. (6.89), using the nonexpansion property of
the projection, the definition of subgradient, and Eq. (6.85), we obtain for
all y E X and k ~ 0,

while in place of Eq. (6.90), using Prop. 6.4.l(b), we have


348 Additional Algorithmic Topics Chap. 6

Combining these equations, in analogy with Eq. (6.91), we obtain

llxk+1 - Yll 2 ::; llxk - Yll 2 - 2ak (fik (xk+d + hik (xk) - fik (y) - hik (y))
+ a%c2
= llxk -yll 2 - 2ak(Fik(xk) - Fik(y)) + a%c2
+ 2ak(fik(xk) - Jik(xk+1)).
(6.98)
As earlier, we let k mark the beginning of a cycle (i.e., ik = 1). We
replicate the preceding inequality with k replaced by k + l, ... , k + m - l,
and add to obtain [in analogy with Eq. (6.92)]

llxk+m -yll 2 ::; llxk -yll 2 - 2ak(F(xk) - F(y)) + ma%c2


m

j=l (6.99)
m

+ 2ak L (f (xk+J-d -1 J1 (xk+J)).


j=l

We now bound the two sums in Eq. (6.99), using Assumption 6.4.2.
From Eq. (6.86), we have

Fj(xk) - F1 (xk+J-d s; 2cllxk - Xk+j-1 II


::; 2c(llxk - Xk+i II + · · · + llxk+j-2 - Xk+j-1 II),
and since by Eq. (6.85) and the definition of the algorithm, each of the
norm terms in the right-hand side above is bounded by 2akc,

Also from Eqs. (6.85) and (6.87), and the nonexpansion property of the
projection, we have

Combining the preceding relations and adding, we obtain


m m
2ak L (FJ(xk) - F1 (xk+ 1-d) + 2ak L (fJ(xk+J-d - J1 (xk+J))
j=l j=l
m

::; 8a%c 2 L(j - 1) + 4a%c m 2

j=l

= 4a%c2 (m2 - m) + 4a%c2m


= (4 + !) a%c2m2,
Sec. 6.4 Incremental Subgradient Proximal Methods 349

which together with Eq. (6.99), yields Eq. (6.88). Q.E.D.

Among other things, Prop. 6.4.2 guarantees that with a cyclic order,
given the iterate Xk at the start of a cycle and any point y E X having
lower cost than Xk (for example an optimal point), the algorithm yields a
point Xk+m at the end of the cycle that will be closer toy than Xk, provided
the st epsize ak satisfies

2(F(xk ) - F(y))
O'.k < fJ m 2 e 2 .

In particular, for any f > 0 and assuming that there exists an optimal
2 2
solution x*, either we are within a k fi;_' c + f of the optimal value,

akf3m 2c2
F (x k) ::; F (x* ) + 2 + E,
or else the squared distance to x* will be strictly decreased by at least 20'.kf,

Thus, using Prop. 6.4.2, we can provide various types of convergence re-
sults. As an example, for a constant stepsize (ak = a), convergence can
be established to a neighborhood of the optimum, which shrinks to O as
o: --* 0, as stated in the following proposition.

Proposition 6.4.3: Let { xk} be the sequence generated by any one


of the algorithms (6.80)-(6.82), with a cyclic order of component se-
lection, and let the stepsize O'.k be fixed at some positive constant a .
(a) If F* = -oo, then

liminf F(xk)
k---+oo
= F*.

(b) If F * > - oo, then

a(3m2 c2
lim inf F(xk) ::; F*
k---+oo
+ - -2 - ,

where c and fJ are the constants of Prop. 6.4.2.

Proof: We prove (a) and (b) simultaneously. If t he result does not hold,
there must exist an f > 0 such that
a(3m2c2
lim inf F (Xkm ) - - - - - 2E > F*.
k--..+oo 2
350 Additional Algorithmic Topics Chap. 6

Let f; E X be such that


a{3m 2 c2
liminf F(xkm) - - - - - 2E :::0: F(y),
k-.oo 2
and let ko be large enough so that for all k :::0: ko, we have

F(xkm) :::0: lim inf F(xkm) - E.


k-.oo

By combining the preceding two relations, we obtain for all k :::0: ko,
a{3m 2 c2
F(xkm) - F(y) :::0: 2 + E.
Using Prop. 6.4.2 for the case where y = y together with the above relation,
we obtain for all k ::::: ko,

llx(k+l)m - flll 2 ~ llxkm - flll 2 - 2a(F(xkm) - F(f;)) + /3a 2 m 2 c2


~ llxkm - flll 2 - 2aE.

This relation implies that for all k :::0: ko,

which cannot hold for k sufficiently large - a contradiction. Q.E.D.

The next proposition gives an estimate of the number of iterations


needed to guarantee a given level of optimality up to the threshold tolerance
a{3m 2 c2 /2 of the preceding proposition.

Proposition 6.4.4: Assume that X* is nonempty. Let {xk} be a


sequence generated as in Prop. 6.4.3. Then for any E > 0, we have

a{3m2 c2 +E
min F(xk) ~ F*
0-5,_k-:5,_N
+ 2 , (6.100)

where N is given by

N=m ldist(xo;
- ~ -X*) - -J
2
.
<Y,E
(6.101)

Proof: Assume, to arrive at a contradiction, that Eq. (6.100) does not


hold, so that for all k with O ~km~ N, we have

F( Xkm ) > F * a{3m 2 c2 +E


+ 2 ·
Sec. 6.4 Incremental Subgradient Proximal Methods 351

By using this relation in Prop. 6.4.2 with ak replaced by a and y equal to


the vector of X* that is at minimum distance from Xkm, we obtain for all
k with O::; km ::; N,

dist(x(k+l)m; X*) 2 ::; dist(xkm; X*) 2 - 2a(F(xkm) - F*)+a2(3m 2c2


::; dist(xkm;X*) 2 - (a 2(3m 2c2 + m) + a 2(3m2c2
= dist(xkm; X*) 2 - m.
Adding the above inequalities for k = 0, ... , lJ;,, we obtain

dist(xN+m; X*) 2 ::; dist(xo; X*) 2 - ( : + 1) m,

so that
( : + 1) m::; dist(x 0 ;X*) 2,

which contradicts the definition of N. Q.E.D.

According to Prop. 6.4.4, to achieve a cost function value within O(E)


of the optimal, the term a(3m 2c2 must also be of order O(E), so a must be
of order 0(E/m2c2), and from Eq. (6.101), the number of necessary itera-
tions N is O(m3 c2/E 2), and the number of necessary cycles is O((mc)2 /E 2).
This is the same type of estimate as for the nonincremental subgradient
method [i.e., 0(1/E2), counting a cycle as one iteration of the nonincre-
mental method, and viewing me as a Lipschitz constant for the entire cost
function F], and does not reveal any advantage for the incremental meth-
ods given here. However, in the next section, we demonstrate a much more
favorable iteration complexity estimate for the incremental methods that
use a randomized order of component selection.

Exact Convergence for a Diminishing Stepsize

We can also obtain an exact convergence result for the case where the step-
size ak diminishes to zero. The idea is that with a constant stepsize a we
can get to within an O(a)-neighborhood of the optimum, as shown above,
so with a diminishing stepsize ak, we should be able to reach an arbitrar-
ily small neighborhood of the optimum. However, for this to happen, ak
should not be reduced too fast, and should satisfy L~o ak = oo (so that
the method can "travel" infinitely far if necessary).

Proposition 6.4.5: Let { xk} be the sequence generated by any one


of the algorithms (6.80)-(6.82), with a cyclic order of component se-
lection, and let the stepsize ak satisfy
352 Additional Algorithmic Topics Chap. 6

00

lim
k--+oo
0:k ~ 0, LUk = 00.
k=O

Then,
liminfF(xk)
k--+oo -
= F*.
Furthermore, if X* is nonempty and
00

La~< oo,
k=O

then {xk} 1,;ouverges to some x* EX*.

Proof: For the first part, it will suffice to show that

lim inf F(xkm) = F *.


k--+oo

Assume, to arrive at a contradiction, that there exists an E > 0 such that

liminf F(xkm) - 2E > F*.


k--+oo

Then there exists a point i) E X such that

liminf F(xkm) - 2E > F(f)).


k--+oo

Let ko be large enough so that for all k ~ ko, we have

F(xkm) ~ liminf F(xkm) - E.


k--+oo

By combining the preceding two relations, we obtain for all k ~ ko,

F(xkm) - F(f)) > E.

By setting y = i) in Prop. 6.4.2, and by using the above relation, we have


for all k ~ ko,

llx(k+l)m - fJll 2 S llxkm - fJll 2 - 20:kmE + /30:~mm 2 c2


= llxkm - fJll 2 - O'.km (2E - /30:kmm 2 c2 ) .

Since ak ---+ 0, without loss of generality, we may assume that ko is large


enough so that
\/ k ~ ko.
Sec. 6.4 Incremental Subgradient Proximal Methods 353

Therefore for all k 2: ko, we have

llx(k+l)m - :011 2 ::; llxkm - :011 2 - O'.kmE ::; · · · ::; llxkom - :011 2 - E L O'.£m,
£=ko

which cannot hold fork sufficiently large. Hence liminfk-+oo F(xkm) = F*.
To prove the second part of the proposition, note that from Prop.
6.4.2, for every x* E X* and k 2: 0 we have

llx(k+l)m - x*ll 2 ::; llxkm - x*ll 2 - 20'.km(F(xkm) - F(x*)) + a%m/3m2 c2 .


(6.102)
If I::%':o a% < oo, Prop. A.4.4 of Appendix A implies that for each x* E X*,
llx(k+l)m - x*II converges to some real number, and hence {x(k+l)m} is
bounded. Consider a subsequence {x(k+l)mh: such that

lim inf F(xkm) = F*,


k-+oo

and let x be a limit point of { xk}x:. Since F is continuous, we must


have F(x) = F*, so x EX*. Using x* = x in Eq. (6.102), it follows that
llx(k+l)m - xii converges to a real number, which must be equal to O since x
is a limit point of {x(k+l)m}- Thus xis the unique limit point of {x(k+l)m}.
Finally, to show that the entire sequence {xk} also converges to x,
note that from Eqs. (6.83) and (6.85), and the form of the iterations (6.80)-
(6.82), we have llxk+l - xkll ::; 2akc-+ 0. Since {xkm} converges to x, it
follows that { xk} also converges to x. Q.E.D.

6.4.2 Convergence for Methods with Randomized Order

In this section we discuss convergence for the randomized component selec-


tion order and a constant stepsize a. The randomized versions of iterations
(6.80), (6.81), and (6.82), are

Zk = Xk - aVfwk(zk), Xk+l = Px(zk - aVhwk(zk)), (6.104)

Zk = Xk - aVhwk(xk), Xk+l = Px(zk - aVfwk(xk+1)), (6.105)

respectively, where {wk} is a sequence of random variables, taking values


from the index set {1, ... , m }.
We assume the following throughout the present section.
354 Additional Algorithmic Topics Chap. 6

Assumption 6.4.3: [For iterations (6.103) and (6.104)]


(a) {wk} is a sequence of random variables, each uniformly distributed
over {1, ... , m }, and such that for each k, Wk is independent of
the past history {xk,zk-1,Xk-I,···,zo,xo}.
(b) There is a constant c E R such that for all k, we have with
probability 1, for all i = 1, ... , m,

max{IIV/i(zi)II, 11Vhi(4)11}::; c, (6.106)

max {!i(xk) - /i(zi), hi(xk) - hi(4)}::; cllxk - zill, (6.107)


where 4 is the result of the proximal iteration, starting at Xk if
Wk would be i, i.e.,

zi E argmin 1-llx -
{/i(x) + -2ak xkll 2 } ,
xEX

in the case of iteration (6.103), and

in the case of iteration (6.104).

Assumption 6.4.4: [For iteration (6.105)]


(a) {wk} is a sequence of random variables, each uniformly distributed
over {1, ... , m }, and such that for each k, wk is independent of
the past history {xk,Zk-I,Xk-I,· .. ,zo,xo}.
(b) There is a constant c E R such that for ·all k, we have with
probability 1

max { IIV fi(xi+l)II, IIVhi(xk)II} ::; c, 1Vi= 1, ... , m,


(6.108)
fi(xk) - fi(xi+l) ::; cllxk - xl+i 11, Vi = 1, ... , m, (6.109)
where xi+i is the result of the iteration, starting at Xk if Wk
would be i, i.e.,
Sec. 6.4 Incremental Subgradient Proximal Methods 355

with

Note that condition (6.107) is satisfied if there exist subgradients of


Ji and hi at X k with norms less or equal to c. Thus the conditions (6.106)
and (6.107) are similar, the main difference being that the first applies to
"slopes" of Ji and hi at zi
while the second applies to the "slopes" of Ji
and hi at Xk. As in the case of Assumption 6.4.1, these conditions are
guaranteed by Lipschitz continuity assumptions on Ji and hi. The conver-
gence analysis of t he randomized algorithms of this section is somewhat
more complicated t han the one of t he cyclic order counterparts, and relies
on the Supermart ingale Convergence Theorem (Prop. A.4.5 in Appendix
A). The following proposition deals with the case of a constant stepsize,
and parallels Prop. 6.4.3 for the cyclic order case.

Proposition 6.4.6 : Let {:ck} be the sequence generated by one of the


randomized incremental methods (6.103)-(6.105), and let the stepsize
ak be fixed at some positive constant a.
(a) If F* = -oo, then with probability 1

inf F(xk)
k ~O
= F*.

(b) If F* > -oo, then with probability 1

a{3mc2
k~O
inf F(xk):::; F* + ---,
2

where f3 = 5.

Proof: Consider first algorithms (6.103) and (6.104). By adapting the


proof argument of Prop. 6.4.2 with F ik replaced by Fwk [cf. Eq. (6.91)], we
have

Taking conditional expect ation of both sides given the set of random vari-
ables F k = {x k,Zk- 1 , .. • , zo,xo}, and using t he fact t hat Wk takes the
values i = 1, . .. , m with equal probability 1/m, we obtain for all y E X
356 Additional Algorithmic Topics Chap. 6

and k,

E { llxk+l - Yll 2 I Fk} ::; llxk - Yll 2 - 2aE{ Fwk (zk) - Fwk (y) I Fk} + a2 c2
= llxk - Yll 2 - 2a I)Fi(zi) - Fi(Y)) + a 2 c2
m i=l
2a
= llxk - Yll 2 - -(F(xk) - F(y)) + a 2 c2
m
+ 2a I)R(xk) - Fi(zU).
m i=l
(6.110)
By using Eqs. (6.106) and (6.107),
m m m
2)Fi(xk) - Fi(4))::; 2c L llxk - 411 = 2ca L 11Vfi(4)11::; 2mac 2•

i=l i=l i=l


By combining the preceding two relations, we obtain
2a
E{llxk+1 -yll 2 I Fk}:::; llxk -yll 2 - -(F(xk)-F(y))
m
+4a2 c2 +a2 c2
2a
= llxk - Yll 2 - m (F(xk) - F(y)) + (3a 2 c2 ,
(6.111)
where (3 = 5.
The preceding equation holds also for algorithm (6.105). To see this
note that Eq. (6.98) yields for ally EX

llxk+l -yll 2 ::; llxk - Yll 2 - 2a(Fwk(xk) - Fwk(y)) + a 2 c2


(6.112)
+ 2a(fwk(xk) - fwk(xk+1)),
and similar to Eq. (6.110), we obtain

(6.113)

From Eq. (6.109), we have

fi(xk) - fi(xi+ 1) ::; cllxk - xl+1 II,


and from Eq. (6.108) and the nonexpansion property of the projection,

llxk - xl+111 ::; llxk - zi + av li(4+1)II


= llxk - Xk + aVhi(Xk) + av fi(4+1)II
::; 2ac.
Sec. 6.4 Incremental Subgradient Proximal Methods 357

Combining the preceding inequalities, we obtain Eq. (6.111) with (3 = 5.


Let us fix a positive scalar 'Y, consider the level set L'Y defined by

_ { {XE X I F(x) < -"( + 1 + n,B;'c2 } if F* = -oo,


L'Y - {x EX I F(x) < F* + ~ + a,B;'c 2 } if F* > -oo,

and let y'Y E X be such that

F(y'Y) = {
-"(
F* +~
if F* = -oo,
if F* > -oo.

Note that y'Y E L'Y by construction. Define a new process { xk} that is
identical to {xk}, except that once Xk enters the level set L"/, the process
terminates with Xk = Y'Y. We will now argue that for any fixed 'Y, { Xk}
(and hence also { xk}) will eventually enter L'Y, which will prove both parts
(a) and (b).
Using Eq. (6.111) with y = Y"/, we have

from which
(6.114)
where
Vk = { ~ (F(xk) - F(y'Y)) - (3a 2c 2 ~f !k rf_ L"f,
0 1fxk=Y"f·
The idea of the subsequent argument is to show that as long as Xk rf. L"f, the
scalar Vk (which is a measure of progress) is strictly positive and bounded
away from 0.
(a) Let F* = -oo. Then if Xk rf. L"f, we have

~ m
a(3mc2
2a ( -"(+ 1 + --2- +'Y ) -(3a2c2

2a
m

Since Vk = 0 for Xk E L'Y, we have Vk ~ 0 for all k, and by Eq. (6.114)


and the Supermartingale Convergence Theorem (Prop. A.4.5 in Appendix
A), I::%:o Vk < oo implying that Xk E L'Y for sufficiently large k, with
probability 1. Therefore, in the original process we have

a(3mc2
inf F(xk) :S -"( + 1 + - -
k'2'.0 2
358 Additional Algorithmic Topics Chap. 6

with probability 1. Letting 1 -t oo, we obtain infk::::o F(xk) = -oo with


probability 1.
(b) Let F* > -oo. Then if Xk (/. L"t, we have

Vk = 20: ( F(xk) - F(y"/)) - f3o: 2 c2


m
2
?: -2o: ( F* + -2 + -
o:f3mc
- - - F* - -1 ) - /30:2c2
m 1 2 1
2o:
m,
Hence, Vk ?: 0 for all k, and by the Supermartingale Convergence Theorem
(Prop. A.4.5 in Appendix A), we have Z:,:;:'= 0 Vk < oo implying that Xk E L'Y
for sufficiently large k, so that in the original process,
2 o:f3mc2
inf F(xk) S F* + - + -2-
k::::o 1
with probability 1. Letting 1 -t oo, we obtain infk::::o F(xk) < F* +
o:f3mc2 /2. Q.E.D.

By comparing Prop. 6.4.6(b) with Prop. 6.4.3(b), we see that when


F* > -oo and the stepsize o: is constant, the randomized methods (6.103),
(6.104), and (6.105), have a better error bound (by a factor m) than their
nonrandomized counterparts. In fact there is an example for the incremen-
tal subgradient method (see [BN003], p. 514) that can be adapted to show
that the bound of Prop. 6.4.3(b) is tight in the sense that for a bad prob-
lem/ cyclic order we have lim infk--+oo F(xk) - F* = O(o:m 2 c2 ). By contrast
the randomized method will get to within 0( o:mc2 ) with probability 1 for
any problem, according to Prop. 6.4.6(b). Thus with the randomized al-
gorithm we do not run the risk of choosing by accident a bad cyclic order.
A related result is provided by the following proposition, which should be
compared with Prop. 6.4.4 for the nonrandomized methods.

Proposition 6.4.7: Assume that X* is nonempty. Let {xk} be a


sequence generated as in Prop. 6.4.6. Then for any E > 0, we have
with probability 1

o:f3mc2 + E
min F(xk) S F*
0-:5,_k-:5,_N
+- --2
-, (6.115)

where N is a random variable with

E{N} Sm dist(xo; X*) 2 (6.116)


O:E
Sec. 6.4 Incremental Subgradient Proximal Methods 359

Proof: Let iJ be some fixed vector in X*. Define a new process { xk} which
is identical to { Xk} except that once Xk enters the level set

+ a(3mc2 + E }
2
L = { x EX I F(x) < F* ,

the process {xk} terminates at iJ. Similar to the proof of Prop. 6.4.6 [cf.
Eq. (6.111) with y being the closest point of Xk in X*], for the process {xk}
we obtain for all k,

E{ dist(xk+1; X*) 2 I Fk} SE{ llxk+l - Yll 2 I Fk}


S dist(xk; X*) 2 - 2a (F(xk) - F*) + (3a 2 c2
m
= dist(xk; X*) 2 - Vk,
(6.117)
where Fk = {xk, Zk-l, ... , zo, xo} and

if xk ~ L,
otherwise.

In the case where Xk ~ L, we have

+ a(3mc2 + E -
2a ( 2 )
Vk ~ m F* F* - (3a 2 c 2 (6.118)
m

By the Supermartingale Convergence Theorem (Prop. A.4.5 in Appendix


A), from Eq. (6.117) we have I:%°=o Vk < oo with probability 1, so that
Vk = 0 for all k ~ N, where N is a random variable. Hence XN E L with
probability 1, implying that in the original process we have

. a(3mc 2 + E
mm F(xk) SF* +- ---
09-<;,N 2

with probability 1. Furthermore, by taking the total expectation in Eq.


(6.117), we obtain for all k,

E{dist(xk+ 1;X*) 2 } s E{dist(xk;X*)2} - E{vk}

s dist(xo; X*) 2 - E {t vi},


1=0

where in the last inequality we use the facts xo = xo and E{ dist(xo; X*) 2 } =
dist(xo; X*) 2 . Therefore, letting k--+ oo, and using the definition of vk and
Eq. (6.118),

dist(xo; X*) 2 ~ E { ~ Vk} = E{ ~ Vk} ~ E { N:E} = : E{ N}.


360 Additional Algorithmic Topics Chap. 6

Q.E.D.
Like Prop. 6.4.6, a comparison of Props. 6.4.4 and 6.4. 7 again suggests
an advantage for the randomized methods: compared to their deterministic
counterparts, they achieve a much smaller error tolerance (by a factor of
m), in the same expected number of iterations. Note, however, that the
preceding assessment is based on upper bound estimates, which may not
be sharp on a given problem [although the bound of Prop. 6.4.3(b) is tight
with a worst-case problem selection as mentioned earlier; see [BN003], p.
514]. Moreover, the comparison based on worst-case values versus expected
values may not be strictly valid. In particular, while Prop. 6.4.4 provides an
upper bound estimate on N, Prop. 6.4.7 provides an upper bound estimate
on E{N}, which is not quite the same. However, this comparison seems to
be supported by the experimental results obtained so far.
Finally for the case of a diminishing stepsize, let us give the following
proposition, which parallels Prop. 6.4.5 for the cyclic order.

Proposition 6.4.8: Let { xk} be the sequence generated by one of the


randomized incremental methods (6.103)-(6.105), and let the stepsize
ak satisfy
00

lim ak
k--too
= 0, Lak = oo.
k=O

Then, with probability 1,

liminf F(xk) = F*.


k--too

Furthermore, if X* is nonempty and


00

La~< oo,
k=O

then {xk} converges to some x* EX* with probability 1.

Proof: The proof of the first part is nearly identical to the corresponding
part of Prop. 6.4.5. To prove the second part, similar to the proof of Prop.
6.4.6, we obtain for all k and all x* E X*,

[cf. Eq. (6.111) with a and y replaced with ak and x*, respectively], where
Fk = {Xk, Zk-I, ... , zo, xo }. According to the Supermartingale Conver-
gence Theorem (Prop. A.4.5 in Appendix A), for each x* EX*, there is a
Sec. 6.4 Incremental Subgradient Proximal Methods 361

set Dx• of sample paths of probability 1 such that for each sample path in

f
k=O
2ak (F(xk) - F*)
m
< oo, (6.120)

and the sequence { llxk - x* II} converges.


Let {Vi} be a countable subset of the relative interior ri(X*) that is
dense in X* [such a set exists since ri(X*) is a relatively open subset of the
affine hull of X*; an example of such a set is the intersection of X* with the
set of vectors of the form x* + I:f= 1 ri~i, where 6, ... , ~P are basis vectors
for the affine hull of X* and ri are rational numbers]. Let also Dv; be the
set of sample paths defined earlier that corresponds to Vi- The intersection

has probability 1, since its complement It is equal to U~ 1nii and


00

Prob (U~ 1Di;) :S L Prob (Di;) = 0.


i=l

For each sample path in 0, all the sequences {llxk - viii} converge
so that { xk} is bounded, while by the first part of the proposition [or Eq.
(6.120)] liminfk-+oo F(xk) = F*. Therefore, {xk} has a limit point x in
X*. Since {vi} is dense in X*, for every E > 0 there exists vi(<) such that
llx-vi(e)II < E. Since the sequence {llxk -Vi(e)II} converges and xis a limit
point of {xk}, we have limk-+oo llxk -vi(e)II < E, so that

lim sup Jlxk - xii :S lim llxk -


k-+oo k-+oo
Vi(<) II + llvi(<) - xii < 2E.

By taking E ~ 0, it follows that Xk ~ x. Q.E.D.

6.4.3 Application in Specially Structured Problems

We will now illustrate the application of our methods of this section to


some specially structured problems.

£1 -Regularization

Let us consider the £1-regularization problem

1 m
minimize 'YllxlJi + 2 L(c~x - bi) 2
i=l
(6.121)
subject to x E ~n,
362 Additional Algorithmic Topics Chap. 6

where 'Y is a positive scalar and xJ is the jth coordinate of x ( cf. Example
1.3.1). It is convenient to handle the regularization term with the proximal
algorithm:
Zk E arg min 1-iix - xkll 2 } .
{'Y llxll1 + -2Dk
xE~n

This proximal iteration decomposes into the n one-dimensional minimiza-


tions

z1. E arg xJE~


min { 'Y jx1j + -jx1
1
2Dk
- x11. 2 } , j = 1, ... ,n.

and can be done in closed form

if "(Dk :S: x{,


if -"(Dk < x{ < "(Dk, j = 1, ... ,n; (6.122)
if x{ :S: -"(Dk,

cf. the shrinkage operation discussed in Section 5.4.1.


Thus the incremental algorithms of this section are well-suited for
solution of the /\-regularization problem. The kth incremental iteration
may consist of selecting a pair (Cik, bik) and performing a proximal iteration
of the form (6.122) to obtain Zk, followed by a gradient iteration on the
component ½(c~kx - bik) 2 , starting at Zk:

This algorithm is the special case of the algorithms (6.80)-(6.82) (here


X = ~n, and all three algorithms coincide), with fi(x) being 'Yilxll1 (we
use m copies of this function) and hi(x) = ½(<x - bi) 2 .
Finally, let us note that as an alternative, the proximal iteration
(6.122) could be replaced by a proximal iteration on 'Y jx1j for some index
j, with all indexes selected cyclically in incremental iterations. Random-
ized selection of the data pair (Cik, bik) is also a possibility, particularly in
contexts where the data has a natural stochastic interpretation.

Incremental and Distributed Augmented Lagrangian Methods

We will now revisit the augmented Lagrangian methodology of Section


5.2.1, in the context oflarge-scale separable problems and incremental prox-
imal methods. Consider the separable constrained minimization problem
m
minimize L Ji (xi)
i=l
m
(6.123)
subject to xi E xi, i = 1, ... , m, L(Aixi - bi) = 0,
i=l
Sec. 6.4 Incremental Subgradient Proximal Methods 363

where Ji : 3in; f-t 3i are convex functions (ni is a positive integer, which
may depend on i), Xi are nonempty closed convex subsets of 3in;, Ai are
given r x ni matrices, and bi E 3ir are given vectors. For simplicity, we
focus on linear equality constraints, but the analysis can be extended to
convex inequality constraints as well.
Similar to our discussion of separable problems in Section 1.1.1, the
dual function is given by

and by decomposing the minimization over the components of x, it can be


expressed in the additive form
m
q(>-) = I:qi(>-),
i=l

where
qi(>.)= inf {fi(xi) + N(A;xi - bi)}. (6.124)
x'EX;

A classical method for maximizing the dual function q(>.), dating to


[Eve63] [see [Las70] or [Ber99] (Section 6.2.2) for a discussion], is a gradient
method, which, however, can be used only in the case where the functions
qi are differentiable. Assuming a constant stepsize o: > 0, the method takes
the form
m

Ak+l = Ak + 0: L'vqi(Ak),
i=l

with the gradients 'vqi(>.k) obtained as

i = l, ... ,m,

where xi attains the minimum of

over xi E Xi; cf. Eq. (6.124) and Example 3.1.2. Thus this method ex-
ploits the separability of the problem, and is well suited for distributed
computation, with the gradients 'vqi(>.k) computed in parallel at separate
processors. However, the differentiability requirement on qi is very strong
[it is equivalent to the infimum being attained uniquely for all ,\ in Eq.
(6.124)], and the convergence properties of this method tend to be frag-
ile. We will consider instead an incremental proximal method that can
be dually implemented (cf. Section 5.2) with decomposable augmented La-
grangian minimizations, and has more solid convergence properties.
364 Additional Algorithmic Topics Chap. 6

In this connection, we observe that the dual problem,


m
maximize L qi ( ,\)
i=l
subject to ,\ E ~r,

has a suitable form for application of the incremental proximal method [cf.
Eq. (6.67)].t In particular, the incremental proximal algorithm updates the
current vector Ak to a new vector Ak+l after a cycle of m subiterations:

(6.125)

where starting with 'l/'!2 = Ak, we obtain 'l/J',;' after the m proximal steps

i = 1, ... ,m, (6.126)

where ak is a positive parameter.


We now recall the Fenchel duality relation between proximal and
augmented Lagrangian minimization, which was discussed in Section 5.2.
Based on that relation, the proximal incremental update (6.126) can be
written in terms of the data of the primal problem as

(6.127)

where xi is obtained from the minimization


(6.128)

and Lak,i is the "incremental" augmented Lagrangian function

(6.129)

This algorithm allows decomposition within the augmented Lagrangian


framework, which is not possible in the standard augmented Lagrangian
method of Section 5.2.1, since the addition of the penalty term

t The algorithm (6.67) requires that the functions -qi have a common ef-
fective domain, which is a closed convex set. This is true for example if qi is
real-valued, which occurs if Xi is compact. In unusual cases where -qi has an
effective domain that depends on i and/or is not closed, the earlier convergence
analysis does not apply and needs to be modified.
Sec. 6.4 Incremental Subgradient Proximal Methods 365

to the Lagrangian function destroys its separability.


We note that the algorithm has suitable structure for distributed
asynchronous implementation, along the lines discussed in Section 2.1.6. In
particular, the augmented Lagrangian minimizations (6.128) can be done
at separate processors, each dedicated to updating a single component
xi, based on a multiplier vector >. that is updated at a central proces-
sor/ coordinator, using incremental iterations of the form (6.127).
Finally, let us note that the algorithm bears similarity with the ADMM
for separable problems of Section 5.4.2. The latter algorithm is not incre-
mental, which can be a disadvantage, but uses a constant penalty param-
eter, which can be an advantage. Other differences are that the ADMM
maintains iterates of variables zi that approximate the constraint levels bi
of Eq. (6.123), and involves a form of averaging of the multiplier iterates.

6.4.4 Incremental Constraint Projection Methods

In this section we consider incremental approaches for problems involving


complicated constraint sets that cannot be easily dealt with projection or
proximal algorithms. In particular, we consider the following problem,

minimize f (x)
(6.130)
subject to x E n~ 1 Xi,

where f : ~n H ~ is a convex cost function, and Xi are closed convex


sets that have a relatively simple form that is convenient for projection or
proximal iterations.
While the problem (6.130) does not involve a sum of component func-
tions, it may be converted into one that does by using a distance-based
exact penalty function. In particular, consider the problem
'm

minimize f(x) + c L dist(x; Xi)


i=l
(6.131)
subject to x E ~n,

where c is a positive penalty parameter. Then for f Lipschitz continuous


and c sufficiently large, problems (6.130) and (6.131) are equivalent, as
shown by the following proposition, which was proved in Section 1.5 (cf.
Prop. 1.5.3).

Proposition 6.4.9: Let f : ~n H ~ be a function, and let Xi,


i = 0, 1, ... , m, be closed subsets of ~n with nonempty intersection.
Assume that f is Lipschitz continuous over Xo. Then there is a scalar
c > 0 such that for all c 2'. c, the set of minima of f over n~ 0 Xi
coincides with the set of minima of
366 Additional Algorithmic Topics Chap. 6

m
f(x) + c L dist(x; Xi)
i=l '

over Xo.

From the preceding proposition, with Xo = ~n, it follows that we


can solve in place of the original problem (6.130) the additive cost problem
(6.131) for which our incremental algorithms of the preceding sections apply
(assuming a sufficiently large choice of c). In particular, let us consider the
algorithms (6.80)-(6.82), with X = ~n, which involve a proximal iteration
on one of the functions cdist(x; Xi), i = 1, ... , m, followed by a subgradient
iteration on f. A key fact here is that the proximal iteration

Zk E arg min {cdist(x;Xik) 1-llx-xkll 2 }


+ -2ak (6.132)
xEWn

involves a projection on X;k of Xk, followed by an interpolation; see Fig.


6.4.l. This is shown in the following proposition.

Proposition 6.4.10: Let Zk be the vector produced by the proximal


iteration (6.132). If Xk E Xik then Zk = Xk, while if Xk (/. Xik, '

(6.133)

where

Proof: The case Xk E X;k is evident, so assume that Xk tJ. X;k. From the
nature of the cost function in Eq. (6.132) we see that Zk is a vector that
lies in the line segment between Xk and Px;k (xk)· Hence there are two
possibilities: either
Zk = Px;k (xk), (6.134)

or Zk tJ. Xik in which case by setting to O the gradient at Zk of the cost


function in Eq. (6.132) yields

Zk - Px;k (zk) 1
c-,-,----~-_,.,. = -(xk -zk)-
llzk -Px;k(zk)II ak
Sec. 6.4 Incremental Subgradient Proximal Methods 367

\ X

Figure 6.4.1. Illustration of a proximal iteration applied to the distance function


dist(x; Xik) of the set X;k:

In the left-hand side figure, we have

and Zk is equal to the projection of Xk onto X;k. In the right-hand side figure,
we have
a1,;c < dist(x1,;;X;k),

and zk is obtained by calculating the projection of xk onto Xik, and then inter-
polating according to Eq. (6.133) .

This equation implies that Xk, Zk, and Px;"' (zk) lie on the same line, so
that Px;k (zk) = Pxik (xk) and

Zk = Xk- d" t( 0:kC. X- ) ( Xk-Pxi"' (xk) ) = (l-f3k)xk+f3kPx;k (xk)- (6.135)


IS Xk, 'k

By calculating and comparing the value of the cost function in Eq. (6.132)
for each of the possibilities (6.134) and (6.135), we can verify that (6.135)
gives a lower cost if and only if J3k < l. Q.E.D.

Let us now consider the problem


m
minimize L (fi(x) + hi(x)) (6.136)
i=l

subject to x E n~ 1 Xi.
Based on the preceding analysis, we can convert this problem to the un-
constrained minimization problem
m
minimize L (fi(x) + hi(x) + cdist(x; Xi))
i=l

subject to x E Rn,
368 Additional Algorithmic Topics Chap. 6

where c is sufficiently large. The incremental subgradient proximal algo-


rithm (6.82), can be applied to this problem. At iteration k, it first performs
a subgradient iteration on a cost component hik,

(6.137)

where Vhik (xk) denotes some subgradient of hik at Xk. It then performs a
proximal iteration on a cost component fik,

(6.138)

Finally, it performs a proximal iteration on a constraint distance component


cdist(·; Xik) according to Prop. 6.4.10 [cf. Eq. (6.133)],

if f3k < l,
(6.139)
if f3k ~ l,

where
/3 _ O!kC
(6.140)
k - dist(zk; xik)'

with the convention that f3k = oo if dist(zk; Xik) = 0. The index ik may be
chosen either randomly or according to a cyclic rule. Our earlier conver-
gence analysis extends straightforwardly to the case of three cost function
components for each index i. Moreover the subgradient, proximal, and
projection operations may be performed in any order.
Note that the penalty parameter c can be taken as large as desired,
and it does not affect the algorithm as long as

in which case f3k ~ l [cf. Eq. (6.140)]. Thus we may keep increasing c so
that f3k ~ l, up to the point where it reaches some "very large" threshold.
It would thus appear that in practice we may be able to use a stepsize f3k
that is always equal to 1 in Eq. (6.139), leading to the simpler algorithm

Xk+l = Px;k (zk),


which does not involve the parameter c. Indeed, this algorithm has been
analyzed in the paper [WaB13a]. Convergence has been shown for a variety
Sec. 6.5 Coordinate Descent Methods 369

of randomized and cyclic sampling schemes for selecting the cost function
components and the constraint components.
While this algorithm does not depend on the penalty parameter c,
its currently available convergence proof requires an additional condition.
This is the so-called linear regularity condition, namely that for some rJ > 0,

V XE ~n,

where Py(x) denotes projection of a vector x on the set Y. This property


is satisfied in particular if all the sets Xi are polyhedral. By contrast, the
algorithm (6.137)-(6.139) does not need this condition, but uses a (large)
value of c to guard against the rare case where the scalar f3k of Eq. (6.140)
gets smaller than 1, requiring an interpolation as per Eq. (6.139), which
ensures convergence.
Finally, let us note a related problem for which incremental constraint
projection methods are well-suited. This is the problem
r
minimize f(x) +C:~:::::max{O,g1 (x)}
j=l

subject to X E n~ 1 Xi,
which is obtained by replacing convex inequality constraints of the form
g1 ( x) :S: 0 with the nondifferentiable penalty terms c max { 0, g1 ( x)}, where
c > 0 is a penalty parameter (cf. Section 1.5). Then a possible incremental
method at each iteration, would either do a subgradient iteration on f, or
select one of the violated constraints (if any) and perform a subgradient
iteration on the corresponding function g1 , or select one of the sets Xi and
do an interpolated projection on it.

6.5 COORDINATE DESCENT METHODS

In this section we will consider the block coordinate descent approach that
we discussed briefly in Section 2.1.2. We focus on the problem
minimize f (x)
subject to x E X,
where f : ~n M ~ is a differentiable convex function, and X is a Cartesian
product of closed convex sets X1, ... , Xm:

X = X1 X X2 X · ·· X Xm,
where Xi is a subset of ~ni (we allow that ni > 1, although the most
common case is when ni = 1 for all i). The vector x is partitioned as
X= (x1,x 2 , ... ,xm),
370 Additional Algorithmic Topics Chap. 6

where each xi is a "block component" of x that is constrained to be in Xi,


so the constraint x E X is equivalent to xi E Xi for all i = 1, ... , m.
Similar to the proximal gradient and incremental methods of the pre-
ceding two sections, the idea is to split the optimization process into a
sequence of simpler optimizations, with the motivation to exploit the spe-
cial structure of the problem. In the block coordinate descent method,
however, the simpler optimizations revolve around the components xi of
x, rather than the components of f or the components of X, as in the
proximal gradient and the incremental methods.
The most common form of block coordinate descent method is defined
as follows: given the current iterate Xk = (xl, ... , xr), we generate the next
iterate Xk+l = (Xk+l, ... , Xk+l), according to

i = 1, ... , m; (6.141)

where we assume that the preceding minimization has at least one optimal
solution. Thus, at each iteration, the cost is minimized with respect to each
of the block components xL, taken in cyclic order, with each minimization
incorporating the results of the preceding minimizations. Naturally, the
method makes practical sense if the minimization in Eq. (6.141) is fairly
easy. This is frequently so when each xi is a scalar, but there are also other
cases of interest, where xi is multidimensional.
The coordinate descent approach has a sound theoretical basis thanks
to its iterative cost function descent character, and is often conveniently
applicable. In particular, when the coordinate blocks are one-dimensional,
the descent direction does not require a special calculation. Moreover if
the cost function is a sum of functions with "loose coupling" between the
block components (i.e., each block component appears in just a few of the
functions in the sum), then the calculation of the minimum along each
block component may be simplified. Another structure that favors the use
of block coordinate descent is when the cost function involves terms of
the form h(Ax), where A is a matrix such that computing y = Ax is far
more expensive than computing h(y); this simplifies the minimization over
a block component.
The following proposition gives the basic convergence result for the
method. It turns out that it is necessary to make an assumption implying
that the minimum in Eq. (6.141) is uniquely attained. This assumption
is satisfied if f is strictly convex in each block component when all other
block components are held fixed. We will discuss later a version of the
algorithm, which involves quadratic regularization and does not require this
assumption. While the proposition as stated applies to convex optimization
problems, consistently with the framework of this section, the proof can
be adapted to use only the continuous differentiability of f and not its
convexity (see the proposition after the next one).
Sec. 6.5 Coordinate Descent Methods 371

Proposition 6.5.1: (Convergence of Block Coordinate De-


scent) Let f : ~n H ~ be convex and differentiable, and let X =
X1 x X2 x · · · x Xm, where Xi are closed and convex. Assume further
that for each x = (x 1 , ... ,xm) EX and i,

c xi+l ' ... ' xm)


f(x l ' ... ' xi-1 ,~,

viewed as a function of~' attains a unique minimum over Xi. Let {xk}
be the sequence generated by the block coordinate descent method
(6.141). Then, every limit point of {xk} minimizes f over X.

Proof: Denote

Using the definition (6.141) of the method, we obtain

\:/ k. (6.142)

Let ;i; = (x1, ... ,xm) be a limit point of the sequence {xk}, and note
that ;i; E X since X is closed. Equation (6.142) implies that the sequence
{f (xk)} converges to f (x). We will show that ;i; satisfies the optimality
condition
v' f (x)'(x - x) ;:::: o, \:/x EX;
cf. Prop. 1.1.8 in Appendix B.
Let {xki I j = 0, 1, ... } be a subsequence of {xk} that converges to x.
From the definition (6.141) of the algorithm and Eq. (6.142), we have

f(Xk 1-+ 1 ) :S: f (z() :S: f (x1, xL ... , x,;". ),


J J J

Taking the limit as j tends to infinity, we obtain

f (x) :s: f (xl' ;j;2' ... 'xm ), (6.143)

Using the optimality conditions of Prop. 1.1.8 in Appendix B, we conclude


that
v'if(x)'(x 1 -x 1 );:::: o, \:/ x 1 E X1,
where v'd denotes the gradient off with respect to the component xi.
The idea of the proof is now to show that { z(} converges to ;i; as
J
j -+ oo, so that by repeating the preceding argument with { z1 _} in place
J
of {Xki}, we will have

v'2f(x)'(x 2 - x2 );:::: o, \:/ x 2 E X2.


372 Additional Algorithmic Topics Chap. 6

We can then continue similarly to obtain

for all i = 1, ... , m. By adding these inequalities, and using the Cartesian
product structurr of the set X, it follows that V f (x)'(x - x) ::::0: 0 for all
x E X, thereby completing the proof.
To show that { zt.} converges to x as j -+ oo, we assume the contrary,
J
or equivalently that { ztj - Xkj} does not converge to zero. Let

By possibly restricting to a subsequence of { k 1 }, we may assume that there


exists some 7y > 0 such that 1'kj ::::0: 7y for all j. Let

Thus, z~. = Xk 1· + 1'k1 s(, llsUI = 1, ands( differs from zero only along
J J J J
the first block component. Notice that sl belongs to a compact set and
J
therefore has a limit point s 1 . By restricting to a further subsequence of
{k1 }, we assume thats( converges to s 1 .
J
Let us fix some EE [O, l]. Since O:::; E"y:::; 1'kj, the vector Xkj + E)'st
lies on the line segment joining Xkj and Xk 1 + l'kj stj = z~i, and belongs to
X since X is convex. Using the fact that f is monotonically nonincreasing
on the interval from Xki to z~j (by the convexity of J), we obtain

Since f(xk) converges to f(x), Eq. (6.142) shows that f(z{) also converges
J
to f (x). Taking the limit as j tends to infinity, we obtain

f(x) :::; f (x + E"ys 1 ) :::; f(x).

We conclude that f(x) = f(x+E"ys 1 ), for every EE [O, 1]. Since 7ys 1 =/= 0 and
by Eq. (6.143), x 1 attains the minimum of J(x 1 , x2 , ... ,x=) over x 1 E X 1 ,
this contradicts the hypothesis that f is uniquely minimized when viewed as
a function of the first block component. This contradiction establishes that
z( converges to x, which as noted earlier, shows that V2f(x)'(x 2 -x 2 );:::: 0
J
for all x 2 E X2.
By using {z(} in place of {xk 1 }, and {z~} in place of {z(} in the
J J J
preceding arguments, we can show that '\l3f(x)'(x 3 - x 3) ::::0: 0 for all x 3 E
X 3, and similarly Vd(x)'(xi - xi) ;:::: 0 for all xi E Xi and i. Q.E.D.
Sec. 6.5 Coordinate Descent Methods 373

The preceding proposition applies to convex optimization problems.


However, its proof can be simply adapted to use only the continuous dif-
ferentiability of f and not its convexity. For this an extra assumption is
needed (monotonic decrease to the minimum along each coordinate), as
stated in the following proposition from [Ber99], Section 2.7.

Proposition 6.5.2: (Convergence of Block Coordinate Descent


- Nonconvex Case) Let f: ~n H ~ be continuously differentiable,
and let X = X1 x X2 x · · · x Xm, where Xi are closed arid convex.
Assume further that for each x = (x 1 , ... ,xm) EX and i,

c xi+l ' ... ' xm)


f(x l , ... ' xi-1 '~' (6.144)

viewed as a function of~, attains a unique minimum ( over Xi, and


is monotonically nonincreasing in the interval from xi to {. Let { xk}
be the sequence generated by the block coordinate descent method
(6.141). Then, every limit point x of {xk} satisfies the optimality
condition 'v J(x)'(x - x) ~ 0 for all x EX. ·

The proof is nearly identical to the one of Prop. 6.5.1, using at the
right point the monotonic nonincrease assumption in place of convexity of
f. An alternative assumption, also discussed in [Ber99], Section 2.7, under
which the conclusion of Prop. 6.5.2 can be shown with a similar proof is
that the sets Xi are compact (as well as convex), and that for each i and
x EX, the function (6.144) of the ith block-component~ attains a unique
minimum over Xi, when all other block-components are held fixed.
The nonnegative matrix factorization problem, described in Section
1.3, is an important example where the cost function is convex as a function
of each block component, but not convex as a function of the entire set
of block components. For this problem, a special convergence result for
the case of just two block components applies, which does not require the
uniqueness of the minimum in the two block component minimizations.
This result can be shown with a variation of the proofs of Props. 6.5.1 and
6.5.2; see [GrS99], [GrSOO].

6.5.1 Variants of Coordinate Descent

There are many variations of coordinated descent, which are aimed at im-
proved efficiency, application-specific structures, and distributed comput-
ing environments. We describe some of the possibilities here and in the
exercises, and we refer to the literature for a more detailed analysis.
(a) We may apply coordinate descent in the context of a dual problem.
This is often convenient because the dual constraint set often has the
374 Additional Algorithmic Topics Chap. 6

required Cartesian product structure (e.g., X may be the nonnegative


orthant). An early algorithm of this type for quadratic programming
was given in [Hil57]. For further discussion and development, we
refer to the books [BeT89a], Section 3.4.1, [Ber99], Section 6.2.1,
[CeZ97], and [Her09], and the references quoted there. For other
applications in a dual context, including single commodity network
flow optimization, we refer to the papers [BHT87], [TsB87], [TsB90],
[TsB91], [Aus92].
(b) There is a combination of coordinate descent with the proximal al-
gorithm, which aims to circumvent the need for the "uniqueness of
minimum" assumption in Prop. 6.5.1; see [Tse91a], [Aus92], [GrSOO].
This method is obtained by applying cyclically the coordinate itera-
tion

i
xk+l Eargmm
·
~EXi
{f( 1
xk+ i-1 c m)
i+l , ... ,xk
1, ... ,xk+l'"'xk + -l 11c.,,-xki 112} ,
2c

where c is a positive scalar. Assuming that f is convex and differ-


entiable, it can be seen that every limit point of the sequence { xk}
is a global minimum. This is easily proved by applying the result of
Prop. 6.5.1 to the cost function

1
F(x, y) = f(x) + -llx
2c
- Yll 2 -

(c) The preceding combination of coordinate descent with the proximal


algorithm is an example of a broader class of methods whereby instead
of carrying out exactly the coordinate minimization

[cf. Eq. (6.141)], we perform one or more iterations of a descent al-


gorithm aimed at solving this minimization. Aside from the prox-
imal algorithm of (b) above, there are combinations with other de-
scent algorithms, including conditional gradient, gradient projection,
and two-metric projection methods. A key fact here is that there is
guaranteed cost function descent after cycling through all the block
components. See e.g., [Lin07], [LJS12], [Spa12], [Jag13], [BeT13],
[RHL13], [RHZ14], [RiT14].
(d) When f is convex but nondifferentiable, the coordinate descent ap-
proach may fail because there may exist nonoptimal points starting
from which it is impossible to obtain cost function descent along any
one of the scalar coordinates. As an example consider minimization
of J(x1, x 2) = x 1 + x 2+ 2lx 1 - x 21over x1, x 2 ~ 0. At points (x1, x 2)
with x 1 = x 2 > 0, coordinate descent makes no progress.
Sec. 6.5 Coordinate Descent Methods 375

There is an important special case, however, where the nondifferen-


tiability of f is inconsequential, and at each nonoptimal point, it is
possible to find a direction of descent among the scalar coordinate
directions. This is the case where the nondifferentiable portion of f
is separable, i.e., f has the form
n
f(x) = F(x) + L Gi(xi), (6.145)
i=l

where F is convex and differentiable, and each Gi is a function of the


ith block component that is convex but not necessarily differentiable;
see Exercise 6.10 for a brief discussion, and [Aus76], [TseOla], [TsY09],
[FHTlO], [TselO], [Nes12], [RHL13], [SaT13], [BST14], [HCW14],
[RiT14] for detailed analyses. The coordinate descent approach for
this type of cost function may also be combined with the ideas of
the proximal gradient method of Section 6.3; see [FeR12], [QSG13],
[LLX14], [LiW14], [RiT14]. A case of special interest, which arises
in machine learning problems involving i:'i-regularization, is when
L~=l Gi(xi) is a positive multiple of the £1 norm.
(e) For the case where f is convex and nondifferentiable but not of the
separable form (6.145), a coordinate descent approach is possible but
requires substantial modifications and special assumptions. In one
possible approach, known as auction and E-relaxation, the coordi-
nate search is nonmonotonic, in the sense that a single coordinate
may be allowed to change even if this deteriorates the cost function
value. In particular, when a coordinate is increased (or decreased),
it is set to E > 0 (or -E, respectively) plus the value that minimizes
the cost function along that coordinate. For (single commodity) net-
work special structures, including assignment, max-flow, linear cost
transshipment, and convex separable network flow problems with and
without gains, there are sound algorithms of this type, with excellent
computational complexity properties, which can reach or approach an
optimal solution, assuming that E is sufficiently small; see the papers
[Ber79], [Ber92], [BPT97a], [BPT97b], [Ber98], [TsBOO], the books
[BeT89a], [Ber91], [BeT97], [Ber98], and the references quoted there.
An alternative approach for nondifferentiable cost is to use (exact) co-
ordinate descent when this leads to cost function improvement, and
to compute a more complicated descent direction (a linear combi-
nation of multiple coordinate directions) otherwise. For dual prob-
lems of (single commodity) linear network optimization problems,
the latter descent direction may be computed efficiently (it is the
sum of appropriately chosen coordinate directions). This idea has
led to fast algorithms, known as relaxation methods; see the papers
[Ber81], [Ber85], [BeT94b], and the books [BeT89a], [Ber91], [Ber98].
In practice, most of the iterations use single coordinate directions, and
376 Additional Algorithmic Topics Chap. 6

the computationally more complex multiple-coordinate directions are


needed infrequently. This approach has also been applied to network
flow problems with gains [BeT88] and to convex network flow prob-
lems [BHT87], and may be applied more broadly to specially struc-
tured problems where multiple-coordinate descent directions can be
efficiently computed.
(f) A possible way to improve the convergence properties of the coordi-
nate descent method is to use an irregular order instead of a fixed
cyclic order for coordinate selection. Convergence of such a method
can usually be established simply: the result of Prop. 6.5.1 can be
shown for a method where the order of iteration may be arbitrary
as long as there is an integer M such that each block component is
iterated at least once in every group of M contiguous iterations. The
proof is similar to the one of Prop. 6.5.1.
An idea to accelerate convergence that has received a lot of attention
is to use randomization in the choice of coordinate at each iteration;
see [Spa03], [StV09], [LeLlO], [NeelO], [Nesl2], [LeS13], [NeC13]. An
important question in this context, which has been addressed in var-
ious ways by these references, is whether the convergence rate of the
method can be enhanced in this way. In particular, a complexity
analysis in the paper [Nesl2] of randomized coordinate selection has
stimulated much interest.
An alternative possibility is to use a deterministic coordinate selection
rule, which aims to identify promising coordinates that lead to large
cost improvements. A classical approach of this type is the Gauss-
Southwell order, which at each iterate chooses the coordinate direc-
tion of steepest descent (the one with minimum directional deriva-
tive). While this requires some overhead for coordinate selection, it
is known to lead to faster convergence (fewer iterations) than cyclic
or randomized selection, based on analysis and practical experience
[DRTll], [ScF14] [see Exercise 6.ll(b)].
(g) A more extreme type of irregular method is a distributed asynchronous
version, which is executed concurrently at different processors with
minimal coordination. The book [BeT89a] (Section 6.3.5) discusses
the circumstances under which asynchrony may improve the conver-
gence rate of the method. We discuss the convergence of asynchronous
coordinate descent in the next section.

6.5.2 Distributed Asynchronous Coordinate Descent

We will now consider the distributed asynchronous implementation of fixed


point algorithms, which applies as a special case to coordinate descent. The
implementation is totally asynchronous in the sense that there is no bound
Sec. 6.5 Coordinate Descent Methods 377

on the size of the communication delays between the processors; cf. the
terminology of Section 2.1.6. The analysis in this section is based on the
author's paper [Ber83] (see [BeT89a], [BeT91], [FrSOO] for broad surveys
of totally asynchronous algorithms). A different line of analysis applies
to partially asynchronous algorithms, for which it is necessary to have a
bound on the size of the communication delays. Such algorithms will not
be considered here; see [TBA86], [BeT89a], Chapter 7, for gradient-like
methods, [BeT89a] for network flow algorithms, [TBT90] for nonexpan-
sive iterations, and [LiW14] which focuses on coordinate descent methods,
under less restrictive conditions than the ones of the present section (a
sup-norm contraction property of the algorithmic map is not assumed).
Let us consider parallelizing a stationary fixed point algorithm by sep-
arating it into several local algorithms operating concurrently at different
processors. As we discussed in Section 2.1.6, in an asynchronous algorithm,
the local algorithms do not have to wait at predetermined points for pre-
determined information to become available. Thus some processors may
execute more iterations than others, while the communication delays be-
tween processors may be unpredictable. Another practical setting that may
be modeled well by a distributed asynchronous iteration is when all com-
putation takes place at a single computer, but any number of coordinates
may be simultaneously updated at a time, with the order of coordinate
selection possibly being random.
With this context in mind, we introduce a model of asynchronous
distributed solution of abstract fixed point problems of the form x = F(x),
where F is a given function. We represent x as x = (x1, ... , xm), where
xi E Rni with ni being some positive integer. Thus x E ~n, where n =
n1 +···+nm, and F maps ~n to ~n. We denote by Fi : ~n H ~ni
the ith component of F, so F(x) = (Fi(x), ... ,Fm(x)). Our computation
framework involves m interconnected processors, the ith of which updates
the ith component xi by applying the corresponding mapping Fi. Thus, in
a (synchronous) distributed fixed point algorithm, processor i iterates at
time t according to

xi+i = Fi(x}, ... , xt'), I;/ i = 1, ... , m. (6.146)


To accommodate the distributed algorithmic framework and its overloaded
notation, we will use subscript t to denote iterations/times where some (but
not all) processors update their corresponding components, reserving the
index k for computation stages involving all processors, and also reserving
superscript i to denote component/processor index.
In an asynchronous version of the algorithm, processor i updates xi
only for t in a selected subset Ri of iterations, and with components xi,
j =f. i, supplied by other processors with delays t - Tij(t),

i _
Xt+l -
{ Fi. (x;
,1
(t), ... , x";' (t))
irn
if t E Ri,
(6.147)
x't if tr/. Ri.
378 Additional Algorithmic Topics Chap. 6

Here Tij ( t) is the time at which the jth coordinate used in this update was
computed, and the difference t- Ti1 (t) is referred to as the communication
delay from j to i at time t.
We noted in Section 2.1.6 that an example of an algorithm of this
type is a coordinate descent method, where we assume that the ith scalar
coordinate is updated at a subset of times Ri C {O, 1, ... }, according to

i
xt+l E argmm· f ( x 1 _ (t), i-1
... ,x __
c i+l
(t)'"''x __
m
(t), ... ,x _ (t) ) ,
{EW Til Ti,i-1 Ti,z.+l Tim

and is left unchanged (xl+i = xi) if t .J_ R;. Here we can assume with-
out loss of generality that each scalar coordinate is assigned to a separate
processor. The reason is that a physical processor that updates a block
of scalar coordinates may be replaced by a block of fictitious processors,
each assigned to a single scalar coordinate, and updating their coordinates
simultaneously.
To discuss the convergence of the asynchronous algorithm (6.147), we
introduce the following assumption.

Assumption 6.5.1: (Continuous Updating and Information


Renewal)
(1) The set of times 7?; at which processor i updates xi is infinite,
for each i = 1, ... , m. ·
(2) limt-Hxi Tij(t) = oo for all i,j = 1, ... , m.

Assumption 6.5.1 is natural, and is essential for any kind of conver-


gence result about the algorithm. In particular, the condition Tij(t) -+ oo
guarantees that outdated information about the processor updates will
eventually be purged from the computation. It is also natural to assume
that Tij ( t) is monotonically increasing with t, but this assumption is not
necessary for the subsequent analysis.
We wish to show that {Xt} converges to a fixed point of F, and to this
end we employ the following convergence theorem for totally asynchronous
iterations from [Ber83]. The theorem has served as the basis for the treat-
ment of totally asynchronous iterations in the book [BeT89a] (Chapter 6),
including coordinate descent and asynchronous gradient-based optimiza-
tion.

Proposition 6.5.3: (Asynchronous Convergence Theorem) Let


F have a unique fixed point x*, let Assumption 6.5.1 hold, and assume
that there is a sequence of nonempty subsets { S(k)} C Rn with
Sec. 6.5 Coordinate Descent Methods 379

S(k+l)cS(k), k=O,l, ... ,


and is such that if {Yk} is any sequence with Yk E S(k), for all k 2'.: 0,
then {yk} converges to x*. Assume further the following:
( 1) Synchronous Convergence Condition: We have

F(x) E S(k + 1), \Ix E S(k), k = 0, l, ....

(2) Box Condition: For all k, S(k) is a Cartesian product of the form

S(k) = S1(k) X ·· · X Sm(k),

where Si(k) is a subset of !Rn;, i = 1, ... , m.


Then for every 'initial vector xo E S (0), the sequence { xt} generated
by the asynchronous algorithm (6.147) converges to x*.

Proof: To explain the idea of the proof, let us note that the given condi-
tions imply that updating any component xi, by applying F to x E S(k),
while leaving all other components unchanged, yields a vector in S(k).
Thus, once enough time passes so that the delays become "irrelevant,"
then after x enters S(k), it stays within S(k). Moreover, once a compo-
nent xi enters the subset Si(k) and the delays become "irrelevant," xi gets
permanently within the smaller subset Si(k + 1) at the first time that xi
is iterated on with x E S (k). Once each component xi, i = 1, ... , m, gets
within Si(k+ 1), the entire vector xis within S(k+ 1) by the box condition.
Thus the iterates from S(k) eventually get into S(k + 1) and so on, and
converge pointwise to x* in view of the assumed properties of { S(k) }.
With this idea in mind, we show by induction that for each k 2'.: 0,
there is a time tk such that:
(1) Xt E S(k) for all t 2'.: tk.

(2) For all i and t E Ri with t 2'.: tk, we have

( x;.;i(t)' ... , x~m(ti) E S(k).

[In words, after some time, all fixed point estimates will be in S(k) and all
estimates used in iteration (6.147) will come from S(k).]
The induction hypothesis is true fork= 0 since xo E S(O). Assuming
it is true for a given k, we will show that there exists a time tk+l with the
required properties. For each i = 1, ... , m, let t(i) be the first element of
Ri such that t(i) 2'.: tk. Then by the synchronous convergence condition,
we have F(xt(i)) E S(k + 1), implying (in view of the box condition) that
x~(i)+l E Si(k + 1).
380 Additional Algorithmic Topics Chap. 6

·~ x = (x1,x 2 )

ls(k+1) ex* F(x)~


S(k)
S(O)
'

S1(0)

Figure 6.5.1. Geometric interpretation of the conditions of the asynchronous


convergence theorem. We have a nested sequence of boxes {S(k)} such that
F(x) E S(k + 1) for all x E S(k).

Similarly, for every t E Ri, t ~ t(i), we have xl+i E Si(k + 1). Between
elements of Ri, xf does not change. Thus,

xf E Si(k + 1), \f t ~ t( i) + 1.
Let t~ = maxi {t( i)} + 1. Then, using the box condition we have

XtES(k+l),

Finally, since by Assumption 6.5.1, we have Tij(t)---+ oo as t---+ oo, t E Ri,


we can choose a time tk+l ~ t~ that is sufficiently large so that Tij(t) ~ t~
for all i, j, and t E Ri with t ~ tk+l· We then have, for all t E Ri with
t ~ tk+l and j = 1, ... , m, x!j;(t) E Sj(k+ 1), which (by the box condition)
implies that
( x,;:ii(t)' ... , X~m(ti) E S(k + 1).
The induction is complete. Q.E.D.

Figure 6.5.1 illustrates the assumptions of the preceding convergence


theorem. The main issue in applying the theorem is to identify the set
sequence { S(k)} and to verify the assumptions of Prop. 6.5.3. These as-
sumptions hold in two primary contexts of interest. The first is when S(k)
are spheres centered at x* with respect to the weighted sup-norm

l xl ~ = .211ax Ix: I,
i-l, ... ,n W

where w = (w 1 , ... , wn) is a vector of positive weights (see the following


proposition). The second context is based on monotonicity conditions,
and is particularly useful in dynamic programming algorithms, for which
we refer to the papers [Ber82c], [BeYlO], and the books [Ber12], [Berl3].
Sec. 6.5 Coordinate Descent Methods 381

x 1 Iterations
...._
\~
0 --
.-..
ls(k+l) • x* t I
S(k) I
S(O) I
I
x2 Iteration

Figure 6.5.2. Geometric interpretation of the mechanism for asynchronous con-


vergence. Iteration on a single component xi, keeps x in S(k), while it moves xi
into the corresponding component Si(k+ 1) of S(k+l), where it remains through-
out the subsequent iterations. Once all components xi have been iterated on at
least once, the iterate is guaranteed to be in S(k + 1).

Figure 6.5.2 illustrates the mechanism by which asynchronous convergence


is achieved.
As an example, let us apply the preceding convergence theorem under
a weighted sup-norm contraction assumption.

Proposition 6.5.4: Let F be a contraction with respect to a weighted


sup-norm II · 11&':i and modulus p < 1, and let Assumption 6.5.1 hold.
Then a sequence { xt} generated by the asynchronous algorithm (6.147)
converges pointwise to x*.

Proof: We apply Prop. 6.5.3 with

S(k) = {x E Rn I llxk - x* 11&':i :::; pk llxo - x* 11&':i}, k = 0, 1, ....

Since F is a contraction with modulus p, the synchronous convergence


condition is satisfied. Since Fis a weighted sup-norm contraction, the box
condition is also satisfied, and the result follows. Q.E.D.

The contraction property of the preceding proposition can be verified


in a few interesting special cases. In particular, let F be linear of the form

F(x)=Ax+b,

where A and bare given n x n matrix and vector in Rn. Let us denote by IAI
the matrix whose components are the absolute values of the components
of A and let a(IAI) denote the spectral radius of IAI (the largest modulus
among the moduli of the eigenvalues of IAI). Then it can be shown that
F is a contraction with respect to some weighted sup-norm if and only if
382 Additional Algorithmic Topics Chap. 6

o-(IAI) < 1. A proof of this may be found in several sources, including


[BeT89a], Chapter 2, Cor. 6.2. Another interesting fact is that F is a
contraction with respect to the (unweighted) sup-norm 11 · lloo if and only if
n
Vi= 1, ... , n,

where aij are the components of A. To see this note that the ith component
of Ax satisfies
n n

j=l j=l

so IIAxlloo : : ; Pllxlloo, where p = maxi L?=l 1%1 < 1. This shows that
A (and hence also F) is a contraction with respect to I · 1 00 . A similar
argument shows also the reverse assertion.
We finally note a few extensions of the theorem. It is possible to
allow F to be time-varying, so in place of F we operate with a sequence
of mappings Fk, k = 0, l, .... Then if all Fk have a common fixed point,
the conclusion of the theorem holds (see [BeT89a] for details). Another
extension is to allow F to have multiple fixed points and introduce an
assumption that roughly says that nk= 0 S(k) is the set of all fixed points.
Then the conclusion is that any limit point of { Xt} is a fixed point.

6.6 GENERALIZED PROXIMAL METHODS

The proximal algorithm admits several extensions, which may be particu-


larly useful in specialized application domains, such as inference and signal
processing. Moreover the algorithm, with some unavoidable limitations,
applies to nonconvex problems as well. A general form of the algorithm for
minimizing a function f : lRn .....+ (-oo, oo] is

(6.148)

where Dk lR2 n .....+ (-oo, oo] is a regularization term that replaces the
quadratic
1
-llx-xkll
2ck
2

in the proximal algorithm. We assume that Dk(-, xk) is a closed proper


(extended real-valued) convex function for each Xk·
The algorithm (6.148) can be graphically interpreted similar to the
proximal algorithm, as illustrated in Fig. 6.6.1. This figure, and the conver-
gence and rate of convergence results of Section 5.1 provide some qualitative
Sec. 6.6 Generalized Proximal Methods 383

Figure 6.6.1. Illustration of the generalized proximal algorithm (6.148) for a con-
vex cost function /. The regularization term is convex but need not be quadratic
or real-valued. In this figure, 'Yk is the scalar by which the graph of -Dk(·,xk)
must be raised so that it just touches the graph of /.

guidelines about the kind of behavior that may be expected from the al-
gorithm when f is a closed proper convex function. In particular, under
suitable assumptions on Dk, we expect to be able to show convergence to
the optimal value and convergence to an optimal solution if one exists (cf.
Prop. 5.1.3).

Example 6.6.1: (Entropy Minimization Algorithm)

Let us consider the case where

whwew xi and yi denote the scalar components of x and y, respectively. This


regularization term is based on the scalar function ¢ : R >--+ (-oo, oo], given
by
x(ln(x)-1) if X > 0,
¢(x) = { 0 if X = 0, (6.149)
00 if X < 0,
which is referred to as the entropy function, and is shown in Fig. 6.6.2. Note
that ¢ is finite only for nonnegative arguments.
The proximal algorithm that uses the entropy function is given by

Xk+l E arg min


xE!Rn
{f(x) +_..!_~xi (1n (x') -
Ck L...,
i=l
xt 1)}, (6.150)

where xl denotes the ith coordinate of Xk; see Fig. 6.6.3. Because the loga-
rithm is finite only for positive arguments, the algorithm requires that >0 xi
384 Additional Algorithmic Topics Chap. 6

x(ln(x) -1) if X > 0,


¢(x) = { O if X = 0,
00 if X < 0,

0 X

Figure 6.6.2. Illustration of the entropy function (6.149).

for all i, and must generate a sequence that lies strictly within the positive
orthant. Thus the algorithm can be used only for functions f for which the
minimum above is well-defined and guaranteed to be attained within the pos-
itive orthant.

Figure 6.6.3. Illustration of the entropy minimization algorithm (6.150).

We may speculate on the properties and potential applications of the


proximal approach with nonquadratic regularization, in various algorithmic
contexts, based on the corresponding applications that we have considered
so far. These are:
(a) Dual proximal algorithms, based on application of the Fenchel Duality
Theorem (Prop. 1.2.1) to the minimization (6.148) ; cf. Section 5.2.
(b) Augmented Lagrangian methods with nonquadratic penalty functions.
Sec. 6.6 Generalized Proximal Methods 385

Figure 6.6.4. Illustration of the generalized proximal algorithm (6.148) for the
case of a nonconvex cost function f.

Here the augmented Lagrangian function will be related to the con-


jugate of Dk(·, Xk); cf. Section 5.2.1. We will discuss later in this
section an example involving an augmented Lagrangian with expo-
nential penalty function (the conjugate of the entropy function).
(c) Combinations with polyhedral approximations. There are straight-
forward extensions of proximal cutting plane and bundle algorithms
involving nonquadratic regularization terms; cf. Section 5.3.
(d) Extensions of incremental subgradient proximal methods; cf. Section
6.4. Again such extensions are straightforward.
(e) Gradient projection algorithms with "nonquadratic metric." We will
discuss an example of such an algorithm, the so-called mirror descent
method, later in this section.
We may also consider in principle the application of the approach to
a nonconvex cost function f as well. Its behavior, however, in this case
may be complicated and/or unreliable, as indicated in Fig. 6.6.4. An im-
portant question is whether the minimum in the proximal iteration (6.148)
is attained for all k; if f is not assumed convex, this is not automatically
guaranteed, even if Dk is quadratic [take for example f(x) = -llxll 3 and
Dk(x,xk) = llx - xkll 2]. To simplify the presentation, we will implicitly
assume the attainment of the minimum throughout our discussion; it is
guaranteed for example if f is closed proper convex, and Dk(·,xk) is closed
and coercive [satisfies Dk(x, Xk ) -+ oo as llxll -+ oo], and its effective do-
main intersects with <lorn(!) for all k (cf. Prop. 3.2.3 in Appendix B).
Let us now introduce two conditions on Dk that guarantee some sound
behavior for the algorithm, even when f is not convex. The first is a
"stabilization property," whereby adding Dk to f does not produce an
386 Additional Algorithmic Topics Chap. 6

incentive to move from Xk:

\/XE Rn, k = 0, l, .... (6.151)

With this condition we are assured that the algorithm has a cost improve-
ment property. Indeed, we have

J(xk+1) :S f(xk+1) + Dk(Xk+i,xk) - Dk(xk,Xk)


:'.S J(xk) + Dk(xk, xk) - Dk(Xk, Xk) (6.152)
= f(xk),
where the first inequality follows from Eq. (6.151), and the second inequal-
ity follows from the definition (6.148) of the algorithm. The condition
(6.151) also guarantees that

x* E arg min f(x)


xEWn
x* E arg min
xEWn
{f (x) + Dk(x, x*) },

in which case we assume that the algorithm stops.


However, for the algorithm to be reliable, an additional condition
is required to guarantee that it produces strict cost improvement when
outside of X*, the set of (global) minima off. One such condition is that
the algorithm can stop only at points of X *, i.e.,

Xk EX*, (6.153)

in which case, the second inequality in the calculation of Eq. (6.152) is


strict when Xk r/. X*, implying that

ifxkr/.X*. (6.154)

A set of assumptions guaranteeing the condition (6.153) are:


(a) f is convex.
(b) Dk(·, Xk) satisfies Eq. (6.151), and is convex and differentiable at Xk·
(c) We have
ri( dom(f)) n ri(dom(Dk(·, xk))) i- 0. (6.155)

To see this, note that if

by the Fenchel Duality Theorem (Prop. 1.2.1), there exists a dual optimal
solution A* such that -.X* is a subgradient of Dk(·, xk) at Xk, so that
A*= 0 [by Eq. (6.151)], and also A* is a subgradient off at Xk, so that Xk
Sec. 6.6 Generalized Proximal Methods 387

Figure 6.6.5. Illustration of a case where the generalized proximal algorithm


(6.156) converges to a local minimum that is not global. In this example con-
vergence to the global minimum would be attained if the regularization term
Dk (-, x k) were sufficiently "flat."

minimizes f. Note that the condition (6.153) may fail if Dk(·,xk) is not
differentiable. For example if f(x) = ½llxll 2 and Dk(x,xk) = ¼llx - xkll,
then for any c > 0, the points Xk E [-1/c, 1/c] minimize J(-) + Dk(·,xk)-
Simple examples can also be constructed to show that the relative interior
condition is essential to guarantee the condition (6.153).
We summarize the preceding discussion in the following proposition.

Proposition 6.6.1: 'Under the conditions (6.151) and (6.153), and


assuming that the minimum of f(x) + Dk(x, xk) over xis attained for
every k, the algorithm

Xk+1 E arg min {f(x)


xE!Rn
+ Dk(x, xk)} (6.156)

improves strictly the value of f at each iteration where Xk is not a


global minimum off, and stops at a global minimum off.

Of course, cost improvement is a reassuring property for the algorithm


(6.156), but does not guarantee convergence to a global minimum, partic-
ularly when f is convex (see Fig. 6.6.5). Thus despite the descent property
established in the preceding proposition, the convergence of the algorithm
may be problematic. In fact this is true even if f is assumed convex and
has a nonempty set of minima X*. Some extra conditions are required,
but we will not pursue this issue further; see e.g., [ChT93], [Teb97].
If f is not convex, the difficulties may be formidable. First, the global
minimum in Eq. (6.156) may be hard to compute, since the cost function
f(-) + Dk(·, xk) may not be convex. Second, the algorithm may converge
to local minima off that are not global; see Fig. 6.6.5. As this figure indi-
388 Additional Algorithmic Topics Chap. 6

Figure 6.6.6. Illustration of a case where the generalized proximal algorithm


(6.156) diverges even when started at a local minimum of/.

cates, convergence to a global minimum is facilitated if the regularization


term Dk(·, xk) is relatively "flat." The algorithm may also not converge at
all, even if f has local minima and the algorithm is started near or at a
local minimum off (see Fig. 6.6.6). Still the algorithm has been used for
nonconvex problems, often on a heuristic basis.

Some Examples

In what follows in this section we will give some examples of application of


the generalized proximal algorithm (6.156) for the case where f is closed
proper convex. We will not provide a convergence analysis, and refer in-
stead to the literature.

Example 6.6.2: (Bregman Distance Function)

Let 'I/; : ~n .-+ (-oo, oo] be a convex function, which is differentiable within
int(dom('I/;)), and define for all x,y E int(dom('I/;)),

Dk(x,y) = _!_('1/;(x) -'1/;(y)- V'l/;(y)'(x -y)), (6.157)


Ck

where Ck is a positive penalty parameter. This is known as the Bregman


distance function, and has been analyzed in connection with proximal-like
algorithms in the paper [CeZ92] and the book [CeZ97]. Note that in the case
where '1/;(x) = ½llxll 2 , we have

Dk(x,y) = -Ck1 (1-llxll


2
2 1
- -IIYII
2
2 1
- Y/ (x - y) ) = -llx
2Ck
- YII 2 ,

so the quadratic regularization term of the proximal algorithm is included as


a special case.
Sec. 6.6 Generalized Proximal Methods 389

Similarly, when '!j;(x) = L~=l 'lj;;(xi), where

if xi> 0,
if xi= 0,
if xi< 0,

with gradient V'!j;i(xi) = ln(xi) + 1, we obtain from Eq. (6.157) the function

Except for the constant term c~ L~=l yi, which is inconsequential since it
does not depend on x, this is the regularization function that is used in the
entropy minimization algorithm of Example 6.6.1.
Note that because of the convexity of '!j;, the condition (6.151) holds.
Furthermore, because of the differentiability of Dk ( ·, Xk) (a consequence of the
differentiability of '!f;), the condition (6.153) holds as well when f is convex.

Example 6.6.3: (Exponential Augmented Lagrangian


Method)

Consider the constrained minimization problem

minimize f(x)
subjectto xEX, g1(x)S:O, ... ,gr(x)S:O,

where f, 91, ... , 9r : ar >-+ R are convex functions, and X is a closed convex
set. Consider also the corresponding primal and dual functions

p(u) = inf J(x), q(µ) = inf {f(x)+µ'g(x)}.


xEX,g(x):S:u xEX

We assume that p is closed, so that there is no duality gap, and except for sign
changes, q and p are conjugates of each other [i.e., p( u) is equal to ( -q) * ( -u);
cf. Section 4.2 in Appendix BJ.
Let us consider the entropy minimization algorithm of Example 6.6.1,
applied to maximization over µ 2: 0 of the dual function. It is given by

µk+l E argmax {q(µ) - _!_


µ2'.0 Ck
~
L...,
µi (1n (µ
µ
1
1k
) - 1)}, (6.158)
J=l

where µi andµ{ denote the jth coordinates ofµ and µk, respectively, and it
corresponds to the case
390 Additional Algorithmic Topics Chap. 6

We now consider a dual implementation of the proximal iteration (6.158) (cf.


Section 5.2). It is based on the Fenchel dual problem, which is to minimize
over u E Wr
(-q)*(-u) + Di:,(u, µk),
where (-q)* and D~(·,µk) are the conjugates of (-q) and Dk(·,µk), respec-
tively. Since (-q)*(-u) = p(u), we see that the dual proximal iteration is

(6.159)

and from the optimality condition of the Fenchel Duality Theorem (Prop.
1.2.1), the primal optimal solution of Eq. (6.158) is given by

(6.160)

To calculate u;;,, we first note that the conjugate of the entropy function
x(ln(x) - 1) if X > 0,
¢(x) = { O if X = 0,
CX) if X < 0,

is the exponential function ¢* (u) = eu, [to see this, simply calculate supx { x' u-
eu }, the conjugate of the exponential eU, and show that it is equal to ¢(x)].
Thus the conjugate of the function

which can be written as

is equal to

Since we have

it follows that its conjugate is

(6.161)

so the proximal minimization (6.159) is written as

Uk+! E arg min {p(u)


uE!Rn
+ J:_ ~ µ{eckuj}.
Ck L.....,
j=!
Sec. 6.6 Generalized Proximal Methods 391

1
-µecg
C

Constraint Level g

Figure 6.6. 7. Illustration of the exponential penalty function.

Similar to the augmented Lagrangian method with quadratic penalty


function of Section 5.2.1, the preceding minimization can be written as

It can be seen that uk+l = g(xk), where Xk is obtained through the mini-
mization
min {f(x) +I_~ µ{eckgj(x)},
xEX Ck L
j=l

or equivalently, by minimization of a corresponding augmented Lagrangian


function:

(6.162)

From Eqs. (6.160) and (6.161), and the fact Uk+1 = g(xk), it follows that the
corresponding multiplier iteration is

j = 1, . .. ,r. (6.163)

The exponential penalty that is added to f to form the augmented La-


grangian in Eq. (6.162) is illustrated in Fig. 6.6.7. Contrary to its quadratic
counterpart for inequality constraints, it is twice differentiable, which, de-
pending on the problem at hand, can be a significant practical advantage when
Newton-like methods are used for minimizing the augmented Lagrangian.
In summary, the exponential augmented Lagrangian method consists of
sequential minimizations of the form (6.162) followed by multiplier iterations
of the form (6.163). The method is dual (and equivalent) to the entropy
392 Additional Algorithmic Topics Chap. 6

minimization algorithm in the same way the augmented Lagrangian method


with quadratic penalty function of Section 5.2.1 is dual (and equivalent) to
the proximal algorithm with quadratic regularization.
The convergence properties of the exponential augmented Lagrangian
method, and the equivalent entropy minimization algorithm, are quite similar
to those of their quadratic counterparts. However, the analysis is somewhat
more complicated because when one of the coordinates µ{ tends to zero, the
corresponding exponential penalty term tends to 0, while the fraction µi / µ{
of the corresponding entropy term tends to oo. This analysis also applies to
exponential smoothing of nondifferentiabilities; cf. the discussion of Section
2.2.5. We refer to the literature cited at the end of the chapter.

Example 6.6.4: (Majorization-Minimization Algorithm)

An equivalent version of the generalized proximal algorithm (6.156) (known


as majorization-minimization algorithm) is obtained by absorbing the cost
function into the regularization term. This leads to the algorithm

(6.164)

where Mk : R 2 n >---+ (-oo, oo] satisfies the conditions

Mk(x, x) = J(x), V X E Rn, k = 0, 1, ... , (6.165)

Mk(x, Xk) 2:'. f(xk), V XE Rn, k = 0, 1, ... . (6.166)


By defining

we have
Mk(x, Xk) = f(x) + Dk(X,Xk),
so the algorithm (6.164) can be written in the generalized proximal format
(6.156). Moreover the condition (6.166) is equivalent to the condition (6.151)
that guarantees cost improvement, which is strict assuming also that

Xk EX*, (6.167)

where X* is the set of desirable points for convergence, cf. Eq. (6.153) and
Prop. 6.6.1.
As an example, consider the problem of unconstrained minimization of
the function
f(x) = R(x) + IIAx - bll 2 ,
where A is an m x n matrix, b is a vector in Rm, and R : Rn >---+ R is a
nonnegative-valued convex regularization function. Let D be any symmet-
ric matrix such that D - A' A is positive definite (for example D may be a
sufficiently large multiple of the identity). Let us define

M(x, y) = R(x) + IIAy - bll 2 + 2(x - y)' A' (Ay - b) + (x - y)' D(x - y),
Sec. 6.6 Generalized Proximal Methods 393

and note that M satisfies the condition M(x, x) = f(x) [cf. Eq. (6.165)], as
well as the condition M(x, Xk) ~ f(xk) for all x and k [cf. Eq. (6.166)] in view
of the calculation

M(x, Y) - f(x) = IIAy - bll 2


IIAx - bll 2
-

+ 2(x -y)'A'(Ay- b) + (x -y)'D(x -y) (6.168)


= (x -y)'(D -A' A)(x - y).

When Dis the identity matrix I, by scaling A, we can make the matrix
I - A' A positive definite, and from Eq. (6.168), we have

M(x, y) = R(x) + II Ax - bll 2 - II Ax - Ayll 2 + llx - Yll 2 .


The majorization-minimization algorithm for this form of M has been used
widely in signal processing applications.

Example 6.6.5: (Proximal Algorithm with Power


Regularization - Super linear Convergence)

Consider the generalized proximal algorithm

Xk+1 E a.rg min {f(x)


xE~n
+ Dk(X,Xk) }, (6.169)

where Dk : R 2 n ,-+ R is a regularization term that grows as the pth power of


the distance to the proximal center, where pis any scalar with p > l (instead
of p = 2 as in the quadratic regularization case):

(6.170)

where Ck is a positive parameter, and ¢ is a scalar convex function with order


of growth p > l around 0, such as for example

/4(
'+' X
i
-
i)
Xk = -1 1 X
i
-
i
Xk
Ip . (6.171)
p

We will aim to show that while the algorithm has satisfactory convergence
properties for all p > l, it attains a superlinear convergence rate, provided
p is larger than the order of growth of f around the optimum. This occurs
under natural conditions, even when Ck is kept constant - an old result, first
obtained in [KoB76] (see also [Ber82a], Section 5.4, and [BeT94a], which we
will follow in the subsequent derivation).
We assume that f : Rn ,-+ (-oo, oo] is a closed convex function with a
nonempty set of minima, denoted X*. We also assume that for some scalars
fJ > 0, 8 > 0, and 'Y > 1, we have

r + fl(d(x)f :S f(x), V x E Rn with d(x) :S 8, (6.172)


394 Additional Algorithmic Topics Chap. 6

where
d(x) = min
x*EX*
llx - x* II

(cf. the assumption of Prop. 5.1.4). Moreover we require that ¢, : R t-t


R is strictly convex, continuously differentiable, and satisfies the following
conditions:

¢,(0) = 0, v'ef,(O) = 0, lim v'ef,(z) = -oo, lim v'ef,(z) = oo,


z--+-oo z--+oo

and for some scalar M > 0, we have


V z E [-8,8]. (6.173)

An example is the order-p power function¢, of Eq. (6.171). Since we want to


focus on the rate of convergence, we will assume that the method converges
in the sense that

(6.174)

where f* is the optimal value (convergence proofs under mild conditions on


/ are given in [KoB76], [Ber82a], and [BeT94a]).
Let us denote by Xk the vector of X*, which is at minimum distance
from Xk. Let also T/ be a constant that bounds the lp norm in terms of the £2
norm:

XE Rn.

From the form of the proximal minimization (6.169)-(6.170), and using Eqs.
(6.173), (6.174), we have for all k large enough so that lxl+i -xll ~ 8,

~ f(xk) +-
1 I:n ¢,(x1. - x1). - f*
Ck
i=l

=-
1 I:n ¢,(x1. - x1).
.
Ck
i=l
(6.175)
n

~ -MLI xki -xkilp


Ck
i=l

Also from the growth assumption (6.172), we have for all k large enough so
that d(xk) ~ 8,

(6.176)
Sec. 6.6 Generalized Proximal Methods 395

By combining Eq. (6.175) and (6.176), we obtain

so if p > "f, the convergence rate is superlinear. In particular, in the special


case where f is strongly convex so that "( = 2, {f (xk)} converges to f*
superlinearly when p > 2 [superlinear convergence is also attained if p = 2
and Ck -+ oo, cf. Prop. 5.l.4(c)].
The dual of the proximal algorithm (6.169)-(6.170) is an augmented
Lagrangian method obtained via Fenchel duality. Computational experience
with this method [KoB76], has shown that indeed its asymptotic convergence
rate is very fast when p > 'Y· At the same time the use of order p > 2
regularization rather than p = 2 may lead to complications, because of the
diminished regularization near 0. Such complications may be serious if a first
order method is used for the proximal minimization. If on the other hand a
Newton-like method can be used, much better results may be obtained with
p > 2.

Example 6.6.6: (Mirror Descent Method)

Consider the general problem

minimize f(x)
subject to x E X,

where f : Rn >-+ R is a convex function, and X is a closed convex set. We


noted earlier that the subgradient projection method

where V f(xk) is a subgradient off at Xk, can equivalently be written as

-
Xk+l E argmin { '\lf(xk) I
(x -xk) 1 llx - xkll 2} ;
+ -2O'.k
xEX

cf. Prop. 6.1.4. In this form the method resembles the proximal algorithm,
the difference being that f(x) is replaced by its linearized version

and the stepsize O'.k plays the role of the penalty parameter.
If we also replace the quadratic
396 Additional Algorithmic Topics Chap. 6

with a nonquadratic proximal term Dk(x, xk), we obtain a version of the


subgradient projection method, called mirror descent. It has the form

Xk+l E arg min {v f(xk) 1 (x - Xk) + Dk(X, Xk) }.


xEX

One advantage of this method is that using in place off its linearization may
simplify the minimization above for a problem with special structure.
As an example, consider the minimization of f(x) over the unit simplex

A special case of the mirror descent method, called entropic descent, uses the
entropy regularization function of Example 6.6.1 and has the form

where 'v';f(xk) are the components of 'v' f (xk), It can be verified that this
minimization can be done in closed form as follows:

i = 1, ... ,n.

Thus it involves less overhead per iteration than the corresponding gradient
projection iteration, which requires projection on the unit simplex, as well as
the corresponding proximal iteration.
When f is differentiable, the convergence properties of mirror descent
are similar to those of the gradient projection method, although depending
on the problem at hand and the nature of D1;,(x, Xk) the analysis may be
more complicated. When f is nondifferentiable, an analysis similar to the
one for the subgradient projection method may be carried out; see [BeT03].
For extensions and further analysis of the method, we refer to the surveys
[JuNlla), [JuNllb), and the references quoted there.

6. 7 E-DESCENT AND EXTENDED MONOTROPIC


PROGRAMMING

In this section we return to the idea of cost function descent for nondiffer-
entiable cost functions, which we discussed in Section 2.1.3. We noted there
the theoretical difficulties around the use of the steepest descent direction,
which is obtained by projection of the origin on the subdifferential. In
this section we focus on the E-subdifferential, aiming at theoretically more
sound descent algorithms. We subsequently use these algorithms in an un-
usual way: to obtain a strong duality analysis for the extended monotropic
programming problem that we discussed in Section 4.4, in connection with
generalized polyhedral approximation.
Sec. 6.7 E-Descent and Extended Monotropic Programming 397

6. 7.1 E-Subgradients

The discussion of Section 2.1.3 has indicated some of the deficiencies of


subgradients and directional derivatives: anomalies may occur near points
where the directional derivative is discontinuous, and at points of the rela-
tive boundary of the domain where the subdifferential may be empty. This
motivates an attempt to rectify these deficiencies through the use of the
E-subdifferential, which turns out to have better continuity properties.
We recall from Section 3.3 that given a proper convex function f :
~n H ( -oo, oo] and a scalar E > 0, we say that a vector g is an E-subgradient
of f at a point x E dom(f) if

f(z) ~ f(x) + (z - x)'g - E, I;;/ z E ~n. (6.177)

The E-subdifferential od(x) is the set of all E-subgradients off at x, and


by convention, od(x) = 0 for x rfc dom(f). It can be seen that

if O < fl < f2,

and that
ne.j.08d(x) = of(x).
We will now discuss in more detail the properties of E-subgradients, with a
view towards using them in cost function descent algorithms.

E-Subgradients and Conjugate Functions

We first provide characterizations of the E-subdifferential as a level set of


a certain conjugate function. Consider a proper convex function f : ~n H
(-oo, oo], and for any x E dom(f), consider the x-translation off, i.e., the
function f x given by

fx(d) = f(x + d) - f(x),

The conjugate of fx, is given by

f;(g) = sup { d'g - f(x + d) + f(x) }. (6.178)


dE~n

Since the definition of subgradient can be written as

g E of(x) if and only if sup {g'd - f(x + d) + f(x)} ~ 0,


dE~n

we see from Eq. (6.178) that of(x) can be characterized as the 0-level set
off;:
of(x) = {g I J;(g) ~ o}. (6.179)
398 Additional Algorithmic Topics Chap. 6

Similarly, from Eq. (6.178), we see that


od(x) = {g I J;;(g) $ E }. (6.180)
We will now use the preceding facts to discuss issues of nonemptiness and
compactness of od(x).
We first observe that, viewed as a function of d, the conjugate of f;
is (clf)(x + d) - f(x) (cf. Prop. 1.6.1 of Appendix B). Hence from the
definition of conjugacy, for d = 0, we obtain
(clf)(x) - J(x) = sup {-f;(g)}.
gE!Rn

Since (clf)(x) $ f(x), we see that O $ inf 9 Emn J;(g) and


inf f;(g) = 0 if and only if (clf)(x) = J(x). (6.181)
gE!Rn

It follows from Eqs. (6.179) and (6.180) that for every x E <lorn(!), there
are two cases of interest:
(a) (cl f)(x) = f (x). Then, we have
of(x) = arg min J;(g) = {g I J;(g) = o},
gE!Rn

od(x) = {g I f;(g) $ E} .

In this case, od(x) is nonempty, although of(x) may be empty.


(b) (clf)(x) < f(x). In this case, of(x) is empty, and so is od(x) when
E < f(x) - (clf)(x).

We will now summarize the main properties of the E-subdifferential in


the following proposition. Part (b) is illustrated in Fig. 6.7.1 and will be the
basis for the E-descent method to be introduced shortly. Note the relation
of this part with the formula for the support function of the subdifferential,
given in Prop. 3.1.l(a) of Chapter 3 and Prop. 5.4.8 of Appendix B .

Proposition 6. 7.1: Let f : lRn i-+ ( -oo, oo] be a proper .convex func-
tion and let Ebe a positive scalar. For every x E <lorn(!), the following
hold:
(a) The E-subdifferential od(x) is a closed convex set.
(b) If (clf)(x) = f(x), then 8,f(x) is nonempty and its support
function is given by

. f f (X + ad) - f (X) + E
aa,J(x) (d) = sup d'g = m . , d E lRn.
gE8,f(x) a>O O',
Sec. 6.7 E-Descent and Extended Monotropic Programming 399

Fd(a) = f(x + ad)

Slope = SUPgE&.J(x} d'g

0 a

Figure 6.7.1. Illustration of the E-subdifferential of a closed convex tunction


f : ~ >--+ (-=, oo], viewed along directions. The figure shows the function f along
a direction d, starting at a point x E <lorn(/), i.e., the one-dimensional function

Fd(a) = f(x + ad).

As Prop. 6.7.l(b) shows, the minimal and maximal slopes of planes that support
the graph of Fd and pass through ( 0, f(x) - E) are

inf d'g and sup d' g .


gE8,f( x ) g E 8 ,f(x)

These are also the two endpoints of the E-subdifferential 8,Fd(O).

(c) If f is real-valued, od(x) is nonempty and compact.

Proof: (a) We have shown that od(x) is the E-level set of the function f;
[cf. Eq. (6.180)]. Since f;; is closed and convex, being a conjugate function,
od(x) is closed and convex.
(b) By Eqs. (6.180) and (6.181), 8.f(x) is the E-level set off;;, while
inf f;;(g) = 0.
gE~n

It follows that 8.f(x) is nonempty. Furthermore, by the discussion of sup-


port functions in Section 1.6 of Appendix B, the support function aa, J(x)
of 8d (x) is the closed function generated by the conjugate off; - E, which
is fx + E. Thus, epi(aa,J(x)) consists of the origin and the set

Ua>O o:-l epi(fx + E) = Ua>o{ (o:- 1 z, o:- 1 w) I fx(z) + E:::; W }.


Hence,

aa,J(x)(d) = a- 1 z=d,inf o:- 1w = inf o:- 1(fx(o:d) + E),


fx( z )+•Scw a>O
n>O
400 Additional Algorithmic Topics Chap. 6

which is the desired result.


(c) If f is real-valued then, by Prop. 3.1.l(a), of(x) is nonempty and com-
pact for all x, so the 0-level set off; is nonempty and compact, implying
that all level sets are compact. Since by Eq. (6.180), od(x ) is the 1:-level
set off;, it is nonempty and compact. Q.E.D.

6. 7 .2 1:-Descent Method

We will now discuss an iterative cost function descent algorithm that uses
1:-subgradients. Let f : ~n f-t (-oo, oo] be a proper convex function to be
minimized.
We say that a direction d is an 1:-descent direction at x E dom(f) ,
where E is a positive scalar, if
inf f (x
a>O
+ ad) < f(x ) - E,

or in words, there is a guarantee of a reduction of at least E of the value of


f along the direction d. Note that by Prop. 6.7.l(b), assuming
(clf)(x) = f(x),
we have
. f f( x + ad) - f( x ) + E
aa, f( x ) (d) = sup d'g = m - - - - - - - -, dE ~ n,
gEa,J(x) a>O a
so
dis an E-descent direction if and only if sup d'g < 0, (6.182)
gEa, J(x)
as illustrated in Fig. 6.7.2. Part (b) of the following proposition shows how
to obtain an E-descent direction if this is at all possible.

Proposition 6. 7.2: Let f : ~n f-t (-oo, oo] be a proper convex func-


tion, let E be a positive scalar, and let x E dom(f) be such that
(clf)(x) = f(x) . Then: '
(a) We have OE od(x) if and only if

f(x)'<::;,_ inf f(z)+E.


zE!Rn

(b) We have O fJ. od ( x) if and only if there exists an 'E-descent direc-


tion. In particular, if O .J. od(x), the vector - g where

g E arg min 11911,


gEa, J(x)

is an E-descent direction.
Sec. 6.7 E-Descent and Extended Monotropic Programming 401

f(z) f(x+ad)

Slope= sup 9E8.f(x) d'g


f(x) -
f(x) - c
I
Slope= 0
0 X z 0 a

Figure 6.7.2. Illustration of the connection between od(x) and E-descent direc-
tions [cf. Eq. (6.182)). In the figure on the left, we have

f(x) :::; inf f(z)


zE~n
+ E,

or equivalently, that the horizontal hyperplane [normal (0, 1)) that passes through
( x, J(x) - E) contains the epigraph of f in its upper halfspace, or equivalently,
that O E od(x). In this case there is no E-descent direction. In the figure on the
right, d is an E-descent direction because the slope shown is negative [cf. Prop.
6.7.l(b)].

Proof: (a) By definition, 0 E ad(x) if and only if f(z) ~ f(x) - c for all
z E ~n, which is equivalent to infzE~n f(z) + E ~ f(x).
(b) If there exists an E-descent direction, then by part (a), we have O tf.
ad(x). Conversely, assume that O tf. ad(x). The vector g is the projection
of the origin on the closed convex set ad(x), which is nonempty in view
of the assumption (clf)(x) = f(x) [cf. Prop. 6.7.l(b)]. By the Projection
Theorem (Prop. 1.1.9 in Appendix B),

0:::; (g- g)'g, 'ti g E aef(x),

or
sup (-g)'g:::; -ll"?Jll 2 < 0,
gE8ef(x)

where the last inequality follows from the hypothesis O tf. ad(x). By Eq.
(6.182), this implies that -g is an E-descent direction. Q.E.D.

The preceding proposition contains the elements of an iterative algo-


rithm for minimizing f to within a tolerance of E. This algorithm, called
E-descent method, is similar to the steepest descent method briefly discussed
in Section 2.1.3, but is more general since it applies to extended real-valued
functions, and has more sound theoretical convergence properties. At the
kth iteration, it stops if there is no E-descent direction, in which case Xk is
402 Additional Algorithmic Topics Chap. 6

an t-optimal solution; otherwise it sets

(6.183)

where dk is an t-descent direction and O'.k is a positive stepsize that reduces


the cost function by at least t:

Thus the algorithm is guaranteed to find an t-optimal solution, assuming


that f is bounded below, and to yield a sequence {xk} with f(xk)-+ -oo,
if f is unbounded below.
Note that by Prop. 6.7.2(a), the algorithm will stop if and only if
0 E 8,J(xk)- This can be checked by finding the projection 9k of the
origin onto 8,J(xk) to determine whether 9k = 0. If, however, 9k # 0,
then by Prop. 6.7.2(b), -gk is ant-descent direction, and can be used as
the direction dk in the iteration (6.183). In general, however, any t-descent
direction dk can be used, and to verify the t-descent property, it is sufficient
to check that
sup d~g < 0
gE8d(xk)

[cf. Eq. (6.182)].


A drawback of the t-descent method is that at the typical iteration,
an explicit representation of 8,f(xk) may be needed, which can be hard
to obtain. This motivates a variant where 8d(xk) is approximated by a
set A(xk) that can be computed more easily than 8,f(xk)- In this variant,
given A(xk), the direction used in iteration (6.183) is dk = -gk, where 9k
is the projection of the origin onto A(xk)- One may consider two types of
methods:
(a) Outer approximation methods: Here 8,J(xk) is approximated by a
set A(x) such that

where 'Y is a scalar with 'Y > 1. If 9k = 0 [equivalently O E A(xk)], the


method stops, and from Prop. 6.7.2(a), it follows that Xk is within
"(t of being optimal. If 9k # 0, it follows from Prop. 6.7.2(b) that by
suitable choice of the stepsize O'.k, we can move along the direction
dk = -gk to decrease the cost function by at least t. Thus for a fixed
t > and assuming that f is bounded below, the method is guaran-
teed to terminate in a finite number of iterations with a "(t-optimal
solution. Aside from its computational value, the method will also
be used for analytical purposes, to establish strong duality results for
extended monotropic programming problems in the next section.
Sec. 6.7 E-Descent and Extended Monotropic Programming 403

(b) Inner approximation methods: Here A(xk) is the convex hull of a


finite number of E-subgradients at Xk, and hence it is a subset of
8d(xk). One method of this type, builds incrementally the approx-
imation A(xk) of the E-subdifferential, one element at a time, but
does not need an explicit representation of the full E-subdifferential
8,J(x) or even the subdifferential 8f(x). Instead it requires that we
be able to compute a single (arbitrary) element of 8f(x) at any x.
This method was proposed in [Lem74] and is described in detail in
[HiL93]. We will not consider it further in this book.

E-Descent Based on Outer Approximation

We will now discuss outer approximation implementations of the E-descent


method. The idea is to replace 8,J(x) by an outer approximation A(x) that
is easier to compute and/or easier to project on. To achieve this aim, it is
typically necessary to exploit some special structure of the cost function. In
the following, we restrict ourselves to an outer approximation method for
the important case where f consists of a sum of functions f = Ji+·· · + f m·
The next proposition shows that we may use as approximation the closure
of the vector sum of the E-subdifferentials:

A(x) = cl(8d1(x) + · · · + 8dm(x)).


[Note here that the vector sum of the closed sets 8di(x) need not be closed;
cf. the discussion of Section 1.4 of Appendix B.]

Proposition 6. 7.3: Let f be the sum of m closed proper convex


functions Ji: ~n H (-oo, oo], i = 1, ... , m,

f(x) = fi(x) + · · · + fm(x),


and let E be a positive scalar. For any vector x E dom(J), we have

8,f(x) C cl(8d1(x) + · · · + 8dm(x)) C 8md(x). (6.184)

Proof: We first note that, by Prop. 6.7.l(b), the E-subdifferentials 8di(x)


are nonempty. Let gi E 8di(x) for i = 1, ... , m. Then we have

fi(z) 2 fi(x) + g~(z - x) - E, \:/ z E ~n, i = 1, ... , m.

By adding over all i, we obtain

f(z) 2 J(x) + (g1 + · · · + gm)'(z - x) - mE, \:/ z E ~n.


404 Additional Algorithmic Topics Chap. 6

Hence 91 + · · · + 9m E 8md(x), and it follows that

8d1(x) + · · · + 8dm(x) C 8md(x).

Since 8md(x) is closed, this proves the right-hand side of Eq. (6.184).
To prove the left-hand side of Eq. (6.184), assume to arrive at a
contradiction, that there exists a 9 E 8d (x) such that

9 t/c cl(8d1(x) + · · · + 8dm(x)).


By the Strict Separation Theorem (Prop. 1.5.3 in Appendix B), there exists
a hyperplane strictly separating 9 from the set cl( 8d1 (x) + · · · + 8dm(x)).
Thus, there exist a vector d and a scalar b such that

d' (91 + · · · + 9m) < b < d' 9, V 91 E 8d1(x), ... ,9m E 8dm(x).

From this we obtain

sup d'91 + · · · + sup d'9m < d19,


g1E8d1(x) gmEB,fm(x)

and by Prop. 6.7.l(b),

. f Ji (X + ad) - Ji (X) + E . f f m (X + ad) - f m (X) + E d


m -'-----'---'--'-- + .. · + m --'-------'----'--'--- < '9.
a>O a a>O a
Let a1, ... , am be positive scalars such that

fi(x + a1d) - Ji(x) + E + ... + fm(X + amd) - fm(x) + E < d' 9 , (6 .185 )
a1 am
and let
1
a--------
-l/a1+···+l/am·
By the convexity of Ji, the ratio (fi(X + ad) - fi(x))/a is monotonically
nondecreasing in a. Thus, since ai ;?: a, we have

fi(x + aid) - fi(x) > fi(x + ad) - fi(x) i = 1, ... ,m,


ai - -
a '
and from Eq. (6.185) and the definition of a we obtain

fi(x + a1d) - fi(x) + E


dI 9 > -'------'--'---- fm(x + amd) - fm(x) + E
+ ... + - --------
a1 am
> Ji (x + ad) - Ji (x) + E + ... + fm(X + ad) - fm(x) + E
- a a
f (X + ad) - f (X) + E
a
. f f (X + ad) - f (X) + E
> m --------.
- a>O a
Sec. 6.7 E-Descent and Extended Monotropic Programming 405

Since g E 8d(x), this contradicts Prop. 6.7.l(b), and proves the left-hand
side of Eq. (6.184). Q.E.D.

The potential lack of closure of the set 8d1 (x)+· · ·+8dm(x) indicates
a practical difficulty in implementing the method. In particular, in order
to find an E-descent direction one will ordinarily minimize llg1 + · · · + gmll
over gi E 8di(x), i = 1, ... , m, but an optimal solution to this problem
may not exist. Thus, it may be difficult to check computationally whether

0 E cl(8d1(x) + · · · + 8dm(x)),
which is the test for mE-optimality of x. The closure of the vector sum
8d1(x)+· · ·+8dm(x) may be guaranteed under various assumptions (e.g.,
the ones given in Section 1.4 of Appendix B; see also Section 6.7.4).
One may use Prop. 6.7.3 to approximate 8d(x) in cases where f is
the sum of convex functions whose E-subdifferential is easily computed or
approximated. The following is an illustrative example.

Example 6.7.1: (E-Descent for Separable Problems)

Let us consider the optimization problem

minimize L f;(xi)
i=l

subject to x E P,

where x = (x 1 , ... , Xn), each Ji : R ,-+ R is a convex function of the scalar


component Xi, and P is a polyhedral set of the form

P = P1 n · · · n Pr,
with
j = 1, ... ,r,
for some vectors a1 and scalars b1 . We can write this problem as

minimize L f;(xi) + L Op (x) 1


i=l j=l

subject to x E Rn,

where Op1 is the indicator function of P1 .


The E-subdifferential of the cost function is not easily calculated, but
can be approximated by a vector sum of intervals. In particular, it can be
verified, using the definition, that the E-subdifferential of Op1 is
406 Additional Algorithmic Topics Chap. 6

which is an interval in ar. Similarly, it can be seen that 8,fi (Xi) is a compact
interval of the ith axis. Thus
n m

i=l i=l

is a vector sum of intervals, which is a polyhedral set and is therefore closed.


Thus by Prop. 6.7.3, it can be used as an outer approximation of 8,f(x).
At each iteration of the corresponding E-descent method, projection on this
vector sum to obtain an E-descent direction requires solution of a quadratic
program, which depending on the problem, may be tractable.

6. 7.3 Extended Monotropic Programming Duality

We now return to the Extended Monotropic Program of Section 4.4 (EMP


for short):
m

minimize L Ji(Xi) (6.186)


i=l

subject to x E S,
where
= ( X1, .. ,,Xm,)
X def
is a vector in Rn1 + .. +nm, with components Xi E Rni, i = 1, ... , m, and
Ji : Rni f-t (-oo, oo] is a closed proper convex function for each i,
S is a subspace of Rn1 + .. +nm.
The dual problem was derived in Section 4.4. It has the form
m

mm1m1ze L ft (Ai) (6.187)


i=l
subject to A E SJ_.

In this section we will use the E-descent method as an analytical tool to


obtain conditions for strong duality.
Let f* and q* be the optimal values of the primal and dual problems
(6.186) and (6.187), respectively, and note that by weak duality, we have
q* :s; J*. Let us introduce the functions Yi :
Rn1 + .. +nm f-t (-oo, oo] of the
vector x = (x1, ... , Xm), defined by

Yi(x) = Ji(Xi), i = 1, ... ,m.


Note that the E-subdifferentials of Yi and Ji are related by
ae]i(x) = {(0, ... , 0, Ai, 0, ... , 0) [ Ai E 8di(Xi) }, i = 1, ... , m, (6.188)
Sec. 6.7 E-Descent and Extended Monotropic Programming 407

where the nonzero element in (0, ... , 0, Ai, 0, ... , 0) is in the ith position.
The following proposition gives conditions for strong duality.

Proposition 6.1.4: {EMP Strong Duality) Assume that the EMP


(6.186) is feasible, and that for all feasible solutions x and all E > 0,
the set
T(x, E) =SJ_+ a,71 (x) + ... +a.]m(x)
is closed. Then q* = f *.

Proof: If f* = -oo, then q* = f* by weak duality, so we may assume


that f * > -oo. Let X denote the feasible region of the primal problem:

We apply the E-descent method based on outer approximation of the sub-


differential to the minimization of the function
m m

i=l i=l

where 8s is the indicator function of S, for which 8,8s(x) = SJ_ for all
x E S and E > 0. In this method, we start with a vector x 0 E X, and
we generate a sequence {xk} C X. At the kth iteration, given the current
iterate xk, we find the vector of minimum norm wk on the set T (xk, E)
(which is closed by assumption). If wk = 0 the method stops, verifying
that O E a(m+i).f (xk) [cf. the right side of Eq. (6.184)]. If wk =/- 0, we
generate a vector xk+I EX of the form xk+I = xk - akwk, satisfying

f (xk+I) < J(xk) - E;

such a vector is guaranteed to exist, since O .j. T(xk, E) and hence O .j.
8.J(xk) by Prop. 6.7.3. Since f(xk) ~ f* and at the current stage of the
proof we have assumed that f * > -oo, the method must stop at some
iteration with a vector x = (x1, ... , Xm) such that OE T(x, E). Thus some
vector in 8.f1 (x)+· · ·+8,]m(x) must belong to SJ_. In view ofEq. (6.188),
it follows that there must exist vectors

i = l, ... ,m,

such that
A= (A1, ... , Am) E SJ_.
From the definition of an E-subgradient we have

i = l, ... ,m,
408 Additional Algorithmic Topics Chap. 6

and by adding over i, and using the fact x ES and>. E s1-, we obtain

m m
L fi(Xi)::::; - L ft(>.i) + mt.
i=l i=l

Since xis primal feasible and - r=: 1 ft(>,i) is the dual value at>., it follows
that
f* :::; q* + mf.
Taking the limit as € ----+ 0, we obtain f * :::; q*, and using also the weak
duality relation q* :::; f*, we obtain f* = q*. Q.E.D.

6. 7.4 Special Cases of Strong Duality

We now delineate some special cases where the assumptions for strong
EMP duality of Prop. 6.7.4 are satisfied. We first note that in view of
Eq. (6.188), the set 8, J i (x ) is compact if 8.fi(xi) is compact, and it is
polyhedral if 8.fi(xi) is polyhedral. Since the vector sum of a compact
set and a polyhedral set is closed (see the discussion at the end of Section
1.4 of Appendix B), it follows that if each of the sets 8.fi (Xi) is either
compact or polyhedral, then T(x, €) is closed, and by Prop. 6. 7.4, we have
q* = f*. Furthermore, from Prop. 5.4.1 of Appendix B, 8fi(Xi ) and hence
also 8.fi (Xi ) is compact if Xi E int ( dom(fi)) (as in the case where Ji is
real-valued). Moreover 8.fi(Xi ) is polyhedral if Ji is polyhedral [being t he
level set of a polyhedral function, cf. Eq. (6.180)]. There are some other
interesting special cases where 8.fi(xi ) is polyhedral, as we now describe.
One such special case is when Ji depends on a single scalar component
of x, as in the case of a monotropic programming problem. The following
definition introduces a more general case.

Definition 6.7.1: We say that a closed proper convex function h :


Rn H (-oo, oo] is essentially one-dimensional if it has the form

h(x) = h(a'x),
where a is a vector in ~ and h:R H ( -oo, oo] is a scalar closed
proper convex function.

The following proposition establishes the main associated property


for our purposes.
Sec. 6.7 E-Descent and Extended Monotropic Programming 409

Proposition 6. 7.5: Leth : Rn H (-oo, oo) be a closed proper convex


essentially one-dimensional function. Then for all x E dom( h) and
E > 0, the E-subdifferential 8,h(x) is nonempty and polyhedral.

Proof: Let h(x ) = h(a'x), where a is a vector in Rn and h is a scalar


closed proper convex function . If a = 0, then h is a constant function, and
8, h(x) is equal to {O}, a polyhedral set. Thus, we may assume that a -f. 0.
We note that.XE 8,h(x) if and only if

h(a'z) 2: h(a'x) + (z - x)' .X - E, V z E Rn.

Writing .X in the form .X =~a+ v with~ ER and v ..la, we have

h(a'z) 2: h(a'x) + (z - x)'(~a + v) - E, V z E Rn,

and by taking z = 1 a + 8v with 1 , 8 ER and I such that ,llall 2 E dom(h),


we obtain for all 8 E R

h{,llall 2 ) 2: h(a'x) +(,a+ 8v - x)' .X - E = h(a'x) + (,a - x )' .X - E + 8v' .X.


Since v' .X = llvll 2 and 8 can be arbitrarily large, this relation implies that
v = 0, so it follows that every .X E 8,h(x) must be a scalar multiple of
a. Since 8,h(x) is also a closed convex set, it must be a nonempty closed
interval in Rn , and hence is polyhedral. Q.E.D.

Another interesting special case is described in the following defini-


tion.

Definition 6. 7.2: We say that a closed proper convex function h :


Rn H (-oo, oo) is domain one-dimensional if the affine hull of dom(h)
is either a single point or a line, i.e.,

aff(dom(h)) ={,a+ b I "f ER},

where a and b are some vectors in Rn.

The following proposition parallels Prop. 6.7.5.

Proposition 6. 7.6: Leth: Rn H (-oo, oo] be a closed proper convex


domain one-dimensional function. Then for all x E dom(h) and E > 0,
the E-subdifferential 8,h(x) is nonempty and polyhedral.
410 Additional Algorithmic Topics Chap. 6

Proof: Denote by a and b the vectors associated with the domain of h


as per Definition 6.7.2. We note that for 'Ja + b E dom(h), we have >. E
8,h(-?a + b) if and only if

h(,a + b) ~ h('Ja + b) + (, - 'J)a' >. - t, V, ER,

or equivalently, if and only if a'>. E 8Ji(7), where h is the one-dimensional


convex function
h(1 ) = h(, a + b),

Thus,

Since 8,h(-y) is a nonempty closed interval (h is closed because h is), it


follows that 8,h(-ya + b) is nonempty and polyhedral [if a = 0, it is equal to
Rn , and if a =/ 0, it is the vector sum of two polyhedral sets: the interval
{ 1 a \ 11\a\\ 2 E 8,h(-y)} and the subspace that is orthogonal t o a]. Q.E.D.

By combining the preceding two propositions wit h Prop. 6.7.4, we


obtain the following.

Proposition 6. 7. 7: Assume that the EMP (6.186) is feasible, and


that each function fi is real-valued, or is polyhedral, or is essentially
one-dimensional, or is domain one-dimensional. Then q* = f*.

It t urns out that t here is a conjugacy relation between essentially one-


dimensional functions and domain one-dimensional functions such that the
affine hull of their domain is a subspace. This is shown in the following
proposition, which establishes a somewhat more general connection, needed
for our purposes.

Proposition 6.7.8:
(a) The conjugate of an essentially one-dimensional function is a do-
main one-dimensional function such that the affine hull of its
domain is a subspace.
(b) The conjugate of a domain one-dimensional function is the sum
of an essentially one-dimensional function and a linear function.

Proof: (a) Let h : Rn H (-oo, oo] be essentially one-dimensional, so that

h(x ) = h (a' x ),
Sec. 6.7 E-Descent and Extended Monotropic Programming 411

where a is a vector in Rn and h : R f-t (-oo, oo] is a scalar closed proper


convex function. If a = 0, then h is a constant function, so its conjugate
is domain one-dimensional, since its domain is {O}. We may thus assume
that a -=/- 0. We claim that the conjugate

h*(>..) = sup {>..'x - h(a'x)}, (6.189)


xEWn

takes infinite values if>.. is outside the one-dimensional subspace spanned by


a, implying that h* is domain one-dimensional with the desired property.
Indeed, let >.. be of the form >.. = (a+ v, where ( is a scalar, and v is a
nonzero vector with v .la. If we take x = 1 a + 8v in Eq. (6.189), where 1
is such that ,llall 2 E dom(h), we obtain

h*(>..) = sup { >..'x - h(a'x)}


xEWn
:::=: sup{ ((a+ v)'(,a + 8v) - h(,llall 2 )}
8EW

= 6llall 2 - h(,llall 2 ) + sup{ 8llvll 2 },


8EW

so it follows that h*(>..) = oo.


(b) Let h: Rn f--t (-oo, oo] be domain one-dimensional, so that

aff(dom(h)) ={,a+ b I I E R},

for some vectors a and b. If a = b = 0, the domain of h is {O}, so its


conjugate is the function taking the constant value -h(O) and is essentially
one-dimensional. If b = 0 and a -=/- 0, then the conjugate is

h*(>..) = sup { >..'x - h(x)} = sup{,a' >.. - h(,a) },


xEWn 7EW

so h*(>..) = Ti,* (a'>..) where Ti,* is the conjugate of the scalar function h(,) =
h( ,a). Since his closed proper convex, the same is true for Ti,*, and it follows
that h* is essentially one-dimensional. Finally, consider the case where
b -=/- 0. Then we use a translation argument and write h(x) = h(x - b),
where h is a function such that the affine hull of its domain is the subspace
spanned by a. The conjugate of h is essentially one-dimensional (by the
preceding argument), and the conjugate of h is obtained by adding b' >.. to
it. Q.E.D.

We now turn to the dual problem, and derive a duality result that is
analogous to the one of Prop. 6.7.7. We say that a function is co-finite if
its conjugate is real-valued. If we apply Prop. 6. 7. 7 to the dual problem
(6.187), we obtain the following.
412 Additional Algorithmic Topics Chap. 6

Proposition 6. 7.9: Assume that the dual EMP (6.187) is feasible.


Assume further that each Ji is co-finite, or is polyhedral, or, is essen-
tially one-dimensional, or is domain one-dimensional. Then q* = f *.

In the special case of a monotropic programming problem, where


the functions fi are essentially one-dimensional (they depend on the single
scalar component Xi), Props. 6.7.7 and 6.7.9 yield the following proposition.
This is a central result for monotropic programming.

Proposition 6. 7.10: (Monotropic Programming Strong Dual-


ity) Consider the monotropic programming problem, the special case
of EMP where ni = l for all i. Assume that either the problem is
feasible, or else its dual problem is feasible. Then q* = f*.

Proof: This is a consequence of Props. 6.7.7 and 6.7.9, and the fact that
when ni = l, the functions Ji and Qi are essentially one-dimensional. Ap-
plying Prop. 6.7.7 to the primal problem, shows that q* = f* under the
hypothesis that the primal problem is feasible. Applying Prop. 6.7.9 to
the dual problem, shows that q* = f * under the hypothesis that the dual
problem is feasible. Q.E.D.

The preceding results can be used to establish conditions for q* = f*


in various specialized contexts, including multicommodity flow problems
(cf. Example 1.4.5); see [BerlOa].

6.8 INTERIOR POINT METHODS

Let us consider inequality constrained problems of the form

minimize f (x)
(6.190)
subject to x E X, g1 (x) ~ 0, j = l, ... ,r,

where f and g1 are real-valued convex functions, and X is a closed con-


vex set. The interior (relative to X) of the set defined by the inequality
constraints is
S = {XE XI g1 (x) < 0, j = 1, ... , r },

and is assumed to be nonempty.


In interior point methods, we add to the cost a function B(x) that is
defined in the interior set S. This function, called the barrier function, is
continuous and tends to oo as any one of the constraints g1 (x) approaches
Sec. 6.8 Interior Point Methods 413

0 from negative values. A common example of barrier function is the


logarithmic,
r
B(x) =- Lln{-gj(x)}.
j=l

Another example is the inverse,

r 1
B(x) = - L--:--(x)·
j=l 91

Note that both of these functions are convex since the constraints gj are
convex. Figure 6.8.1 illustrates the form of B(x).

EB(x)
/

E'B(x)

Boundary of S Boundary of S
' .

Figure 6.8.1. Form of a barrier function. The barrier term EB(x) tends to zero
for all interior points x E S as E --t 0.

The barrier method is defined by introducing a parameter sequence


{Ek} With
0< Ek+l < Ek, k = 0, 1,,,,, Ek -+ 0.
It consists of finding

Xk E argmin{J(x)
xES
+ EkB(x) }, k = 0, 1, . ... (6.191)

Since the barrier function is defined only on the interior set S, the successive
iterates of any method used for this minimization must be interior points.
If X = ~, one may use unconstrained methods such as Newton's
method with the stepsize properly selected to ensure that all iterates lie in
414 Additional Algorithmic Topics Chap. 6

S. Indeed, Newton's method is often recommended for reasons that have


to do with ill-conditioning, a phenomenon that relates to the difficulty of
carrying out the minimization (6.191) (see Fig. 6.8.2 and nonlinear pro-
gramming sources such as [Ber99] for a discussion). Note that the barrier
term EkB(x) goes to zero for all interior points x ES as Ek-+ 0. Thus the
barrier term becomes increasingly inconsequential as far as interior points
are concerned, while progressively allowing Xk to get closer to the boundary
of S (as it should if the solutions of the original constrained problem lie on
the boundary of S). Figure 6.8.2 illustrates the convergence process, and
the following proposition gives the main convergence result.

Proposition 6.8.1: Every limit point of a sequence {xk} generated


by a barrier method is a global minimum of the original constrained
problem (6.190).

Proof: Let {x} be the limit of a subsequence {xkhEK· If x ES, by the


continuity of B within S we have,

while if x lies on the boundary of S, we have limk-,oo, kEK B( Xk) = oo. In


either case we obtain

which implies that

The vector xis a feasible point of the original problem (6.190), since Xk ES
and X is a closed set. If x were not a global minimum, there would exist a
feasible vector x* such that f(x*) < f(x) and therefore also [since by the
Line Segment Principle (Prop. 1.3.1 in Appendix B) x* can be approached
arbitrarily closely through the interior set SJ an interior point x E S such
that f (x) < f(x). We now have by the definition of Xk,

k = 0, l, ... ,
which by taking the limit ask-+ oo and k EK, implies together with Eq.
(6.192), that J(x) :=; f (x). This is a contradiction, thereby proving that x
is a global minimum of the original problem. Q.E.D.

The idea of using a barrier function as an approximation to con-


straints has been used in several different ways, in methods that generate
Sec. 6.8 Interior Point Methods 415

Figure 6.8.2. Illustration of the level sets of the barrier-augmented cost function,
and the convergence process of the barrier method for the problem

minimize f(x) = ½{(x 1 ) 2 + (x2)2)

subject to 2 :::; x 1 ,

with optimal solution x• = (2, 0). For the case of the logarithmic barrier function
B(x) = - In (x 1 - 2), we have

xk E arg min {½((x 1 )2


x 1 >2
+ (x 2)2) - €k ln(x 1 - 2)} = (1 + ~,o),

so as €k is decreased, the unconstrained minimum Xk approaches the constrained


minimum x• = (2, 0). The figure shows the level sets of f(x) + €B(x) for€= 0.3
(left side) and € = 0.03 (right side). As €k --+ 0, computing Xk becomes more
difficult due to ill-conditioning (the level sets become very elongated near xk)-

successive iterates lying in the interior of the constraint set. These methods
are generically referred to as interior point methods, and have been exten-
sively applied to linear, quadratic, and conic programming problems. The
logarithmic barrier function has been central in many of these methods.
In the next two sections we will discuss a few methods that are designed
for problems with special structure. In particular, in Section 6.8.1 we will
discuss in some detail primal-dual methods for linear programming, one
416 Additional Algorithmic Topics Chap. 6

of the most popular methods for solving linear programs. In Section 6.8.2
we will address briefly interior point methods for conic programming prob-
lems. In Section 6.8.3 we will combine the cutting plane and interior point
approaches.

6.8.1 Primal-Dual Methods for Linear Programming

Let us consider the linear program

minimize c' x
(LP)
subject to Ax = b, X ~ 0,

where c E ~n and b E ~ are given vectors, and A is an m x n matrix of


rank m. The dual problem, given in Section 5.2 of Appendix B, is given by

maximize b' ,\
(DP)
subject to A',\::;; c.

As indicated in Section 5.2 of Appendix B, (LP) has an optimal solution


if and only if (DP) has an optimal solution. Furthermore, when optimal
solutions to (LP) and (DP) exist, the corresponding optimal values are
equal.
Recall that the logarithmic barrier method involves finding for various
E > 0,
x(E) E argminFe(x), (6.193)
xES

where
n
Fe(x) = c'x - EL lnxi,
i=l

xi is the ith component of x and S is the interior set

s = {x I Ax= b, X > o}.


We assume that S is nonempty and bounded.
Rather than directly minimizing Fe(x) for small values of E [cf. Eq.
(6.193)], we will apply Newton's method for solving the system of opti-
mality conditions for the problem of minimizing Fe(·) over S. The salient
features of this approach are:
(a) Only one Newton iteration is carried out for each value of Ek.

(b) For every k, the pair ( Xk, Ak) is such that Xk is an interior point of
the positive orthant, i.e., Xk > 0, while Ak is an interior point of the
dual feasible region, i.e.,

c-A'..\k > 0.
Sec. 6.8 Interior Point Methods 417

(However, Xk need not be primal-feasible, that is, it need not satisfy


the equation Ax = b.)
(c) Global convergence is enforced by ensuring that the expression

(6.194)

is decreased to 0, where Zk is the vector

Zk=c-A'>..k.

The expression (6.194) may be viewed as a merit function, and con-


sists of two nonnegative terms: the first term is xk' Zk, which is posi-
tive (since Xk > 0 and Zk > 0) and can be written as

Thus when Xk is primal-feasible (Axk = b), xk'Zk is equal to the dual-


ity gap, that is, the difference between the primal and the dual costs,
c' Xk - b' Ak. The second term is the norm of the primal constraint
violation IIAxk - bll- In the method to be described, neither of the
terms xk' Zk and IIAxk - bll may increase at each iteration, so that
Mk+ 1 ::; Mk (and typically Mk+I < Mk) for all k. If we can show
that Mk --+ 0, then asymptotically both the duality gap and the pri-
mal constraint violation will be driven to zero. Thus every limit point
of { (xk, >..k)} will be a pair of primal and dual optimal solutions, in
view of the duality relation

min c'x = max b' >..,


Ax=b, x2:0 A' >.$;c

given in Section 5.2 of Appendix B.


Let us write the necessary and sufficient conditions for (x, >..) to be a
primal and dual optimal solution pair for the problem of minimizing the
barrier function Fe(x) subject to Ax= b. They are

c - Ex- 1 -A1 >.. = 0, Ax=b, (6.195)

where x- 1 denotes the vector with components (xi)- 1 . Let z be the vector
of slack variables
z = c-A1 >...
Note that >.. is dual feasible if and only if z 2: 0.
Using the vector z, we can write the first condition of Eq. (6.195) as
z - Ex- 1 = 0 or, equivalently, X Ze = Ee, where X and Z are the diagonal
418 Additional Algorithmic Topics Chap. 6

matrices with the components of x and z, respectively, along the diagonal,


and e is the vector with unit components,

x~
(
xl

:
0
x2
-~-), Z= (~t 0
z2 ...
O)
0
' e= (1)
1
. .
0 xn 0 0 ... zn 1

Thus the optimality conditions (6.195) can be written in the equiva-


lent form
XZe = Ee, (6.196)
Ax= b, (6.197)
z+A'A = c. (6.198)
Given (x, A, z) satisfying z + A' A = c, and such that x > 0 and z > 0, a
Newton iteration for solving this system is
x(a, E) = x + aAx, (6.199)
A(a, E) =A+ aAA,
z(a, E) = z + aAz,
where a is a stepsize such that O < a :S: 1 and
x(a, E) > 0, z(a, E) > 0,
and the Newton increment (Ax, AA, Az) solves the linearized version of
the system (6.196)-(6.198)
XAz + ZAx = -v, (6.200)
AAx = b-Ax, (6.201)
Az+A'AA =0, (6.202)
with v defined by
v = XZe - Ee. (6.203)
After a straightforward calculation, the solution of the linearized sys-
tem (6.200)-(6.202) can be written as

AA= (AZ- 1 XA')- 1 (AZ- 1 v+b-Ax), (6.204)


Az = -A 1 AA, (6.205)
Ax= -Z- 1 v - Z- 1 XAz.
Note that A(a, E) is dual feasible, since from Eq. (6.202) and the condition
z + A'A = c, we see that
z(a, E) + A A(a, E) = c.
1

Note also that if a= 1, i.e., a pure Newton step is used, x(a, E) is primal
feasible, since from Eq. (6.201) we have A(x + Ax) = b.
Sec. 6.8 Interior Point Methods 419

Merit Function Improvement

We will now evaluate the changes in the constraint violation and the merit
function (6.194) induced by the Newton iteration.
By using Eqs. (6.199) and (6.201), the new constraint violation is
given by

Ax(a, E) - b =Ax+ aAAx - b =Ax+ a(b - Ax) - b = (1 - a)(Ax - b).


(6.206)
Thus, since O <a:::; 1, the new norm of constraint violation IIAx(a, E) - bll
is always no larger than the old one. Furthermore, if x is primal-feasible
(Ax= b), the new iterate x(a, E) is also primal-feasible.
The inner product
p= x'z (6.207)
after the iteration becomes
p(a,E) =x(a,E)'z(a,E)
= (x + aAx)'(z + aAz) (6.208)
= x'z + a(x'Az + z'Ax) + a 2Ax'Az.
From Eqs. (6.201) and (6.205) we have

Ax' Az = (Ax - b)' AA,


while by premultiplying Eq. (6.200) withe' and using the definition (6.203)
for v, we obtain
x'Az + z'Ax = -e'v = nE -x'z.
By substituting the last two relations in Eq. (6.208) and by using also the
expression (6.207) for p, we see that
p(a, E) = p - a(p - nE) + a2(Ax - b)' AA. (6.209)
Let us now denote by Mand M(a, E) the value of the merit function
(6.194) before and after the iteration, respectively. We have by using the
expressions (6.206) and (6.209),

M(a, E) = p(a, E) + IIAx(a, E) - bll


= p- a(p - m) + a 2(Ax - b)' AA+ (1 - a)IIAx - bll,
or
M(a, E) = M - a(p - m + II Ax - bll) + a 2(Ax - b)' AA.
Thus if E is chosen to satisfy
p
E <-
n
and a is chosen to be small enough so that the second order term a 2 ( Ax -
b)' AA is dominated by the first order term a(p - nE), the merit function
will be improved as a result of the iteration.
420 Additional Algorithmic Topics Chap. 6

A General Class of Primal-Dual Algorithms

Let us consider now the general class of algorithms of the form

where ak and Ek are positive scalars such that


9k
Xk+l > 0, Zk+l > 0, Ek<-,
n
where
9k = Xk 1 Zk + (Axk - b)'>-.k,
and ak is such that the merit function Mk is reduced. Initially we must
have xo > 0, and zo = c - A 1 >-.o > 0 (such a point can often be easily
found; otherwise an appropriate reformulation of the problem is necessary
for which we refer to the specialized literature). These methods are gener-
ally called primal-dual, in view of the fact that they operate simultaneously
on the primal and dual variables.
It can be shown that it is possible to choose ak and Ek so that the
merit function is not only reduced at each iteration, but also converges
to zero. Furthermore, with suitable choices of ak and Ek, algorithms with
good theoretical properties, such as polynomial complexity and superlinear
convergence, can be derived.
Computational experience has shown that with properly chosen se-
quences ak and Ek, and appropriate implementation, the practical perfor-
mance of the primal-dual methods is excellent. The choice
9k
Ek= n2'

leading to the relation

for feasible Xk, has been suggested as a good practical rule. Usually, when
Xk has already become feasible, ak is chosen as 0ak, where 0 is a factor very
close to 1 (say 0.999), and ak is the maximum stepsize a that guarantees
that x(a,Ek) 2'. 0 and z(a,Ek) 2'. 0

iik = min { xik__ j Axi


min { __ < 0 } , min { __
z;k __
-Ax•
i=l, ... ,n i=l, ... ,n -Az'

When Xk is not feasible, the choice of ak must also be such that the merit
function is improved. In some works, a different stepsize for the x update
than for the (>-., z) update has been suggested. The stepsize for the x
update is near the maximum stepsize a that guarantees x( a, Ek) 2: 0, and
Sec. 6.8 Interior Point Methods 421

the stepsize for the (>., z) update is near the maximum stepsize a that
guarantees z(a, Ek) ~ 0.
There are a number of additional practical issues related to imple-
mentation, for which we refer to the specialized literature. We refer to
the research monographs [Wri97], [Ye97], and other sources for a detailed
discussion, as well as extensions to nonlinear/ convex programming prob-
lems, such as quadratic programming. There are also more sophisticated
implementations of the Newton/primal-dual idea, one of which we describe
next.

Predictor-Corrector Variants

We will now discuss some modified versions of the preceding interior point
methods, which are based on a variation of Newton's method where the
Hessian is evaluated periodically every q > 1 iterations in order to econo-
mize in iteration overhead. When q = 2 and the problem is to solve the
system h(x) = 0, where g : ~n r-+ ~n, this variation of Newton's method
takes the form
Xk = Xk - (Vh(xk)')- 1 h(xk), (6.210)

Xk+I = Xk - (Vh(xk)')- 1 h(xk), (6.211)


Thus, given Xk, this iteration performs a regular Newton step to ob-
tain Xk, and then an approximate Newton step from Xk, using, however,
the already available Jacobian inverse (Vh(xk)')- 1 . It can be shown that
if Xk -+ x*, the order of convergence of the error llxk - x* I is cubic, that
is,
. llxk+l - x* II
hmsup
k---+oo II Xk - X*113 < oo,
under the same assumptions that the ordinary Newton's method (q = 1)
attains a quadratic order of convergence; see [OrR70], p. 315. Thus, the
price for the 50% saving in Jacobian evaluations and inversions is a small
degradation of the convergence rate over the ordinary Newton's method
(which attains a quartic order of convergence when two successive ordinary
Newton steps are counted as one).
Two-step Newton methods such as the iteration (6.210), (6.211),
when applied to the system of optimality conditions (6.196)-(6.198) for the
linear program (LP) are known as predictor-corrector methods (the name
comes from their similarity with predictor-corrector methods for solving
differential equations). They operate as follows:
Given (x, z, >.) with x > 0, and z = c-A 1>. > 0, the predictor iteration
[cf. Eq. (6.210)], solves for (~x, ~z, ~.>,) the system
x~z + z~x = -v, (6.212)

A~x = b-Ax, (6,213)


422 Additional Algorithmic Topics Chap. 6

~z+A'~~ =0, (6.214)


with f; defined by
f; = XZe - ie, (6.215)
[cf. Eqs. (6.200)-(6.203)].
The corrector iteration [cf. Eq. (6.211)], solves for (~x, ~z, ~>.) the
system
X~z+Z~x = -v, (6.216)

A~x = b-A(x + ~x), (6.217)

~z + A'~>. = 0, (6.218)
with v defined by

v = (X + ~X)(Z + ~Z)e - Ee, (6.219)

where ~.x and ~z are the diagonal matrices corresponding to ~x and ~2,
respectively. Here E and E are the barrier parameters corresponding to the
two iterations.
The composite Newton direction is

~x = ~x+~x,

~z = ~z+~z,
~,\ = ~~+ ~>.,
and the corresponding iteration is

x(a,t) = x + a~x,

.\(a, t) =.\+a~.\,
z(a, t) = z + a~z,
where a is a stepsize such that O < a :=:; 1 and

x(a, t) > 0, z(a, t) > 0.

We will now develop a system of equations that yields the composite


Newton direction. By adding Eqs. (6.212)-(6.214) and Eqs. (6.216)-(6.218),
we obtain
X(~z + ~z)z + Z(~x + ~x) = -v - v, (6.220)
A(~x + ~x)x = b - Ax+ b - A(x + ~x), (6.221)

~z + ~z + A 1(~~ + ~>.) = o, (6.222)


Sec. 6.8 Interior Point Methods 423

We use the fact


b-A(x+~x)=O
[cf. Eq. (6.213)], and we also use Eqs. (6.219) and (6.212) to write

ii= (X + ~.X)(Z + ~Z)e - le


= XZe+ ~XZe+X~Ze+~X~Ze - le
= X Ze + z ~x + X ~z + ~x ~Ze - Ee
= X Ze - V + ~x ~Ze - Ee.

Substituting in Eqs. (6.220)-(6.222) we obtain the following system of equa-


tions for the composite Newton direction (~x, ~z, ~,\) = (~i: + ~x, ~z +
~z, ~~ + ~.\):
X~z + Z~x = -XZe - ~X~Ze + le, (6.223)

A~x = b-Ax, (6.224)


~z + A 1 ~>. = 0. (6.225)
To implement the predictor-corrector method, we need to solve the
system (6.212)-(6.215) for some value ofE to obtain (~X, ~Z), and then to
solve the system (6.223)-(6.225) for some value of E to obtain (~x, ~z, ~>.).
It is important to note here that most of the work needed for the first
system, namely the factorization of the matrix
AZ- 1 XA'
in Eq. (6.204), need not be repeated when solving the second system, so that
solving both systems requires relatively little extra work over solving the
first one. We refer to the specialized literature for further details [LMS92],
[Meh92], [Wri97], [Ye97].

6.8.2 Interior Point Methods for Conic Programming

We now discuss briefly interior point methods for the conic programming
problems discussed in Section 1.2. Consider first the second order cone
problem
minimize c'x
(6.226)
subject to Aix - bi E Ci, i = 1, ... , m,
where x E ~n, c is a vector in ~n, and for i = 1, ... , m, Ai is an ni x n
matrix, bi is a vector in ~ni, and Ci is the second order cone of ~ni. We
approximate this problem with
m

minimize c'x + Ek L Bi(Aix - bi)


i=l
(6.227)
subject to Aix - bi E int( Ci), i = 1, ... , m,
424 Additional Algorithmic Topics Chap. 6

where Bi is a function defined in the interior of the second order cone Ci,
and given by

Y = (y1, ... , YnJ E int(Ci),

and { Ek} is a positive sequence that converges to 0. Note that as Aix -


bi approaches the boundary of Ci, the logarithmic penalty Bi(Aix - bi)
approaches oo.
Similar to Prop. 6.8.1, it can be shown that if Xk is an optimal so-
lution of the approximating problem (6.227), then every limit point of the
sequence { Xk} is an optimal solution of the original problem. For theoreti-
cal as well as practical reasons, the approximating problem (6.227) should
not be solved exactly. In the most efficient methods, one or more Newton
steps corresponding to a given value Ek are performed, and then the value
of Ek is appropriately reduced. Similar, to the interior point methods for
linear programming of the preceding section, Newton's method should be
implemented with a stepsize to ensure that the iterates keep Aix - bi within
the second order cone.
If the aim is to achieve a favorable polynomial complexity result, a
single Newton step should be performed between successive reductions of
Ek, and the subsequent reduction of Ek must be correspondingly small, ac-
cording to an appropriate formula, which is designed to enable a polynomial
complexity proof. An alternative, which has proved more efficient in prac-
tice, is to allow multiple Newton steps until an appropriate termination
criterion is satisfied, and then reduce Ek substantially. When properly im-
plemented, methods of this type seem to require a consistently small total
number of Newton steps [a number typically no more than 50, regardless
of dimension (!) is often reported]. This empirical observation is far more
favorable than what is predicted by the theoretical complexity analysis.
There is a similar interior point method for the dual semidefinite cone
problem involving the multiplier vector ,\ = (.:\1, ... , Am):

maximize b' ,\
(6.228)
subject to D - (>.1A1 + · · · + AmAm) E C,
where b E ~m, D, A 1, ... , Am are symmetric matrices, and C is the cone
of positive semidefinite matrices. It consists of solving the problem

max1m1ze b' ,\+Ek ln ( det(D - .:\1A1 - · · · - AmAm))


(6.229)
subject to ,\ E ~m,

where {Ek} is a positive sequence that converges to 0. Furthermore, a


starting point such that D - .:\1A1 - · · · - AmAm is positive definite should
be used, and Newton's method should be implemented with a stepsize to
ensure that the iterates keep D - >.1A1 - · · · - AmAm within the positive
definite cone.
Sec. 6.8 Interior Point Methods 425

The properties of this method are similar to the ones of the preceding
second order cone method. In particular, if Xk is an optimal solution of
the approximating problem (6.229), then every limit point of { xk} is an
optimal solution of the original problem (6.228).
We finally note that there are primal-dual interior point methods for
conic programming problems, which bear similarity with the one given for
linear programming in the preceding section. We refer to the literature for
further discussion and a complexity analysis; see e.g., [NeN94], [BoV04].

6.8.3 Central Cutting Plane Methods

We now return to the general problem of minimizing a real-valued convex


function f over a closed convex constraint set X. We will discuss a method
that combines the interior point and cutting plane approaches. Like the
cutting plane method of Section 4.1, it maintains a polyhedral approxima-
tion

Fk(x) = max{f(xo) + (x - xo)'go, ... , J(xk) + (x - Xk)'gk}

to f, constructed using the points xo, ... , Xk generated so far, and associ-
ated subgradients go, ... , gk, with gi E 8 f (xi) for all i = 0, ... , k. However,
it generates the next vector Xk+l by using a different mechanism. In partic-
ular, instead of minimizing Fk over X, the method obtains xk+ 1 by finding
a "central pair" (xk+1,Wk+i) within the subset

sk = {(x,w) IX EX, Fk(x) :S w :S Jk},

where Jk is the best upper bound to the optimal value that has been found
so far,
Jk = . min f (xi)
i=O, ... ,k

(see Fig. 6.8.3).


There are several methods for finding the central pair ( x k+ 1, wk+ 1).
Roughly, the idea is that the central pair should be "somewhere in the
middle" of Sk. For example, consider the case where Sk is polyhedral with
nonempty interior. Then (xk+l, Wk+1) could be the analytic center of Sk,
where for any polyhedral set

P = {y I a~y :S cp, p = 1, ... , m} (6.230)

with nonempty interior, its analytic center is defined as the unique maxi-
mizer of
m

2:)n ( Cp - a~ y)
p=l
426 Additional Algorithmic Topics Chap. 6

~f(x1) + (x - x1)'g1
~ :::?---,r,o-.,F.-/ I

' Central pair (x2, '!L72)


I

Xo X

Figure 6.8.3. Illustration of the set

Sk = {(x, w) I x E X, Fk (x) :S w :S h}
in the central cutting plane method.

over y E P.
Another possibility is the ball center of S, i.e., the center of the largest
inscribed sphere in Sk; for the generic polyhedral set P of the form (6.230),
the ball center can be obtained by solving the following problem with op-
timization variables (y, a):

maximize a
subject to a~(y + d) ::=; Cp, V lldll ::=; a, p = l, ... , m,

assuming that P has nonempty interior. By maximizing over all d with


lldll ::=; a, it can be seen that this problem is equivalent to the linear program
maximize a
subject to a~y + llaplla ::=; Cp, p = l, ... , m.

Central cutting plane methods have satisfactory convergence proper-


ties, even though they do not terminate finitely in the case of a polyhedral
cost function f, as the ordinary cutting plane method does. Since they
are closely related to the interior point methods, they have benefited from
advances in the practical implementation methodology of these methods.

6.9 NOTES, SOURCES, AND EXERCISES

As we noted earlier, the discussion in this chapter is often not as detailed


as in earlier chapters, as some of the methods described are under active
Sec. 6.9 Notes, Sources, and Exercises 427

development. Moreover the literature in the field has grown explosively in


the decade preceding the writing of this book. As a result our presentation
and references are not comprehensive. They tend to reflect to some extent
the author's reading preferences and research orientation. The textbooks,
research monographs, and surveys that we cite may provide the reader with
a more comprehensive view of the many different research lines in the field.
Section 6.1: Convergence rate analysis of optimization methods (as well
as related methods for solving systems of linear and nonlinear equations)
has traditionally been local in nature, i.e., asymptotic estimates of the
number of iterations, starting sufficiently close to the point of convergence,
and an emphasis on issues of condition number and scaling (cf. our discus-
sion in Section 2.1.1). Most of the books in nonlinear programming follow
this approach. Over time and beginning in the late 70s, a more global
approach has received attention, influenced in part by computational com-
plexity ideas, and emphasizing first order methods. In this connection, we
mention the books of Nemirovskii and Yudin [NeY83], and Nesterov and
Nemirovskii [NeN94J.
Section 6.2: The optimal complexity gradient projection/extrapolation
method is due to Nesterov [Nes83], [Nes04], [Nes05]; see also Tseng [Tse08],
Beck and Teboulle [BeTlO], Lu, Monteiro, and Yuan [LMY12], and Gon-
zaga and Karas [GoK13], for proposals and analysis of variants and more
general methods. Some of these variants also apply to important classes of
nondifferentiable cost problems, similar to the ones treated by the proximal
gradient methods of Section 6.3.
The idea of using smoothing in conjunction with a gradient method
to construct optimal algorithms for nondifferentiabe convex problems is
due to Nesterov [Nes05]. In his work he proves the Lipschitz property of
Prop. 6.2.2 for the more general case, where pis convex but not necessarily
differentiable, and analyzes several important special cases. The algorithm,
and its complexity oriented analysis, have had a strong influence on con-
vex optimization algorithms research. In our presentation, we follow the
analysis of Tseng [Tse08].
Section 6.3: There have been several proposals of combinations of gradi-
ent and proximal methods for minimizing the sum of two convex functions
(or more generally, finding a zero of the sum of two nonlinear monotone
operators). These methods, commonly called splitting algorithms, have
a long history, dating to the papers of Lions and Mercier [LiM79], and
Passty [Pas79]. Like the ADMM, the basic proximal gradient algorithm,
written in the form (6.46), admits an extension to the problem of finding
a zero of the sum of two nonlinear maximal monotone operators, called
the forward-backward algorithm (one of the two operators must be single-
valued). The form of the forward-backward algorithm is illustrated in Fig.
6.3.1: we simply need to replace the subdifferential 8h with a general multi-
428 Additional Algorithmic Topics Chap. 6

valued maximal monotone operator, and the gradient v' f with a general
single-valued monotone operator.
The forward-backward algorithm was proposed and analyzed by Lions
and Mercier [LiM79], and Passty [Pas79]. Additional convergence results
for the algorithm and a discussion of its applications were given by Gabay
[Gab83] and Tseng [Tse91b]. The convergence result of Prop. 6.3.3, in the
case where the stepsize is constant, descends from the more general results
of [Gab83] and [Tse91b]. A modification that converges under weaker as-
sumptions is given by Tseng [TseOO]. The rate of convergence has been
further discussed by Chen and Rockafellar [ChR97].
Variants of proximal gradient and Newton-like methods have been
proposed and analyzed by several authors, including cases where the dif-
ferentiable function is not convex; see e.g., Fukushima and Mine [FuM81],
[MiF81], Patriksson [Pat93], [Pat98], [Pat99], Tseng and Yun [TsY09], and
Schmidt [SchlO]. The methods have received renewed attention, as they
are well-matched to the structure of some large-scale machine learning and
signal processing problems; see Beck and Teboulle [BeT09a], [BeT09b],
[BeTlO], and the references they give to algorithms for problems with spe-
cial structures.
There has been a lot of additional recent work in this area, which
cannot be fully surveyed here. Methods (with and without extrapolation),
which replace the gradient with an aggregated gradient that is calculated
incrementally, are proposed and analyzed by Xiao [XialO], and Xiao and
Zhang [XiZ14]. Inexact variants that admit errors in the proximal min-
imization and the gradient calculation, in the spirit of the E-subgradient
methods of 3.3, have been discussed by Schmidt, Roux, and Bach [SRBll].
The convergence rate for some interesting special cases was investigated by
Tseng [TselO], Hou et al. [HZS13], and Zhang, Jiang, and Luo [ZJL13].
Algorithms where f has an additive form with components treated incre-
mentally are discussed by Duchi and Singer [Du809], and by Langford, Li,
and Zhang [LLZ09]. For recent work on proximal Newton-like methods, see
Becker and Fadili [BeF12], Lee, Sun, and Saunders [11812], [11814], and
Chouzenoux, Pesquet, and Repetti [CPR14]. The finite and superlinear
convergence rate results of Exercises 6.4 and 6.5 are new to the author's
knowledge.
Section 6.4: Incremental subgradient methods were proposed by sev-
eral authors in the 60s and 70s. Perhaps the earliest paper is by Litvakov
[Lit66], which considered convex/nondifferentiable extensions of linear least
squares problems. There were several other related subsequent proposals,
including the paper by Kibardin [Kib80]. These works remained unnoticed
in the Western literature, where incremental methods were reinvented of-
ten in different contexts and with different lines of analysis. We mention
the papers by Solodov and Zavriev [SoZ98], Bertsekas [Ber99] (Section
6.3.2), Ben-Tal, Margalit, and Nemirovski [BMNOl], Nedic and Bertsekas
Sec. 6.9 Notes, Sources, and Exercises 429

[NeBOO], [NeBOl], [NeBlO], Nedic, Bertsekas, and Borkar [NBBOl], Kiwiel


[Kiw04], Rabbat and Nowak [RaN04], [RaN05], Gaudioso, Giallombardo,
and Miglionico [GGM06], Shalev-Shwartz et al. [SSS07], Helou and De
Pierro [HeD09], Johansson, Rabi, and Johansson [JRJ09], Predd, Kulka-
rni, and Poor [PKP09], Ram, Nedic, and Veeravalli [RNV09], Agarwal
and Ouchi [AgD11], Ouchi, Hazan, and Singer [DHSll], Nedic [Nedll],
Ouchi, Bartlett, and Wainwright [DBW12], Wang and Bertsekas [WaB13a],
[WaB14], and Wang, Fang, and Liu [WFL14].
The advantage of using deliberate randomization in selecting com-
ponents for deterministic additive cost functions was first established by
Nedic and Bertsekas [NeBOO], [NeBOl]. Asynchronous incremental sub-
gradient methods were proposed and analyzed by Nedic, Bertsekas, and
Borkar [NBBOl].
The incremental proximal methods and their combinations with sub-
gradient methods (cf. Sections 6.4.1 and 6.4.2) were first proposed by the
author in [BerlOb], [Berll], which we follow in our development here. For
recent work and applications, see Andersen and Hansen [AnH13], Couellan
and Trafalis [CoT13], Weinmann, Demaret, and Storath [WDS13], Bacak
[Bac14], Bergmann et al. [BSL14], Richard, Gaiffas, and Vayatis [RGV14],
and You, Song, and Qiu [YSQ14].
The incremental augmented Lagrangian method of Section 6.4.3 is
new. The idea of this method is simple: just as the proximal algorithm,
when dualized, yields the augmented Lagrangian method, the incremental
proximal algorithm, when dualized, should yield a form of incremental
augmented Lagrangian method. While we have focused on the case of a
particular type of separable problem, this idea applies more broadly to
other contexts that involve additive cost functions.
Incremental constraint projection methods are related to classical fea-
sibility methods, which have been discussed by many authors; see e.g.,
Gubin, Polyak, and Raik [GPR67], the survey by Bauschke and Borwein
[BaB96], and recent papers such as Bauschke [BauOl], Bauschke, Com-
bettes, and Kruk [BCK06], Cegielski and Suchocka [CeS08], and Nedic
[NedlO], and their bibliographies.
Incremental constraint projection methods (with a nonzero convex
cost function) were first proposed by Nedic [Nedll]. The algorithm of Sec-
tion 6.4.4 was proposed by the author in [BerlOb], [Berll], and is similar
but differs in a few ways from the one of [Nedll]. The latter algorithm
uses a stepsize f3k = 1 and requires the linear regularity assumption, noted
in Section 6.4.4, in order to prove convergence. Moreover, it does not
consider cost functions that are sums of components that can be treated
incrementally. We refer to the papers by Wang and Bertsekas [WaB13a],
[WaB14] for extensions of the results of [Berl 1] and [Nedll], which involve
incremental treatment of the cost function and the constraints, and a uni-
fied convergence analysis of a variety of component selection rules, both
deterministic and randomized.
430 Additional Algorithmic Topics Chap. 6

Section 6.5: There is extensive literature on coordinate descent methods,


and it has recently grown tremendously (a Google Scholar search produced
many thousands of papers appearing in the two years preceding the publica-
tion of this book). Our references here and in Section 6.5 are consequently
highly selective, with a primary focus on early contributions.
Coordinate descent algorithms have their origin in the classical Ja-
cobi and Gauss-Seidel methods for solving systems of linear and nonlinear
equations. These were among the first methods to receive extensive at-
tention in the field of scientific computation in the 50s; see the books by
Ortega and Rheinholdt [OrR70] for a numerical analysis point of view, and
by Bertsekas and Tsitsiklis [BeT89a] for a distributed synchronous and
asynchronous computation point of view.
In the field of optimization, Zangwill [Zan69] gave the first conver-
gence proof of coordinate descent for unconstrained continuously differen-
tiable problems, and focused attention on the need for a unique attain-
ment assumption of the minimum along each coordinate direction. Powell
[Pow73] gave a counterexample showing that for three or more block com-
ponents this assumption is essential.
The convergence rate of coordinate descent has been discussed by Lu-
enberger [Lue84], and more recently by Nesterov [Nes12], and by Schmidt
and Friedlander [ScF14] (see Exercise 6.11). Convergence analyses for con-
strained problems have been given by a number of authors under various
conditions. We have followed here the line of analysis of Bertsekas and
Tsitsiklis [BeT89a] (Section 3.3.5), also used in Bertsekas [Ber99] (Section
2.7).
Coordinate descent methods are often well suited for solving dual
problems, and within this specialized context there has been much anal-
ysis; see Hildreth [Hil57], Cryer [Cry71], Pang [Pan84], Censor and Her-
man [CeH87], Hearn and Lawphongpanich [HeL89], Lin and Pang [LiP87],
Bertsekas, Hossein, and Tseng [BHT87], Tseng and Bertsekas [TsB87],
[TsB90], [TsB91], Tseng [Tse91a], [Tse93], Hager and Hearn [HaH93], Luo
and Tseng [LuT93a], [LuT93b], and Censor and Zenios [CeZ97]. Some of
these papers also deal with the case of multiple dual optimal solutions.
The convergence to a unique limit and the rate of convergence of the
coordinate descent method applied to quadratic programming problems
were analyzed by Luo and Tseng [LuT91], [LuT92] (without an assumption
of a unique minimum). These two papers marked the beginning of an
important line of analysis of the convergence and rate of convergence of
several types of constrained optimization algorithms, including coordinate
descent, which is based on the Hoffman bound and other related error
bounds; see [LuT93a], [LuT93b], [LuT93c], [LuT94a], [LuT94b]. For a
survey of works in this area see Tseng [TselO].
Recent attention has focused on variants of coordinate descent meth-
ods with inexact minimization along each block component, extensions to
nondifferentiable cost problems with special structure, alternative orders
Sec. 6.9 Notes, Sources, and Exercises 431

for coordinate selection, and asynchronous distributed computation issues


(see the discussion and references in Section 6.5). A survey of the field is
given by Wright [Wri14].
Section 6.6: Nonquadratic augmented Lagrangians and their associated
proximal minimization-type algorithms that use nonquadratic proximal
terms were introduced by Kort and Bertsekas; see the paper [KoB72],
which included as a special case the exponential augmented Lagrangian
method, the thesis [Kor75], the paper [KoB76], and the augmented La-
grangian monograph [Ber82a]. Progressively stronger convergence results
for the exponential method were established in [KoB72], [Ber82a], [TsB93].
There has been continuing interest in proximal algorithms with non-
quadratic regularization, and related augmented Lagrangian algorithms,
directed at obtaining additional classes of methods, sharper convergence
results, understanding the properties that enhance computational perfor-
mance, and specialized applications. For representative works, see Polyak
[Po188], [Pol92], Censor and Zenios [CeZ92], [CeZ97], Guler [Gul92], Tebou-
lle [Teb92], Chen and Teboulle [ChT93], [ChT94], Tseng and Bertsekas
[TsB93], Bertsekas and Tseng [BeT94a], Eckstein [Eck94], Iusem, Svaiter,
and Teboulle [IST94], Iusem and Teboulle [IuT95], Auslender, Cominetti,
and Haddou [ACH97], Ben-Tal and Zibulevsky [BeZ97], Polyak and Tebou-
lle [PoT97], Iusem [Ius99], Facchinei and Pang [FaP03], Auslender and
Teboulle [AuT03], and Tseng [Tse04].
Section 6. 7: Descent methods for minimax problems were considered in
several works during the 60s and early 70s, as extensions of feasible di-
rection methods for constrained minimization (a minimax problem can be
converted to a constrained optimization problem); see e.g., the book by
Demjanov and Rubinov [DeR70], and the papers by Pshenichnyi [Psh65],
Demjanov [Dem66], [Dem68], and Demjanov and Malozemov [DeM71].
These methods typically involved an E parameter to deal with convergence
difficulties due to proximity to the constraint boundaries.
The E-descent method of Section 6. 7 relates to these earlier methods,
but applies more generally to any closed proper convex function. It was pro-
posed by Bertsekas and Mitter [BeM71], [BeM73], together with its outer
approximation variant for sums of functions. This variant is well-suited
for application to convex network optimization problems; see Rockafellar
[Roc84], Bertsekas, Hossein, and Tseng [BHT87], Bertsekas and Tsitsik-
lis [BeT89a], and Bertsekas [Ber98]. Inner approximation variants of the
E-descent method were proposed by Lemarechal [Lem74], [Lem75], and a
related algorithm was independently derived by Wolfe [Wol75]. These pa-
pers, together with other papers in the collection edited by Balinski and
Wolfe [BaW75], were influential in focusing attention on nondifferentiable
convex optimization as a field with broad applications.
The extended monotropic programming problem was introduced and
studied by the author [BerlOa], as an extension of Rockafellar's mono-
432 Additional Algorithmic Topics Chap. 6

tropic programming framework, which was extensively developed in the


book [Roc84]. The latter framework is the case where each function Ji is
one-dimensional, and includes as special cases linear, quadratic, and convex
separable single commodity network flow problems (see [Roc84], [Ber98]).
Extended monotropic programming is more general in that it allows multi-
dimensional component functions Ji, but requires constraint qualifications
for strong duality, which we have discussed in Sections 6. 7.3 and 6. 7.4.
Monotropic programming is the largest class of convex optimization
problems with linear constraints for which strong duality holds without
any additional qualifications, such as the Slater or other relative interior
point conditions (cf. Section 1.1), or the E-subdifferential condition of Prop.
6.7.4. The EMP strong duality theorem (Prop. 6.7.4) extends a theorem
proved for the case of a monotropic programming problem in [Roc84]. Re-
lated analyses and infinite dimensional extensions of extended monotropic
programming are given by Borwein, Burachik, and Yao [BBY12], Burachik
and Majeed [BuM13], and Burachik, Kaya, and Majeed [BKM14].
Section 6.8: Interior point methods date to the work of Frisch in the
50s [Fri56]. A textbook treatment was given by Fiacco and McCormick
[FiM68]. The methods became popular in the middle 80s, when they were
systematically applied to linear programming problems, following the in-
tense interest in the complexity oriented work of Karmarkar [Kar84].
The textbooks by Bertsimas and Tsitsiklis [BeT97], and Vanderbei
[VanOl] provide accounts of interior point methods for linear programming.
The research monographs by Wright [Wri97], Ye [Ye97], and Roos, Terlaky,
and Vial [RTV06], the edited volume by Terlaky [Ter96], and the survey
by Forsgren, Gill, and Wright [FGW02] are devoted to interior point meth-
ods for linear, quadratic, and convex programming. For a convergence
rate analysis of primal-dual interior point methods, see Zhang, Tapia, and
Dennis [ZTD92], and Ye et al. [YGT93]. The predictor-corrector variant
was proposed by Mehrotra [Meh92]; see also Lustig, Marsten, and Shanno
[LMS92]. We have followed closely here the development of the author's
textbook [Ber99].
Interior point methods were adapted for conic programming, start-
ing with the complexity-oriented monograph by Nesterov and Nemirovskii
[NeN94]. The book by Boyd and Vanderbergue [BoV04] has a special fo-
cus on interior point methods. Furthermore, related convex optimization
software have become popular (Grant, Boyd, and Ye [GBY12]).
Central cutting plane methods were introduced by Elzinga and Moore
[ElM75]. More recent proposals, some of which relate to interior point
methods, are discussed in Goffin and Vial [GoV90], Goffin, Haurie, and
Vial [GHV92], Ye [Ye92], Kortanek and No [KoN93], Goffin, Luo, and
Ye [GLY94], [GLY96], Atkinson and Vaidya [AtV95], den Hertog et al.
[HKR95], Nesterov [Nes95]. For a textbook treatment, see Ye [Ye97], and
for a survey, see Goffin and Vial [GoV02].
Sec. 6.9 Notes, Sources, and Exercises 433

EXERCISES

6.1 (Equivalent Properties of Convex Functions with Lipschitz


Gradient)

Given a convex differentiable convex function f : ar >-+ R and a scalar L > 0,


show that the following five properties are equivalent:
(i) llv' f(x) - v' f(y)II::; L llx -yll, for all x, y E Rn.

(ii) f(x) + v'f(x)'(y-x) + 21r, llv'f(x)-v'f(y)ll 2 ::; f(y), for all x,y E Rn.

(iii) (v' f(x) - v' J(y) )' (x - y) 2 t llv' f(x) - v' J(y) 112, for all x, y E Rn.
(iv) f(y)::; f(x) + v'f(x)'(y-x) + ½lly-xll 2 , for all x,y E Rn.
(v) (v'f(x) -v'f(y)) 1 (x-y)::; Lllx-yll 2 , for all x,y E Rn.
Note: This equivalence, given as part of Th. 2.1.5 in [Nes04], and also given
in part as Prop. 12.60 of [RoW98], proves among others the converse to Prop.
6.1.9(a). Proof: In Prop. 6.1.9, we showed that (i) implies (ii) and that (ii)
implies (iii). Moreover, (iii) and the Schwarz inequality imply (i). Thus (i), (ii),
and (iii) are equivalent.
The proof of Prop. 6.1.9(a) also shows that (iv) implies (v). To obtain (iv)
from (v), we use a variation of the proof of the descent lemma (Prop. 6.1.2). Let
t be a scalar parameter and let g(t) = f (x + t(y - x)). We have

J(y)-f(x)=g(I)-g(O)=
1 0
1
dd!(t)dt= 10
1
v'f(x+t(y-x)) 1 (y-x)dt,

so that

f(y)-f(x)-v'f(x)'(y-x) = 1 1
(v'f(x+t(y-x))-v'f(x)) 1 (y-x)dt::; ½lly-xll 2

where the last inequality follows by using (v) with y replaced by x + t(y - x) to
obtain

(v' f (x + t(y - x)) - v' f(x) )' (y - x)::; Lt IIY - xii 2, \:/ t E [O, l].

Thus (iv) and (v) are equivalent.


We complete the proof by showing that (i) is equivalent to (iv). The descent
lemma (Prop. 6.1.2) states that (i) implies (iv). To show that (iv) implies (i),
we use the following proof, given to us by Huizhen Yu (this part is not proved in
Th. 2.1.5 of [Nes04]). Fix x and consider the following expression as a function
of y,
½(f(y) - f(x) - v'f(x)'(y-x)).
434 Additional Algorithmic Topics Chap. 6

Making a change of variable z =y - x, we define

hx(z) = ½(f(z + x) - J(x) - 'v J(x)' z ).

By definition
1
'vhx(y-x) = L('vf(y) - 'vf(x)), (6.231)

while by the assumption (iv), we have hx(z)::; ½llzll 2 for all z, implying that

'cf 0 Ear, (6.232)

where h; is the conjugate of hx,


Similarly, fix y and define

hy(z) = ½(f (z + y) - J(y) - 'v J(y)' z),

to obtain
1
'vhy(x - y) = L ('v J(x) - 'v J(y) ),

and
(6.233)

Let 0 = ±('vf(y) - 'vf(x)). Since by Eq. (6.231), 0 is the gradient of hx


at y - x, by the Conjugate Subgradient Theorem (Prop. 5.4.3 in Appendix B),
we have
h;(0) + hx(Y - x) = 0'(y - x),

and similarly
h;(-0) + hy(x - y) = -0'(x -y).

Also by the definition of hx and hy,

hx(Y - x) + hy(x - y) = 0'(y - x).

By combining the preceding three relations,

and by using also Eqs. (6.232) and (6.233),

11011 2 ::; h;(0) + h;(-0) = 0'(y- x)::; ll0IIIIY - xii-

Thus we obtain 11011::; IIY - xii, i.e., that ll'vf(y) - 'vf(x)II::; LIiy- xii-
Sec. 6.9 Notes, Sources, and Exercises 435

6.2 (Finite Convergence of Gradient Projection for Linear


Programming)

Consider the gradient projection method, applied to minimization of a linear


function f over a polyhedral set X, using a stepsize Ctk such that z:=:=o
ak = =·
Show that if the set X* of optimal solutions is nonempty, the method converges to
some x* E X* in a finite number of iterations. Hint: Show that for this problem
the method is equivalent to the proximal algorithm and apply Prop. 5.1.5.

6.3 (Convergence Rate of Proximal Gradient Method Under


Strong Convexity Conditions)

This exercise, based on an unpublished analysis in [Schl4b], derives conditions


under which the proximal gradient method converges linearly. Consider the min-
imization of f + h, where f is strongly convex and differentiable with Lipschitz
continuous gradient [cf. Eq. (6.44)], and h is closed proper convex (cf. Section
6.3). The proximal gradient method with constant stepsize a > 0 is written as

where Pa,h is the proximal operator corresponding to a and h (cf. Section 5.1.4).
Denote by x* the optimal solution (which exists and is unique by the strong
convexity off), and let z* = x* - av' f (x*). We assume that a :S 1/ L, where L
is the Lipschitz constant of v'f, so that Xk-+ x* (cf. Prop. 6.3.3).
(a) Show that for some scalars p E (0, 1) and q E (0, l], we have
llx-av'f(x)-z*II :SPllx-x*II, \fxE3r, (6.234)

IIPa,h(z) - Pa,h(z*)II :Sq llz - z*II, \f z E 3r. (6.235)


Show also that
llxk+1 - x*II :S pq llxk - x*II, \f k. (6.236)
Hint: See Prop. 6.1.8 and also [SRB11] for Eq. (6.234). Because Pa,h is
nonexpansive (cf. Prop. 5.1.8), we can set q = l in Eq. (6.235). Finally,
note that as shown in Section 6.3, the set of minima of f + h coincides with
the set of fixed points of the mapping Pa,h (x - av' f(x)) [the composition
of the two mappings in Eqs. (6.234) and (6.235)]. Note: Conditions under
which Pa,h is a contraction, so values q < l can be used in Eq. (6.235),
are given in Prop. 5.1.4 and Exercise 5.2 (see also [Roc76a] and [Luq84]).
A local version of Eq. (6.236) can also be proved, when Eqs. (6.234) and
(6.235) hold locally, within spheres centered at x* and z*, respectively.
(b) Assume that f is positive definite quadratic with minimum and maxi-
mum eigenvalues denoted by m and M, respectively, and that his positive
semidefinite quadratic with minimum eigenvalue >.. Show that
max {II - aml, II - aMI}
llxk+I - x*II :S 1 + a>. llxk - x*II, \i k.

Hint: Use Eq. (6.236), Exercise 2.1, and the linearity of the proximal op-
erator.
436 Additional Algorithmic Topics Chap. 6

Gradient Step
- Xo - aV f (xo)
/
X

Figure 6.9.1. Illustration of the finite convergence process of the proximal gra-
dient method for the case of a sharp minimum, where h is nondifferentiable at
x• and - 'vf(x*) E int(8h(x*)) (cf. Exercise 6.4) . The figure a lso illustrates how
the method can attain superlinear convergence (cf. Exercise 6.5) . These results
should be compared with the convergence rate analysis of the proximal algorithm
in Section 5.1.2.

6.4 (Finite Convergence of Proximal Gradient Method for


Sharp Minima)

Consider the minimization of f + h, where f is convex and differentiable with


Lipschitz continuous gradient [cf. Eq. (6.44)], and his closed proper convex (cf.
Section 6.3). Assume that there is a unique optimal solution, denoted x·, and
that a::; 1/L, where Lis the Lipschitz constant of VJ, so that Xk-+ x* (cf.
Prop. 6.3.3). Show that if the interior point condition

0 E int(V f( x *) + ah(x*)) (6.237)


holds, then the method converges to x* in a finite number of iterations. Note: T he
condition (6.237) is illustrated in Fig. 6.9.1, and is known as a sharp minimum
condition. It is equivalent to the existence of some (3 > 0 such that

f(x*) + h(x*) + /Jllx - x* II '.S f(x) + h(x),


cf. the finite convergence result of Prop. 5. 1.5 for the proximal algorithm (which,
however, does not assume uniqueness of the optimal solution). Abbreviated proof:
By continuity of V f there is an open sphere Sx• centered at x* with

X - aV f(x) Ex* + o:ah(x*),


Once Xk E Sx•, the method terminates after one more iteration.
Sec. 6.9 Notes, Sources, and Exercises 437

6.5 (Superlinear Convergence of Proximal Gradient Method)

Consider the minimization of f + h, where f is convex and differentiable with


Lipschitz continuous gradient [cf. Eq. (6.44)], and h is closed proper convex (cf.
Section 6.3). Assume that the optimal solution set X* is nonempty, and that
a :S 1/ L, where Lis the Lipschitz constant of v' f, so that Xk converges to some
point in X* (cf. Prop. 6.3.3). Assume also that for some scalars (3 > 0, 6 > 0,
and 'YE (1, 2), we have

F* + f3(d(x)f :S f(x) + h(x), \:/xE1lr withd(x):S6, (6.238)

where F* is the optimal value of the problem and

d(x) = xEX*
min llx - x*ll-

Show that there exists k 2:'. 0 such that for all k 2:'. k we have

f( Xk+l )+h( Xk+l ) - F *< 1 (f(xk)+h(xk)-F*) 2 h


_ 2a (3

Note: Figure 6.9.1 provides a geometric interpretation of the mechanism for


superlinear convergence. The graph of 8h changes almost "vertically" as we
move away from the optimal solution set. Abbreviated proof: From the descent
lemma (Prop. 6.1.2), we have the inequality

f(xk+1) :S f(xk) + v' f(xk ) ' (xk+1 - xk) + 21a llxk+1 - Xkll 2 ,

so that for all x* E X*, we obtain

f(xk+1) + h(xk+1) :S J(xk) + v' J(xk)'(xk+1 - Xk) + h(xk+1) + 2~ llxk+l - Xkll 2

min {f(xk) + v' f(xk)' (x - xk) + h(x) + __!_ llx -


= xE~n Xkll 2 }
2a
:S f(xk) + v' f(xk)'(x* - xk) + h(x*) + 2~ llx* - Xkll
2

:S f(x*) + h(x*) + 2~ llx* - Xkll 2 ,

where the last step uses the gradient inequality f(xk)+ v' f(xk)'(x* -xk) :S f(x*).
Letting x* be the vector of X* that is at minimum distance from Xk, we obtain

Since Xk converges to some point of X*, by using the hypothesis (6.238), we have
for sufficiently large k,

and the result follows by combining the last two relations.


438 Additional Algorithmic Topics Chap. 6

6.6 (Proximal Gradient Method with Diagonal Scaling)

Consider the minimization of f + h, where f is convex and differentiable with


Lipschitz continuous gradient [cf. Eq. (6.44)], and his closed proper convex (cf.
the framework of Section 6.3). Consider the algorithm

i = 1, .. . ,n,

Xk+I E arg min {h(x)


xElRn
+~ 1 llx - zlll 2 } ,
-d
~ 2 k
i=l

where d1 are positive scalars.


(a) Assume that Xk -+ x* for some vector x*, and dl, -+ di for some positive
scalars di, i = 1, ... , n. Show that x* minimizes f + h. Hint: Show that
the vector 9k with components given by

i = 1, ... ,n,

belongs to 8h(xk), and that the limit of 9k belongs to 8h(x*).


(b) Assuming that f is strongly convex, derive conditions under which Xk -+ x*
and d1-+ di.

6. 7 (Parallel Projections Algorithm as a Block Coordinate


Descent Method)

Let X 1, . . . , Xm be given closed convex sets in Rn. Consider the problem of


finding a point in their intersection, and the equivalent problem

minimize ½z:::::;: 1 llzi - xll 2


subject to x E Rn, zi EX;, i = 1, ... , m.

Show that a block coordinate descent algorithm is given by

1 ~.
Xk+I = m ~z);,
i=l

Verify that the convergence result of Prop. 6.5.1 applies to this algorithm.
Sec. 6.9 Notes, Sources, and Exercises 439

6.8 (Proximal Algorithm as a Block Coordinate Descent


Method)

Let f : ar 1-t R be a continuously differentiable convex function, let X be a closed


convex set, and let c be a positive scalar. Show that the proximal algorithm

Xk+I E argmin {f(x)


xEX
+ ...!:__l
2c
lx -xkll 2}

is a special case of the block coordinate descent method applied to the problem

minimize f(x) .!:__ llx -yll 2


+ .2c
subject to x EX, y E Rn,

which is equivalent to the problem of minimizing f over X. Hint: Consider the


cost function
1 2
g(x,y) = f(x) + 2)1x -yll ·

6.9 (Combination of Coordinate Descent and Proximal


Algorithm [Tse91a])

Consider the minimization of a continuously differentiable function f of the vector


x = (x 1 , ... , xm) subject to xi E X;, where X; are closed convex sets. Consider
the following variation of the block coordinate descent method (6.141):

i
xk+I .
E arg min 1 ... , xk+I'
f( xk+I, i-1 ..,,c xki+I , ... , xkm) + -1 11.:. , - xki 112 ,
~~ ~

for some scalar c > 0. Assuming that f is convex, show that every limit point of
the sequence of vectors Xk = (xl, ... , x,;') is a global minimum. Hint: Apply the
result of Prop. 6.5.1 to the cost function

g(x, y) = f(x1, ... 'xm) +;ct llxi - Yi112·


i=l

For a related analysis of this type of algorithm see [Aus92], and for a recent
analysis see [BST14].

6.10 (Coordinate Descent for Convex Nondifferentiable


Problems)

Consider the minimization of F + G, where F : Rn 1-t R is a differentiable convex


function and G : Rn 1-t (-oo, oo] is a closed proper convex function such that
8G(x) =/- 0 for all x E dom(G).
440 Additional Algorithmic Topics Chap. 6

(a) Use the optimality conditions of Prop. 5.4.7 in Appendix B to show that

x* E arg min { F(x)


xE!Rn
+ G(x)}

if and only if
x* E arg min {"\!F(x*)'x+G(x)}.
xE!Rn

(b) Assume that G is separable of the form

G(x) = ~ Gi(xt
i=l

where xi, i = 1, ... , n, are the one-dimensional coordinates of x, and Gi :


R >-* (-oo, oo] are closed proper convex functions. Use part (a) to show
that x* minimizes F + G if and only if (xi)* minimizes F + G in the ith
coordinate direction for each i.

6.11 ( Convergence Rate of Coordinate Descent Under Strong


Convexity Conditions)

This exercise compares the convergence rates associated with various orders of
coordinate selection in the coordinate descent method, and illustrates how a
good deterministic order can outperform a randomized order. We consider the
minimization of a differentiable function f : Rn >-* R, with Lipschitz continuous
gradient along the ith coordinate, i.e., for some L > 0, we have

\:/xERn, a:ER, i=l, ... ,n, (6.239)

where Ci is the ith coordinate direction, Ci = (0, ... , 0, 1, 0, ... , 0) with the 1 in
the ith position, and "\!if is the ith component of the gradient. We also assume
that f is strongly convex in the sense that for some a > 0

f(y) 2'. f(x) + "\7 f(x)' (y - x) i


+ llx - Yll 2 , (6.240)

and we denote by x* the unique minimum off. Consider the algorithm

where the index ik is chosen in some way to be specified shortly.


(a) (Randomized Coordinate Selection [LeLlO], [Nes12]) Assume that ik is se-
lected by uniform randomization over the index set { 1, ... , n}, indepen-
dently of previous selections. Show that for all k,

E{J(xk+i)} - f(x*) :S ( 1 - {n) (f(xk) - f(x*)). (6.241)


Sec. 6.9 Notes, Sources, and Exercises 441

Abbreviated Solution: By the descent lemma (Prop. 6.1.2) and Eq. (6.239),
we have

J(xk+1)::; J(xk) + 'vikf(xk)(x~+ 1 - x~k) + ½(x~\ 1 - x~k)2


(6.242)
= f(xk) - 2~ 1vikf(xk)j2,

while by minimizing over y both sides of Eq. (6.240) with x = Xk, we have

(6.243)

Taking conditional expected value, given Xk, in Eq. (6.242), we obtain

E{J(xk+i)}::; E{f(xk)- 2~1'vikf(xk)j 2 }

= J(xk) - 2~ t
i=l
¾lvikJ(xk)l 2

Subtracting J(x*) from both sides and using Eq. (6.243), we obtain Eq.
(6.241).
(b) ( Gauss-Southwell Coordinate Selection [ScF14]) Assume that ik is selected
according to
ik E arg. max l~7if(xk)I,
i=l, ... ,n

and let the following strong convexity assumption hold for some a1 > 0,

J(y) 2: J(x) + Vf(x)'(y - x) + ~1 llx - Ylli, \/ x, y E ~r, (6.244)

where 11 · 111 denotes the £1 norm. Show that for all k,

J(xk+1) - J(x*):::; ( 1- 7,) (f(xk) - f(x*)). (6.245)

Abbreviated Solution: Minimizing with respect toy both sides of Eq. (6.244)
with x = Xk, we obtain

Combining this with Eq. (6.242) and using the definition of ik, which im-
plies that

we obtain Eq. (6.245). Note: It can be shown that ~ ::; a1 ::; a, so that
the convergence rate estimate (6.245) is more favorable than the estimate
(6.241) of part (a); see [ScF14].
442 Additional Algorithmic Topics Chap. 6

(c) Consider instead the algorithm

where Li is a Lipschitz constant for the gradient along the ith coordinate,
i.e.,

\;/ x E 3'r, a E R, i = 1, ... , n.


Here ik is selected as in part (a), by uniform randomization over the index
set {1, ... , n }, independently of previous selections. Show that for all k,

where L = L~i Li, Note: This is a stronger convergence rate estimate


than the one of part (a), which applies with L = min{L1, .. ,,Ln}, The
use of different stepsizes for different coordinates may be viewed as a form
of diagonal scaling. In practice, Lik may be approximated by a finite
difference approximation of the second derivative of f along eik or some
other crude line search scheme.
APPENDIX A:
Mathematical Background

In Sections A.1-A.3 of this appendix, we provide some basic definitions,


notational conventions, and results from linear algebra and real analysis.
We assume that the reader is familiar with these subjects, so no proofs
are given. For additional related material, we refer to textbooks such as
Hoffman and Kunze [HoK71], Lancaster and Tismenetsky [LaT85], and
Strang [Str76] (linear algebra), and Ash [Ash72], Ortega and Rheinholdt
[OrR70], and Rudin [Rud76] (real analysis).
In Section A.4, we provide a few convergence theorems for deter-
ministic and random sequences, which we will use for various convergence
analyses of algorithms in the text. Except for the Supermartingale Conver-
gence Theorem for sequences of random variables (Prop. A.4.5), we provide
complete proofs.

Set Notation

If X is a set and x is an element of X, we write x E X. A set can be


specified in the form X = {x I x satisfies P}, as the set of all elements
satisfying property P. The union of two sets X 1 and X 2 is denoted by
X1 U X2, and their intersection by X1 n X2. The symbols :l and\::/ have
the meanings "there exists" and "for all," respectively. The empty set is
denoted by 0.
The set of real numbers (also referred to as scalars) is denoted by R.
The set R augmented with +oo and -oo is called the set of extended real
numbers. We write -oo < x < oo for all real numbers x, and -oo ::; x ::; oo
for all extended real numbers x. We denote by [a, b] the set of (possibly
extended) real numbers x satisfying a ::; x ::; b. A rounded, instead of
square, bracket denotes strict inequality in the definition. Thus (a, bl, [a, b),
and (a, b) denote the set of all x satisfying a < x ::; b, a ::; x < b, and

443
444 Mathematical Background Appendix A

a < x < b, respectively. Furthermore, we use the natural extensions of the


rules of arithmetic: x · 0 = 0 for every extended real number x, x · oo = oo
if x > 0, x · oo = -oo if x < 0, and x + oo = oo and x - oo = -oo for
every scalar x. The expression oo - oo is meaningless and is never allowed
to occur.

Inf and Sup Notation

The supremum of a nonempty set X of scalars, denoted by sup X, is defined


as the smallest scalar y such that y 2 x for all x E X. If no such scalar
exists, we say that the supremum of X is oo. Similarly, the infimum of X,
denoted by inf X, is defined as the largest scalar y such that y :::; x for all
x E X, and is equal to -oo if no such scalar exists. For the empty set, we
use the convention

sup 0 = -oo, inf 0 = oo.


If sup X is equal to a scalar x that belongs to the set X, we say that
xis the maximum point of X and we write x = maxX. Similarly, if inf Xis
equal to a scalar x that belongs to the set X, we say that x is the minimum
point of X and we write x = minX. Thus, when we write maxX (or minX)
in place of sup X (or inf X, respectively), we do so just for emphasis: we
indicate that it is either evident, or it is known through earlier analysis, or
it is about to be shown that the maximum (or minimum, respectively) of
the set X is attained at one of its points.

Vector Notation

We denote by ~n the set of n-dimensional real vectors. For any x E ~n,


we use Xi (or sometimes xi) to indicate its ith coordinate, also called its
ith component. Vectors in ~n will be viewed as column vectors, unless the
contrary is explicitly stated. For any x E ~n, x' denotes the transpose of
x, which is an n-dimensional row vector. The inner product of two vectors
x = (x1, ... ,xn) and y = (Yl,···,Yn) is defined by x'y = I:7= 1 XiYi· Two
vectors x, y E ~n satisfying x' y = 0 are called orthogonal.
If x is a vector in ~n, the notations x > 0 and x 2 0 indicate that all
components of x are positive and nonnegative, respectively. For any two
vectors x and y, the notation x > y means that x - y > 0. The notations
x 2 y, x < y, etc., are to be interpreted accordingly.

Function Notation and Terminology

If f is a function, we use the notation f : X f-t Y to indicate the fact that


f is defined on a nonempty set X (its domain) and takes values in a set
Y (its range). Thus when using the notation f : X f-t Y, we implicitly
Sec. A.l Linear Algebra 445

assume that X is nonempty. If f : X f--t Y is a function, and U and V


are subsets of X and Y, respectively, the set {f(x) I x E U} is called the
image or forward image of U under f, and the set { x EX I f(x) E V} is
called the inverse image of V under f.
A function f : Rn f--t R is said to be affine if it has the form f(x) =
a'x + b for some a E Rn and b E R. Similarly, a function f : Rn f--t Rm is
said to be affine if it has the form f(x) = Ax+ b for some m x n matrix
A and some b E Rm. If b = 0, f is said to be a linear function or linear
transformation. Sometimes, with slight abuse of terminology, an equation
or inequality involving a linear function, such as a'x = b or a'x ::; b, is
referred to as a linear equation or inequality, respectively.

A.1 LINEAR ALGEBRA

If X is a subset of Rn and >. is a scalar, we denote by >.X the set {>.x I x E


X}. If X and Y are two subsets of Rn, we denote by X + Y the set

{x + y Ix EX, y E Y},

which is referred to as the vector sum of X and Y. We use a similar


notation for the sum of any finite number of subsets. In the case where
one of the subsets consists of a single vector x, we simplify this notation as
follows:
x + x = {x + x I x E X}.
We also denote by X - Y the set

{x -y Ix EX, y E Y}.

Given sets X; c Rn;, i = 1, ... , m, the Cartesian product of the X;,


denoted by X1 x · · · x Xm, is the set

{ (x1, ... , Xm) IX; EX;, i = 1, ... , m },


which is viewed as a subset of Rn1+··+nm.

Subspaces and Linear Independence

A nonempty subset S of Rn is called a subspace if ax + by E S for every


x, y E S and every a, b E R. An affine set in Rn is a translated subspace,
i.e., a set X of the form X = x + S = {x + x Ix ES}, where xis a vector
in Rn and S is a subspace of Rn, called the subspace parallel to X. Note
that there can be only one subspace S associated with an affine set in this
manner. [To see this, let X = x + S and X = x + S be two representations
of the affine set X. Then, we must have x = x + s for some s ES (since
446 Mathematical Background Appendix A

x EX), so that X = x + s + S. Since we also have X = x + S, it follows


that S = S - s = S.] A nonempty set X is a subspace if and only if
it contains the origin, and every line that passes through any pair of its
points that are distinct, i.e., it contains O and all points ax+ (1 - a)y,
where a E ?R and x, y E X with x -=/- y. Similarly X is affine if and only
if it contains every line that passes through any pair of its points that are
distinct. The span of a finite collection {x1, ... , Xm} of elements of ?Rn,
denoted by span(x1, ... , Xm), is the subspace consisting of all vectors y of
the form y = I:;~ 1 akXk, where each ak is a scalar.
The vectors x 1, ... , Xm E ?Rn are called linearly independent if there
exists no set of scalars a1, ... , am, at least one of which is nonzero, such
that I:Z'=i akXk = 0. An equivalent definition is that x1 -=/- 0, and for every
k > l, the vector Xk does not belong to the span of x1, ... , Xk-l ·
If S is a subspace of ?Rn containing at least one nonzero vector, a basis
for S is a collection of vectors that are linearly independent and whose
span is equal to S. Every basis of a given subspace has the same number
of vectors. This number is called the dimension of S. By convention, the
subspace {O} is said to have dimension zero. Every subspace of nonzero
dimension has a basis that is orthogonal (i.e., any pair of distinct vectors
from the basis is orthogonal). The dimension of an affine set x + S is the
dimension of the corresponding subspace S. An (n - 1)-dimensional affine
subset of ?Rn is called a hyperplane, assuming n 2'.'. 2. It is a set specified by
a single linear equation, i.e., a set of the form { x I a' x = b}, where a -=/- 0
and b ER
Given any subset X of ?Rn, the set of vectors that are orthogonal to
all elements of X is a subspace denoted by X .l:
X .l = {y I y'x = 0, le/ x EX}.
If Sis a subspace, s.1 is called the orthogonal complement of S. Any vector
x can be uniquely decomposed as the sum of a vector from S and a vector
from s.1. Furthermore, we have (S.l ).l = S.

Matrices

For any matrix A, we use Aij, [A]ij, or aij to denote its ijth component.
The transpose of A, denoted by A', is defined by [A']ij = aji· For any two
matrices A and B of compatible dimensions, the transpose of the product
matrix AB satisfies (AB)' = B' A'. The inverse of a square and invertible
A is denoted A- 1.
If X is a subset of ?Rn and A is an m x n matrix, then the image of
X under A is denoted by AX (or A· X if this enhances notational clarity):
AX= {Ax Ix EX}.
If Y is a subset of ?Rm, the inverse image of Y under A is denoted by A- 1Y:
A- 1 Y = {x I Ax E Y}.
Sec. A.1 Linear Algebra 447

Let A be an m x n matrix. The range space of A, denoted by R(A),


is the set of all vectors y E lRm such that y = Ax for some x E ~n. The
nullspace of A, denoted by N(A), is the set of all vectors x E ~n such
that Ax = 0. It is seen that the range space and the nullspace of A are
subspaces. The rank of A is the dimension of the range space of A. The
rank of A is equal to the maximal number of linearly independent columns
of A, and is also equal to the maximal number of linearly independent
rows of A. The matrix A and its transpose A' have the same rank. We say
that A has full rank, if its rank is equal to min{ m, n}. This is true if and
only if either all the rows of A are linearly independent, or all the columns
of A are linearly independent. The range space of an m x n matrix A is
equal to the orthogonal complement of the nullspace of its transpose, i.e.,
R(A) = N(A 1 ).1.

Square Matrices

By a square matrix we mean any n .x n matrix, where n ?: 1. The deter-


minant of a square matrix A is denoted by det(A).

Definition A.1.1: A square matrix A is called singular if its deter-


minant is zero. Otherwise it is called nonsingular or invertible.

Definition A.1.2: The characteristic polynomial</> of an n x n matrix


A is defined by¢(>..) = det(>..J - A), where I is the identity matrix of
the same size as A. Then (possibly repeated and complex) roots of
</> are called the eigenvalues of A. A nonzero vector x (with possibly
complex coordinates) such that Ax= >..x, where>.. is an eigenvalue of
A, is called an eigenvector of A associated with >...

Note that the only use of complex numbers in this book is in relation
to eigenvalues and eigenvectors. All other matrices or vectors are implicitly
assumed to have real components.

Proposition A.1.1:
(a) Let A be an n x n matrix. The following are equivalent:
(i) The matrix A is nonsingular.
(ii) The matrix A 1 is nonsingular.
(iii) For every nonzero x E ~n, we have Ax # 0.
448 Mathematical Background Appendix A

(iv) For every y E ~n, there is a unique x E ~n such that


Ax=y_
(v) '!'here is an n x n matrix B such that AB =I - BA.
(vi) The columns of A are linearly independent.
(vii) The rows of A are linearly independent.
(viii) All eigenvalues of A are nonzero.
(b) Assuming that A is nonsingular, the matrix B of statement (v)
(called the inverse of A and denoted by A-)) ii; 'unique.
(c) For any two square invertible matrices A and B of the same
dimensions, we have (AB)- 1 = B-lA- 1 .

Proposition A.1.2: Let A be an n x n matrix.


(a) If Tis a nonsingular matrix and B = T AT-1, then the eigenval-
ues of A and B coincide.
(b) For any scalar c, the eigenvalues of cl + A are equal to c +
A1, ... , c + An, where A1, ... , An are the eigenvalues of A.
(c) The eigenvalues of Ak are equal to At, ... , A~, where A1, ... , An
are the eigenvalues of A.
(d) If A is nonsingular, then the eigenvalues of A- 1 are the recipro-
cals of the eigenvalues of A.
(e) The eigenvalues of A and A' coincide.

Let A and B be square matrices, and let C be a matrix of appropriate


dimension. Then we have

(A+ CBC 1 ) - 1 = A- 1 - A- 1C(B- 1 + C' A- 1 C)- 1 C' A- 1,

provided all the inverses appearing above exist. For a proof, multiply the
right-hand side by A+ CBC' and show that the product is the identity.
Another useful formula provides the inverse of the partitioned matrix

M=[~ ~]-
There holds

-1 - [ Q -QBD-1 ]
M - -D- 1CQ D- 1 + D- 1CQBD- 1 '
Sec. A.1 Linear Algebra 449

where
Q = (A - BD- 1 C)- 1 ,
provided all the inverses appearing above exist. For a proof, multiply M
with the given expression for M- 1 and verify that the product is the iden-
tity.

Symmetric and Positive Definite Matrices

A square matrix A is said to be symmetric if A = A'. Symmetric matrices


have several special properties, particularly regarding their eigenvalues and
eigenvectors.

Proposition A.1.3: Let A be a symmetric n x n matrix. Then:


· (a) The eigenvalues of A are real.
(b) The matrix A has a set of n mutually orthogonal, real, and
nonzero eigenvectors x1, . .. , Xn.
(c) There holds

~x'x S x'Ax S Xx 1x, VxE1Rn,

where ~ and X are the smallest and largest eigenvalues of A,


respectively.

Definition A.1.3: A symmetric n x n matrix A is called positive def-


inite if x' Ax > 0 for all x E !Rn, x ,f 0. It is called positive semidefinite
if x' Ax ~ 0 for all x E !Rn.

Throughout this book, the notion of positive definiteness applies ex-


clusively t o symmetric matrices. Thus whenever we say that a matrix is
positive (semi}definit e, we implicitly assume that the m atrix is symmetric ,
alt hough we usually add the t erm "symmet ric" for clarity.

Proposition A.1.4:
(a) A square matrix is symmetric and positive definite if and only if
it is invertible and its inverse is symmetric and positive definite.
(b) The sum of two symmetric positive semidefinite matrices is pos-
itive semidefinite. If one of the two matrices is positive definite,
the sum is positive definite.
450 Mathematical Background Appendix A

(c) If A is a symmetric positive semidefinite n x n matrix and T is


an m x n matrix, then the matrix TAT' is positive semidefinite.
If A is positive definite and T is invertible, then TAT' is positive
definite.
(d) If A is a symmetric positive definite n x n matrix, there exists
a unique symmetric positive definite matrix that yields A when
multiplied with itself. This matrix is called the square root of A.
It is denot~d by A 1/ 2 , and its inverse is denoted by A- 1 / 2 .

A.2 TOPOLOGICAL PROPERTIES

Definition A.2.1: A norm II · II on lRn is a function that assigns a


scalar llxll to every x E lRn and that has the following properties:
(a) llxll ~ 0 for all x E lRn.
(b) llaxll = lal · llxll for every scalar a and every x E ~n.

(c) llxll = 0 if and only if x = 0.


(d) llx + YII ::; IJxll + IIYII for all x,y E lRn (this is referred to as the
triangle inequality).

The Euclidean norm of a vector x = (x 1 , ... , Xn) is defined by

n ) 1/2
llxll = (x'x) 1l 2 = (
8 lxil 2
Except for specialized contexts, we use this norm. In particular, in the
absence of a clear indication to the contrary, II · II will denote the Euclidean
norm. The Schwarz inequality states that for any two vectors x and y, we
have
lx'yl :S llxll · IIYII,
with equality holding if and only if x = ay for some scalar a. The
Pythagorean Theorem states that for any two vectors x and y that are
orthogonal, we have
llx + Yll 2 = llxll 2 + IIYll 2 -
Two other important norms are the maximum norm 11 ·lloo (also called
sup-norm or £00 -norm), defined by
llxlloo = i=l,
. max... ,n
lxil,
Sec. A.2 Topological Properties 451

and the l\ -norm 11 · 111, defined by


n

llxlli = L lxil-
i=l
Sequences

We use both subscripts and superscripts in sequence notation. Generally,


we prefer subscripts, but sometimes we use superscripts whenever we need
to reserve the subscript notation for indexing components of vectors and
functions. The meaning of the subscripts and superscripts should be clear
from the context in which they are used.
A scalar sequence {xk I k = 1,2, .. .} (or {xk} for short) is said
to converge if there exists a scalar x such that for every € > 0 we have
lxk - xi < E for every k greater than some integer K (that depends on
E). The scalar x is said to be the limit of {xk}, and the sequence {xk}
is said to converge to x; symbolically, Xk -+ x or limk-+oo Xk = x. If for
every scalar b there exists some K (that depends on b) such that Xk 2 b
for all k 2 K, we write Xk -+ oo and limk-+oo Xk = oo. Similarly, if for
every scalar b there exists some integer K such that Xk :S b for all k 2 K,
we write Xk -+ -oo and limk""""*oo Xk = -oo. Note, however, that implicit
in any of the statements "{xk} converges" or "the limit of {xk} exists" or
"{xk} has a limit" is that the limit of {xk} is a scalar.
A scalar sequence { xk} is said to be bounded above (respectively, be-
low) if there exists some scalar b such that Xk :S b (respectively, Xk 2 b) for
all k. It is said to be bounded if it is bounded above and bounded below.
The sequence {xk} is said to be monotonically nonincreasing (respectively,
nondecreasing) if Xk+l :S xk (respectively, Xk+I 2 Xk) for all k. If Xk -+ x
and {xk} is monotonically nonincreasing (nondecreasing), we also use the
notation Xk -1, x (xk t x, respectively).

Proposition A.2.1: Every bounded and monotonically nonincreas-


ing or nondecreasing scalar sequence converges.

Note that a monotonically nondecreasing sequence { Xk} is either


bounded, in which case it converges to some scalar x by the above propo-
sition, or else it is unbounded, in which case Xk -+ oo. Similarly, a mono-
tonically nonincreasing sequence { xk} is either bounded and converges, or
it is unbounded, in which case Xk -+ -oo.
Given a scalar sequence {xk}, let
Ym = sup{xk I k 2 m}, Zm = inf{xk I k 2 m}.
The sequences {Ym} and {zm} are nonincreasing and nondecreasing, re-
spectively, and therefore have a limit whenever { xk} is bounded above or
452 Mathematical Background Appendix A

is bounded below, respectively (Prop. A.2.1). The limit of Ym is denoted


by limsupk-too Xk, and is referred to as the upper limit of {xk}- The limit
of Zm is denoted by liminfk-too Xk, and is referred to as the lower limit of
{xk}- If {xk} is unbounded above, we write limsupk-too Xk = oo, and if it
is unbounded below, we write lim infk-too Xk = -oo.

Proposition A.2.2: Let {xk} and {Yk} be scalar sequences.


(a) We have

inf{xk I k ~ O} S liminf Xk S limsupxk S sup{xk I k ~ O}.


· k-too k-too

(b) {xk} converges if and only if

-oo < liminf Xk = limsupxk < oo.


k-too k-too

Furthermore, if {xk} converges, its limit is equal to the common


scalar value ofliminfk-tooXk and limsupk-tooXk-
(c) If Xk S Yk for all k, then

lim inf Xk Slim inf Yk, limsupxk S limsupyk.


k-too k-too k-too k-too

(d) We have

lim inf Xk
k-too
+ lim inf Yk Slim inf(xk + Yk),
k-too k-too

· limsupxk + limsupyk ~ limsup(xk + Yk).


k-too k-too k-too

A sequence { xk} of vectors in ~n is said to converge to some x E ~n


if the ith component of Xk converges to the ith component of x for every i.
We use the notations Xk -+ x and limk-too Xk = x to indicate convergence
for vector sequences as well. A sequence {xk} C ~n is said to be a Cauchy
sequence if l!xm - Xn II -+ 0 as m, n -+ oo, i.e., given any E > 0, there exists
N such that l! xm - xnll SE for all m, n ~ N. A sequence is Cauchy if and
only if it converges to some vector. The sequence {xk} is called bounded if
each of its corresponding component sequences is bounded. It can be seen
that {xk} is bounded if and only if there exists a scalar c such that llxkll S c
for all k. An infinite subset of a sequence {xk} is called a subsequence of
{xk}. Thus a subsequence can itself be viewed as a sequence, and can be
Sec. A.2 Topological Properties 453

represented as a set { Xk I k E K}, where K is an infinite subset of positive


integers (the notation {xk}K will also be used) .
A vector x E Rn is said to be a limit point of a sequence { x k} if
there exists a subsequence of {xk} that converges to x. The following is a
classical result that will be used often.

Proposition A.2.3: (Bolzano-Weierstrass T heorem) A bounded


sequence in Rn has at least one limit point.

o( ·) Notation

For a function h: Rn~ Rm we write h(x) = o(llxl!P), where pis a positive


integer, if
lim h(xk) - 0
!lxk!IP - '
k---too

for all sequences {xk} such that Xk-+ 0 and Xk -=J. 0 for all k.

Closed and Open Sets

We say that x is a closure point of a subset X of Rn if there exists a


sequence {xk} C X that converges to x. The closure of X, denoted cl(X),
is the set of all closure points of X .

Definition A.2.2: A subset X of Rn is called closed if it is equal to


its closure. It is called open if its complement, {x Ix ff. X}, is closed.
It is called bounded if there exists a scalar c such that llxll ~ c for all
x E X. It is called compact if it is closed and bounded.

Given x* E Rn and E > 0, the sets { x I !Ix - x* II < E} and { x I


llx - x* II ~ E} are called an open sphere and a closed sphere centered at
x*, respectively. Sometimes the terms open ball and closed ball are used.
A consequence of the definitions, is that a subset X of Rn is open if and
only if for every x E X there is an open sphere that is centered at x and is
contained in X. A neighborhood of a vector xis an open set containing x.

Definition A.2.3: We say that x is an interior point of a subset X of


Rn if there exists a neighborhood of x that is contained in X. The set
of all interior points of X is called the interior of X, and is denoted
454 Mathematical Background Appendix A

by int(X). A vector x E cl(X) ,which is not an interior point of X is


said to be a boundary point of X. The set of all boundary points of X
is called the boundary of X.

Proposition A.2.4:
(a) The union of a finite collection of closed sets is closed.
(b) The intersection of any collection of closed sets is clo::;ed.
(c) The union of any collection of open sets is open.
(d) The intersection of a finite collection of open sets is open.
(e) A set is open if and only if all of its elements are interior points.
(f) Every subspace of Rn is closed.
(g) A set X c Rn is compact if and only if every sequence of elements
of X has a subsequence that converges to an element of X.
(h) If {Xk} is a sequence of nonempty and compact subsets of Rn
such that Xk+i C Xk for all k, then the intersection nk= 0 Xk is
nonempty and compact.

The topological properties of sets in Rn, such as being open, closed,


or compact, do not depend on the norm being used. This is a consequence
of the following proposition.

Proposition A.2.5: (Norm Equivalence Property)


(a) For any two norms II · I\ and I\· II' on Rn, there exists a scalar c
such that
\\xi\ :S c\lx\\',
(b) If a subset of Rn is open (respectively, clo::;ed, bounded, or com-
pact) with respect to some norm, it is open (respectively, closed,
bounded, or compact) with respect to all other norms.

Continuity

Let f : X I-+ Rm be a function, where X is a subset of Rn, and let x be a


vector in X. If there exists a vector y E Rm such that the sequence {J(xk)}
converges to y for every sequence {Xk} C X such that limk--+oo Xk = x, we
Sec. A.2 Topological Properties 455

write limz-.x f (z) = y. If there exists a vector y E Wm such that the


sequence {f(xk)} converges toy for every sequence {xk} C X such that
limk-.oo Xk = x and Xk ~ x (respectively, Xk ~ x) for all k, we writ e
limztx f(z) = y [respectively, limzt x f( z ) = y].

Definition A.2.4: Let X be a nonempty subset of Rn.


(a) A function f: X H Wm is called continuous at a vector x EX if
limz-.x f(z) = f(x).
(b) A function f : X H Wm is called right-continuous (respectively,
left-continuous) at a vector x E x ,if limztx f(z) = f(x) [respec-
tively, limztx f(z) = f(x)] .
(c) A function f : X H Wm is called Lipschitz continuous over X if
there exists a scalar L such that

IIJ(x) - f(y)II ~ Lllx - YII, \:/ x,y EX.

(d) A real-valued function f: X H Wis called upper semicontinuous


(respectively, lower semicontinuous) at a vector x EX if f(x) ~
limsupk-.oo f(xk) [respectively, f(x) ~ liminfk-.oo f(xk)] for ev-
ery sequence {xk} C X that converges to x.

If f : X H Wm is continuous at every vector in a subset of its domain


X, we say that f is continuous over that subset. If f : X H Wm is contin-
uous at every vector in its domain X , we say that f i,s continuous (wit h-
out qualification). We use similar terminology for right-continuous, left-
continuous, Lipschitz continuous, upper semicontinuous, and lower semi-
continuous functions.

Proposition A.2.6 :
(a) Any vector norm on Rn is a continuous function .
(b) Let f : Wm H WP and g : Rn H Wm be continuous functions.
The composition f ·g: Rn H WP, defined by (f-g)(x) = f(g(x)),
is a continuous function.
(c) Let f : Rn H Wm be continuous, and let Y be an open (re-
spectively, closed) subset of Rm. Then the inverse image of Y,
{x E wnI f(x) E Y}, is open (respectively, closed).
(d) Let f: Rn H Wm be continuous, and let X be a compact subset
of Rn. Then the image of X, {f(x) Ix EX}, is compact.
456 Mathematical Background Appendix A

If f : Rn H R is a continuous function and X c Rn is compact, by


Prop. A.2.6(c), the sets

V-y = {x EX I f(x) ~ 'Y}

are nonempty and compact for all 'Y E R with 'Y > f*, where

f* = inf f(x).
xEX

Since the set of minima off is the intersection of the nonempty and compact
sets V'Yk for any sequence bk} with 'Yk -1.- f* and 'Yk > f* for all k, it follows
from Prop. A.2.4(h) that the set of minima is nonempty. This proves the
following classical theorem of Weierstrass.

Proposition A.2. 7: (Weierstrass' Theorem for Continuous


Functions) A continuous function f : Rn H R attains a minimum
over any compact subset of Rn.

A..3 DERIVATIVES

Let f : Rn H R be some function, fix x E Rn, and consider the expression

1. J(x + aei) - J(x)


1m
a-+0 a '

where ei is the ith unit vector (all components are O except for the ith
component which is 1). If the above limit exists, it is called the ith par-
tial derivative off at the vector x and it is denoted by (of /oxi)(x) or
of(x)/oxi (xi in this section will denote the ith component of the vector
x). Assuming all of these partial derivatives exist, the gradient off at xis
defined as the column vector

'vf(x) = [
of(x)
8x1

of(x)
:
l .
OXn

For any d E Rn, we define the one-sided directional derivative of f at


a vector x in the direction d by

f'(x; d) = lim J(x + ad) - f(x),


a.j.O a

provided that the limit exists.


Sec. A.3 Derivatives 457

If the directional derivative of f at a vector x exists in all directions


and f'(x; d) is a linear function of d, we say that f is differentiable at x. It
can be seen that f is differentiable at x if and only if the gradient "v f (x)
exists and satisfies "vf(x)'d = f'(x;d) for all d E Rn, or equivalently

f(x +ad)= f(x) + o:"v f(x)'d + o(lo:I), Va ER.

The function f is called differentiable over a subset S of Rn if it is differ-


entiable at every x E S. The function f is called differentiable (without
qualification) if it is differentiable at all x E Rn.
If f is differentiable over an open set S and "v f (·) is continuous at
all x E S, f is said to be continuously differentiable over S. It can then be
shown that for any x ES and norm II· II,

f(x + d) = f(x) + "v f(x)'d + o(lldll),


The function f is called continuously differentiable (without qualification)
if it is differentiable and "v f ( ·) is continuous at all x E Rn. In our de-
velopment, whenever we assume that f is differentiable, we also assume
that it is continuously differentiable. Part of the reason is that a convex
differentiable function is automatically continuously differentiable over Rn
(see Section 3.1).
If each one of the partial derivatives of a function f : Rn r-+ R is a
continuously differentiable function of x over an open set S, we say that f
is twice continuously differentiable over S. We then denote by

82 f(x)
OXiOXj

the ith partial derivative of of /oxj at a vector XE Rn. The Hessian off
at x, denoted by "v 2 f(x), is the matrix whose components are the above
second derivatives. The matrix "v 2 f(x) is symmetric. In our development,
whenever we assume that f is twice differentiable, we also assume that it
is twice continuously differentiable.
We now state some theorems relating to differentiable functions.

Proposition A.3.1: (Mean Value Theorem) Let f: Rn r-+ R be


continuously differentiable over an open sphere S, and let x be a vector
in S. Then for all y such that x + y E S, there exists an a E [O, 1] such
that
f(x + y) = f(x) + "vf(x + o:y)'y.
458 Mathematical Background Appendix A

Proposition A.3.2: (Second Order Expansions) Let f: ~n rt~


be twice continuously differentiable over an open sphere S, and let x
be a vector in S. Then for ally such that x + y ES:
(a) There exists an a E [O, 1] such that

f(x + y) = J(x) + y'V f(x) + ½Y'V 2 f(x + ay)y.

(b) We have

J(x + y) = f(x) + y'Vf(x) + ½Y'V2 f(x)y + o(IIYll 2 ).

A.4 CONVERGENCE THEOREMS

We will now discuss a few convergence theorems relating to iterative algo-


rithms. Given a mapping T : ~n rt ~n, the iteration

aims at finding a fixed point of T, i.e., a vector x* such that x* = T(x*).


A common criterion for existence of a fixed point is that T is a contraction
mapping (or contraction for short) with respect to some norm, i.e., for some
(3 < l, and some norm II· II (not necessarily the Euclidean norm), we have

IIT(x) -T(y)II ~ f311x - YII, \::/ x,y E ~n.

When T is a contraction, it has a unique fixed point and the iteration


Xk+l = T(xk) converges to the fixed point. This is shown in the following
classical theorem.

Proposition A.4.1: (Contraction Mapping Theorem) Let T:


~n rt ~n be a contraction mapping. Then T has a unique fixed
point x*, and the sequence generated by the iteration Xk+l = T(xk)
converges to x*, starting from any xo E ~n.

Proof: We first note that T can have at most one fixed point (if i; and x
are two fixed points, we have

Iii: - xii= IIT(x) - T(x)II ~ f311x - xii,


Sec. A.4 Convergence Theorems 459

which implies that x = x). Using the contraction property, we have for all
k,m>O
m m-1
llxk+m - Xk II :::; (3k llxm - xo II :::; (3k L llxe - Xe-1 II :::; (3k L (3f llx1 - Xo II,
l=l l=O
and finally,
[jk(I - (3m)
llxk+m - Xkll s; 1 _ (3 llx1 - xoll-
Thus {xk} is a Cauchy sequence, and hence converges to some x*. Taking
the limit in the equation Xk+l = T(xk) and using the continuity of T
(implied by the contraction property), we see that x* must be a fixed point
ofT. Q.E.D.

In the case of a linear mapping


T(x)=Ax+b,
where A is an n x n matrix and b E ~n, it can be shown that T is a
contraction mapping with respect to some norm (but not necessarily all
norms) if and only if all the eigenvalues of A lie strictly within the unit
circle.
The next theorem applies to a mapping that is nonexpansive with re-
spect to the Euclidean norm. It shows that a fixed point of such a mapping
can be found by an interpolated iteration, provided at least one fixed point
exists. The idea underlying the theorem is quite intuitive: if x* is a fixed
point of T, the distance IIT(xk) - x* II cannot be larger than the distance
llxk - x* 11 (by nonexpansiveness of T) :
IIT(xk) - x*II = IIT(xk) - T(x*)II s; llxk - x*II-
Hence, if Xk -=J. T(xk), any point obtained by strict interpolation between Xk
and T(xk) must be strictly closer to x* than Xk (by Euclidean geometry).
Note, however, that for this argument to work, we need to know that T
has at least one fixed point. If T is a contraction, this is automatically
guaranteed, but if T is just nonexpansive, there may not exist a fixed point
[as an example, just let T(x ) = 1 + x].

Proposition A.4.2: (Krasnosel'skii-Mann Theorem for Non-


expansive Iterations [Kra55], [Man53]) Consider a mapping' T :
~n ,-+ ~n that is nonexpansive with r~spect to the Euclidean norm
II · II, i.e.,
IIT(x) -T(y)II s; llx - YII, V x ,y E ~n,
I

and has at least one fixed point. Then the iteration


460 Mathematical Background Appendix A

Xk+l = (1 - o:k)Xk + o:kT(xk), (A.1)


where O:k E [O, 1] for all k and
00

I: o:k(1 - o:k) = oo,


k=O

converges to a fixed point of T, starting from any xo E ~n.

Proof: We will use the identity

llo:x + (1 - o:)yll 2 = o:llxll 2 + (1 - o:)IIYll 2 - o:(1 - o:)llx - Yll 2 , (A.2)

which holds for all x, y E ~n, and o: E [O, 1], as can be verified by a
straightforward calculation. For any fixed point x* of T, we have

llxk+l - x* 11 2 = II (1 - o:k)(xk - x*) + o:k (T(xk) - T(x*)) 11 2


= (1- o:k)llxk - x*ll 2 + o:kllT(xk) -T(x*)ll 2
(A.3)
- o:k(l - o:k)IIT(xk) - xkll 2
:'.S llxk - x*ll 2 - o:k(l - o:k)IIT(xk) - xkll 2 ,

where for the first equality we use iteration (A.1) and the fact x* = T(x*),
for the second equality we apply the identity (A.2), and for the inequality
we use the nonexpansiveness of T. By adding Eq. (A.3) for all k, we obtain

00

L o:k(l - o:k)IIT(xk) - xkll 2 :'.S llxo - x*ll 2 .


k=O

In view of the hypothesis E:=o o:k(l - o:k) = oo, it follows that

lim
k--+oo, kE IC
IIT(xk) - Xkll = 0, (A.4)

for some subsequence {xk}JC. Since from Eq. (A.3), {xk}JC is bounded, it
has at least one limit point, call it x, so {xkh:-+ x for an infinite index
set Kc K. Since Tis nonexpansive it is continuous, so {T(xk) he-+ T(x),
and in view of Eq. (A.4), it follows that x is a fixed point of T. Letting
x* = x in Eq. (A.3), we see that { llxk - xii} is nonincreasing and hence
converges, necessarily to 0, so the entire sequence { Xk} converges to the
fixed point x. Q.E.D.
Sec. A.4 Convergence Theorems 461

N onstationary Iterations

For nonstationary iterations of the form Xk+l = Tk(xk), where the function
Tk depends on k, the ideas of the preceding propositions may apply but
with modifications. The following proposition is often useful in this respect.

P roposition A.4.3: Let {ak} be a nonnegative sequence satisfying


' '

Vk = 0,1, ... ,
where f3k 2 0, "/k > 0 for all k, and
'oo

L"lk =oo,
k=O

Then ak-----+ 0.

Proof: We first show that given any E > 0, we have ak < E for infinitely
many k. Indeed, if this were not so, by letting k be such that ak :?'. E and
f3khk ~ E/2 for all k :?'. k, we would have for all k 2 k

Therefore, for all m :?: k,


m

Gm+l ~QI-~ L "lk·


k=I

Since {ak} is nonnegative and z=;;:


0 "/k = oo, we obtain a contradiction.
Thus, given any E > 0, there exists k such that f3khk < E for all k 2 k
and aI < E. We then have

By repeating this argument, we obtain ak < E for all k :?'. k. Since E can be
arbitrarily small, it follows that a k -----+ 0. Q.E.D.

As an example, consider a sequence of "approximate" contraction


mappings Tk : ~n H ~n, satisfying

V x, y E ~ n, k = 0, 1, . . . ,
462 Mathematical Background Appendix A

where ,k E (0, 1], for all k, and


00

L,k = oo,
k=O
Assume also that all the mappings Tk have a common fixed point x*. Then
llxk+l - x*II = IITk(xk) - n(x*)II : : ; (1- 1k)llxk - x*II + f3k,
and from Prop. A.4.3, it follows that the sequence {xk} generated by the
iteration Xk+1 = Tk(xk) converges to x* starting from any xo E lRn.

Supermartingale Convergence

We now give two theorems relating to supermartingale convergence analysis


(the term refers to a collection of convergence theorems for sequences of
nonnegative scalars or random variables, which satisfy certain inequalities
implying that the sequences are "almost" nonincreasing). The first theo-
rem relates to deterministic sequences, while the second theorem relates to
sequences of random variables. We prove the first theorem, and we refer to
the literature on stochastic processes and iterative methods for the proof
of the second.

Proposition A.4.4: Let {Yk}, {Zk}, {Wk}, and {Vk} be four scalar
sequences such that

k = 0,1, ... , (A.5)

{Zk}, {Wk}, and {Vk} are nonnegative, and


00 00

Lwk <oo, I:vk < oo.


k=O k=O

Then either Yk ---r -oo, or ·else {Yk} converges to a finite value and
I:%:oZk < oo.

Proof: We first give the proof assuming that Vk = 0, and then generalize.
In this case, using the nonnegativity of {Zk}, we have
Yk+1 ::::; Yk + Wk.
By writing this relation for the index k set to k, ... , k, where k 2:: k, and
adding, we have
k 00

Yk+1::::; Y1.: + Lw"::::; Y1.: + Lw".


Sec. A.4 Convergence Theorems 463

Since E~o Wk < oo, it follows that {Yk} is bounded above, and by taking
upper limit of the left hand side as k -+ oo and lower limit of the right
hand side as k -+ oo, we have

limsup Yk $ lt_minfY,;; < oo.


k-+oo k-too

This implies that either Yk -+ -oo, or else {Yk} converges to a finite value.
In the latter case, by writing Eq. (A.5) for the index k set to 0, ... , k, and
adding, we have
k k
LZt $Yo+ I:w1-Yk+1, Vk = 0, l, ... ,
l=O l=O

so by taking the limit as k -+ oo, we obtain E~o Zt < oo.


We now extend the proof to the case of a general nonnegative sequence
{Vk}. We first note that
k k 00

log 11(1 + ½) = I::log(l + ½) $ L vk,


k=O

since we generally have (1 +a) $ ea and log(l +a) $ a for any a~ 0. Thus
the assumption E~o Vk < oo implies that
00

11(1 + ½) < 00. (A.6)


l=O

Define
k-1 k k

Yk = Yk IT (l+ ½)-1, zk = zk l1(1+½)-1, Wk =Wkll(l+½)-1.


l=O l=O l=O

Multiplying Eq. (A.5) with TIJ=0 (1 + ½)- 1 , we obtain


Yk+i SYk-zk+wk.
Since wk $ wk, the hypothesis L~o wk < 00 implies L~o wk < oo,
so from the special case of the result already shown, we have that either
Y k -+ -oo or else {Y k} converges to a finite value and E~o Z k < oo.
Since
k-1 k
Yk = Y k IT (1 + ½), zk = zk l1(1 + ½),
l=O l=O

and TIJ,:~ (1 + ½) converges to a finite value by the nonnegativity of {Vk}


and Eq. (A.6), it follows that either Yk -+ -oo or else {Yk} converges to a
finite value and E~o Zk < oo. Q.E.D.
464 Mathematical Background Appendix A

The next theorem has a long history. The particular version we give
here is due to Robbins and Sigmund (RoS71]. Their proof assumes the
special case of the theorem where Vk = 0 (see Neveu [Nev75], p. 33, for a
proof of this special case), and then uses the line of proof of the preceding
proposition. Note, however, that contrary to the preceding proposition,
the following theorem requires nonnegativity of the sequence {Yk}.

Proposition A.4.5: (Supermartingale Convergence Theorem)


Let {Yk}, {Zk}, {Wk}, and {Vk} be four nonnegative sequences of
random variables, and let Fk, k = 0, 1, .. _., be sets of random variables
such that Fk c Fk+l for all k. Assume that:
(1) For each k, Yk, Zk, Wk, and Vk are functions of the random
variables in Fk.
(2) We have

E{Yk+1 I Fk}::::; (1 + Vi)Yk - zk + wk, k = 0, 1, ....

(3) There holds, with probability 1,


00

I:vk < oo.


k=O
Then {Yk} converges to a nonnegative random variable Y, and we have
Lr=OZk < 00, with probability 1.

Fejer Monotonicity

The supermartingale convergence theorems can be applied in a variety


of contexts. One such context, the so called Fejer monotonicity theory,
deals with iterations that "almost" decrease the distance to every element
of some given set X*. We may then often show that such iterations are
convergent to a (unique) element of X * . Applications of this idea arise
when X * is the set of optimal solutions of an optimization problem or the
set of fixed points of a certain mapping. Examples are various gradient and
subgradient projection methods with a diminishing stepsize that arise in
various contexts in this book, as well as the Krasnosel'skii-Mann Theorem
[Prop. A.4.2; see Eq. (A.3)].
The following theorem is appropriate for our purposes. There are sev-
eral related but somewhat different theorems in the literature, and for com-
plementary discussions, we refer to (BaB96], [ComOl], [BaCll], [CoV13].
Sec. A .4 Convergence Theorems 465

Proposition A.4.6: (Fejer Convergence Theorem) Let X* be


a nonempty subset of Wn, and let {xk} C Wn be a sequence satisfying
for some p > 0 and for all k,

V x• EX*,

where {,Bk}, {,k}, and {8k} are nonnegative sequences satisfying

00

L'Yk = oo,
k=O

</>: Wn x X* rt [O, oo) is some nonnegative function, and II · II is some


norm. Then:
(a) The minimum distance sequence infx*EX* llxk - x*II converges,
and in particular, { Xk} is bounded.
(b) If {xk} has a limit point x that belongs to X *, then the entire
sequence {xk} converges to x.
(c) Suppose that for some x* E X *, </>( ·; x•) is lower semicontinuous
and satisfies

<f>(x;x•) =0 if and only if XE X*. (A.7)

Then {xk} converges to a point in X*.

Proof: (a) Let {Ek } be a positive sequence such t hat I::=0 (1 + ,Bk)Ek < oo,
and let xk be a point of X * such that

Then since </> is nonnegative, we have for all k,

and by combining the last two relations, we obtain

The result follows by applying Prop. A.4.4 with


466 Mathematical Background Appendix A

(b) Following the argument of the proof of Prop. A.4.4, define for all k,

k-1 k

Yk = llxk - xllP II (1 + f3t)- 1, 8k = Dk II (1 + f3t)- 1 •


l=O R=O

Then from our hypotheses, we have I:f=o 8k < oo and

Vk = 0, l, ... , (A.8)

while {Yk} has a limit point at 0, since xis a limit point of {xk}· For any
E > 0, let k be such that

so that by adding Eq. (A.8), we obtain for all k > k,

00

yk:::; Y.k+ L8R:::; 2E.


R=k

Since E is arbitrarily small, it follows that Y k --+ 0. We now note that as


in Eq. (A.6),
CX)

II (1 + f3t)- 1 < oo,


l=O

so that Y k --+ 0 implies that llxk - xllP --+ 0, and hence Xk --+ x.
(c) From Prop. A.4.4, it follows that
CX)

L 'Yk ¢(xk; x*) < oo.


k=O

Thus limk-+oo, kEK ¢(xk; x*) = 0 for some subsequence {xk}K, By part (a),
{xk} is bounded, so the subsequence {xk}K has a limit point x, and by the
lower semicontinuity of¢(·; x*), we must have

¢(x;x*)::::; lim ¢(xk;x*)=O,


k-too, kEK

which in view of the nonnegativity of¢, implies that ¢(x; x*) = 0. Using
the hypothesis (A.7), it follows that x E X*, so by part (b), the entire
sequence {xk} converges to x. Q.E.D.
APPENDIX B
Convex Optimization Theory:
A Summary
In this appendix, we provide a summary of theoretical concepts and results
relating to convex analysis, convex optimization, and duality theory. In
particular, we list the relevant definitions and propositions (without proofs)
of the author's book "Convex Optimization Theory," Athena Scientific,
2009. For ease of use, the chapter, section, definition, and proposition
numbers of the latter book are identical to the ones of this appendix.

CHAPTER 1: Basic Concepts of Convex Analysis


Section 1.1. Convex Sets and Functions

Definition 1.1.1: A subset C of ~n is called convex if

o:x + (1 - o:)y E C, V x,y EC, Vo: E [0,1] .

Proposition 1.1.1:
(a) The intersection niEICi of any collection {Ci Ii E J} of convex
sets is convex.
(b) The vector sum C1 + C2 of two convex sets C1 and C2 is convex.
(c) The set >..C is convex for any convex set C and scalar >... Fur-
thermore, if C is a convex set and >..1, >..2 are positive scalars,

467
468 Convex Optimization Theory: A Summary Chap. 1

(d) The closure and the interior of a convex set are convex.
(e) The image and t he inverse image of a convex set under an affine
function are convex.

A hyperplane is a set of the form {x I a'x = b}, where a is a nonzero


vector and b is a scalar. A halfspace is a set specified by a single linear
inequality, i.e., a set of the form { x I a'x ~ b}, where a is a nonzero vector
and bis a scalar. A set is said to be polyhedral if it is nonempty and it has
the form {x I ajx ~ bj, j = l, ... , r} , where a1, ... , ar and b1, . .. , br are
some vectors in Rn and scalars, respectively. A set C is said to be a cone
if for all x E C and >. > 0, we have >.x EC.

Definition 1. 1.2: Let C be a convex subset of Rn. We say that a


function f : C t-+ R is convex if

f(ax + (1 - a)y) ~ af(x) + (1 - a)f(y), \:/x,yEC, v'aE [0,1].

A convex function f : C t-+ R is called strictly convex if

f( ax + (1 - a)y) < a f(x) + (1 - a)J(y)


for all x, y E C with x -/= y, and all a E (0, 1). A function f : C t-+ R, where
C is a convex set, is called con cave if the function ( - J) is convex.
The epigraph of a function f : X t--+ [-00,00], where X C Rn , is
defined to be the subset of Rn+ 1 given by

epi(f) = {(x,w) Ix EX, w ER, f(x) ~ w}.


The effective domain of f is defined t o be the set

dom(f) = {x E X I f( x ) < oo}.


We say that f is proper if f(x) < oo for at least one x EX and f(x) > - oo
for all x E X, and we say that f improper if it is not proper. Thus f is
proper if and only if epi(f) is nonempty and does not contain a vertical
line.

Definition 1.1.3: Let C be a convex subset of Rn. We say that an


extended real-valued function f : Ci--+ [-oo, oo] is convex if epi(f) is
a convex subset of Rn+1.
Sec. B.1.1 Convex Sets and Functions 469

Definition 1.1.4: Let C and X be subsets of Rn such that C is


nonempty and convex, and C C X. We say that an extended real-
valued function f : X H [-oo, oo] is convex over C if f becomes
convex when the domain of f is restricted to C, i.e., if the function
J: CH [-00,00], defined by J(x) = f(x) for all x EC, is convex.

We say that a function f : X H [-oo, oo] is closed if epi(f) is a


closed set. We say that f is lower semicontinuous at a vector x E X if
f(x) ::; liminfk-;cx, f(xk) for every sequence {xk} C X with Xk -r x. We
say that f is lower semicontinuous if it is lower semicontinuous at each
point x in its domain X. We say that f is upper semicontinuous if - f is
lower semicontinuous.

Proposition 1.1.2: For a function f: Rn H [-oo, oo], the following


are equivalent:
(i) The level set V7 = {x I f (x) ::; 1 } is closed for every scalar 1 .
(ii) f is lower semicontinuous.
(iii) epi(f) is closed.

Proposition 1.1.3: Let f : X H [-oo, oo] be a function. If dom(f)


is closed and f is lower semi continuous at each x E dom(f), then f is
closed.

Proposition 1.1.4: Let f : Rm H (-oo, oo] be a given function, let


A be an m x n matrix, and let F : Rn H ( -oo, oo] be the. function

F(x) = f(Ax),
If f is convex, then F is also convex, while if f is closed, then F is
also closed.

Proposition 1.1.5: Let Ji : Rn H (-oo, ooJ, i = 1, ... , m, be given


,m
functions, let ,1, ... , be positive scalars, and let F: Rn H (-oo, oo]
470 Convex Optimization Theory: A Summary Chap. 1

be the function

F(x) = 'Yif1(x) + · · · + 'Ymfm(x), XE Rn.

If Ji, ... , f m are convex, then F is also convex, while if Ji, ... , f m are
closed, then F is also closed.

Proprnsition 1.1.6: Let Ji : Rn H (-oo, oo] be given functions for


i EI, where I is an arbitrary index s'et, and let f: Rn H (-oo, oo] be
the function given by
f(x) = sup Ji(x).
iEl

If Ii, i E I, are convex, then f is also convex, while if Ji, i E I, are


closed, then f is also closed.

Proposition 1.1. 7: Let C be a nonempty convex subset of Rn and


let f : Rn H R be differentiable over an open set that contains C.
(a) f is convex over C if and only if

f(z) ~ J(x) + v'f(x)'(z - x), V x,z EC.

(b) f is strictly convex over C if and only if the above inequality is


stri~t whenever x =/ z.

Proposition 1.1.8: Let C be a nonempty convex subset of Rn and


let f : Rn H R be convex and differentiable over an open set that
contains C. Then a vector x* E C minimizes f over C if and only if

v'f(x*)'(z - x*) ~ 0, V z EC.

When f is not convex but is differentiable over an open set that


contains C, the condition of the above proposition is necessary but not
sufficient for optimality of x* (see e.g., [Ber99], Section 2.1).
Sec. B.1.1 Convex Sets and Functions 471

Proposition 1.1.9: (Projection Theorem) Let C be a nonempty


closed convex subset of ~n, and let z be a vector in ~".'. There existsa
unique vector that minimizes llz-xll over x EC, called the projection
of z on C. Furthermore, a vector x* is the projection of z on C if and
only if
(z - x*)'(x - x*) :S 0, VxEC.

Proposition 1.1.10: Let C be a nonempty convex subset of ~n and


let f : ~n H ~ be twice continuously differentiable over an open set
that contains C.
(a) If v' 2 f(x) is positive semidefinite for all x EC, then f is convex
over C.
(b) If v' 2 f(x) is positive definite for all x E C, then f is strictly
convex over C.
(c) If C is open and f is convex over C, then v' 2 f (x) is positive
semidefinite for all x E C.

Strong Convexity

If f : ~n H ~ is a function that is continuous over a closed convex set


C C dom(f), and a is a positive scalar, we say that f is strongly convex
over C with coefficient a if for all x, y EC and all o: E [O, 1], we have

a
f (ax+ (1 - o:)y) + 20:(1 - o:)llx - Yll 2 :S o:f(x) + (1 - o:)f(y).

Then f is strictly convex over C. Furthermore, there exists a unique x* E C


that minimizes f over C, and by applying the definition with y = x* and
letting o: -1, 0, it can be seen that

a
f(x) ~ f(x*) + 2 11x - x*ll 2 , V XE C.

If int (C), the interior of C, is nonempty, and f is continuously differentiable


over int(C), the following are equivalent:
(i) f is strongly convex with coefficient a over C.
(ii) (v'f(x) -v'f(y)) 1 (x -y) ~ allx -yll 2 , V x,y E int(C).
(iii) f(y) ~ f(x) + v'f(x)'(y-x) + %11x -yll 2 , V x,y E int(C).
472 Convex Optimization Theory: A Summary Chap. 1

Furthermore, if f is twice continuously differentiable over int(C) , the above


three properties are equivalent to:
(iv) The matrix v' 2 f(x)-aI is positive semidefinite for every x E int(C),
where I is the identity matrix.
A proof may be found in the on-line exercises of Chapter 1 of [Ber09] .

Section 1.2. Convex and Affine Hulls

The convex hull of a set X, denoted conv(X), is the intersection of all


convex sets containing X. A convex combination of elements of X is a
vector of the form I:Z:
1 O:iXi, where m is a positive integer, x1 , . . . , Xm
belong to X, and 0:1, ... , O:m are scalars such that
m

O:i :2:: 0, i = 1, ... , m,


i=l
The convex hull conv(X) is equal to the set of all convex combinations of
elements of X. Also, for any set S and linear transformation A, we have
conv(AS) = A conv(S). From this it follows that for any sets S1, ... , Sm,
we have conv(S1 +···+Sm)= conv(S1) + · · · + conv(Sm)-
lf X is a subset of Rn, the affine hull of X, denoted aff( X), is the
intersection of all affine sets containing X. Note that aff(X) is itself an
affine set and that it contains conv(X). The dimension of aff(X) is defined
to be the dimension of the subspace parallel to aff(X). It can be shown that
aff(X) = aff(conv(X)) = aff(cl(X)). For a convex set C, the dimension of
C is defined to be the dimension of aff( C).
Given a nonempty subset X of Rn, a nonnegative combination of
elements of X is a vector of the form I:Z:
1 O:iXi, where m is a positive
integer, x1, ... , Xm belong to X, and 0:1, ... , O:m are nonnegative scalars. If
the scalars O:i are all positive, I:Z:
1 O:iXi is said to be a positive combination.
The cone generated by X, denoted cone(X), is the set of all nonnegative
combinations of elements of X.

Proposition 1.2.1: (Caratheodory's Theorem) Let X be a non-


empty subset of Rn.
(a) Every nonzero vector from cone(X) can be represented as a pos-
itive combination of linearly independent vectors from X.
(b) Every vector from conv(X) can be represented as a convex com-
bination of no more than n + l vectors from X.

Proposition 1.2.2: The convex hull of a compact set is compact.


Sec. B.1.3 Relative Interior and Closure 473

Section 1.3. Relative Interior and Closure

Let C be a nonempty convex set. We say that x is a relative interior point


of C if x E C and there exists an open sphere S centered at x such that

Sn aff(C) c C,

i.e., x is an interior point of C relative to the affine hull of C. The set


of relative interior points of C is called the relative interior of C, and is
denoted by ri(C). The set C is said to be relatively open if ri(C) = C. The
vectors in cl( C) that are not relative interior points are said to be relative
boundary points of C, and their collection is called the relative boundary of
C.

Proposition 1.3.1: (Line Segment Principle) Let C be a nonempty


convex set. If x E ri(C) and x E cl(C), then all points on the line seg-
ment connecting x and x, except possibly x, belong to ri(C).

Proposition 1.3.2: (Nonemptiness of Relative Interior) Let C


be a nonempty convex set. Then:
(a) ri( C) is a nonempty convex set, and has the same affine hull as
C.
(b) If m is the dimension of aff( C) and m > 0, there exist vectors
xo, x1, ... , Xm E ri(C) such that X1 - xo, ... , Xm - Xo span the
subspace parallel to aff( C).

Proposition 1.3.3: (Prolongation Lemma) Let C be a nonempty


convex set. A vector x is a relative interior point of C if and only if
every line segment in C having x as one endpoint can be prolonged
beyond x without leaving C [i.e., for every x EC, there exists a 'Y > 0
such that x + 1(x - x) E CJ.

Proposition 1.3.4: Let X be a nonempty convex subset of ~n, let


f : X f-? ~ be a concave function, and let X* be the set of vectors
where f attains a minimum over X, i.e.,
474 Convex Optimization Theory: A Summary Chap. I

X * = { x* E X ' f (x*) = ]~i f (x)} .


If X* contains !a relative interior point of X, then f must be constant
over X, i.e., X* = X.

Proposition 1.3.5: Let C be a nonempty convex set.


(a) We have cl(C) = cl(ri(C)).
(b) We have ri(C) = ri(cl(C)).
(c) Let C be another nonempty convex set. Then the following three
conditions are equivalent:
(i) C and C have the same relative interior.
(ii) C and C have the same closure.
(iii) ri(C) c Cc cl(C).

Proposition 1.3.6: Let C be a nonempty convex subset of ~n and


let A be an m x n matrix.
(a) We have A· ri(C) = ri(A · C).
(b) We have A· cl(C) C cl(A · C). Furthermore, if C is bounded,
then A· cl(C) = cl(A · C).

Proposition 1.3. 7: Let C1 and C2 be nonempty convex sets. We


have

Furthermore, if at least one of the sets C1 and C2 is bounded, then


Sec. B.1.3 Relative Interior and Closure 475

Proposition 1.3.8: Let C 1 and C2 be nonempty convex sets. We


have

Furthermore, if the sets ri(C1 ) and ri(C2) have a nonempty intersec-


tion, then

Proposition 1.3.9: Let C be a nonempty convex subset of ~m, and


let A be an m x n matrix. If A- 1 · ri(C) is nonempty, then

ri(A- 1 . C) = A- 1 . ri(C), cl(A- 1 . C) = A- 1 . cl(C),


where A - 1 denotes inverse image of the corresponding set under A.

Proposition 1.3.10: Let C be a convex subset of ~+m. For x E ~n,


denote
Cx = {y I (x,y) EC},
and let
D = {X I Cx -/- 0} ·
Then
ri(C) = {(x,y) Ix E ri(D), y E ri(Cx)}.

Continuity of Convex Functions

Proposition 1.3.11: If f : ~n H ~ is convex, then it is continuous.


More generally, if f : ~n H ( -oo, oo] is a proper convex function,
then f, restricted to dom(f), is continuous over the relative interior of
dom(f).
476 Convex Optimization Theory: A Summary Chap. 1

Proposition 1.3.12: If C is a closed interval of the real line, and


f: Ct-+ R is closed and convex, then f is continuous over C.

Closures of Functions
The closure of the epigraph of a function f : X t-+ [-oo, oo] can be seen
to be a legitimate epigraph of another function. This function, called the
closure off and denoted elf: Rn t-+ [-00,00], is given by

(clf)(x) = inf{w I (x,w) E cl(epi(f))},

The closure of the convex hull of the epigraph of f is the epigraph of some
function, denoted cl f called the convex closure of f. It can be seen that
elf is the closure of the function F: Rn t-+ [-oo, oo] given by

F(x) = inf{w I (x,w) E conv(epi(f))}, (B.l)

It is easily shown that F is convex, but it need not be closed and its
domain may be strictly contained in <lorn( cl f) (it can be seen though that
the closures of the domains of F and cl f coincide).

Proposition 1.3.13: Let f: X t-+ [-oo, oo] be a function. Then

inf f(x)
xEX
= xEX
inf (clf)(x) = inf (clf)(x) = inf F(x) = inf (clf)(x),
xE!Rn xE!Rn xE!Rn

where Fis given by Eq. (B.1). Furthermore, any vector that attains
the infimum off over X also attains the infimum of elf, F, and elf.

Proposition 1.3.14: Let f : Rn t-+ [-oo, oo] be a function.


(a) cl f is the greatest closed function majorized by f, i.e., if g :
Rn t-+ [-oo, oo] is closed and satisfies g(x) $ f(x) for all x E ~,
then g(x) $ ( cl f) (x) for all x E Rn.
(b) cl f is the greatest closed and convex function majorized by f,
i.e., if g : Rn t-+ [-oo, oo] is closed and convex, and satisfies
g(x) $ f(x) for all XE Rn, then g(x) $ (clf)(x) for all XE Rn.
Sec. B.1.3 Relative Interior and Closure 477

Proposition 1.3.15: Let f: Rn i--+ [-00,00] be a convex function.


Then:
(a) We have

cl(dom(f)) = cl(dom(clf)), ri(dom(f)) = ri(dom(clf)),

(clf)(x) = f(x), \:/ x E ri(dom(f)).

Furthermore, cl f is proper if and only if f is proper.


(b) If x E ri(dom(f)), we have

(clf)(y) = limf(y
a.j..0
+ a(x -y)), \:/yERn.

Proposition 1.3.16: Let f : Rm i--+ [-oo, oo] be a convex function


and A be an m x n matrix such that the range of A contains a point
in ri(dom(f)). The function F defined by

F(x) = f(Ax),
is convex and

(cl F)(x) = (cl !)(Ax),

Proposition 1.3.17: Let f; : Rn i--+ [-oo, oo], i = 1, ... , m, be con-


vex functions such that

The function F defined by

F(x) = fi(x) + · · · + fm(x),


478 Convex Optimization Theory: A Summary Chap. I

is convex and

(clF)(x) = (clfi)(x) + · · · + (clfm)(x), \:/XE ~n.

Section 1.4. Recession Cones

Given a nonempty convex set C, we say that a vector d is a direction of


recession of C if x + ad E C for all x E C and a ::=: 0. The set of all
directions of recession is a cone containing the origin, called the recession
cone of C, and denoted by Re.

Proposition 1.4.1: (Recession Cone Theorem) Let C be a nonem-


pty closed convex set.
(a) The recession cone Re is closed and convex.
(b) A vector d belongs to Re if and only if there exists a vector
x EC such that x + ad EC for all a::=: 0.

Proposition 1.4.2: (Properties of Recession Cones) Let C be


a nonempty closed convex set.
(a) Re contains a nonzero direction if and only if C is unbounded.
(b) Re= Rri(e)·
(c) For any collection of closed convex sets Ci, i E I, where I is an
arbitrary index set and niE!Ci =I= 0, we have

(d) Let W be a compact and convex subset of~=, and let A be an


m x n matrix. The recession cone of the set

V = {x E C I Ax E W}

(assuming this set is nonempty) is Ren N(A), where N(A) is


the nullspace of A.
Sec. B.1.4 Recession Cones 479

Given a convex set C the lineality space of C, denoted by Le, is the


set of directions of recession d whose opposite, -d, are also directions of
recession:
Le= Ren (-Re).

Proposition 1.4.3: (Properties of _Lineality Space) Let C be a


nonempty closed convex subset of Rn.
(a) Le is a subspace of Rn.
(b) Le= Lri(e)·

(c) For any collection of closed convex sets Ci, i E J, where J is an


arbitrary index set and niEJCi =l 0, we have

(d) Let W be a compact and convex subset of Rm , and let A be an


m x n matrix. The lineality space of the set

V = {x E C I Ax E W}
(assuming it is nonempty) is Le n N(A), where N(A) is the
nullspace of A.

Proposition 1.4.4: (Decomposition of a Convex Set) Let C be


a nonempty convex subset of Rn. Then, for every subspace S that is
contained in the lineality space Le, we have

C = s+ (CnS..L).

The notion of direction of recession of a convex function f can be


described in terms of its epigraph via the following proposition.

Proposition 1.4.5: Let f : Rn 1-t (-oo, oo] be a closed proper convex


function and consider the level sets

V,,={xlf(x)S,}, 1 E R.
480 Convex Optimization Theory: A Summary Chap. 1

Then:
(a) All the nonempty level sets V7 have the same recession cone,
denoted Rf, and given by

Rf= {d I (d,O) E Repi(f)},

where Repi(f) is the recession cone of the epigraph off.


(b) If one nonempty level set V7 is compact, then all of these level
sets are compact.

For a closed proper convex function f : Rn H (-oo, oo], the (com-


mon) recession cone Rt of the nonempty level sets is called the recession
cone off. A vector d E Rt is called a direction of recession off. The
recession function of f, denoted rt, is the closed proper convex function
whose epigraph is Rt·
The lineality space of the recession cone Rt of a closed proper convex
function f is denoted by Lt, and is the subspace of all d E Rn such that
both d and-dare directions of recession off, i.e.,
Lt= Rt n (-RJ)-
We have that d E Lt if and only if
f(x +ad)= f(x), \:/ x E dom(f), \:/ a ER.
Consequently, any d E Lt is called a direction in which f is constant, and
Lt is called the constancy space of f.

Proposition 1.4.6: Let f : Rn H (-oo, oo] be a closed proper convex


function. Then the recession cone and constancy space of f are given
in terms of its recession function by

Rt= {d I rt(d)::::; o}, Lt= {d I rt(d) = rJ(-d) = O}.

Proposition 1.4. 7: Let f : Rn H ( -oo, oo] be a closed proper convex


function. Then, for all x E dom(f) and d E Rn,

rt(d) = sup f(x + ad) - f(x) = lim f(x + ad) - f(x).


o>O CY, o-too CY,
Sec. B.1.4 Recession Cones 481

Proposition 1.4.8: (Recession Function of a Sum) Let Ji :


)Rn 1--t (-oo, oo], i = 1, ... , m, be closed proper convex functions such
that the function f = Ji + · · · + f m is proper. Then

r1(d) = r1i (d) + · · · + rrm (d), \:/dE)Rn.

Nonemptiness of Set Intersections

Let {Ck} be a sequence of nonempty closed sets in )Rn with Ck+l C Ck for
all k (such a sequence is said to be nested). We are concerned with the
question whether nk= 0 Ck is nonempty.

Definition 1.4.1: Let { Ck} be a nested sequence of nonempty closed


convex sets. We say that {xk} is an asymptotic sequence of {Ck} if
Xk -:/- 0, Xk E Ck for all k, and

llxk 11 -t oo,

where dis some nonzero common direction of recession of the sets Ck,

d-f- 0,

A special case is when all the sets Ck are equal. In particular, for a
nonempty closed convex set C, and a sequence {xk} CC, we say that {xk}
is an asymptotic sequence of C if {xk} is asymptotic (as per the preceding
definition) for the sequence {Ck}, where Ck= C.
Given any unbounded sequence {xk} such that Xk E Ck for each k,
there exists a subsequence { Xk} kEK that is asymptotic for the corresponding
subsequence {CkhEK· In fact, any limit point of {xk/llxkll} is a common
direction of recession of the sets Ck,

Definition 1.4.2: Let {Ck} be a nested sequence of nonempty closed


convex sets. We say that an asymptotic sequence { Xk} is retractive if
for the direction d corresponding to {xk} as per Definition 1.4.1, there
exists an index k such that
482 Convex Optimization Theory: A Summary Chap. 1

\:/ k ~ k.
We say that the sequence {Ck} is retractive if all its asymptotic se-
quences are retractive. In the special case Ck = C, we say that the set
C is retractive if all its asymptotic sequences are retractive.

A closed halfspace is retractive. Intersections and Cartesian products,


involving a finite number of sets, preserve retractiveness. In particular, if
{cu, ... ,{en are retractive nested sequences of nonempty closed convex
sets, the sequences {Nk} and {Tk} are retractive, where

Tk = ci x cf x ... x er;,, \:/ k,

and we assume that all the sets Nk are nonempty. A simple consequence
is that a polyhedral set is retractive, since it is the nonempty intersection
of a finite number of closed halfspaces.

Proposition 1.4.9: A polyhedral set is retractive.

The importance of retractive sequences is motivated by the following


proposition.

Proposition 1.4.10: A retractive nested sequence of nonempty closed


convex sets has nonempty intersection.

Proposition 1.4.11: Let {Ck} be a nested sequence of nonempty


closed convex sets. Denote

(a) If R = L, then {Ck} is retractive, and n~0 Ck is nonempty.


Furthermore,
n~ock =L+C,
where C is some nonempty and compact set.
(b) Let X be a retractive closed convex set. Assume that all the sets
Ck= X n Ck are nonempty, and that
Sec. B.1.4 Recession Cones 483

RxnR CL.
Then, {Ck} is retractive, and nk=o Ck is nonempty.

Proposition 1.4.12: (Existence of Solutions of Convex Quad-


ratic Programs) Let Q be a symmetric positive seiµ.tdefinite n x n
matrix, let c and a1, ... , ar be vectors in ~n, and let b1, ... ; br be
scalars. Assume that the optimal value of the problem

minimize x'Qx + c'x


subject to ajx ::;; bj, j = 1, ... , r,

is finite. Then the problem has at least one optimal solution.

Closedness under Linear Transformation and Vector Sum

The conditions of Prop. 1.4.11 can be translated to conditions guaranteeing


the closedness of the image, A C, of a closed convex set C under a linear
transformation A.

Proposition 1.4.13: Let X and C be nonempty closed convex sets


in ~n, and let A be an m x n matrix with nullspace denoted by N(A).
If X is a retractive closed convex set and

Rx nRe nN(A) c Le,

then A(X n C) is a closed set.

A special case relates to vector sums.

Proposition 1.4.14: Let C1, ... , Cm be nonempty closed convex sub-


sets of ~n such that the equality d1 + · · · + dm = 0 for some vectors
di E Re; implies that di E Le; for all i = 1, ... , m. Then C1 +···+Cm,
is a closed set.

When specialized to just two sets, the above proposition implies that
if C1 and -C2 are closed convex sets, then C1 - C2 is closed if there is no
484 Convex Optimization Theory: A Summary Chap. 1

common nonzero direction of recession of C1 and C2, i.e.

This is true in particular if either C1 or C2 is bounded, in which case either


Rc 1 = {0} or Rc2 = {0}, respectively. For an example of two unbounded
closed convex sets in the plane whose vector sum is not closed, let

Some other conditions asserting the closedness of vector sums can be


derived from Prop. 1.4.13. For example, we can show that the vector sum
of a finite number of polyhedral sets is closed, since it can be viewed as the
image of their Cartesian product (clearly a polyhedral set) under a linear
transformation. Another useful result is that if X is a polyhedral set, and
C is a closed convex set, then X + C is closed if every direction of recession
of X whose opposite is a direction of recession of C lies also in the lineality
space of C. In particular, X + C is closed if X is polyhedral, and C is
closed.

Section 1.5. Hyperplanes

A hyperplane in ~n is a set of the form

{x I a'x = b},
where a is nonzero vector in ~n (called the normal of the hyperplane), and
b is a scalar. The sets

{x I a'x 2 b}, {x I a'x S: b},


are called the closed halfspaces associated with the hyperplane (also referred
to as the positive and negative halfspaces, respectively). The sets

{x I a'x > b}, {x I a'x < b},


are called the open halfspaces associated with the hyperplane.

Proposition 1.5.1: (Supporting Hyperplane Theorem) Let C


be a nonempty convex subset of ~n and let x be a vector in ~n. If
x is not an interior point of C, there exists a hyperplane that passes
through x and contains C in one of its closed halfspaces, i.e., there
exists a vector a =f. 0 such that

a'xS: a'x, \IX EC.


Sec. B.1.5 Hyperplanes 485

Proposition 1.5.2: (Separating Hyperplane Theorem) Let C 1


and C2 be two nonempty convex subsets of ~n. If C1 and C2 are
disjoint, there exists a hyperplane that separates C1 and C2, i.e., there
exists a vector a =/= 0 such that

Proposition 1.5.3: (Strict Separation Theorem) Let C1 and


C2 be two disjoint nonempty convex sets. Then under any one of
the following five conditions, there exists a hyperplane that strictly
separates C1 and C2, i.e., a vector a=/= 0 and a scalar b such that

a'x1 < b < a'x2,

(1) C2 - C1 is closed.
(2) C1 is closed and C2 is compact.
(3) C1 and C2 are polyhedral.
(4) C1 and C2 are closed, and

where Re, and Le, denote the recession cone and the lineality
space of Ci, i = 1, 2.
(5) C1 is closed, C2 is polyhedral, and Re 1 n Re2 C Le1 •

The notion of direction of recession of a convex function f can be


described in terms of its epigraph via the following proposition.

Proposition 1.5.4: The closure of the convex hull of a set C is the


intersection of the closed halfspaces that contain C. In particular,
a closed convex set is the intersection of the closed halfspaces that
contain it.

Let C1 and C2 be two subsets of ~n. We say that a hyperplane


properly separates C1 and C2 if it separates C1 and C2, and does not fully
486 Convex Optimization Theory: A Summary Chap. 1

contain both C1 and C2. If C is a subset of ~n and xis a vector in ~n, we


say that a hyperplane properly separates C and x if it properly separates
C and the singleton set {x}.

Proposition 1.5.5: (Proper Separation Theorem) Let C be a


nonempty convex subset of ~n and let x be a vector in ~n. There
exists a hyperplane that properly separates C and x if and only if
xi ri(C).

Proposition 1.5.6: (Proper Separation of Two Convex Sets)


Let C1 and C2 be two nonempty convex sub:,ets of ~n. There exists a
hyperplane that properly separates C1 and C2 if and only if

ri(C1) n ri(C2) = 0.

Proposition 1.5. 7: (Polyhedral Proper Separation Theorem)


Let C and P be two nonempty convex subsets of ~n such that P is
polyhedral. There exists a hyperplane that separates C and P, and
does not contain C if and only if

ri(C)nP=0.

Consider a hyperplane in ~n+ 1 with a normal of the form (µ, /3),


where µ E ~n and /3 E R We say that such a hyperplane is vertical if
/3 = 0, and nonvertical if /3 -=/- 0.

Proposition 1.5.8: (Nonvertical Hyperplane Theorem) Let C


be a nonempty convex subset of ~n+l that contains no vertical lines.
Let the vectors in ~n+l be denoted by (u, w), where u E ~n and
w E ~- Then:
(a) C is contained in a closed halfspace corresponding to a nonverti-
cal hyperplane, i.e., there exist a vectorµ E ~n, a scalar /3-=/- 0,
and a scalar 'Y such that
Sec. B.1.6 Conjugate Functions 487

µ'u + (3w 2: "/, \f,(u,w) EC.

(b) If (u, w) does not belong to cl(C), there exists a nonvertical hy-
perplane strictly separating (u, w) and C.

Section 1.6. Conjugate Functions

Consider an extended real-valued function f: Wn H [-oo, oo]. The conju-


gate function off is the function f* : Wn H [-oo, oo] defined by
f*(y)= sup{x'y-f(x)}, y E Wn. (B.2)
xE~n

Proposition 1.6.1: (Conjugacy Theorem) Let f: Wn H [-oo, oo]


be a function, let f * be its conjugate, and consider the double conju-
gate f ** = (f * )*. Then:
(a) We have
f(x) 2: f**(x), \fxEW".

(b) If f is convex, then properness of any one of the functions f, f *,


and f ** implies properness of the other two.
(c) If f is closed proper convex, then

J(x) = f**(x), \fxEWn.

(d) The conjugates of f and its convex closure cl f are equal. Fur-
thermore, if cl f is proper' then

(clf)(x) = j**(x),

Positively Homogeneous Functions and Support Functions

Given a nonempty set X, consider the indicator function of X, defined by


8x(x)={O ~fxEX,
00 ~ X.
1f X

The conjugate of 8x is given by


ax(y) = sup y'x
xEX
488 Convex Optimization Theory: A Summary Chap. 1

and is called the support function of X.


Let C be a convex cone. The conjugate of its indicator function 8c
is its support function,
a-c(y) = supy'x.
xEC

The support/conjugate function a-c is the indicator function Oc• of the


cone
C* = {y I y'x ::; 0, \/ x E C},
called the polar cone of C. By the Conjugacy Theorem [Prop. 1.6.l(c)] C*
is equal to cl Oc. Thus the polar cone of C* is cl( C). In particular, if C is
closed, the polar of its polar is equal to the original. This is a special case
of the Polar Cone Theorem, given in Section 2.2.
A function f : ~n f-+ [-oo, oo] is called positively homogeneous if its
epigraph is a cone in ~nH. Equivalently, f is positively homogeneous if
and only if
f('yx) = 'Y f(x), \/ "( > 0, \IX E ~n.
Positively homogeneous functions are closely connected with support
functions. Clearly, the support function a-x of a set X is closed convex
and positively homogeneous. Moreover, if a- : ~n f-+ (-oo, oo] is a proper
convex positively homogeneous function, then we claim that the conjugate
of a- is the indicator function of the closed convex set

X = {x I y'x::; a-(y), \I y E ~n},


and that cl a- is the support function of X. For a proof, let o be the
conjugate of a-:
8(x) = sup {y'x - a-(y) }.
yE~n

Since a- is positively homogeneous, we have for any 'Y > 0,

"fO(x) = sup {'Yy'x - "f<T(y)} = sup { ('yy)'x - a-('yy) }.


yE~n yE~n

The right-hand sides of the preceding two relations are equal, so we obtain

o(x) = "fO(x), \/ 'Y > 0,


which implies that 8 takes only the values O and oo (since a- and hence also
its conjugate o is proper). Thus, 8 is the indicator function of a set, call it
X, and we have

X = {x I o(x)::; o}

= {x I sup {y'x - a-(y)} ::;


yE~n
o}
= {x I y'x::; a-(y), \j y E ~n}.
Sec. B.2.1 Extreme Points 489

Finally, since 8 is the conjugate of a, we see that cl a is the conjugate of


8; cf. the Conjugacy Theorem [Prop. 1.6.l(c)]. Since 8 is the indicator
function of X, it follows that cl a is the support function of X.
We now discuss a characterization of the support function of the 0-
level set of a closed proper convex function f : ~n H (-oo, oo]. The
closure of the cone generated by epi(f), is the epigraph of a closed convex
positively homogeneous function, called the closed function generated by
f, and denoted by gen f. The epigraph of gen f is the intersection of all
the closed cones that contain epi(f). Moreover, if gen f is proper, then
epi(gen f) is the intersection of all the halfspaces that contain epi(f) and
contain O in their boundary.
Consider the conjugate f* of a closed proper convex function f :
~n H (-00,00]. We claim that if the level set {y I f*(y) S: O} [or the
level set {x I f(x) S: O}] is nonempty, its support function is gen/ (or
respectively gen/*). Indeed, if the level set {y I f*(y) S: O} is nonempty,
any y such that f*(y) S: 0, or equivalently y'x S: f(x) for all x, defines a
nonvertical hyperplane that separates the origin from epi(f), implying that
the epigraph of gen f does not contain a line, so gen f is proper. Since gen f
is also closed, convex, and positively homogeneous, by our earlier analysis
it follows that gen f is the support function of the set
Y = {y I y'x S: (genf)(x), '>;/ x E ~n}.
Since epi(gen J) is the intersection of all the halfspaces that contain epi(f)
and contain O in their boundary, the set Y can be written as

Y = {y I y'x S: J(x), '>;/ x E ~n} = {y I sup {y'x - f(x)} S:


xE1Rn
o}.
We thus obtain that gen f is the support function of the set
Y = {Y I f * (y) s: o},
assuming this set is nonempty.
Note that the method used to characterize the 0-level sets off and
f* can be applied to any level set. In particular, a nonempty level set
L-y = {x I f(x) S: ')'} is the 0-level set of the function f-y defined by
f-y(x) = f(x) - ')', and its support function is the closed function generated
by f{, the conjugate of f-y, which is given by g(y) = f*(y) + 1'·

CHAPTER 2: Basic Concepts of Polyhedral Convexity

Section 2.1. Extreme Points

In this chapter, we discuss polyhedral sets, i.e., nonempty sets specified by


systems of a finite number of affine inequalities
j = 1, ... ,r,
490 Convex Optimization Theory: A Summary Chap. 2

where a1, ... , ar are vectors in Rn, and b1, ... , br are scalars.
Given a nonempty convex set C, a vector x E C is said to be an
extreme point of C if it does not lie strictly between the endpoints of any
line segment contained in the set, i.e., if there do not exist vectors y E C
and z E C, with y -=fa x and z -=fa x, and a scalar a E (0, 1) such that
x = ay + (1 - a)z.

Proposition 2.1.1: Let C be a convex subset of Rn, and let H be a


hyperplane that contains C in one of its closed halfspaces. Then the
extreme points of C n H are precisely the extreme points of C that
belong to H.

Proposition 2.1.2: A nonempty closed convex subset of Rn has at


least one extreme point if and only if it does not contain a lin6, 'i.e.,
a set of the form {x + ad I a ER}, where x and dare vectors in Rn
With d -I 0. I

Proposition 2.1.3: Let C be a nonempty closed convex subset of Rn.


Assume that for some m x n matrix A of rank n and some b E Rm, we
have
Ax::=:: b, VxEC.
Then C has at least one extreme point.

Proposition 2.1.4: Let P be a polyhedral subset of Rn


(a) If P has the form

P = {x I ajx ~ bj, j = 1, ... ,r },


where aj E Rn, bj E R, j = 1, ... , r, then a vector v E P is an
extreme point of P if and only if the set

contains n linearly independent vectors.


Sec. B.2.2 Polar Cones 491

(b) If P has the form

P = {x I Ax= b, x ~ O},

where A is an m x n matrix and b is a vector in Rm, then a


vector v E P is an extreme point of P if and only if the columns
of A corresponding to the nonzero coordinates of v are linearly
independent.
(c) If P has the form

P = {x I Ax= b, c :S x :S d},

where A is an m x n matrix, b is a vector in lRm, and c, d are


vectors in Rn, then a vector v E P is an extreme point of P if
and only if the columns of A corresponding to the coordinates
of v that lie strictly between the corresponding coordinates of c
and d are linearly independent.

Proposition 2.1.5: A polyhedral set in Rn of the form

{ x I a1x :S bj, j = 1, . .. , r}
has an extreme point if and only if the set {aj I j = 1, ... , r} contains
n linearly independent vectors.

Section 2.2. Polar Cones

We return to the notion of polar cone of nonempty set C, denoted by C*,


and given by C* = {y I y'x :S 0, V x E C }.

Proposition 2.2.1:
(a) For any nonempty set C, we have

C* = (cl(C))* = (conv(C))* = (cone(C))* .


492 Convex Optimization Theory: A Summary Chap. 2

(b) (Polar Cone Theorem) For any nonempty cone C, we have

(C*)* = cl(conv(C)).
In particular, if C is closed and convex, we have (C*)* = C.

Section 2.3. Polyhedral Sets and Functions

We recall that a polyhedral cone C C Rn is a polyhedral set of the form

C = {x I aJx:::; 0, j = 1, ... , r },

where a 1, ... , ar are some vectors in Rn, and r is a positive integer. We


say that a cone C c Rn is finitely generated, if it is generated by a finite
set of vectors, i.e., if it has the form

C =cone({a1, .. . ,ar}) ={x Ix= tµ a J=l


3 3, µ3 2: 0, j =1, ... ,r},
where a 1, . .. , ar are some vectors in Rn, and r is a positive integer.

Proposition 2.3.1: {Farkas' Lemma) Let a1, ... , ar be vectors in


Rn. Then, {x I ajx :=:; 0,j = 1, ... ,r} and cone({a1, ... ,ar}) are
closed cones that are polar to each other.

Proposition 2.3.2: (Minkowski-Weyl Theorem) A cone is poly-


hedral if and only if it is finitely generated.

Proposition 2.3.3: (Minkowski-Weyl Representation) A set P


is polyhedral if and only if there is a nonempty finite set {v1, ... , Vm}
and a finitely generated cone C such that P = conv ( {v1, ... , Vm}) + C,
i.e.,
Sec. B.2.3 Polyhedral Sets and Functions 493

P = {x I = f
x
J=l
µjVj + Y, f
J=l
µj = 1, µj ~ 0, j = 1, ... , m,
.
y E c} .

Proposition 2.3.4: (Alge braic Operations on Polyhedral Sets)


(a) The intersection of polyhedral sets is polyhedral, ifit is rionempty.
(b) The Cartesian product of polyhedral sets is polyhedral.
(c) The image of a polyhedral set under a linear transformation is a
polyhedral set.
(d) The vector sum of two polyhedral sets is polyhedral.
(e) The inverse image of a polyhedral set under a linear transforma-
tion is polyhedral.

We say that a function f : ~ H (-oo, oo] is polyhedral if its epigraph


is a polyhedral set in Rn+l. Note that a polyhedral function f is, by
definition, closed, convex, and also proper [since f cannot take the value
-oo, and epi(f) is closed, convex, and nonempty (based on our convention
that only nonempty sets can be polyhedral)] .

Proposition 2.3.5: Let f : Rn H (-oo, oo] be a convex function.


Then f is polyhedral if and only if dom(f) is a polyhedral set and

V x E dom(f),

where a3 are vectors in Rn, b3 are scalars, and m is a positive integer.

Some common operations on polyhedral functions, such as sum and


linear composition preserve their polyhedral character as shown by the
following two propositions.

Proposition 2.3.6: The sum of two polyhedral functions Ji and h,


such that dom(f1) n dom(h)-/- 0, is a polyhedral function.
494 Convex Optimization Theory: A Summary Chap. 3

Proposition 2.3. 7: If A is a matrix and g is a polyhedral function


such that dom(g) contains a point in the range of A, the function f
given by f(x) = g(Ax) is polyhedral.

Section 2.4. Polyhedral Aspects of Optimization

Polyhedral convexity plays a very important role in optimization. The


following are two basic results related to linear programming, the mini-
mization of a linear function over a polyhedral set.

Proposition 2.4.1: Let C be a closed convex subset of lJtn that has


at least one extreme point. A concave function f : C H lit that attains
a minimum over C attains the minimum at some extreme point of C.

Proposition 2.4.2: (Fundamental Theorem of Linear Pro-


gramming) Let P be a polyhedral set that has at least one extreme
point. A linear function that is bounded below over P attains a mini-
mum at some extreme point of P.

CHAPTER 3: Basic Concepts of Convex Optimization

Section 3.1. Constrained Optimization

Let us consider the problem


minimize f (x)
subject to x E X,
where f : lJtn H (-oo, oo] is a function and X is a nonempty subset of
lJtn. Any vector x E X n dom(f) is said to be a feasible solution of the
problem (we also use the terms feasible vector or feasible point). If there
is at least one feasible solution, i.e., X n dom(f) -=/- 0, we say that the
problem is feasible; otherwise we say that the problem is infeasible. Thus,
when f is extended real-valued, we view only the points in X n dom(f) as
candidates for optimality, and we view dom(f) as an implicit constraint set.
Furthermore, feasibility of the problem is equivalent to infxEX f(x) < oo.
We say that a vector x* is a minimum off over X if
x* E X n dom(f), and f(x*) = inf f(x).
xEX
Sec. B.3.2 Constrained Optimization 495

We also call x* a minimizing point or minimizer or global minimum of f


over X. Alternatively, we say that f attains a minimum over X at x*, and
we indicate this by writing

x* E arg min f(x).


x EX

If x* is known to be the unique minimizer of f over X, with slight abuse


of notation, we also occasionally write

x* = arg min
xEX
f ( x ).

We use similar terminology for maxima.


Given a subset X of Rn and a function f : Rn H ( - oo, oo], we say
that a vector x * is a local minim um of f over X if x * E X n dom(f) and
there exists some E > 0 such that

f(x*) ~ f(x), 'v x E X with llx - x* II < L

A local minimum x * is said to be strict if there is no other local mini-


mum within some open sphere centered at x* . Local maxima are defined
similarly.

P roposition 3.1. 1: If X is a convex subset of Rn and f : Rn H


( -oo, oo] is a convex function, then a local minimum of f over X is
also a global minimum. If in addition f is strictly convex, then there
exists at most one global minimum off over X.

Section 3.2. Existence of Optimal Solutions

Proposition 3.2 .1: (Weierstrass' Theorem) Consider a closed


proper function f : Rn H (-oo, oo], and assume that any one of the
following three conditions holds:
(1) dom(f) is bounded.
(2) There exists a scalar "1 such that the level set

{x I f(x) ~~}
is nonempty and bounded.
496 Convex Optimization Theory: A Summary Chap. 3

(3) f is coercive, i.e., if for every sequence { xk} such that llxk II -+ oo,
we have limk-+= f(xk) = oo.
Then the set of minima of f over Rn is nonempty and compact.

Proposition 3.2.2: Let X be a closed convex subset of Rn, and let


f : Rn H (-oo, oo] be a closed convex function with X n dom(f) -/=- 0.
The set of minima of f over X is nonempty and compact if and only
if X and f have no common nonzero direction of recession.

Proposition 3.2.3: (Existence of Solution, Sum of Functions)


Let /i : Rn H (-oo, oo], i = 1, ... , m, be closed proper convex func-
tions such that the function f = Ji + · · · + f m is proper. Assume that
the recession function of a single function Ji satisfies r f; ( d) = oo for
all d -f=. 0. Then the set of minima of f is nonempty and compact.

Section 3.3. Partial Minimization of Convex Functions

Functions obtained by minimizing other functions partially, i.e., with re-


spect to some of their variables, arise prominently in the treatment of dual-
ity and minimax theory. It is then useful to be able to deduce properties of
the function obtained, such as convexity and closedness, from correspond-
ing properties of the original.

Proposition 3.3.1: Consider a function F: Rn+m H (-oo, oo] and


the function f: Rn H [-oo, oo] defined by

J(x) = zE!Rm
inf F(x, z).

Then:
(a) If F is convex, then f is also convex.
(b) We have
P( epi(F)) c epi(f) C c1(P(epi(F))),
Sec. B.3.3 Partial Minimization of Convex Functions 497

where P(·) denotes projection on the space of (x, w), i.e., for any
subset S of Wn+m+l, P(S) = {(x, w) I (x, z, w) ES}.

Proposition 3.3.2: Let F : Wn+m r--+ (-oo, oo] be a closed proper


convex function, and consider the function f given by

J(x) = inf F(x, z), XE Wn.


zE1R"'

Assume that for s~me x E Wn and "j E W the set

{z I F(x,z)::; "1}

is nonempty and compact. Then f is closed proper convex. Further-


more, for each x E dom(J), the set of minima in the definition of f(x)
is nonempty and compact.

Proposition 3.3.3: Let X and Z be nonempty convex sets of Wn and


Wm, respectively, let F : X x Z r--+ W be a closed convex function, and
assume that Z is compact. Then the function f given by·

f(x) = inf F(x, z), xEX,


zEZ

is a real-valued convex function over X.

Proposition 3.3.4: Let F : Wn+m r--+ ( -oo, oo] be a closed proper


convex function, and consider the function f given by

J(x) = inf F(x, z),


zE1Rm

Assume that for some x E Wn and "j E W the set

{ z I F(x, z) ~ "i}

is nonempty and its recession cone is equal to its lineality space. Then
498 Convex Optimization Theory: A Summary Chap. 4

f is closed proper convex. Furthermore, for each x E dom(f), the set


of minima in the definition of f(x) is nonempty.

Section 3.4. Saddle Point and Minimax Theory

Let us consider a function</>: X x Z f--t 3?, where X and Z are nonempty


subsets of 3?n and 3?m, respectively. An issue of interest is to derive condi-
tions guaranteeing that
sup inf </>(x,z) = inf sup</>(x,z), (B.3)
zEZ xEX xEX zEZ

and that the infima and the suprema above are attained.

Definition 3.4.1: A pair of vectors x* E X and z* E Z is called a


saddle point of </> if

</>(x*,z):::; </>(x*,z*):::; </>(x,z*), \/ X E X, \/ z E z.

Proposition 3.4.1: A pair (x*, z*) is a saddle point of</> if and only
if the minimax equality (B.3) holds, and x* is an optimal solution of
the problem
minimize sup¢(x,z)
zEZ
subject to x E X,
while z* is an optimal solution of the problem

maximize inf </>(x, z)


xEX
subject to z E Z.

CHAPTER 4: Geometric Duality Framework

Section 4.1. Min Common/Max Crossing Duality

We introduce a geometric framework for duality analysis, which aims to


capture the most essential characteristics of duality in two simple geomet-
rical problems, defined by a nonempty subset M of 3?n+l.
Sec. B.4.1 Min Common/Max Crossing Duality 499

(a) Min Common Point Problem: Consider all vectors that are common
to M and the (n + l)st axis. We want to find one whose (n + l)st
component is minimum.
(b) Max Crossing Point Problem: Consider nonvertical hyperplanes that
contain M in their corresponding "upper" closed halfspace, i.e., the
closed halfspace whose recession cone contains the vertical halfline
{ (0, w) I w ~ O}. We want to find the maximum crossing point of
the (n + 1)st axis with such a hyperplane.
We refer to the two problems as the min common/max crossing {MC/MC}
framework, and we will show that it can be used to develop much of the
core theory of convex optimization in a unified way.
Mathematically, the min common problem is

minimize w
subject to (0, w) E M.

We also refer to this as the primal problem, and we denote by w* its optimal
value,
w* = inf w.
(O,w)EM

The max crossing problem is to maximize over allµ E Rn the maxi-


mum crossing level corresponding to µ, i.e.,

maximize inf
(u,w)EM
{w + µ' u}
(B.4)
subject to µ E Rn.

We also refer to this as the dual problem, we denote by q* its optimal value,

q* = sup q(µ),
µEfRn

and we refer to q(µ) as the crossing or dual function.

Proposition 4.1.1: The dual function q is concave and upper semi-


continuous.

The following proposition states that we always have q* ~ w*; we


refer to this as weak duality. When q* = w*, we say that strong duality
holds or that there is no duality gap.

Proposition 4.1.2: (Weak Duality Theorem) We have q* ~ w*.


500 Convex Optimization Theory: A Summary Chap. 4

The feasible solutions of the max crossing problem are restricted by


the horizontal directions of recession of M. This is the essence of the
following proposition.

Proposition 4.1.3: Assume that the set

M=M+{(O,w) lw=::::o}

is convex. Then the set of feasible solutions of the max crossing prob-
lem, {µ I q(µ) > -oo }, is contained in the cone

{µ I µ'd =:: : 0 for all d with (d;O) ERM},

where RM is the recession cone of M.

Section 4.2. Some Special Cases

There are several interesting special cases where the set M is the epigraph of
some function. For example, consider the problem of minimizing a function
f : ~n H [-oo, oo]. We introduce a function F : ~n+r H [-oo, oo] of the
pair (x, u), which satisfies

f( x ) = F(x , 0), \;/XE~- (B.5)

Let the function p: ~ r H [- oo, oo] be defined by


p(u) = inf F (x ,u), (B.6)
xE!Rn

and consider the MC/ MC framework with

M = epi(p) .
The min common value w * is the minimal value of f , since

w* = p(O) = x inf
E!Rn
F(x, 0) = inf f(x).
x E!Rn

The max crossing problem (B.4) can b e written as

maximize q(µ)
subject to µ E ~ r ,

where the dual function is

q(µ) = inf {w+µ'u} = inf {p(u)+µ'u} = inf {F(x ,u)+µ'u }.


(u,w)EM uE!Rr (x,u)E!Rn+r
(B.7)
Sec. B.4.2 Some Special Cases 501

Note that from Eq. (B.7), an alternative expression for q is

q(µ)=- sup {-µ'u-F(x,u)}=-F*(O,-µ),


(x,u)E1Rn+r

where F* is the conjugate of F, viewed as a function of (x, u). Since

q* = sup q(µ) = - inf F*(O, -µ) = - inf F*(O, µ),


µE1Rr µE1Rr µE1Rr

the strong duality relation w* = q* can be written as

inf F(x, 0)
xE1Rn
=- inf F*(O, µ).
µE1Rr

Different choices of function F, as in Eqs. (B.5) and (B.6), yield


corresponding MC/MC frameworks and dual problems. An example of
this type is minimization with inequality constraints:

minimize f (x)
(B.8)
subject to x EX, g(x) ~ 0,

where X is a nonempty subset of ~n, f : X ~ ~ is a given function, and


g(x) = (g1(x), ... ,gr(x)) with gj: X ~~being given functions. We
introduce a "perturbed constraint set" of the form

Cu= {XE XI g(x) ~ u }, u E ~r, (B.9)

and the function


if XE Cu,
F(x, u) = { ~x)
otherwise,
which satisfies the condition F(x, 0) = f(x) for all x E Co [cf. Eq. (B.5)].
The function p of Eq. (B.6) is given by

p(u) = inf F(x, u) = inf f(x), (B.10)


xE1Rn xEX, g(x)'.Su

and is known as the primal function or perturbation function. It captures


the essential structure of the constrained minimization problem, relating to
duality and other properties, such as sensitivity. Consider now the MC/MC
framework corresponding to M = epi(p). From Eq. (B.7), we obtain with
some calculation

infxEx{f(x) + µ'g(x)} ifµ~ 0,


q(µ) = {
-(X) otherwise.

The following proposition derives the primal and dual functions in


the minimax framework. In this proposition, for a given x, we denote by
502 Convex Optimization Theory: A Summary Chap. 4

(cl ¢ )(x, ·) the concave closure of ¢(x, ·) [the smallest concave and upper
semicontinuous function that majorizes ¢(x , ·)].

Proposition 4.2.1: Let X and Z be nonempty subsets of Wn and


Wm, respectively, and let ¢ : X x Z f--t W be a function. Assume
that (-~1¢)(x,·) is proper for all x EX, and consider the MC/MC
framework corresponding to M = epi(p), where pis given by

p(u) = inf sup{ <j>(x, z) - u'z }, uEWm.


xEX zEZ

Then the dual function is given by

q(µ) = inf (~1¢)(x,µ),


xEX

Section 4.3. Strong Duality Theorem

The following propositions give general results for strong duality.

Proposition 4.3.1: (MC/MC Strong Duality) Consider the min


common and max crossing problems, and assume the following:
(1) Either w* < oo, or else w* = oo and M contains no vertical lines.
(2) The set
M = M + {(O,w) I w ~ o}
is convex . .
Then, we have q* = w* if and only if for every sequence { (uk, Wk)} C
M with Uk-+ 0, there holds w* :S liminfk-+oo Wk.

Proposition 4.3.2: Consider the MC/MC framework, assuming that


w* < oo.
(a) Let M be closed and convex. Then q* = w*. Furthermore, the
function
Sec. B.4.4 Existence of Dual Optimal Solutions 503

p(u) = inf{ w I (u, w) EM}, uE ~n,

is convex and its epigraph is the set

M = M + {(O,w) I w;::: o}.


If in addition -oo < w*, then p is closed and proper.
(b) q* is equal to the optimal value of the min common problem
corresponding to cl ( conv( M)).
(c) If Mis of the form

M=M+{(u,O)luEC},

where M is a compact set and C is a closed convex set, then


q* is equal to the optimal value of the min common problem
corresponding to conv(M).

Section 4.4. Existence of Dual Optimal Solutions

The following propositions give general results for strong duality, as well
existence of dual optimal solutions.

Proposition 4.4.1: (MC/MC Existence of Max Crossing So-


lutions) Consider the MC/MC framework and assume the following:
(1) -oo<w*.
(2) The set
M=M+{(O,w) lw;:::o}
1s convex.
(3) The origin is a relative interior point of the set

D = {u I there exists w E ~ with (u, w) E M}.

Then q* = w*, and there exists at least one optimal solution of


the max crossing problem.
504 Convex Optimization Theory: A Summary Chap . 4

Proposition 4 .4.2: Let the assumptions of Prop. 4.4.1 hold. Then


Q*, the :set of optimal solutions of the max crossing problem, has the
form
Q* = (aff(D)).L + Q,
where Q is a nonempty, convex, and compact set. In particular, Q* is
compact if and only if the origin is an interior point of D.

Section 4.5. Duality and Polyhedral Convexity

The following propositions address special cases where the set M has par-
tially polyhedral structure.

Proposition 4.5.1: Consider the MC/MC framework, and assume


the following:
(1) -oo < w*.
(2) The set M has the form

M = !vf - { (u, 0) I u E P},

where M and P are convex sets.


(3) Either ri(.D) n ri(P) i- 0, or Pis polyhedral and ri(.D) n Pi- 0,
where .D is the set given by

b = {u I there exists w E ~ with (u, w) EM}.

Then q* = w*, and Q*, the set of optimal solutions of the max crossing
problem, is a nonempty subset of Rj,, the polar cone of the recession
cone of P. Furthermore, Q* is compact if int(.D) n Pi- 0.

Proposition 4.5.2: Consider the MC/MC framework, and assume


that:
(1) -oo < w*.
(2) The set M is defined in terms of a polyhedral set P, an r x n
matrix A, a vector b E ~r, and a convex function f : ~n i-+
( -oo, oo) as follows:
Sec. B.5.1 Nonlinear Farkas' Lemma 505

M = {(u, w) I Ax - b- u E P for some (x, w) E epi(f) }.

(3) There is a vector x E ri( dom(f)) such that Ax - b E P


Then q* = w* and Q*, the set of optimal solutions of the max crossing
problem, is a nonempty subset of Rj,, the polar cone of the recession
cone of P. Furthermore, Q* is compact if the matrix A has rank rand
there is a vector x E int(dom(f)) such that Ax - b E P.

CHAPTER 5: Duality and Optimization

Section 5.1. Nonlinear Farkas' Lemma

A nonlinear version of Farkas' Lemma captures the essence of convex pro-


gramming duality. The lemma involves a nonempty convex set X C ~n,
and functions f : X H ~ and gj : X H ~, j = 1, ... , r. We denote
g(x) = (g1(x), ... ,gr(x))', and use the following assumption.

Assumption 5.1: The functions f and gj, j = 1, ... , r, are convex,


and
f(x) 2 0, V x EX with g(x) S .o.

Proposition 5.1.1: (Nonlinear Farkas' Lemma) Let Assumption


5.1 hold and let Q* be the subset of ~r given by

Q* ={µI µ 2 0, f(x) +µ'g(x) 2 0, V x E ~}-

Assume that one of the following two conditions holds:


(1) There exists x EX such that gj(x) < 0 for all j = 1, ... ,r.
(2) The functions gj, j = 1, ... , r, are affine, and there exists x E
ri(X) such that g(x) S 0.
Then Q* is nonempty, and under condition (1) it is also compact.

The interior point condition ( 1) in the above proposition, and other


propositions that follow, is known as the Slater condition. By selecting f
and gj to be linear, and X to be the entire space in the above proposition,
506 Convex Optimization Theory: A Summary Chap. 5

we obtain a version of Farkas' Lemma (cf. Section 2.3) as a special case.

Proposition 5.1.2: (Linear Farkas' Lemma) Let A be an m x n


matrix and c be a vector in lJrm.
(a) The system Ay = c, y 2 0 has a solution if and only if

A'x::;o c'x:::; 0.

(b) The system Ay 2 c has a solution if and only if

A'x = 0, x 2'0 c'x::::: o.

Section 5.2. Linear Programming Duality

One of the most important results in optimization is the linear program-


ming duality theorem. Consider the problem
minimize c' x
subject to ajx 2 bj, j = 1, ... , r,
where c E lJrn, aj E lJrn, and bj E lR, j = 1, ... , r.
In the following
proposition, we refer to this as the primal problem. We consider the dual
problem
maximize b' µ
r
subject to L ajµj = c, µ 2 0,
j=l

which can be derived from the MC/MC duality framework in Section 4.2.
We denote the primal and dual optimal values by f * and q*, respectively.

Propos ition 5.2.1 : (Linear Programming Duality Theorem)


(a) If either f* or q* is finite, then f* = q* and both the primal and
the dual problem have optimal solutions.
(b) If f* = -oo, then q* = -oo.
(c) If q* = oo, then f* = oo.

Note that the theorem allows the possibility f* = oo and q* = -oo.


Another related result is the following necessary and sufficient condition
for primal and dual optimality.
Sec. B.5.3 Convex Programming Duality 507

Proposition 5.2.2: (Linear Programming Optimality Condi-


tions) A pair of vectors (x*, µ*) form a primal and dual optimal so-
lution pair if and only if x* is primal-feasible, µ* is dual-feasible, and

I:/ j = 1, ... ,r.

Section 5.3. Convex Programming Duality

We first focus on the problem


minimize f (x)
(B.11)
subject to x EX, g(x) ~ 0,

where Xis a convex set in ~n, g(x) = (g1(x), ... ,gr(x))', f: X H ~ and
gj : XH ~, j = 1, ... , r, are convex functions. The dual problem is

maximize inf L( x, µ)
xEX
subject to µ ~ 0,
where L is the Lagrangian function
L(x,µ) = f(x) + µ'g(x), XE X, µ E ~r.

For this and other similar problems, we denote the primal and dual opti-
mal values by f * and q*, respectively. We always have the weak duality
relation q* ~ f*; cf. Prop. 4.1.2. When strong duality holds, dual optimal
solutions are also referred to as Lagrange multipliers. The following eight
propositions are the main results relating to strong duality in a variety of
contexts. They provide conditions (often called constraint qualifications),
which guarantee that q* = f *.

Proposition 5.3.1: (Convex Programming Duality - Exis-


tence of Dual Optimal Solutions) Consider problem (B.11). As-
sume that f * is finite, and that one of the following two conditions
holds:
(1) There exists x EX such that gj(x) < 0 for all j = 1, ... , r.
(2) The functions gj, j = 1, ... , r, are affine, and there exists x E
ri(X) such that g(x) ~ 0.
Then q* = f* and the set of optimal solutions of the dual problem is
nonempty. Under condition (1) this set is also compact.
508 Convex Optimization Theory: A Summary Chap. 5

Proposition 5.3.2: (Optimality Conditions) Consider problem


(B.11). There holds q* = f*, and (x*, µ*) are a primal and dual
optimal solution pair if and only if x * is feasible, µ* ~ 0, and

x* E argminL(x,µ*),
xEX
µjgi(x•) = 0, j = 1, .. . , r.

The condition µjgi(x*) = 0 is known as complementary slackness,


and generalizes the corresponding condit ion for linear programming, given
in Prop. 5.2.2. The preceding proposit ion actually can be proved wit hout
the convexity assumptions of X, f, and g, although this fact will not be
useful to us.
The analysis for problem (B.11) can be refined by making more spe-
cific assumptions regarding available polyhedral structure in the constraint
functions and the abstract constraint set X. Here is an extension of prob-
lem (B.11) with additional linear equality constraints:

minimize f (x )
(B.12)
subject to x E X, g(x ) '.S 0, Ax = b,

where X is a convex set, g(x) = (gi(x ), ... ,9r(x))', f : X f-t lR and


9i : X f-t lR, j = 1, .. . , r, are convex functions, A is an m x n matrix, and
b E lRm. The corresponding Lagrangian function is

L(x ,µ, >.) = f( x ) + µ'g( x ) + >.'(Ax - b) ,


and the dual problem is

maximize inf L(x , µ, >.)


xEX
subject to µ ~ 0, >. E lRm.

In t he special case of a problem with just linear equality constraints:

minimize f (x)
(B.13)
subject to x E X , Ax = b,
t he Lagrangian function is

L(x, >.) = f( x ) + N(Ax - b),


and t he dual problem is

maximize inf L(x, >.)


x EX
subject to >. E lRm .
Sec. B.5.3 Convex Programming Duality 509

Proposition 5.3.3: ( Convex Programming - Linear Equality


Constraints) Consider problem (B.13).
(a) Assume that f* is finite and that there exists x E ri(X) such
that Ax = b. Then f* = q* and there exists at least one dual
optimal solution.
(b) There holds f * = q*, and (x*, >. *) are a primal and dual optimal
solution pair if and only if x* is feasible and

x* E argmin L(x, >.*).


xEX

Proposition 5.3.4: (Convex Programming - Linear Equality


and Inequality Constraints) Consider problem (B.12).
(a) Assume that f* is finite, that the functions gj are linear, and
that there exists x E ri(X) such that Ax= band g(x) ~ 0. Then
q* = f* and there exists at least one dual optimal solution.
(b) There holds f* = q*, and (x*,µ*,>.*) are a primal and dual
optimal solution pair if and only if x* is feasible, µ* ~ 0, and

x* E argminL(x,µ*,.X*),
xEX

Proposition 5.3.5: (Convex Programming - Linear Equality


and Nonlinear Inequality Constraints) Consider problem (B.12).
Assume that f* is finite, that there exists x E X such that Ax = b
and g(x) < 0, and that there exists x E ri(X) such that Ax= b. Then
q* = f * and there exists at least one dual optimal solution.

Proposition 5.3.6: (Convex Programming - Mixed Polyhe-


dral and Nonpolyhedral Constraints) C~nsider problem (B.12),
where X is the intersection of a polyhedral set P and a convex set C,
510 Convex Optimization Theory: A Summary Chap. 5

X=PnC,
g(x) = (g1(x), ... ,gr(x))', the functions f: lJtn H lR and gj: l)tn i-+ lR,
j = 1, ... , r, are defined over lRtt, A is an m x n matrix, and b E lJtm.
Assume that j* is finite and that for some r with 1 ~ r ~ r, the
functions gj, j = 1, ... , r, are polyhedral, and the functions f and gj,
j = r + 1, ... , r, are convex over C. Assume further that:
(1) There exists a vector x E ri(C) in the set

P = Pn {x I Ax =b, gj(x) ~ 0, j = 1, ... ,r}.

(2) There exists x E Pnc such that gj(x) < 0 for all j =.r+ 1, ... , r .
Then q* = j* and there exists at least one dual optimal solution.

We will now give a different type of result, which under some com-
pactness assumptions, guarantees strong duality and that there exists an
optimal primal solution (even if there may be no dual optimal solution).

Proposition 5.3. 7: (Convex Programming Duality - Exis-


tence of Primal Optimal Solutions) Assume that problem (B.11)
is feasible, that the convex functions f and gj are closed, and that the
function
if g(x) ~ 0, x EX,
F(x, 0) = { ~x)
otherwise,
has compact level sets. Then f * = q* and the set of optimal solutions
of the primal problem is nonempty and compact.

We now consider another important optimization framework, the


problem
minimize fi(x) + h(Ax)
(B.14)
subject to x E l)tn,

where A is an m x n matrix, Ji : lJtn i-+ (-oo, oo] and h : lJtm H (-oo, oo]
are closed proper convex functions. We assume that there exists a feasible
solution.

Proposition 5.3.8: (Fenchel Duality)


(a) If f* is finite and (A· ri(dom(f1))) n ri(dom(h)) cf. 0, then
j* = q* and there exists at least one dual optimal solution.
Sec. B.5.4 Subgradients and Optimality Conditions 511

(b) There holds f * = q*, and (x*, >. *) is a' primal and dual optimal 1

solution pair if and only if

x* E arg min { Ji (x) - x' A'>.*} and Ax* E arg min { /2 (z) + z' >. *}.
xEWn zEWn
(B.15)

An important special case of Fenchel duality involves the problem

minimize f (x)
(B.16)
subject to x E C,

where f : ~n H (-oo, oo] is a closed proper convex function and C is a


closed convex cone in ~n. This is known as a conic program, and some of its
special cases (semidefinite programming, second order cone programming)
have many practical applications.

Proposition 5.3.9: (Conic Duality Theorem) Assume that the


optimal value of the primal conic problem (B.16) is finite, and that
ri(dom(f)) nri(C)-/- 0. Consider the dual problem

minimize f * (>.)
subject to >.EC,

where f * is the conjugate of f and C is the dual cone,


C= -C* = {>. I Nx ;?: 0, V x EC}.

Then there is no duality gap and the dual problem has an optimal
solution.

Section 5.4. Subgradients and Optimality Conditions

Let f: ~ H (-oo, oo] be a proper convex function. We say that a vector


gE ~n is a subgradient of f at a point x E dom(f) if

f(z) ;?: J(x) + g'(z - x), Vz E ~n. (B.17)

The set of all subgradients of f at x is called the subdifferential of f at


x and is denoted by of(x). By convention, of(x) is considered empty
for all x ff. dom(f). Generally, of(x) is closed and convex, since based
on the subgradient inequality (B.17), it is the intersection of a collection
512 Convex Optimization Theory: A Summary Chap. 5

of closed halfspaces. Note that we restrict attention to proper functions


(subgradients are not useful and make no sense for improper functions).

Proposition 5.4.1: Let f: ~n I-+ (-oo, oo] be a proper convex func-


tion. For every x E ri (dom(f)),

of(x) =SJ..+ G,

where S is the subspace that is parallel to the affine hull of dom(f),


and G is a nonempty convex and compact set. In particular, if x E
int(dom(f)), then 8f(x) is nonempty and compact.

It follows from the preceding proposition that if f is real-valued, then


8f(x) is nonempty and compact for all x E ~n. An important property is
that if f is differentiable at some x E int(dom(f)), its gradient v'f(x) is
the unique subgradient at x. We give a proof of these facts, together with
the following proposition, in Section 3.1.

Proposition 5.4.2: (Subdifferential Boundedness and Lips-


chitz Continuity) Let f: ~n r-+ ~ be a real-valued convex function,
and let X be a nonempty bounded subset of ~n.
(a) The set UxEx8f(x) is nonempty and bounded.
(b) The function f is Lipschitz continuous over X, i.e., there exists
a scalar L such that

lf(x) - f(z)I ~ L llx - zll, l:;/x,z EX.

Section 5.4.1. Subgradients of Conjugate Functions

We will now derive an important relation between the subdifferentials of a


proper convex function f : ~n r-+ ( -oo, oo] and its conjugate f*. Using the
definition of conjugacy, we have

x'y ~ f(x) + f*(y), I:;/ XE ~n, y E ~n.

This is known as the Fenchel inequality. A pair (x, y) satisfies this inequal-
ity as an equation if and only if x attains the supremum in the definition

f*(y) = sup {y'z - f(z) }.


zE~n
Sec. B.5.4 Subdifferential Calculus 513

Pairs of this type are connected with the subdifferentials of f and f *, as


shown in the following proposition.

Proposition 5.4.3: (Conjugate Subgradient Theorem) Let f:


~n H ( -oo, oo] be a proper convex function and let f* be its conju-
gate. The following two relations are equivalent for a pair of vectors
(x, y):
(i) x'y = f(x) + f*(y).
(ii) y E of(x).
If in addition f is closed, the relations (i) and (ii) are equivalent to
(iii) XE of*(y).

For an application of the Conjugate Subgradient Theorem, note that


the necessary and sufficient optimality condition (B.15) in the Fenchel Du-
ality Theorem can be equivalently written as

A 1 .,\* E ofi(x* ), .,\* E -oh(Ax*).


The following proposition gives some useful corollaries of the Conjugate
Subgradient Theorem:

Proposition 5.4.4: Let f : ~n H (-oo, oo] be a closed proper convex


function and let f* be its conjugate.
(a) f* is differentiable at a vector y E int(dom(f*)) if and only if
the supremum of x'y - f(x) over x E ~n is uniquely attained.
(b) The set of minima of f is given by

arg min f(x)


xEeR"
= of*(O).

Section 5.4.2. Subdifferential Calculus

We now generalize some of the basic theorems of ordinary differentiation


(Section 3.1 gives proofs for the case of real-valued functions).

Proposition 5.4.5: ( Chain Rule) Let f : ~m H. (-oo, oo] be a


convex function, let A be an m x n matrix, and assume that the
function F given by
514 Convex Optimization Theory: A Summary Chap. 5

F(x) = J(Ax)
is proper. Then

8F(x) :J A 18f(Ax), \/xElRn.

Furthermore, if either f is polyhedral or else the range of A contains


a point in the relative interior of dom(f), we have

8F(x) = A 18f(Ax),

We also have the following proposition, which is a special case of the


preceding one [cf. the proof of Prop. 3.l.3(b)].

Proposition 5.4.6: (Subdifferential of Sum of Functions) Let


Ji : lRn i-+ (-oo, oo], i = 1, ... , m, be convex functions, and assume
that the function F= Ji+···+ fm is proper. Then

8F(x) :J ofi(x) + · · · + 8fm(x),


Furthermore, if n~ 1 ri(dom(fi)) =/= 0, we have

8F(x) = ofi(x) + · · · + 8fm(x),


More generally, the same is true if for some m with 1 S:: m S:: m, the
functions /i, i = 1, ... , m, are polyhedral and

( n~ 1 dom(fi)) n ( n~m+ 1 ri ( dom(/i))) =I= 0.

Section 5.4.3. Optimality Conditions


It can be seen from the definition of subgradient that a vector x* minimizes
f over lRn if and only if O E of(x*). We give the following generalization
of this condition to constrained problems.

Proposition 5.4. 7: Let f: lRn i-+ (-oo, oo] be a proper convex func-
tion, let X be a nonempty convex subset of lRn, and assume that one
of the following four conditions holds:
Sec. B.5.4 Directional Derivatives 515

(1) ri(dom(f)) n ri(X) i- 0.


i- 0.
(2) f is polyhedral and <lorn(!) n ri(X)
(3) Xis polyhedral·and ri(dom(f)) nX i- 0.
(4) f and X are volyhedral, and <lorn(!) n Xi- 0.
Then, a vector x* minimizes f over X if and only if there exists g E
{) f (x*) such that

g'(x - x*) 2 0, \:/xEX. (B.18)

The relative interior condition (1) of the preceding proposition is au-


tomatically satisfied when f is real-valued [we have <lorn(!)= ~n]; Section
3.1 gives a proof of the proposition for this case. If in addition, f is differ-
entiable, the optimality condition (B.18) reduces to the one of Prop. 1.1.8
of this appendix:

v' J(x*)'(x - x*) 2 0, \:/xEX.

Section 5.4.4. Directional Derivatives

For a proper convex function f : ~n 1---t (-oo, oo], the directional derivative
at any x E <lorn(!) in a direction d E ~n, is defined by

f'(x;d) = lim f(x + ad) - f(x). (B.19)


a-1,0 a

An important fact here is that the ratio in Eq. (B.19) is monotonically


nonincreasing as a .J, 0, so that the limit above is well-defined. To verify
this, note that for any a > 0, the convexity of f implies that for all a E
(O,a),

a
f(x +ad)~ =f(x +ad)+ ( 1- = f(x) = f(x) a) a
+ =(f(x + ad) - J(x)),
a a a
so that
f(x + ad) - J(x) < J(x + ad) - f(x)
\:/ a E (O,a). (B.20)
a - -a '
Thus the limit in Eq. (B.19) is well-defined (as a real number, or oo, or
-oo) and an alternative definition of f'(x; d) is

f'(x;d) = inf f(x + ad) - f(x)' d E ~n. (B.21)


a>O a
516 Convex Optimization Theory: A Summary Chap. 5

The directional derivative is related to the support function of the


subdifferential &f (x), as indicated in the following proposition.

Proposition 5.4.8: (Support Function of the Subdifferential)


Let f : ~n H (-oo, oo] be a µroper convex function, and let (cl f')(x; ·)
be the closure of the directional derivative f'(x; ·).
(a) For all x E dom(f) such that &f(x) is nonempty, (clf')(x; ·) is
the support function of &f(x).
(b) For all x E ri(dom(f)), f'(x; ·) is closed and it is the support
function of &f(x).

Section 5.5. Minimax Theory

We will now provide theorems regarding the validity of the minimax equal-
ity and the existence of saddle points. These theorems are obtained by
specializing the MC/MC theorems of Chapter 4. We will assume through-
out this section the following:
(a) X and Z are nonempty convex subsets of ~n and ~m, respectively.
(b) ¢:Xx Z H ~ is a function such that¢(·, z) : X H ~ is convex and
closed for each z E Z, and -¢(x, ·) : Z H ~ is convex and closed for
each x E X .

Proposition 5.5.1: Assume that the function p given by

p(u) = inf sup{ ¢(x, z) - u'z }, uE~m,


xEX zEZ

satisfies either p(O) < oo, or else p(O) = oo and p(u) > -oo for all
u E ~m. Then

sup inf ¢(x, z) = inf sup ¢(x, z)


zEZ xEX xEX zEZ

if and only if p is lower semicontinuous at u = 0.

Proposition 5.5.2: Assume that O E ri(dom(p)) and p(O) > -oo.


Then
Sec. B.5.5 Minimax Theory 517

sup inf ¢(x,z) = inf sup¢(x,z),


zEZ xEX xEX zEZ

and the supremum over Z in the left-hand side is finite and is attained.
Furthermore, the set of z E Z attaining this supremum is compact if
and only if O lies in the interior of dom(p).

Proposition 5.5.3: (Classical Saddle Point Theorem) Let the


sets X and Z be compact. Then the set of saddle points of ¢ is
nonempty and compact.

To formulate more general saddle point theorems, we consider the


convex functions t :wn
f-t (-oo, oo] and r : wm f-t (-oo, oo] given by

t(x) = { S(X)UPzEZ ¢(x, z) if XE X,


if X ~ X,
and
r(z) = { 00
- infxEX ¢(x, z) if z E Z,
if z ~ z.
Thus, by Prop. 3.4.1, (x*, z*) is a saddle point if and only if
sup inf ¢(x,z) = inf sup¢(x,z),
zEZ xEX xEX zEZ
and x* minimizes t while z* minimizes r.
The next two propositions provide conditions for the minimax equal-
ity to hold. These propositions are used to prove results about nonempti-
ness and compactness of the set of saddle points.

Proposition 5.5.4: Assume that t is proper and that the level sets
{ x I t(x) ~, }, , E W, are compact. Then

sup inf ¢(x,z) = inf sup¢(x,z),


zEZ xEX xEX zEZ

and the infimum over X in the right-hand side above is attained at a


set of points that is nonempty and compact.

Proposition 5.5.5: Assume that tis proper, and that the recession
cone and the constancy space of t are equal. Then
518 Convex Optimization Theory: A Summary Chap. 5

sup inf ¢(x, z) = inf sup¢(x, z),


zEZ xEX xEX zEZ

and the infimum over X in the right-hand side above is attained.

Proposition 5.5.6: Assume that either t is proper or r is proper.


(a) If the level sets {x I t(x) S -y} and {z I r(z) S -y}, 1' E lR, oft
and r are compact, the set of saddle points of ¢ is nonempty and
compact.
(b) If the rec~ssion cones oft and r are equal to the constancy spaces
of t and r, respectively, the set of saddle points of ¢ is nonempty.

Proposition 5.5. 7: (Saddle Point Theorem) The· set of saddle


points of ¢ is nonempty and compact under any one of the following
conditions:
(1) X and Z are compact.
(2) Z is compact, and for some z E Z, "YE lR, the level set

{ x EX I ¢(x, z) S "Y}

is nonempty and compact.


(3) X is compact, and for some x E X, "y E lR, the level set

{z E Z I ¢(x,z) ~ "Y}

is nonempty and compact.


(4) For some x EX, z E Z, "Y E lR, the level sets
{x EX I ¢(x,z) S "Y}, {z E z I ¢(x, z) ~ "Y},
are nonempty and compact.
References

[ACH97] Auslender, A., Cominetti, R., and Haddou, M., 1997. "Asymp-
totic Analysis for Penalty and Barrier Methods in Convex and Linear Pro-
gramming," Math. of Operations Research, Vol. 22, pp. 43-62.
[ALS14] Abernethy, J., Lee, C., Sinha, A., and Tewari, A., 2014. "Online
Linear Optimization via Smoothing," arXiv preprint arXiv:1405.6076.
[AgB14] Agarwal, A., and Bottou, L., 2014. "A Lower Bound for the Op-
timization of Finite Sums," arXiv preprint arXiv:1410.0723.
[AgDll] Agarwal, A., and Duchi, J.C., 2011. "Distributed Delayed Stochas-
tic Optimization," In Advances in Neural Information Processing Systems
(NIPS 2011), pp. 873-881.
[AlG03] Alizadeh, F., and Goldfarb, D., 2003. "Second-Order Cone Pro-
gramming," Math. Programming, Vol. 95, pp. 3-51.
[AnH13] Andersen, M. S., and Hansen, P. C., 2013. "Generalized Row-
Action Methods for Tomographic Imaging," Numerical Algorithms, Vol.
67, pp. 1-24.
[Arm66] Armijo, L., 1966. "Minimization of Functions Having Continuous
Partial Derivatives," Pacific J. Math., Vol. 16, pp. 1-3.
[Ash72] Ash, R. B., 1972. Real Analysis and Probability, Academic Press,
NY.
[AtV95] Atkinson, D. S., and Vaidya, P. M., 1995. "A Cutting Plane Algo-
rithm for Convex Programming that Uses Analytic Centers," Math. Pro-
gramming, Vol. 69, pp. 1-44.
[AuE76] Aubin, J. P., and Ekeland, I., 1976. "Estimates of the Duality
Gap in Nonconvex Optimization," Math. of Operations Research, Vol. 1,
pp. 255- 245.
[AuT03] Auslender, A., and Teboulle, M., 2003. Asymptotic Cones and
Functions in Optimization and Variational Inequalities, Springer, NY.
[AuT04] Auslender, A., and Teboulle, M., 2004. "Interior Gradient and
Epsilon-Subgradient Descent Methods for Constrained Convex Minimiza-
tion," Math. of Operations Research, Vol. 29, pp. 1-26.
[Aus76] Auslender, A., 1976. Optimization: Methodes Numeriques, Mason,
Paris.
[Aus92] Auslender, A., 1992. "Asymptotic Properties of the Fenchel Dual

519
520 References

Functional and Applications to Decomposition Problems," J. of Optimiza-


tion Theory and Applications, Vol. 73, pp. 427-449.
[BBCll] Bertsimas, D., Brown, D. B., and Caramanis, C., 2011. "Theory
and Applications of Robust Optimization," SIAM Review, Vol. 53, pp.
464-501.
[BBG09] Bordes, A., Bottou, L., and Gallinari, P., 2009. "SGD-QN: Care-
ful Quasi-Newton Stochastic Gradient Descent," J. of Machine Learning
Research, Vol. 10, pp. 1737-1754.
[BBL07] Berry, M. W., Browne, M., Langville, A. N., Pauca, V. P., and
Plemmons, R. J., 2007. "Algorithms and Applications for Approximate
Nonnegative Matrix Factorization," Computational Statistics and Data
Analysis, Vol. 52, pp. 155-173.
[BBY12] Borwein, J. M., Burachik, R. S., and Yao, L., 2012. "Conditions
for Zero Duality Gap in Convex Programming," arXiv preprint arXiv:-
1211.4953.
[BCK06] Bauschke, H. H., Combettes, P. L., and Kruk, S. G., 2006. "Ex-
trapolation Algorithm for Affine-Convex Feasibility Problems," Numer. Al-
gorithms, Vol. 41, pp. 239-274.
[BGI95] Burachik, R., Grana Drummond, L. M., lusem, A. N., and Svaiter,
B. F., 1995. "Full Convergence of the Steepest Descent Method with Inexact
Line Searches," Optimization, Vol. 32, pp. 137-146.
[BGL06] Bonnans, J. F., Gilbert, J. C., Lemarechal, C., and Sagastizabal,
S. C., 2006. Numerical Optimization: Theoretical and Practical Aspects,
Springer, NY.
[BGN09] Ben-Tal, A., El Ghaoui, L., and Nemirovski, A., 2009. Robust
Optimization, Princeton Univ. Press, Princeton, NJ.
[BHG08] Blatt, D., Hero, A. 0., Gauchman, H., 2008. "A Convergent In-
cremental Gradient Method with a Constant Step Size," SIAM J. Opti-
mization, Vol. 18, pp. 29-51.
[BHT87] Bertsekas, D. P., Hossein, P., and Tseng, P., 1987. "Relaxation
Methods for Network Flow Problems with Convex Arc Costs," SIAM J. on
Control and Optimization, Vol. 25, pp. 1219-1243.
[BJM12] Bach, F., Jenatton, R., Mairal, J., and Obozinski, G., 2012. "Op-
timization with Sparsity-Inducing Penalties," Foundations and Trends in
Machine Learning, Vol. 4, pp. 1-106.
[BKM14] Burachik, R. S., Kaya, C. Y., and Majeed, S. N., 2014. "A Du-
ality Approach for Solving Control-Constrained Linear-Quadratic Optimal
Control Problems," SIAM J. on Control and Optimization, Vol. 52, pp.
1423-1456.
[BLY14] Bragin, M. A., Luh, P. B., Yan, J. H., Yu, N., and Stern, G. A.,
2014. "Convergence of the Surrogate Lagrangian Relaxation Method," J.
of Optimization Theory and Applications, on line.
References 521

[BMNOl] Ben-Tal, A., Margalit, T., and Nemirovski, A., 2001. "The Or-
dered Subsets Mirror Descent Optimization Method and its Use for the
Positron Emission Tomography Reconstruction," in Inherently Parallel Al-
gorithms in Feasibility and Optimization and their Applications (D. But-
nariu, Y. Censor, and S. Reich, eds.), Elsevier, Amsterdam, Netherlands.
[BMROO] Birgin, E. G., Martinez, J. M., and Raydan, M., 2000. "Non-
monotone Spectral Projected Gradient Methods on Convex Sets," SIAM
J. on Optimization, Vol. 10, pp. 1196-1211.
[BMS99] Boltyanski, V., Martini, H., and Soltan, V., 1999. Geometric
Methods and Optimization Problems, Kluwer, Boston.
[BN003] Bertsekas, D. P., Nedic, A., and Ozdaglar, A. E., 2003. Convex
Analysis and Optimization, Athena Scientific, Belmont, MA.
[BOT06] Bertsekas, D. P., Ozdaglar, A. E., and Tseng, P., 2006 "Enhanced
Fritz John Optimality Conditions for Convex Programming," SIAM J. on
Optimization, Vol. 16, pp. 766-797.
[BPCll] Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J., 2011.
Distributed Optimization and Statistical Learning via the Alternating Di-
rection Method of Multipliers, Now Publishers Inc, Boston, MA.
[BPP13] Bhatnagar, S., Prasad, H., and Prashanth, L. A., 2013. Stochas-
tic Recursive Algorithms for Optimization, Lecture Notes in Control and
Information Sciences, Springer, NY.
[BPT97a] Bertsekas, D. P., Polymenakos, L. C., and Tseng, P., 1997. "An
E-Relaxation Method for Separable Convex Cost Network Flow Problems,"
SIAM J. on Optimization, Vol. 7, pp. 853-870.
[BPT97b] Bertsekas, D. P., Polymenakos, L. C., and Tseng, P., 1997.
"Epsilon-Relaxation and Auction Methods for Separable Convex Cost Net-
work Flow Problems," in Network Optimization, Pardalos, P. M., Hearn,
D. W., and Hager, W.W., (Eds.), Lecture Notes in Economics and Math-
ematical Systems, Springer-Verlag, NY, pp. 103-126.
[BSL14] Bergmann, R., Steidl, G., Laus, F., and Weinmann, A., 2014.
"Second Order Differences of Cyclic Data and Applications in Variational
Denoising," arXiv preprint arXiv:1405.5349.
[BSS06] Bazaraa, M. S., Sherali, H. D., and Shetty, C. M., 2006. Nonlinear
Programming: Theory and Algorithms, 3rd Edition, Wiley, NY.
[BST14] Bolte, J., Sabach, S., and Teboulle, M., 2014. "Proximal Alternat-
ing Linearized Minimization for Nonconvex and Nonsmooth Problems,"
Math. Programming, Vol. 146, pp. 1-36.
[BaB88] Barzilai, J., and Borwein, J.M., 1988. "Two Point Step Size Gra-
dient Methods," IMA J. of Numerical Analysis, Vol. 8, pp. 141-148.
[BaB96] Bauschke, H. H., and Borwein, J. M., 1996. "On Projection Algo-
rithms for Solving Convex Feasibility Problems," SIAM Review, Vol. 38,
pp. 367-426.
522 References

[BaCll] Bauschke, H. H., and Combettes, P. L., 2011. Convex Analysis


and Monotone Operator Theory in Hilbert Spaces, Springer, NY.
[BaMll] Bach, F., and E. Moulines, E., 2011. "Non-Asymptotic Analysis
of Stochastic Approximation Algorithms for Machine Learning," Advances
in Neural Information Processing Systems (NIPS 2011).
[BaT85] Balas, E., and Toth, P., 1985. "Branch and Bound Methods," in
The Traveling Salesman Problem, Lawler, E., Lenstra, J. K., Rinnoy Kan,
A.H. G., and Shmoys, D. B., (Eds.), Wiley, NY, pp. 361-401.
[BaW75] Balinski, M., and Wolfe, P., (Eds.), 1975. Nondifferentiable Opti-
mization, Math. Programming Study 3, North-Holland, Amsterdam.
[Bac14] Bacak, M., 2014. "Computing Medians and Means in Hadamard
Spaces," arXiv preprint arXiv:1210.2145v3.
[BauOl] Bauschke, H. H., 2001. "Projection Algorithms: Results and Open
Problems," in Inherently Parallel Algorithms in Feasibility and Optimiza-
tion and their Applications (D. Butnariu, Y. Censor, and S. Reich, eds.),
Elsevier, Amsterdam, Netherlands.
[BeF12] Becker, S., and Fadili, J., 2012. "A Quasi-Newton Proximal Split-
ting Method," in Advances in Neural Information Processing Systems (NIPS
2012), pp. 2618-2626.
[BeG83] Bertsekas, D. P., and Gafni, E., 1983. "Projected Newton Meth-
ods and Optimization of Multicommodity Flows," IEEE Trans. Automat.
Control, Vol. AC-28, pp. 1090-1096.
[BeG92] Bertsekas, D. P., and Gallager, R. G., 1992. Data Networks, 2nd
Ed., Prentice-Hall, Englewood Cliffs, NJ.
On line at http://web.mit.edu/dimitrib/www/datanets.html.
[BeL88] Becker, S., and LeCun, Y., 1988. "Improving the Convergence of
Back-Propagation Learning with Second Order Methods," in Proceedings
of the 1988 Connectionist Models Summer School, San Matteo, CA.
[BeL07] Bengio, Y., and LeCun, Y., 2007. "Scaling Learning Algorithms
Towards AI," Large-Scale Kernel Machines, Vol. 34, pp. 1-41.
[BeM71] Bertsekas, D. P, and Mitter, S. K., 1971. "Steepest Descent for
Optimization Problems with Nondifferentiable Cost Functionals," Proc.
5th Annual Princeton Confer. Inform. Sci. Systems, Princeton, NJ, pp.
347-351.
[BeM73] Bertsekas, D. P., and Mitter, S. K., 1973. "A Descent Numerical
Method for Optimization Problems with Nondifferentiable Cost Function-
als," SIAM J. on Control, Vol. 11, pp. 637-652.
[BeNOl] Ben-Tal, A., and Nemirovskii, A., 2001. Lectures on Modern Con-
vex Optimization: Analysis, Algorithms, and Engineering Applications,
SIAM, Philadelphia, PA.
[Be002] Bertsekas, D. P., Ozdaglar, A. E., 2002. "Pseudonormality and
a Lagrange Multiplier Theory for Constrained Optimization," J. of Opti-
References 523

mization Theory and Applications, Vol. 114, pp. 287-343.


[BeS82] Bertsekas, D. P., and Sandell, N. R., 1982. "Estimates of the Du-
ality Gap for Large-Scale Separable Nonconvex Optimization Problems,"
Proc. 1982 IEEE Conf. Decision and Control, pp. 782-785.
[BeT88] Bertsekas, D. P., and Tseng, P., 1988. "Relaxation Methods for
Minimum Cost Ordinary and Generalized Network Flow Problems," Op-
erations Research, Vol. 36, pp. 93-114.
[BeT89a] Bertsekas, D. P., and Tsitsiklis, J. N., 1989. Parallel and Dis-
tributed Computation: Numerical Methods, Prentice-Hall, Englewood Cl-
iffs, NJ; republished by Athena Scientific, Belmont, MA, 1997. On line at
http:/ /web.mit.edu/ dimitrib/www /pdc.html.
[BeT89b] Ben-Tal, A., and Teboulle, M., 1989. "A Smoothing Technique for
Nondifferentiable Optimization Problems," Optimization, Lecture Notes in
Mathematics, Vol. 1405, pp. 1-11.
[BeT91] Bertsekas, D. P., and Tsitsiklis, J. N., 1991. "Some Aspects of
Parallel and Distributed Iterative Algorithms - A Survey," Automatica,
Vol. 27, pp. 3-21.
[BeT94a] Bertsekas, D. P., and Tseng, P., 1994. "Partial Proximal Mini-
mization Algorithms for Convex Programming," SIAM J. on Optimization,
Vol. 4, pp. 551-572.
[BeT94b] Bertsekas, D. P., and Tseng, P., 1994. "RELAX-IV: A Faster
Version of the RELAX Code for Solving Minimum Cost Flow Problems,"
Massachusetts Institute of Technology, Laboratory for Information and De-
cision Systems Report LIDS-P-2276, Cambridge, MA.
[BeT96] Bertsekas, D. P., and Tsitsiklis, J. N., 1996. Neuro-Dynamic Pro-
gramming, Athena Scientific, Belmont, MA.
[BeT97] Bertsimas, D., and Tsitsiklis, J. N., 1997. Introduction to Linear
Optimization, Athena Scientific, Belmont, MA.
[BeTOO] Bertsekas, D. P., and Tsitsiklis, J. N., 2000. "Gradient Convergence
of Gradient Methods with Errors," SIAM J. on Optimization, Vol. 36, pp.
627-642.
[BeT03] Beck, A., and Teboulle, M., 2003. "Mirror Descent and Nonlin-
ear Projected Subgradient Methods for Convex Optimization," Operations
Research Letters, Vol. 31, pp. 167-175.
[BeT09a] Beck, A., and Teboulle, M., 2009. "Fast Gradient-Based Algo-
rithms for Constrained Total Variation Image Denoising and Deblurring
Problems," IEEE Trans. on Image Processing, Vol. 18, pp. 2419-2434.
[BeT09b] Beck, A., and Teboulle, M., 2009. "A Fast Iterative Shrinkage-
Thresholding Algorithm for Linear Inverse Problems," SIAM J. on Imaging
Sciences, Vol. 2, pp. 183-202.
[BeTlO] Beck, A., and Teboulle, M., 2010. "Gradient-Based Algorithms
with Applications to Signal-Recovery Problems," in Convex Optimization
524 References

in Signal Processing and Communications (Y. Eldar and D. Palomar, eds.),


Cambridge Univ. Press, pp. 42-88.
[BeT13] Beck, A., and Tetruashvili, L., 2013. "On the Convergence of Block
Coordinate Descent Type Methods," SIAM J. on Optimization, Vol. 23, pp.
2037-2060.
[Be Y09] Bertsekas, D. P., and Yu, H., 2009. "Projected Equation Methods
for Approximate Solution of Large Linear Systems," J. of Computational
and Applied Mathematics, Vol. 227, pp. 27-50.
[BeYlO] Bertsekas, D. P., and Yu, H., 2010. "Asynchronous Distributed
Policy Iteration in Dynamic Programming," Proc. of Allerton Conf. on
Communication, Control and Computing, Allerton Park, Ill, pp. 1368-1374.
[BeYll] Bertsekas, D. P., and Yu, H., 2011. "A Unifying Polyhedral Ap-
proximation Framework for Convex Optimization," SIAM J. on Optimiza-
tion, Vol. 21, pp. 333-360.
[BeZ97] Ben-Tal, A., and Zibulevsky, M., 1997. "Penalty /Barrier Multiplier
Methods for Convex Programming Problems," SIAM J. on Optimization,
Vol. 7, pp. 347-366.
[Ben09] Bengio, Y., 2009. "Learning Deep Architectures for AI," Founda-
tions and Trends in Machine Learning, Vol. 2, pp. 1-127.
[Ber72] Bertsekas, D. P., 1972. "Stochastic Optimization Problems with
Nondifferentiable Cost Functionals with an Application in Stochastic Pro-
gramming," Proc. 1972 IEEE Conf. Decision and Control, pp. 555-559.
[Ber73] Bertsekas, D. P., 1973. "Stochastic Optimization Problems with
Nondifferentiable Cost Functionals," J. of Optimization Theory and Appli-
cations, Vol. 12, pp. 218-231.
[Ber75a] Bertsekas, D. P., 1975. "Necessary and Sufficient Conditions for a
Penalty Method to be Exact," Math. Programming, Vol. 9, pp. 87-99.
[Ber75b] Bertsekas, D. P., 1975. "Nondifferentiable Optimization via Ap-
proximation," Math. Programming Study 3, Balinski, M., and Wolfe, P.,
(Eds.), North-Holland, Amsterdam, pp. 1-25.
[Ber75c] Bertsekas, D. P., 1975. "Combined Primal-Dual and Penalty Meth-
ods for Constrained Optimization," SIAM J. on Control, Vol. 13, pp. 521-
544.
[Ber75d] Bertsekas, D. P., 1975. "On the Method of Multipliers for Convex
Programming," IEEE Transactions on Aut. Control, Vol. 20, pp. 385-388.
[Ber76a] Bertsekas, D. P., 1976. "On the Goldstein-Levitin-Poljak Gradient
Projection Method," IEEE Trans. Automat. Control, Vol. 21, pp. 174-184.
[Ber76b] Bertsekas, D. P., 1976. "Multiplier Methods: A Survey," Auto-
matica, Vol. 12, pp. 133-145.
[Ber76c] Bertsekas, D. P., 1976. "On Penalty and Multiplier Methods for
Constrained Optimization," SIAM J. on Control and Optimization, Vol.
14, pp. 216-235.
References 525

[Ber77] Bertsekas, D. P., 1977. "Approximation Procedures Based on the


Method of Multipliers," J. Optimization Theory and Applications, Vol. 23,
pp. 487-510.
[Ber79] Bertsekas, D. P., 1979. "A Distributed Algorithm for the Assign-
ment Problem," Lab. for Information and Decision Systems Working Paper,
MIT, Cambridge, MA.
[Ber81] Bertsekas, D. P., 1981. "A New Algorithm for the Assignment Prob-
lem," Mathematical Programming, Vol. 21, pp.152-171.
[Ber82a] Bertsekas, D. P., 1982. Constrained Optimization and Lagrange
Multiplier Methods, Academic Press, NY; republished in 1996 by Athena
Scientific, Belmont, MA. On line at http://web.mit.edu/dimitrib/www/-
lagrmult.html.
[Ber82b] Bertsekas, D. P., 1982. "Projected Newton Methods for Opti-
mization Problems with Simple Constraints," SIAM J. on Control and
Optimization, Vol. 20, pp. 221-246.
[Ber82c] Bertsekas, D. P., 1982. "Distributed Dynamic Programming,"
IEEE Trans. Aut. Control, Vol. AC-27, pp. 610-616.
[Ber83] Bertsekas, D. P., 1983. "Distributed Asynchronous Computation of
Fixed Points," Math. Programming, Vol. 27, pp. 107-120.
[Ber85] Bertsekas, D. P., 1985. "A Unified Framework for Primal-Dual
Methods in Minimum Cost Network Flow Problems," Mathematical Pro-
gramming, Vol. 32, pp. 125-145.
[Ber91] Bertsekas, D. P., 1991. Linear Network Optimization: Algorithms
and Codes, MIT Press, Cambridge, MA.
[Ber92] Bertsekas, D. P., 1992. "Auction Algorithms for Network Problems:
A Tutorial Introduction," Computational Optimization and Applications,
Vol. 1, pp. 7-66.
[Ber96] Bertsekas, D. P., 1996. "Incremental Least Squares Methods and
the Extended Kalman Filter," SIAM J. on Optimization, Vol. 6, pp. 807-
822.
[Ber97] Bertsekas, D. P., 1997. "A New Class of Incremental Gradient
Methods for Least Squares Problems," SIAM J. on Optimization, Vol. 7,
pp. 913-926.
[Ber98] Bertsekas, D. P., 1998. Network Optimization: Continuous and
Discrete Models, Athena Scientific, Belmont, MA.
[Ber99] Bertsekas, D. P., 1999. Nonlinear Programming: 2nd Edition, Athe-
na Scientific, Belmont, MA.
[Ber07] Bertsekas, D. P., 2007. Dynamic Programming and Optimal Con-
trol, Vol. I, 3rd Edition, Athena Scientific, Belmont, MA.
[Ber09] Bertsekas, D. P., 2009. Convex Optimization Theory, Athena Sci-
entific, Belmont, MA.
[BerlOa] Bertsekas, D. P., 2010. "Extended Monotropic Programming and
526 References

Duality," Lab. for Information and Decision Systems Report LIDS-P-2692,


MIT, March 2006, corrected in Feb. 2010; a version appeared in J. of
Optimization Theory and Applications, Vol. 139, pp. 209-225.
[BerlOb] Bertsekas, D. P., 2010. "Incremental Gradient, Subgradient, and
Proximal Methods for Convex Optimization: A Survey," Lab. for Informa-
tion and Decision Systems Report LIDS-P-2848, MIT.
[Berll] Bertsekas, D. P., 2011. "Incremental Proximal Methods for Large
Scale Convex Optimization," Math. Programming, Vol. 129, pp. 163-195.
[Ber12] Bertsekas, D. P., 2012. Dynamic Programming and Optimal Con-
trol, Vol. II, 4th Edition: Approximate Dynamic Programming, Athena
Scientific, Belmont, MA.
[Ber13] Bertsekas, D. P., 2013. Abstract Dynamic Programming, Athena
Scientific, Belmont, MA.
[BiF07] Bioucas-Dias, J., and Figueiredo, M. A. T., 2007. "A New TwIST:
Two-Step Iterative Shrinkage/Thresholding Algorithms for Image Restora-
tion," IEEE Trans. Image Processing, Vol. 16, pp. 2992-3004.
[BiL97] Birge, J. R., and Louveaux, 1997. Introduction to Stochastic Pro-
gramming, Springer-Verlag, New York, NY.
[Bis95] Bishop, C. M, 1995. Neural Networks for Pattern Recognition, Ox-
ford Univ. Press, NY.
[BoLOO] Borwein, J. M., and Lewis, A. S., 2000. Convex Analysis and
Nonlinear Optimization, Springer-Ver lag, NY.
[BoL05] Bottou, L., and LeCun, Y., 2005. "On-Line Learning for Very
Large Datasets," Applied Stochastic Models in Business and Industry, Vol.
21, pp. 137-151.
[BoSOOJ Bonnans, J. F., and Shapiro, A., 2000. Perturbation Analysis of
Optimization Problems, Springer-Verlag, NY.
[BoV04] Boyd, S., and Vanderbergue, L., 2004. Convex Optimization, Cam-
bridge Univ. Press, Cambridge, UK.
[Bor08] Borkar, V. S., 2008. Stochastic Approximation: A Dynamical Sys-
tems Viewpoint, Cambridge Univ. Press.
[Bot05] Bottou, L., 2005. "SGD: Stochastic Gradient Descent,"
http://leon.bottou.org/ projects/ sgd.
[Bot09] Bottou, L., 2009. "Curiously Fast Convergence of Some Stochastic
Gradient Descent Algorithms," unpublished open problem offered to the
attendance of the SLDS 2009 conference.
[BotlO] Bottou, L., 2010. "Large-Scale Machine Learning with Stochastic
Gradient Descent," In Proc. of COMPSTAT 2010, pp. 177-186.
[BrL 78] Brezis, H., and Lions, P. L., 1978. "Produits Infinis de Resolvantes,"
Israel J. of Mathematics, Vol. 29, pp. 329-345.
[BrS13] Brown, D. B., and Smith, J. E., 2013. "Information Relaxations,
Duality, and Convex Stochastic Dynamic Programs," Working Paper, Fuqua
References 527

School of Business, Durham, NC, USA.


[Bre73] Brezis, H.,1973. "Operateurs Maximaux Monotones et Semi-Grou-
pes de Contractions Dans les Espaces de Hilbert," North-Holland, Amster-
dam.
[BuM13] Burachik, R. S., and Majeed, S. N., 2013. "Strong Duality for
Generalized Monotropic Programming in Infinite Dimensions," J. of Math-
ematical Analysis and Applications, Vol. 400, pp. 541-557.
[BuQ98] Burke, J. V., and Qian, M., 1998. "A Variable Metric Proximal
Point Algorithm for Monotone Operators," SIAM J. on Control and Opti-
mization, Vol. 37, pp. 353-375.
[Bur91] Burke, J. V., 1991. "An Exact Penalization Viewpoint of Con-
strained Optimization," SIAM J. on Control and Optimization, Vol. 29,
pp. 968-998.
[CDSOl] Chen, S. S., Donoho, D. L., and Saunders, M.A., 2001. "Atomic
Decomposition by Basis Pursuit," SIAM Review, Vol. 43, pp. 129-159.
[CFM75] Camerini, P. M., Fratta, L., and Maffioli, F., 1975. "On Improving
Relaxation Methods by Modified Gradient Techniques," Math. Program-
ming Studies, Vol. 3, pp. 26-34.
[CGTOO] Conn, A. R., Gould, N. I., and Toint, P. L., 2000. Trust Region
Methods, SIAM, Philadelphia, PA.
[CHY13] Chen, C., He, B., Ye, Y., and Yuan, X., 2013. "The Direct Ex-
tension of ADMM for Multi-Block Convex Minimization Problems is not
Necessarily Convergent," Optimization Online.
[CPR14] Chouzenoux, E., Pesquet, J. C., and Repetti, A., 2014. "Variable
Metric Forward-Backward Algorithm for Minimizing the Sum of a Differ-
entiable Function and a Convex Function," J. of Optimization Theory and
Applications, Vol. 162, pp. 107-132.
[CPS92] Cottle, R. W., Pang, J.-S., and Stone, R. E., 1992. The Linear
Complementarity Problem, Academic Press, NY.
[CRP12] Chandrasekaran, V., Recht, B., Parrilo, P. A., and Willsky, A. S.,
2012. "The Convex Geometry of Linear Inverse Problems," Foundations of
Computational Mathematics, Vol. 12, pp. 805-849.
[CaC68] Canon, M. D., and Cullum, C. D., 1968. "A Tight Upper Bound
on the Rate of Convergence of the Frank-Wolfe Algorithm," SIAM J. on
Control, Vol. 6, pp. 509-516.
[CaG74] Cantor, D. G., Gerla, M., 1974. "Optimal Routing in Packet
Switched Computer Networks," IEEE Trans. on Computing, Vol. C-23,
pp. 1062-1068.
[CaR09] Candes, E. J., and Recht, B., 2009. "Exact Matrix Completion via
Convex Optimization," Foundations of Computational Math., Vol. 9, pp.
717-772.
[CaTlO] Candes, E. J., and Tao, T., 2010. "The Power of Convex Re-
528 References

laxation: Near-Optimal Matrix Completion," IEEE Trans. on Information


Theory, Vol. 56, pp. 2053-2080.
[CeH87] Censor, Y., and Herman, G. T., 1987. "On Some Optimization
Techniques in Image Reconstruction from Projections," Applied Numer.
Math., Vol. 3, pp. 365-391.
[CeS08] Cegielski, A., and Suchocka, A., 2008. "Relaxed Alternating Pro-
jection Methods," SIAM J. Optimization, Vol. 19, pp. 1093-1106.
[CeZ92] Censor, Y., and Zenios, S. A., 1992. "The Proximal Minimization
Algorithm with D-Functions," J. Opt. Theory and Appl., Vol. 73, pp. 451-
464.
[CeZ97] Censor, Y., and Zenios, S. A., 1997. Parallel Optimization: Theory,
Algorithms, and Applications, Oxford Univ. Press, NY.
[ChG59] Cheney, E. W., and Goldstein, A. A., 1959. "Newton's Method
for Convex Programming and Tchebycheff Approximation," Numer. Math.,
Vol. I, pp. 253-268.
[ChR97] Chen, G. H., and Rockafellar, R. T., 1997. "Convergence Rates
in Forward~Backward Splitting," SIAM J. on Optimization, Vol. 7, pp.
421-444.
[ChT93] Chen, G., and Teboulle, M., 1993. "Convergence Analysis of a
Proximal-Like Minimization Algorithm Using Bregman Functions," SIAM
J. on Optimization, Vol. 3, pp. 538-543.
[ChT94] Chen, G., and Teboulle, M., 1994. "A Proximal-Based Decom-
position Method for Convex Minimization Problems," Mathematical Pro-
gramming, Vol. 64, pp. 81-101.
[Cha04] Chambolle, A., 2004. "An Algorithm for Total Variation Minimiza-
tion and Applications," J. of Mathematical Imaging and Vision, Vol. 20,
pp. 89-97.
[Che07] Chen, Y., 2007. "A Smoothing Inexact Newton Method for Min-
imax Problems," Advances in Theoretical and Applied Mathematics, Vol.
2, pp. 137-143.
[ClalO] Clarkson, K. L., 2010. "Coresets, Sparse Greedy Approximation,
and the Frank-Wolfe Algorithm," ACM Transactions on Algorithms, Vol.
6, pp. 63.
[CoT13] Couellan, N. P., and 'Irafalis, T. B., 2013. "On-line SVM Learning
via an Incremental Primal-Dual Technique," Optimization Methods and
Software, Vol. 28, pp. 256-275.
[CoV13] Combettes, P. L., and Vu, B. C., 2013. "Variable Metric Quasi-
Fejer Monotonicity," Nonlinear Analysis: Theory, Methods and Applica-
tions, Vol. 78, pp. 17-31.
[Com01] Combettes, P. L., 2001. "Quasi-Fejerian Analysis of Some Opti-
mization Algorithms," Studies in Computational Mathematics, Vol. 8, pp.
115-152.
References 529

[Cry71] Cryer, C. W., 1971. "The Solution of a Quadratic Programming


Problem Using Systematic Overrelaxation," SIAM J. on Control, Vol. 9,
pp. 385-392.
[DBW12] Duchi, J. C., Bartlett, P. L., and Wainwright, M. J., 2012. "Ran-
domized Smoothing for Stochastic Optimization," SIAM J. on Optimiza-
tion, Vol. 22, pp. 67 4-701.
[DCD14] Defazio, A. J., Caetano, T. S., and Domke, J., 2014. "Finito: A
Faster, Permutable Incremental Gradient Method for Big Data Problems,"
Proceedings of the 31st ICML, Beijing.
[DHS06] Dai, Y. H., Hager, W. W., Schittkowski, K., and Zhang, H., 2006.
"The Cyclic Barzilai-Borwein Method for Unconstrained Optimization,"
IMA J. of Numerical Analysis, Vol. 26, pp. 604-627.
[DHSll] Duchi, J., Hazan, E., and Singer, Y., 2011. "Adaptive Subgradient
Methods for Online Learning and Stochastic Optimization," J. of Machine
Learning Research, Vol. 12, pp. 2121-2159.
[DMM06] Drineas, P., Mahoney, M. W., and Muthukrishnan, S., 2006.
"Sampling Algorithms for 12 Regression and Applications," Proc. 17th
Annual SODA, pp. 1127-1136.
[DMMll] Drineas, P., Mahoney, M. W., Muthukrishnan, S., and Sarlos,
T., 2011. "Faster Least Squares Approximation," Numerische Mathematik,
Vol. 117, pp. 219-249.
[DRTll] Dhillon, I. S., Ravikumar, P., and Tewari, A., 2011. "Nearest
Neighbor Based Greedy Coordinate Descent," in Advances in Neural In-
formation Processing Systems 24, (NIPS 2011), pp. 2160-2168.
[DaW60] Dantzig, G. B., and Wolfe, P., 1960. "Decomposition Principle
for Linear Programs," Operations Research, Vol. 8, pp. 101-111.
[DaY14a] Davis, D., and Yin, W., 2014. "Convergence Rate Analysis of
Several Splitting Schemes," arXiv preprint arXiv:1406.4834.
[DaY14b] Davis, D., and Yin, W., 2014. "Convergence Rates of Relaxed
Peaceman-Rachford and ADMM Under Regularity Assumptions, arXiv
preprint arXiv:1407.5210.
[Dan67] Danskin, J. M., 1967. The Theory of Max-Min and its Application
to Weapons Allocation Problems, Springer, NY.
[Dav76] Davidon, W. C., 1976. "New Least Squares Algorithms," J. Opti-
mization Theory and Applications, Vol. 18, pp. 187-197.
[DeM71] Demjanov, V. F., and Malozemov, V. N., 1971. "On the Theory
of Non-Linear Minimax Problems," Russian Math. Surveys, Vol. 26, p. 57.
[DeR70] Demjanov, V. F., and Rubinov, A. M., 1970. Approximate Meth-
ods in Optimization Problems, American Elsevier, NY.
[DeS96] Dennis, J. E., and Schnabel, R. B., 1996. Numerical Methods for
Unconstrained Optimization and Nonlinear Equations, SIAM, Philadel-
phia, PA.
530 References

[DeT91] Dennis, J. E., and Torczon, V., 1991. "Direct Search Methods on
Parallel Machines," SIAM J. on Optimization, Vol. 1, pp. 448-474.
[Dem66] Demjanov, V. F., 1966. "The Solution of Several Minimax Prob-
lems," Kibernetika, Vol. 2, pp. 58-66.
[Dem68] Demjanov, V. F., 1968. "Algorithms for Some Minimax Prob-
lems," J. of Computer and Systems Science, Vol. 2, pp. 342-380.
[DoE03] Donoho, D. L., Elad, M., 2003. "Optimally Sparse Representation
in General (Nonorthogonal) Dictionaries via f\ Minimization," Proc. of the
National Academy of Sciences, Vol. 100, pp. 2197-2202.
[DrH04] Drezner, Z., and Hamacher, H. W., 2004. Facility Location: Ap-
plications and Theory, Springer, NY.
[DuS83] Dunn, J. C., and Sachs, E., 1983. "The Effect of Perturbations on
the Convergence Rates of Optimization Algorithms," Appl. Math. Optim.,
Vol. 10, pp. 143-157.
[DuS09] Ouchi, J., and Singer, Y., 2009. "Efficient Online and Batch Learn-
ing Using Forward Backward Splitting, J. of Machine Learning Research,
Vol. 10, pp. 2899-2934.
[Dun79] Dunn, J.C., 1979. "Rates of Convergence for Conditional Gradient
Algorithms Near Singular and Nonsingular Extremals," SIAM J. on Control
and Optimization, Vol. 17, pp. 187-211.
[Dun80] Dunn, J. C., 1980. "Convergence Rates for Conditional Gradient
Sequences Generated by Implicit Step Length Rules," SIAM J. on Control
and Optimization, Vol. 18, pp. 473-487.
[Dun81] Dunn, J. C., 1981. "Global and Asymptotic Convergence Rate
Estimates for a Class of Projected Gradient Processes," SIAM J. on Control
and Optimization, Vol. 19, pp. 368-400.
[Dun87] Dunn, J. C., 1987. "On the Convergence of Projected Gradient
Processes to Singular Critical Points," J. of Optimization Theory and Ap-
plications, Vol. 55, pp. 203-216.
[Dun91] Dunn, J.C., 1991. "A Subspace Decomposition Principle for Scaled
Gradient Projection Methods: Global Theory," SIAM J. on Control and
Optimization, Vol. 29, pp. 219-246.
[EcB92] Eckstein, J., and Bertsekas, D. P., 1992. "On the Douglas-Rachford
Splitting Method and the Proximal Point Algorithm for Maximal Monotone
Operators," Math. Programming, Vol. 55, pp. 293-318.
[EcS13] Eckstein, J., and Silva, P. J. S., 2013. "A Practical Relative Error
Criterion for Augmented Lagrangians," Math. Programming, Vol. 141, Ser.
A, pp. 319-348.
[Eck94] Eckstein, J., 1994. "Nonlinear Proximal Point Algorithms Using
Bregman Functions, with Applications to Convex Programming," Math.
of Operations Research, Vol. 18, pp. 202-226.
[Eck03] Eckstein, J., 2003. "A Practical General Approximation Criterion
References 531

for Methods of Multipliers Based on Bregman Distances," Math. Program-


ming, Vol. 96, Ser. A, pp. 61-86.
[Eck12] Eckstein, J., 2012. "Augmented Lagrangian and Alternating Direc-
tion Methods for Convex Optimization: A Tutorial and Some Illustrative
Computational Results," RUTCOR Research Report RRR 32-2012, Rut-
gers, Univ.
[EkT76] Ekeland, I., and Temam, R., 1976. Convex Analysis and Varia-
tional Problems, North-Holland Puhl., Amsterdam.
[ElM75] Elzinga, J., and Moore, T. G., 1975. "A Central Cutting Plane
Algorithm for the Convex Programming Problem," Math. Programming,
Vol. 8, pp. 134-145.
[Erm76] Ermoliev, Yu. M., 1976. Stochastic Programming Methods, Nauka,
Moscow.
[Eve63] Everett, H., 1963. "Generalized Lagrange Multiplier Method for
Solving Problems of Optimal Allocation of Resources," Operations Re-
search, Vol. 11, pp. 399-417.
[FGW02] Forsgren, A., Gill, P. E., and Wright, M. H., 2002. "Interior
Methods for Nonlinear Optimization," SIAM Review, Vol. 44, pp. 525-597.
[FHTlO] Friedman, J., Hastie, T., and Tibshirani, R., 2010. "Regulariza-
tion Paths for Generalized Linear Models via Coordinate Descent," J. of
Statistical Software, Vol. 33, pp. 1-22.
[FJS98] Facchinei, F., Judice, J., and Soares, J., 1998. "An Active Set New-
ton Algorithm for Large-Scale Nonlinear Programs with Box Constraints,"
SIAM J. on Optimization, Vol. 8, pp. 158-186.
[FLP02] Facchinei, F., Lucidi, S., and Palagi, L., 2002. "A Truncated New-
ton Algorithm for Large Scale Box Constrained Optimization," SIAM J.
on Optimization, Vol. 12, pp. 1100-1125.
[FLT02] Fukushima, M., Luo, Z.-Q., and Tseng, P., 2002. "Smoothing
Functions for Second-Order-Cone Complementarity Problems," SIAM J.
Optimization, Vol. 12, pp. 436-460.
[FaP03] Facchinei, F., and Pang, J.-S., 2003. Finite-Dimensional Varia-
tional Inequalities and Complementarity Problems, Springer Verlag, NY.
[Fab73] Fabian, V., 1973. "Asymptotically Efficient Stochastic Approxima-
tion: The RM Case," Ann. Statist., Vol. 1, pp. 486-495.
[FeM91] Ferris, M. C., and Mangasarian, 0. L., 1991. "Finite Perturbation
of Convex Programs," Appl. Math. Optim., Vol. 23, pp. 263-273.
[FeM02] Ferris, M. C., and Munson, T. S., 2002. "Interior-Point Methods
for Massive Support Vector Machines," SIAM J. on Optimization, Vol. 13,
pp. 783-804.
[FeR12] Fercoq, 0., and Richtarik, P., 2012. "Accelerated, Parallel, and
Proximal Coordinate Descent," arXiv preprint arXiv:1312.5799.
532 References

[Fen51] Fenchel, W., 1951. Convex Cones, Sets, and Functions, Mimeogra-
phed Notes, Princeton Univ.
[F1H95] Florian, M. S., and Hearn, D., 1995. "Network Equilibrium Models
and Algorithms," Handbooks in OR and MS, Ball, M. 0., Magnanti, T.
L., Monma, C. L., and Nemhauser, G. L., (Eds.), Vol. 8, North-Holland,
Amsterdam, pp. 485-550.
[FiM68] Fiacco, A. V., and McCormick, G. P., 1968. Nonlinear Program-
ming: Sequential Unconstrained Minimization Techniques, Wiley, NY.
[FiN03] Figueiredo, M.A. T., and Nowak, R. D., 2003. "An EM Algorithm
for Wavelet-Based Image Restoration," IEEE Trans. Image Processing, Vol.
12, pp. 906-916.
[FleOO] Fletcher, R., 2000. Practical Methods of Optimization, 2nd edition,
Wiley, NY.
[FoG83] Fortin, M., and Glowinski, R., 1983. "On Decomposition-Coordina-
tion Methods Using an Augmented Lagrangian," in: M. Fortin and R.
Glowinski, eds., Augmented Lagrangian Methods: Applications to the So-
lution of Boundary-Value Problems, North-Holland, Amsterdam.
[FrG13] Friedlander, M. P., and Goh, G., 2013. "Tail Bounds for Stochastic
Approximation," arXiv preprint arXiv:1304.5586.
[FrG14] Freund, R. M., and Grigas, P., 2014. "New Analysis and Results
for the Frank-Wolfe Method," arXiv preprint arXiv:1307.0873, to appear
in Math. Programming.
[FrSOO] Frommer, A., and Szyld, D. B., 2000. "On Asynchronous Itera-
tions," J. of Computational and Applied Mathematics, Vol. 123, pp. 201-
216.
[FrS12] Friedlander, M. P., and Schmidt, M., 2012. "Hybrid Deterministic-
Stochastic Methods for Data Fitting," SIAM J. Sci. Comput., Vol. 34, pp.
A1380-A1405.
[FrT07] Friedlander, M. P., and Tseng, P., 2007. "Exact Regularization of
Convex Programs," SIAM J. on Optimization, Vol. 18, pp. 1326-1350.
[FrW56] Frank, M., and Wolfe, P., 1956. "An Algorithm for Quadratic
Programming," Naval Research Logistics Quarterly, Vol. 3, pp. 95-110.
[Fra02] Frangioni, A., 2002. "Generalized Bundle Methods," SIAM J. on
Optimization, Vol. 13, pp. 117-156.
[Fri56] Frisch, M. R., 1956. "La Resolution des Problemes de Programme
Lineaire par la Methode du Potential Logarithmique," Cahiers du Semi-
naire D'Econometrie, Vol. 4, pp. 7-20.
[FuM81] Fukushima, M., and Mine, H., 1981. "A Generalized Proximal
Point Algorithm for Certain Non-Convex Minimization Problems," Inter-
nat. J. Systems Sci., Vol. 12, pp. 989-1000.
[Fuk92] Fukushima, M., 1992. "Application of the Alternating Direction
Method of Multipliers to Separable Convex Programming Problems," Com-
References 533

putational Optimization and Applications, Vol. 1, pp. 93-111.


[GBY12] Grant, M., Boyd, S., and Ye, Y., 2012. "CVX: Matlab Software
for Disciplined Convex Programming, Version 2.0 Beta," Recent Advances
in Learning and Control, cvx.com.
[GGM06] Gaudioso, M., Giallombardo, G., and Miglionico, G., 2006. "An
Incremental Method for Solving Convex Finite Min-Max Problems," Math.
of Operations Research, Vol. 31, pp. 173-187.
[GHV92] Goffin, J. L., Haurie, A., and Vial, J. P., 1992. "Decomposition
and Nondifferentiable Optimization with the Projective Algorithm," Man-
agement Science, Vol. 38, pp. 284-302.
[GKXlO] Gupta, M. D., Kumar, S., and Xiao, J. 2010. "Ll Projections
with Box Constraints," arXiv preprint arXiv:1010.0141.
[GLL86] Grippo, L., Lampariello, F., and Lucidi, S., 1986. "A Nonmono-
tone Line Search Technique for Newton's Method," SIAM J. on Numerical
Analysis, Vol. 23, pp. 707-716.
[GLY94] Goffin, J. L., Luo, Z.-Q., and Ye, Y., 1994. "On the Complexity
of a Column Generation Algorithm for Convex or Quasiconvex Feasibility
Problems," in Large Scale Optimization: State of the Art, Hager, W. W.,
Hearn, D. W., and Pardalos, P. M., (Eds.), Kluwer, Boston.
[GLY96] Goffin, J. L., Luo, Z.-Q., and Ye, Y., 1996. "Complexity Analysis
of an Interior Cutting Plane Method for Convex Feasibility Problems,"
SIAM J. on Optimization, Vol. 6, pp. 638-652.
[GMW81] Gill, P. E., Murray, W., and Wright, M. H., 1981. Practical
Optimization, Academic Press, NY.
[GNS08] Griva, I., Nash, S. G., and Sofer, A., 2008. Linear and Nonlinear
Optimization, 2nd Edition, SIAM, Philadelphia, PA.
[GOP14] Gurbuzbalaban, M., Ozdaglar, A., and Parrilo, P., 2014. "A Glob-
ally Convergent Incremental Newton Method," arXiv preprint arXiv:1410.-
5284.
[GPR67] Gubin, L. G., Polyak, B. T., and Raik, E. V., 1967. "The Method
of Projection for Finding the Common Point in Convex Sets," USSR Com-
put. Math. Phys., Vol. 7, pp. 1-24.
[GSW12] Gamarnik, D., Shah, D., and Wei, Y., 2012. "Belief Propagation
for Min-Cost Network Flow: Convergence and Correctness," Operations
Research, Vol. 60, pp. 410-428.
[GaB84] Gafni, E. M., and Bertsekas, D. P., 1984. "Two-Metric Projection
Methods for Constrained Optimization," SIAM J. on Control and Opti-
mization, Vol. 22, pp. 936-964.
[GaM76] Gabay, D., and Mercier, B., 1976. "A Dual Algorithm for the
Solution of Nonlinear Variational Problems via Finite-Element Approxi-
mations," Comp. Math. Appl., Vol. 2, pp. 17-40.
[Gab79] Gabay, D., 1979. Methodes Numeriques pour !'Optimization Non
534 References

Lineaire, These de Doctorat d'Etat et Sciences Mathematiques, Univ. Pierre


at Marie Curie (Paris VI).
[Gab83] Gabay, D., 1983. "Applications of the Method of Multipliers to
Variational Inequalities," in M. Fortin and R. Glowinski, eds., Augmented
Lagrangian Methods: Applications to the Solution of Boundary-Value Prob-
lems, North-Holland, Amsterdam.
[GeM05] Gerlach, S., and Matzenmiller, A., 2005. "Comparison of Numeri-
cal Methods for Identification of Viscoelastic Line Spectra from Static Test
Data," International J. for Numerical Methods in Engineering, Vol. 63, pp.
428-454.
[Geo72] Geoffrion, A. M., 1972. "Generalized Benders Decomposition," J.
of Optimization Theory and Applications, Vol. 10, pp. 237-260.
[Geo77] Geoffrion, A. M., 1977. "Objective Function Approximations in
Mathematical Programming," Math. Programming, Vol. 13, pp. 23-27.
[GiB14] Giselsson, P., and Boyd, S., 2014. "Metric Selection in Douglas-
Rachford Splitting and ADMM," arXiv preprint arXiv:1410.8479.
[GiM74] Gill, P. E., and Murray, W., (Eds.), 1974. Numerical Methods for
Constrained Optimization, Academic Press, NY.
[GlM75] Glowinski, R. and Marrocco, A., 1975. "Sur l' Approximation
par Elements Finis d' Ordre un et la Resolution par Penalisation-Dualite
d'une Classe de Problemes de Dirichlet Non Lineaires" Revue Francaise
d'Automatique Informatique Recherche Operationnelle, Analyse Numerique,
R-2, pp. 41-76.
[GoK13] Gonzaga, C. C., and Karas, E. W., 2013. "Fine Tuning Nes-
terov's Steepest Descent Algorithm for Differentiable Convex Program-
ming," Math. Programming, Vol. 138, pp. 141-166.
[Go009] Goldstein, T., and Osher, S., 2009. "The Split Bregman Method
for Ll-Regularized Problems," SIAM J. on Imaging Sciences, Vol. 2, pp.
323-343.
[GoSlO] Goldstein, T., and Setzer, S., 2010. "High-Order Methods for Basis
Pursuit," UCLA CAM Report, 10-41.
[GoV90] Coffin, J. L., and Vial, J. P., 1990. "Cutting Planes and Column
Generation Techniques with the Projective Algorithm," J. Opt. Th. and
Appl., Vol. 65, pp. 409-429.
[GoV02] Coffin, J. L., and Vial, J. P., 2002. "Convex Nondifferentiable
Optimization: A Survey Focussed on the Analytic Center Cutting Plane
Method," Optimization Methods and Software, Vol. 17, pp. 805-867.
[GoZ12] Gong, P., and Zhang, C., 2012. "Efficient Nonnegative Matrix
Factorization via Projected Newton Method," Pattern Recognition, Vol.
45, pp. 3557-3565.
[Gol64] Goldstein, A. A., 1964. "Convex Programming in Hibert Space,"
Bull. Amer. Math. Soc., Vol. 70, pp. 709-710.
References 535

[Gol85] Golshtein, E. G., 1985. "A Decomposition Method for Linear and
Convex Programming Problems," Matecon, Vol. 21, pp. 1077-1091.
[GonOO] Gonzaga, C. C., 2000. "Two Facts on the Convergence of the
Cauchy Algorithm," J. of Optimization Theory and Applications, Vol. 107,
pp. 591-600.
[GrS99] Grippo, L., and Sciandrone, M., 1999. "Globally Convergent Block-
Coordinate Techniques for Unconstrained Optimization," Optimization Me-
thods and Software, Vol. 10, pp. 587-637.
[GrSOO] Grippo, L., and Sciandrone, M., 2000. "On the Convergence of the
Block Nonlinear Gauss-Seidel Method Under Convex Constraints," Oper-
ations Research Letters, Vol. 26, pp. 127-136.
[Gri94] Grippo, L., 1994. "A Class of Unconstrained Minimization Methods
for Neural Network Training," Optim. Methods and Software, Vol. 4, pp.
135-150.
[GriOO] Grippo, L., 2000. "Convergent On-Line Algorithms for Supervised
Learning in Neural Networks," IEEE Trans. Neural Networks, Vol. 11, pp.
1284-1299.
fGul92J Guler, 0., 1992. "New Proximal Point Algorithms for Convex Min-
imization," SIAM J. on Optimization, Vol. 2, pp. 649-664.
[HCW14] Hong, M., Chang, T. H., Wang, X., Razaviyayn, M., Ma, S.,
and Luo, z. Q., 2014. "A Block Successive Upper Bound Minimization
Method of Multipliers for Linearly Constrained Convex Optimization,"
arXiv preprint arXiv:1401.7079.
[HJN14] Harchaoui, Z., Juditsky, A., and Nemirovski, A., 2014. "Condi-
tional Gradient Algorithms for Norm-Regularized Smooth Convex Opti-
mization," Math. Programming, pp. 1-38.
[HKR95] den Hertog, D., Kaliski, J., Roos, C., and Terlaky, T., 1995. "A
Path-Following Cutting Plane Method for Convex Programming," Annals
of Operations Research, Vol. 58, pp. 69-98.
[HLV87] Hearn, D. W., Lawphongpanich, S., and Ventura, J. A., 1987. "Re-
stricted Simplicial Decomposition: Computation and Extensions," Math.
Programming Studies, Vol. 31, pp. 119-136.
[HMTlO] Halko, N., Martinsson, P.-G., and Tropp, J. A., 2010. "Find-
ing Structure with Randomness: Probabilistic Algorithms for Constructing
Approximate Matrix Decompositions," arXiv preprint arXiv:0909.4061.
[HTF09] Hastie, T., Tibshirani, R., and Friedman, J., 2009. The Elements
of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edi-
tion, Springer, NY. On line at http://statweb.stanford.edu/ tibs/ElemStat-
Learn/
[HYS15] Hu, Y., Yang, X., and Sim, C. K., 2015. "Inexact Subgradient
Methods for Quasi-Convex Optimization Problems," European Journal of
Operational Research, Vol. 240, pp. 315-327.
536 References

[HZS13] Hou, K., Zhou, Z., So, A. M. C., and Luo, Z. Q., 2013. "On
the Linear Convergence of the Proximal Gradient Method for Trace Norm
Regularization," in Advances in Neural Information Processing Systems
(NIPS 2013), pp. 710-718.
[Ha90] Ha, C. D., 1990. "A Generalization of the Proximal Point Algo-
rithm," SIAM J. on Control and Optimization, Vol. 28, pp. 503-512.
[HaB70] Haarhoff, P. C., and Buys, J. D, 1970. "A New Method for the
Optimization of a Nonlinear Function Subject to Nonlinear Constraints,"
Computer J., Vol. 13, pp. 178-184.
[HaH93] Hager, W. W., and Hearn, D. W., 1993. "Application of the Dual
Active Set Algorithm to Quadratic Network Optimization," Computational
Optimization and Applications, Vol. 1, pp. 349-373.
[HaM79] Han, S. P., and Mangasarian, 0. L., 1979. "Exact Penalty Func-
tions in Nonlinear Programming," Math. Programming, Vol. 17, pp. 251-
269.
[Hay08] Haykin, S., 2008. Neural Networks and Learning Machines, (3rd
Ed.), Prentice Hall, Englewood Cliffs, NJ.
[HeD09] Helou, E. S., and De Pierro, A. R., 2009. "Incremental Subgradi-
ents for Constrained Convex Optimization: A Unified Framework and New
Methods," SIAM J. on Optimization, Vol. 20, pp. 1547-1572.
[HeL89] Hearn, D. W., and Lawphongpanich, S., 1989. "Lagrangian Dual
Ascent by Generalized Linear Programming," Operations Res. Letters, Vol.
8, pp. 189-196.
[HeMll] Henrion, D., and Malick, J., 2011. "Projection Methods for Conic
Feasibility Problems: Applications to Polynomial Sum-of-Squares Decom-
positions," Optimization Methods and Software, Vol. 26, pp. 23-46.
[HeM12] Henrion, D., and Malick, J., 2012. "Projection Methods in Conic
Optimization," In Handbook on Semidefinite, Conic and Polynomial Opti-
mization, Springer, NY, pp. 565-600.
[Her09] Gabor, H., 2009. Fundamentals of Computerized Tomography: Im-
age Reconstruction from Projection, (2nd ed.), Springer, NY.
[Hes69] Hestenes, M. R., 1969. "Multiplier and Gradient Methods," J. Opt.
Th. and Appl., Vol. 4, pp. 303-320.
[Hes75] Hestenes, M. R., 1975. Optimization Theory: The Finite Dimen-
sional Case, Wiley, NY.
[HiL93] Hiriart-Urruty, J.-B., and Lemarechal, C., 1993. Convex Analysis
and Minimization Algorithms, Vols. I and II, Springer-Verlag, Berlin and
NY.
[Hil57] Hildreth, C., 1957. "A Quadratic Programming Procedure," Naval
Res. Logist. Quart., Vol. 4, pp. 79-85. See also "Erratum," Naval Res.
Logist. Quart., Vol. 4, p. 361.
References 537

[HoK71] Hoffman, K., and Kunze, R., 1971. Linear Algebra, Pearson, En-
glewood Cliffs, NJ.
[HoL13] Hong, M., and Luo, Z. Q., 2013. "On the Linear Convergence of the
Alternating Direction Method of Multipliers," arXiv preprint arXiv:1208.-
3922.
[Hoh77] Hohenbalken, B. von, 1977. "Simplicial Decomposition in Nonlin-
ear Programming," Math. Programming, Vol. 13, pp. 49-68.
[Hol74] Holloway, C. A., 1974. "An Extension of the Frank and Wolfe
Method of Feasible Directions," Math. Programming, Vol. 6, pp. 14-27.
[IPS03] Iusem, A. N., Pennanen, T., and Svaiter, B. F., 2003. "Inexact
Variants of the Proximal Point Algorithm Without Monotonicity," SIAM
J. on Optimization, Vol. 13, pp. 1080-1097.
[IST94] Iusem, A. N., Svaiter, B. F., and Teboulle, M., 1994. "Entropy-
Like Proximal Methods in Convex Programming," Math. of Operations
Research, Vol. 19, pp. 790-814.
[IbF96] Ibaraki, S., and Fukushima, M., 1996. "Partial Proximal Method of
Multipliers for Convex Programming Problems," J. of Operations Research
Society of Japan, Vol. 39, pp. 213-229.
[IuT95] Iusem, A. N., and Teboulle, M., 1995. "Convergence Rate Analysis
of Nonquadratic Proximal Methods for Convex and Linear Programming,"
Math. of Operations Research, Vol. 20, pp. 657-677.
[Ius99] Iusem, A. N., 1999. "Augmented Lagrangian Methods and Proximal
Point Methods for Convex Minimization," Investigacion Operativa, Vol. 8,
pp. 11-49.
[Ius03] Iusem, A. N., 2003. "On the Convergence Properties of the Pro-
jected Gradient Method for Convex Optimization," Computational and
Applied Mathematics, Vol. 22, pp. 37-52.
[JFY09] Joachims, T., Finley, T., and Yu, C.-N. J., 2009. "Cutting-Plane
Training of Structural SVMs," Machine Learning, Vol. 77, pp. 27-59.
[JRJ09] Johansson, B., Rabi, M., and Johansson, M., 2009. "A Random-
ized Incremental Subgradient Method for Distributed Optimization in Net-
worked Systems," SIAM J. on Optimization, Vol. 20, pp. 1157-1170.
[Jag13] Jaggi, M., 2013. "Revisiting Frank-Wolfe: Projection-Free Sparse
Convex Optimization," Proc. of ICML 2013.
[JiZ14] Jiang, B., and Zhang, S., 2014. "Iteration Bounds for Finding the E-
Stationary Points for Structured Nonconvex Optimization," arXiv preprint
arXiv:1410.4066.
[JoY09] Joachims, T., and Yu, C.-N. J., 2009. "Sparse Kernel SVMs via
Cutting-Plane Training," Machine Learning, Vol. 76, pp. 179-193.
[JoZ13] Johnson, R., and Zhang, T., 2013. "Accelerating Stochastic Gra-
dient Descent Using Predictive Variance Reduction," Advances in Neural
Information Processing Systems 26 (NIPS 2013).
538 References

[Joa06] Joachims, T., 2006. "Training Linear SVMs in Linear Time," Inter-
national Conference on Knowledge Discovery and Data Mining, pp. 217-
226.
[JuNlla] Juditsky, A., and Nemirovski, A., 2011. "First Order Methods for
Nonsmooth Convex Large-Scale Optimization, I: General Purpose Meth-
ods," in Optimization for Machine Learning, by Sra, S., Nowozin, S., and
Wright, S. J. (eds.), MIT Press, Cambridge, MA, pp. 121-148.
[JuNllb] Juditsky, A., and Nemirovski, A., 2011. "First Order Methods
for Nonsmooth Convex Large-Scale Optimization, II: Utilizing Problem's
Structure," in Optimization for Machine Learning, by Sra, S., Nowozin, S.,
and Wright, S. J. (eds.), MIT Press, Cambridge, MA, pp. 149-183.
[KaW94] Kall, P., and Wallace, S. W., 1994. Stochastic Programming,
Wiley, Chichester, UK.
[Kac37] Kaczmarz, S., 1937. "Approximate Solution of Systems of Linear
Equations," Bull. Acad. Pol. Sci., Lett. A 35, pp. 335-357 (in German);
English transl.: Int. J. Control, Vol. 57, pp. 1269-1271, 1993.
[Kar84] Karmarkar, N., 1984. "A New Polynomial-Time Algorithm for Lin-
ear Programming," In Proc. of the 16th Annual ACM Symp. on Theory of
Computing, pp. 302-311.
[Kel60] Kelley, J.E., 1960. "The Cutting-Plane Method for Solving Convex
Programs," J. Soc. Indust. Appl. Math., Vol. 8, pp. 703-712.
[Kel99] Kelley, C. T., 1999. Iterative Methods for Optimization, Siam,
Philadelphia, PA.
[Kib80] Kibardin, V. M., 1980. "Decomposition into Functions in the Min-
imization Problem," Automation and Remote Control, Vol. 40, pp. 1311-
1323.
[Kiw04] Kiwiel, K. C., 2004. "Convergence of Approximate and Incremental
Subgradient Methods for Convex Optimization," SIAM J. on Optimization,
Vol. 14, pp. 807-840.
[KoB72] Kort, B. W., and Bertsekas, D. P., 1972. "A New Penalty Function
Method for Constrained Minimization," Proc. 1972 IEEE Confer. Decision
Control, New Orleans, LA, pp. 162-166.
[KoB76] Kort, B. W., and Bertsekas, D. P., 1976. "Combined Primal-Dual
and Penalty Methods for Convex Programming," SIAM J. on Control and
Optimization, Vol. 14, pp. 268-294.
[KoN93] Kortanek, K. 0., and No, H., 1993. "A Central Cutting Plane
Algorithm for Convex Semi-Infinite Programming Problems," SIAM J. on
Optimization, Vol. 3, pp. 901-918.
[Kor75] Kort, B. W., 1975. "Combined Primal-Dual and Penalty Function
Algorithms for Nonlinear Programming," Ph.D. Thesis, Dept. of Enginee-
ring-Economic Systems, Stanford Univ., Stanford, Ca.
References 539

[Kra55] Krasnosel'skii, M. A., 1955. "Two Remarks on the Method of Suc-


cessive Approximations," Uspehi Mat. Nauk, Vol. 10, pp. 123-127.
[KuC78] Kushner, H. J., and Clark, D. S., 1978. Stochastic Approxima-
tion Methods for Constrained and Unconstrained Systems, Springer-Verlag,
NY.
[KuY03] Kushner, H. J., and Yin, G., 2003. Stochastic Approximation and
Recursive Algorithms and Applications, Springer-Verlag, NY.
[LBB98] LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P., 1998. "Gradient-
Based Learning Applied to Document Recognition," Proceedings of the
IEEE, Vol. 86, pp. 2278-2324.
[LJS12] Lacoste-Julien, S., Jaggi, M., Schmidt, M., and Pletscher, P., 2012.
"Block-Coordinate Frank-Wolfe Optimization for Structural SVMs," arXiv
preprint arXiv:1207.4747.
[LLS12] Lee, J., Sun, Y., and Saunders, M., 2012. "Proximal Newton-Type
Methods for Convex Optimization," NIPS 2012.
[LLS14] Lee, J., Sun, Y., and Saunders, M., 2014. "Proximal Newton-Type
Methods for Minimizing Composite Functions," arXiv preprint arXiv:1206.-
1623.
[LLX14] Lin, Q., Lu, Z., and Xiao, L., 2014. "An Accelerated Proximal
Coordinate Gradient Method and its Application to Regularized Empirical
Risk Minimization," arXiv preprint arXiv:1407.1296.
[LLZ09] Langford, J., Li, L., and Zhang, T., 2009. "Sparse Online Learning
via Truncated Gradient," In Advances in Neural Information Processing
Systems (NIPS 2009), pp. 905-912.
[LMS92] Lustig, I. J., Marsten, R. E., and Shanno, D. F., 1992. "On Imple-
menting Mehrotra's Predictor-Corrector Interior-Point Method for Linear
Programming," SIAM J. on Optimization, Vol. 2, pp. 435-449.
[LMY12] Lu, Z., Monteiro, R. D. C., and Yuan, M., 2012. "Convex Op-
timization Methods for Dimension Reduction and Coefficient Estimation
in Multivariate Linear Regression," Mathematical Programming, Vol. 131,
pp. 163-194.
[LPS98] Larsson, T., Patriksson, M., and Stromberg, A.-B., 1998. "Er-
godic Convergence in Subgradient Optimization," Optimization Methods
and Software, Vol. 9, pp. 93-120.
[LRW98] Lagarias, J. C., Reeds, J. A., Wright, M. H., and Wright, P. E.,
1998. "Convergence Properties of the Nelder-Mead Simplex Method in Low
Dimensions," SIAM J. on Optimization, Vol. 9, pp. 112-147.
[LVB98] Lobo, M. S., Vandenberghe, L., Boyd, S., and Lebret, H., 1998.
"Applications of Second-Order Cone Programming," Linear Algebra and
Applications, Vol. 284, pp. 193-228.
[LWS14] Liu, J., Wright, S. J., and Sridhar, S., 2014. "An Asynchronous
Parallel Randomized Kaczmarz Algorithm," Univ. of Wisconsin Report,
540 References

arXiv preprint arXiv:1401.4780.


[LaD60] Land, A. H., and Doig, A. G., 1960. "An Automatic Method for
Solving Discrete Programming Problems," Econometrica, Vol. 28, pp. 497-
520.
[LaS87] Lawrence, J., and Spingarn, J. E., 1987. "On Fixed Points of Non-
expansive Piecewise Isometric Mappings," Proc. London Math. Soc., Vol.
55, pp. 605-624.
[LaT85] Lancaster, P., and Tismenetsky, M., 1985. The Theory of Matrices,
Academic Press, NY.
[Lan14] Landi, G., 2014. "A Modified Newton Projection Method for i!i-
Regularized Least Squares Image De blurring," J. of Mathematical Imaging
and Vision, pp. 1-14.
[Las70] Lasdon, L. S., 1970. Optimization Theory for Large Systems, Macmil-
lan, NY; republished by Dover Publications, 2002.
[LeLlO] Leventhal, D., and Lewis, A. S., 2010. "Randomized Methods for
Linear Constraints: Convergence Rates and Conditioning," Math. of Op-
erations Research, Vol. 35, pp. 641-654.
[LeP65] Levitin, E. S., and Poljak, B. T., 1965. "Constrained Minimization
Methods," Z. Vycisl. Mat. i Mat. Fiz., Vol. 6, pp. 787-823.
[LeS93] Lemarechal, C., and Sagastizabal, C., 1993. "An Approach to Vari-
able Metric Bundle Methods," in Systems Modelling and Optimization,
Proc. of the 16th IFIP-TC7 Conference, Compiegne, Henry, J., and Yvon,
J.-P., (Eds.), Lecture Notes in Control and Information Sciences 197, pp.
144-162.
[LeS99] Lee, D., and Seung, H., 1999. "Learning the Parts of Objects by
Non-Negative Matrix Factorization," Nature, Vol. 401, pp. 788-791.
[LeS13] Lee, Y. T., and Sidford, A., 2013. "Efficient Accelerated Coordinate
Descent Methods and Faster Algorithms for Solving Linear Systems," Proc.
2013 IEEE 54th Annual Symposium on Foundations of Computer Science
(FOCS), pp. 147-156.
[LeWll] Lee, S., and Wright, S. J., 2011. "Approximate Stochastic Sub-
gradient Estimation Training for Support Vector Machines," Univ. of Wis-
consin Report, arXiv preprint arXiv:1111.0432.
[Lem74] Lemarechal, C., 1974. "An Algorithm for Minimizing Convex Func-
tions," in Information Processing '74, Rosenfeld, J. L., (Ed.), North-Holland,
Amsterdam, pp. 552-556.
[Lem75] Lemarechal, C., 1975. "An Extension of Davidon Methods to Non-
differentiable Problems," Math. Programming Study 3, Balinski, M., and
Wolfe, P., (Eds.), North-Holland, Amsterdam, pp. 95-109.
[Lem89] Lemaire, B., 1989. "The Proximal Algorithm," in New Methods
in Optimization and Their Industrial Uses, J.-P. Penot, (ed.), Birkhauser,
Basel, pp. 73-87.
References 541

[LiM79] Lions, P. L., and Mercier, B., 1979. "Splitting Algorithms for the
Sum of Two Nonlinear Operators," SIAM J. on Numerical Analysis, Vol.
16, pp. 964-979.
[LiP87] Lin, Y. Y., and Pang, J.-S., 1987. "Iterative Methods for Large
Convex Quadratic Programs: A Survey," SIAM J. on Control and Opti-
mization, Vol. 18, pp. 383-411.
[LiW14] Liu, J., and Wright, S. J., 2014. "Asynchronous Stochastic Coordi-
nate Descent: Parallelism and Convergence Properties," Univ. of Wisconsin
Report, arXiv preprint arXiv:1403.3862.
[Lin07] Lin, C. J., 2007. "Projected Gradient Methods for Nonnegative
Matrix Factorization," Neural Computation, Vol. 19, pp. 2756-2779.
[Lit66] Litvakov, B. M., 1966. "On an Iteration Method in the Problem of
Approximating a Function from a Finite Number of Observations," Avtom.
Telemech., No. 4, pp. 104-113.
[Lju77] Ljung, L., 1977. "Analysis of Recursive Stochastic Algorithms,"
IEEE Trans. on Automatic Control, Vol. 22, pp. 551-575.
[LuT91 J Luo, Z. Q., and Tseng, P., 1991. "On the Convergence of a Matrix-
Splitting Algorithm for the Symmetric Monotone Linear Complementarity
Problem," SIAM J. on Control and Optimization, Vol. 29, pp. 1037-1060.
[LuT92] Luo, Z. Q., and Tseng, P., 1992. "On the Convergence of the
Coordinate Descent Method for Convex Differentiable Minimization," J.
Optim. Theory Appl., Vol. 72, pp. 7-35.
[LuT93a] Luo, Z. Q., and Tseng, P., 1993. "On the Convergence Rate
of Dual Ascent Methods for Linearly Constrained Convex Minimization,"
Math. of Operations Research, Vol. 18, pp. 846-867.
[LuT93b] Luo, Z. Q., and Tseng, P., 1993. "Error Bound and Reduced-
Gradient Projection Algorithms for Convex Minimization over a Polyhedral
Set," SIAM J. on Optimization, Vol. 3, pp. 43-59.
[LuT93c] Luo, Z. Q., and Tseng, P., 1993. "Error Bounds and Convergence
Analysis of Feasible Descent Methods: A General Approach," Annals of
Operations Research, Vol. 46, pp. 157-178.
[LuT94a] Luo, Z. Q., and Tseng, P., 1994. "Analysis of an Approximate
Gradient Projection Method with Applications to the Backpropagation Al-
gorithm," Optimization Methods and Software, Vol. 4, pp. 85-101.
[LuT94b] Luo, Z. Q., and Tseng, P., 1994. "On the Rate of Convergence of a
Distributed Asynchronous Routing Algorithm," IEEE Trans. on Automatic
Control, Vol. 39, pp. 1123-1129.
[LuT13] Luss, R., and Teboulle, M., 2013. "Conditional Gradient Algo-
rithms for Rank-One Matrix Approximations with a Sparsity Constraint,"
SIAM Review, Vol. 55, pp. 65-98.
[LuY08] Luenberger, D. G., and Ye, Y., 2008. Linear and Nonlinear Pro-
gramming, 3rd Edition, Springer, NY.
542 References

[Lue84] Luenberger, D. G., 1984. Introduction to Linear and Nonlinear


Programming, 2nd Edition, Addison-Wesley, Reading, MA.
[Luo91] Luo, Z. Q., 1991. "On the Convergence of the LMS Algorithm
with Adaptive Learning Rate for Linear Feedforward Networks," Neural
Computation, Vol. 3, pp. 226-245.
[Luq84] Luque, F. J., 1984. "Asymptotic Convergence Analysis of the Prox-
imal Point Algorithm," SIAM J. on Control and Optimization, Vol. 22, pp.
277-293.
[MRSlO] Mosk-Aoyama, D., Roughgarden, T., and Shah, D., 2010. "Fully
Distributed Algorithms for Convex Optimization Problems," SIAM J. on
Optimization, Vol. 20, pp. 3260-3279.
[MSQ98] Mifflin, R., Sun, D., and Qi, L., 1998. "Quasi-Newton Bundle-
Type Methods for Nondifferentiable Convex Optimization," SIAM J. on
Optimization, Vol. 8, pp. 583-603.
[MYF03] Moriyama, H., Yamashita, N., and Fukushima, M., 2003. "The
Incremental Gauss-Newton Algorithm with Adaptive Stepsize Rule," Com-
putational Optimization and Applications, Vol. 26, pp. 107-141.
[MaMOl] Mangasarian, 0. L., Musicant, D.R., 2001. "Lagrangian Support
Vector Machines," J. of Machine Learning Research, Vol. 1, pp. 161-177.
[MaS94] Mangasarian, 0. L., and Solodov, M. V., 1994. "Serial and Paral-
lel Backpropagation Convergence Via Nonmonotone Perturbed Minimiza-
tion," Opt. Methods and Software, Vol. 4, pp. 103-116.
[Mail3] Mairal, J., 2013. "Optimization with First-Order Surrogate Func-
tions," arXiv preprint arXiv:1305.3120.
[Mai14] Mairal, J., 2014. "Incremental Majorization-Minimization Opti-
mization with Application to Large-Scale Machine Learning," arXiv preprint
arXiv:1402.4419.
[Man53] Mann, W. R., 1953. "Mean Value Methods in Iteration," Proc.
Amer. Math. Soc., Vol. 4, pp. 506-510.
[Mar70] Martinet, B., 1970. "Regularisation d' Inequations Variationelles
par Approximations Successives," Revue Fran. d'Automatique et Infoma-
tique Rech. Operationelle, Vol. 4, pp. 154-159.
[Mar72] Martinet, B., 1972. "Determination Approche d'un Point Fixe
d'une Application Pseudo-Contractante. Cas de !'Application Prox," Com-
ptes Rendus de l'Academie des Sciences, Paris, Serie A 274, pp. 163-165.
[Meh92] Mehrotra, S., 1992. "On the Implementation of a Primal-Dual
Interior Point Method," SIAM J. on Optimization, Vol. 2, pp. 575-601.
[Mey07] Meyn, S., 2007. Control Techniques for Complex Networks, Cam-
bridge Univ. Press, NY.
[MiF81] Mine, H., Fukushima, M. 1981. "A Minimization Method for the
Sum of a Convex Function and a Continuously Differentiable Function," J.
of Optimization Theory and Applications, Vol. 33, pp. 9-23.
References 543

[Mif96] Mifflin, R., 1996. "A Quasi-Second-Order Proximal Bundle Algo-


rithm," Math. Programming, Vol. 73, pp. 51-72.
[Min62] Minty, G. J., 1962. "Monotone (Nonlinear) Operators in Hilbert
Space," Duke J. of Math., Vol. 29, pp. 341-346.
[Min64] Minty, G. J., 1964. "On the Monotonicity of the Gradient of a
Convex Function," Pacific J. of Math., Vol. 14, pp. 243-247.
[Min86] Minoux, M., 1986. Math. Programming: Theory and Algorithms,
Wiley, NY.
[MoT89] More, J. J., and Toraldo, G., 1989. "Algorithms for Bound Con-
strained Quadratic Programming Problems," Numer. Math., Vol. 55, pp.
377-400.
[NBBOl] Nedic, A., Bertsekas, D. P., and Borkar, V., 2001. "Distributed
Asynchronous Incremental Subgradient Methods," Proc. of2000 Haifa Wor-
kshop "Inherently Parallel Algorithms in Feasibility and Optimization and
Their Applications," by D. Butnariu, Y. Censor, and S. Reich, Eds., Else-
vier, Amsterdam.
[NJL09] Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A., 2009.
"Robust Stochastic Approximation Approach to Stochastic Programming,"
SIAM J. on Optimization, Vol. 19, pp. 1574-1609.
[NSW14] Needell, D., Srebro, N., and Ward, R., 2014. "Stochastic Gra-
dient Descent and the Randomized Kaczmarz Algorithm," arXiv preprint
arXiv: 1310.5715v3.
[NaT02] Nazareth, L., and Tseng, P., 2002. "Gilding the Lily: A Variant
of the Nelder-Mead Algorithm Based on Golden-Section Search," Compu-
tational Optimization and Applications, Vol. 22, pp. 133-144.
[NaZ05] Narkiss, G., and Zibulevsky, M., 2005. "Sequential Subspace Opti-
mization Method for Large-Scale Unconstrained Problems," Technion-UT,
Department of Electrical Engineering.
[NeBOO] Nedic, A., and Bertsekas, D. P., 2000. "Convergence Rate of In-
cremental Subgradient Algorithms," Stochastic Optimization: Algorithms
and Applications," S. Uryasev and P. M. Pardalos, Eds., Kluwer, pp. 263-
304.
[NeBOl] Nedic, A., and Bertsekas, D. P., 2001. "Incremental Subgradient
Methods for Nondifferentiable Optimization," SIAM J. on Optimization,
Vol. 12, pp. 109-138.
[NeBlO] Nedic, A., and Bertsekas, D. P., 2010. "The Effect of Deterministic
Noise in Subgradient Methods," Math. Programming, Ser. A, Vol. 125, pp.
75-99.
[NeC13] Necoara, I., and Clipici, D., 2013. "Distributed Coordinate Descent
Methods for Composite Minimization," arXiv preprint arXiv:1312.5302.
[NeN94] Nesterov, Y., and Nemirovskii, A., 1994. Interior Point Polynomial
Algorithms in Convex Programming, SIAM, Studies in Applied Mathemat-
544 References

ics 13, Philadelphia, PA.


[Ne009a] Nedic, A., and Ozdaglar, A., 2009. "Distributed Subgradient
Methods for Multi-Agent Optimization," IEEE Trans. on Aut. Control,
Vol. 54, pp. 48-61.
[Ne009b] Nedic, A., and Ozdaglar, A., 2009. "Subgradient Methods for
Saddle-Point Problems," J. of Optimization Theory and Applications, Vol.
142, pp. 205-228.
[NeW88] Nemhauser, G. L., and Wolsey, L. A., 1988. Integer and Combi-
natorial Optimization, Wiley, NY.
[NeY83] Nemirovsky, A., and Yudin, D. B., 1983. Problem Complexity and
Method Efficiency, Wiley, NY.
[NedlO] Nedic, A., 2010. "Random Projection Algorithms for Convex Set
Intersection Problems," Proc. 2010 IEEE Conference on Decision and Con-
trol, Atlanta, Georgia, pp. 7655-7660.
[Nedll] Nedic, A., 2011. "Random Algorithms for Convex Minimization
Problems," Math. Programming, Ser. B, Vol. 129, pp. 225-253.
[NeelO] Needell, D., 2010. "Randomized Kaczmarz Solver for Noisy Linear
Systems," BIT Numerical Mathematics, Vol. 50, pp. 395-403.
[Nes83] Nesterov, Y., 1983. "A Method for Unconstrained Convex Mini-
mization Problem with the Rate of Convergence 0(1/k 2 )," Doklady AN
SSSR, Vol. 269, pp. 543-547; translated as Soviet Math. Dokl.
[Nes95] Nesterov, Y., 1995. "Complexity Estimates of Some Cutting Plane
Methods Based on Analytic Barrier," Math. Programming, Vol. 69, pp.
149-176.
[Nes04] Nesterov, Y., 2004. Introductory Lectures on Convex Optimization,
Kluwer Academic Publisher, Dordrecht, The Netherlands.
[Nes05] Nesterov, Y., 2005. "Smooth Minimization of Nonsmooth Func-
tions," Math. Programming, Vol. 103, pp. 127-152.
[Nes12] Nesterov, Y., 2012. "Efficiency of Coordinate Descent Methods on
Huge-Scale Optimization Problems," SIAM J. on Optimization, Vol. 22,
pp. 341-362.
[Nev75] Neveu, J., 1975. Discrete Parameter Martingales, North-Holland,
Amsterdam, The Netherlands.
[NoW06] Nocedal, J., and Wright, S. J., 2006. Numerical Optimization,
2nd Edition, Springer, NY.
[Noc80] Nocedal, J., 1980. "Updating Quasi-Newton Matrices with Limited
Storage," Math. of Computation, Vol. 35, pp. 773-782.
[OBG05] Osher, S., Burger, M., Goldfarb, D., Xu, J., and Yin, W., 2005.
"An Iterative Regularization Method for Total Variation-Based Image Res-
toration," Multiscale Modeling and Simulation, Vol. 4, pp. 460-489.
[OJW05] Olafsson, A., Jeraj, R., and Wright, S. J., 2005. Optimization
of Intensity-Modulated Radiation Therapy with Biological Objectives,"
References 545

Physics in Medicine and Biology, Vol. 50, pp. 53-57.


[OMVOO] Ouorou, A., Mahey, P., and Vial, J.P., 2000. "A Survey of Algo-
rithms for Convex Multicommodity Flow Problems," Management Science,
Vol. 46, pp. 126-147.
[OrR70] Ortega, J. M., and Rheinboldt, W. C., 1970. Iterative Solution of
Nonlinear Equations in Several Variables, Academic Press, NY.
[OvG14] Ovcharova, N., and Gwinner, J., 2014. "A Study of Regularization
Techniques of Nondifferentiable Optimization in View of Application to
Hemivariational Inequalities," J. of Optimization Theory and Applications,
Vol. 162, pp. 754-778.
[OzB03] Ozdaglar, A. E., and Bertsekas, D. P., 2003. "Routing and Wave-
length Assignment in Optical Networks," IEEE Trans. on Networking, Vol.
11, pp. 259-272.
[PKP09] Predd, J.B., Kulkarni, S. R., and Poor, H. V., 2009. "A Collab-
orative Training Algorithm for Distributed Learning," IEEE Transactions
on Information Theory, Vol. 55, pp. 1856-1871.
[PaElO] Palomar, D. P., and Eldar, Y. C., (Eds.), 2010. Convex Optimiza-
tion in Signal Processing and Communications, Cambridge Univ. Press,
NY.
[PaT94] Paatero, P., and Tapper, U., 1994. "Positive Matrix Factorization:
A Non-Negative Factor Model with Optimal Utilization of Error Estimates
of Data Values," Environmetrics, Vol. 5, pp. 111-126.
[Pa Y84] Pang, J. S., Yu, C. S., 1984. "Linearized Simplicial Decomposition
Methods for Computing Traffic Equilibria on Networks," Networks, Vol.
14, pp. 427-438.
[Paa97] Paatero, P., 1997. "Least Squares Formulation of Robust Non-
Negative Factor Analysis," Chemometrics and Intell. Laboratory Syst., Vol.
37, pp. 23-35.
[Pan84] Pang, J. S., 1984. "On the Convergence of Dual Ascent Methods for
Large-Scale Linearly Constrained Optimization Problems," Unpublished
manuscript, The Univ. of Texas at Dallas.
[Pap81] Papavassilopoulos, G., 1981. "Algorithms for a Class of Nondiffer-
entiable Problems," J. of Optimization Theory and Applications, Vol. 34,
pp. 41-82.
[Pas79] Passty, G. B., 1979. "Ergodic Convergence to a Zero of the Sum of
Monotone Operators in Hilbert Space," J. Math. Anal. Appl., Vol. 72, pp.
383-390.
[Pat93] Patriksson, M., 1993. "Partial Linearization Methods in Nonlinear
Programming," J. of Optimization Theory and Applications, Vol. 78, pp.
227-246.
[Pat98] Patriksson, M., 1998. "Cost Approximation: A Unified Framework
of Descent Algorithms for Nonlinear Programs," SIAM J. Optimization,
546 References

Vol. 8, pp. 561-582.


[Pat99] Patriksson, M., 1999. Nonlinear Programming and Variational In-
equality Problems: A Unified Approach, Springer, NY.
[PatOl] Patriksson, M., 2001. "Simplicial Decomposition Algorithms," En-
cyclopedia of Optimization, Springer, pp. 2378-2386.
[Pat04] Patriksson, M., 2004. "Algorithms for Computing Traffic Equilib-
ria," Networks and Spatial Economics, Vol. 4, pp. 23-38.
[Pen02] Pennanen, T., 2002. "Local Convergence of the Proximal Point
Algorithm and Multiplier Methods Without Monotonicity," Math. of Op-
erations Research, Vol. 27, pp. 170-191.
[Pfl96] Pflug, G., 1996. Optimization of Stochastic Models. The Interface
Between Simulation and Optimization, Kluwer, Boston.
[PiZ94] Pinar, M., and Zenios, S., 1994. "On Smoothing Exact Penalty
Functions for Convex Constrained Optimization," SIAM J. on Optimiza-
tion, Vol. 4, pp. 486-511.
[PoJ92] Poljak, B. T., and Juditsky, A. B., 1992. "Acceleration of Stochastic
Approximation by Averaging," SIAM J. on Control and Optimization, Vol.
30, pp. 838-855.
[PoT73] Poljak, B. T., and Tsypkin, Y. z., 1973. "Pseudogradient Adapta-
tion and Training Algorithms," Automation and Remote Control, Vol. 12,
pp. 83-94.
[PoT74] Poljak, B. T., and Tretjakov, N. V., 1974. "An Iterative Method
for Linear Programming and its Economic Interpretation," Matecon, Vol.
10, pp. 81-100.
[PoT80] Poljak, B. T., and Tsypkin, Y. Z., 1980. "Adaptive Estimation Al-
gorithms (Convergence, Optimality, Stability)," Automation and Remote
Control, Vol. 40, pp. 378-389.
[PoT81] Poljak, B. T., and Tsypkin, Y. Z., 1981. "Optimal Pseudogradient
Adaptation Algorithms," Automation and Remote Control, Vol. 41, pp.
1101-1110.
[PoT97] Polyak, R., and Teboulle, M., 1997. "Nonlinear Rescaling and
Proximal-Like Methods in Convex Optimization," Math. Programming,
Vol. 76, pp. 265-284.
[Pol64] Poljak, B. T., 1964. "Some Methods of Speeding up the Convergence
of Iteration Methods," Z. VyCisl. Mat. i Mat. Fiz., Vol. 4, pp. 1-17.
[Pol71] Polak, E., 1971. Computational Methods in Optimization: A Uni-
fied Approach, Academic Press, NY.
[Pol78] Poljak, B. T., 1978. "Nonlinear Programming Methods in the Pres-
ence of Noise," Math. Programming, Vol. 14, pp. 87-97.
[Pol79] Poljak, B. T., 1979. "On Bertsekas' Method for Minimization of
Composite Functions," Internat. Symp. Systems Opt. Analysis, Benoussan,
A., and Lions, J. L., (Eds.), Springer-Verlag, Berlin and NY, pp. 179-186.
References 547

[Pol87] Poljak, B. T., 1987. Introduction to Optimization, Optimization


Software Inc., NY.
[Pol88] Polyak, R. A., 1988. "Smooth Optimization Methods for Minimax
Problems," SIAM J. on Control and Optimization, Vol. 26, pp. 1274-1286.
[Pol92] Polyak, R. A., 1992. "Modified Barrier Functions (Theory and
Methods)," Math. Programming, Vol. 54, pp. 177-222.
[Pow69] Powell, M. J. D., 1969. "A Method for Nonlinear Constraints
in Minimizing Problems," in Optimization, Fletcher, R., (Ed.), Academic
Press, NY, pp. 283-298.
[Pow73] Powell, M. J. D., 1973. "On Search Directions for Minimization
Algorithms," Math. Programming, Vol. 4, pp. 193-201.
[Pow 11] Powell, W. B., 2011. Approximate Dynamic Programming: Solving
the Curses of Dimensionality, 2nd Ed., Wiley, NY.
[Pre95] Prekopa, A., 1995. Stochastic Programming, Kluwer, Boston.
[Psh65] Pshenichnyi, B. N., 1965. "Dual Methods in Extremum Problems,"
Kibernetika, Vol. 1, pp. 89-95.
[Pyt98] Pytlak, R., 1998. "An Efficient Algorithm for Large-Scale Nonlinear
Programming Problems with Simple Bounds on the Variables," SIAM J.
on Optimization, Vol. 8, pp. 532-560.
[QSG13] Qin, Z., Scheinberg, K., and Goldfarb, D., 2013. "Efficient Block-
Coordinate Descent Algorithms for the Group Lasso," Math. Programming
Computation, Vol. 5, pp. 143-169.
[RFPlO] Recht, B., Fazel, M., and Parrilo, P. A., 2010. "Guaranteed Mini-
mum-Rank Solutions of Linear Matrix Equations via Nuclear Norm Mini-
mization," SIAM Review, Vol. 52, pp. 471-501.
[RGV14] Richard, E., Gaiffas, S., and Vayatis, N., 2014. "Link Prediction
in Graphs with Autoregressive Features," J. of Machine Learning Research,
Vol. 15, pp. 565-593.
[RHL13] Razaviyayn, M., Hong, M., and Luo, Z. Q., 2013. "A Unified
Convergence Analysis of Block Successive Minimization Methods for Non-
smooth Optimization," SIAM J. on Optimization, Vol. 23, pp. 1126-1153.
[RHW86] Rumelhart, D. E., Hinton, G. E., and Williams, R. J., 1986.
"Learning Internal Representation by Error Backpropagation," in Parallel
Distributed Processing-Explorations in the Microstructure of Cognition,
by Rumelhart and McClelland, (eds.), MIT Press, Cambridge, MA, pp.
318-362.
[RHW88] Rumelhart, D. E., Hinton, G. E., and Williams, R. J., 1988.
"Learning Representations by Back-Propagating Errors," in Cognitive Mod-
eling, by T. A. Polk, and C. M. Seifert, (eds.), MIT Press, Cambridge, MA,
pp. 213-220.
[RHZ14] Razaviyayn, M., Hong, M., and Luo, Z. Q., 2013. "A Unified
Convergence Analysis of Block Successive Minimization Methods for Non-
548 References

smooth Optimization," SIAM J. on Optimization, Vol. 23, pp. 1126-1153.


[RNV09] Ram, S. S., Nedic, A., and Veeravalli, V. V., 2009. "Incremental
Stochastic Subgradient Algorithms for Convex Optimization," SIAM J. on
Optimization, Vol. 20, pp. 691-717.
[RNVlO] Ram, S. S., Nedic, A., and Veeravalli, V. V., 2010. "Distributed
Stochastic Subgradient Projection Algorithms for Convex Optimization,"
J. of Optimization Theory and Applications, Vol. 147, pp. 516-545.
[ROF92] Rudin, L. I., Osher, S., and Fatemi, E., 1992. "Nonlinear Total
Variation Based Noise Removal Algorithms," Physica D: Nonlinear Phe-
nomena, Vol. 60, pp. 259-268.
[RRWll] Recht, B., Re, C., Wright, S. J., and Niu, F., 2011. "Hogwild:
A Lock-Free Approach to Parallelizing Stochastic Gradient Descent," in
Advances in Neural Information Processing Systems (NIPS 2011), pp. 693-
701.
[RSW13] Rao, N., Shah, P., Wright, S., and Nowak, R., 2013. "A Greedy
Forward-Backward Algorithm for Atomic Norm Constrained Minimization,
in Proc. 2013 IEEE International Conference on Acoustics, Speech and
Signal Processing, pp. 5885-5889.
[RTV06] Roos, C., Terlaky, T., and Vial, J.P., 2006. Interior Point Methods
for Linear Optimization, Springer, NY.
[RX.Bll] Recht, B., Xu, W., and Hassibi, B., 2011. "Null Space Conditions
and Thresholds for Rank Minimization," Math. Programming, Vol. 127,
pp. 175-202.
[RaN04] Rabbat, M. G., and Nowak, R. D., 2004. "Distributed Optimiza-
tion in Sensor Networks," in Proc. Inf. Processing Sensor Networks, Berke-
ley, CA, pp. 20-27.
[RaN05] Rabbat M. G., and Nowak R. D., 2005. "Quantized Incremen-
tal Algorithms for Distributed Optimization," IEEE J. on Select Areas in
Communications, Vol. 23, pp. 798-808.
[Ray93] Raydan, M., 1993. "On the Barzilai and Borwein Choice ofSteplen-
gth for the Gradient Method," IMA J. of Numerical Analysis, Vol. 13, pp.
321-326.
[Ray97] Raydan, M., 1997. "The Barzilai and Borwein Gradient Method
for the Large Scale Unconstrained Minimization Problem," SIAM J. on
Optimization, Vol. 7, pp. 26-33.
[ReR13] Recht, B., and Re, C., 2013. "Parallel Stochastic Gradient Algo-
rithms for Large-Scale Matrix Completion," Math. Programming Compu-
tation, Vol. 5, pp. 201-226.
[Recll] Recht, B., 2011. "A Simpler Approach to Matrix Completion," The
J. of Machine Learning Research, Vol. 12, pp. 3413-3430.
[RiT14] Richtarik, P., and Takac, M., 2014. "Iteration Complexity of Ran-
domized Block-Coordinate Descent Methods for Minimizing a Composite
References 549

Function," Math. Programming, Vol. 144, pp. 1-38.


[RoS71] Robbins, H., and Siegmund, D. 0., 1971. "A Convergence Theorem
for Nonnegative Almost Supermartingales and Some Applications," Opti-
mizing Methods in Statistics, pp. 233-257; see "Herbert Robbins Selected
Papers," Springer, NY, 1985, pp. 111-135.
[RoW91] Rockafellar, R. T., and Wets, R. J.-B., 1991. "Scenarios and Pol-
icy Aggregation in Optimization Under Uncertainty," Math. of Operations
Research, Vol. 16, pp. 119-147.
[RoW98] Rockafellar, R. T., and Wets, R. J.-B., 1998. Variational Analysis,
Springer-Verlag, Berlin.
[Rob99] Robinson, S. M., 1999. "Linear Convergence ofEpsilon-Subgradient
Descent Methods for a Class of Convex Functions," Math. Programming,
Ser. A, Vol. 86, pp. 41-50.
[Roc66] Rockafellar, R. T., 1966. "Characterization of the Subdifferentials
of Convex Functions," Pacific J. of Mathematics, Vol. 17, pp. 497-510.
[Roc70] Rockafellar, R. T., 1970. Convex Analysis, Princeton Univ. Press,
Princeton, NJ.
[Roc73] Rockafellar, R. T., 1973. "A Dual Approach to Solving Nonlinear
Programming Problems by Unconstrained Optimization," Math. Program-
ming, pp. 354-373.
[Roc76a] Rockafellar, R. T., 1976. "Monotone Operators and the Proximal
Point Algorithm," SIAM J. on Control and Optimization, Vol. 14, pp. 877-
898.
[Roc76b] Rockafellar, R. T., 1976. "Augmented Lagrangians and Applica-
tions of the Proximal Point Algorithm in Convex Programming," Math. of
Operations Research, Vol. 1, pp. 97-116.
[Roc76c] Rockafellar, R. T., 1976. "Solving a Nonlinear Programming Prob-
lem by Way of a Dual Problem," Symp. Matematica, Vol. 27, pp. 135-160.
[Roc84] Rockafellar, R. T., 1984. Network Flows and Monotropic Optimiza-
tion, Wiley, NY; republished by Athena Scientific, Belmont, MA, 1998.
[Rud76] Rudin, W., 1976. Real Analysis, McGraw-Hill, NY.
[Rup85] Ruppert, D., 1985. "A Newton-Raphson Version of the Multi-
variate Robbins-Monro Procedure," The Annals of Statistics, Vol. 13, pp.
236-245.
[Rus86] Ruszczynski, A., 1986. "A Regularized Decomposition Method for
Minimizing a Sum of Polyhedral Functions," Math. Programming, Vol. 35,
pp. 309-333.
[Rus06] Ruszczynski, A., 2006. Nonlinear Optimization, Princeton Univ.
Press, Princeton, NJ.
[SBC91] Saarinen, S., Bramley, R. B., and Cybenko, G., 1991. "Neural
Networks, Backpropagation and Automatic Differentiation," in Automatic
Differentiation of Algorithms, by A. Griewank and G. F. Corliss, (eds.),
550 References

SIAM, Philadelphia, PA, pp. 31-42.


[SBK64] Shah, B., Buehler, R., and Kempthorne, 0., 1964. "Some Algo-
rithms for Minimizing a Function of Several Variables," J. Soc. Indust.
Appl. Math., Vol. 12, pp. 74-92.
[SBT12] Shah, P., Bhaskar, B. N., Tang, G., and Recht, B., 2012. "Linear
System Identification via Atomic Norm Regularization," arXiv preprint
arXiv:1204.0590.
[SDR09] Shapiro, A., Dentcheva, D., and Ruszczynski, A., 2009. Lectures
on Stochastic Programming: Modeling and Theory, SIAM, Phila., PA.
[SFR09] Schmidt, M., Fung, G., and Rosales, R., 2009. "Optimization
Methods for t'1-Regularization," Univ. of British Columbia, Technical Re-
port TR-2009-19.
[SKS12] Schmidt, M., Kim, D., and Sra, S., 2012. "Projected Newton-Type
Methods in Machine Learning," in Optimization for Machine Learning, by
Sra, S., Nowozin, S., and Wright, S. J., (eds.), MIT Press, Cambridge, MA,
pp. 305-329.
[SLB13] Schmidt, M., Le Roux, N., and Bach, F., 2013. "Minimizing Finite
Sums with the Stochastic Average Gradient," arXiv preprint arXiv:1309.-
2388.
[SNW12] Sra, S., Nowozin, S., and Wright, S. J., 2012. Optimization for
Machine Learning, MIT Press, Cambridge, MA.
[SRBll] Schmidt, M., Roux, N. L., and Bach, F. R., 2011. "Convergence
Rates of Inexact Proximal-Gradient Methods for Convex Optimization,"
In Advances in Neural Information Processing Systems, pp. 1458-1466.
[SSS07] Shalev-Shwartz, S., Singer, Y., Srebro, N., and Cotter, A., 2007.
"Pegasos: Primal Estimated Subgradient Solver for SVM," in ICML 07,
New York, NY, pp. 807-814.
[SaT13] Saha, A., and Tewari, A., 2013. "On the Nonasymptotic Conver-
gence of Cyclic Coordinate Descent Methods," SIAM J. on Optimization,
Vol. 23, pp. 576-601.
[Sak66] Sakrison, D. T., 1966. "Stochastic Approximation: A Recursive
Method for Solving Regression Problems," in Advances in Communication
Theory and Applications, 2, A. V. Balakrishnan, ed., Academic Press, NY,
pp. 51-106.
[Sayl4] Sayed, A. H., 2014. "Adaptation, Learning, and Optimization over
Networks," Foundations and Trends in Machine Learning, Vol. 7, no. 4-5,
pp. 311-801.
[ScF14] Schmidt, M., and Friedlander, M. P., 2014. "Coordinate Descent
Converges Faster with the Gauss-Southwell Rule than Random Selection,"
Advances in Neural Information Processing Systems 27 (NIPS 2014).
[Sch82] Schnabel, R. B., 1982. "Determining Feasibility of a Set of Non-
linear Inequality Constraints," Math. Programming Studies, Vol. 16, pp.
References 551

137-148.
[Sch86] Schrijver, A., 1986. Theory of Linear and Integer Programming,
Wiley, NY.
[SchlOJ Schmidt, M., 2010. "Graphical Model Structure Learning with Ll-
Regularization," PhD Thesis, Univ. of British Columbia.
[Sch14a] Schmidt, M., 2014. "Convergence Rate of Stochastic Gradient with
Constant Step Size," Computer Science Report, Univ. of British Columbia.
[Sch14b] Schmidt, M., 2014. "Convergence Rate of Proximal Gradient with
General Step-Size," Dept. of Computer Science, Unpublished Note, Univ.
of British Columbia.
[ShZ12] Shamir, 0., and Zhang, T., 2012. "Stochastic Gradient Descent for
Non-Smooth Optimization: Convergence Results and Optimal Averaging
Schemes," arXiv preprint arXiv:1212.1824.
[Sha79] Shapiro, J. E., 1979. Mathematical Programming Structures and
Algorithms, Wiley, NY.
[Sho85] Shor, N. Z., 1985. Minimization Methods for Nondifferentiable
Functions, Springer-Verlag, Berlin.
[Sho98] Shor, N. Z., 1998. Nondifferentiable Optimization and Polynomial
Problems, Kluwer Academic Publishers, Dordrecht, Netherlands.
[SmS04] Smola, A. J., and Scholkopf, B., 2004. "A Tutorial on Support
Vector Regression," Statistics and Computing, Vol. 14, pp. 199-222.
[SoZ98] Solodov, M. V., and Zavriev, S. K., 1998. "Error Stability Proper-
ties of Generalized Gradient-Type Algorithms," J. Opt. Theory and Appl.,
Vol. 98, pp. 663-680.
[Sol98] Solodov, M. V., 1998. "Incremental Gradient Algorithms with Step-
sizes Bounded Away from Zero," Computational Optimization and Appli-
cations, Vol. 11, pp. 23-35.
[Spa03] Spall, J. C., 2003. Introduction to Stochastic Search and Optimiza-
tion: Estimation, Simulation, and Control, J. Wiley, Hoboken, NJ.
[Spa12] Spall, J. C., 2012. "Cyclic Seesaw Process for Optimization and
Identification," J. of Optimization Theory and Applications, Vol. 154, pp.
187-208.
[Spi83] Spingarn, J. E., 1983. "Partial Inverse of a Monotone Operator,"
Applied Mathematics and Optimization, Vol. 10, pp. 247-265.
[Spi85] Spingarn, J. E., 1985. "Applications of the Method of Partial In-
verses to Convex Programming: Decomposition," Math. Programming,
Vol. 32, pp. 199-223.
[StV09J Strohmer, T., and Vershynin, R., 2009. "A Randomized Kaczmarz
Algorithm with Exponential Convergence," J. Fourier Anal. Appl., Vol. 15,
pp. 262-278.
[StW70] Stoer, J., and Witzgall, C., 1970. Convexity and Optimization in
Finite Dimensions, Springer-Verlag, Berlin.
552 References

[StW75] Stephanopoulos, G., and Westerberg, A. W., 1975. "The Use of


Hestenes' Method of Multipliers to Resolve Dual Gaps in Engineering Sys-
tem Optimization," J. Optimization Theory and Applications, Vol. 15, pp.
285-309.
[Str76] Strang, G., 1976. Linear Algebra and Its Applications, Academic
Press, NY.
[Str97] Stromberg, A-B., 1997. Conditional Subgradient Methods and Er-
godic Convergence in Nonsmooth Optimization, Ph.D. Thesis, Univ. of
Linkoping, Sweden.
[SuB98] Sutton, R. S., and Barto, A. G., 1998. Reinforcement Learning,
MIT Press, Cambridge, MA.
[TBA86] Tsitsiklis, J. N., Bertsekas, D. P., and Athans, M., 1986. "Dis-
tributed Asynchronous Deterministic and Stochastic Gradient Optimiza-
tion Algorithms," IEEE Trans. on Automatic Control, Vol. AC-31, pp.
803-812.
[TBT90J Tseng, P., Bertsekas, D. P., and Tsitsiklis, J. N., 1990. "Partially
Asynchronous, Parallel Algorithms for Network Flow and Other Problems,"
SIAM J. on Control and Optimization, Vol. 28, pp. 678-710.
[TVSlOJ Teo, C. H., Vishwanthan, S. V. N., Smola, A. J., and Le, Q.
V., 2010. "Bundle Methods for Regularized Risk Minimization," The J. of
Machine Learning Research, Vol. 11, pp. 311-365.
[TaP13] Talischi, C., and Paulino, G. H., 2013. "A Consistent Operator
Splitting Algorithm and a Two-Metric Variant: Application to Topology
Optimization," arXiv preprint arXiv:1307.5100.
[Teb92] Teboulle, M., 1992. "Entropic Proximal Mappings with Applica-
tions to Nonlinear Programming," Math. of Operations Research, Vol. 17,
pp. 1-21.
[Teb97] Teboulle, M., 1997. "Convergence of Proximal-Like Algorithms,"
SIAM J. Optim., Vol. 7, pp. 1069-1083.
[Ter96J Terlaky, T. (Ed.), 1996. Interior Point Methods of Mathematical
Programming, Springer, NY.
[Tib96] Tibshirani, R., 1996. "Regression Shrinkage and Selection via the
Lasso," J. of the Royal Statistical Society, Series B (Methodological), Vol.
58, pp. 267-288.
[TodOl] Todd, M. J., 2001. "Semidefinite Optimization," Acta Numerica,
Vol. 10, pp. 515-560.
[TsB87] Tseng, P., and Bertsekas, D. P., 1987. "Relaxation Methods for
Problems with Strictly Convex Separable Costs and Linear Constraints,"
Math. Programming, Vol. 38, pp. 303-321.
[TsB90] Tseng, P., and Bertsekas, D. P., 1990. "Relaxation Methods for
Monotropic Programs," Math. Programming, Vol. 46, pp. 127-151.
[TsB91] Tseng, P., and Bertsekas, D. P., 1991. "Relaxation Methods for
References 553

Problems with Strictly Convex Costs and Linear Constraints," Math. of


Operations Research, Vol. 16, pp. 462-481.
[TsB93] Tseng, P., and Bertsekas, D. P., 1993. "On the Convergence of
the Exponential Multiplier Method for Convex Programming," Math. Pro-
gramming, Vol. 60, pp. 1-19.
[TsBOOJ Tseng, P., and Bertsekas, D. P., 2000. "An Epsilon-Relaxation
Method for Separable Convex Cost Generalized Network Flow Problems,"
Math. Programming, Vol. 88, pp. 85-104.
[TsY09] Tseng, P. and Yun S., 2009. "A Coordinate Gradient Descent
Method for Nonsmooth Separable Minimization," Math. Programming,
Vol. 117, pp. 387-423.
[Tse91a] Tseng, P., 1991. "Decomposition Algorithm for Convex Differen-
tiable Minimization," J. of Optimization Theory and Applications, Vol. 70,
pp. 109-135.
[Tse91b] Tseng, P., 1991. "Applications of a Splitting Algorithm to De-
composition in Convex Programming and Variational Inequalities," SIAM
J. on Control and Optimization, Vol. 29, pp. 119-138.
[Tse93] Tseng, P., 1993. "Dual Coordinate Ascent Methods for Non-Strictly
Convex Minimization," Math. Programming, Vol. 59, pp. 231-247.
[Tse95] Tseng, P., 1995. "Fortified-Descent Simplicial Search Method," Re-
port, Dept. of Math., Univ. of Washington, Seattle, Wash.; also in SIAM
J. on Optimization, Vol. 10, 1999, pp. 269-288.
[Tse98] Tseng, P., 1998. "Incremental Gradient(-Projection) Method with
Momentum Term and Adaptive Stepsize Rule," SIAM J. on Control and
Optimization, Vol. 8, pp. 506-531.
[TseOOJ Tseng, P., 2000. "A Modified Forward-Backward Splitting Method
for Maximal Monotone Mappings," SIAM J. on Control and Optimization,
Vol. 38, pp. 431-446.
[TseOla] Tseng, P., 2001. "Convergence of Block Coordinate Descent Meth-
ods for Nondifferentiable Minimization," J. Optim. Theory Appl., Vol. 109,
pp. 475-494.
[TseOlb] Tseng, P., 2001. "An Epsilon Out-of-Kilter Method for Monotropic
Programming," Math. of Operations Research, Vol. 26, pp. 221-233.
[Tse04] Tseng, P., 2004. "An Analysis of the EM Algorithm and Entropy-
Like Proximal Point Methods," Math. Operations Research, Vol. 29, pp.
27-44.
[Tse08] Tseng, P., 2008. "On Accelerated Proximal Gradient Methods for
Convex-Concave Optimization," Report, Math. Dept., Univ. of Washing-
ton.
[Tse09] Tseng, P., 2009. "Some Convex Programs Without a Duality Gap,"
Math. Programming, Vol. 116, pp. 553-578.
[TselO] Tseng, P., 2010. "Approximation Accuracy, Gradient Methods, and
554 References

Error Bound for Structured Convex Optimization," Math. Programming,


Vol. 125, pp. 263-295.
[VKG14] Vetterli, M., Kovacevic, J., and Goyal, V. K., 2014. Foundations
of Signal Processing, Cambridge Univ. Press, NY.
[VMR88] Vogl, T. P., Mangis, J. K., Rigler, A. K., Zink, W T., and
Alkon, D. L., 1988. "Accelerating the Convergence of the Back-Propagation
Method," Biological Cybernetics, Vol. 59, pp. 257-263.
[VaF08] Van Den Berg, E., and Friedlander, M. P., 2008. "Probing the
Pareto Frontier for Basis Pursuit Solutions," SIAM J. on Scientific Com-
puting, Vol. 31, pp. 890-912.
[VanOl] Vanderbei, R. J., 2001. Linear Programming: Foundations and
Extensions, Springer, NY.
[VeH93] Ventura, J. A., and Hearn, D. W., 1993. "Restricted Simplicial
Decomposition for Convex Constrained Problems," Math. Programming,
Vol. 59, pp. 71-85.
[Ven67] Venter, J. H., 1967. "An Extension of the Robbins-Monro Proce-
dure," Ann. Math. Statist., Vol. 38, pp. 181-190.
[WDS13] Weinmann, A., Demaret, L., and Storath, M., 2013. "Total Varia-
tion Regularization for Manifold-Valued Data," arXiv preprint arXiv:1312.-
7710.
[WFL14] Wang, M., Fang, E., and Liu, H., 2014. "Stochastic Compositional
Gradient Descent: Algorithms for Minimizing Compositions of Expected-
Value Functions," Optimization Online.
[WHM13] Wang, X., Hong, M., Ma, S., Luo, Z. Q., 2013. "Solving Multiple-
Block Separable Convex Minimization Problems Using Two-Block Alter-
nating Direction Method of Multipliers," arXiv preprint arXiv:1308.5294.
[WSK14] Wytock, M., Suvrit S., and Kolter, J. K., 2014. "Fast Newton
Methods for the Group Fused Lasso," Proc. of 2014 Conf. on Uncertainty
in Artificial Intelligence.
[WSVOO] Wolkowicz, H., Saigal, R., and Vanderbergue, L., (eds), 2000.
Handbook of Semidefinite Programming, Kluwer, Boston.
[WaB13a] Wang, M., and Bertsekas, D. P., 2013. "Incremental Constraint
Projection-Proximal Methods for Nonsmooth Convex Optimization," Lab.
for Information and Decision Systems Report LIDS-P-2907, MIT, to appear
in SIAM J. on Optimization.
[WaB13b] Wang, M., and Bertsekas, D. P., 2013. "Convergence of Itera-
tive Simulation-Based Methods for Singular Linear Systems," Stochastic
Systems, Vol. 3, pp. 38-95.
[WaB13c] Wang, M., and Bertsekas, D. P., 2013. "Stabilization of Stochas-
tic Iterative Methods for Singular and Nearly Singular Linear Systems,"
Math. of Operations Research, Vol. 39, pp. 1-30.
[WaB14] Wang, M., and Bertsekas, D. P., 2014. "Incremental Constraint
References 555

Projection Methods for Variational Inequalities," Mathematical Program-


ming, pp. 1-43.
[Was04] Wasserman, L., 2004. All of Statistics: A Concise Course in Sta-
tistical Inference, Springer, NY.
[Wat92] Watson, G. A., 1992. "Characterization of the Subdifferential of
Some Matrix Norms," Linear Algebra and its Applications, Vol. 170, pp.
33-45.
[We013] Wei, E., and Ozdaglar, A., 2013. "On the 0(1/k) Convergence of
Asynchronous Distributed Alternating Direction Method of Multipliers,"
arXiv preprint arXiv:1307.8254.
[WiH60] Widrow, B., and Hoff, M. E., 1960. "Adaptive Switching Circuits,"
Institute of Radio Engineers, Western Electronic Show and Convention,
Convention Record, Part 4, pp. 96-104.
[Wol75] Wolfe, P., 1975. "A Method of Conjugate Subgradients for Mini-
mizing Nondifferentiable Functions," Math. Programming Study 3, Balin-
ski, M., and Wolfe, P., (Eds.), North-Holland, Amsterdam, pp. 145-173.
[Wri97] Wright, S. J., 1997. Primal-Dual Interior Point Methods, SIAM,
Philadelphia, PA.
[Wri14] Wright, S. J., 2014. "Coordinate Descent Algorithms," Optimiza-
tion Online.
[XiZ14] Xiao, L., and Zhang, T., 2014. "A Proximal Stochastic Gradient
Method with Progressive Variance Reduction," arXiv preprint arXiv:1403.-
4699.
[XialO] Xiao L., 2010. "Dual Averaging Methods for Regularized Stochastic
Learning and Online Optimization," J. of Machine Learning Research, Vol.
11, pp. 2534-2596.
[YBR08] Yu, H., Bertsekas, D. P., and Rousu, J., 2008. "An Efficient Dis-
criminative Training Method for Generative Models," Extended Abstract,
the 6th International Workshop on Mining and Learning with Graphs
(MLG).
[YGT93] Ye, Y., Guler, 0., Tapia, R. A., and Zhang, Y., 1993. "A Quadrat-
ically Convergent 0( v'nL)-Iteration Algorithm for Linear Programming,"
Math. Programming, Vol. 59, pp. 151-162.
[YNSlO] Yousefian, F., Nedic, A., and Shanbhag, U. V., 2010. "Convex
Nondifferentiable Stochastic Optimization: A Local Randomized Smooth-
ing Technique," Proc. American Control Conference (ACC), pp. 4875-4880.
[YNS12] Yousefian, F., Nedic, A., and Shanbhag, U. V., 2012. "On Stochas-
tic Gradient and Subgradient Methods with Adaptive Steplength Sequences,"
Automatica, Vol. 48, pp. 56-67.
[YOG08] Yin, W., Osher, S., Goldfarb, D., and Darbon, J., 2008. "Bregman
Iterative Algorithms for £1-Minimization with Applications to Compressed
Sensing," SIAM J. on Imaging Sciences, Vol. 1, pp. 143-168.
556 References

[YSQ14] You, K., Song, S., and Qiu, L., 2014. "Randomized Incremental
Least Squares for Distributed Estimation Over Sensor Networks," Preprints
of the 19th World Congress The International Federation of Automatic
Control Cape Town, South Africa.
[Ye92] Ye, Y., 1992. "A Potential Reduction Algorithm Allowing Column
Generation," SIAM J. on Optimization, Vol. 2, pp. 7-20.
[Ye97] Ye, Y., 1997. Interior Point Algorithms: Theory and Analysis, Wiley
Interscience, NY.
[YuR07] Yu, H., and Rousu, J., 2007. "An Efficient Method for Large Mar-
gin Parameter Optimization in Structured Prediction Problems," Technical
Report C-2007-87, Univ. of Helsinki.
[ZJL13] Zhang, H., Jiang, J., and Luo, Z. Q., 2013. "On the Linear Con-
vergence of a Proximal Gradient Method for a Class of Nonsmooth Convex
Minimization Problems," J. of the Operations Research Society of China,
Vol. 1, pp. 163-186.
[ZLW99] Zhao, X., Luh, P. B., and Wang, J., 1999. "Surrogate Gradient
Algorithm for Lagrangian Relaxation," J. Optimization Theory and Appli-
cations, Vol. 100, pp. 699-712.
[ZMJ13] Zhang, L., Mahdavi, M., and Jin, R., 2013. "Linear Convergence
with Condition Number Independent Access of Full Gradients," Advances
in Neural Information Processing Systems 26 (NIPS 2013), pp. 980-988.
[ZTD92] Zhang, Y., Tapia, R. A., and Dennis, J. E., 1992. "On the Su-
perlinear and Quadratic Convergence of Primal-Dual Interior Point Linear
Programming Algorithms," SIAM J. on Optimization, Vol. 2, pp. 304-324.
[Za102] Zalinescu, C., 2002. Convex Analysis in General Vector Spaces,
World Scientific, Singapore.
[Zan69] Zangwill, W. I., 1969. Nonlinear Programming, Prentice-Hall, En-
glewood Cliffs, NJ.
[Zou60] Zoutendijk, G., 1960. Methods of Feasible Directions, Elsevier
Pub!. Co., Amsterdam.
[Zou76] Zoutendijk, G., 1976. Mathematical Programming Methods, North
Holland, Amsterdam.
INDEX

ADMM 111, 280, 292, 295, 298, Bolzano-Weierstrass Theorem 453


337,427 Boundary of a set 454
Affine function 445 Boundary point 454
Affine hull 4 72 Bounded sequence 451, 452
Affine set 445 Bounded set 453
Aggregated gradient method 91 94 Branch-and-bound 7
428 ' ' Bregman distance 388
Alternating direction method 111 Bundle methods 110, 187, 272, 295,
280 ' 385
Analytic center 425 C
Approximation algorithms 36, 54
Caratheodory's Theorem 4 72
Armijo rule 69, 123, 125, 317
Cartesian product 445
Asymptotic sequence 481
Cauchy sequence 452
Asynchronous computation 33 104
376 ' ' Central cutting plane methods 425
432 '
Asynchronous gradient method 104
106 ' Chain rule 142, 513
Classification 29
Atomic norm 35
Closed ball 453
Auction algorithm 180, 375
Closed function 469
Augmented Lagrangian function 260
283, 290 ' Closed halfspace 484
Closed set 453
Augmented Lagrangian method 109,
Closed set intersections 481
115, 120, 261, 294, 326, 337, 362,
Closed sphere 453
384,389,430
Closedness under linear transfor-
Averaging of iterates 120, 157, 176
mations 483
B Closedness under vector sums 483
Closure of a function 4 76
Backpropagation 119 Closure of a set 453
Backtracking rule 69, 123 Closure point 453
Ball center 426 Co-finite function 411
Barrier function 412 Coercive function 496
Barrier method 413 Compact set 453
Basis 446 Complementary slackness 508
Basis pursuit 34, 286 Component of a vector 444
Batching 93 Composition of functions 142 455
Block coordinate descent 75, 268, 469,514 ' '
281, 429, 438-442 Concave closure 502

557
558 Index

Concave function 468 Cyclic incremental method 84, 96,


Condition number 60, 122, 315 166,343
Conditional gradient method 71, 107,
191, 192, 374 D
Cone 468 Danskin's Theorem 146, 172
Cone decomposition 219 Dantzig-Wolfe decomposition 107,
Cone generated by a set 472, 489, 111, 229, 295
492 Decomposition algorithm 7, 77,289,
Confusion region 89, 93 363
Conic duality 14, 19, 23, 511 Decomposition of a convex set 479
Conic programming 13, 217, 224, Derivative 456
231,232,423,432,511 Descent algorithms 54
Conjugate Subgradient Theorem 201, Descent direction 58, 71
513 Descent inequality 122, 305
Conjugacy Theorem 487 Diagonal dominance 106
Conjugate direction method 64, 68, Diagonal scaling 63, 101, 333, 338
320 Differentiability 457
Conjugate function 487 Differentiable convex function 470
Conjugate gradient method 64 Differentiation theorems 457, 458,
Constancy space 480 513,514
Constant stepsize rule 56, 59, 69, Dimension of a convex set 472
153,304,349 Dimension of a subspace 446
Constraint aggregation 230 Dimension of an affine set 472
Constraint qualification 507 Diminishing stepsize rule 69, 127,
Continuity 455, 4 75 157, 174, 316, 351
Continuous differentiability 141, 172, Direct search methods 83
457 Direction of recession 478
Contraction mapping 56, 296, 312, Directional derivative 137, 170-173,
458 515
Convergent sequence 451, 452 Distributed computation 8, 33, 104,
Convex closure 4 76 365,376
Convex combination 472 Domain 444
Convex function 468, 469 Domain one-dimensional 409
Convex hull 4 72 Double conjugate 487
Convex programming 4, 507 Dual cone 14, 511
Convex set 468 Dual function 3, 499
Convexification of a function 4 76 Dual pair representation 219
Coordinate 444 Dual problem 2, 147, 164, 499, 506,
Coordinate descent method 75, 104, 507
369, 429, 439-442 Dual proximal algorithm 257, 336,
Crossing function 499 384
Cutting plane method 107,182,211, Duality gap estimate 9
228,270 Duality theory 2, 498, 505
Cyclic coordinate descent 76, 370, Dynamic programming 36, 380
376
Index 559

Dynamic stepsize rule 159, 175, 177 Farout region 89


Feasibility problem 34, 283, 429
E Feasible direction 71
£-complementary slackness 180 Feasible direction methods 71
£-descent algorithm 83, 396, 400, Feasible solution 2, 494
431 Fejer Convergence Theorem 126 158
465
' '
£-descent direction 400
£-relaxation method 375 Fejer monotonicity 464
E-subdifferential 162, 397 Fenchel duality 10, 510
E-subgradient 162, 164, 169, 180, Fenchel inequality 512
397 Finitely generated cone 492
E-subgradient method 162, 179 Forward-backward algorithm 427
EMP 197, 220, 396, 406 Forward image 445
Effective domain 468 Frank-Wolfe algorithm 71, 107, 191,
Entropic descent 396 374
Entropy function 383 Fritz John optimality conditions 6
Entropy minimization algorithm 383 Full rank 44 7
Epigraph 468 Fundamental Theorem of Linear Pro-
Essentially one-dimensional 408 gramming 494
Euclidean norm 450 G
Eventually constant stepsize rule
GPA algorithm 200-202, 229
308,334
Exact penalty 39, 45, 365, 369 Gauss-Southwell order 376, 441
Existence of dual optimal solutions Generalized polyhedral approxima-
tion 107, 196, 201, 229
503
Generalized simplicial decomposi-
Existence of optimal solutions 483,
tion 209, 229
495
Generated cone 472, 489, 492
Exponential augmented Lagrangian
Geometric convergence 57
method 116, 134, 389, 431
Global maximum 495
Exponential loss 30
Global minimum 495
Exponential smoothing 116, 134,
Gradient 456
391
Gradient method 56, 59
Extended Kalman filter 103, 120
Extended monotropic programming Gradient method distributed 104
106 '
83, 197, 220, 229, 396, 406, 431
Extended real number 443 Gradient method with momentum
Extended real-valued function 468 63, 92
Gradient projection 73, 82, 136, 302,
Exterior penalty method 110
Extrapolation 63-66, 322, 338, 427, 374,385,396,427,434
428 H
Extreme point 490 Halfspace 484
Heavy ball method 63, 92
F
Hessian matrix 457
Farkas' Lemma 492, 505, 506 Hierarchical decomposition 77
560 Index

Hinge loss 30 K
Hyperplane 484
Kaczmarz method 85, 98, 131
Hyperplane separation 484-487
Krasnosel'skii-Mann Theorem 252,
I 285,300,459
Ill-conditioning 60, 109, 413
L
Image 445, 446
Improper function 468 £1-norm 451
Incremental Gauss-Newton method £00 -norm 450
103, 120 LMS method 119
Incremental Newton method 97, 101, Lagrange multiplier 507
118, 119 Lagrangian function 3, 507
Incremental aggregated method 91, Lasso problem 27
94 Least absolute value deviations 27,
Incremental constraint projection 288
method 102, 365, 429 Least mean squares method 119
Incremental gradient method 84, 105, Left-continuous function 455
118, 119, 130-132 Level set 469
Incremental gradient with momen- Limit 451
tum 92 Limit point 451, 453
Incremental method 25, 83, 166, Limited memory quasi-Newton me-
320 thod 63, 338
Incremental proximal method 341, Line minimization 60, 65, 69, 320
385,429 Line segment principle 4 73
Incremental subgradient method 84, Lineality space 4 79
166,341,385,428 Linear-conic problems 15, 16
Indicator function 487 Linear convergence 57
Inner linearization 107, 182, 188, Linear equation 445
194,296 Linear function 445
Infeasible problem 494 Linear inequality 445
Infimum 444 Linear programming 16, 415, 434
Inner approximation 402 Linear programming duality 506
Inner product 444 Linear regularity 369
Instability 186, 191, 269 Linear transformation preservation
Integer programming 6 of closedness 483
Interior of a set 412, 453 Linearly independent vectors 446
Interior point 453 Lipschitz continuity 141, 455, 512
Interior point method 108, 412, 415, Local convergence 68
423,432 Local maximum 495
Interpolated iteration 249, 253, 298, Local minimum 495
459 Location theory 32
Inverse barrier 412 Logarithmic barrier 412, 416
Inverse image 445, 446 Logistic loss 30
J Lower limit 452
Index 561

Lower semicontinuous function 455, Nelder-Mead algorithm 83


469 Nested sequence 481
Network optimization 37, 189, 193,
M 208,217,375
Majorization-maximization algorithm Newton's method 67, 74, 75, 97,
392 338,416,424
Matrix completion 28, 35 Nonexpansive mapping 57,249,459
Matrix factorization 30, 373 Nonlinear Farkas' Lemma 505
Max crossing problem 499 Nonmonotonic stepsize rule 70
Max function 4 70 Nonnegative combination 472
Maximal monotone mapping 255 Nonquadratic regularization 242, 294,
Maximum likelihood 31 382,393
Maximum norm 450 Nonsingular matrix 447
Maximum point 444, 495 Nonstationary iteration 57, 461
Mean Value Theorem 457, 458 Nonvertical Hyperplane Theorem
Merit function 54, 417 486
Min common problem 499 Nonvertical hyperplane 486
Min common/max crossing frame- Norm 450
work 499 Norm equivalence 454
Minimax duality 502, 516 Normal cone 145
Minimax duality gap 9 Normal of a hyperplane 484
Minimax duality theorems 516 Nuclear norm 28, 35
Minimax equality 12, 516-518 Null step 212
Minimax problems 9, 12, 35, 113, N ullspace 44 7
147, 164, 215, 217 0
Minimax theory 498, 502, 516
Minimizer 495 Open ball 453
Minimum point 444, 495 Open halfspace 484
Minkowski-Weyl Theorem 492 Open set 453
Minkowski-Weyl representation 492 Open sphere 453
Mirror descent 82, 385, 395 Optimality conditions 3, 144, 470,
Momentum term 63, 92 508-511, 514
Monotone mapping 255 Orthogonal complement 446
Monotonically nondecreasing sequence Orthogonal vectors 444
451 Outer approximation 402
Monotonically nonincreasing sequence Outer linearization 107, 182, 194
451 Overrelaxation 253
Monotropic programming 83, 197, p
208,431
Multicommodity flow 38, 193, 217 PARTAN 64
Multiplier method 109, 261, 267 Parallel subspace 445
Parallel projections method 76, 438
N
Parallel tangents method 64
Negative halfspace 484 Partitioning 9
Neighborhood 453 Partial cutting plane method 187
562 Index

Partial derivative 456 Q


Partial minimization 496 Quadratic penalty function 39
Partial proximal algorithm 297 Quadratic programming 21, 483
Partially asynchronous algorithm 105 Quasi-Newton method 68, 74, 338
377 '
Penalty method 38, 108, 120, 326 R
Penalty parameter 38, 40, 109
Perturbation function 501 Randomized coordinate descent 376
440 '
Polar Cone Theorem 488, 492
Polar cone 488, 491 Randomized incremental method 84
86,131,344,353,429 '
Polyhedral Proper Separation The-
orem 486 Range 444
Polyhedral approximation 107 182 Range space 44 7
196,217,385 ' ' Rank 447
Polyhedral cone 492 Recession Cone Theorem 4 78
Polyhedral function 493 Recession cone of a function 480
Recession cone of a set 478
Polyhedral set 468, 489
Positive combination 4 72 Recession direction 4 78
Positive definite matrix 449 Recession function 480
Positive halfspace 484 Reflection operator 249
Positive semidefinite matrix 449 Regression 26
Positively homogeneous function 488 Regularization 26-31, 117, 133, 232,
489 ' 235,287,361,382,393
Power regularization 393 Relative boundary 473
Predictor-corrector method 421 432 Relative boundary point 473
.
Pnmal-dual method 416, 420
' Relative interior 4 73
Primal function 501 Relative interior point 4 73
Relatively open 4 73
Primal problem 2, 499
Relaxation method 375
Projection Theorem 471
Prolongation Lemma 473 Reshuffling 96
Proper Separation Theorem 486 Restricted simplicial decomposition
Proper function 468 192,204
Properly separating hyperplane 485 Retractive sequence 481
486 ' Retractive set 482
Proximal Newton algorithm 338, 428 Right-continuous function 455
Proximal algorithm 80 110 120 Robust optimization 20, 47
'
234,293,307,374,384,393,439 ' ' s
Proximal cutting plane method 270 Saddle Point Theorem 518
Proximal gradient algorithm 82, 112, Saddle point 498, 517-518
133,330,336,385,428, 436-438 Scaling 61
Proximal inner linearization 278 Schwarz inequality 450
Proximal operator 248, 296 Second order cone programming 17,
Proximal simplicial decomposition 49, 50,230,423
280 Second order expansions 458
Pythagorean Theorem 450
Index 563

Second order method of multipliers Strong duality 3, 499


267 Strong monotonicity 296
Self-dual cone 17 Subdifferential 136, 141, 143, 146,
Semidefinite programming 17, 22, 167, 171, 511-514
424 Subgradient 136, 147, 148, 167, 511-
Sensitivity 501 514
Separable problems 7,289,362,405 Subgradient methods 78, 136, 179
Separating Hyperplane Theorem 485 Subsequence 452
Separating hyperplane 485 Subspace 445
Sequence definitions 451 Superlinear convergence 68, 100, 338,
Serious step 272 393,437
Set intersection theorems 482 Supermartingale Convergence 462-
Shapley-Folkman Theorem 9 464
Sharp minimum 80, 176, 179, 242- Support function 488
244, 436 Support vector machine 30, 48
Shrinkage operation 133, 287, 330, Supporting Hyperplane Theorem 484
362 Supporting hyperplane 484
Side constraints 213, 217 Supremum 444
Simplicial decomposition method 72, Symmetric matrix 449
107, 182, 188, 193, 209, 221, 228-
230, 278,280,320 T
Single commodity network flow 38, Theorems of the alternative 52
208,375 Tikhonov regularization 111
Singular matrix 44 7 Total variation denoising 28
Slater condition 8, 505 Totally asynchronous algorithm 105,
Smoothing 113, 168, 326, 427 376
Span 446 Triangle inequality 450
Sphere 453 Trust region 70, 296
Splitting algorithm 113, 120, 427 Two-metric projection method 74,
Square root of a matrix 450 102, 129, 189, 374
Stationary iteration 56
Steepest descent direction 59, 128 u
Steepest descent method 59, 78,401 Uniform sampling 96, 344
Stepsize rules 69, 308, 316 Upper limit 452
Stochastic Newton method 120 Upper semicontinuous function 455,
Stochastic approximation 94 469
Stochastic gradient method 93, 167
Stochastic programming 32 V
Stochastic subgradient method 93, Vector sum preservation of closed-
167 ness 483
Strict Separation Theorem 485 Vertical hyperplane 486
Strict local minimum 495
Strictly convex function 468 w
Strictly separating hyperplane 485 Weak Duality Theorem 3, 7, 499,
Strong convexity 312, 435, 440, 471 507
564 Index

Weber point 32, 52


Weierstrass' Theorem 456, 495
Weighted sup-norm 380
y

z
Zero-sum games 9
An insightful, comprehensive, and up-to-date treatment of the theory of convex optimization
algorithms, and some of their applications in large-scale resource allocation, signal processing,
and machine learning. The book complements the author's 2009 "Convex Optimization Theory"
book, which focuses on convex analysis and duality theory. The two books can be read independently.
share notation and style, and together cover the entire finite-dimensional convex optimization field.

develops comprehensively the theory of descent and approximation methods, including


gradient and subgradient projection methods, cutting plane and simplicial decomposition
methods, and proximal methods
describes and analyzes augmented Lagrangian methods, and alternating direction methods
of multipliers
develops the modern theory of coordinate descent methods, including distributed
asynchronous convergence analysis
comprehensively covers incremental gradient, subgradient, proximal, and constraint
projection methods
includes optimal algorithms based on extrapolation techniques, and associated rate
of convergence analysis
describes a broad variety of applications of large-scale optimization and machine learning
contains many examples, illustrations, and exercises

1,;,1,;1i......;&;11~.o a member of the U.S. National Academy of


Engineering, is McAfee Professor of Engineering at the Massachusetts
Institute ofTechnology. Among others, he has received the 200 I AACC
John R. Ragazzini Education Award, the 2009 INFORMS Expository
Writing Award, the 2 0 14 AACC Richard Bellman Heritage Award,
and the 2014 Khachiyan Prize.

Related Athena Scientific books by the same author: Visit Athena Scientific online at:
www.athenasc.com

ISBN 978 - 1-886529-28-1


IS BN 978 -1- 8 86529- 2 8-l.

9
II781886
llllll 1111111111111111
529281

You might also like