CntrlEngg (Optimization) ConvexOptimizationAlgorithms DimitriBertsekas
CntrlEngg (Optimization) ConvexOptimizationAlgorithms DimitriBertsekas
Dimitri P. Bertsekas
Massachusetts Institute of Technology
http://www.athenasc.com
®
Athena Scientific, Belmont, Massachusetts
Athena Scientific
Post Office Box 805
Nashua, NH 03061-0805
U.S.A.
Email: [email protected]
WWW: http://www.athenasc.com
iii
iv Contents
References p. 519
Index . . . p. 557
ATHENA SCIENTIFIC
OPTIMIZATION AND COMPUTATION SERIES
vi
ABOUT THE AUTHOR
vii
Preface
ix
X Preface
problem structures are discussed, often arising from Lagrange duality the-
ory and Fenchel duality theory, together with its special case, conic duality.
Some additional structures involving a large number of additive terms in
the cost, or a large number of constraints are also discussed, together with
their applications in machine learning and large-scale resource allocation.
Chapter 2: Here we provide an overview of algorithmic approaches, focus-
ing primarily on algorithms for differentiable optimization, and we discuss
their differences from their nondifferentiable convex optimization counter-
parts. We also highlight the main ideas of the two principal algorithmic
approaches of this book, iterative descent and approximation, and we illus-
trate their application with specific algorithms, reserving detailed analysis
for subsequent chapters.
Chapter 3: Here we discuss subgradient methods for minimizing a con-
vex cost function over a convex constraint set. The cost function may be
nondifferentiable, as is often the case in the context of duality and machine
learning applications. These methods are based on the idea of reduction
of distance to the optimal set, and include variations aimed at algorithmic
efficiency, such as E-subgradient and incremental subgradient methods.
Chapter 4: Here we discuss polyhedral approximation methods for min-
imizing a convex function over a convex constraint set. The two main
approaches here are outer linearization (also called the cutting plane ap-
proach) and inner linearization (also called the simplicial decomposition
approach). We show how these two approaches are intimately connected
by conjugacy and duality, and we generalize our framework for polyhedral
approximation to the case where the cost function is a sum of two or more
convex component functions.
Chapter 5: Here we focus on proximal algorithms for minimizing a convex
function over a convex constraint set. At each iteration of the basic proxi-
mal method, we solve an approximation to the original problem. However,
unlike the preceding chapter, the approximation is not polyhedral, but
rather it is based on quadratic regularization, i.e., adding a quadratic term
to the cost function, which is appropriately adjusted at each iteration. We
discuss several variations of the basic algorithm. Some of these include
combinations with the polyhedral approximation methods of the preced-
ing chapter, yielding the class of bundle methods. Others are obtained
via duality from the basic proximal algorithm, including the augmented
Lagrangian method ( also called method of multipliers) for constrained op-
timization. Finally, we discuss extensions of the proximal algorithm for
finding a zero of a maximal monotone operator, and a major special case:
the alternating direction method of multipliers, which is well suited for
taking advantage of the structure of several types of large-scale problems.
Chapter 6: Here we discuss a variety of algorithmic topics that sup-
plement our discussion of the descent and approximation methods of the
Preface xi
Dimitri P. Bertsekas
[email protected]
January 2015
1
Convex Optimization Models:
An Overview
Contents
1
2 Convex Optimization Models: An Overview Chap. 1
We start our overview of Lagrange duality with the basic case of nonlin-
ear inequality constraints, and then consider extensions involving linear
inequality and equality constraints. Consider the problemt
mm1m1ze f (x)
(1.1)
subject to x EX, g(x) ::::; 0,
where X is a nonempty set,
g(x) = (g1(x), ... ,gr(x))',
and f: X H ~ and gj : X H ~, j = 1, ... , r, are given functions. We refer
to this as the primal problem, and we denote its optimal value by f*. A
vector x satisfying the constraints of the problem is referred to as feasible.
The dual of problem (1.1) is given by
maximize q(µ)
(1.2)
subject to µ E ~r,
t Consistent with its overview character, this chapter contains few proofs,
and refers frequently to the literature, and to Appendix B, which contains a full
list of definitions and propositions (without proofs) relating to nonalgorithmic
aspects of convex optimization. This list reflects and summarizes the content
of the author's "Convex Optimization Theory" book [Ber09]. The proposition
numbers of [Ber09] have been preserved, so all omitted proofs of propositions in
Appendix B can be readily accessed from [Ber09].
t Appendix A contains an overview of the mathematical notation, terminol-
ogy, and results from linear algebra and real analysis that we will be using.
Sec. 1.1 Lagrange Duality 3
so that
q* = sup q(µ) = supq(µ)::::; inf .f(x) = .f*.
µ E1Rr µ 2: 0 xEX,g(x)<'.'.O
When q* = .f*, we say that strong duality holds. The following propo-
sition gives necessary and sufficient conditions for strong duality, and pri-
mal and dual optimality (see Prop. 5.3.2 in Appendix B).
In this manner, Prop. 1.1.3 under condition (2), together with Prop. 1.1.2,
yield the following for the case where all constraint functions are linear.
mm1m1ze f (x)
(1.4)
subject to x E X, Ax = b,
x* E argminL(x,>.*).
xEX
Aside from the preceding results, there are alternative optimality con-
ditions for convex and nonconvex optimization problems, which are based
on extended versions of the Fritz John theorem; see [Be002] and [BOT06],
and the textbooks [Ber99] and [BN003]. These conditions are derived us-
ing a somewhat different line of analysis and supplement the ones given
here, but we will not have occasion to use them in this book.
The preceding propositions deal mostly with situations where strong du-
ality holds (q* = f*). However, duality can be useful even when there is
duality gap, as often occurs in problems that have a finite constraint set
X. An example is integer programming, where the components of x must
be integers from a bounded range (usually O or 1). An important special
case is the linear 0-1 integer programming problem
minimize c' x
subject to Ax .:; b, Xi = 0 or 1, i = 1, ... , n,
Sec. 1.1 Lagrange Duality 7
which by weak duality, is a lower bound to the optimal value of the re-
stricted problem (1.5). In a strengthened version of this approach, the
given inequality constraints g(x) ::::; 0 may be augmented by additional in-
equalities that are known to be satisfied by optimal solutions of the original
problem.
An important point here is that when X is finite, the dual function
q of Eq. (1.6) is concave and polyhedral. Thus solving the dual problem
amounts to minimizing the polyhedral function -q over the nonnegative
orthant. This is a major context within which polyhedral functions arise
in convex optimization.
minimize L f;(x;)
i=l
m
(1.7)
subject to L%(x;) :S: 0, x; EX;, i = 1, ... , m, j = 1, ... , r,
i=l
where f; : ~n; M ~ and 9ij : ~n; M ~r are given functions, and X; are
given subsets of ~n;. By assigning a dual variable µj to the jth constraint,
we obtain the dual problem [cf. Eq. (1.2)]
m
subject to µ 2'. 0,
where
q;(µ) = x·EX
inf {f;(x;) ~
+ L..., µj9ij(x;)},
' ' j=l
The separable structure is additionally helpful when the cost and/ or the
constraints are not convex, and there is a duality gap. In particular, in this
case the duality gap turns out to be relatively small and can often be shown
to diminish to zero relative to the optimal primal value as the number m of
separable terms increases. As a result, one can often obtain a near-optimal
primal solution, starting from a dual-optimal solution, without resorting
to costly branch-and-bound procedures.
Sec. 1.1 Lagrange Duality 9
The small duality gap size is a consequence of the structure of the set
S of constraint-cost pairs of problem (1.7), which in the case of a separable
problem, can be written as a vector sum of m sets, one for each separable
term, i.e.,
S= S1 +··· +Sm,
where
Si= { (gi(x;), f;(x;)) Ix; E Xi},
and 9i: ~n; H ~r is the function gi(x;) = (g;1(xi), ... ,gim(x;)). It can
be shown that the duality gap is related to how much S "differs" from
its convex hull (a geometric explanation is given in [Ber99], Section 5.1.6,
and [Ber09], Section 5.7). Generally, a set that is the vector sum of a
large number of possibly nonconvex but roughly similar sets "tends to
be convex" in the sense that any vector in its convex hull can be closely
approximated by a vector in the set. As a result, the duality gap tends to
be relatively small. The analytical substantiation is based on a theorem
by Shapley and Folkman (see [Ber99], Section 5.1, or [Ber09], Prop. 5.7.1,
for a statement and proof of this theorem). In particular, it is shown in
[AuE76], and also [BeS82], [Ber82a], Section 5.6.1, under various reasonable
assumptions, that the duality gap satisfies
f*-q*::; (r+l). max p;,
i=l, ... ,m
1.1.2 Partitioning
can be written as
minimize F(x) + By=c-Ax,
inf
yEY
G(y)
subject to x E X,
or
minimize F(x) + p(c - Ax)
subject to x E X,
where p is given by
p(u) = inf G(y).
By=u,yEY
In favorable cases, p can be dealt with conveniently (see e.g., the book
[Las70] and the paper [Geo72]).
Strategies of splitting or transforming the variables to facilitate al-
gorithmic solution will be frequently encountered in what follows, and in
a variety of contexts, including duality. The next section describes some
significant contexts of this type.
Let us consider the Fenchel duality framework (see Section 5.3.5 of Ap-
pendix B). It involves the problem
t
We remind the reader that our convex analysis notation, terminology, and
nonalgorithmic theory are summarized in Appendix B.
Sec. 1.2 Fenchel Duality and Conic Programming 11
= x1E~n
inf {f1(x1) - A' Axi} + x2E~n
inf {h(x2) + Nx2}.
The dual problem of maximizing q over >. E lRm, after a sign change to
convert it to a minimization problem, takes the form
minimize ft(A'>.) + f:i(->.)
(1.11)
subject to >. E lRm,
where ft and f:i are the conjugate functions of Ji and /2. We denote by
f* and q* the corresponding optimal primal and dual values.
The following Fenchel duality result is given as Prop. 5.3.8 in Ap-
pendix B. Parts (a) and (b) are obtained by applying Prop. l.l.5(a) to
problem (1.10), viewed as a problem with x2 = Ax1 as the only linear
equality constraint. The first equation of part (c) is a consequence of Prop.
l.l.5(b). Its equivalence with the last two equations is a consequence of
the Conjugate Subgradient Theorem (Prop. 5.4.3, App. B), which states
that for a closed proper convex function f, its conjugate f *, and any pair
of vectors (x, y), we have
x E arg min
zE~n
{f (z) - z'y} iff y E 8f(x) iff x E 8f*(y),
with all of these three relations being equivalent to x'y = f (x) + f*(y).
Here 8f(x) denotes the subdifferential off at x (the set of all subgradients
at fat x); see Section 5.4 of Appendix B.
x* E arg min { Ji (x) - x' A'>.* } and Ax* E arg min { h (z) + z' >. *},
xE~n zE~n
(1.12)
A 1>.* E 8fi(x*) and - >.* E 8/2(Ax*), (1.13)
x* E 8ft(A' >.*) and Ax* E 8J:i{->.*). (1.14)
12 Convex Optimization Models: An Overview Chap. 1
Minimax Problems
where f : ar t-+ R and g : Rn o-+ Rm are given functions. Then it is seen that
minimize J(x)
(1.15)
subject to x EX, g(x) S 0.
fi(x) = { f(x) !f XE X,
oo 1fx(/cX.
It can also be verified that the Fenchel dual problem (1.11) is equivalent to
maximizing over z E Z the function E(z) = infxEX ¢(x, z). Again having no
duality gap is equivalent to the minimax equality (1.16) holding.
minimize y
subject to x E X, g1 (x):5:y, j=l, ... ,r,
Conic Programming
minimize f (x)
(1.17)
subject to x EC,
where
C* ={>.I A'x s; 0, I;/ x EC}
is the polar cone of C (note that f:i. is the support function of C; cf. Section
1.6 of Appendix B). The dual problem is
minimize f*(>.)
(1.18)
subject to >. E C,
where f* is the conjugate off and C is the negative polar cone (also called
the dual cone of C):
Note the symmetry between primal and dual problems. The strong duality
relation f * = q* can be written as
Using the symmetry of the primal and dual problems, we also obtain
that there is no duality gap and the primal problem (1.17) has an optimal
solution if the optimal value of the dual conic problem (1.18) is finite and
ri (dom(f*)) n ri( C) =I= 0. It is also possible to derive primal and dual op-
timality conditions by translating the optimality conditions of the Fenchel
duality frame work [Prop. 1.2.l(c)].
Sec. 1.2 Fenchel Duality and Conic Programming 15
if XE b+ S,
f(x) = { : if X rt b + S,
where b and c are given vectors, and S is a subspace. Then the primal
problem can. be written as
minimize c' x
(1.19)
subject to x - b E S, x E C;
= sup(>.. - c)'(y + b)
yES
It can be seen that the dual problem min>..EC f*(>..) [cf. Eq. (1.18)], after
discarding the superfluous term c'b from the cost, can be written as
The primal and dual linear-conic problems (1.19) and (1.20) have been
placed in an elegant symmetric form. There are also other useful formats
that parallel and generalize similar formats in linear programming. For
example, we have the following dual problem pairs:
minimize x' µ
(1.23)
subject to µ - c E N(A).L, µEC.
Since N(A).L is equal to Ra(A'), the range of A', the constraints of problem
(1.23) can be equivalently written as c - µ E -Ra(A') = Ra(A'), µEC, or
c- µ = A >.,1 µEC ,
Sec. 1.2 Fenchel Duality and Conic Programming 17
for some >. E ~m. Making the change of variables µ = c - A'>., the dual
problem (1.23) can be written as
In this section we consider the linear-conic problem (1.22), with the cone
which is known as the second order cone (see Fig. 1.2.2). The dual cone is
inf
ll(x1, ... ,Xn-1)ll:Sxn
y'x = inf {YnXn +
xn;:=,:O
inf f
ll(x1, ... ,xn-illl:Sxn i=l
YiXi}
where the second equality follows because the minimum of the inner prod-
uct of a vector z E ~n-l with vectors in the unit ball of ~n-l is -llzll-
Combining the preceding two relations, we have
in W3 .
so C= C.
The second order cone programming problem (SOCP for short) is
minimize c' x
(1.24)
subject to Aix - bi E Ci, i = 1, ... , m,
where x E ~ , c is a vector in Rn, and for i = 1, ... ,m, Ai is an nix n
matrix, bi is a vector in Rni, and Ci is the second order cone of Rni. It is
seen to be a special case of the primal problem in the left-hand side of the
duality relation (1.22), where
C = C1 X ·· · X Cm.
Note that linear inequality constraints of the form a~x - bi > 0 can be
written as
We now observe that from the right-hand side of the duality relation
(1.22), and the self-duality relation C = C, the corresponding dual linear-
conic problem has the form
m
maximize L b~Ai
i=l
m
(1.25)
subject to L A~Ai = c, Ai E Ci, i = 1, ... , m,
i=l
where A = (A1, ... , Am). By applying the Linear-Conic Duality Theorem
(Prop. 1.2.3), we have the following.
then there is no duality gap, and the dual problem has an optimal
solution.
(b) If the optimal value of the dual problem is finite and there exists
a feasible solution X = (>."1, ... , Xm) such that
if and only if
where
(1.27)
For special choices of the set Tj, the function 9i can be expressed in
closed form, and in the case where Tj is an ellipsoid, it turns out that the
constraint gj(x) SO can be expressed in terms of a second order cone. To see
this, let
(1.28)
and finally
Thus,
if and only if
where CJ is the second order cone of ~rJ+1; i.e., the "robust" constraint
gJ(x) ::::; 0 is equivalent to a second order cone constraint. It follows that in
the case of ellipsoidal uncertainty, the robust linear programming problem
(1.26) is a SOCP of the form (1.24).
where Qo, ... , Qr are symmetric n x n positive definite matrices, qo, ... , qr
are vectors in Rn, and po, ... ,Pr are scalars. We show that the problem can
be converted to the second order cone format. A similar conversion is also
possible for the quadratic programming problem where Qo is positive definite
and QJ = 0, j = 1, ... , r.
Indeed, since each QJ is symmetric and positive definite, we have
XI QjX + 2qjX
I
+ PJ -_ (
Qj1/2 X) I Qj1/2 X + 2 ( Qj-1/2 qj ) I Qj1/2 X + PJ
minimize Xn+1
It can be seen that this problem has the second order cone form (1.24). In
particular, the first constraint is of the form Aox - bo E C, where C is the
second order cone of ar+ 1 and the (n + l)st component of A 0 x - bo is Xn+l·
The remaining r constraints are of the form AjX-bj E C, where the (n+ l)st
component of Ajx - bj is the scalar (q5Qj 1 qj - pj)1 12 .
We finally note that the problem of this example is special in that it
has no duality gap, assuming its optimal value is finite, i.e., there is no need
for the interior point conditions of Prop. 1.2.4. This can be traced to the fact
that linear transformations preserve the closure of sets defined by quadratic
constraints (see e.g., BN003], Section 1.5.2).
In this section we consider the linear-conic problem (1.21) with C being the
cone of matrices that are positive semidefinite. t This is called the positive
semidefinite cone. To define the problem, we view the space of symmetric
n x n matrices as the space ~n 2 with the inner product
n n
< X, Y >= trace(XY) =LL XiJYiJ·
i=l j=l
where D, A1, ... , Arn, are given n x n symmetric matrices, and b1, ... , brn,
are given scalars. It is seen to be a special case of the primal problem in
the left-hand side of the duality relation (1.21).
We can view the SDP as a problem with linear cost, linear constraints,
and a convex set constraint. Then, similar to the case of SOCP, it can be
verified that the dual problem (1.20), as given by the right-hand side of the
duality relation (1.21), takes the form
minimize z
or equivalently
minimize z
subject to zI - M(A) E C,
the problem has the form of the dual problem (1.30), with the optimization
variables being (z, A1, ... , Am)-
where Qo, ... , Qm are symmetric n x n matrices, ao, ... , am are vectors in
ar, and bo, ... , bm are scalars.
This problem can be used to model broad classes of discrete optimiza-
tion problems. To see this, consider an integer constraint that a variable x;
must be either O or 1. Such a constraint can be expressed by the quadratic
equality X7 - X; = 0. Furthermore, a linear inequality constraint ajX ::; bj can
be expressed as the quadratic equality constraint yJ
+ aJx - bj = 0, where Yi
is an additional variable.
Introducing a multiplier vector A= (A1, ... , Am), the dual function is
given by
q(A) = inf { x'Q(A)x + a(A)'x + b(A) },
xE?Rn
where
m m m
Q(A) = Qo + L A;Q,, a(A) = ao + L A;a;, b(A) = bo +L A;b;.
i=l i=l i=l
Let f* and q* be the optimal values of problem (1.31) and its dual,
and note that by weak duality, we have f* 2'. q*. By introducing an auxiliary
Sec. 1.3 Additive Cost Problems 25
scalar variable e,
we see that the dual problem is to find a pair (e, >.) that
solves the problem
maximize e
subject to q(>.) 2: e.
The constraint q(>.) 2: e of this problem can be written as
inf
xERn, tER
{ (tx)'Q(>.)(tx) + a(>.)'(tx)t + (b(>.) - e)t 2 } 2 0.
Writing y = tx, this relation takes the form of a quadratic in (y, t),
inf
yERn, tER
{y'Q(>.)y + a(>.)'yt + (b(>.) -e)t 2 } 2 o,
or
Q(>.)
( ½a(>.)' ½a(>.) ) (1.32)
b(>.) - e E C,
where C is the positive semidefinite cone. Thus the dual problem is equivalent
e
to the SDP of maximizing over all (e, >.) satisfying the constraint (1.32), and
its optimal value q* is a lower bound to f*.
Such cost functions can be minimized with specialized methods, called in-
cremental, which exploit their additive structure, by updating x using one
component function Ji at a time (see Section 2.1.5). Problems with ad-
ditive cost functions can also be treated with specialized outer and inner
linearization methods that approximate the component functions Ji indi-
vidually (rather than approximating !); see Section 4.4.
An important special case is the cost function of the dual of a sepa-
rable problem
m
maximize L Qi(µ)
i=l
subject to µ 2 0,
26 Convex Optimization Models: An Overview Chap. 1
where
Qi(µ)= xEX-
inf {fi(Xi) +~
~
µj9i}(Xi)},
i i j=l
and µ = (µ1, ... , µr) [cf. Eq. (1.8)]. After a sign change to convert to
minimization it takes the form (1.33) with Ji(µ) = -qi(µ). This is a major
class of additive cost problems.
We will next describe some applications from a variety of fields. The
following five examples arise in many machine learning contexts.
where Ci and bi are given vectors and scalars, respectively. The regularization
function R is often taken to be differentiable, and particularly quadratic.
However, there are practically important examples of nondifferentiable choices
(see the next example).
In statistical applications, such a problem arises when constructing a
linear model for an unknown input-output relation. The model involves a
vector of parameters x, to be determined, which weigh input data (the com-
ponents of the vectors c;). The inner products c\x produced by the model are
matched against the scalars b;, which are observed output data, corresponding
to inputs Ci from the true input-output relation that we try to represent. The
optimal vector of parameters x* provides the model that (in the absence of a
regularization function) minimizes the sum of the squared errors (c\x* - b;) 2 .
In a more general version of the problem, a nonlinear parametric model
is constructed, giving rise to a nonlinear least squares problem of the form
m
subject to X E ~r'
where gi : ~r >---+ R are given nonlinear functions that depend on the data.
This is also a common problem, referred to as nonlinear regression, which,
however, is often nonconvex [it is convex if the functions g; are convex and
also nonnegative, i.e., g; ( x) 2 0 for all x E Rn].
Sec. 1.3 Additive Cost Problems 27
subject to x E Rn.
which tends to result in a more robust estimate than least squares in the
presence of large outliers in the data. This is known as the least absolute
deviations method.
There are also constrained variants of the problems just discussed,
where the parameter vector x is required to belong to some subset of Rn,
such as the nonnegative orthant or a "box" formed by given upper and lower
bounds on the components of x. Such constraints may be used to encode into
the model some prior knowledge about the nature of the solution.
'Y is a positive scalar and xi is the jth coordinate of x. The reason for the
popularity of the £1 norm \\x\\i is that it tends to produce optimal solutions
where a greater number of components xi are zero, relative to the case of
quadratic regularization (see Fig. 1.3.1). This is considered desirable in many
statistical applications, where the number of parameters to include in a model
may not be known a priori; see e.g., [Tib96], [DoE03], [BJM12]. The special
case where a linear least squares model is used,
i=l
subject to x E Rn,
is known as the total variation denoising problem; see e.g., [ROF92], [Cha04],
[BeT09a]. The regularization term here encourages consecutive variables to
take similar values, and tends to produce more smoothly varying solutions.
Another related example is matrix completion with nuclear norm regu-
larization; see e.g., [CaR09], [CaTlO], [RFPlO], [Recll], [ReR13]. Here the
minimization is over all m x n matrices X, with components denoted Xij. We
have a set of entries M;j, (i, j) E n, where n is a subset of index pairs, and
we want to find X whose entries X;j are close to Mij for (i, j) E n, and has as
small rank as possible, a property that is desirable on the basis of statistical
considerations. The following more tractable version of the problem is solved
instead:
minimize 'YIIXII. + ½L(i,j)Efl(X;j - M;j) 2
subject to X E Rmxn,
where \\XI\. is the nuclear norm of X, defined as the sum of the singular
values of X. There is substantial theory that justifies this approximation,
for which we refer to the literature. It turns out that the nuclear norm is a
convex function with some nice properties. In particular, its subdifferential
at any X can be conveniently characterized for use in algorithms.
Sec. 1.3 Additive Cost Problems 29
ifb;=+l,
subject to XE wn, y E W,
where R is a suitable regularization function, and h : W >---+ W is a convex
function that penalizes negative values of its argument. It would make some
sense to use a penalty of one unit for misclassification, i.e.,
if z 2 0,
h(z) = {~ if z < 0,
but such a penalty function is discontinuous. To obtain a continuous cost
function, we allow a continuous transition of h from negative to positive
30 Convex Optimization Models: An Overview Chap. 1
maximize Pz (z; x)
(1.34)
subject to x E Rn.
The cost function Pz(z; ·) of this problem may either have an additive
structure or may be equivalent to a problem that has an additive structure.
For example the event that Z = z may be the union of a large number of
disjoint events, so Pz(z; x) is the sum of the probabilities of these events. For
another important context, suppose that the data z consists of m independent
samples z 1 , ... , Zm drawn from a distribution P(·; x), in which case
Pz(z;x) = P(z1;x)···P(zm;x).
Then the maximization (1.34) is equivalent to the additive cost minimization
m
minimize L f; (x)
i=l
subject to x E Rn,
where
f;(x) =- log P(z;; x).
In many applications the number of samples m is very large, in which case
special methods that exploit the additive structure of the cost are recom-
mended. Often a suitable regularization term is added to the cost function,
similar to the preceding examples.
where w is a random variable taking a finite but very large number of values
w;, i = 1, ... , m, with corresponding probabilities 7r;. Then the cost function
consists of the sum of them functions 1r;F(x, w;).
For example, in stochastic programming, a classical model of two-stage
optimization under uncertainty, a vector x E X is selected, a random event
occurs that has m possible outcomes W1, ... , Wm, and another vector y E Y
is selected with knowledge of the outcome that occurred (see e.g., the books
[BiL97), [KaW94), [Pre95), [SDR09]). Then for optimization purposes, we
need to specify a different vector y; E Y for each outcome w;. The problem
is to minimize the expected cost
m
F(x) + L 1r;G;(y;),
i=l
where G;(y;) is the cost associated with the choice y; and the occurrence
of w;, and 7r; is the corresponding probability. This is a problem with an
additive cost function.
Additive cost functions also arise when the expected value cost function
E{ F(x, w)} is approximated by an m-sample average
f(x) = ! :tF(x,w;),
i=l
minimize L w; llx - Yi II
i=l
subject to X E \Jr,
where w1, ... , Wm are given positive scalars. This problem has many varia-
tions, including constrained versions, and descends from the famous Fermat-
Torricelli-Viviani problem (see [BMS99) for an account of the history of this
problem). We refer to the book [DrH04) for a survey of recent research, and
to the paper [BeTlO) for a discussion that is relevant to our context.
Sec. 1.3 Additive Cost Problems 33
The structure of the additive cost function (1.33) often facilitates the
use of a distributed computing system that is well-suited for the incremental
approach. The following is an illustrative example.
Consider a network of m sensors where data are collected and are used to solve
some inference problem involving a parameter vector x. If f;(x) represents an
error penalty for the data collected by the ith sensor, the inference problem
involves an additive cost function I:;': 1 f;. While it is possible to collect all
the data at a fusion center where the problem will be solved in centralized
manner, it may be preferable to adopt a distributed approach in order to
save in data communication overhead and/or take advantage of parallelism
in computation. In such an approach the current iterate Xk is passed on from
one sensor to another, with each sensor i performing an incremental iteration
involving just its local component f;. The entire cost function need not be
known at any one location. For further discussion we refer to representative
sources such as [RaN04], [RaN05], [BHG08], [MRS10], [GSW12], and [Say14].
The approach of computing incrementally the values and subgradients
of the components f; in a distributed manner can be substantially extended
to apply to general systems of asynchronous distributed computation, where
the components are processed at the nodes of a computing network, and the
results are suitably combined [NBBOl] (see our discussion in Sections 2.1.5
and 2.1.6).
xi = dom(fi),
resulting in a problem of the form
m
minimize L Ji (x)
i=l
subject to x E n~ 1 Xi,
where each Ji is real-valued over the set Xi. Methods that are well-suited
for the unconstrained version of the problem where Xi = Rn can often be
modified to apply to the constrained version, as we will see in Chapter 6,
where we will discuss incremental constraint projection methods. However,
the case of constraint sets with many components arises independently of
whether the cost function is additive or not, and has its own character, as
we discuss in the next section.
34 Convex Optimization Models: An Overview Chap. 1
minimize f (x)
(1.35)
subject to x EX, gj(x) ::; 0, j = 1, ... , r,
where the number r of constraints is very large. Problems of this type occur
often in practice, either directly or via reformulation from other problems.
A similar type of problem arises when the abstract constraint set X consists
of the intersection of many simpler sets:
where L is a finite or infinite index set. There may or may not be additional
inequality constraints gj(x) ::; 0 like the ones in problem (1.35). We provide
a few examples.
A simple but important problem, which arises in many contexts and embodies
important algorithmic ideas, is a classical feasibility problem, where the ob-
jective is to find a common point within a collection of sets Xe, f EL, where
each Xe is a closed convex set. In the feasibility problem the cost function
is zero. A somewhat more complex problem with a similar structure arises
when there is a cost function, i.e., a problem of the form
minimize / (x)
subject to x E ntELXe,
weights for a large number n of basis functions (m < n). We want to satisfy
exactly the measurement equations Ax = b, while using only a few of the
basis functions in our model. Consequently, we introduce the £1 norm in the
cost function of problem (1.36), aiming to delineate a small subset of basis
functions, corresponding to nonzero coordinates of x at the optimal solution.
This is called the basis pursuit problem (see, e.g., (CDSOl], (VaF08)), and its
underlying idea is similar to the one of £1-regularization (cf. Example 1.3.2).
It is also possible to consider a norm other than £1 in Eq. (1.36). An
example is the atomic norm I[ · IIA induced by a subset A that is centrally
symmetric around the origin (a EA if and only if -a EA):
This problem, and other related problems involving atomic norms, have many
applications; see for example (CRP12], (8BT12], (RSW13].
A related problem is
minimize IIXII•
subject to AX = B,
where the optimization is over all m x n matrices X. The matrices A, B
are given and have dimensions £ x m and £ x n, respectively, and IIXIJ. is
the nuclear norm of X. This problem aims to produce a low-rank matrix X
that satisfies an underdetermined set of linear equations AX = B (see e.g.,
(CaR09], [RFPIO], [RXB11]). When these equations specify that a subset of
entries X;j, ( i, j) E n, are fixed at given values M;j,
(i,j) E 0,
we obtain an alternative formulation of the matrix completion problem dis-
cussed in Example 1.3.2.
minimize y
which involves a large number of constraints (one constraint for each z in the
set Z, which could be infinite). Of course in this problem the set X may also
be of the form X = ntELXt as in earlier examples.
36 Convex Optimization Models: An Overview Chap. 1
minimize L fi (Yi)
i=l
m
(1.37)
subject to L9iJ(Yi) s; 0, Vj = l, ... ,r, y 2 0,
i=l
<t>:x 2 0, i = 1, ... , m,
where¢; denotes the ith row of <I>, and ¢;xis viewed as an approximation of
Yi· Thus the dimension of the problem is reduced from m ton. However, the
constraint set of the problem became more complicated, because the simple
constraints Yi 2 0 take the more complex form ¢;x 2 0. Moreover the number
m of additive components in the cost function, as well as the number of its
constraints is still large. Thus the problem has the additive cost structure of
the preceding section, as well as a large number of constraints.
An important application of this approach is in approximate dynamic
programming (see e.g., [BeT96], [SuB98], [Powll], [Ber12]), where the func-
tions Ji and 9iJ are linear. The corresponding problem (1.37) relates to the
solution of the optimality condition (Bellman equation) of an infinite horizon
Markovian decision problem (the constraint y 2 0 may not be present in this
context). Here the numbers m and rare often astronomical (in fact r can be
much larger than m), in which case an exact solution cannot be obtained. For
such problems, approximation based on problem (1.38) has been one of the
major algorithmic approaches (see [Ber12] for a textbook presentation and
references). For very large m, it may be impossible to calculate the cost func-
tion value ~::, 1 fi(¢;x) for a given x, and one may at most be able to sample
individual cost components fi- For this reason optimization by stochastic
simulation is one of the most prominent approaches in large scale dynamic
programming.
Let us also mention that related approaches based on randomization
and simulation have been proposed for the solution of large scale instances of
classical linear algebra problems; see [BeY09], [Ber12) (Section 7.3), [DMM06),
[StV09), [HMTlO], [NeelO], [DMMll], [WaB13a], [WaB13b].
Sec. 1.4 Large Number of Constraints 37
The collection of all path flows {xp p E Pw, w E W} must satisfy the
constraints
L Xp =rw, 'vwEW, (1.39)
pEPw
(1.41)
all paths p
containing (i,j)
LDij(Fij)- (1.42)
(i,j)
The problem is to find a set of path flows { Xp} that minimize this cost function
subject to the constraints of Eqs. (1.39)-(1.41). It is typically assumed that
D;j is a convex function of F;j. In data routing applications, the form of
D;j is often based on a queueing model of average delay, in which case D;j is
continuously differentiable within its domain (see e.g., [BeG92)). In a related
context, arising in optical networks, the problem involves additional integer
constraints on Xp, but may be addressed as a problem with continuous flow
variables (see [OzB03)).
38 Convex Optimization Models: An Overview Chap. 1
minimize D(x)
subject to L Xp = rw, V w E W,
pEPw
Xp ~ 0, V p E Pw, WE W,
where
D(x) = LD;
(i,j)
1 ( L
all paths p
Xp)
containing (i,j)
and x is the vector of path flows Xp, There is a potentially huge number
of variables as well as constraints in this problem. However, by judiciously
taking into account the special structure of the problem, the constraint set
can be simplified and approximated by the convex hull of a small number of
vectors x, and the number of variables and constraints can be reduced to a
manageable size (see e.g., [BeG83], [FIH95], [OMVOOJ, and our discussion in
Section 4.2).
subject to x E X,
where P( ·) is a scalar penalty function satisfying P( u) = 0 if u '.S 0, and
P( u) > 0 if u > 0, and c is a positive penalty parameter. We discuss this
possibility in the next section.
Sec. 1.5 Exact Penalty Functions 39
minimize f (x)
(1.43)
subject to x EX, gj(X) :::; 0, j = 1, ... , r,
where X is a convex subset of ~n, and f : X -+ ~ and gj : X -+ ~ are
given convex functions. We denote by f* the primal optimal value, and by
q* the dual optimal value, i.e.,
q* = supq(µ),
µ?,O
where
q(µ) inf {f(x) + µ'g(x)},
= xEX V µ 2'. 0,
with g(x) = (g1(x), ... ,gr(x))'. We assume that -oo < q* = f* < oo.
We introduce a convex penalty function P : ~r H ~, which satisfies
and
r
P(u) = c Lmax{O,uj},
j=l
40 Convex Optimization Models: An Overview Chap. 1
~
P(u) = cmax{O,u} if O :S; µ :S; C
Q(µ) =:= { otherwise
/
0 u 0 C µ
O a µ
0 u ,0 µ
We have,
inf {f(x)
xEX
+ P(g(x))} = xEX
inf inf
uE~r,g(x):C::u
{f(x) + P(g(x))}
= inf inf
xEX uE~r,g(x):C::u
{f (x) + P (u) }
= inf
xEX,uE~r,g(x):C::u
{f (x) + P(u)}
= inf inf {f(x)+P(u)}
uE~r xEX,g(x):C::u
Moreover, -oo < q* and f* < oo by assumption, and since for anyµ with
q(µ) > -oo, we have
it follows that p(O) < oo and p(u) > -oo for all u E ~r, sop is proper.
We can now apply the Fenchel Duality Theorem (Prop. 1.2.1) with
the identifications Ji = p, h = P, and A = I. We use the conjugacy
relation between the primal function p and the dual function q to write
so that
inf {f(x) + P(g(x))} = sup{ q(µ) - Q(µ) }; (1.48)
xEX µ'20
see Fig. 1.5.2. Note that the conditions for application of the theorem are
satisfied since the penalty function P is real-valued, so that the relative
j+Q(µ)
]= inf {f(x)+P(g(x))}.
xEX
q(µ)
/
f
0 µ
(c) In order for the penalized problem (1.46) and the constrained
problem (1.43) to have the same set of optimal solutions, it is
sufficient that there exists a dual optimal solution µ* such that
u'µ* < P(u), \;/ u E )Rr with Uj > 0 for some j. (1.50)
if and only if equality holds in Eq. (1.51). This is true if and only if
0 E arg min {p(u)
uEiRr
+ P(u) },
which by Prop. 5.4. 7 in Appendix B, is true if and only if there exists some
µ* E - ap(O) with µ* E aP(O) (in view of the fact that P is real-valued).
Since the set of dual optimal solutions is -ap(O) (under our assumption
-oo < q* = f* < oo; see Example 5.4.2, [Ber09]), the result follows.
(b) If x* is an optimal solution of both problems (1.43) and (1.46), then by
feasibility of x*, we have P(g(x*)) = 0, so these two problems have equal
optimal values. From part (a), there must exist a dual optimal solution
µ* E oP(O), which is equivalent to Eq. (1.49), by the subgradient inequality.
(c) If x* is an optimal solution of the constrained problem (1.43), then
P(g(x*)) = 0, so we have
f* = f(x*) = J(x*) + P(g(x* )) 2: xEX
inf {f(x) + P(g(x)) } .
44 Convex Optimization Models: An Overview Chap. 1
The condition (1.50) implies the condition (1.49), so that by part (a),
equality holds throughout in the above relation, showing that x* is also
an optimal solution of the penalized problem (1.46).
Conversely, let x* E X be an optimal solution of the penalized prob-
lem (1.46). If x* is feasible [i.e., satisfies in addition g(x*) ::::; OJ, then it is an
optimal solution of the constrained problem (1.43) [since P(g(x)) = 0 for
all feasible vectors x], and we are done. Otherwise x* is infeasible in which
case 9j(x*) > 0 for some j. Then, by using the given condition (1.50), it
follows that there exists a dual optimal solutionµ* and an E > 0 such that
so q(µ) = 0 for µ E [1, oo) and q(µ) = -oo otherwise. Let P(u) =
c max{ 0, u}, so the penalized problem is minx::c:o { - x + c max{ 0, x}}. Then
parts (a) and (b) of the proposition apply if c ~ 1. However, part (c) ap-
plies only if c > 1. In terms of Fig. 1.5.2, the conjugate of Pis Q(µ) = 0 if
µ E [O, c] and Q(µ) = oo otherwise, so when c = 1, Q is "flat" over an area
not including an interior point of the dual optimal solution set [1, oo ).
To elaborate on the idea of the preceding example, let
r
P(u) = c I::max{O,uj},
j=l
u'µ*::::; P(u),
[cf. Eq. (1.49)], is equivalent to
Similarly, the condition u'µ* < P(u) for all u E !Rr with Uj > 0 for some j
[cf. Eq. (1.50)], is equivalent to
V j =I, . .. ,r.
The reader may consult the literature for other results on exact penalty
functions, starting with their first proposal in the book [Zan69]. The pre-
ceding development is based on [Ber75], and focuses on convex program-
ming problems. For additional representative references, some of which
also discuss nonconvex problems, see [HaM79], [Ber82a], [Bur91], [FeM91],
[BN003], [FrT07]. In what follows we develop an exact penalty function
result for the case of an abstract constraint set, which will be used in the
context of incremental constraint projection algorithms in Section 6.4.4.
dist(x ; X) = yEX
inf llx - Yll-
The next proposition from [Berll] provides the basic result (see Fig. 1.5.3) .
over Y.
Fc(x) = f(x) +cllx-xll = J(±) + (f(x) - f(x)) + cllx-xll "2 J(±) = Fe(±),
with strict inequality if x f:. x. Thus all minima of Fe over Y must lie in X,
and also minimize f over X (since Fe= f on X). Conversely, all minima
46 Convex Optimization Models: An Overview Chap. 1
f(x) + cdist(x;X)
- - - - - -.... x* X
X
y
Figure 1.5.3. Illustration of Prop. 1.5.2. For c greater than the Lipschitz constant
off, the "slope" of the penalty function counteracts the "slope" off at the optimal
solution x*.
over Xo.
Proof: Let L be the Lipschitz constant for f, and let c 1, ... , Cm be scalars
satisfying
Ck > L + c1 + · · · + Ck-1, V k = I, ... ,m,
where co= 0. Define
Hk over ( n~k+l Xi) n Xo coincides with the set of minima of Hk-1 over
( n~k Xi) n Xo. Thus, fork= l, we obtain that the set of minima of Hm
over Xo coincides with the set of minima of Ho, which is f, over n~ 0 Xi.
Let
X* C n~oXi
be this set of minima. For c ~ Cm, we have Fe ~ Hm, while Fe coincides
with Hm on X*. Hence X* is the set of minima of Fe over Xo. Q.E.D.
We finally note that exact penalty functions, and particularly the dis-
tance function dist(x; Xi), are often relatively convenient in various con-
texts where difficult constraints complicate the algorithmic solution. As an
example, see Section 6.4.4, where incremental proximal methods for highly
constrained problems are discussed.
dimensional spaces. The book by Rockafellar and Wets [RoW98] also has
a substantial finite-dimensional treatment of this subject. The books by
Cottle, Pang, and Stone [CPS92], and Facchinei and Pang [FaP03] focus on
complementarity and variational inequality problems. The books by Palo-
mar and Eldar [PaElO], and Vetterli, Kovacevic, and Goyal [VKG14], and
the surveys in the May 2010 issue of the IEEE Signal Processing Magazine
describe applications of convex optimization in communications and sig-
nal processing. The books by Hastie, Tibshirani, and Friedman [HTF09],
and Sra, Nowozin, and Wright [SNW12] describe applications of convex
optimization in machine learning.
EXERCISES
subject to X E )Jr' y E R,
with quadratic regularization, where /3 is a positive regularization parameter (cf.
Example 1.3.3).
(a) Write the problem in the equivalent form
where
q(µ) = Lµ; - ½I:: I:';= 1 1 b;bjc;cjµiµj.
i=l
always have a solution? Is the solution unique? Note: The dual problem
may have high dimension, but it has a generally more favorable structure
than the primal. The reason is the simplicity of its constraint set, which
makes it suitable for special types of quadratic programming methods, and
the two-metric projection and coordinate descent methods of Section 2.1.2.
(b) Consider an alternative formulation where the variable y is set to 0, leading
to the problem
minimize ½llxll 2 + /3 I:: 1max{O, 1 - b;c;x}
subject to XE ar.
Show that the dual problem should be modified so that the constraint
I:7= 1 µjbj = 0 is not present, thus leading to a bound-constrained quadratic
dual problem.
Note: The literature of the support vector machine field is extensive. Many of the
nondifferentiable optimization methods to be discussed in subsequent chapters
have been applied in connection to this field; see e.g., [MaMOl], [FeM02], [SmS04],
[Bot05], [Joa06], [JFY09], [JoY09], [SSS07], [LeWll].
minimize L IIF;x + g; II
(1.53)
i=l
subject to X E ar'
and
minimize max
i=l, ... ,p
IIF;x + g;II
subject to X E \Jr,
where F; and g, are given matrices and vectors, respectively. Convert these
problems to second order cone form and derive the corresponding dual problems.
1.4
The purpose of this exercise is to show that the SOCP can be viewed as a special
case of SDP.
(a) Show that a vector x E !Rn belongs to the second order cone if and only if
the matrix
0 0
0 0
0 0
Xn-1
is positive semidefinite. Hint: We have that for any positive definite sym-
metric n x n matrix A, vector b E Rn, and scalar d, the matrix
( A1
b
b)
C
(b) Use part (a) to show that the primal SOCP can be written in the form of
the dual SDP.
i = 1, ... ,m,
i = 1, ... ,m,
Sec. 1.6 Notes, Sources, and Exercises 51
whereµ; E Rni- 1 and v; E R. Show that the dual problem (1.25) can be
written in the form
(c) Show that the primal and dual interior point conditions for strong duality
(Prop. 1.2.4) hold if there exist primal and dual feasible solutions x and
(µ;, v;) such that
i= 1, ... ,m,
and
i = 1, ... ,m,
respectively.
minimize L Ji (xi)
i=l
subject to x E Sn C,
where x = (x1, ... , Xm) with Xi E Rn;, i = 1, ... , m, and f; : Rn; I-+ (-oo, oo] is
a proper convex function for each i, and S and C are a subspace and a cone of
Rn1 +··+nm, respectively. Show that a dual problem is
maximize L q, (A;)
i=l
subject to A E C +SJ..,
i = 1, ... ,m.
52 Convex Optimization Models: An Overview Chap. 1
1. 7 (Weber Points)
Consider the problem of finding a circle of minimum radius that contains r points
y1, ... , Yr in the plane, i.e., find x and z that minimize z subject to llx - YJ II :S z
for all j = 1, ... , r, where x is the center of the circle under optimization.
(a) Introduce multipliers µj, j = l, ... ,r, for the constraints, and show that
the dual problem has an optimal solution and there is no duality gap.
(b) Show that calculating the dual function at some µ 2 0 involves the com-
putation of a Weber point of Y1, ... , Yr with weights µ1, ... , µr, i.e., the
solution of the problem
Let gj : Rn 1--t R, j = 1, ... , r, be convex functions over the nonempty convex set
X C Rn. Show that the system
has no solution within X if and only if there exists a vector µ E Rr such that
µ 2 0,
µ' g(x) 2 0, V XE X.
minimize y
subject to x E X, y E R, gj(x)'.Sy, j=l, ... ,r.
2
Optimization Algorithms:
An Overview
Contents
53
54 Optimization Algorithms: An Overview Chap. 2
(2.4)
where the stepsize O:k is not constant. Still many of these algorithms admit
a convergence analysis based on a descent approach, whereby we introduce
a function ¢ that measures the progress of the algorithm towards optimality,
and show that
Two common cases are when ¢(x) = f(x) or ¢(x) = dist(x, X*), the Eu-
clidean minimum distance of x from the set X* of minima off. For example
convergence of the gradient algorithm (2.4) is often analyzed by showing
that for all k,
where "'/k is a positive scalar that depends on O:k and some characteristics
of f, and is such that ~%°=a "'/k = oo; this brings to bear the convergence
k=O,I, ... ,
(cf. Section A.3 of Appendix A). From this formula it follows that if dk is
a descent direction at Xk, in the sense that
we may reduce the cost by moving from Xk along dk with a small enough
positive stepsize 0:. In the unconstrained case where X = ~n, this leads to
an algorithm of the form
(2.6)
For the case where f is differentiable and X = ~n, there are many popular
descent algorithms of the form (2.6). An important example is the classical
gradient method, where we use dk = - v' f (xk) in Eq. (2.6):
Xk+l = Xk - 0:k y' f (xk),
Since for differentiable f we have
f'(xk; d) = v' f(xk)'d,
it follows that
y' f (Xk) _ • / ,
llv'f( Xk )II - arg Jld!!:c;l
mm f (xk,d)
[assuming v'f(xk) =/- OJ. Thus the gradient method is the descent algorithm
of the form (2.6) that uses the direction that yields the greatest rate of
cost improvement. For this reason it is also called the method of steepest
descent.
Let us now discuss the convergence rate of the steepest descent method,
assuming that f is twice continuously differentiable. With proper step-
size choice, it can be shown that the method has a linear rate, assuming
that it generates a sequence { xk} that converges to a vector x* such that
v' f (x*) = 0 and v' 2 f (x*) is positive definite. For example, if O:k is a
sufficiently small constant 0: > 0, the corresponding iteration
Xk+I = Xk - av' f(xk), (2.7)
can be shown to be contractive within a sphere centered at x*, so it con-
verges linearly.
To get a sense of this, assume for convenience that f is quadratic, t
so by adding a suitable constant to f, we have
f(x) = ½(x - x*)'Q(x - x*), v'f(x) = Q(x - x*),
t Convergence analysis using a quadratic model is commonly used in nonlin-
ear programming. The rationale is that behavior of an algorithm for a positive
definite quadratic cost function is typically a correct predictor of its behavior for
a twice differentiable cost function in the neighborhood of a minimum where the
Hessian matrix is positive definite. Since the gradient is zero at that minimum,
the positive definite quadratic term dominates the other terms in the Taylor se-
ries expansion, and the asymptotic behavior of the method does not depend on
terms of order higher than two.
This time-honored line of analysis underlies some of the most widely used
unconstrained optimization methods, such as Newton, quasi-Newton, and conju-
gate direction methods, which will be briefly discussed later. However, the ratio-
nale for these methods is weakened when the Hessian is singular at the minimum,
since in this case third order terms may become significant. For this reason, when
considering algorithmic options for a given differentiable optimization problem,
it is important to consider (in addition to its cost function structure) whether
the problem is "singular or "nonsingular."
60 Optimization Algorithms: An Overview Chap. 2
where Q is the positive definite symmetric Hessian off. Then for a constant
stepsize a, the steepest descent iteration (2.7) can be written as
For a < 2 /Amax, where Amax is the largest eigenvalue of Q, the matrix
I - aQ has eigenvalues strictly within the unit circle, and is a contraction
with respect to the Euclidean norm. It can be shown (cf. Exercise 2.1)
that the optimal modulus of contraction can be achieved with the stepsize
choice
2
a*== - - -
M +m'
where Mand mare the minimum and maximum eigenvalues of Q. With
this stepsize, we obtain the linear convergence rate estimate
(2.8)
With this rule, when the steepest descent method converges to a vector x*
such that v' f(x*) = 0 and v' 2 f(x*) is positive definite, its convergence rate
is also linear, but not faster than the one of Eq. (2.8), which is associated
with an optimally chosen constant stepsize (see [Ber99], Section 1.3).
If the method converges to an optimal point x* where the Hessian
matrix v' 2 f(x*) is singular or does not exist, the convergence rate that we
can guarantee is typically slower than linear. For example, with a properly
chosen constant stepsize, and under some reasonable conditions (Lipschitz
continuity of v' !), we can show that
where f* is the optimal value off and c(xo) is a constant that depends on
the initial point xo (see Section 6.1).
For problems where v' f is continuous but cannot be assumed Lips-
chitz continuous at or near the minimum, it is necessary to use a stepsize
rule that can produce time-varying stepsizes. For example in the scalar case
where f(x) = lxl 312 , the steepest descent method with any constant step-
size oscillates around the minimum x* = 0, because the gradient grows too
fast around x*. However, the line minimization rule as well as other rules,
such as the Armijo rule to be discussed shortly, guarantee a satisfactory
form of convergence (see the end-of-chapter exercises and the discussion of
Section 6.1).
On the other hand, with additional assumptions on the structure of
f, we can obtain a faster convergence than the 0(1/k) estimate on the
cost function error of Eq. (2.9). In particular, the rate of convergence to
a singular minimum depends on the order of growth of the cost function
near that minimum; see [Dun81], which shows that if f is convex, has a
unique minimum x*, and satisfies the growth condition
for some scalars ,8 > 0 and 'Y > 2, then for the method of steepest descent
with the Armijo rule and other related rules we have
Thus for example, with a quartic order of growth off ('Y = 4), an 0(1/k 2 )
estimate is obtained for the cost function error after k iterations. The paper
[Dun81] provides a more comprehensive analysis of the convergence rate of
gradient-type methods based on order of growth conditions, including cases
where the convergence rate is linear and faster than linear.
Scaling
To improve the convergence rate of the steepest descent method one may
"scale" the gradient v' f (xk) by multiplication with a positive definite sym-
metric matrix Dk, i.e., use a direction dk = - Dk v' f (Xk), leading to the
algorithm
Xk+l = Xk - akDk v' f (xk); (2.11)
cf. Fig. 2.1.l. Since for v' f (xk) -/:- 0 we have
it follows that we still have a cost descent method, as long as the positive
stepsize O:k is sufficiently small so that f(xk+1) < f(xk).
62 Optimization Algorithms: An Overview Chap. 2
for the equivalent problem of minimizing the function hk (y) = f (D!12 y).
For a quadratic problem, where f(x) = ½x'Qx - b'x, the condition number
of hk is the ratio of largest to smallest eigenvalue of the matrix D!12 Q 12 v!
(rather than Q).
Much of unconstrained nonlinear programming methodology deals
with ways to compute "good" scaling matrices Dk, i.e., matrices that result
in fast convergence rate. The "best" scaling in this sense is attained with
assuming that the inverse above exists and is positive definite, which asymp-
totically leads to an "effective condition number" of 1. This is Newton's
method, which will be discussed shortly. A simpler alternative is to use a
diagonal approximation to the Hessian matrix v' 2 f (xk), i.e., the diagonal
Sec. 2.1 Iterative Descent Algorithms 63
i = 1, ... ,n,
along the diagonal. This often improves the performance of the classical
gradient method dramatically, by providing automatic scaling of the units
in which the components xi of x are measured, and also facilitates the choice
of stepsize - good values of O:k are often chose to 1 (see the subsequent
discussion of Newton's method and sources such as [Ber99], Section 1.3).
The nonlinear programming methodology also prominently includes
quasi-Newton methods, which construct scaling matrices iteratively, using
gradient information collected during the algorithmic process (see nonlin-
ear programming textbooks such as [Pol71], [GMW81], [Lue84], [DeS96],
[Ber99], [FleOO], [NoW06], [LuY08]). Some of these methods approximate
the full inverse Hessian of f, and eventually attain the fast convergence
rate of Newton's method. Other methods use a limited number of gradient
vectors from previous iterations (have "limited memory") to construct a
relatively crude but still effective approximation to the Hessian off, and
attain a convergence rate that is considerably faster than the one of the
unscaled gradient method; see [Noc80], [NoW06].
(2.12)
where f3k is a scalar in [O, 1), and we define X-1 = xo. When O:k and f3k are
chosen to be constant scalars a and (3, respectively, the method is known as
the heavy ball method [Pol64]; see Fig. 2.1.2. This is a sound method with
guaranteed convergence under a Lipschitz continuity assumption on v' f. It
can be shown to have faster convergence rate than the corresponding gradi-
ent method where o:k is constant and f3k = 0 (see [Pol87], Section 3.2.1, or
[Ber99], Section 1.3). In particular, for a positive definite quadratic prob-
lem, and with optimal choices of the constants a and /3, the convergence
rate of the heavy ball method is linear, and is governed by the formula (2 .8)
but with JM/m in place of M/m. This is a substantial improvement over
the steepest descent method, although the method can still be very slow.
Simple examples also suggest that with a momentum term, the steepest
descent method is less prone to getting trapped at "shallow" local minima,
and deals better with cost functions that are alternately very flat and very
steep along the path of the algorithm.
64 Optimization Algorithms: An Overview Chap. 2
Gradient Step
Xk-1,/
/
/
""' Extrapolation Step
Figure 2.1.2. Illustration of the heavy ball method (2.12), where Cik =a and
f3k = (3.
with (A chosen in a special way so that f3k --+ l, and then a gradient step
with constant stepsize o:, and gradient calculated at Yk,
It turns out that if the parameters ak and f3k in iteration (2.12) are
chosen optimally for each k so that
k = 0, l, ... ,
(2.13)
with X-1 = xo, the resulting method is an implementation of the conjugate
gradient method (see e.g., [Ber99], Section 1.6). By this we mean that if f
is a convex quadratic function, the method (2.12) with the stepsize choice
(2.13) generates exactly the same iterates as the conjugate gradient method,
and hence minimizes f in at most n iterations. Finding the optimal pa-
rameters according to Eq. (2.13) requires solution of a two-dimensional
optimization problem in a and /3, which may be impractical in the absence
of special structure. However, this optimization is facilitated in some im-
portant special cases, which also favor the use of other types of conjugate
direction methods. t
There are several other ways to implement the conjugate gradient
method, all of which generate identical iterates for quadratic cost functions,
but may differ substantially in their behavior for nonquadratic ones. One of
them, which resembles the preceding extrapolation methods, is the method
of parallel tangents or PARTAN, first proposed in the paper [SBK64]. In
particular, each iteration of PARTAN involves extrapolation and two one-
dimensional line minimizations. At the typical iteration, given Xk, we
obtain Xk+i as follows:
(1) We find a vector Yk that minimizes f over the line
(2) We generate Xk+I by minimizing f over the line that passes through
Xk-l and Yk·
Gradient Step
Yk = Xk - tk \7 f (xk)
Xki_ . \
/ Yk
/
/ Extrapolation Step
Xk+l = Yk + f3k(yk - Xk - l)
we see that the heavy ball method (2.12) with constant parameters a and f3 is
obtained when '°Yk = a/(1 + (3) and f3k =(3. The PARTAN method is obtained
when '°Yk and f3k are chosen by line minimization, in which case the corresponding
parameter °'k of iteration (2.12) is °'k = 'Yk(l + f3k)-
where
O:k
'°Yk = l + f3k.
Thus starting from Xk, the parameter (A is determined by the second
line search of PARTAN as the optimal stepsize along the line that passes
through Xk-l and Yk, and then O:k is determined as '°Yk(l + f3k), where '°'fk
is the optimal stepsize along the line
Newton's Method
provided v' 2 f (xk) exists and is positive definite, so the iteration takes the
form
Xk+1 = Xk - °'k (v' 2 f(xk) )- 1 v' f (xk)·
If v' 2 f(xk) is not positive definite, some modification is necessary. There
are several possible modifications of this type, for which the reader may
consult nonlinear programming textbooks. The simplest one is to add to
v' 2 f(xk) a small positive multiple of the identity. Generally, when f is
convex, v' 2 f(xk) is positive semidefinite (Prop. 1.1.10 in Appendix B), and
this facilitates the implementation of reliable Newton-type algorithms.
The idea in Newton's method is to minimize at each iteration the
quadratic approximation off around the current point Xk given by
(2.14)
is positive definite, and that a stepsize °'k = 1 is used, at least after some
iteration. For a simple argument, we may use Taylor's theorem to write
in order to find the Newton direction. There are many iterative algorithms
that are patterned after Newton's method, and aim to strike a balance be-
tween fast convergence and high overhead (e.g., quasi-Newton, conjugate
direction, and others, extensive discussions of which may be found in non-
linear programming textbooks such as [GMW81], [DeS96], [Ber99], [FleOO],
[BSS06], [NoW06], [LuY08]).
We finally note that for some problems the special structure of the
Hessian matrix can be exploited to facilitate the implementation of New-
ton's method. For example the Hessian matrix of the dual function of the
separable convex programming problem of Section 1.1, when it exists, has
particularly favorable structure; see [Ber99], Section 6.1. The same is true
for optimal control problems that involve a discrete-time dynamic system
and a cost function that is additive over time; see [Ber99], Section 1.9.
Sec. 2.1 Iterative Descent Algorithms 69
Stepsize Rules
There are several methods to choose the stepsize O:k in the scaled gradient
iteration (2.11). For example, O:k may be chosen by line minimization:
k = 0, l, ... ,
and when ak is chosen to be diminishing to 0, while satisfying the condi-
tionst
00 00
0
----------- ~ ~
a
)
Slope: -Vf(xk)'Dk'vf(xk) \
\
-uaVJ(xk)'DkVf(xk)
Figure 2.1.4. Illustration of the successive points tested by the Armijo rule along
the descent direction dk = -Dk'ilf(xk)- In this figure, °'k is obtained as {3 2 sk
after two unsuccessful trials. Because a E (0, 1), the set of acceptable stepsizes
begins with a nontrivial interval interval when dk i= 0. This implies that if dk = 0,
the Armijo rule will find an acceptable stepsize with a finite number of stepsize
reductions.
where j3 E (0, 1) and u E (0, 1) are some constants, and Sk > 0 is positive
initial stepsize, chosen to be either constant or through some simplified
search or polynomial interpolation. In other words, starting with an initial
trial sk, the stepsizes j3msk, m = 0, 1, ... , are tried successively until the
above inequality is satisfied for m = mk; see Fig. 2.1.4. We will explore
the convergence properties of this rule in the exercises.
Aside from guaranteeing cost function descent, successive reduction
rules have the additional benefit of adapting the size of the stepsize °'k to
the search direction -Dk 'v f(xk), particularly when the initial stepsize Sk
is chosen by some simplified search process. We refer to nonlinear program-
ming sources for detailed discussions.
Note that the diminishing stepsize rule does not guarantee cost func-
tion descent at each iteration, although it reduces the cost function value
once the stepsize becomes sufficiently small. There are also some other
rules, often called nonmonotonic, which do not explicitly try to enforce
cost function descent and have achieved some success, but are based on
ideas that we will not discuss in this book; see [GLL86], [BaB88], [Ray93],
[Ray97], [BMROO], [DHS06]. An alternative approach to enforce descent
without explicitly using stepsizes is based on the trust region methodol-
ogy for which we refer to book sources such as [Ber99], [CGTOO], [FleOO],
[NoW06].
Sec. 2.1 Iterative Descent Algorithms 71
(2.16)
and set
in Eq. (2.16); see Fig. 2.1.5. Clearly '7f(xk)'(xk - Xk)::; 0, with equality
holding only if '7f(xk)'(x - Xk) 2'. 0 for all x EX, which is a necessary
condition for optimality of Xk.
This is the conditional gradient method (also known as the Frank-
Wolfe algorithm) proposed in [FrW56] for convex programming problems
with linear constraints, and for more general problems in [LeP65]. The
method has been used widely in many contexts, as it is theoretically sound,
quite simple, and often convenient. In particular, when X is a polyhedral
set, computation of Xk requires the solution of a linear program. In some
important cases, this linear program has special structure, which results in
great simplifications, e.g., in the multicommodity flow problem of Example
72 Optimization Algorithms: An Overview Chap. 2
1.4.5 (see the book [BeG92], or the surveys [FlH95], [PatOl]). There has
been intensified interest in the conditional gradient method, thanks to ap-
plications in machine learning; see e.g., [ClalO], [Jag13], [LuT13], [RSW13],
[FrG14], [HJN14], and the references quoted there.
However, the conditional gradient method often tends to converge
very slowly relative to its competitors (its asymptotic convergence rate
can be slower than linear even for positive definite quadratic programming
problems); see [CaC68], [Dun79], [Dun80]. For this reason, other methods
with better practical convergence rate properties are often preferred.
One of these methods, is the simplicial decomposition algorithm (first
proposed independently in [CaG74] and [Hol74]), which will be discussed
in detail in Chapter 4. This method is not a feasible direction method of
the form (2.16), but instead it is based on multidimensional optimizations
over approximations of the constraint set by convex hulls of finite numbers
of points. When X is a polyhedral set, it converges in a finite number of
iterations, and while this number can potentially be very large, the method
often attains practical convergence in very few iterations. Generally, simpli-
cial decomposition can provide an attractive alternative to the conditional
gradient method because it tends to be well-suited for the same type of
problems [it also requires solution of linear cost subproblems of the form
(2.17); see the discussion of Section 4.2].
Somewhat peculiarly, the practical performance of the conditional
gradient method tends to improve in highly constrained problems. An
explanation for this is given in the papers [Dun79], [DuS83], where it is
shown among others that the convergence rate of the method is linear when
the cost function is positive definite quadratic, and the constraint set is not
polyhedral but rather has a "positive curvature" property (for example it is
a sphere). When there are many linear constraints, the constraint set tends
to have very many closely spaced extreme points, and has this "positive
curvature" property in an approximate sense.
Sec. 2.1 Iterative Descent Algorithms 73
Level sets of f
where O:k > 0 is a stepsize and Px(·) denotes projection on X (the projec-
tion is well defined since X is closed and convex; see Fig. 2.1.6).
To get a sense of the validity of the method, note that from the
Projection Theorem (Prop. 1.1.9 in Appendix B), we have
and by the optimality condition for convex functions (cf. Prop. 1.1.8 in
Appendix B), the inequality is strict unless Xk is optimal. Thus Xk+l - Xk
defines a feasible descent direction at Xk, and based on this fact, we can
show the descent property f(xk+I) < f(xk) when O:k is sufficiently small.
The stepsize O:k is chosen similar to the unconstrained gradient me-
thod, i.e., constant, diminishing, or through some kind of reduction rule
to ensure cost function descent and guarantee convergence to the opti-
mum; see the convergence analysis of Section 6.1, and [Ber99], Section 2.3,
for a detailed discussion and references. Moreover the convergence rate
estimates given earlier for unconstrained steepest descent in the positive
definite quadratic cost case [cf. Eq. (2.8)] and in the singular case [cf. Eqs.
(2.9) and (2.10)] generalize to the gradient projection method under vari-
ous stepsize rules (see Exercise 2.1 for the former case and [Dun81] for the
latter case).
74 Optimization Algorithms: An Overview Chap. 2
Despite its simplicity, the gradient projection method has some significant
drawbacks:
(a) Its rate of convergence is similar to the one of steepest descent, and
is often slow. It is possible to overcome this potential drawback by
a form of scaling. This can be accomplished with an iteration of the
form
Xk+l E argmin
xEX
{v' f(xk)'(x - xk) 1-(x -
+ -2ak Xk)' Hk(x - Xk)},
(2.19)
where Hk is a positive definite symmetric matrix and O:k is a positive
stepsize. When Hk is the identity, it can be seen that this itera-
tion gives the same iterate Xk+l as the unscaled gradient projection
v'
iteration (2.18). When Hk = 2 f(xk) and ak = 1, we obtain a
constrained form of Newton's method (see nonlinear programming
sources for analysis; e.g., [Ber99]).
(b) Depending on the nature of X, the projection operation may involve
substantial overhead. The projection is simple when Hk is the identity
(or more generally, is diagonal), and X consists of simple lower and/or
upper bounds on the components of x:
onto the interval of corresponding bounds [h.i, bi], and is very simple.
However, for general nondiagonal scaling the overhead for solving the
quadratic programming problem (2.19) is substantial even if X has a
simple bound structure of Eq. (2.20).
To overcome the difficulty with the projection overhead, a scaled pro-
jection method known as two-metric projection method has been proposed
for the case of the bound constraints (2.20) in [Ber82a], [Ber82b]. It has a
similar form to the scaled gradient method (2.11), and it is given by
(2.21)
Dk will not necessarily yield a descent direction. However, it turns out that
if some of the off-diagonal terms of Dk that correspond to components of
Xk that are at their boundary are set to zero, one can obtain descent (see
Exercise 2.8). Furthermore, one can select Dk as the inverse of a partially
diagonalized version of the Hessian matrix 'v' 2 f (xk) and attain the fast
convergence rate of Newton's method (see [Ber82a], [Ber82b], [GaB84]).
The idea of simple two-metric projection with partial diagonaliza-
tion may be generalized to more complex constraint sets, and it has been
adapted in [Ber82b], and subsequent papers such as [GaB84], [Dun91],
[LuT93b], to problems of the form
minimize f (x)
subject to Q :S: x :S: b, Ax = c,
where A is an m x n matrix, and Q, b E ~n and c E ~m are given vectors. For
example the algorithm (2.21) can be easily modified when the constraint
set involves bounds on the components of x together with a few linear
constraints, e.g., problems involving a simplex constraint such as
minimize f (x)
subject to O :S: x, a'x = c,
where a E ~n and c E ~, or a Cartesian product of simplexes. For an
example of a Newton algorithm of this type, applied to the multicommodity
flow problem of Example 1.4.5, see [BeG83]. For representative applications
in related large-scale contexts we refer to the papers [Dun91], [LuT93b],
[FJS98], [Pyt98], [GeM05], [OJW05], [TaP13], [WSK14].
The advantage that the two-metric projection approach can offer is to
identify quickly the constraints that are active at an optimal solution. After
this happens, the method reduces essentially to an unconstrained scaled
gradient method (possibly Newton method, if Dk is a partially diagonalized
Hessian matrix), and attains a fast convergence rate. This property has
also motivated variants of the two-metric projection method for problems
involving £\-regularization, such as the ones of Example 1.3.2; see [SFR09],
[SchlO], [GKXlO], [SKS12], [Lan14].
The preceding methods require the computation of the gradient and possi-
bly the Hessian of the cost function at each iterate. An alternative descent
approach that does not require derivatives or other direction calculations
is the classical block coordinate descent method, which we will briefly de-
scribe here and consider further in Section 6.5. The method applies to the
problem
minimize f (x)
subject to x E X,
76 Optimization Algorithms: An Overview Chap. 2
X = (X 1 , x 2 , ... , xm),
where each xi belongs to ~ni, so the constraint x E X is equivalent to
i = 1, ... ,m.
The most common case is when ni = 1 for all i, so the components xi
are scalars. The method involves minimization with respect to a single
component xi at each iteration, with all other components kept fixed.
In an example of such a method, given the current iterate Xk =
(Xk, ... , Xk ), we generate the next iterate Xk+l = (xl+l, ... , Xk+l), ac-
cording to the "cyclic" iteration
i
xk+l · f( xk+i,···,xk+i,'>,xk
E argm1n 1 i-1 c i+l
, ... ,xkm) , i = 1, ... , m. (2.22)
~EX;
Thus, at each iteration, the cost is minimized with respect to each of the
"block coordinate" vectors xi, taken one-at-a-time in cyclic order.
Naturally, the method makes practical sense only if it is possible to
perform this minimization fairly easily. This is frequently so when each xi
is a scalar, but there are also other cases of interest, where xi is a multi-
dimensional vector. Moreover, the method can take advantage of special
structure off; an example of such structure is a form of "sparsity," where
f is the sum of component functions, and for each i, only a relatively small
number of the component functions depend on xi, thereby simplifying the
minimization (2.22). The following is an example of a classical algorithm
that can be viewed as a special case of block coordinate descent.
We are given m closed convex sets X 1, ... , Xm in ~r, and we want to find a
point in their intersection. This problem can equivalently be written as
m
i=l
i = l, ... ,m,
Sec. 2.1 Iterative Descent Algorithms 77
which minimizes the cost function with respect to x when each yi is fixed at
Yic+1·
i = 1, ... ,m,
In this example, the algorithm fails even though it never encounters a point
where f is nondifferentiable, which suggests that convergence questions in
convex optimization are delicate and should not be treated lightly. The
problem here is a lack of continuity: the steepest descent direction may
undergo a large/discontinuous change close to the convergence limit. By
contrast, this would not happen if f were continuously differentiable at
the limit, and in fact the steepest descent method with the minimization
stepsize rule has sound convergence properties when used for differentiable
functions.
Because the implementation of cost function descent has the limita-
tions outlined above, a different kind of descent approach, based on the
notion of subgradient, is often used when f is nondifferentiable. The the-
ory of subgradients of extended real-valued functions is outlined in Section
5.4 of Appendix B, as developed in the textbook [Ber09]. The properties of
Sec. 2.1 Iterative Descent Algorithms 79
40
20
-20
x,
Figure 2.1. 7. An example of failure of the steepest descent method with the line
minimization stepsize rule for a convex nondifferentiable cost function [Wol75].
Here we have the two-dimensional cost function
if xi > lx2I,
if xi ::::; lx2I,
shown in the figure. Consider the method that moves in the direction of steepest
descent from the current point, with the stepsize determined by cost minimization
along that direction (this can be done analytically). Suppose that the algorithm
starts anywhere within the set
The generated iterates are shown in the figure, and it can be verified that they
converge to the nonoptimal point (0, 0).
k=0,1, ....
In this case only approximate convergence can be guaranteed, i.e.,
convergence to a neighborhood of the optimum whose size depends
on a. Moreover the convergence rate may be slow. However, there is
an important case where some favorable results can be shown. This is
when a so called sharp minimum condition holds, i.e., for some f3 > 0,
f* + /3 x*EX*
min llx - x*II::; f(x), Vx EX, (2.24)
Aside from methods that are based on gradients or subgradients, like the
ones of the preceding sections, there are some other approaches to effect
cost function descent. A major approach, which applies to any convex cost
function is the proximal algorithm, to be discussed in detail in Chapter 5.
This algorithm embodies both the cost improvement and the approximation
ideas. In its basic form, it approximates the minimization of a closed
proper convex function f : a:?n .-+ (-oo, oo] with another minimization that
involves a quadratic term. It is given by
f(x)
x* X
Figure 2.1.8. Illustration of the proximal algorithm (2.25) and its descent prop-
erty. The minimum of f( x )+ 2 ~k llx - xkll 2 is attained at the unique point Xk+I at
1 llx -
which the graph of the quadratic function -2- xkii 2 , raised by the amount
ck
just touches the graph off. Since 'Yk < f(xk) , it follows that f( xk+1) < f(xk) ,
unless Xk minimizes f , which happens if and only if xk+I = Xk·
(2.26)
82 Optimization Algorithms: An Overview Chap. 2
This approach may be useful when Dk has a special form that matches
the structure off.
(b) Linear approximation off using its gradient at Xk
When the proximal term Dk(x; xk) is the quadratic (1/2ck) llx -xkll 2 ,
this iteration can be seen to be equivalent to the gradient projection
iteration (2.18):
with gi,k being a subgradient of Ji at 'I/Ji-1,k [or the gradient 'v fi('I/J;-1,k)
in the differentiable case].
In a randomized version of the method, given Xk at iteration k, an
index ik is chosen from the set { 1, ... , m} randomly, and the next iterate
Xk+1 is generated by
where starting with 1/Jo,k = Xk, we generate 1/Jm,k after them steps
Let
i = 1, ... ,m,
where c, are given nonzero vectors in ar and b; are given scalars, so we have
a linear least squares problem. The constant term 1/(2jjc; 11 2 ) multiplying
each of the squared functions (c'.x - b;) 2 serves a scaling purpose: with its
inclusion, all the components f; have a Hessian matrix
2 ( ) 1 I
V f; x = lie; 112 c;c;
with trace equal to 1. This type of scaling is often used in least squares
problems (see [Ber99] for explanations). The incremental gradient method
(2.32)-(2.33) takes the form Xk+I = '1/Jm,k, where '1/Jm,k is obtained after the
m steps
Hyperplane
<x = bi
(a) (b)
Figure 2.1.9. Illustratiuu of the Ka1.;:<ma.r:< method (2.34) with unit stepsize
ak = 1: (a) 'lj;i,k is obtained by projecting 'lpi-l,k onto the hyperplane defined
by the single equation <x = bi. (b) The convergence process for the case
where the system of equations <x = bi, i = 1, ... , m, is consistent and has
a unique solution x*. Here m = 3, and Xk is the vector obtained after k
cycles through the equations. Each incremental iteration decreases the dis-
tance to x*, unless the current iterate lies on the hyperplane defined by the
corresponding equation.
(a) Progress when far from convergence. Here the incremental method
can be much faster. For an extreme case let X = ~n ( no constraints),
and take m large and all components Ji identical to each other. Then
an incremental iteration requires m times less computation than a
classical gradient iteration, but gives exactly the same result, when
the stepsize is appropriately scaled to be m times larger. While this
is an extreme example, it reflects the essential mechanism by which
incremental methods can be much superior: far from the minimum
a single component gradient will point to "more or less" the right
direction, at least most of the time.
(b) Progress when close to convergence. Here the incremental method
can be inferior. As a case in point, assume that all components Ji
are differentiable functions. Then the nonincremental gradient pro-
jection method can be shown to converge with a constant stepsize
under reasonable assumptions, as we will see in Section 6.1. How-
ever, the incremental method requires a diminishing stepsize, and its
ultimate rate of convergence can be much slower. When the compo-
nent functions Ji are nondifferentiable, both the nonincremental and
the incremental subgradient methods require a diminishing stepsize.
The nonincremental method tends to require a smaller number of it-
erations, but each of the iterations involves all the components Ji and
thus larger computation overhead, so that on balance, in terms of
computation time, the incremental method tends to perform better.
As an illustration consider the following example.
Example 2.1.4:
Consider a scalar linear least squares problem where the components J. have
the form
XE )R,
88 Optimization Algorithms: An Overview Chap. 2
where Ci and bi are given scalars with Ci =/ 0 for all i. The minimum of each
of the components fi is
It can be seen that x* lies within the range of the component minima
has the same sign as 'v'f(x) (see Fig. 2.1.11). As a result, when outside the
region R, the incremental gradient method
However, for x inside the region R, the ith step of a cycle of the in-
cremental gradient method need not make progress. It will approach x* (for
small enough stepsize ak) only if the current point 'I/Ji-1 does not lie in the
interval connecting xT and x*. This induces an oscillatory behavior within R,
and as a result, the incremental gradient method will typically not converge
to x* unless ak -+ 0.
Let us now compare the incremental gradient method with the nonin-
cremental version, which takes the form
m
It can be shown that this method converges to x* for any constant stepsize
O'.k= a satisfying
1
0< a~ I::1 c;'
On the other hand, for x outside the region R, an iteration of the nonin-
cremental method need not make more progress towards the solution than
a single step of the incremental method. In other words, with comparably
intelligent stepsize choices, far from the solution ( outside R), a single cy-
cle through the entire set of component functions by the incremental method
is roughly as effective as m iterations by the nonincremental method, which
require m times as many component gradient calculations.
Sec. 2.1 Iterative Descent Algorithms 89
Example 2.1.5:
The preceding example assumes that each component function Ji has a min-
imum, so that the range of component minima is defined. In cases where
the components Ji have no minima, a similar phenomenon may occur. As an
example consider the case where J is the sum of increasing and decreasing
convex exponentials, i.e.,
XE R,
where ai and b; are scalars with a; > 0 and bi =I= 0. Let
1+ = {i I bi> O}, r = {i I bi< o},
and assume that 1+ and 1- have roughly equal numbers of components. Let
also x* be the minimum of I:::
1 /i.
Consider the incremental gradient method that given the current point,
call it Xk, chooses some component J;k and iterates according to the incre-
mental iteration
90 Optimization Algorithms: An Overview Chap. 2
Then it can be seen that if Xk > > x•, Xk+I will be substantially closer to x* if
i E J+, and negligibly further away than x* if i E 1-. The net effect, averaged
over many incremental iterations, is that if Xk > > x•, an incremental gradient
iteration makes roughly one half the progress of a full gradient iteration, with
m times less overhead for calculating gradients. The same is true if Xk < < x*.
On the other hand as Xk gets closer to x* the advantage of incrementalism is
reduced, similar to the preceding example. In fact in order for the incremental
method to converge, a diminishing stepsize is necessary, which will ultimately
make the convergence slower than the one of the nonincremental gradient
method with a constant stepsize.
Stepsize Selection
where fik is the new component function selected for iteration k. Here,
the component indexes ik may either be selected in a cyclic order [ik =
( k modulo m) + 1], or according to some randomization scheme, consis-
tently with Eq. (2.31). Also for k < m, the summation should go up to
C = k, and a should be replaced by a corresponding larger value, such
as ak = ma/(k + 1). This method, first proposed in [BHG08], computes
the gradient incrementally, one component per iteration, but in place of
the single component gradient, it uses an approximation to the total cost
gradient v' f(xk), which is the sum of the component gradients computed
in the past m iterations.
There is analytical and experimental evidence that by aggregating
the component gradients one may be able to attain a faster asymptotic
convergence rate, by ameliorating the effect of approximating the full gra-
dient with component gradients; see the original paper [BHG08], which
provides an analysis for quadratic problems, the paper [SLB13], which pro-
vides a more general convergence and convergence rate analysis, and ex-
tensive computational results, and the papers [Mai13], [Mai14], [DCD14],
which describe related methods. The expectation of faster convergence
should be tempered, however, because in order for the effect of aggregat-
ing the component gradients to fully manifest itself, at least one pass (and
possibly quite a few more) through the components must be made, which
may be too long if m is very large.
92 Optimization Algorithms: An Overview Chap. 2
Sk = L V fik_e(Xk-P),
P=O
in Eq. (2.35), these methods use
m-1
[both iterations (2.35) and (2.37) involve different types of diminishing de-
pendence on past gradient components]. Thus, the heavy ball iteration
(2.36) provides an approximate implementation of the incremental aggre-
gated gradient method (2.35), while it does not have the memory storage
issue of the latter.
A further way to intertwine the ideas of the aggregated gradient
method (2.35) and the heavy ball method (2.36) for the unconstrained
case (X = ~n) is to form an infinite sequence of components
and group together blocks of successive components into batches. One way
to implement this idea is to add p preceding gradients (with 1 < p < m) to
the current component gradient in iteration (2.36), thus iterating according
to p
f(x) = E{F(x,w)},
(2.40)
94 Optimization Algorithms: An Overview Chap. 2
(2.41)
for minimizing a finite sum L:i Ji, when randomization is used for com-
ponent selection [cf. Eq. (2.31)]. An important difference is that the former
method involves sequential sampling of cost components F(x, w) from an
infinite population under some statistical assumptions, while in the latter
the set of cost components Ji is predetermined and finite. However, it is
possible to view the incremental subgradient method (2.41), with uniform
randomized selection of the component function Ji (i.e., with ik chosen
to be any one of the indexes 1, ... , m, with equal probability 1/m, and
independently of preceding choices), as a stochastic subgradient method.
Despite the apparent similarity of the incremental and the stochastic
subgradient methods, the view that the problem
m
subject to x E X,
is questionable.
One reason is that once we convert the finite sum problem to a
stochastic problem, we preclude the use of methods that exploit the finite
sum structure, such as the aggregated gradient methods we discussed ear-
lier. Under certain conditions, these methods offer more attractive conver-
gence rate guarantees than incremental and stochastic gradient methods,
and can be very effective for many problems, as we have noted.
Another reason is that the finite-component problem (2.42) is often
genuinely deterministic, and to view it as a stochastic problem at the outset
Sec. 2.1 Iterative Descent Algorithms 95
Example 2.1.6:
i=l
subject to x E !R,
W; = {1-1 if i: odd,
if i: even.
starting with some initial iterate xo. It is then easily verified by induction
that
Xo W;o + ... + Wik-I
Xk =k + k k = 1,2, ....
Thus the iteration error, which is Xk (since x* = 0), consists of two terms. The
first is the error term x 0 /k, which is independent of the method of selecting
ik, and the second is the error term
may obtain much worse results with an unfavorable deterministic order (such
as selecting first all the odd components and then all the even components).
However, the point here is that if we take the view that we are minimizing
an expected value, we are disregarding at the outset information about the
problem's structure that could be algorithmically useful.
where the functions Ji : Rn i-+ R are convex and twice continuously dif-
ferentiable. Consider the quadratic approximation Ji of a function Ji at a
vector 'I/; E Rn, i.e., the second order Taylor expansion of Ji at 'lj;:
where starting with '1/;o,k = Xk, we obtain '1/;m,k after the m steps
If all the functions Ji are quadratic, it can be seen that the method finds
the solution in a single cycle. t The reason is that when Ji is quadratic,
each Ji(x) differs from ]i(x; 'I/;) by a constant, which does not depend on
x. Thus the difference
m m
Xo =, "'f''
1'0 0
Figure 2.1.12. Illustration of the incremental Newton method for the case of a
two-dimensional linear least squares problem with m = 3 cost function compo-
nents (compare with the Kaczmarz method, cf. Fig. 2.1.10).
i = l, ... ,m.
Lfe(x),
£=1
and when i = m, the solution of the problem is obtained (see Fig. 2.1.12).
This convergence behavior should be compared with the one for the Kacz-
marz method (cf. Fig. 2.1.10).
It is important to note that the quadratic minimizations of Eq. (2.43)
can be carried out efficiently. For simplicity, let as assume that J1 (x; '¢)
is a positive definite quadratic, so that for all i, 1Pi,k is well defined as the
unique solution of the minimization problem in Eq. (2.43). We will show
that the incremental Newton method (2.43) can be implemented in t erms
of the incremental update formula
(2.44)
D i,k = (t
£=1
V 2!£(1/Jt- 1,k)) -
1
, (2.45)
(2.46)
Sec. 2.1 Iterative Descent Algorithms 99
Indeed, from the definition of the method (2.43), the quadratic function
I:~:i Jt(x; 1P£-1,k) is minimized by 1Pi-1,k and its Hessian matrix is n-;_!1,k,
so we have
i-1
'"" -
L., ft(x; 1P£-1,k) = 21 (x - -1
'I/Jt-1,k)' Di-l,k(x - 'I/Jt-1,k) + constant.
£=1
'""-
L.,ft(x; 1P£-1,k) = 21 (x - 1
'I/Jt-1,k)' Di-l,k(x - 'I/J£-1,k) + constant
£=1
+ ½(x -'l/Ji-l,k)''\7 2fi(1Pi-1,k)(x - 'I/Ji-1,k) + '\lfi(1Pi-1,k)'(x - 'I/Ji-1,k)·
D _ D· _ Di-1,kaia~Di-1,k .
i,k - i-1,k '7 2
v 'f'i-l,k )-l + ai'D i-l,kai ,
h i (·'·
this is the well-known Sherman-Morrison formula for the inverse of the sum
of an invertible matrix and a rank-one matrix (see the matrix inversion
formula in Section A.l of Appendix A).
We have considered so far a single cycle of the incremental Newton
method. One algorithmic possibility for cycling through the component
functions multiple times, is to simply create a larger set of components by
concatenating multiple copies of the original set, that is, by forming what
we refer to as the extended set of components
by introducing a fading factor >..k E (0, 1), which can be used to accelerate
the practical convergence rate of the method (see [Ber96] for an analysis
of schemes where >..k -+ 1; in cases where Ak is some constant >.. < 1, linear
convergence to within a neighborhood of the optimum may be shown).
The following example provides some insight regarding the behavior
of the method when the cost function f has a very large number of cost
components, as is the case when f is defined as the average of a very large
number of random samples.
subject to x E Rn,
where {wk} is a given sequence from some set, and each function F(·,w;):
Rn f-t R is positive semidefinite quadratic. We assume that f is well-defined
(i.e., the limit above exists for each x E Rn), and is a positive definite
quadratic. This type of problem arises in linear regression models (cf. Ex-
ample 1.3.1) involving an infinite amount of data that is obtained through
random sampling.
The natural extension of the incremental Newton's method, applied to
the infinite set of components F( ·, w1), F( ·, w2), ... , generates the sequence
{xi] where
k
Since f is positive definite and the same is true for /k, when k is large enough,
we have x'i., -+ x*, where x* is the minimum of f. The rate of convergence
Sec. 2.1 Iterative Descent Algorithms 101
is determined strictly by the rate at which the vectors x'i., approach x*, or
equivalently by the rate at which /k approaches f. It is impossible to achieve
a faster rate of convergence with an algorithm that is nonanticipative in the
sense that it uses just the first k cost components in the first k iterations.
By contrast, if we were to apply the natural extension of the incremental
gradient method to this problem, the convergence rate could be much worse.
There would be an error due to the difference (x'i., - x*), but also an additional
error due to the difference (x'i., - Xk) between x'i., and the kth iterate Xk of
the incremental gradient method, which is generally diminishing quite slowly,
possibly more slowly than (xi; - x*). The same is true for other gradient-
type methods based on incremental computations, including the aggregated
gradient methods discussed earlier.
where
minimize I: llgi(x) 11
2
i=l
subject to x E ~n,
If all the functions 9i are linear, we have 9i(x; 'I/J) = gi(x), and the method
solves the problem exactly in a single cycle. It then becomes identical to
the incremental Newton method.
When the functions gi are nonlinear the algorithm differs from the
incremental Newton method because it does not involve second deriva-
tives of gi. It may be viewed instead as an incremental version of the
Gauss-Newton method, a classical nonincremental scaled gradient method
for solving nonlinear least squares problems (see e.g., [Ber99], Section 1.5).
It is also known as the extended Kalman filter, and has found extensive ap-
plication in state estimation and control of dynamic systems, where it was
introduced in the mid-60s (it was also independently proposed in [Dav76]).
The implementation issues of the extended Kalman filter are simi-
lar to the ones of the incremental Newton method. This is because both
methods solve similar linear least squares problems at each iteration [cf.
Eqs. (2.43) and (2.49)]. The convergence behaviors of the two methods are
104 Optimization Algorithms: An Overview Chap. 2
if k tf. Ri,
i = 1, ... ,n,
if k E Ri,
xi+i E argminf
{E~
(x 71il. (k), ... ,xi~l
7 i,i-1
(k)'~,xi+l
Ti,i+l
(k), ... ,xn_
Tin
(k)),
and is left unchanged (xi+l = xi) if k 1:- Ri, The meanings of the
subsets of updating times Ri and indexes Tij ( k) are the same as in the
case of gradient methods. Also the distributed environment where the
method can be applied is similar to the case of the gradient method.
Another practical setting that may be modeled well by this iteration
is when all computation takes place at a single computer, but any
number of coordinates may be simultaneously updated at a time,
with the order of coordinate selection possibly being random.
(c) Incremental gradient methods for the case where
m
f(x) = I: fi(X).
i=l
k ERi,
At the next iteration, Fk+l and Xk+l are generated by refining the approx-
imation, based on the new point Xk+l, and possibly on the earlier points
Xk, ... , xo. Of course such a method makes sense only if the approximat-
ing problems are simpler than the original. There is a great variety of
approximation methods, with different aims, and suitable for different cir-
cumstances. The present section provides a brief overview and orientation,
while Chapters 4-6 provide a detailed analysis.
epi(f)
X · X
P(u)=O if u=O,
and
P(u)>O if ulO.
The scalar Ck is a positive penalty parameter, so by increasing Ck to oo,
the solution Xk of the penalized problem tends to decrease the constraint
violation, thereby providing an increasingly accurate approximation to the
original problem. An important practical point here is that Ck should
be increased gradually, using the optimal solution of each approximating
problem to start the algorithm that solves the next approximating problem.
Otherwise serious numerical problems occur due to "ill-conditioning."
A common choice for P is the quadratic penalty function
P(u) -- lu
2
2
'
(2.54)
This is also known as the first order augmented Lagrangian method (also
called first order method of multipliers). It is a major general purpose,
highly reliable, constrained optimization method, which applies to non-
convex problems as well. It has a rich theory, with a strong connection
to duality, and many variations that are aimed at increased efficiency, in-
volving for example second order multiplier updates and inexact minimiza-
tion of the augmented Lagrangian. In the convex programming setting of
110 Optimization Algorithms: An Overview Chap. 2
P(u) = ½(max{O,u})2.
The quadratic regularization term makes the cost function of the preced-
ing problem strictly convex, and guarantees that it has a unique minimum.
Sometimes the quadratic term in Eq. (2.56) is scaled and a term 11Sxll 2
is used instead, where S is a suitable scaling matrix. The difference with
the proximal algorithm (2.55) is that Xk does not enter directly the min-
imization to determine Xk+l, so the method relies for its convergence on
increasing ck to oo. By contrast this is not necessary for the proximal al-
gorithm, which is generally convergent even when Ck is left constant (as
we will see in Section 5.1), and is typically much faster. Similar to the
proximal algorithm, there is a dual and essentially equivalent algorithm
to Tikhonov regularization. This is the penalty method that consists of
sequential minimization of the quadratically penalized cost function (2.52)
for a sequence {ck} with Ck ---+ oo.
(2.59)
(2.60)
(2.61)
The important advantage that the ADMM may offer over the aug-
mented Lagrangian method, is that it does not involve a joint minimization
with respect to x and z. Thus the complications resulting from the coupling
of x and z in the penalty term IIAx - zll 2 of the augmented Lagrangian
are eliminated. This property can be exploited in special applications, for
which the ADMM is structurally well suited, as we will discuss in Sec-
tion 5.4. On the other hand the ADMM may converge more slowly than
the augmented Lagrangian method, so the flexibility it provides must be
weighted against this potential drawback.
In Chapter 5, we will see that the proximal algorithm for minimiza-
tion can be viewed as a special case of a generalized proximal algorithm for
finding a solution of an equation involving a multivalued monotone opera-
tor. While we will not fully develop the range of algorithms that are based
on this generalization, we will show that both the augmented Lagrangian
method and the ADMM are special cases of the generalized proximal al-
gorithm, corresponding to two different multivalued monotone operators.
Because of the differences of these two operators, some of the properties of
the two methods are quite different. For example, contrary to the case of
the augmented Lagrangian method (where Ck is often taken to be increasing
with k in order to accelerate convergence), there seems to be no generally
good way to adjust c in ADMM from one iteration to the next. Moreover,
even when both methods have a linear convergence rate, the performance of
the two methods may differ markedly in practice. Still there is more than a
superficial connection between the two methods, which can be understood
within the context of their common proximal algorithm ancestry.
In Section 6.3, we will also see another connection of ADMM with
proximal-related methods, and particularly the proximal gradient method,
which we briefly discussed in Section 2.1.4 [cf. Eq. (2.27)]. It turns out
that both the ADMM and the proximal gradient method can be viewed
Sec. 2.2 Approximation Methods 113
where Ji, ... , fm are differentiable functions can be converted to the dif-
ferentiable constrained problem
minimize z
(2.63)
subject to Jj(x) ::; z, j = 1, ... , m,
The conjugates of ¢1(u) = f(x - u) and ¢2(u) =Nu+ ~llull 2 are ¢t(y) =
f*(-y) + y'x and ¢2(y) = t,:IIY - All 2, so by using the Fenchel duality
formula infuEWn { ¢1(u) + ¢2(u)} = supyEWn { - ¢t(-y) - ¢2(y) }, we have
lim fc,>.(x)
c--+oo
= f**(x) = J(x),
114 Optimization Algorithms: An Overview Chap. 2
).2 2c
2c
0 X
~:!===::;::::;~,ro~'-=----~x
,>.
C C
!
It can be verified using Eqs. (2.64) and (2.65) that
X - (l - ).)2 if 1-A < X
2C C - l
see Fig. 2.2.2. The function f(x) = max{O, x} may also be used as a
building block to construct more complicated nondifferentiable functions,
such as for example
max{x1, x2} = x1 + max{O, x1 - x2};
see [Ber82a], Ch. 3.
Sec. 2.2 Approximation Methods 115
The smoothing technique just described can also be combined with the
augmented Lagrangian method. As an example, let f : ~n H (-oo, oo]
be a closed proper convex function with conjugate denoted by f *. Let
F: ~n H ~ be another convex function, and let X be a closed convex set.
Consider the problem
where
subject to x E X,
116 Optimization Algorithms: An Overview Chap. 2
which is closed and convex, being the supremum of closed and convex
functions. The augmented Lagrangian minimization (2.53) for this problem
takes the form
Xk+I E argmin fck .xk(x),
xEX '
where
Yk+l E arg min
yE1Rm
{f (xk+I, y) + >..~y + c2 IIYll
k 2 }.
This method of course makes sense only in the case where the function f
has a convenient form that facilitates the preceding minimization.
For further discussion of the relations and combination of smoothing
with the augmented Lagrangian method, see [Ber75b], [Ber77], [Pap81],
and for a detailed textbook analysis, [Ber82a], Ch. 3. There have also been
many variations of smoothing ideas and applications in different contexts;
see [Ber73], [Geo77], [Pol79], [Pol88], [BeT89b], [PiZ94], [Nes05], [Che07],
[OvG 14]. In Section 6.2, we will also see an application of smoothing as an
analytical device, in the context of complexity analysis.
Exponential Smoothing
We have used so far a quadratic penalty function as the basis for smooth-
ing. It is also possible to use other types of penalty functions. A simple
penalty function, which often leads to convenient formulas is the exponen-
tial, which will also be discussed further in Section 6.6. The advantage of
the exponential function over the quadratic is that it produces twice differ-
entiable approximating functions. This may be significant when Newton's
method is used to solve the smoothed problem.
Sec. 2.2 Approximation Methods 117
is given by
fc,>.(x) = -;;1 ln { ~
m .
>.•ecfi(x) }
' (2.66)
m
I: ).i = 1, ).i > 0, Vi= 1, ... , m.
i=l
and by repeating the process. t The generated sequence { Xk} can be shown
to converge to the minimum of f under mild assumptions, based on gen-
eral convergence properties of augmented Lagrangian methods that use
nonquadratic penalty functions; see [Ber82a], Ch. 5, for a detailed devel-
opment.
~ · 1,f- 2
1 ~ lx 1 I+
I
minimize 2
~(aix - b;)
j=l i=l
subject to X E rJr,
t Sometimes the use of the exponential in Eq. (2.67) and other related formu-
las, such as (2.66), may lead to very large numbers and computer overflow. In this
case one may use a translation device to avoid such numbers, e.g., multiplying
numerator and denominator in Eq. (2.67) by e- 13 k where
where a; and b; are given vectors and scalars, respectively (cf. Example 1.3.1).
The nondifferentiable £1 penalty may be smoothed by writing each term lxjl
as max{ xi, -xi} and by smoothing it using Eq. (2.66), i.e., replacing it by
where c and >.i are scalars satisfying c > 0 and ),.i E (0, 1) (see Fig. 2.2.3).
We may then consider an exponential type of augmented Lagrangian method,
whereby we minimize over lRn the twice differentiable function
'Y t
j=l
Rck,>..{ (xj) + ~ I)a:x - b;)2,
i=l
(2.68)
[cf. Eq. (2.67)], and by repeating the process. Note that the minimization of
the exponentially smoothed cost function (2.68) can be carried out efficiently
by incremental methods, such as the incremental gradient and Newton meth-
ods of Section 2.1.5.
As Fig. 2.2.3 suggests, the adjustment of the multiplier >...i can selec-
tively reduce the error
Sec. 2.3 Notes, Sources, and Exercises 119
kin [PoT80], [PoT81], and more recently by Bottou and LeCun [BoL05],
and Bhatnagar, Prasad, and Prashanth [BPP13]. Among others, these
references quantify the convergence rate advantage that stochastic New-
ton methods have over stochastic gradient methods. Deterministic
incremental Newton methods have received little attention (for a re-
cent work see Gurbuzbalaban, Ozdaglar, and Parrilo [GOP14]). How-
ever, they admit an analysis that is similar to a deterministic analysis
of the extended Kalman filter, the incremental version of the Gauss-
Newton method (see Bertsekas [Ber96], and Moriyama, Yamashita, and
Fukushima [MYF03]). There are also many stochastic analyses of the
extended Kalman filter in the literature of estimation and control of
dynamic systems.
Let us also note another approach to accelerate the theoretical
convergence rate of incremental gradient methods, which involves using
a larger than 0(1/k) stepsize and averaging the iterates (for analysis of
the corresponding stochastic gradient methods, see Ruppert [Rup85],
and Poljak and Juditsky [PoJ92], and for a textbook account, Kushner
and Yin [KuY03]).
Section 2.2: The nonlinear programming textbooks cited earlier con-
tain a lot of material on approximation methods. In particular, the lit-
erature on polyhedral approximation is extensive. It dates to the early
days of nonlinear and convex programming, and it involves applications
in data communication and transportation networks, and large-scale
resource allocation. This literature will be reviewed in Chapter 4.
The research monographs by Fiacco and MacCormick [FiM68],
and Bertsekas [Ber82a] focus on penalty and augmented Lagrangian
methods, respectively. The latter book also contains a lot of mate-
rial on smoothing methods and the proximal algorithm, including cases
where nonquadratic regularization is involved, leading in turn to non-
quadratic penalty terms in the augmented Lagrangian (e.g., logarithmic
regularization and exponential penalty).
The proximal algorithm was proposed in the early 70s by Martinet
[Mar70], [Mar72]. The literature on the algorithm and its extensions,
spurred by the influential paper by Rockafellar [Roc76a], is voluminous,
and reflects the central importance of proximal ideas in convex optimiza-
tion and other problems. The ADMM, an important special case of the
proximal context, was proposed by Glowinskii and Morocco [GIM75],
and Gabay and Mercier [GaM76], and was further developed by Gabay
[Gab79], [Gab83]. We refer to Section 5.4 for a detailed discussion of this
algorithm, its applications, and its connections to more general operator
splitting methods. Recent work involving proximal ideas has focused on
combinations with other algorithms, such as gradient, subgradient, and
coordinate descent methods. Some of these combined methods will be
discussed in detail in Chapters 5 and 6.
Sec. 2.3 Notes, Sources, and Exercises 121
EXERCISES
(b) Show that the value of a that minimizes the bound of part (a) is
2
Q * = M+m'
in which case
that this contraction property implies for steepest descent with con-
stant stepsize is sharp, in the sense that there exist starting points xo
for which the preceding inequality holds as an equation for all k (see
[Ber99], Section 2.3).
This exercise deals with an inequality that is fundamental for the convergence
analysis of gradient methods. Let X be a convex set, and let f : Rn t-t R be
a differentiable function such that for some constant L > 0, we have
f(y) ~ f(x) + v' f(x)' (y - x) + ½IIY - x112, \:/ x,y EX. (2.70)
Proof: Let t be a scalar parameter and let g( t) = f ( x + t(y - x)). The chain
rule yields (dg/dt)(t) = v' f (x + t(y - x) )' (y - x). Thus, we have
f(y) - f(x) = g(l) - g(O)
= 1 1
!;(t)dt
=1 1
(y- x)'v'f(x + t(y- x)) dt
~1 1
(y - x)'v'f(x) dt + 11 1
(y - x)' (v'f (x + t(y - x)) - v' f(x)) dt I
~1 1
(y - x)'v' f(x) dt + 1 1
IIY - xii· !Iv' f (x + t(y - x)) - v' f(x)lldt
where O <ex< f- Show that if {xk} has a limit point, then v' f(xk)-+ 0, and
every limit point x of { xk} satisfies v' f (x) = 0. Proof: We use the descent
inequality (2.70) to show that the cost function is reduced at each iteration
according to
Thus if there exists a limit point x of {xk}, we have J(xk) -+ J(x) and
v' f(xk) -+ 0. This implies that v' f(x) = 0, since v' f (-) is continuous by Eq.
(2.71).
where dk is a descent direction. Given fixed scalars /3, and a, with O < /3 < 1,
0 < a < 1, and Bk with infk::::o Bk > 0, the stepsize CXk is determined as follows:
we set CXk = (3mk Bk, where mk is the first nonnegative integer m for which
Assume that there exist positive scalars c1, c2 such that for all k we have
(2.72)
(a) Show that the stepsize CXk is well-defined, i.e., that it will be determined
after a finite number of reductions if v' J(xk) i= 0. Proof: We have for
allB>O
124 Optimization Algorithms: An Overview Chap. 2
which is satisfied for s in some interval (0, sk]. Thus the test will be
passed for all m for which f3msk ::::; Sk.
(b) Show that every limit point x of the generated sequence {xk} satisfies
v7 J(x) = 0. Proof: Assume, to arrive at a contradiction, that there
is a subsequence { xk},c that converges to some x with v7 f (x) =I= 0.
Since {f(xk)} is monotonically nonincreasing, {f(xk)} either converges
to a finite value or diverges to -oo. Since f is continuous, J(x) is a
limit point of {f(xk)}, so it follows that the entire sequence {J(xk)}
converges to J(x). Hence,
By the definition of the Armijo rule and the descent property v7 f(xk)' dk ::::;
0 of the direction dk, we have
(2.73)
From the left side of Eq. (2.72) and the hypothesis v7 J(x) =I= 0, it follows
that
limsupv7f(xk) 1 dk < 0, (2.74)
k-too
kEIC
{ak},c--+ 0.
Since sk, the initial trial value for O:k, is bounded away from 0, sk will be
reduced at least once for all k E IC that are greater than some iteration
index k. Thus we must have for all k E IC with k > k,
(2.75)
From the right side of Eq. (2.72), {dk},c is bounded, and it follows that
there exists a subsequence {dk}x; of {dk},c such that
Sec. 2.3 Notes, Sources, and Exercises 125
\/ k E K., k 2'. k,
where cik = O:k / /3. By using the mean value theorem, this relation is
written as
\/ k E K., k 2'. k,
where O:k is a scalar in the interval [O, cik]- Taking limits in the preceding
relation we obtain
or
o :S (1 - a)v7 f(x)'d.
Since a < 1, it follows that
o :S v7 f (x)'d,
a contradiction of Eq. (2.74).
Let f : ar >-+ R be a differentiable convex function, and assume that for some
L > 0, we have
Let X* be the set of minima off, and assume that X* is nonempty. Consider
the steepest descent method
Show that {xk} converges to a minimizing point off under each of the fol-
lowing two stepsize rule conditions:
(i) For some E > 0, we have
minimization rule may produce a sequence with multiple limit points (all of
which are of course optimal), even for a convex cost function. There is also
a "local capture" theorem that applies to gradient methods for nonconvex
continuously differentiable cost functions f and an isolated local minimum of
f (a local minimum x* that is unique within a neighborhood of x*). Under
mild conditions it asserts that there is an open sphere Bx• centered at x*
such that once the generated sequence { Xk} enters Sx*, it converges to x*
(see [Ber82a], Prop. 1.12, or [Ber99], Prop. 1.2.5 and the references given
there). Abbreviated Proof: Consider the stepsize rule (i). From the descent
inequality (Exercise 2.2), we have for all k
since v'f(xk) = (xk -xk+i)/ak. Moreover any limit point of {xk} belongs to
X*, since v' f(xk) -+ 0 and f is convex.
Using the convexity off, we have for all x* E X*,
llxk+l - x*ll 2 - llxk - x*ll 2 - llxk+1 - Xkll 2 = -2(x* - Xk) 1 (xk+1 - xk)
= 20:k(x* - Xk)'v'f(xk)
:S 20:k (f (x*) - f(xk))
:S 0,
so that
We now use Eqs. (2.76) and (2.77), and the Fejer Convergence Theorem
(Prop. A.4.6 in Appendix A). From part (a) of that theorem it follows that
{xk} is bounded, and hence it has a limit point x, which must belong to X*
as shown earlier. Using this fact and part (b) of the theorem, it follows that
{ xk} converges to x.
The proof for the case of the stepsize rule (ii) is similar. Using the
assumptions O:k -+ 0 and I:~=O
°'k = oo, and the descent inequality, we show
that v' f(xk) -+ 0, that {f(xk)} converges, and that Eq. (2.76) holds. From
this point, the preceding proof applies.
Sec. 2.3 Notes, Sources, and Exercises 127
k = 0, 1, .... (2.78)
Show that either f(xk) -+ -oo or else f(xk) converges to a finite value
and limk-+oo v'J(xk) = 0. Furthermore, every limit point x of {xk} satis-
fies v'f(x) = 0. Abbreviated Proof: The descent inequality (2.70) yields
Combining the preceding three relations and collecting terms, it follows that
Since Cl!k -+ 0, we have for some positive constants c and d, and all k suffi-
ciently large
128 Optimization Algorithms: An Overview Chap. 2
Using the inequality jjv'J(xk)jj :s; 1 + jjv'J(xk)jl 2 , the above relation yields
for all k sufficiently large
Let f : Rn >--+ R be a convex function, and let us view the steepest descent
direction at x as the solution of the problem
Show that this direction is -g*, where g* is the vector of minimum norm in
8f(x). Abbreviated Solution: From Prop. 5.4.8 in Appendix B, J' (x; ·) is the
support function of the nonempty and compact subdifferential 8f(x), i.e.,
Since the sets { d I lldll :s; 1} and 8f(x) are convex and compact, and the
function d' g is linear in each variable when the other variable is fixed, by the
Saddle Point Theorem of Prop. 5.5.3 in Appendix B, it follows that
and that a saddle point exists. For any saddle point (d*, g*), g* maximizes
the function minlldll9 d'g = -llgll over 8f(x), so g* is the unique vector of
minimum norm in 8f(x). Moreover, d* minimizes maxgEBf(x) d'g or equiva-
lently f' (x; d) [by Eq. (2.80)] subject to lldll :s; 1 (so it is a direction of steepest
descent), and minimizes d'g* subject to lldll :s; 1, so it has the form
d*=-L
llg*II
't/J;,k = 't/Ji-1,k Qk { I }
- llcill 2 max 0, c;'t/Ji-1,k - bi Ci, i = l, ... ,m,
starting with 't/Jo,k = Xk. Show that the algorithm can be viewed as an
incremental gradient method for a suitable differentiable cost function.
(b) Implement the algorithm of (a) for two examples where n = 2 and
m = 100. In the first example, the vectors Ci have the form Ci = (l;i, (;),
where .;;, (i, as well as bi, are chosen randomly and independently
from [-100, 100] according to a uniform distribution. In the second
example, the vectors Ci have the form Ci = (<;i, (;), where <;i, (i are
chosen randomly and independently within [-10, 10) according to a
uniform distribution, while bi is chosen randomly and independently
within [O, 1000] according to a uniform distribution. Experiment with
different starting points and stepsize choices, and deterministic and
randomized orders of selection of the indexes i for iteration. Explain
your experimental results in terms of the theoretical behavior described
in Section 2.1.
130 Optimization Algorithms: An Overview Chap. 2
f(x) = L Ji(x),
i=l
and
jjv7J;(x)jj ~ C + Djjv7J(x)jj,
Assume also that
00 00
Show that either f(xk) -+ -oo or else f (xk) converges to a finite value
and limk-+oo v'f(xk) = 0. Furthermore, every limit point x of {xk} satis-
fies v7 f (x) = 0. Abbreviated Solution: The idea is to view the incremental
gradient method as a gradient method with errors, so that the result of Ex-
ercise 2.6 can be used. For simplicity we assume that m = 2. The proof is
similar when m > 2. We have
where
We have
where ik is an index randomly chosen from the set {1, ... , m} with equal
probabilities 1/m, independently of previous choices. Let P(x) denote the
Euclidean projection of a vector x E ~r
onto the set of solutions of the
system, and let C be the matrix whose rows are c1, ... , Cm. Show that
where Amin is the minimum eigenvalue of the matrix C'C. Hint: Show that
where b1 and b2 are given scalars, and the incremental gradient algorithm
that generates Xk+l from Xk according to
where
'lpk = Xk - a(xk - b1),
and a is a positive stepsize. Assuming that a< 1, show that {xk} and {'lj;k}
converge to limits x(a) and 'lj;(a), respectively. However, unless bi = b2,
x(a) and 'lj;(a) are neither equal to each other, nor equal to the least squares
solution x* = (b1 + b2)/2. Verify that
over x E Rn, where the vectors Zi and the matrices Ci are given. Let Xk be
the vector at the start of cycle k of the incremental gradient method that
operates in cycles where components are selected according to a fixed order.
Thus we have
m
i = l, ... ,m.
Assume that I:=:, 1 Cf Ci is a positive definite matrix and let x* be the optimal
solution. Then:
(a) There exists a> 0 such that if °'k is equal to some constant a E (0, a]
for all k, {xk} converges to some vector x(a). Furthermore, the error
llxk -x(a)II converges to O linearly. In addition, we have lima--+O x(a) =
x*. Hint: Show that the mapping that produces Xk+l starting from Xk
is a contraction mapping for a sufficiently small.
(b) If °'k > 0 for all k, and
00
k = 0, 1, ... ,
where Jo, Ji, ... , are quadratic functions with eigenvalues lying within some
interval ['y, r], where 'Y > 0. Suppose that for a given E > 0, there is a vector
x* such that
V k = 0, 1, ....
Sec. 2.3 Notes, Sources, and Exercises 133
Show that for all O! with O < O! :::; 2/('y + r), the generated sequence {xk}
converges to a 2E/'y-neighborhood of x*, i.e.,
For other related convergence rate results, see [NeBOO] and [Sch14a].
The proximal gradient iteration (2.27) is well suited for problems involving a
nondifferentiable function component that is convenient for a proximal iter-
ation. This exercise considers the important case of the £1 norm. Consider
the problem
minimize f(x) +-yJJxlJi
subject to x E Rn,
where f : Rn >-+ R is a differentiable convex function, II · Iii is the £1 norm,
and -y > 0. The proximal gradient iteration is given by the gradient step
[cf. Eq. (2.28)]. Show that the proximal step can be performed separately for
each coordinate xi of x, and is given by the so-called shrinkage operation:
i = 1, ... ,n.
Note: Since the shrinkage operation tends to set many coordinates xi+i to
0, it tends to produce "sparse" iterates.
134 Optimization Algorithms: An Overview Chap. 2
where c > 0 is some scalar, and the scalars Ai, i = 1, ... , m, are such that
Ai > 0, i = 1, ... , m,
i=l
(a) Show that if the system is feasible (or strictly feasible) the optimal
value is non positive (or strictly negative, respectively). If the system is
infeasible, then
where (Ci, bi) are randomly generated as in the two problems of Exercise
2.9(b ). Experiment with different starting points and stepsize choices,
and deterministic and randomized orders of selection of the indexes i
for iteration.
(c) Repeat part (b) where the problem is instead to minimize f(x) =
maxi=l, ... ,m gi(X) and the exponential smoothing method of Section
2.2.5 is used, possibly with the augmented Lagrangian update (2.67)
of >.. Compare three methods of operation: ( 1) c is kept constant and
A is updated, (2) c is increased to oo and A is kept constant, and (3) c
is increased to oo and .>. is updated.
3
Subgradient Methods
Contents
135
136 Subgradient Methods Chap. 3
Given a proper convex function f : ~n r-+ (-oo, oo], we say that a vector
g E ~n is a subgradient of f at a point x E dom(f) if
f(z)
Epigraph of f (-g, 1)
Thus, g is a subgradient off at x if and only if the hyperplane in wn+I that has
normal (-g, 1) and passes through ( x, f(x)) supports the epigraph off, as shown
in the figure.
(cf. Section 5.4.4 of Appendix B). The ratio on the right-hand side is mono-
tonically nonincreasing to f'(x; d), as shown in Section 5.4.4 of Appendix
B; also see Fig. 3.1.3.
Our first result shows some basic properties, and provides the con-
nection between 8f(x) and f'(x; d) for real-valued f. A related and more
refined result is given in Prop. 5.4.8 in Appendix B for extended real-valued
f. Its proof, however, is more intricate and includes some conditions that
are unnecessary for the case where f is real-valued.
138 Subgradient Methods Chap. 3
0 X -1 0 1 X
Bf(x) Bf(x)
1 ,
, ,,
, ,,
0 :x -1 ,, 0 1 X
, ,,
,,
-1
f'(x; d) = gE8f(x)
max g'd, \/ d E ~n, (3.3)
f(x +o:d)
Figure 3.1.3. Illustration of the direc-
tional derivative of a convex function f.
The ratio
0 0:
0: (3
¢(0) :s: o:+(3</>(-(3) + o:+(3</J(o:),
or equivalently
0: (3
J(x) :S: --(3f(x - (3d) + --(3f(x + o:d).
o:+ o:+
This relation can be written as
J(x) - J(x - (3d) J(x + o:d) - f(x)
(3 :s; 0: .
{g I !'(x;d) 2: g'd},
140 Subgradient Methods Chap. 3
where d ranges over the nonzero vectors of ~n. It follows that 8J(x) is
closed and convex. Also it cannot be unbounded, since otherwise, for some
d E ~n, g' d could be made unbounded from above by proper choice of
g E 8J(x), contradicting Eq. (3.4) [since f'(x; ·) was proved to be real-
valued].
Next we show that 8f(x) is nonempty. Take any x and din ~n, and
consider the convex subset of ~n+l
see Fig. 3.1.4. Since the quotient on the right in Eq. (3.2) is monotonically
nonincreasing and converges to f'(x; d) as a ..J_ 0 (cf. Fig. 3.1.3), we have
It follows that the convex sets 0 1 and C2 are disjoint. By applying the
Separating Hyperplane Theorem (Prop. 1.5.2 in Appendix B), we see that
there exists a nonzero vector (µ, 1 ) E ~n+l such that
w
Slope= f'(x; d)/d
\
0 d X z
Figure 3.1.4. Illustration of the sets C1 and C2 used in the hyperplane separation
argument of the proof of Prop. 3.1.l(a).
is f'(x;d) = 'vf(x)'d. Thus, from Eq. (3.3), f has 'vf(x) as its unique
subgradient at x. Q.E.D.
where
L = sup 11911-
gEUxExaf(x)
Since both {x k} and { dk} are bounded, they contain convergent subse-
quences. We assume without loss of generality that {xk} and {dk} con-
verge to some vectors. Therefore, by the continuity off (cf. Prop. 1.3.11
in Appendix B), the left-hand side of the preceding relation is bounded.
Hence the right-hand side is also bounded, thereby contradicting the un-
boundedness of {gk}.
(b) Let x and z be any two points in X. By the subgradient inequality
(3.1), we have for all g E 8f(x),
so that
f(x) - f(z) :S: 11911 · llx - zll :S: Lllx - zll-
By exchanging the roles of x and z, we similarly obtain
The next proposition provides the analog of the chain rule for sub-
differentials of real-valued convex functions. The proposition is a special
case of more general results that apply to extended real-valued functions
(Props. 5.4.5 and 5.4.6 of Appendix B), but admits a simpler proof.
Proposition 3.1.3:
(a) (Chain Rule): Let F be the composition of a convex function
h : Rm 1-t R and an m x n matrix A,
Then
Then
and A is the matrix defined by the equation Ax= (x, ... , x). The subdif-
ferential sum formula then follows from part (a). Q.E.D.
144 Subgradient Methods Chap. 3
(cf. Prop. 1.1.8 in Appendix B). The proposition can be generalized in turn
to the case where f can be extended real-valued. In this case the optimality
condition requires the additional assumption that ri( dom(f)) n ri(X) i= 0,
or some polyhedral assumption on f and/or X; see Prop. 5.4.7 in Appendix
B, whose proof is simple but requires a more sophisticated version of the
chain rule that applies to extended real-valued functions .
g'(z - x) ~ 0, V zEX.
Proof: Suppose that g'(z -x) ~ 0 for some g E af(x) and all z EX. Then
from the subgradient inequality (3.1), we have f(z) - f( x ) ~ g'(z - x) for
all z E X, so f( z ) - f( x ) ~ 0 for all z EX, and x minimizes f over X.
Conversely, suppose that x minimizes f over X . Consider the set of
feasible directions of X at x , i.e., the cone
N x (:x* )
Figure 3.1.5. Illustration of the optimality condition of Prop. 3.1.4. In the figure
on the left, f is differentiable and the optimality condition is
- 'vf(x *) E Nx (x* ),
T hus, using Prop. 3.1.l(a), we have f'(x; d) < 0, while from Eq. (3.7),
we see that d belongs to the polar cone of W*, which by the Polar Cone
Theorem [Prop. 2.2.l (b) in Appendix BJ is t he closure of W. Hence there
is a sequence {yk} C W that converges to d. Since f'( x; ·) is a continuous
function [being convex a nd real-valued by Prop. 3.1.l(a)] and f'( x; d) < 0,
we have f'(x; Yk) < 0 for all k after some index, which contradicts t he
optimality of x. Q.E.D.
Computation of Subgradients
Let
f (x) = sup q'>(x, z ), (3.12)
zEZ
J(y) = sup q'>(y, z) 2 q'>(y, zx) 2 q'>(x, zx) + g:(y - x) = f(x) + g:(y - x),
zEZ
gx E 8f(x).
This relation provides a convenient method for calculating a single subgra-
dient of f at x with little extra computation, once a maximizer Zx E Z of
q'>(x, ·) has been found: we simply use any subgradient in 8xq'>(x, zx)-
minimize f (x)
subject to x EX, g(x) ::::; 0,
where f : Rn f--t R, g : Rn f--t Rr are given functions, X is a subset of Rn.
Consider the dual problem
maximize q(µ)
subject to µ 2 0,
148 Subgradient Methods Chap. 3
Thus the dual problem involves minimization of the convex function -q over
µ 2'. 0. Note that in many cases, q is real-valued (for example when f and g
are continuous, and X is compact).
For a convenient way to obtain a subgradient of -q at µ E 1W, suppose
that Xµ minimizes the Lagrangian over x EX,
This is essentially a special case of the preceding example, and can also be
verified directly by writing for all z; E Rr,
q(v)= inf{J(x)+v'g(x)}
xEX
This follows from the Conjugate Subgradient Theorem (see Props. 5.4.3
and 5.4.4 of Appendix B). Thus a subgradient off at a given x can be
obtained by finding a solution to a maximization problem that involves f*.
(3.13)
Sec. 3.2 Convergence Analysis of Subgradient Methods 149
\/XE X.
Similarly,
(Px(x) - Px(y))' (y - Px(y)) ::::; 0.
By adding these two inequalities, we see that
\/ Q > 0,
(see Fig. 3.2.1). However, if the stepsize is small enough, the distance of
the current iterate to the optimal solution set is reduced (this is illustrated
150 Subgradient Methods Chap. 3
Proof: (a) Using the nonexpansion property of the projection [cf. Eq.
(3.14)], we obtain for ally E X and k,
(3.15)
2(f(xk) - f(x•)))
(
o, llgkll 2 '
where x* is any optimal solution, and reduces the distance of the current
iterate to x*.
Unfortunately, however, the stepsize (3.15) requires that we know f•,
which is rare. In practice, one must either estimate f * or use some simpler
scheme for selecting a stepsize. The simplest possibility is to select O:k to
be the same for all k, i.e., O:k = a for some a > 0. Then, if the subgradients
gk are bounded, i.e., llgkll :S: c for some constant c and all k, Prop. 3.2.2(a)
shows that for all optimal solutions x•, we have
2(f(xk) - J•)
O<a< 2
C
Level set
{x EX I /(x) $ /• +ac2 /2}
~Optimal solution set
Convergence Analysis
f* = inf f(x),
xEX
X * = {X E X I f (X) = f *}.
We will consider the subgradient method
and three different types of rules for selecting the stepsize ak:
Sec. 3.2 Convergence Analysis of Subgradient Methods 153
which follows from Prop. 3.2.2(a). This type of inequality allows the use of
supermartingale convergence arguments (see Section A.4 in Appendix A),
and lies at the heart of the convergence proofs of this section, as well as
the convergence proofs of other subgradient-like methods given in the next
section and Section 6.4.
Constant Stepsize
When the stepsize is constant in the subgradient method (i.e., O:k = o:), we
cannot expect to prove convergence, in the absence of additional assump-
tions. As indicated in Fig. 3.2.3, we may only guarantee that asymptot-
ically we will approach a neighborhood of the set of minima, whose size
will depend on o:. The following proposition quantifies the size of this
neighborhood and provides an estimate on the difference between the cost
value
Joo= liminf f(xk)
k-+oo
154 Subgradient Methods Chap. 3
that the method achieves, and the optimal cost value /*.
a.c2
f oo ~/(ii)+ 2 + 2E,
Thus we have
oc2 + E
min f (xk)
05,k'.5,K
~ f* + ---,
2
where
Proof: Assume the contrary, i.e., that for all k with O ~ k ~ K, we have
+ oc 2+ E .
2
f (xk ) > f*
From this relation, and Eq. (3.16) with y = x* EX* and Ok = o, we obtain
for all x * EX* and k with O ~ k ~ K ,
\/xEX, (3.17)
and the result follows using the fact I:1= (1 - 2o:')')J::;: 2,1,'Y.
0 Q.E.D.
and compact. To see this, note that for polyhedral f and X, there exists
(3 > 0 such that
f(x) - f* ~ (3d(x) , l;;f XE X;
Diminishing Stepsize
We next consider the case where the stepsize ak diminishes to zero, but
satisfies I::r=o ak = oo. This condition is needed so that the method
can "travel" infinitely far if necessary to attain convergence; otherwise,
convergence to X* may be impossible from starting points xo that are far
from X*, as for example in the case where X = ~n and
00
Since O:k ----+ 0, without loss of generality, we may assume that ko is large
enough so that
\/ k ~ ko.
Therefore for all k ~ ko we have
k
t Note that this argument is essentially the same as the one we used to prove
the Fejer Convergence Theorem (Prop. A.4.6 in Appendix A). Indeed we could
have invoked that theorem for the last part of the proof.
Sec. 3.2 Convergence Analysis of Subgradient Methods 159
Vk ~ 0, (3.19)
in place of the stronger Assumption 3.2.1 (thus covering for example the
case where f is positive definite quadratic and X = 1Rn, which is not covered
by Assumption 3.2.1). This is shown in Exercise 3.6, with essentially the
same proof, after replacing Eq. (3.18) with another inequality that relies
on the assumption (3.19).
Vk ~ 0, (3.20)
V x* E X*, k 2=: 0.
This implies that {xk} is bounded. Furthermore, f(xk) ---+ f*, since other-
wise we would have JJxkH - x* II :S Jlxk - x* JI - E for some suitably small
E > 0 and infinitely many k. Hence for any limit point x of {xk}, we have
V k 2=: 0, (3.21)
(3.22)
if f(xk+i) :S !k,
(3.23)
if f(xk+1) > fk,
where o, /3, and 0 are fixed positive constants with f3 < l and 0 2=: 1.
Thus in this scheme, we essentially "aspire" to reach a target level fk
that is smaller by Ok over the best value achieved thus far [cf. Eq. (3.22)].
Whenever the target level is achieved, we increase Ok (if 0 > l) or we
keep it at the same value (if 0 = l). If the target level is not attained
at a given iteration, Ok is reduced up to a threshold o. If the subgradient
boundedness Assumption 3.2.1 holds, this threshold guarantees that the
stepsize Dk of Eq. (3.21) is bounded away from zero, since from Eq. (3.22),
we have J(xk) - fk 2=: o and hence Dk 2=: o/JJgkJl 2 2=: o/c 2 . As a result, the
method behaves somewhat similar to the one with a constant stepsize (cf.
Prop. 3.2.3), as indicated by the following proposition.
Sec. 3.2 Convergence Analysis of Subgradient Methods 161
inf f(x ·)
O<. J
= f* '
_J
f* + 8 < 05,j
inf f(xj)- (3.24)
Each time the target level is attained [i.e., f (xk) :S fk-1], the current best
function value mino$j$k f(xj) decreases by at least 8 [cf. Eqs. (3.22) and
(3.23)], so in view of Eq. (3.24), the target level can be attained only a
finite number of times. From Eq. (3.23) it follows that after finitely many
iterations, 8k is decreased to the threshold value 8, and remains at that
value for all subsequent iterations, i.e., there is an index k such that
V k ~ k.
V k ~ k.
(3.25)
Applying Prop. 3.2.2(a) with y = y, together with the preceding relation,
we have
\\xk+l -y\\ 2 :S -y\1 2 -
\\xk 2ak(f(xk) - f(y)) + a~\\gk\\ 2
:S \lxk - y\\ 2 - 2ak(f (xk) - fk) + a~\\gk\\ 2, Vk ~ k.
By using the definition of ak [cf. Eq. (3.21)] and Eq. (3.25), we obtain
where the last inequality follows from the right side of Eq. (3.25). Hence
{xk} is bounded, which implies that {gk} is also bounded (cf. Prop. 3.1.2).
Letting c be such that llgkll :::; c for all k and adding Eq. (3.26) over k, we
have
V k 2: k,
which cannot hold for sufficiently large k - a contradiction. Q.E.D.
Vk 2: 0,
where 'Y is a fixed positive scalar and fk is given by the same adjustment
procedure (3.22)-(3.23). This will guard against the potential practical
difficulty of ak becoming too large due to very small values of llgkll- The
result of the preceding proposition still holds with this modification (see
Exercise 3.9).
We finally note that the line of convergence analysis of this section can
be applied with small modifications to related methods that are based on
subgradients, most notably to the E-subgradient methods of the next sec-
tion, and the incremental subgradient and incremental proximal methods
of Section 6.4.
The E-subdifferential 8ef (x) is the set of all E-subgradients off at x, and
by convention, 8d(x) = 0 for x rJ_ dom(f). It can be seen that
and that
nd.o8d(x) = 8f(x).
Sec. 3.3 E-Subgradient Methods 163
f(z)
Figure 3.3.1. Illustration of an E-sub-
Epigraph of f (-g, 1) gradient of a convex function f. A vec-
tor g is an E-subgradient at x E dom(f)
if and only if there is a hyperplane with
normal (- g, 1), which passes through
the point (x,f(x) - <"), and separates
this point from the epigraph off.
(3.28)
/(z) /(z)
0 I X I Z 0 x 1..._ _ _ _ _ _ _..., z
-+---...,v=----., '
D
and that 9x is some subgradient of the convex function ¢ (·, zx) at x, i.e.,
9x E 8¢(x, zx )- Then, for ally E Rn, we have using the subgradient inequality,
J(y) = sup ¢(y, z ) 2 ¢(y, zx) 2 ¢(x, Zx ) + g~(y - x) 2 J(x) - 1: + g~(y - x),
zE Z
Using this inequality, one can essentially replicate the convergence analysis
of Section 3.2, while carrying along the E parameter.
As an example, consider the case where ak and Ek are constant: ak =
a for some a > 0 and Ek = E for some E > 0. Then, if the E-subgradients
gk are bounded, with llgkll :":'.: c for some constant c and all k, we obtain for
all optimal solutions x*,
where f* = infxEX f(x) is the optimal value [cf. Eq. (3.16)]. This implies
that the distance to all optimal x* decreases if
2(J(xk) - f* - E)
O<a< 2 ,
C
(cf. Fig. 3.2.3). With analysis similar to the one for the subgradient case,
we can also show that if
00
Lak = oo,
k=O
we have
(cf. Prop. 3.2.6). There is also a related convergence result for an analog of
the dynamic stepsize rule and other rules (see [NeBlO]). Ifwe have Ek-+ 0
instead of Ek = E, the convergence properties of the E-subgradient method
(3.28) are essentially the same as the ones of the ordinary subgradient
method, both for a constant and for a diminishing stepsize.
166 Subgradient Methods Chap. 3
Xk+l = "Pm,k,
where starting with "Po,k = Xk, we obtain "Pm,k after the m steps
Thus, g E 8J(x) implies that g E 8,J(x), with E: small when xis near x.
We now observe from Eq. (3.30) that the ith step within a cycle of
the incremental subgradient method involves the direction 9i,k, which is
a subgradient of Ji at the corresponding vector 'I/Ji-1,k· If the stepsize ak
is small, then 'I/Ji-1,k is close to the vector Xk available at the start of the
cycle, and hence 9i,k is an Ei-subgradient of Ji at Xk, where Ei is small. In
particular, assuming for simplicity that X = ~, we have
m
Xk+l = Xk - Ci.k L 9i,k, (3.31)
i=l
where E = E1 + · · · + Em.
From this analysis it follows that the incremental subgradient itera-
tion (3.31) can be viewed as an E-subgradient iteration at Xk, the starting
point of the cycle. The size of E depends on the size of the stepsize ak, as
well as the function J, and we have E -+ 0 as ak -+ 0. As a result, when
00
LCi.k =oo,
k=O
Section 3.1: Subgradients are central in the work of Fenchel [Fen51]. The
original theorem by Danskin [Dan67] provides a formula for the directional
derivative of the maximum of a (not necessarily convex) directionally dif-
ferentiable function. When adapted to a convex function J, this formula
yields Eq. (3.10) for the subdifferential of J; see Exercise 3.5.
Another important subdifferential formula relates to the subgradients
of an expected value function
J(x) = E{ F(x,w) },
168 Subgradient Methods Chap. 3
so Prop. 3.l.3(b) applies. However, the formulas (3.32) hold even in the
case where n is uncountably infinite, with appropriate mathematical inter-
pretation of the integral of set-valued functions E{ 8F(x, w)} as the set of
integrals
1 wE!1
g(x,w)dP(w), (3.33)
EXERCISES
X1<+1 = Xk - ak L 9i,k,
i=l
where
c· if c;xk > b;,
9i,k ={ 0' otherwise.
170 Subgradient Methods Chap. 3
starting with 'I/Jo,k = Xk· Compare this method with the algorithm of (a)
computationally with two examples where n = 2 and m = 100. In the first
example, the vectors Ci have the form Ci = (ei, (i), where ei,
(i, as well
as bi, are chosen randomly and independently from [-100, 100] according
to a uniform distribution. In the second example, the vectors Ci have the
form Ci = (ei, (i), where ei,
(i are chosen randomly and independently
within [-10, 10] according to a uniform distribution, while bi is chosen
randomly and independently within [O, 1000] according to a uniform dis-
tribution. Experiment with different starting points and stepsize choices,
and deterministic and randomized orders of selection of the indexes i for
iteration. In the case of the second example, under what circumstances
does the method stop after a finite number of iterations?
The purpose of this exercise is to express the necessary and sufficient condition
for optimality of Prop. 3.1.4 in terms of the directional derivative of the cost
function. Consider the minimization of a convex function f : ~n >---+ ~ over a
convex set X C ~n. For any x E X, the set of feasible directions of f at x is
defined to be the convex cone
Note: In words, this condition says that x is optimal if and only if there is no
feasible descent direction of f at x. Solution: Let D(x) denote the closure of
D(x). By Prop. 3.1.4, x minimizes f over X if and only if there exists g E 8f(x)
such that
g'd ~ 0, 'v' d E D(x),
which is equivalent to
g'd ~ 0,
Since the minimization and maximization above are over convex and compact
sets, by the Saddle Point Theorem of Prop. 5.5.3 in Appendix B, this is equivalent
to
min max g'd 2'. 0,
1idll9, dED(x) gE8f(x)
or by Prop. 3.1.l(a),
min __ j' (x; d) 2: 0.
ildli:51, dED(x)
This is in turn equivalent to the desired condition (3.34), since J'(x; ·) is contin-
uous being convex and real-valued.
where Nx(x) is the normal cone of X at x EX. Note: If his convex but
extended-real valued, this formula requires the assumption ri ( dom( h)) n
ri(X) ¥, 0 or some polyhedral conditions on h and X; see Prop. 5.4.6 of
Appendix B. Proof: By the subgradient inequality (3.1), we have g E af(x)
if and only if x minimizes p( z) = h( z) - g' z over z E X, or equivalently,
some subgradient of pat x [i.e., a vector in ah(x) - {g}, by Prop. 3.1.3]
belongs to -Nx(x) (cf. Prop. 3.1.4).
(b) Let f(x) = -,Ix if x 2'. 0 and f(x) = oo if x < 0. Verify that f is a closed
convex function that cannot be written in the form (3.35) and does not
have a subgradient at x = 0.
(c) Show the following formula for the subdifferential of the sum of functions
Ji that have the form (3.35) for some hi and Xi:
! '( x; y ) = m
. f f(x-+
- -ay)
- -- -f(x)
-,
a>O Ct
. ) <_
! k'( Xk,Yk fk(xk + ayk) - . ) + E,
fk(xk) < J'( x,y
a
Since this is true for all E > 0, we obtain limsupk-+oo f~(xk; Yk) :C:::: J'(x; y).
If f is differentiable at all x E Rn, then by Prop. 3.1.1 (b), we have J' (x; y) =
'i7 f ( x )' y for all x, y E Rn, and by using the part of the proposition just proved,
it follows that for every sequence { Xk} converging to x and every y E Rn,
limsup 'i7 f(xk)' y = lim sup J' (xk; y) :C:::: J' (x; y) = 'i7 f(x)'y.
k--+ OC> k--+oo
- lim inf 'i7 f (xk)' y = lim sup(-'il J(xk)' y) :C:::: -'il f (x )' y.
k--+oo k--+oo
Combining the preceding two relations, we have 'ilf(xk)'y-+ 'ilf(x)'y for every
y, which implies that 'ilf(xk)-+ 'ilf(x). Hence, 'ilf(·) is continuous.
Sec. 3.4 Notes, Sources, and Exercises 173
Let Z be a compact subset of Rm, and let q> : Rn x Z 1--t R be continuous and
such that ¢>(·, z) : Rn 1--t R is convex for each z E Z.
(a) Show that the function f : Rn 1--t R given by
f(x) = max(j>(x,z)
zEZ
(3.36)
(b) Show that if ¢>( ·, z) is differentiable for all z E Z and 'i7 x ¢>( x, ·) is continuous
on Z for each x, then
Solution: (a) We note that since q> is continuous and Z is compact, the set Z(x)
is nonempty by Weierstrass' Theorem and f is real-valued. For any z E Z(x),
y E Rn, and a > 0, we use the definition off to obtain
Taking the limit as a decreases to zero, we obtain J' (x; y) ?'. ¢>'(x, z; y). Since
this is true for every z E Z(x), we conclude that
We will next prove the reverse inequality and that the supremum in the
right-hand side of the above inequality is attained. To this end, we fix x, we
consider a sequence { ak} of positive scalars that converges to zero, and we let
Xk = x + CTkY· For each k, let Zk be a vector in Z(xk)· Since {zk} belongs to the
174 Subgradient Methods Chap. 3
'v' z E Z,
so by taking the limit as k -+ oo and by using the continuity of <p, we obtain
We take the limit in inequality (3.39) as k -+ oo, and we use inequality (3.40) to
conclude that
f'(x;y) :S cp'(x,z;y).
This relation together with inequality (3.38) proves Eq. (3.37).
For the last statement of part (a), if Z(x) consists of the unique point z,
the differentiability assumption on <p and Eq. (3.37) yield
J'(x;y) = zEZ(x)
max 'ii'x</J(x,z)'y,
f(y) = maxcp(y,
zEZ
z)
::::: </J(y,z)
2 cp(x,z) + 'ii'x</J(x,z)'(y- x)
= f(x) + 'ii'.,cp(x,z)'(y - x).
Sec. 3.4 Notes, Sources, and Exercises 175
Therefore, we have
This exercise shows an enhanced version of Prop. 3.2.6, whereby we assume that
for some scalar c, we have
\;/ k, (3.41)
in place of the stronger Assumption 3.2.1. Assume also that x• is nonempty and
that
00 00
Show that {xk} converges to some optimal solution. Abbreviated proof: Similar
to the proof of Prop. 3.2.6 [cf. Eq. (3.18)], we apply Prop. 3.2.2(a) with y equal
to any x• EX*, and then use the assumption (3.41) to obtain
(3.43)
In view of the assumption (3.42), the convergence result of Prop. A.4.4 of Ap-
pendix A applies, and shows that {xk} is bounded and that liminfk-too f(xk) =
f*. From this point the proof follows the one of Prop. 3.2.6.
176 Subgradient Methods Chap. 3
Consider the subgradient method Xk+l = Px (xk - akgk) with the dynamic step-
size rule
(3.44)
and assume that the optimal solution set X* is nonempty. Show that:
(a) {xk} and {gk} are bounded sequences. Proof: Let x* be an optimal solu-
tion. From Prop. 3.2.2(a), we have
(3.45)
Therefore
implying that {xk} is bounded, and by Prop. 3.1.2, that {gk} is bounded.
(b) ( Sublinear Convergence) We have
Vk 2 k,
implying that
00 00
k=k k=k
On the other hand, by adding Eq. (3.45) over all k, and using the bound-
edness of {gk}, shown in part (a), we have
00
a contradiction.
(c) ( Linear Convergence) Assume that there exists a scalar f3 > 0 such that
where p = \,h - /3 2 / , 2 and I is any upper bound to ll9k II with 1 > /3 [cf.
part (a)]. Proof: From Eqs. (3.45), (3.46), we have for all k
Using the fact supk::C,O ll9kll :<S; 1 , the desired relation follows.
is chosen to be large (such as constant or such that the condition I:;;°= 0 a% < oo
is violated) the method may not converge. This exercise shows that by averaging
the iterates of the method, we may obtain convergence with larger stepsizes. Let
the optimal solution set X* be nonempty, and assume that for some scalar c, we
have
c:2:sup{ll9kll lk=0,1, ... }, V k 2". 0,
(cf. Assumption 3.2.1). Assume further that ak is chosen according to
0
a - -==
k - cv'k+T' k = 0, 1, ... ,
k = 0, 1, ... ,
Note: The averaging approach seems to be less sensitive to the choice of step-
size parameters. Practical variants include restarting the method with the most
recent averaged iterate, and averaging over just a subset of recent iterates. A
178 Subgradient Methods Chap. 3
Applying Prop. 3.2.2(a) with y equal to the projection of x1e onto X*, we obtain
le
V k 2 0, (3.47)
Sec. 3.4 Notes, Sources, and Exercises 179
Hence {xk} is bounded, which implies that {gk} is also bounded (cf. Prop. 3.1.2).
Let c be such that ll9k II S c for all k. Assume that ak is chosen according to the
first rule in Eq. (3.47). Then from the preceding relation we have for all k 2 k,
As in the proof of Prop. 3.2.8, this leads to a contradiction and the result follows.
The proof is similar if ak is chosen according to the second rule in Eq. (3.47).
k = 0, 1, ... ,
where 1: is some positive scalar with 1: < /3. Assume further that for some c > 0,
we have
V k 2 0,
cf. the subgradient boundedness Assumption 3.2.1.
(a) Show that if O'.k is equal to some constant a for all k, then
. . ( ) * af3(c+ 1:)2
hm mf f Xk S
k--+oo
f + 2 (/3 - E ) , (3.49)
180 Subgradient Methods Chap. 3
while if
=
Letk = oo,
k=D
and hence
(b) Use the scalar function f(x) = lxl to show that the estimate (3.49) is tight.
The purpose of this exercise (based on unpublished joint work with P. Tseng)
is to show how to calculate E-subgradients of the dual function of the separable
problem
minimize L Ji(x;)
i=l
n
subject to Lgij(X;) S 0, j = 1, ... ,r, et;Sx;Sf];, i=l, ... ,n,
i=l
where f; : R I-), R, g; 1 : R I-), R are convex functions. For an E > 0, we say that a
pair (x, µ) satisfies E-complementary slackness ifµ :::: 0, X; E [et;, fl;] for all i, and
r r
where r = {i I x; < f]i}, r+ = {i I et; < x;}, J;,g 0 and Jt,g~ denote
the left and right derivatives of f;, g;j, respectively. Show that if (x, µ) sat-
isfies E-complementary slackness, the r-dimensional vector with jth component
L~=l g;j(x;) is an E-subgradient of the dual function q atµ, where
E = E L(fl; - et;).
i=l
Polyhedral Approximation
Methods
Contents
181
182 Polyhedral Approximation Methods Chap. 4
Vif(xi) + (x - x1)'g1
I
I
I
I
I
I
__.--+-f(xo) ; - (x :_ xo)'go
I
I
I
Xo X
Figure 4.1.1. Illustration of the cutting plane method. With each new iterate xk,
a new hyperplane f(xk) + (x - xk)' 9k is added to the polyhedral approximation
of the cost function, where 9k is a subgradient off at Xk·
minimize Fk (x)
subject to x E X,
see Fig. 4.1.1. We assume that the minimum of Fk(x) above is attained
for all k. For those k for which this is not guaranteed (as may happen in
the early iterations if X is unbounded), artificial bounds may be placed on
the components of x, so that the minimization will be carried out over a
compact set and consequently the minimum will be attained by Weierstrass'
Theorem.
184 Polyhedral Approximation Methods Chap. 4
'vxEX,
so from the definitions (4.1) and (4.2) of Fk and Xk, it follows that
By taking the upper limit above as j --+ oo, k --+ oo, j < k, j E K, k E K,
we obtain
Since the subsequence { Xk} K is bounded and the union of the subdif-
ferentials of a real-valued convex function over a bounded set is bounded (cf.
Prop. 3.1.2), it follows that the subgradient subsequence {gj }K is bounded.
Moreover, we have
lim (xk - Xj) = 0,
j----::,.oo, k----::,.oo,j<k
jEK, kEK
so that
lim
j-+oo, k-+oo,j<k
(xk - x·)'g·
J J
= 0. (4.5)
jEK,kEK
k = 0,1, ... ,
(4.7)
where I is a finite index set, and ai and bi are given vectors and scalars,
respectively. Then, any vector aik that maximizes a~xk +bi over {ai I i E I}
is a subgradient off at Xk (cf. Example 3.1.1). We assume that the cutting
plane method selects such a vector at iteration k, call it aik. We also assume
that the method terminates when
Then, since Fk-1(x) :S f(x) for all x EX and Xk minimizes Fk-l over X,
we see that, upon termination, Xk minimizes f over X and is therefore op-
timal. The following proposition shows that the method converges finitely;
see also Fig. 4.1.2.
Proof: If (aik, bik) is equal to some pair (aij, bij) generated at some earlier
iteration j < k, then
186 Polyhedral Approximation Methods Chap. 4
Vif(x1) + (x - x1)'g1
I
..-----f- /(xo) + (x - xo)'go
I
I
I
Xo X3 x* X2 X
Figure 4.1.2. Illustration of the finite convergence property of the cutting plane
method in the case where / is polyhedral. What happens here is that if Xk is not
optimal, a new cutting plane will be added at the corresponding iteration, and
there can be only a finite number of cutting planes.
where the first inequality follows since a:i Xk + bii corresponds to one of the
hyperplanes defining Fk- l, and the last inequality follows from the fact
Fk- 1 ( x) :::; f (x) for all x E X. Hence equality holds throughout in the
preceding relation, and it follows that the method terminates if the pair
(aik, bik) has been generated at some earlier iteration. Since the number of
pairs (ai, bi), i EI, is finite, the method must terminate finitely. Q.E.D.
(c) The convergence is often slow. Indeed, for challenging problems, even
when f is polyhedral, one should base termination on the upper and
lower bounds
f(x) + c(x),
where f : X f-t ~ and c : X f-t ~ are convex functions, but one of them,
say c, is convenient for optimization, e.g., is quadratic. It may then be
preferable to use a piecewise linear approximation of f only, while leaving
c unchanged. This leads to a partial cutting plane algorithm, involving
solution of the problems
where as before
with 9j E fJJ(xj) for all j, and Xk+i minimizes Fk(x) over x EX,
Consider the case where the constraint set X is polyhedral of the form
As earlier,
minimize f (x)
(4.10)
subject to x E conv(Xk+1)·
Note that this is a problem of the form (4.8). The process is illustrated in
Fig. 4.2.1.
Sec. 4.2 Inner Linearization - Simplicial Decomposition 191
Proof: There are two possibilities for the extreme point Xk that minimizes
Vf(xk)'(x - xk) over x EX [cf. problem (4.9)]:
(a) We have
over the convex hull of a finite number of points [cf. problem (4.8)], while
the latter requires a search over the line segment [x k, xk].
We will now discuss some variations and extensions of the simplicial de-
composition method. The essence of the convergence proof of Prop. 4.2.1
is that the extreme point Xk does not belong to Xk, unless the optimal so-
lution has been reached. Thus it is not necessary that Xk solves exactly the
linearized problem (4.9). Instead it is sufficient that Xk is an extreme point
and that the inner product 'vf(xk)'(xk - Xk) is negative [cf. Eq. (4.11)].
This idea may be used in variants of the simplicial decomposition method
whereby 'v f(xk)'(x-xk) is minimized inexactly over x EX. Moreover, one
may add multiple extreme points Xk, as long as they satisfy the condition
':;Jf(xk) 1 (Xk - Xk) < 0.
There are a few other variants of the method. For example to address
the case where X is an unbounded polyhedral set, one may augment X with
additional constraints to make it bounded (an alternative for the case where
X is a cone is discussed in Section 4.6). There are extensions that allow
for a nonpolyhedral constraint set, which is approximated by the convex
hull of some of its extreme points in the course of the algorithm; see the
discussion in Sections 4.4-4.6. Finally, one may use variants, known as
restricted simplicial decomposition methods, which allow discarding some
of the extreme points generated so far. In particular, given the minimum
Xk+l off over Xk+l [cf. problem (4.10)], we may discard from Xk+ 1 all
points x such that
'vf(xk+1)'(x - Xk+l) > 0,
while possibly augmenting the constraint set with the additional constraint
(4.12)
The idea is that the costs of the subsequent points Xk+ 2, Xk+ 3, .. ., generated
by the method will all be no greater than the cost of Xk+1, so they will
satisfy the constraint (4.12).
In fact a stronger result can be shown: any number of extreme points
may be discarded, as long as conv(Xk+i) contains Xk and Xk- The proof is
based on the theory of feasible direction methods (cf. Section 2.1.2), and
the fact that Xk - Xk is a descent direction for f, since if Xk is not optimal,
we have
so a point with improved cost can be found along the line segment con-
necting Xk and Xk· Indeed, the method that discards all previous points
xo, xo, ... , Xk-1, replacing them with just Xk, is essentially the same as the
conditional gradient method that was discussed in Section 2.1.2.
Sec. 4.2 Inner Linearization - Simplicial Decomposition 193
subject to L Xp = rw, \/ w E W,
pEPw
Xp 2::: 0, \/ p E Pw, W E W,
where Fij is the total flow that passes through arc (i, j):
(4.13)
all paths p
containing ( i,j)
at the kth iterate Xk of the simplicial decomposition method. Here Xp,k de-
notes the path flow/ component of Xk that goes through path p and "( i, j) E
p" means that (i,j) is part of path p [we use Eq. (4.13) in the preceding
expression]. The key fact is that minimizing this linear approximation over
the constraint set is a shortest path problem, which can be solved with
very fast algorithms: the length of arc (i,j) is v7Dij (I::{vl(i,j)Ep} Xp,k ),
the length of path p is the sum of the lengths of the arcs on the path, and
the computation of the path of minimum length over paths p E Pw can
be done separately for each w E W. Once the shortest path for each w is
determined, the input flow rw is placed on that shortest path, and the new
extreme point Xk is the flow vector formed by these shortest path flows.
We also note that the minimization of D over the convex hull of the
extreme points forming Xk+ 1 [cf. Eq. (4.10)] is a low-dimensional problem
that can be conveniently solved by two-metric Newton-like methods (in
practice, few extreme points are typically required). In conclusion, the mul-
ticommodity flow problem combines all the important structural elements
that are necessary for the effective application of simplicial decomposition.
We refer to the end-of-chapter references for further discussion, including
the application of alternative algorithms.
Yi Y2 0 Y3 y
and it is illustrated in the left side of Fig. 4.3.1. The choices of Xj such
that Yj E of(xj) may not be unique, but result in the same function F(x):
the epigraph of F is determined by the supporting hyperplanes to the
epigraph off with normals defined by Yj, and the points of support Xj are
immaterial. In particular, the definition (4.14) can be equivalently written
in terms of the conjugate f * of f as
F(x) = _max {x'yj - f*(yj)}, (4.15)
J=l, ... ,e
sup
xERn
{v'x - .Eiax
J-1, ... ,e
{v;x - f*(yj)}} ,
sup {y'x-o.
xElRn, {ElR
yjx- f*(Yj )5,(, j=l, ... ,i
inf "£ _
L..j=l °'jYj-Y
Ll=l O'.jj*(yj)
J
if y E conv({y1, ... ,Yt}),
F*(y) ={ u->o '°'l °' =l
J- 'L..j=l J
oo otherwise,
( 4.16)
where a 1 is the dual variable of the constraint y;x - f*(y1 ) :Sr
From this formula, it can be seen that F* is a piecewise linear ap-
proximation of f * with domain
and "break points" at Y1, ... , Yi with values equal to the corresponding
values of f *. In particular, as indicated in Fig. 4.3.1, the epigraph of F*
is the convex hull of the union of the vertical halflines corresponding to
Y1, ···,Yi:
subject to x E S,
where
def ( )
X = Xl,· .. ,Xm,
Sec. 4.4 Generalized Polyhedral Approximation 197
L Ji(Xi),
i=l
they are coupled through the subspace constraint. This allows a variety of
transformations to the EMP format. For example, consider a cost function
of the form
m
where F is a closed proper convex function of all the components Xi. Then,
by introducing an auxiliary vector z E ~n1 +·+nm, the problem of mini-
mizing J over a subspace X can be transformed to the problem
m
S = {(x,x) Ix EX}.
This problem is of the form (4.18).
Another problem that can be converted to the EMP format (4.18) is
m
minimize L Ji (x)
i=l
subject to x E X,
where Ji : ~n f--+ (-oo, oo] are closed proper convex functions, and X is
a subspace of ~n. This can be done by introducing m copies of x, i.e.,
auxiliary vectors Zi E ~n that are constrained to be equal, and write the
problem as
m
minimize L Ji (Zi)
i=l
m m
= inf
(xl,···,xm)ES i=l
L A~Xi + L int {/i(zi)
i=l z;E1R '
- .>.~zi}
where
i = l, ... ,m,
where ft is the conjugate of fi. Thus the dual problem has the same
form as the primal. Moreover, assuming that the functions /i are closed,
when the dual problem is dualized, it yields the primal, so the duality has
a symmetric character, like Fenchel duality. We will discuss further the
duality theory for EMP in Sections 6. 7.3 and 6. 7.4, using algorithmic ideas
(the E-descent method of Section 6.7.2).
200 Polyhedral Approximation Methods Chap. 4
i = l, ... ,m.
and
xi
opt • _{!()
E arg mm i Xi -
/\Opt} ,
xi"i i = 1, ... ,m. (4.24)
xiE1Rni
i = 1, ... ,m.
These conditions are significant, because they show that once (x?t, ... , x';!;t)
or (A?t, ... , A';!;t) is obtained, its dual counterpart can be computed by
"differentiation" of the functions fi or J:, respectively. We will often use
the equivalences of the preceding formulas in the remainder of this chapter,
so we depict them in Fig. 4.4.1.
Sec. 4.4 Generalized Polyhedral Approximation 201
For i E I, given a finite set Xi c dom(fi) such that 8fi(x) :/- 0 for
all x E Xi, we consider the inner linearization of Ji corresponding to Xi by
if Xi E conv(Xi),
otherwise,
where C(xi, Xi) is the set of all vectors with components ax, x E Xi,
satisfying
[cf. Eq. (4.16)]. As noted in Section 4.3, this is the function whose epigraph
is the convex hull of the halflines { (xi, w) I fi(xi) S: w }, Xi E Xi (cf. Fig.
4.3.1).
We assume that at least one of the sets I and I is nonempty. At
the start of the typical iteration, we have for each i E I, a finite subset
Ai C dom(ft), and for each i E I, a finite subset Xi C dom(fi). The
iteration is as follows:
where L,A;
and h,x; are the outer and inner linearizations of Ji cor-
responding to Ai and Xi, respectively.
Step 2: (Test for Termination and Enlargement) Enlarge the
sets Ai and Xi as follows (see Fig. 4.4.2):
(a) For i EI, we add any subgradient >.i E 8fi(xi) to Ai.
(b) For i EI, we add any subgradient Xi E 8ft(>.i) to Xi,
If there is no strict enlargement, i.e., for all i EL we have Ai E Ai, and
for all i E I we have Xi E Xi, the algorithm terminates. Otherwise,
we proceed to the next iteration, using the enlarged sets Ai and Xi.
ew slope X,
.,. . . J. . .
,, ... "'
Slope Ai
"' A
Figure 4.4.2. Illustration of the enlargement step in the GPA algorithm, after
we obtain a primal-dual optimal solution pair
(cf. Fig. 4.4.1- the Conjugate Subgradient Theorem, Prop. 5.4.3 in Appendix B).
The enlargement step on the left (finding .\i) is also equivalent to .\i satisfying
Xi E o/;(.\i), or equivalently, solving the optimization problem
The enlargement step on the right (finding :i;.) is also equivalent to solving the
optimization problem
(a) The refinement process may be faster, because at each iteration, mul-
tiple cutting planes and break points are added (as many as one per
function Ji). As a result, in a single iteration, a more refined approx-
imation may be obtained, compared with the methods of Sections 4.1
and 4.2, where a single cutting plane or extreme point is added. More-
over, when the component functions Ji are scalar, adding a cutting
plane/break point to the polyhedral approximation of Ji can be very
simple, as it requires a one-dimensional differentiation or minimiza-
tion for each Ji. Of course if the number m of component functions
is large, maintaining these multiple cutting planes and break points
may add significant overhead to the method, in which case a scheme
for discarding some old cutting planes and break points may be used,
similar to the case of the restricted simplicial decomposition scheme.
(b) The approximation process may preserve some of the special struc-
ture of the cost function and/or the constraint set. For example if
the component functions Ji are scalar, or have partially overlapping
dependences, such as for example,
f(x1, ... ,xm) = fi(x1,x2) + h(x2,x3) + · · ·
+ fm-1(Xm-1,Xm) + fm(Xm),
the minimization off by the cutting plane method of Section 4.1 leads
to general/unstructured linear programming problems. By contrast,
using separate outer approximation of the components functions leads
to linear programs with special structure, which can be solved effi-
ciently by specialized methods, such as network flow algorithms, or
interior point algorithms that can exploit the sparsity structure of the
problem.
Generally, in specially structured problems, the preceding two advantages
can be of decisive importance.
Note two prerequisites for the GPA algorithm to be effective:
(1) The (partially) linearized problem (4.25) must be easier to solve than
the original problem (4.18). For example, problem (4.25) may be a
linear program, while the original may be nonlinear (cf. the cutting
plane method of Section 4.1); or it may effectively have much smaller
dimension than the original (cf. the simplicial decomposition method
of Section 4.2).
(2) Finding the enlargement vectors (\ for i E I, and Xi for i E I)
must not be too difficult. This can be done by the differentiation
5:.i E afi(i:i) for i EI, and Xi E aft(>.i) or i El. Alternatively, if this
is not convenient for some of the functions (e.g., because some of the
Ji or the ft are not available in closed form), one may calculate Ai
and/ or Xi via the relations
Xi E aft(5:.i), >.i E afi(xi);
Sec. 4.4 Generalized Polyhedral Approximation 205
where ft is the conjugate of f;. Then the inner (or outer) linearized index
set I of the primal becomes the outer (or inner, respectively) linearized in-
dex set of the dual. At each iteration, the algorithm solves the approximate
dual EMP,
which is simply the dual of the approximate primal EMP (4.25) [since the
outer (or inner) linearization of ft is the conjugate of the inner (or respec-
tively, outer) linearization of Ji]. Thus the algorithm produces mathemat-
ically identical results when applied to the primal or the dual EMP. The
choice of whether to apply the algorithm in its primal or its dual form is
simply a matter of whether calculations with Ji or with their conjugates
ft are more or less convenient. In fact, when the algorithm makes use of
both the primal solution x and the dual solution 5.. in the enlargement step,
the question of whether the starting point is the primal or the dual EMP
becomes moot: it is best to view the algorithm as applied to the pair of
primal and dual EMP, without designating which is primal and which is
dual.
206 Polyhedral Approximation Methods Chap. 4
Now let us discuss the validity of the GPA algorithm. To this end, we will
use two basic properties of outer approximations. The first is that for any
closed proper convex functions f and [_, and vector x E dom(f), we have
To see this, use the subgradient inequality to write for any g E a[_(x),
which implies that g E af(x). The second property is that for any outer
linearization LA of f, we have
To see this, consider vectors x.x such that >. E af(x.x), >.EA, and write
LA (x) = T/t {f (x.x) + N(x - x.x)} ~ f(x5,) + .X'(x - x5,) ~ f(x),
where the second inequality follows from .X E af(x5,). Since we also have
f ~ LA' we obtain f A(x) = f(x). We first show the optimality of the
primal and dual solution pair obtained upon termination of the algorithm.
Proof: From Prop. 4.4.1 and the definition of (x1, ... , Xm) and (>.1, ... , >.m)
as a primal and dual optimal solution pair of the approximate problem
(4.25), we have
(4.30)
4.4.1). We will complete the proof by showing that it holds for all i E I
(the proof for i E J follows by a dual argument).
Indeed, let us fix i E I and let 5..i E 8 Ji (Xi) be the vector generated by
the enlargement step upon termination. We must have 5..i E Ai, since there
is no strict enlargement when termination occurs. Since L,Ai is an outer
linearization of Ii, by Eq. (4.29), the fact 5..i E Ai, 5..i E ofi(xi) implies that
f.
-i,
A·
i
(xi) = fi(xi),
By Prop. 4.4.1, we also have .5..i E BL,Ai (xi), so .5..i E ofi(Xi)- Q.E.D.
be the vectors used for the corresponding enlargements. If the set I is empty
(no inner approximation) and the sequence {>..7} is bounded for every i EI,
then we can easily show that every limit point of {xk} is primal optimal.
To see this, note that for all k, C:S k -1, and (x1, ... ,xm) ES, we have
::::: Lfi(Xi)-
i=l
208 Polyhedral Approximation Methods Chap. 4
for all (x 1 , ... , Xm) E S. It follows that xis primal optimal, i.e., every limit
point of {i;k} is optimal. The preceding convergence argument also goes
through even if the sequences { >.}} are not assumed bounded, as long as the
limit points Xi belong to the relative interior of the corresponding functions
fi (this follows from the subgradient decomposition result of Prop. 5.4.1 in
Appendix B).
Exchanging the roles of primal and dual, we similarly obtain a conver-
gence result for the case where I is empty (no outer linearization): assuming
that the sequence { x7} is bounded for every i E l, every limit point of { ).k}
is dual optimal.
We finally state a more general convergence result from [BeYll],
which applies to the mixed case where we simultaneously use outer and
inner approximation (both l and I are nonempty). The proof is more
complicated than the preceding ones, and we refer to [BeYl 1] for the cor-
responding analysis.
In this section we will aim to highlight some of the applications and the
fine points of the general algorithm of the preceding section. As vehicle we
will use the simplicial decomposition approach, and the problem
minimize f(x) + c(x)
(4.31)
subject to x E ~n,
where f: ~n H (-oo, oo] and c: ~n H (-oo, oo] are closed proper convex
functions. This is the Fenchel duality context, and it contains as a special
case the problem to which the ordinary simplicial decomposition method of
Section 4.2 applies (where f is differentiable, and c is the indicator function
of a bounded polyhedral set). Here we will mainly focus on the case where
f is nondifferentiable and possibly extended real-valued.
We apply the polyhedral approximation scheme of the preceding sec-
tion to the equivalent EMP
minimize fi(x1) + h(x2)
subject to (x1,x2) ES,
210 Polyhedral Approximation M ethods Chap. 4
where
(1) We obtain
Xk E arg min {J(x)
xE!Rn
+ Ck(x)}, (4.32)
(4.33)
(4.34) '
and form
Sec. 4.5 Generalized Simplicial Decomposition 211
As in the case of the GPA algorithm, we assume that f and care such
that the steps (1)-(3) above can be carried out. In particular, the existence
of the subgradient Ak in step (2) is guaranteed by the optimality conditions
of Prop. 5.4.7 in Appendix B, applied to the minimization in Eq. (4.32),
under the appropriate relative interior conditions.
Note that step (3) is equivalent to finding
if x E conv(Xk),
if x tf_ conv(Xk),
The dimension of this problem is the cardinality of Jk, which can be quite
small relative to the dimension of the original problem.
'
c*(->.) /,
~~/"( x,,
~
\
0
Constant - f*(>.)-
where Aj and Xj are vectors that can be obtained either by using the
generalized simplicial decomposition method (4.32)-( 4.34), or by using its
dual, the cutting plane method based on solving the outer approximation
problems (4.37). The ordinary cutting plane method, described in the
beginning of Section 4.1, is obtained as the special case where f*(>.) = 0
[or equivalently, f(x) = oo if x-/= 0, and f(O) = OJ.
Sec. 4.5 Generalized Simplicial Decomposition 213
Let us first consider the favorable case of the generalized simplicial decom-
position algorithm (4.32)-(4.35), where f is differentiable and c is poly-
hedral with bounded effective domain. Then the method is essentially
equivalent to the simple version of the simplicial decomposition method of
Section 4.2. In particular:
(a) When c is the indicator function of a bounded polyhedral set X, and
Xo = {xo}, the method reduces to the earlier simplicial decomposi-
tion method (4.9)-(4.10). Indeed, step (1) corresponds to the min-
imization (4.10), step (2) simply yields Ak = v' f(xk), and step (3),
as implemented by Eq. (4.35), corresponds to solution of the linear
program (4.9) that generates a new extreme point.
(b) When c is a general polyhedral function, the method can be viewed
as essentially the special case of the earlier simplicial decomposition
method (4.9)-(4.10) applied to the problem of minimizing f(x) + w
subject to x EX and (x, w) E epi(c) [the only difference is that epi(c)
is not bounded, but this is inconsequential if we assume that dom(c)
is bounded, or more generally that the problem (4.32) has a solution].
In this case, the method terminates finitely, assuming that the vectors
(xk, c(h)) obtained by solving the linear program (4.35) are extreme
points of epi(c) (cf. Prop. 4.2.1).
For the more general case where f is differentiable and c is a (non-
polyhedral) convex function, the method is illustrated in Fig. 4.5.2. The
existence of a solution Xk to problem (4.32) [or equivalently (4.36)] is guar-
anteed by the compactness of conv(Xk) and Weierstrass' Theorem, while
step (2) yields Ak = v' f(xk), The existence of a solution to problem (4.35)
must be guaranteed by some assumption such as for example compactness
of the effective domain of c.
Let us now consider the problem of minimizing f + c [cf. Eq. (4.31)] for
the more complicated case where f is extended real-valued and nondiffer-
214 Polyhedral Approximation Methods Chap. 4
0 X
XO
minimize z
(4.40)
subject to Ji (x) :S z , j = 1, . .. , r, x E conv(Xk),
where conv(Xk) is a polyhedral inner approximation to X . According to the
optimality conditions of Prop. 1.1.3, the optimal solution (xk, z*) together
with dual optimal variables µ;,
satisfy
216 Polyhedral Approximation Methods Chap. 4
(4.41)
I:µ;= 1, (4.43)
j=l
[which is the optimality condition for the optimization in Eq. (4.41)). Using
Eqs. (4.42) and (4.43), it can be shown that the vector
r
Ak = Lµ;v'Jj(Xk) (4.45)
j=l
Consider a more general version of the preceding example, where there are
additional inequality constraints defining the domain off. This is the problem
of minimizing f + c where c is the indicator of a closed convex set X, and f
is of the form
f(x) = { :ax {fi(x), ... , fr(x)} if g;(x) '.S 0, i = 1, ... ,P, (4.47)
otherwise,
with fJ and g; being convex differentiable functions. Applications of this type
include multicommodity flow problems with "side constraints" (the inequal-
ities g;(x) :S 0, which are separate from the network flow constraints that
comprise the set X; cf. the discussion of Section 4.2].
Similarly, to calculate Ak, we introduce dual variables v;
2'. 0 for the
constraints g;(x) :S 0, and we write the Lagrangian optimality and comple-
mentary slackness conditions. Then Eq. (4.44) takes the form
V x E conv(Xk)-
preceding two sections is not well-suited for the case where some of the
component functions of the cost are indicator functions of unbounded sets
such as cones. There are two main reasons for this:
(1) The enlargement procedure of the GPA algorithm may not be imple-
mentable by optimization, as in Fig. 4.4.2, because this optimization
may not have a solution. This may be true in particular if the function
involved is the indicator function of an unbounded set.
(2) The inner linearization procedure of the GPA algorithm approximates
an unbounded set by the convex hull of a finite number of points,
which is a compact set. It would appear that an unbounded polyhe-
dral set may provide a more effective approximation.
Motivated by these concerns, we extend the generalized polyhedral
approximation approach of Section 4.4 so that it applies to the problem of
minimizing the sum I::::;,1 fi(Xi) of convex extended real-valued functions
Ji, subject to (x1, ... , Xm) being in the intersection of a given subspace and
the Cartesian product of closed convex cones. To this end we first discuss
an alternative method for linearization of a cone, which allows enlargements
using directions of recession rather than points.
In particular, given a closed convex cone C and a finite subset X CC,
we view cone(X), the cone generated by X (see Section 1.2 in Appendix
B), as an inner linearization of C. Its polar, denoted cone(X)*, is an outer
linearization of the polar C* (see Fig. 4.6.1). This type of linearization has
a twofold advantage: a cone is approximated by a cone (rather than by a
compact set), and outer and inner linearizations yield convex functions of
the same type as the original (indicator functions of cones).
As a first step in our analysis, we introduce some duality concepts
relating to cones. We say that (x, >.) is a dual pair with respect to the
Sec. 4.6 Polyhedral Approximation for Conic Programming 219
where Pc (y) and Pc* (y) denote projection of a vector y onto C and C*,
respectively. We also say that (x, >.) is a dual pair representation of a vector
y if y = x +). and (x, >.) is a dual pair with respect to C and C*. The
following proposition shows that ( Pc (y), Pc* (y)) is the unique dual pair
representation of y, and provides a related characterization; see Fig. 4.6.1.
e Pc(y) = 0. (4.49)
e
By combining Eqs. (4.48) and (4.49), we obtain z < 0 for all z E C,
implying that~ EC*. Moreover, since Pc(y) EC, we have
where the second equality follows from Eq. (4.49). Thus ~ satisfies the
necessary and sufficient condition for being the projection Pc* (y).
(b) Suppose that property (i) holds, i.e., x and ). are the projections of
x + ). on C and C*, respectively. Then we have, using also the Projection
Theorem,
xEC, >.EC*, ((x + >.) - x))'x = 0,
220 Polyhedral Approximation Methods Chap. 4
or
X EC, A EC*, Nx=O,
which is property (ii).
Conversely, suppose that property (ii) holds. Then, since A E C*, we
have Nz SO for all z EC, and hence
where the second equality follows from the fact x 1- A. Thus x satisfies
the necessary and sufficient condition for being the projection Pc(x + A).
By a symmetric argument, it follows that A is the projection Pc* (x + A).
Q.E.D.
where (x1, ... , Xr) is a vector in ~n1 +··+nr, with components Xi E ~n;,
i = 1, ... , r, and
Ji : ~n; r--+ (-oo, oo] is a closed proper convex function for each i,
S is a subspace of ~n1 +·+nr,
and has the same form as the primal problem (4.50). Furthermore, since Ji
is assumed closed proper and convex, and Ci is assumed closed convex, the
conjugate of ft is ft* = Ji and the polar cone of Cf is (Ct)* = C. Thus
when the dual problem is dualized, it yields the primal problem, similar to
the EMP problem of Section 4.4.
Let us denote by fopt and Qopt the optimal primal and dual values.
According to Prop. 1.1.5, (x 0 Pt, >,opt) form an optimal primal and dual
solution pair if and only if they satisfy the standard primal feasibility, dual
feasibility, and Lagrangian optimality conditions. By working out these
conditions similar to Section 4.4, we obtain the following proposition, which
parallels Prop. 4.4.1.
opt
xi E .
arg mm {f()
i Xi -
,,opt} ,
xi"i i = 1, ... ,m, (4.53)
XiE!J?n
>,~pt E
z
of·( xiopt) '
i
i = l, ... ,m;
(cf. Fig. 4.4.1). Thus the optimality conditions are fully symmetric, con-
sistently with the symmetric form of the primal and dual problems (4.50)
and (4.51).
functions Ji and/ or cones Ci are inner linearized while others are outer
linearized.
We introduce a fixed subset I C { 1, ... , m}, which corresponds to
functions Ji that are inner linearized. For notational convenience, we de-
note by I the complement of I in { 1, ... , m}:
{1, ... , m} = I u I,
and we also denotet
Ic={m+l, ... ,r}.
At the typical iteration of the algorithm, we have for each i E I, a
finite set Xi such that 8 fi(xi) =/- 0 for all Xi E Xi, and for each i E Ic a
finite set Xi c Ci. The iteration is as follows.
of the problem
where Ji
,
X·
'
are the inner linearizations of Ji corresponding to Xi, i E I.
Step 2: (Test for Termination and Enlargement) Enlarge the
sets Xi as follows (see Fig. 4.6.2): '
(a) For i EI, we add any subgradient Xi E aJt(>.i) to Xi,
(b) For i E Ic, we add the projection Xi = Pei (>.i) to Xi.
If there is no strict enlargement for all i EI, i.e., we have Xi E Xi, and
moreover Xi = 0 for all i E Ic, the algorithm terminates. Otherwise,
we proceed to the next iteration, using the enlarged sets xi. .
Xi +
New break point Xi
Figure 4.6.2. Illustration of the enlargement step of the algorithm, aft er we ob-
tain a primal and dual optimal solution pair (:ri, ... , Xr, 5.1, ... , 5.r ). The enlarge-
ment step on the left [finding Xi with Xi E aJ:(5.i) for i E l] is also equivalent
to finding Xi satisfying ),i E 8/i(xi), or equivalently, solving the optimization
problem
maximize { 5.~xi - fi(xi)}
subject to Xi E lRni.
The enlargement step on the right, for i E le, is to add to Xi the vector Xi =
Pei (>,i), the projection on Ci of 5.i.
minimize
subject to Xi E Ci, llxill 2 ~ ,2,
224 Polyhedral Approximation Methods Chap. 4
which is used for the enlargement of the set of break points Xi for the
functions Ji (cf. Fig. 4.6.2).
Note that the projection on a cone that is needed for the enlargement
process can be done conveniently in some important special cases. For
example when Ci is a polyhedral cone (in which case the projection is a
quadratic program), or when Ci is the second order cone (see Exercise 4.3
and [FLT02], [SchlO]), or in other cases, including when Ci is the semidefi-
nite cone (see [BoV04], Section 8.1.1, or [HeMll], [HeM12]). The following
is an illustration of the algorithm for a simple special case.
By transcribing our algorithm to this special case, we see that (xk, xk)
and (>.\ ->.k) are optimal primal and dual solutions of the corresponding
approximate problem of the algorithm if and only if
and
>.k E af(xk), (4.58)
(cf. Prop. 4.6.2). Once 5.k is found, Xk is enlarged by adding xk, the projec-
tion of -5.k onto C. This construction illustrated in Fig. 4.6.3.
introduce a dual variable µ for the constraint llx;ll 2 ::; "'y2, and show that if
>.; <f_c;,
then the optimal solution is x; = (1/2µ)Pc;(\).
Sec. 4.6 Polyhedral Approximation for Conic Programming 225
Convergence Analysis
Proof: We will verify that upon termination, the three conditions of Prop.
4.6.2 are satisfied for the original problem (4.50). From the definition of
(x1 , ... , Xr) and ( A1, ... , Ar) as a primal and dual optimal solution pair of
226 Polyhedral Approximation Methods Chap. 4
where the vectors Ax can be any vectors such that x E 8ft(>-x)- Therefore,
the relations Xi E Xi and Xi E 8Jt(>.i) imply that
T:,xi (>.i) = ft(>.i),
which by Eq. (4.28), shows that
87':,xi (>.i) C 8ft(>.i)-
By Eq. (4.53), we also have Xi E aT:,xJAi), so Xi E 8ft(>.i)- Thus Eq.
(4.59) is shown for i E f, and all the optimality conditions of Prop. 4.6.2
are satisfied for the original problem (4.50). Q.E.D.
Proof: (a) Let us fix i E le. Since x7 = Pc.(>.7), the subsequence {x7}JC
converges to Xi = Pei (.Xi). We will show that Xi = 0, which implies that
.Xi EC;_
Denote Xf° = Uf= 0 Xf. Since 5.7 E cone(Xf)*, we have x~>.7 ::=; 0
for all Xi E Xik, so that x: .Xi ::=; 0 for all Xi E Xf°. Since Xi belongs to
the closure of Xf°, it follows that x:5.i ::=; 0. On the other hand, since
Xi= PcJ>.i), from Prop. 4.6.l(b) we have x~(>.i - xi) = 0, which together
with x~>.i ::=; 0, implies that llxill 2 :S: 0, or Xi= 0.
(b) From the definition of 7; xk [cf. Eq. (4.60)], we have for all i EI and
k, £ E K, with £ < k, ' i
f.t* (A·)+
'£
't
'k
(A·
't
'£ ,-£
- A-)
't
x.'t <
-
-=* 'k
f i,· xk(A·
i 'L
).
Using this relation and the optimality of ).k for the kth approximate dual
problem to write for all k, £ E K, with £ < k
for all (A1, ... , Arn) such that there exist Ai E cone(Xf)*, i E le, with
(A1, ... , Ar) ES. Since c; c cone(Xf)*, it follows that
i=l
(4.61)
for all (A1, ... , Arn) such that there exist Ai E c;, i E le, with (A1, ... , Ar) E
S, where the last in€quality holds since J:,xk is an outer linearization of
ft. I
228 Polyhedral Approximation Methods Chap. 4
By taking limit inferior in Eq. (4.61), ask,£-+ oo with k,£ EK, and
by using the lower semicontinuity of ft, which implies that
we obtain
m m
(4.62)
i=l i=l
for all (A1, ... , Am) such that there exist Ai E c;, i E le, with (A1, ... , Ar) E
S. We have 5.. E S and 5..; E c; for all i E le, from part (a). Thus Eq.
(4.62) implies that 5.. is dual optimal. The sequence of optimal values of the
dual approximation problem [the dual of problem (4.55)] is monotonically
nondecreasing (since the outer approximation is monotonically refined) and
converges to - f 0 Pt since 5.. is dual optimal. This sequence is the opposite
of the sequence of optimal values of the primal approximation problem
(4.55), so the latter sequence is monotonically nonincreasing and converges
to f 0 Pt. Q.E.D.
Section 4.1: Cutting plane methods were introduced by Cheney and Gold-
stein [ChG59], and by Kelley [Kel60]. For analysis of related methods, see
Ruszczynski [Rus86], Mifflin [Mif96], Burke and Qian [BuQ98], Mifflin,
Sun, and Qi [MSQ98], and Bonnans et al. [BGL09].
Section 4.2: The simplicial decomposition method was introduced by Hol-
loway [Hol74]; see also Hohenbalken [Hoh77], Pang and Yu [PaY84], Hearn,
Lawphongpanich, and Ventura [HLV87], Ventura and Hearn [VeH93], and
Patriksson [PatOl]. The method was also independently proposed in the
context of multicommodity flow problems by Cantor and Gerla [CaG74].
Some of these references describe applications to communication and trans-
portation networks; see also the surveys by Florian and Hearn [FlH95], Pa-
triksson [Pat04J, the nonlinear programming textbook [Ber99] (Examples
2.1.3 and 2.1.4), and the discussion of the application of gradient projection
methods in [BeG83], [BeG92]. Simplicial decomposition in a dual setting
for problems with a large number of constraints (Exercise 4.4), was pro-
posed by Huizhen Yu, and was developed in the context of some large-scale
parameter estimation/machine learning problems in the papers [YuR07]
and [YBR08].
Sec. 4.7 Notes, Sources, and Exercises 229
Section 4.3: The duality relation between outer and inner linearization
has been known for a long time, particularly in the context of the Dantzig-
Wolfe decomposition algorithm [DaW60], which is a cutting plane/simpli-
cial decomposition algorithm applied to separable problems (see textbooks
such as [Las70], [BeT97], [Ber99] for descriptions and analysis). Our de-
velopment of the conjugacy-based form of this duality follows the paper by
Bertsekas and Yu [BeYll].
Section 4.4: The generalized polyhedral approximation algorithm is due
to Bertsekas and Yu [BeYll], which contains a detailed convergence analy-
sis. Extended monotropic programming and its duality theory were devel-
oped in the author's paper [BerlOa], and will be discussed in greater detail
in Section 6.7.
Section 4.5: The generalized simplicial decomposition material of this
section follows the paper [Be Yl l]. A different simplicial decomposition
method for minimizing a nondifferentiable convex function over a poly-
hedral set, based on concepts of ergodic sequences of subgradients and
a conditional subgradient method, is given by Larsson, Patriksson, and
Stromberg (see [Str97], [LPS98]).
Section 4.6: The simplicial decomposition algorithm with conical approx-
imations is new and was developed as the book was being written.
EXERCISES
Consider using the cutting plane method for finding a solution of a system of
inequality constraints g;(x) :S 0, i = 1, ... , m, where g; : ar >---+ Rare convex
functions. Formulate this as a problem of unconstrained minimization of the
convex function
f(x) = . max g;(x).
i=l, ... ,m
(a) State the cutting plane method, making sure that the method is well-
defined.
(b) Implement the method of part (a) for the case where g;(x) = c;x - b;,
n = 2, and m = 100. The vectors c; have the form c; = (l;, (i), where
l;, (; are chosen randomly and independently within [-1, 1] according to a
uniform distribution, while b; is chosen randomly and independently within
[O, 1] according to a uniform distribution. Does the method converge in a
finite number of iterations? Is the problem solved after a finite number
230 Polyhedral Approximation Methods Chap. 4
of iterations? How can you monitor the progress of the method towards
optimality using upper and lower bounds?
4.2
and the problem of Euclidean projection of a given vector i: = (i:1, ... , i:n) onto
C. Let z E Rn-l be the vector z = (i:1, .. ,,i:n-1). Show that the projection,
denoted x, is given by
minimize f (x)
subject to Ax ::; 0, X EX,
Sec. 4.7 Notes, Sources, and Exercises 231
maximize h(e)
subject to eE C,
where
h(e) = inf {f(x)
xEX
+ (x },
and C is the cone {A'µ I µ 2: O}, the polar of the cone { x I Ax ~ O}.
(b) Suppose that the cone C of the dual problem of (a) is approximated by a
polyhedral cone of the form
e
where 1 , ... , em
are m vectors from C. Show that the resulting approxi-
mate problem is dual to the problem
minimize f (x)
subject to µ'.Ax ~ 0, i = 1, ... , m, X EX,
maximize h(e) - (
subject to (e, () EC,
The algorithms and analysis of Section 4.6 apply to cases where the constraint
set involves the intersection of compact sets and cones, which can be inner lin-
earized separately (the compact set constraints can be represented as indicator
functions via the functions Ji)- This exercise deals with the related case where
the constraints are vector sums of compact sets and cones, which again can be
232 Polyhedral Approximation Methods Chap. 4
linearized separately. Describe how the algorithm of Section 4.6 can be applied
to the problem
minimize J (x)
subject to x E X + C,
where X is a compact set and C is a closed convex cone. Hint: Write the problem
as
minimize f(x1) + 8(x2IX) + 8(x3JC)
subject to x1 = x2 + x3,
which is of the form (4.50) with
minimize f(x) +w
subject to (x, w) E epi(h),
minimize J(x) +w
subject to (x, w) E C,
Proximal Algorithms
Contents
233
234 Proximal Algorithms Chap. 5
x* X
Figure 5.1.1. Geometric view of the proximal algorithm (5.1). The minimum of
1 2
f(x) + -llx - xkll
2ck
is attained at a unique point Xk+l as shown. In this figure, 'Yk is the scalar by
which the graph of the quadratic -2- 1 llx - Xkll 2 must be raised so that it just
Ck
touches the graph of f. The slope shown in the figure,
is the common subgradient of f(x) and -2- 1 llx - Xkll 2 at the minimizing point
Ck
Xk+l, cf. the Fenchel Duality Theorem (Prop. 1.2.1).
in Eq. (5.1) [cf. Prop. 3.1.1 and Prop. 3.2.1 in Appendix B; also the broader
discussion of existence of minima in Chapter 3 of [Ber09l].
Evidently, the algorithm is useful only for problems that can benefit
from regularization. It turns out, however, that many interesting problems
fall in this category, and often in unexpected and diverse ways. In particu-
lar, as we will see in this and the next chapter, the creative application of
the proximal algorithm and its variations, together with duality ideas, can
allow the elimination of constraints and nondifferentiabilities, the stabiliza-
tion of the linear approximation methods of Chapter 4, and the effective
exploitation of special problem structures.
5.1.1 Convergence
(5.2)
Generally, starting from any nonoptimal point Xk, the cost function
value is reduced at each iteration, since from the minimization in the algo-
rithm's definition [cf. Eq. (5.1)], by setting x = Xk, we have
1
f(xk+1) + -llxk+1
2ck
- xkll 2 :<::::: f(xk).
The following proposition provides an inequality, which among others shows
that the iterate distance to any optimal solution is also reduced. This in-
equality resembles (but is more favorable than) the fundamental inequality
of Prop. 3.2.2(a) for the subgradient method.
Proof: We have
llxk - Yll 2 = llxk - Xk+1 + Xk+l - Yll 2
= llxk - Xk+1ll 2 + 2(xk - Xk+1)'(xk+1
- y) + llxk+l -yll 2 -
Using Eq. (5.2) and the definition of subgradient, we obtain
1
-(xk - Xk+1)'(xk+1 - y) ~ f(xk+1) - f(y).
Ck
By multiplying this relation with 2ck and adding it to the preceding rela-
tion, the result follows. Q.E.D.
(which may be -oo) and by X* the set of minima off (which may be
empty),
X* = arg min f(x).
xEWn
The following is the basic convergence result for the proximal algorithm.
Proof: We first note that since Xk+i minimizes f(x) + 2!k llx - xkll 2 , we
have by setting x = Xk,
1
f(xk+1) + -2 llxk+l - Xkll 2 S f(xk), V k.
Ck
It follows that {f (xk)} is monotonically nonincreasing. Hence f(xk) .j,. f 00,
where / 00 is either a scalar or -oo, and satisfies Joo?: f*.
From Eq. (5.3), we have for ally E 3?n,
llxk+l -yll 2 S llxk -yll 2 - 2ck(f(xk+1) - f(y)). (5.4)
By adding this inequality over k = 0, ... , N, we obtain
N
llxN+l - Yll 2 + 2 L Ck (f (xk+i) - f(y)) S llxo - Yll 2 , VY E 3?n, N?: 0,
k=O
so that
N
2 L ck(f(xk+1) - f(y)) S llxo - Yll 2 , \;/ y E 3?n, N?: 0.
k=O
Taking the limit as N -t oo, we have
00
(5.5)
k=O
Assume to arrive at a contradiction that f 00 > f *, and let i) be such
that
Joo> f(i}) > f*.
Since {f (xk)} is monotonically nonincreasing, we have
f(xk+1) - f(i))?: Joo - f(i)) > 0.
Then in view of the assumption I:%°=o Ck = oo, Eq. (5.5), with y = i), leads
to a contradiction. Thus f oo = f *.
Consider now the case where X* is nonempty, and let x* be any point
in X*. Applying Eq. (5.4) with y = x*, we have
llxk+1-x*ll 2 S llxk-x*ll 2 -2ck(f(xk+i)-f(x*)), k=O,l, .... (5.6)
From this relation it follows that llxk -x* 11 2 is monotonically nonincreasing,
so {Xk} is bounded. If x is a limit point of {xk}, we have
f(x) s k-+oo,
liminf f(xk) =
kEK
f*
The following proposition describes how the convergence rate of the proxi-
mal algorithm depends on the magnitude of Ck and on the order of growth
off near the optimal solution set (see also Fig. 5.1.3) .
f(x)
\
Figure 5.1.3. Illustration of the convergence rate of the proximal algorithm and
the effect of the growth properties of / near the optimal solution set. In the figure
on the left, / grows slowly and the convergence is slow. In the figure on the right,
/ grows fast and the convergence is fast.
where
d(x) = x*min
E X*
llx-x*ll- 1
Z:ck = oo,
k=O
(5.8)
if 1 > 1, and
240 Proximal Algorithms Chap. 5
..
11msup d(Xk+1) 1
~---'- < - - ,
k--+oo d(xk) - l + pc
. d(xk+1)
hm sup ( )2 / < oo.
k--+oc d Xk "I
Proof: (a) The proof uses an argument that can be visualized from Fig.
5.1.4. Since the conclusion clearly holds when Xk+l EX* , we assume that
Xk+l (/. X* and we denote by Xk+i and Xk the projections of Xk+l and Xk
on X*, respectively. From the subgradient relation (5.2), we have
(5.10)
Sec. 5.1 Basic Theory of Proximal Algorithms 241
Slope
f(x)-----
/
r_______.. X
.B(d(xk+d)' ~ f(xk+l) -r
Xk - Xk+l
= - - ~ - · (xk+l - 8k+1)
Ck
where 8k+l is the scalar shown in the figure. Canceling d(xk+i) from both sides,
we obtain Eq. (5.8) .
Proposition 5.1.4 shows that as the growth order 'Yin Eq. (5.7) in-
creases, the rate of convergence becomes slower. An important threshold
value is 'Y = 2; in this case the distance of the iterates to X* decreases
at a rate that is at least linear if ck remains bounded, and decreases even
faster (superlinearly) if Ck -+ oo. Generally, the convergence is accelerated
if ck is increased with k, rather than kept constant; this is illustrated most
clearly when 'Y = 2 [cf. Prop. 5.1.4(c)]. When 1 < 'Y < 2, the convergence
rate is faster than linear (superlinear) [cf. Prop. 5.1.4(b)]. When 'Y > 2, the
convergence rate is generally slower than when 'Y = 2, and examples show
that d(xk) may converge to O sublinearly, i.e., slower than any geometric
progression [cf. Prop. 5.1.4(d)].
The threshold value of 'Y = 2 for linear convergence is related to the
quadratic growth property of the regularization term. A generalized version
of the proposition, with similar proof, is possible for proximal algorithms
that use nonquadratic regularization functions (see [KoB76], and [Ber82a],
Section 3.5, and also Example 6.6.5 in Section 6.6). In this context, the
threshold value for linear convergence is related to the order of growth of
the regularization function.
When 'Y = 1, f is said to have a sharp minimum, a favorable condition
that we encountered in Chapter 3. Then the proximal algorithm converges
finitely. This is shown in the following proposition (see also Fig. 5.1.5).
Proof: The assumption (5. 7) of Prop. 5.1.4 holds with 'Y = 1 and all 8 > 0,
so Eq. (5.9) yields
d(xk+1) + f3ck '.S d(xk), if Xk+l (/;. X*. (5.12)
Sec. 5.1 Basic Theory of Proximal Algorithms 243
xo xi x2 = x* X xo X
Figure 5.1.5. Finite convergence of the proximal algorithm for the case of a
sharp minimum, when J(x) grows at a linear rate near the optimal solution set
(e.g., when f is polyhedral). In the figure on the right, convergence occurs in a
single iteration for sufficiently large co.
If L~o Ck = oo and Xk (j. X* for all k, by adding Eq. (5.12) over all k,
we obtain a contradiction. Hence we must have Xk EX* for k sufficiently
large. Also if co ~ d(xo)/ /3, Eq. (5.12) cannot hold with k = 0, so we must
have x1 E X*. Q.E.D.
- 1
f(x) = f* + f3d(x) + -llx
2co - xoll -
2 (5.13)
- { xo - xo
8J(xo) = /3~ llxo _ ±oil + col (xo - xo)
I ~ E [O, 1] }
J(
Therefore, if Co ~ d( xo) / /3, then O E 8 xo), so that xo minimizes J(x).
Since from Eqs. (5.11) and (5.13), we have
- 1
f(x) ::; f(x) + 2Co llx - xoll 2 , 't/ XE ~n,
1
f(x) + -2Co llx - xoll 2
244 Proximal Algorithms Chap. 5
,__..,
I I
X
X*
Proof: We assume first that f is linear within dom(f), and then general-
ize. Then, there exists a E lRn such that for all x, x E dom(f), we have
f(x) - f(x) = a'(x - x).
For any x E X*, let Bx be the cone of vectors d that are in the normal
cone N x * (x) of X * at x, and are also feasible directions in the sense that
x + ad E dom(f) for a small enough a > 0. Since X* and dom(f) are
polyhedral set s, there exist only a finite number of possible cones Bx as x
ranges over X*. Thus, there is a finite set of nonzero vectors {c1 I j E J},
such that for any x E X *, Bx is either equal to {0}, or is the cone generated
by a subset {c1 I j E Jx}, where J = U xE X*Jx , In addition, for all x EX *
and d E Bx with lldll = 1, we have
d= L 1jCj,
jEJx
Sec. 5.1 Basic Theory of Proximal Algorithms 245
for some scalars "/j ~ 0 with LjEJ,, "(j ~"'?,where "'y = 1/maxjEJ licjll-
Also we can show that for all j E J, we have a'cj > 0, by using the fact
Cj E Bx for some x EX*.
For x E dom(f) with x <f- X*, let x be the projection of x on X*.
Then the vector x - x belongs to S,i, and we have
where (3 = "'yminjEJ a'cj. Since J is finite, we have /3 > 0, and this implies
the desired result for the case where f is linear within dom(f).
Assume now that f is of the form
f(x) = max{a~x
iEJ "
+ bi}, \:/ x E dom(f),
where I is a finite set, and ai and bi are some vectors and scalars, respec-
tively. Let
Y = {(x,z) I z ~ f(x), x E dom(f)},
and consider the function
Since
d(x, z) ~ min llx - x* 11 = d(x),
xEX*
we have
f* + (3d(x) :S g(x, z), \:/(x,z)<f-Y*,
and by taking the infimum of the right-hand side over z for any fixed x,
Q.E.D.
246 Proximal Algorithms Chap. 5
J(z) -----
<Pc(z) = xE!Rn
inf {f(x) + ~llx - zll 2 } .
2c
J(x)
\ We have <Pc(z) ~ f(z) for all z E
~n, and at the set of minima of f,
: Slope V</Jc(z)
1 \ i ,7 <Pc coincides with f . We also have
¢c(z)- 2 11x-zll2 :
....,
v</Jc (Z ) = -
z - Xc(z)
I
z Xc(z) X* X --;
C
inf
xE ~n
f (x) ::; <Pc(z) ::; f (z ), V z E ?Rn,
from which it follows that the set of minima off and <Pc coincide (this is also
evident from the geometric view of the proximal minimization given in Fig.
5.1.7) . The following proposition shows that <Pc is a convex differentiable
function, and derives its gradient.
Sec. 5.1 Basic Theory of Proximal Algorithms 247
Proposition 5.1. 7: The function <Pc of Eq. ,(5.14) is convex and dif-
ferentiable, and we have
"v<./ic(z) E of(xc(z)),
(cf. Prop. 3.3.1 in Appendix B). Furthermore, <Pc is real-valued, since the
infimum in Eq. (5.14) is attained.
Let us fix z, and for notational simplicity, denote z = xc(z). To show
that <Pc is differentiable with the given form of gradient, we note that by
the optimality condition of Prop. 3.1.4, we have v E 8¢c(z), or equivalently
0 E 8</>c(z) - v, if and only if z attains the minimum over y E lRn of
[This last step is obtained by viewing F as the sum of the function f and
the differentiable function
1
2cllx - Yll 2- v'y,
and by writing
x-y y-x }
8F(x,y) = {(g,0) I g E 8/(x)} + { -c-, -c- -v ;
cf. Prop. 5.4.6 in Appendix B.] The right side ofEq. (5.16) uniquely defines
v, so that v is the unique subgradient of <Pc at z, and it has the form
248 Proximal Algorithms Chap. 5
v = (z - z)/c, as required by Eq. (5.15). From the left side of Eq. (5.16),
we also see that v = 'V</>c(z) E of (xc(z)). Q.E.D.
Using the gradient formula (5.15), we see that the proximal iteration
can be written as
(5.17)
so it is a gradient iteration for minimizing </>ck with stepsize equal to Ck.
This interpretation provides insight into the working mechanism of the al-
gorithm and has formed the basis for various acceleration schemes, based
on gradient and Newton-like schemes, particularly in connection with the
augmented Lagrangian method, to be discussed in Section 5.2.1. In this
connection, we will show in the next subsection that a stepsize as large as
2ck can be used in place of Ck in Eq. (5.17). Moreover, the use of extrapo-
lation schemes to modify the stepsize Ck has been shown to be beneficial in
the constrained optimization context of the augmented Lagrangian method
(see [Ber82a], Section 2.3.1).
z E ~n. (5.19)
P.c,f (Z )-
-
Nc,!(z) +z ,
2
Sec. 5.1 Basic Theory of Proximal Algorithms 249
so Pc,J(z) is the midpoint of the line segment connecting Nc,J(z) and z. For
this reason, Nc,f is called the reflection operator. Some interesting facts
here are that:
(a) The set of fixed points of Nc,f is equal to the set of fixed points of
Pc,f and hence the set of minima off. Moreover, as we will show
shortly, the mapping Nc,f is nonexpansive, i.e.,
Thus for any x, Nc,t(x) is at least as close to the set of minima off
as x.
(b) The interpolated iteration
(5.20)
where the interpolation parameter a,k satisfies O.k E [E, 1- E] for some
scalar E > 0 and all k, converges to a fixed point of Nc,f, provided
Nc,t has at least one fixed point (this is a consequence of a classical
result on the convergence of interpolated nonexpansive iterations, to
be stated shortly).
(c) The preceding interpolated iteration (5.20), in view of the definition
of Nc,t [cf. Eq. (5.19)], can be written as
(5.21)
and as a special case, for a,k = 1/2, yields the proximal algorithm
Xk+i = Pc,J(xk)- We thus obtain a generalized form of the proximal
algorithm, which depending on the parameter ak, provides for extrap-
olation (when 1/2 < a,k < 1) or interpolation (when O < a,k < 1/2).
We will now prove the facts just stated in the following two propo-
sitions. To this end, we note that for any z E ~n, the proximal iterate
Pc,t (z) is uniquely defined, and we have
since the right side above is the necessary condition for optimality of z
in the proximal minimization (5.18) that defines Pc,J(z). Moreover the
converse also holds,
since the left side above is the sufficiency condition for z to be (uniquely)
optimal in the proximal minimization. An equivalent way to state the two
250 Proximal Algorithms Chap. 5
Figure 5.1.8. The figure on the left provides a graphical interpretation of the
proximal iterat ion at a vector z for a one-dimensional problem. The line that
passes through z a nd has slope -1/c intercepts the graph of the (monotone)
subdifferential mapping Bf(x ) at a unique point v, which corresponds to z, the
unique vector Pc,f(z) produced by the proximal iteration [cf. Eqs. (5.22)-(5.24)] .
The figure on the left also illustrates the reflection operator Nc,J(z) = 2Pc,J(z)-z.
The iterate Pc, J(z) lies at the midpoint between z and Nc,J(z) [cf. Eq. (5.25)] .
Note that all points between z and Nc,J(z) are at least as close to x * as z. The
figure on the right illustrates the proximal iteration xk+l = Pc,J(xk) -
relations (5.22) and (5.23) is that any vector z E lRn can be written in
exactly one way as
where z E lRn , VE af(z), (5.24)
and moreover the vector z is equal to Pc,J(z),
z = Pc,J(z).
Using Eq. (5.19), we also obtain a corresponding formula for Nc,1:
Nc,J(z) = 2Pc,t(z ) - z = 2z - (z + cv) = z - CV. (5.25)
Figure 5.1.8 illustrates the preceding relations and provides a graphical
interpretation of the proximal algorithm. The following proposition verifies
the nonexpansiveness property of Nc,t·
Proposition 5.1.8: For any c > 0 and closed proper convex function
f : lRn t-+ ( -oo, oo] , the mapping
Nc,t(z) = 2Pc,J(z) - z
with
The nonexpansiveness of Ne,! will follow if we can show that the inner
product in the right-hand side is nonnegative. Indeed this is obtained by
using the definition of subgradients to write
(5.28)
and the result follows. Finally the nonexpansiveness of Nc,f clearly implies
the nonexpansiveness of the interpolated mapping. Q.E.D.
(5.29)
where a,k E [O, 1] for all k and I:;~0 ak(l - a,k) = oo, converges to a
fixed point of T, starting from any xo E ~n.
(5.30)
where "/k E [E, 2 - Ej for some scalar E > 0 and all k, converges to a
minimum of f, assuming at least one minimum exists.
with 'Yk = 2ak . Since the fixed points of Ne,! are the minima off, the
result follows from Prop. 5.1.9 with T = N e,f· Q.E.D.
that the set of fixed points of Nc,t does not depend on c as long as c > 0.
Note that for "Yk = 1, we obtain the proximal algorithm Xk+l = Pck,t(xk),
Another interesting fact is that the iteration (5.31) can also be written
as a gradient iteration
where
c/>c(z) = inf {f(x)
xE~n
+ _!_llx
2c
- zll 2 } ,
[cf. Eq. (5.14)], based on the fact
r1,1,. ( ) _ Xk - Pck,t(xk)
V'f'ck Xk - ,
Ck
[cf. Eq. (5.17)]. Since the performance of gradient methods is often im-
proved by intelligent stepsize choice, this motivates stepsize selection sche-
mes that are aimed at acceleration of convergence.
Indeed, it turns out that with extrapolation along the interval con-
necting Xk and Pc,t(xk), we can always obtain points that are closer to the
set of optimal solutions X* than Pc,t(xk)- By this we mean that for each
x with Pc,t(x) rf. X*, there exists "YE (1, 2) such that
min
x*EX*
llx+7(Pc1(x)-x)-x*II<
'
min IIPcJ(x)-x*II·
x*EX* '
(5.32)
This can be seen with a simple geometrical argument (cf. Fig. 5.1.9). Thus
the proximal algorithm can always benefit from overrelaxation, i.e., "'/k E
(1, 2), if only we knew how to do it effectively. One may consider a trial and
error scheme to determine a constant value of "Yk E (1, 2) that accelerates
convergence relative to "Yk = l; this may work well when Ck is kept constant.
More systematic procedures for variable values Ck have been suggested in
[Ber75d] and [Ber82a], Section 2.3.1. In the procedure of [Ber82a], the
overrelaxation parameter is chosen within (1, 2) as
· Pc,J(x) - x*
Thus the angle between Pc,f (x) - x and Pc,1 (x) - x* is strictly greater than 7r /2.
From triangle geometry, it follows that there exist points in the interval connecting
Pc,J(x) and Nc,J(x), which are closer to x* than Pc,1(x), so there exists 'YE (1, 2)
such that Eq. (5.32) holds.
[cf. Eq. (5.24)]. This was necessary in order for the mapping Pc,t that
maps x to x, and the corresponding mapping
(b) We have
(x1 - x2)'(v1 - v2) 2: 0, V x1,x2 E dom(M)
(5.34)
and V1 E M(x1), v2 E M(x2),
where
dom(M) = {x I M(x) # 0}
(assumed nonempty). This property, known as monotonicity of M,
was used to prove that the mapping Nc,f is nonexpansive in Prop.
5.1.8 [cf. Eq. (5.28)].
It can be shown that both of the preceding two properties hold if and
only if M is maximal monotone, i.e., it is monotone in the sense of Eq.
(5.34), and its graph {(x,v) Iv E M(x)}, is not strictly contained in the
graph of any other monotone mapping on ?Rn t (the subdifferential mapping
can be shown to be maximal monotone; this is shown in several sources,
[Roc66], [Roc70], [RoW98], [BaCll]). Maximal monotone mappings, the
associated proximal algorithms, and related subjects have been extensively
treated in the literature, to which we refer for further discussion; see the
end-of-chapter references.
In summary, the proximal algorithm in its full generality applies to
the problem of finding a zero of a maximal monotone multivalued mapping
M : ?Rn f-t 21Rn [a vector x* such that O E M (x*)]. It takes the form
Xk+l = Xk - CVk,
where Vk is the unique point v such that v E M(xk+i); cf. Fig. 5.1.10. If
Mis a single-valued mapping, we have Xk+l = Xk - cM(xk+1), or
Xk+l = (I+ cM)- 1 (xk),
where I is the identity mapping and (I + cM)- 1 is the inverse of the
mapping I + cM. Moreover, a more general version of the algorithm is
valid, allowing for a stepsize 'Yk E (0, 2),
Xk+l = Xk - "fkCVk,
t Note that the monotonicity property (5.34) and the existence of the rep-
resentation (5.33) for some x E 3r implies the uniqueness of this representation
[if X = X1 + CV1 = X2 + CV2, then Q ::::: (x1 - x2)'(v1 - v2) = -llx1 - x211 2, so
x1 = x2]. Thus maximal monotonicity of M is equivalent to monotonicity and
existence of a representation of the form (5.33) for every x E 3r, something that
can be easily visualized (cf. Fig. 5.1.10) but quite hard to prove (see the original
work [Min62], or subsequent sources such as [Bre73], [RoW98], [BaCll]).
256 Proximal Algorithms Chap. 5
M(x)
Vk - - - - - \ - - -
Xk+3 Xk+2 X
X k+I = Xk - CVk
fi(x) = J(x),
where the last equality follows by noting that the supremum over x is
attained at x = Xk + Ck A- Introducing f*, the conjugate of f,
and substituting into Eq. (5.36), we see that the dual problem (5.36) can
be written as
minimize f*(>..) - xk>.. + ii>..11 2 c; (5.37)
subject to >.. E lRn.
We also note that there is no duality gap, since h and f 2 are real-
valued, so the relative interior conditions of the Fenchel Duality Theorem
[Prop. 1.2.l(a),(b)] are satisfied. In fact there exist unique primal and dual
optimal solutions, since both primal and dual problems involve a strictly
convex cost function with compact level sets.
Let Ak+l be the unique solution of the minimization (5.37). Then
Ak+l together with Xk+l satisfy the necessary and sufficient optimality
conditions of Prop. 1.2.l(c),
(5.38)
, _ Xk - Xk+l
"k+l - , (5.39)
Ck
see Fig. 5.2.1. This equation can be used to find the primal proximal iterate
Xk+1 of Eq. (5.35), once Ak+l is known,
(5.40)
Slope Ak+I
1 ~ Optimal dual
proximal solution
Xk Xk+l X* X
+
Optimal primal
proximal solution
cf. Eq. (5.39), and the relation between the primal and dual proximal solutions.
v' k 2 0,
[cf. Eq. (5.38) for the left side, and the Conjugate Subgradient Theorem
(Prop. 5.4.3 in Appendix B) for the right side], and as Ak converges to 0
and Xk converges to a minimum x* off, we have
0E of(x*), x* E of*(O).
The primal and dual implementations of the proximal algorithm are
mathematically equivalent and generate identical sequences {xk}, assuming
the same starting point xo and penalty parameter sequence {Ck}. Whether
one is preferable over the other depends on which of the minimizations
(5.35) and (5.41) is easier, i.e., whether f or its conjugate f* has more
convenient structure. In the next section we will discuss a case where
the dual proximal algorithm is more convenient and yields the augmented
Lagrangian method.
Sec. 5.2 Dual Proximal Algorithms 259
------ 1k
I
Optimal Slope= xk-
Slope= 0
Primal Proximal Iteration Dual Proximal Iteration
Figure 5.2.2. Illustration of primal and dual proximal algorithms. The primal
algorithm aims to find x*, a minimum off. The dual algorithm aims to find x*
as a subgradient of J* at O [cf. Prop. 5.4.4(b) in Appendix BJ.
We will now apply the proximal algorithm to the dual problem of a con-
strained optimization problem. We will show how the corresponding dual
proximal algorithm leads to the class of augmented Lagrangian methods.
These methods are popular because they allow the solution of constrained
optimization problems, through a sequence of easier unconstrained (or less
constrained) optimizations, which can be performed with fast and reliable
algorithms, such as Newton, quasi-Newton, and conjugate gradient meth-
ods. Augmented Lagrangian methods can also be used for smoothing of
nondifferentiable cost functions, as described in Section 2.2.5; see nonlinear
programming textbooks, and the monograph [Ber82a], which is a compre-
hensive reference on augmented Lagrangian, and related smoothing and
sequential quadratic programming methods.
Consider the constrained minimization problem
minimize f (x)
(5.43)
subject to x E X, Ax = b,
where f: ~n H (-oo, oo] is a convex function, X is a convex set, A is an
m x n matrix, and b E ~m. t
In view of the conjugacy relation between q and p, it can be seen that the
dual proximal algorithm (5.41 )-(5.42) has the form+
inf {
uE1Rm
inf
xEX, Ax-b=u
{f(x)} + Ak 1U + Ck2 llull 2 }
= uE1Rm
inf inf {f(x) + >.k(Ax - b) + Ck IIAx - bll 2 }
xEX, Ax-b=u 2
Slope= - >.k
/
"
Slope= -Ak+l
u
so the vector uk+I is the one for which ->..k is a subgradient of p(u) + ~ llull 2 at
u = uk+I, as shown in the figure. By combining the last two relations, we obtain
->..k+l E 8p(uk+I), as shown in the figure. The optimal value in the minimization
(5.46) is equal to infxEX Lek (x, >..k), and can be geometrically interpreted as in
the figure.
where Xk+l is any vector that minimizes Lck(x, Ak) over X (we assume
that such a vector exists - while the existence of the minimizing Uk+i is
guaranteed, since the minimization (5.44) has a solution, the existence of
the minimizing Xk+I is not guaranteed, and must be either assumed or
verified independently).
Using the preceding expression for uk+l, we see that the dual proximal
algorithm (5.44)-(5.45), applied to the maximization of the dual function
q, starts with an arbitrary initial vector >.o, and iterates according to
where Xk+l is any vector that minimizes Lck(x, Ak) over X. This method is
known as the augmented Lagrangian algorithm or the method of multipliers.
262 Proximal Algorithms Chap. 5
Note that there is no guarantee that { Xk} has a limit point, and indeed
the dual sequence { >.k} will converge to a dual optimal solution, if one
exists, even if the primal problem (5.43) does not have an optimal solution.
As an example, the reader may verify that for the two-dimensional/single
constraint problem where f(x) = ex 1 , x 1 +x 2 = 0, x 1 ER, x 2 2:: 0, the dual
optimal solution is >.* = 0, but there is no primal optimal solution. For
this problem, the augmented Lagrangian algorithm will generate sequences
{.Xk} and {xk} such that Ak-+ 0 and xl-+ -oo, while J(xk )-+ f* = 0.
minimize f (x)
(5.50)
subjectto xEX, z2::0, a~x+z 1 = b1, . .. ,a~x+ zr = br,
for a sequence of values ofµ = (µ 1 , .. . ,µr) and c > 0. This type of mini-
mization can be done by first minimizing L c(x, z, µ) over z 2:: 0, obtaining
and then by minimizing Lc(x, µ) over x EX. A key observation is that the
first minimization with respect to z can be carried out in closed form for
each fixed x, thereby yielding a closed form expression for Lc(x, µ).
Indeed, we have
r
minLc(x, z, µ) = f(x)
z2'.0
+L
.
min{µ) (ajx -
zJ >O
bj + zJ) + -2c lajx - bj + zJl 2 } .
J=l -
(5.51)
The function in braces above is quadratic in zj. Its constrained minimum
is zJ = max{O, zJ}, where zJ is the unconstrained minimum at which the
derivative is zero. The derivative is µJ + c( ajx - bj + zJ), so we obtain
Denoting
gj+( x,µJ,c
. ) =max { ajx-bj,--z
I µJ} , (5.52)
r
Lc(x,µ) = f(x) + L {µJgj(x,µJ,c) +
j=l
i (gt(x,µJ,c)) 2
}. (5.53)
After some calculation, left for the reader, we can also write this expression
as
and we can view it as the augmented Lagrangian function for the inequality
constrained problem (5.49).
It follows from the preceding transcription that the augmented La-
grangian method for the inequality constrained problem (5.49) consists of
a sequence of minimizations of the form
j = 1, ... ,r,
Sec. 5.2 Dual Proximal Algorithms 265
dc {(max{O,µ+ cg}) 2 - µ2 }
"' Slope=µ -
Constraint Level g
_i:
2c
Figure 5.2.4. Form of the quadratic penalty term for a single inequality con-
straint g(x) :S 0.
minimize Lck(x,µk)
subject to x E X,
where Lck(x, µk) is given by Eq. (5.53) or Eq. (5.54), {µk} is a sequence
updated as above, and {Ck} is a positive penalty parameter sequence with
L%':o Ck = oo. Since this method is equivalent to the equality-constrained
method applied to the corresponding equality-constrained problem (5.50),
our earlier convergence results (cf. Prop. 5.2.1) apply with the obvious
modifications.
We finally note that there is a similar augmented Lagrangian method
for problems with nonlinear inequality constraints
266 Proximal Algorithms Chap. 5
in place of the linear ones in problem (5.49). This method has identical
form to the one developed above, with the functions 9j defined by
cf. Eq. (5.52) (the derivation is very similar; see [Ber82a]). In particular, the
method consists of successive minimizations of the augmented Lagrangian
j = 1, ... ,r;
see the end-of-chapter references. Note that Lek(·, µk) is continuously dif-
ferentiable in x if f and 9j are, and is also convex in x if f and gj are.
(5.56)
where Ye(,\) is the unique vector attaining the maximum in Eq. (5.55).
Since Yek(,\k) = Ak+l, we have using Eqs. (5.48) and (5.56),
t"'7
V Qek (,\ k ) -- ,\k+i - ,\k
= A Xk+l - b,
Ck
Sec. 5.2 Dual Proximal Algorithms 267
minimize L f; (x)
i=l
where f; : Rn ,-+ R are convex functions and X; are closed, convex sets with
nonempty intersection. This problem contains as special cases several of the
examples given in Section 1.3, such as regularized regression, classification,
and maximum likelihood.
We introduce additional artificial variables zi, i = 1, ... , m, we consider
the equivalent problem
m
minimize Lf;(zi)
i=l
(5.57)
subject to x = zi, zi E X;, i = 1, ... ,m,
and we apply the augmented Lagrangian method to eliminate the constraints
zi = x, using corresponding multiplier vectors >,.i. The method takes the form
minimize I:
i=l
(f;(zi) + .>.f (x - zi) + c; llx - zi11 2 )
Note that there is coupling between x and the vectors zi, so this problem
cannot be decomposed into separate minimizations with respect to some of the
variables. On the other hand, the problem has a Cartesian product constraint
set, and a structure that is suitable for the application of block coordinate
descent methods that cyclically minimize the cost function, one component at
a time. In particular, we can consider a method that minimizes the augmented
Lagrangian with respect to X E ar with the iteration
(5.59)
then minimizes the augmented Lagrangian with respect to zi EX;, with the
iteration
z i E arg mm
. {f(i)
i z - ,i
"k
1
zi + -Ckll x - z ill2} , i = 1, ... ,m, (5.60)
ztEX; 2
Figure 5.3.1. Illustration of the proximal algorithm with outer linearization. The
point Xk+l is the one at which the graph of the negative proximal term, raised by
some amount, first touches the graph of Fk. A new cutting plane is added, based
on a subgradient 9k+l of f at Xk+l. Note that the proximal term reduces the
effect of instability: Xk+l tends to be closer to Xk, with the distance llxk+l - Xk 11
depending on the size of the proximal term, i.e., the penalty parameter ck.
The method terminates if Xk+1 = xk; in this case, Eqs. (5.62) and
(5.63) imply that
'vxEX,
where I is a finite index set, and ai and bi are given vectors and scalars,
respectively. Assume that the optimal solution set is nonempty and
that the subgradient added to the cutting plane approximation'at each
iteration is one of the vectors ai, i E J. Then the method terminates
finitely with an optimal solution.
Proof: Since there are only finitely many vectors ai to add, eventually
the polyhedral approximation Fk will not change, i.e., Fk = Fy;; for all
k > k. Thus, for k 2'. k, the method will become the proximal algorithm
for minimizing Fy;;, so by Prop. 5.1.5, it will terminate with a point z that
minimizes Fy;; subject to x E X. But then, we will have concluded an
iteration of the cutting plane method for minimizing f over X, with no
new vector added to the approximation Fk. This implies termination of
the cutting plane method, necessarily at a minimum off over X. Q.E.D.
(5.64)
The proximal center of Pk need not be Xk (as in the proximal cutting plane
method), but is rather one of the past iterates Xi, i :S: k.
In one version of the method, Fk is given by
where Yk E {xi I i :S: k}. Following the computation of Xk+i, the new
proximal center Yk+I is set to Xk+I, or is left unchanged (Yk+I = Yk)
depending on whether, according to a certain test, "sufficient progress"
has been made or not. An example of such a test is
Thus,
Xk+I if f(Yk) - f (xk+I) 2:: f38k,
Yk+I = { Yk if f(Yk) - f(xk+1) < f38k,
(5.66)
X X
Figure 5.3.2. Illustration of the test (5.66) for a serious or a null step in the
bundle method. It is based on
Proof: Since there are only finitely many vectors ai to add, eventually the
polyhedral approximation A will not change, i.e., Fk = Fk for all k > k.
We then have A(xk+i) = f(xk+i) for all k > k, since otherwise a new
cutting plane would be added to Fk. Thus, fork> k,
Therefore, according to Eq. (5.66), the method will perform serious steps
for all k > k, and become identical to the proximal cutting plane algorithm,
which converges finitely by Prop. 5.3.1. Q.E.D.
We mentioned earlier that one of the drawbacks of the cutting plane al-
gorithms is that the number of subgradients used in the approximation
Fk may grow to be very large. The monitoring of progress through the
Sec. 5.3 Proximal Algorithms with Linearization 275
test (5.66) for serious/null steps can also be used to discard some of the
accumulated cutting planes. For example, at the end of a serious step,
upon updating the proximal center Yk to Yk+l = Xk+l, we may discard any
subset of the cutting planes.
It may of course be useful to retain some of the cutting planes, par-
ticularly the ones that are "active" or "nearly active" at Yk+l, i.e., those
i :::; k for which the linearization error
(5.67)
where 9k is given by
(5.68)
The next iteration will then be performed with only two cutting planes:
the one just given by Eqs. (5.67)-(5.68) and a new one obtained from Xk+i,
276 Proximal Algorithms Chap. 5
Yk Yk+l = Xk+i X
where
9k = Yk - Xk+l
Ck
In the preceding section we saw that the proximal algorithm can be com-
bined with outer linearization to yield the proximal cutting plane algorithm
and its bundle versions. In this section we use a dual combination, involving
the dual proximal algorithm (5.41)-(5.42) and inner linearization (the dual
of outer linearization). This yields another method, which is connected to
the proximal cutting plane algorithm of Section 5.3.1 by Fenchel duality
(see Fig. 5.3.4).
Let us recall the proximal cutting plane method applied to minimizing
a real-valued convex function f: ~n c-+ ~, over a closed convex set X. The
typical iteration involves a proximal minimization of the current cutting
Sec. 5.3 Proximal Algorithms with Linearization 277
Outer
Linearization
Proximal Algorithm Proximal Cutting Plane
Bundle Versions
Fenchel 1 Duality
Inner
Fenchel 1
Duality
Linearization
Dual Proximal Algorithm Proximal Simplicial Decomposition
Augmented Lagrangian Bundle Versions
Figure 5.3.4. Relations of the proximal and proximal cutting plane methods,
and their duals. The dual algorithms are obtained by application of the Fenchel
Duality Theorem (Prop. 1.2.1), taking also into account the conjugacy relation
between outer and inner linearization (cf. Section 4.3).
where Ft is the conjugate of Fk. Once Ak+l, the unique minimizer in this
dual proximal iteration, is computed, Xk is updated via
Note that the new subgradient 9k+l may also be obtained as a vector
attaining the supremum in the conjugacy relation
f(xk+1) = sup {xk+l>.- f*(>.)},
AERn
where f * is the conjugate function of f, since we have
9k+1 E 8 f (xk+i) if and only if 9k+l E arg max { xk+l >. - f*(>.)},
AERn
(5.73)
(cf. the Conjugate Subgradient Theorem of Prop. 5.4.3 in Appendix B).
The maximization above may be preferable if the "differentiation" 9k+ 1 E
of(xk+l) is inconvenient.
minimi'" tk
a;J•(g;) - x[ t a;g; + '~' lit a;gf
(5.74)
subject to L O'.i = 1, O'.i :::0: 0, i = 0, ... , k.
i=O
Sec. 5.3 Proximal Algorithms with Linearization 279
-.......__ Slope = Xk
If (a~, ... , a~) attains the minimum, Eqs. (5.71) and (5.72) yield
k k
Ak+l = L af 9i, Xk+l = Xk - ck L af gi. (5.75)
i=O i=O
The next subgradient 9k+l E of(xk+i) may also be obtained from the
maximization
9k+l E arg max { x~+i >. - f*(>.)} (5.76)
>-ElRn
if this is convenient; cf. Eq. (5.73). As Fig. 5.3.5 indicates, 9k+l provides a
new break point and an improved inner approximation to f *.
We refer to the algorithm defined by Eqs. (5.74)-(5.76), as the prox-
imal simplicial decomposition algorithm. Note that all the computations
of the algorithm involve the conjugate f* and not f. Thus, if f* is more
convenient to work with than f, the proximal simplicial decomposition al-
gorithm is preferable to the proximal cutting plane algorithm. Note also
that the duality between the two linear approximation versions of the prox-
imal algorithm is a special case of the generalized polyhedral approximation
framework of Section 4.4.
The problem (5.74) can also be written without reference to the con-
jugate f*. Since 9i is a subgradient off at Xi, and hence we have
where Ji : Rn >---t R are convex functions and Xi are closed convex sets with
nonempty intersection. As in Example 5.2.1 we can reformulate this as an
equality constrained problem, by introducing additional artificial variables zi,
i = 1, ... , m, and the equality constraints x = zi:
i=l
minimize t
i=l
(ti(zi) + >.;;' (x - zi) + c; llx - zi11 2 )
(5.78)
subject to x E Rn, zi E Xi, i = 1, ... ,m,
[cf. Eqs. (5.58), (5.47)], and then update the multipliers according to
i = 1, ... ,m.
The minimization in Eq. (5.78) can be done by alternating minimizations of
x and i (a block coordinate descent method), and the multipliers >.i may be
changed only after (typically) many updates of x and zi (enough to minimize
the augmented Lagrangian within adequate precision).
An interesting variation is to perform only a small number of minimiza-
tions with respect to x and zi before changing the multipliers. In the extreme
case, where only one minimization is performed, the method takes the form
(5.79)
(5.80)
[cf. Eq. (5.58)]. Thus the multiplier iteration is performed after just one block
coordinate descent iteration on each of the (now decoupled) variables x and
(z 1 , ... , zm). This is precisely the ADMM specialized to the problem of this
example.
The ADMM, given the current iterates (xk, Zk, ,\k) E lRn x lRm x lRm,
generates a new iterate (xk+l, Zk+1, Ak+l) by first minimizing the aug-
mented Lagrangian with respect to x, then with respect to z, and finally
performing a multiplier update:
(5.84)
(5.85)
(5.86)
The penalty parameter c is kept constant in ADMM. Contrary to the case of
the augmented Lagrangian method (where Ck is often taken to be increasing
with k in order to accelerate convergence), there seems to be no generally
good way to adjust c from one iteration to the next. Note that the iteration
(5.79)-(5.81), given earlier for the additive cost problem (5.77), is a special
case of the preceding iteration, with the identification z = (z1, ... , zm).
We may also formulate an ADMM that applies to the closely related
problem
minimize fi(x) + h(z)
(5.87)
subject to x EX, z E Z, Ax+ Bz = d,
where Ji : lRn t-t lR, h : lRm t-t lR are convex functions, X and Z are closed
convex sets, and A, B, and dare given matrices and vector, respectively, of
appropriate dimensions. Then the corresponding augmented Lagrangian is
C
Lc(x, z, ,\) = fi(x) + h(z) +,\'(Ax+ Bz - d) + 2jjAx + Bz - djj 2 , (5.88)
Sec. 5.4 Alternating Direction Methods of Multipliers 283
and the ADMM iteration takes a similar form [cf. Eqs. (5.84)-(5.86)]:
(5.89)
(5.90)
(5.91)
For some problems, this form may be more convenient than the ADMM of
Eqs. (5.84)-(5.86), although the two forms are essentially equivalent.
The important advantage that the ADMM may offer over the aug-
mented Lagrangian method is that it involves a separate minimization with
respect to x and with respect to z. Thus the complications resulting from
the coupling of x and z in the penalty term IIAx - zil 2 or the penalty
term IIAx + Bz - dll 2 are eliminated. Here is another illustration of this
advantage.
We are given m closed convex sets X 1, ... , Xm in Rn, and we want to find a
point in their intersection. We write this problem in the form (5.83), with x
defined as x = (x 1 , ... , xm),
i = 1, ... ,m.
m m
i=l i=l
The parameter c does not influence the algorithm, because it simply intro-
duces scaling of .>..i by 1/c, so we may assume with no loss of generality that
c = 1. Then, by completing the square, we may write the augmented La-
grangian as
i=l i=l
Using Eqs. (5.89)-(5.91), we see that the corresponding ADMM iterates for
x' according to
i = l, ... ,m,
284 Proximal Algorithms Chap. 5
Aside from the decoupling of the iterations of the variables xi and z, notice
that the projections on Xi can be carried out in parallel.
In the special case where m = 2, we can write the constraint more
simply as x 1 = x 2 , in which case the augmented Lagrangian takes the form
Le (X 1, X 2, A) = A'( X 1 - X 2) + 2C II X 1 - X 2112 ,
On the other hand there is a price for the flexibility that the ADMM
provides. A major drawback is a much slower practical convergence rate
relative to the augmented Lagrangian method of the preceding subsec-
tion. Both methods can be shown to have a linear convergence rate for
the multiplier updates under favorable circumstances (see e.g., [Ber82a]
for augmented Lagrangian, and [HoL13], [DaY14a], [DaY14b], [GiB14] for
ADMM). However, it seems difficult to compare them:on the basis of theo-
retical results alone, because the geometric progression rate at which they
converge is different and also because the amount of work between multi-
plier updates must be properly taken into account. A corollary of this is
that just because the ADMM updates the multipliers more often than the
augmented Lagrangian method, it does not necessarily require less com-
putation time to solve a problem. A further consideration in comparing
the two types of methods is that while ADMM effectively decouples the
minimizations with respect to x and z, augmented Lagrangian methods
allow for some implementation flexibility that may be exploited by taking
advantage of the structure of the given problem:
(a) The minimization of the augmented Lagrangian can be done with a
broad variety of methods (not just block coordinate descent). Some
of these methods may be well suited for the problem's structure.
(b) The minimization of the augmented Lagrangian need not be done ex-
actly, and its accuracy can be readily controlled through theoretically
sound and easily implementable termination criteria.
Sec. 5.4 Alternating Direction Methods of Multipliers 285
(c) The adjustment of the penalty parameter c can be used with advan-
tage in the augmented Lagrangian method, but there is apparently
no general way to do this in ADMM. In particular, by increasing c
to oo, superlinear or finite convergence can often be achieved in the
augmented Lagrangian method [cf. Props. 5.l.4(b) and 5.1.5].
Thus, on balance, it appears that the relative performance merits of ADMM
and augmented Lagrangian methods are problem-dependent in practice.
Convergence Analysis
(a) The sequence {xk, Zk, ).k} generated by the ADMM (5.84)-(5.86)
is bounded, and every limit point of { xk} is an optimal solution
of problem (5.83). Furthermore {).k} converges to an optimal
dual solution.
(b) The residual sequence { Axk - zk} converges to 0, and if A' A is
invertible, then { Xk} converges to an optimal primal solution.
algorithm, involving the same fixed point convergence mechanism, but dif-
ferent mappings (and hence also different convergence rate). The full proof
is somewhat lengthy, but we will provide an outline and some of the key
points in Exercise 5.5.
minimize llxll 1
subject to Cx = b,
(5.92)
is given by the so-called shrinkage operation, which for any o: > 0 and
w = (w 1 , ... ,wm) E ~m, is defined as
where
Ji(x) = 0, h(z) = [[z[[1.
Here the augmented Lagrangian function is modified to include the constant
vector b [cf. Eq. (5.88)]. It is given by
Setting >.k = >..k/c, the iteration can be written in the notationally simpler
form
minimize L Ji (xi)
i=l
m
(5.96)
subject to L Aixi = b, xi E Xi, i = 1, ... , m,
i=l
where Ji : Rni H R are convex functions, Xi are closed convex sets, and
Ai and b are given. We have often noted that this problem has a fa-
vorable structure that is well-suited for the application of decomposition
approaches. Since the primary attractive feature of ADMM is that it de-
couples the augmented Lagrangian optimization calculations, it is natural
to consider its application to this problem.
An idea that readily comes to mind is to form the augmented La-
grangian
i
Xk+l
· L cXk+
E argm1n ( 1 1 , ... ,xk+i,x,xk
i-1 i i+l m , )
, ... ,xk,/\k, i = 1, ... ,m,
x'EXi
(5.97)
and follow these minimizations with the multiplier iteration
(5.98)
Methods of this type have been proposed in various forms long time ago,
starting with the important paper [StW75], which stimulated considerable
further research. The context was unrelated to ADMM (which was un-
known at that time), but the motivation was similar to the one of the
ADMM: working around the coupling of variables induced by the penalty
term in the augmented Lagrangian method.
When there is only one component, m = 1, we obtain the augmented
Lagrangian method. When there are only two components, m = 2, the
above method is equivalent to the ADMM of Eqs. (5.89)-(5.91), so it has
the corresponding convergence properties. On the other hand, when m > 2,
the method is not a special case of the ADMM that we have discussed and
is not covered by similar convergence guarantees. In fact a convergence
290 Proximal Algorithms Chap. 5
minimize L Ji (xi)
i=l
We will now show how to simplify this algorithm. We will first obtain
the minimization (5.101) for z in closed form by introducing a multiplier
vector Ak+1 for the constraint I:;:
1 zi = b, and then show that the multi-
pliers Pi+ 1 obtained from the update (5.102) are all equal to Ak+l· To this
end we note that the Lagrangian function corresponding to the minimiza-
tion (5.101), is given by
m
L ((Aixi+l -
i=l
zi)'Pi + ~IIAixi+ 1 - zill 2 + >-~+ 1zi) - >-~+ 1b.
By setting to zero its gradient with respect to zi, we see that the minimizing
vectors zl+ 1 are given in terms of Ak+l by
· · Pi - Ak+l
zl,+ 1 = Aixk+l + - - - - (5.103)
C
A key observation is that we can write this equation as
Ak+l =Pi+ c(Aixi+ 1 - 4+ 1 ), i = 1, ... , m,
so from Eq. (5.102), we have
P1+ 1 = Ak+1, i = 1, ... ,m.
Thus during the algorithm all the multipliers pi are updated to a common
value: the multiplier Ak+l of the constraint I:;:
1 zi = b of problem (5.101).
We now use this fact to simplify the ADMM (5.100)-(5.102). Given
Zk and Ak (which is equal to Pi for all i), we first obtain Xk+l from the
augmented Lagrangian minimization (5.100) as
xl+ 1 E arg min {fi(xi) + (Aixi - 4)' Ak + -2c IIAixi - zll1 2 }. (5.104)
x'EX;
minimize + h(Ax)
Ji(x)
subject to Ex = d, x E X,
which differs from the standard format (5.82) in that it includes the convex
set constraint x EX, and the linear equality constraint Ex= d, where E and
dare given matrix and vector, respectively. We convert this problem into the
separable form (5.96) as follows:
z21)
zi = ( z~2 ,
1
Zk+l -
_ (A) 1
E Xk+l + -1 ( Ak - Ak+1 )
µ k - µ k+l '
C
Note that this algorithm maintains the main attractive characteristic of the
ADMM: the components Ji and h of the cost function are decoupled in the
augmented Lagrangian minimizations.
Sec. 5.5 Notes, Sources, and Exercises 293
where r is the row dimension of the matrices Ai, and IDj is the number of
submatrices Ai that have nonzero jth row. Using the j-dependent stepsize
c/mj of Eq. (5.107) in place of the stepsize c/m of Eq. (5.106) may be
viewed as a form of diagonal scaling. The derivation of the algorithm of
Eqs. (5.104), (5.105), and (5.107) is nearly identical to the one given for the
algorithm (5.104)-(5.106). The idea is that the components of the vector
zi represent estimates of the corresponding components of Aixi at the
optimum. However, if some of these components are known to be O because
some of the rows of Ai are 0, then the corresponding values of zi might
as well be set to O rather than be estimated. If we repeat the preceding
derivation of the algorithm (5.104)-(5.106), but without introducing the
components of zi that are known to be 0, we obtain by a straightforward
calculation the multiplier iteration (5.107).
Finally let us note that the ADMM can be applied to the dual of the
separable problem (5.96), and yield a similar decomposition algorithm. The
idea is that the dual problem has the form discussed in Example 5.4.1, for
which the ADMM can be conveniently applied. This approach is applicable
even in the case where the primal problem has some nonlinear inequality
constraints; see [Fuk92], which also discusses a connection with the method
of partial inverses of [Spi83], [Spi85].
rithms, and detailed references on the early history of the subject. The
distributed algorithms monograph by Bertsekas and Tsitsiklis [BeT89a]
discusses applications of augmented Lagrangians to classes of large-scale
problems with special structure, including separable problems and prob-
lems with additive cost functions. There has been considerable recent in-
terest in using augmented Lagrangian, proximal, and smoothing methods
for machine learning and signal processing applications; see e.g., Osher et
al. [OBG05], Yin et al. [YOG08], and Goldstein and Osher [Go009].
Section 5.3: The proximal cutting plane and simplicial decomposition
algorithms of Sections 5.3.1 and 5.3.3, may be viewed as regularized versions
of the classical Dantzig-Wolfe decomposition algorithm (see e.g., [Las70],
[BeT97], [Ber99]). The latter algorithm is obtained in the limit, as the
regularization term diminishes to O ( Ck ~ oo).
For presentations of bundle methods, see the books by Hiriart-Urrutu
and Lemarechal [HiL93], and Bonnans et al. [BGL06], which give many
references. For related methods, see Ruszczynski [Rus86], Lemarechal and
Sagastizabal [LeS93], Mifflin [Mif96], Burke and Qian [BuQ98], Mifflin,
Sun, and Qi [MSQ98], Frangioni [Fra02], and Teo et al. [TVSlO].
The term "bundle" has been used with a few different meanings in
the convex algorithmic optimization literature, with some confusion result-
ing. To our knowledge, it was first introduced in the 1975 paper by Wolfe
[Wol75] to describe a collection of subgradients used for calculating a de-
scent direction in the context of a specific algorithm of the descent type -
a context with no connection to cutting planes or proximal minimization.
It subsequently appeared in related descent contexts through the 70s and
early 80s. The meaning of the term "bundle method" shifted gradually in
the 80s, and it is now commonly associated with the stabilized proximal
cutting plane methods that we have described in Section 5.3.2.
Section 5.4: The ADMM was first proposed by Glowinskii and Morocco
[GIM75], and Gabay and Mercier [GaM76], and was further developed by
Gabay [Gab79], [Gab83]. It was generalized by Lions and Mercier [LiM79],
where the connection with alternating direction methods for solving differ-
ential equations was pointed out. The method and its applications in large
boundary-value problems were discussed by Fortin and Glowinskii [FoG83].
The recent literature on the ADMM is voluminous and cannot be
surveyed here (a Google Scholar search produced thousands of papers ap-
pearing in the two years preceding the publication of this book). The
surge of interest is largely due to the flexibility that the ADMM provides
in exploiting special problem structures, such as for example the ones from
machine learning that we have discussed in Section 5.4.1.
In our discussion we have followed the analysis of the book by Bert-
sekas and Tsitsiklis [BeT89a] (which among others gave the ADMM for
separable problems of Section 5.4.2), and in part the paper by Eckstein
and Bertsekas [EcB92] (which established the connection of the ADMM
296 Proximal Algorithms Chap. 5
with the general form of the proximal algorithm of Section 5.1.4, and gave
extensions involving, among others, extrapolation and inexact minimiza-
tion). In particular, the paper [EcB92] showed that the general form of
the proximal algorithm contains as a special case the Douglas-Ratchford
splitting algorithm for finding a zero of the sum of two maximal monotone
operators, proposed by Lions and Mercier [LiM79]. The latter algorithm
contains in turn as a special case the ADMM, as shown by Gabay [Gab83].
EXERCISES
where {"Yk} is a sequence of positive scalars. Use dual variables to relate this
algorithm with the proximal algorithm. In particular, provide conditions under
which there is a proximal algorithm, with an appropriate sequence of penalty
parameters {ck}, which generates the same iterate sequence {xk} starting from
the same point xo.
[cf. Eq. (5.33)]. (As noted in Section 5.1.4, this is true if M has a maximal
monotonicity property.)
(2) For some a> 0, we have
(x1 - x2) 1 (v1 - v2) 2: allx1 - x2ll 2, V x1, x2 E dom(M)
and v1 E M(x1), v2 E M(x2),
Note: While a fixed point iteration involving a contraction has a linear conver-
gence rate, the reverse is not true. In particular, Prop. 5.l.4(c) gives a condition
under which the proximal algorithm has a linear convergence rate. However, this
condition does not guarantee that the proximal operator Pc,f is a contraction
with respect to any particular norm. For example, all the minimizing points of
f are fixed points of Pc,f, but the condition of Prop. 5.1.4( c) does not preclude
the possibility of f having multiple minimizing points. See also [Luq84] for an
extension of Prop. 5.l.4(c), which applies to the case of a maximal monotone
operator, and other related convergence rate results. Hint: Consider the multi-
valued mapping
M(z) = M(z) - (j z,
and for any c > 0, let I\,f (z) be the unique vector z such that z = z + c( v - (j z)
[cf. Eq. (5.108)]. Note that M is monotone, and hence by the theory of Section
5.1.4, l\,t is nonexpansive. Verify that
Pc,J(z) = F\;,J((l + w)- 1 z), z E Rn,
where c = c(l + c/j)- 1 , and use the nonexpansiveness of Pc,f·
zE argmin {f(x)
xEX
+_!_~Ix;
2c~
- zil 2 } . (5.109)
iEl
(a) Show that for a given z, the iterate z can be obtained by the two-step
process
z E arg .min c/Jc(x),
{x[x'=z", iEI}
Show that if the optimal solution set X* is nonempty, the sequence {xk} gen-
erated by the proximal cutting plane method (5.69) converges to some point in
x·.
Consider the ADMM framework of Section 5.4. Let d1 : Rm f-+ (-oo, oo] and
d 2 : Rm f-+ (-oo, oo] be the functions
and note that the dual to the Fenchel problem (5.82) is to minimize d1 + d2 [cf.
Eq. (5.36) or Prop. 1.2.1]. Let N1 : Rm f-+ Rm and N2 : Rm f-+ Rm be the
reflection operators corresponding to d1 and d2, respectively [cf. Eq. (5.19)].
(a) Show that the set of fixed points of the composition N1 · N2 is the set
(5.110)
(5.111)
where ak E [O, 1) for all k and I:;;°= 0 ak(l - ak) = oo, converges to a fixed
point of N1 · N2. Moreover, when ak = 1/2, this iteration is equivalent to
the ADMM; see Fig. 5.5.2.
Hints and Notes: We have N1(z) = z - cv, where z E Rm and v E 8d1(z) are
obtained from the unique decomposition z = z + cv, and N2(z) = z - cv, where
z E Rm and v E 8d2 (z) are obtained from the unique decomposition z = z + cv
[cf. Eq. (5.25)]. Part (a) shows that finding a fixed point of N1 · N2 is equivalent
to finding two vectors ).. * and v* that satisfy the optimality conditions for the
dual problem of minimizing d1 + d2, and then computing ).. * - cv* ( assuming that
the conditions for strong duality are satisfied). In terms of the primal problem,
we will then have
where x* is any optimal primal solution [cf. Prop. 1.2.l(c)J. Moreover we will
have v* = -Ax*.
Sec. 5.5 Notes, Sources, and Exercises 299
Figure 5.5.1. Illustration of the mapping N1 -N2 and its fixed points (cf. Exercise
5.5). The vector>..* shown is an optimal solution of the dual problem of minimizing
d1 + d2, and according to the optimality conditions we have v* E 8d1(>..*) and
v* E -8d2 (>.. *) for some v*. It can be seen then that >.. * - cv* is a fixed point
of N1 · N2 and conversely (in the figure , by applying N2 to >..* - cv* using the
graphical process of Fig. 5.1.8, and by applying N1 to the result, we end back at
>..* - cv*).
(5.112)
(5.113)
Thus from Eqs. (5.112) and (5.113), z is a fixed point of N1 · N2 if and only if
Z = Z2 + CV2 = Z1 - CV1,
Slope= -1/c
/
ADMM Iterate
Figure 5.5.2. Illustration of the interpolated fixed point iteration (5.111). Start-
ing from Yk, we obtain (N1 · N2)(Yk) using the process illustrated in the fig-
ure: first compute N2(Yk) as shown (cf. Fig. 5.1.8), then apply N1 to compute
(N1 · N2)(yk), and finally interpolate between Yk and (N1 · N2)(Yk) using a pa-
rameter °'k E (0, 1). When the interpolation parameter is °'k = 1/2, we obtain
the ADMM iterate, which is the midpoint between Yk and (N1 · N2)(Yk), denoted
by Yk+l in the figure. The iteration converges to a fixed point y* of N1 · N2,
which when written in the form y* = >.. * - cv*, yields a dual optimal solution >.. *.
Additional Algorithmic
Topics
Contents
301
302 Additional Algorithmic Topics Chap. 6
a> 0.
method,
(6.1)
where f : ~n H ~ is a continuously differentiable function, X is a closed
convex set, and ak > 0 is a stepsize. We have outlined in Section 2.1.2 some
of its characteristics, and its connection to feasible direction methods. In
this section we take a closer look at its convergence and rate of convergence
properties, and its implementation.
see Fig. 6.1.1. This is the set of possible next iterates parametrized by
the stepsize a. The following proposition shows that for all a > 0, unless
xk(a) = Xk (which is a condition for optimality of xk), the vector xk(a)-xk
is a feasible descent direction, i.e., v' f(xk)'(xk(a) - Xk) < 0.
\:/a> 0. (6.3)
\:/XE X. (6.4)
Proof: (a) From the Projection Theorem (Prop. 1.1.9 in Appendix B), we
have
\:/xEX, (6.5)
The simplest way to guarantee cost function descent in the gradient pro-
jection method is to keep the stepsize fixed at a constant but sufficiently
small value a > 0. In this case, however, it is necessary to assume that f
has Lipschitz continuous gradient, i.e., for some constant L, we havet
t Without this condition the method may not converge, for any constant
stepsize choice a, as can be seen with the scalar example where f(x) = lxl 312
(the method oscillates around the minimum x* = 0, because the gradient grows
too fast around x*). A different stepsize rule that ensures cost function descent
is needed for convergence.
Sec. 6.1 Gradient Projection Methods 305
X X - t;v'f(x) y
Figure 6.1.2. Visualization of the descent lemma (cf. Prop. 6.1.2). The Lipschitz
constant L serves as an upper bound to the "curvature" of f along directions, so
½
the quadratic function £(y; x)+ lly-xll 2 is an upper bound to f(y). The steepest
descent iterate x- tv'f(x), with stepsize a= 1/L, minimizes this upper bound.
Proof: Lett be a scalar parameter and let g(t) = f(x + t(y - x)). The
chain rule yields
Thus, we have
= 1
1
o
d
_!!_ (t) dt
dt
= fo\y-x)'"vf(x+t(y-x))dt
306 Additional Algorithmic Topics Chap. 6
~ .£\y-x)'"v'f(x)dt+ Jfo\y-x)'("v'f(x+t(y-x))-"v'f(x))dt l
~ la (y-x)'"v'f(x)dt+ Jor1 \ly-x\\ · I\Vf(x+t(y-x))-"v'f(x )\\dt
I
where for the second inequality we use the Schwarz inequality, and for the
third inequality we use the Lipschitz condition (6.6). Q.E.D.
(6.9)
Sec. 6.1 Gradient Projection Methods 307
Hence
Px(x-a'vf(x))-x= lim (xk+1-xk)=O,
k-+oo, kE/C
We will now consider the convergence rate of the gradient projection me-
thod, and as a first step in this direction, we develop a connection to the
proximal algorithm of Section 5.1. The following lemma shows that the
gradient projection iteration (6.1) may also be viewed as an iteration of the
proximal algorithm, applied to the linear approximation function £(·; xk)
(plus the indicator function of X), with penalty parameter equal to the
stepsize ak (see Fig. 6.1.3).
min
yEX
{c(y; x) + ~
2a
IIY - xll2}.
Proof: Using the definition of C [cf. Eq. (6.7)], we have for all x, y E Wn
and a> 0,
1 1
C(y;x) + -IIY-
2a
xll 2 = f(x) + 'vf(x)'(y-x) + -lly-xll
2a
2
= f(x) + ~IIY-
2a
(x -a'vf(x))ll 2 - ~ll'vf(x)ll 2 -
2
The gradient projection iterate Px (x- av' f(x)) minimizes the right-hand
side over y EX, so it minimizes the left hand side. Q.E.D.
308 Additional Algorithmic Topics Chap. 6
f(xk)
"tk
Figure 6.1.3. Illustration of the relation of the gradient projection method and
the proximal algorithm, as per Prop. 6.1.4. The gradient projection iterate Xk+I
is the same as the proximal iterate with f replaced by l:(,; xk).
a >0.
llxk(a)-yll 2
::; llxk-Yll 2 -2a(fi(xk(a); Xk)-fi(y; xk))-llxk-xk(a)i!2-
(6.10)
While the constant stepsize rule is simple, it requires the knowledge of the
Lipschitz constant L. There is an alternative stepsize rule that aims to deal
Sec. 6.1 Gradient Projection Methods 309
with the practically common situation where L, and hence also the range
(0, 2/ L) of stepsizes that guarantee cost reduction, are unknown. Here we
start with a stepsize o: > 0 that is a guess for the midpoint 1/ L of the
range (0, 2/ L). Then we keep using o:, and generate iterates according to
(6.11)
0: > 0.
(6.12)
(6.13)
Proof: By setting y = Xk in Eq. (6.10), using the fact f(xk; Xk) = f(xk),
and rearranging, we have
310 Additional Algorithmic Topics Chap. 6
We will now show convergence and derive the convergence rate of the gra-
dient projection method for a convex cost function, under the gradient
Lipschitz condition (6 .6). It turns out that an additional benefit of convex-
ity is that we can show stronger convergence results than for the nonconvex
case (cf. Prop. 6.1.3). In particular, we will show convergence to an optimal
solution of the entire sequence of iterates {xk} as long as the optimal solu-
tion set is nonempty. t By contrast, in the absence of convexity of f, Prop.
6. 1.3 asserts that all limit points of { Xk} satisfy the optimality condition,
but there is no assertion of uniqueness of limit point.
We will assume that the stepsize satisfies some conditions, which hold
in particular if it is either a constant in the range (0, 1/ L] or it is chosen
according to the eventually constant stepsize rule described earlier.
(6.14)
Proof: By applying Prop. 6.1.5, with o: = O:k and y = x*, where x* EX•,
tWe have seen a manifestation of this result in the analysis of Section 3.2,
where we showed convergence of the subgradient method with diminishing step-
size to an optimal solution. In that section we used ideas of supermartingale and
Fejer convergence. Similar ideas apply for gradient projection methods as well.
Sec. 6.1 Gradient Projection Methods 311
we have
1
l'(xk+1; Xk) + -llxk+l
2ak
- Xkll 2
1 1
:::; l'(x*;xk) + -llx*
2ak
- xkll 2 - -llx* - Xk+ill 2-
2ak
By adding Eq. (6.14), we obtain for all x* EX*,
1 1
f(xk+1):::; l'(x*; Xk) + -llx*
2ak
- Xkll 2 - -llx* - Xk+l 11 2
2ak
(6.16)
1 1
:::; f(x*) + -llx* - xkll 2 - -llx* - Xk+1ll 2,
2ak 2ak
where for the last inequality we use the convexity of f and the gradient
inequality:
Xo = {x EX I f(x):::; f(xo)}
rather than within X. The reason is that under the assumptions of the
proposition the iterates Xk are guaranteed to stay within the initial level
set Xo, and the preceding analysis still goes through. This allows the
application of the proposition to cost functions such as lxl 3 , for which the
Lipschitz condition (6.6) does not hold when X is unbounded.
312 Additional Algorithmic Topics Chap. 6
Proposition 6.1.7 shows that the cost function error of the gradient projec-
tion method converges to Oas 0(1/k). However, this is true without assum-
ing any condition other than Lipschitz continuity of v' f. When f satisfies
a growth condition in the neighborhood of the optimal set, a faster conver-
gence rate can be proved as noted in Section 2.1.1. In the case where the
growth of f is at least quadratic, a linear convergence rate can be shown,
and in fact it turns out that if f is strongly convex, the gradient projection
mapping
Ga(x) = Px(x - av' J(x)) (6.18)
is a contraction when O < a < 2/ L. Let us recall here that the differentiable
convex function f is strongly convex over Rn with a coefficient a > 0 if
cf. Section 1.1 of Appendix B. Note that by using the Schwarz inequality
to bound the inner product on the left above, this condition implies that
for some L > 0, and that it is strongly convex over Rn in the sense that
for some a E (O,L] it satisfies Eq. (6.19). Then the gradient projection
mapping G 0 of Eq. (6.18) satisfies
t
A related but different property is that strong convexity of f is equivalent
to Lipschitz continuity of the gradient of the conjugate f* when f and f* are
real-valued (see [RoW98], Prop. 12.60, for a more general result).
Sec. 6.1 Gradient Projection Methods 313
We have
'1¢(y) = '1f(y) - '1f(x), (6.23)
so¢ is minimized over y at y = x, and we have
(6.24)
and by using the expressions (6.22) and (6.23), we obtain the desired result.
To show (ii), we use (i) twice, with the roles of x and y interchanged,
and add to obtain the desired relation. We similarly use the descent lemma
(Prop. 6.1.2) twice, with the roles of x and y interchanged, and add to
obtain (iii).
(b) If a= L, the result follows by combining (ii) of part (a) and Eq. (6.20),
which is a consequence of the strong convexity assumption. For a < L
consider the function
a
¢(x) = f(x) - 2 11xll 2 -
We will show that '\/ ¢ is Lipschitz continuous with constant L - a. Indeed,
we have
'\1¢(x) = '\lf(x) - ax, (6.25)
where for the first inequality we use (ii) of part (a) and for the second
inequality we use the Lipschitz condition for '\/ f.
We now apply (ii) of part (a) to the function¢ and obtain
which after expanding the quadratic and collecting terms, can be verified
to be equivalent to the desired relation. Q.E.D.
Proof of Prop. 6.1.8: For all x, y E ~n, we have by using the nonexpan-
sive property of the projection (cf. Prop. 3.2.1)
+ a 2 IIVJ(x) - v7f(y)ll 2
:::; ( 1 - ! :~) II X - Y II 2
Note from the last equality of Eq. (6.26) that the smallest modulus
of contraction is obtained when
2
a---
- a+L·
When this optimal value of stepsize a is used, it can be seen by substitution
in Eq. (6.26) that for all x, y E ~n,
We can observe the similarity of this convergence rate estimate with the
one of Section 2.1.1 and Exercise 2.1 for quadratic functions: the ratio
L / a plays the role of the condition number of the problem. Indeed for a
positive definite quadratic function f we can use as L and a the maximum
and minimum eigenvalues of the Hessian of f, respectively.
316 Additional Algorithmic Topics Chap. 6
With this rule, the convergence behavior of the method is very similar to
the one of the corresponding subgradient method. In particular, by Prop.
3.2.6 and the discussion following that proposition (cf. Exercise 3.6), if
there is a scalar c such that
Indeed, if xk(a) -/= Xk for some a> 0, it can be shown (see [Ber99], Section
2.3.2) that there exists ak > 0 such that
(6.27)
Thus there is an interval of stepsizes a E (0, ak] that lead to reduction of the
cost function value. Stepsize reduction and line search rules are motivated
by some of the drawbacks of the constant and eventually constant stepsize
rules: along some directions the growth rate of v' f may be fast requiring a
small stepsize for guaranteed cost function descent, while in other directions
Sec. 6.1 Gradient Projection Methods 317
the growth rate of v' f may be slow, requiring a large stepsize for substantial
progress. A form of line search may deal adequately with this difficulty.
There are many variants of line search rules. Some rules use an exact
line search, aiming to find the stepsize O:k that yields the maximum possible
cost improvement; these rules are practical mostly for the unconstrained
case, where they can be implemented via one of the several possible inter-
polation and other one-dimensional algorithms (see nonlinear programming
texts such as [Ber99], [Lue84], [NoW06]). For constrained problems, step-
size reduction rules are primarily used: an initial stepsize is chosen through
some heuristic procedure (possibly a fixed constant, obtained through some
experimentation, or a crude line search based on some polynomial inter-
polation scheme). This stepsize is then successively reduced by a certain
factor until a cost reduction test is passed.
One of the most popular stepsize reduction rules searches for a step-
size along the set of points
a> 0,
cf. Eq. (6.2). This is the Armijo rule along the projection arc, proposed
in [Ber76a], which is a generalization of the Armijo rule for unconstrained
problems, given in Section 2.1. It has the form
(6.28)
with j3 E (0, 1) and a E (0, 1) being some constants, and sk > 0 being an
initial stepsize. Thus, the stepsize O:k is obtained by reducing sk as many
times as necessary for the inequality (6.28) to be satisfied; see Fig. 6.1.4.
This stepsize rule has strong convergence properties. In particular, it can
be shown that for a convex f with nonempty set X* of minima over X, and
with initial stepsize Sk that is bounded away from 0, it leads to convergence
to some x* E X*, without requiring the gradient Lipschitz condition (6.6).
The proof of this is nontrivial, and was given in [GaB84]; see [Ber99],
Section 2.3.2 for a textbook account [the original paper [Ber76a] gave an
easier convergence proof for the special case where X is the nonnegative
orthant, and also for the case where the Lipschitz condition (6.6) holds and
X is any closed convex set]. Related asymptotic convergence rate results
that involve the rate of growth of f, suitably modified for the presence of
the constraint set, are given in [Dun81], [Dun87].
The preceding Armijo rule requires that with each reduction of the
trial stepsize, a projection operation on X is performed. While this may not
involve much overhead in cases where X is simple, such as for example a box
318 Additional Algorithmic Topics Chap. 6
/
---------~ ,, ~
0
Figure 6.1.4. Illustration of the successive points tested by the Armijo rule along
the projection arc. In this figure, ak is obtained as (32 sk after two unsuccessful
trials.
wheres is a fixed positive scalar [cf. Prop. 6.1.l(a)], and then we set
For a convex cost function, both Armijo rules can be shown to guaran-
tee convergence to a unique limit point/optimal solution, without requiring
the gradient Lipschitz condition (6.6); see [Ius03] and compare also with
the comments in Exercise 2.5. When f is not convex but differentiable, the
standard convergence results with these rules state that every limit point
x of { xk} satisfies the optimality condition
v7 f (x)'(x - x) ?'.: o, \:/xE X,
minimize f (x)
subject to x E X,
min J(xk) $ f* + E,
k$c/eP
where c and p are positive constants, we say that the method has iteration
complexity O C1) (the constant c may depend on the problem data and
the starting point xo). Alternatively, if we can show that
minf(xe) $ f*
£$k
+ kcq ,
where c and q are positive constants, we say that the method involves cost
function error of order O ( klq).
It is generally thought that if the constant c does not depend on the
dimension n of the problem, then the algorithm holds some advantage for
problems where n is large. This view favors simple gradient/subgradient-
like methods over sophisticated conjugate direction or Newton-like meth-
ods, whose overhead per iteration increases at an order up to O(n 2 ) or
320 Additional Algorithmic Topics Chap. 6
-f
0 X
Example 6.1.1
with E > 0 (cf. Fig. 6.1.5). Here the constant in the Lipschitz condition (6.6)
is L = 1, and for any Xk > E, we have v' f(xk) = E. Thus the gradient iteration
with stepsize a= 1/ L = 1 takes the form
complexity estimate assumes the worst case where there is no positive order of
growth . For the case where there is a unique minimum x*, this means that there
are no scalars fJ > 0, 8 > 0, and , > 1 such that
minimize f (x)
(6.30)
subject to x E X,
= Xk + f3k(Xk - Xk-1),
Yk (extrapolation step),
(6.31)
Xk+l = Px(Yk - a'vf(Yk)), (gradient projection step),
where Px(·) denotes projection on X, X-1 = xo, and f3k E (0, 1); see Fig.
6.2.1. The method has a similar flavor with the heavy ball and PARTAN
methods discussed in Section 2.1.1, but with some important differences:
it applies to constrained problems, and it also reverses the order of extrap-
olation and gradient projection within an iteration.
The following proposition shows that with proper choice of f3k, the
method has iteration complexity 0(1/JE) or equivalently error 0(1/k 2 ).
In particular, we use
k = 0, 1, ... , (6.32)
1 - 0k+l 1
---<- k = 0, 1, .... (6.33)
0i+1 - 0i'
if k = 0, ifk=-1,
if k = 1, 2, ... , ifk=O,l, ....
Extrapolation Step
Yk = Xk + f3k(Xk - Xk-1)
(6.34)
2L 2
f(xk) - f* :S (k + l) 2 (d(xo)) , k = 1,2, . .. ,
where we denote
02 L
- +11x* -Zk+1ll 2 ,
326 Additional Algorithmic Topics Chap. 6
where the last equality follows from Eqs. (6.35) and (6.36), and the last
inequality follows from the convexity of£(·; Yk)- Using the inequality
we have
Using the facts xo = zo, f* - £(x*; Yi) ~ 0, and 0k ::::; 2/(k + 2), and taking
the minimum over all x* E X*, we obtain
2L 2
f(xk+I) - f* ::::; (k + 2 ) 2 (d(xo)) ,
0(1/t) can be attained, which is much faster than the 0(1/t 2 ) complexity
of the subgradient method. The idea is to replace a nondifferentiable con-
vex cost function by a smooth t-approximation whose gradient is Lipschitz
continuous with constant L = 0(1/t). By applying the optimal method
given earlier, we obtain an t-optimal solution with iteration complexity
0(1/t) or equivalently error 0(1/k).
We will consider the smoothing technique for the special class of con-
vex functions Jo : ~n 1---t ~ of the form
if u EU,
if u tf. U,
uo E argminp(u).
uEU
An example is the quadratic function p(u) = %11u - uoll 2 , but there are
also other functions of interest (see the paper by [Nes05] for some other
examples, which also allow p to be nondifferentiable and to be defined only
on U).
For a parameter t > 0, consider the function
where
p* = maxp(u).
uEU
328 Additional Algorithmic Topics Chap. 6
The following proposition shows that le is also smooth and its gradient is
Lipschitz continuous with Lipschitz constant that is proportional to 1/f..
where ue(x) is the unique vector attaining the maximum in Eq. (6.39).
Furthermore, we have
Proof: We first note that the maximum in Eq. (6.39) is uniquely attained
in view of the strong convexity of p (which implies that p is strictly con-
vex) . Furthermore, I, is equal to l*(A'x), where I * is t he conjugate of t he
function
¢(u) + Ep(u) + 8u(u),
with 8u being the indicator function of U. It follows that le is convex, and
it is also differentiable with gradient
Vf, (x ) = A'u, (x )
Adding the two inequalities, and using the convexity of ¢ and the strong
convexity of p, we obtain
~ wllu,(x) - u,(y)ll2,
where for the second inequality we used Eq. (6.41), and for the third in-
equality we used a standard property of strongly convex functions. Thus,
iterations.
330 Additional Algorithmic Topics Chap. 6
Gradient Step
Xo - o:v'f(xo)
X* X3
I X
Slope= -1/a
xo - a"vf(xo)
with the gradient step as shown. Starting from that vector, we compute x1 with a
proximal step as shown ( cf. the geometric interpretation of the proximal iteration
of Fig. 5.1.8 in Section 5.1.4). Assuming that "v / is Lipschitz continuous (so "v /
has bounded slope along directions) and a is sufficiently small, the method makes
progress towards the optimal solution x*. When f(x) = 0 we obtain the proximal
algorithm, and when h is the indicator function of a closed convex set, we obtain
the gradient projection method.
(6.47)
[cf. Eq. (6.46)]. It follows that the iteration aims to converge to a fixed
point of the composite mapping Pa,h · G 0 •
Let us now note an important fact: the fixed points of Pa,h · G 0
coincide with the minima of f + h. This is guaranteed by the fact that
the same parameter a is used in the gradient and the proximal mappings
in the composite iteration (6.47}. To see this, note that x* is a fixed point
of Pa,h · Ga if and only if
t Among others, this argument shows that it is not correct to use different
parameters a in the gradient and the proximal portions of the proximal gradient
method. This also highlights the restrictions that must be observed when replac-
ing Go. and Pa.,h with other mappings (involving for example diagonal or Newton
scaling, or extrapolation), with the aim to accelerate convergence.
Sec. 6.3 Proximal Gradient Methods 333
Convergence Analysis
- llxk - Xk(a)ll 2 -
(6.49)
The next proposition shows that cost function descent can be guar-
anteed by a certain inequality test, which is automatically satisfied for all
stepsizes in the range (0, 1/ L].
(6.50)
(6.51)
, I
Proof: By setting y = Xk in Eq. (6.49), using the fact C(xk; xk) = f(xk),
and rearranging, we have
(6.52)
as is necessary for Eq. (6.52) to hold. According to Prop. 6.3.2, this rule
guarantees cost function descent, and guarantees that ak will stay constant
after finitely many iterations.
(6.53)
Then {xk} converges to some point of X*, and for all k = 0, 1, ... , we
have
(6.56)
The proximal gradient algorithm may also be applied to the Fenchel dual
problem
minimize ft ( - A' A) + fz (A)
(6.57)
subject to A E ~m,
where Ji and h are closed proper convex functions, ft and f 2 are their
conjugates
ft(-A',\) = sup {(-,\)'Ax - fi(x)}, (6.58)
xE~n
and A is an m x n matrix. Note that we have reversed the sign of,\ relative
to the formulation of Sections 1.2 and 5.4 [the problem has not changed, it
is still dual to minimizing Ji (x) + h (Ax), but with this reversal of sign, we
will obtain more convenient formulas]. The proximal gradient method for
the dual problem consists of first applying a gradient step using the function
ft ( - A' A) and then applying a proximal step using the function fz (,\) [cf.
Eq. (6.46)]. We refer to this as the dual proximal gradient algorithm, and
we will show that it admits a primal implementation that resembles the
ADMM of Section 5.4.
To apply the algorithm it is of course necessary to assume that the
function ft(-A' ,\) is differentiable. Using Prop. 5.4.4(a) of Appendix B, it
can be seen that this is equivalent to requiring that the supremum in Eq.
(6.58) is uniquely attained for all,\ E ~m. Moreover, using the chain rule,
the gradient of ft ( -A' A), evaluated at any A E ~m, is
According to the theory of the dual proximal algorithm (cf. Section 5.2), the
proximal step can be dually implemented using an augmented Lagrangian-
type minimization: first find
0:k }
Zk+I E arg min { h(z) - AkZ + -llzll 2 ,
-I
(6.62)
zE~m 2
and then obtain Ak+I using the iteration
Ak+I = Ak - 0:kZk+I· (6.63)
The dual proximal gradient algorithm (6.60)-(6.63) is a valid imple-
mentation of the proximal gradient algorithm, applied to the Fenchel dual
problem (6.57). Its convergence is guaranteed by Prop. 6.3.3, provided the
gradient of fi(-A 1 .>..) is Lipschitz continuous, and ak is a sufficiently small
constant or is chosen by the eventually constant stepsize rule.
It is interesting to note that this algorithm bears similarity to the
ADMM for minimizing fi(x) + h(Ax) (which applies more generally, as
it does not require that Ji is differentiable). Indeed we may rewrite the
algorithm (6.60)-(6.63) by combining Eqs. (6.60) and (6.63), so that
Ak+I = Ak + ak(Axk+l - Zk+1),
where Xk+i minimizes the Lagrangian,
Xk+I min {fi(x) + .>..~(Ax - zk)},
= arg xE~n (6.64)
while by using Eqs. (6.60) and (6.62), we can verify that Zk+I minimizes
the augmented Lagrangian
Zk+I E arg min
zE~m
{h(z) + .>..~(Axk+l - z) + ak
2
IIAxk+I - zll 2 } .
Other than minimizing with respect to x the Lagrangian in Eq. (6.64),
instead of the augmented Lagrangian, the only other difference of this dual
proximal gradient algorithm from the ADMM is that there is a restriction
on the magnitude of the stepsize [it is limited by the size of the Lipschitz
constant of the gradient of the function fi(-A' .>..), as per Prop. 6.3.2].
Note that in the ADMM the penalty parameter can be chosen freely, but
(contrary to the augmented Lagrangian method) it may not be clear how to
choose it in order to accelerate convergence. Thus all three proximal-type
methods, proximal gradient, ADMM, and augmented Lagrangian, have
similarities, and relative strengths and weaknesses. The choice between
them hinges largely on the given problem's structure.
338 Additional Algorithmic Topics Chap. 6
(6.65)
(extrapolation step),
(gradient step),
where X-1 = xo, and f3k E (0, 1). The extrapolation parameter f3k is selected
as in Section 6.2.
Sec. 6.3 Proximal Gradient Methods 339
,xo X
-Vf(x)
(6.66)
It can be shown that Xk+I can be generated by a two-step process: first perform
a Newton step, obtaining Xk = Xk - (v' 2 f(xk) )- 1 v'f(xk), and then a proximal
step
To see this, write the optimality condition for the above minimization and show
that it coincides with Eq. (6.66). Note that when h(x) = 0 we obtain the pure
form of Newton's method. If v' 2 f(xk) is replaced by a positive definite symmetric
matrix Hk, we obtain a proximal quasi-Newton method.
subject to x E X,
(6.70)
(6.71)
(6.72)
(6.73)
where the restriction x E X has been omitted from the proximal iteration,
and the algorithm
Zk = Xk - akVhik(xk), (6.74)
where the projection onto X has been omitted from the subgradient itera-
tion. It is also possible to use different stepsize sequences in the proximal
and subgradient iterations, but for notational simplicity we will not discuss
this type of algorithm.
Part (a) of the following proposition is a key fact about incremental
proximal iterations. It shows that they are closely related to incremental
subgradient iterations, with the only difference being that the subgradient
is evaluated at the end point of the iteration rather than at the start point.
Part (b) of the proposition is the three-term inequality, which was shown
in Section 5.1 (cf. Prop. 5.1.2). It will be useful in our convergence analysis
and is restated here for convenience.
(6.77)
Proof: (a) We use the formula for the subdifferential of the sum of the
three functions f, (1/2ak)llx - xkll 2 , and the indicator function of X (cf.
Prop. 5.4.6 in Appendix B), together with the condition that O should
belong to this subdifferential at the optimum Xk+I. We obtain that Eq.
(6.76) holds if and only if
1
-(Xk - Xk+I) E of(Xk+1) + Nx(xk+1), (6.79)
ak
where Nx(xk+I) is the normal cone of X at Xk+I [the set of vectors y such
that y'(x - Xk+I) S O for all x E X, and also the subdifferential of the
indicator function of X at Xk+i; cf. Section 3.1]. This is true if and only if
for some "9 f(xk+I) E af(xk+1), which in turn is true if and only if Eq.
(6.77) holds, by the Projection Theorem (Prop. 1.1.9 in Appendix B).
(b) See Prop. 5.1.2. Q.E.D.
Note that in all the preceding updates, the subgradient "9 hik can be any
vector in the subdifferential of hik, while the subgradient "9 hk must be
a specific vector in the subdifferential of fik, specified according to Prop.
6.4.l(a). Note also that iteration (6.81) can be written as
[cf. Eq. (6.68)], the only difference being that the subgradient of Fik is
computed at Zk rather than Xk-
An important issue which affects the methods' effectiveness is the
order in which the components {Ji, hi} are chosen for iteration. In this
section we consider and analyze the convergence for two possibilities:
(1) A cyclic order, whereby {Ji, hi} are taken up in the fixed determin-
istic order 1, ... , m, so that ik is equal to (k modulo m) plus 1. A
contiguous block of iterations involving
dist(x; X) = min
zEX
llx - zll, XE ~n.
We first discuss convergence under the cyclic order. We focus on the se-
quence {xk} rather than {zk}, which need not lie within X in the case of
iterations (6.81) and (6.82) when X :/=- ~n. In summary, the idea is to show
that the effect of taking subgradients of Ji or hi at points near Xk (e.g., at
Zk rather than at xk) is inconsequential, and diminishes as the stepsize G:k
becomes smaller, as long as some subgradients relevant to the algorithms
are uniformly bounded in norm by some constant. This is similar to the
convergence mechanism of incremental gradient methods described infor-
mally in Section 3.3.1. We use the following assumptions throughout the
present section.
Sec. 6.4 Incremental Subgradient Proximal Methods 345
(6.83)
Furthermore, for all k that mark the beginning of a cycle (i.e., all k > 0
with ik = 1), we have for all j = 1, ... , m,
(6.85)
Furthermore, for all k that mark the beginning of a cycle (i.e., all k > 0
with ik = 1), we have for all j = 1, ... , m,
Note that the condition (6.84) is satisfied if for each i and k, there is
a subgradient of Ji at Xk and a subgradient of hi at Xk, whose norms are
bounded by c. Conditions that imply the preceding assumptions are:
(a) For algorithm (6.80): Ji and hi are Lipschitz continuous over the set
X.
(b) For algorithms (6.81) and (6.82): Ji and hi are Lipschitz continuous
over the entire space ~n.
(c) For all algorithms (6.80), (6.81), and (6.82): Ji and hi are polyhedral
[this is a special case of (a) and (b)].
(d) The sequences { xk} and { zk} are bounded (since then Ii and hi, being
real-valued and convex, are Lipschitz continuous over any bounded
set that contains { x k} and { zk}).
The following proposition provides a key estimate that reveals the
convergence mechanism of the methods.
346 Additional Algorithmic Topics Chap. 6
where /3 = ~ + 4.
Proof: We first prove the result for algorithms (6.80) and (6.81), and
then indicate the modifications necessary for algorithm (6.82). Using Prop.
6.4.l(b), we have for ally EX and k,
the definition of subgradient, and Eq. (6.83), we obtain for all y E X and
k,
llxk+l - Yll 2 :c:; llxk - Yll 2 - 2ak (fik (zk) + hik (zk) - fik (y) - hik (y)) + a%c2
= llxk -yll 2 - 2ak(Fik(zk) - Fik(y)) + a%c2 .
(6.91)
Let now k mark the beginning of a cycle (i.e., ik = 1). Then at
iteration k + j - 1, j = 1, ... , m, the selected components are {Ji, hj }, in
view of the assumed cyclic order. We may thus replicate the preceding
inequality with k replaced by k + 1, ... , k + m - 1, and add to obtain
m
+ 2o:k L (Fj(Xk) - Fj(Zk+j-i)). (6.92)
j=l
The remainder of the proof deals with appropriately bounding the last term
above.
From Eq. (6.84), we have for j = 1, ... , m,
We also have
(6.95)
and finally
llxk+m - Yll 2 :<S; llxk - Yll 2 - 2o:k (F(xk) - F(y)) + mo:ic2 + 4o:ic2m 2,
llxk+1 - Yll 2 ::; llxk - Yll 2 - 2ak (fik (xk+d + hik (xk) - fik (y) - hik (y))
+ a%c2
= llxk -yll 2 - 2ak(Fik(xk) - Fik(y)) + a%c2
+ 2ak(fik(xk) - Jik(xk+1)).
(6.98)
As earlier, we let k mark the beginning of a cycle (i.e., ik = 1). We
replicate the preceding inequality with k replaced by k + l, ... , k + m - l,
and add to obtain [in analogy with Eq. (6.92)]
j=l (6.99)
m
We now bound the two sums in Eq. (6.99), using Assumption 6.4.2.
From Eq. (6.86), we have
Also from Eqs. (6.85) and (6.87), and the nonexpansion property of the
projection, we have
j=l
Among other things, Prop. 6.4.2 guarantees that with a cyclic order,
given the iterate Xk at the start of a cycle and any point y E X having
lower cost than Xk (for example an optimal point), the algorithm yields a
point Xk+m at the end of the cycle that will be closer toy than Xk, provided
the st epsize ak satisfies
2(F(xk ) - F(y))
O'.k < fJ m 2 e 2 .
In particular, for any f > 0 and assuming that there exists an optimal
2 2
solution x*, either we are within a k fi;_' c + f of the optimal value,
akf3m 2c2
F (x k) ::; F (x* ) + 2 + E,
or else the squared distance to x* will be strictly decreased by at least 20'.kf,
Thus, using Prop. 6.4.2, we can provide various types of convergence re-
sults. As an example, for a constant stepsize (ak = a), convergence can
be established to a neighborhood of the optimum, which shrinks to O as
o: --* 0, as stated in the following proposition.
liminf F(xk)
k---+oo
= F*.
a(3m2 c2
lim inf F(xk) ::; F*
k---+oo
+ - -2 - ,
Proof: We prove (a) and (b) simultaneously. If t he result does not hold,
there must exist an f > 0 such that
a(3m2c2
lim inf F (Xkm ) - - - - - 2E > F*.
k--..+oo 2
350 Additional Algorithmic Topics Chap. 6
By combining the preceding two relations, we obtain for all k :::0: ko,
a{3m 2 c2
F(xkm) - F(y) :::0: 2 + E.
Using Prop. 6.4.2 for the case where y = y together with the above relation,
we obtain for all k ::::: ko,
a{3m2 c2 +E
min F(xk) ~ F*
0-5,_k-:5,_N
+ 2 , (6.100)
where N is given by
N=m ldist(xo;
- ~ -X*) - -J
2
.
<Y,E
(6.101)
so that
( : + 1) m::; dist(x 0 ;X*) 2,
We can also obtain an exact convergence result for the case where the step-
size ak diminishes to zero. The idea is that with a constant stepsize a we
can get to within an O(a)-neighborhood of the optimum, as shown above,
so with a diminishing stepsize ak, we should be able to reach an arbitrar-
ily small neighborhood of the optimum. However, for this to happen, ak
should not be reduced too fast, and should satisfy L~o ak = oo (so that
the method can "travel" infinitely far if necessary).
00
lim
k--+oo
0:k ~ 0, LUk = 00.
k=O
Then,
liminfF(xk)
k--+oo -
= F*.
Furthermore, if X* is nonempty and
00
La~< oo,
k=O
llx(k+l)m - :011 2 ::; llxkm - :011 2 - O'.kmE ::; · · · ::; llxkom - :011 2 - E L O'.£m,
£=ko
which cannot hold fork sufficiently large. Hence liminfk-+oo F(xkm) = F*.
To prove the second part of the proposition, note that from Prop.
6.4.2, for every x* E X* and k 2: 0 we have
zi E argmin 1-llx -
{/i(x) + -2ak xkll 2 } ,
xEX
with
inf F(xk)
k ~O
= F*.
a{3mc2
k~O
inf F(xk):::; F* + ---,
2
where f3 = 5.
Taking conditional expect ation of both sides given the set of random vari-
ables F k = {x k,Zk- 1 , .. • , zo,xo}, and using t he fact t hat Wk takes the
values i = 1, . .. , m with equal probability 1/m, we obtain for all y E X
356 Additional Algorithmic Topics Chap. 6
and k,
E { llxk+l - Yll 2 I Fk} ::; llxk - Yll 2 - 2aE{ Fwk (zk) - Fwk (y) I Fk} + a2 c2
= llxk - Yll 2 - 2a I)Fi(zi) - Fi(Y)) + a 2 c2
m i=l
2a
= llxk - Yll 2 - -(F(xk) - F(y)) + a 2 c2
m
+ 2a I)R(xk) - Fi(zU).
m i=l
(6.110)
By using Eqs. (6.106) and (6.107),
m m m
2)Fi(xk) - Fi(4))::; 2c L llxk - 411 = 2ca L 11Vfi(4)11::; 2mac 2•
(6.113)
F(y'Y) = {
-"(
F* +~
if F* = -oo,
if F* > -oo.
Note that y'Y E L'Y by construction. Define a new process { xk} that is
identical to {xk}, except that once Xk enters the level set L"/, the process
terminates with Xk = Y'Y. We will now argue that for any fixed 'Y, { Xk}
(and hence also { xk}) will eventually enter L'Y, which will prove both parts
(a) and (b).
Using Eq. (6.111) with y = Y"/, we have
from which
(6.114)
where
Vk = { ~ (F(xk) - F(y'Y)) - (3a 2c 2 ~f !k rf_ L"f,
0 1fxk=Y"f·
The idea of the subsequent argument is to show that as long as Xk rf. L"f, the
scalar Vk (which is a measure of progress) is strictly positive and bounded
away from 0.
(a) Let F* = -oo. Then if Xk rf. L"f, we have
~ m
a(3mc2
2a ( -"(+ 1 + --2- +'Y ) -(3a2c2
2a
m
a(3mc2
inf F(xk) :S -"( + 1 + - -
k'2'.0 2
358 Additional Algorithmic Topics Chap. 6
o:f3mc2 + E
min F(xk) S F*
0-:5,_k-:5,_N
+- --2
-, (6.115)
Proof: Let iJ be some fixed vector in X*. Define a new process { xk} which
is identical to { Xk} except that once Xk enters the level set
+ a(3mc2 + E }
2
L = { x EX I F(x) < F* ,
the process {xk} terminates at iJ. Similar to the proof of Prop. 6.4.6 [cf.
Eq. (6.111) with y being the closest point of Xk in X*], for the process {xk}
we obtain for all k,
if xk ~ L,
otherwise.
+ a(3mc2 + E -
2a ( 2 )
Vk ~ m F* F* - (3a 2 c 2 (6.118)
m
. a(3mc 2 + E
mm F(xk) SF* +- ---
09-<;,N 2
where in the last inequality we use the facts xo = xo and E{ dist(xo; X*) 2 } =
dist(xo; X*) 2 . Therefore, letting k--+ oo, and using the definition of vk and
Eq. (6.118),
Q.E.D.
Like Prop. 6.4.6, a comparison of Props. 6.4.4 and 6.4. 7 again suggests
an advantage for the randomized methods: compared to their deterministic
counterparts, they achieve a much smaller error tolerance (by a factor of
m), in the same expected number of iterations. Note, however, that the
preceding assessment is based on upper bound estimates, which may not
be sharp on a given problem [although the bound of Prop. 6.4.3(b) is tight
with a worst-case problem selection as mentioned earlier; see [BN003], p.
514]. Moreover, the comparison based on worst-case values versus expected
values may not be strictly valid. In particular, while Prop. 6.4.4 provides an
upper bound estimate on N, Prop. 6.4.7 provides an upper bound estimate
on E{N}, which is not quite the same. However, this comparison seems to
be supported by the experimental results obtained so far.
Finally for the case of a diminishing stepsize, let us give the following
proposition, which parallels Prop. 6.4.5 for the cyclic order.
lim ak
k--too
= 0, Lak = oo.
k=O
La~< oo,
k=O
Proof: The proof of the first part is nearly identical to the corresponding
part of Prop. 6.4.5. To prove the second part, similar to the proof of Prop.
6.4.6, we obtain for all k and all x* E X*,
[cf. Eq. (6.111) with a and y replaced with ak and x*, respectively], where
Fk = {Xk, Zk-I, ... , zo, xo }. According to the Supermartingale Conver-
gence Theorem (Prop. A.4.5 in Appendix A), for each x* EX*, there is a
Sec. 6.4 Incremental Subgradient Proximal Methods 361
set Dx• of sample paths of probability 1 such that for each sample path in
f
k=O
2ak (F(xk) - F*)
m
< oo, (6.120)
For each sample path in 0, all the sequences {llxk - viii} converge
so that { xk} is bounded, while by the first part of the proposition [or Eq.
(6.120)] liminfk-+oo F(xk) = F*. Therefore, {xk} has a limit point x in
X*. Since {vi} is dense in X*, for every E > 0 there exists vi(<) such that
llx-vi(e)II < E. Since the sequence {llxk -Vi(e)II} converges and xis a limit
point of {xk}, we have limk-+oo llxk -vi(e)II < E, so that
£1 -Regularization
1 m
minimize 'YllxlJi + 2 L(c~x - bi) 2
i=l
(6.121)
subject to x E ~n,
362 Additional Algorithmic Topics Chap. 6
where 'Y is a positive scalar and xJ is the jth coordinate of x ( cf. Example
1.3.1). It is convenient to handle the regularization term with the proximal
algorithm:
Zk E arg min 1-iix - xkll 2 } .
{'Y llxll1 + -2Dk
xE~n
where Ji : 3in; f-t 3i are convex functions (ni is a positive integer, which
may depend on i), Xi are nonempty closed convex subsets of 3in;, Ai are
given r x ni matrices, and bi E 3ir are given vectors. For simplicity, we
focus on linear equality constraints, but the analysis can be extended to
convex inequality constraints as well.
Similar to our discussion of separable problems in Section 1.1.1, the
dual function is given by
where
qi(>.)= inf {fi(xi) + N(A;xi - bi)}. (6.124)
x'EX;
Ak+l = Ak + 0: L'vqi(Ak),
i=l
i = l, ... ,m,
over xi E Xi; cf. Eq. (6.124) and Example 3.1.2. Thus this method ex-
ploits the separability of the problem, and is well suited for distributed
computation, with the gradients 'vqi(>.k) computed in parallel at separate
processors. However, the differentiability requirement on qi is very strong
[it is equivalent to the infimum being attained uniquely for all ,\ in Eq.
(6.124)], and the convergence properties of this method tend to be frag-
ile. We will consider instead an incremental proximal method that can
be dually implemented (cf. Section 5.2) with decomposable augmented La-
grangian minimizations, and has more solid convergence properties.
364 Additional Algorithmic Topics Chap. 6
has a suitable form for application of the incremental proximal method [cf.
Eq. (6.67)].t In particular, the incremental proximal algorithm updates the
current vector Ak to a new vector Ak+l after a cycle of m subiterations:
(6.125)
where starting with 'l/'!2 = Ak, we obtain 'l/J',;' after the m proximal steps
(6.127)
(6.129)
t The algorithm (6.67) requires that the functions -qi have a common ef-
fective domain, which is a closed convex set. This is true for example if qi is
real-valued, which occurs if Xi is compact. In unusual cases where -qi has an
effective domain that depends on i and/or is not closed, the earlier convergence
analysis does not apply and needs to be modified.
Sec. 6.4 Incremental Subgradient Proximal Methods 365
minimize f (x)
(6.130)
subject to x E n~ 1 Xi,
m
f(x) + c L dist(x; Xi)
i=l '
over Xo.
(6.133)
where
Proof: The case Xk E X;k is evident, so assume that Xk tJ. X;k. From the
nature of the cost function in Eq. (6.132) we see that Zk is a vector that
lies in the line segment between Xk and Px;k (xk)· Hence there are two
possibilities: either
Zk = Px;k (xk), (6.134)
Zk - Px;k (zk) 1
c-,-,----~-_,.,. = -(xk -zk)-
llzk -Px;k(zk)II ak
Sec. 6.4 Incremental Subgradient Proximal Methods 367
\ X
and Zk is equal to the projection of Xk onto X;k. In the right-hand side figure,
we have
a1,;c < dist(x1,;;X;k),
and zk is obtained by calculating the projection of xk onto Xik, and then inter-
polating according to Eq. (6.133) .
This equation implies that Xk, Zk, and Px;"' (zk) lie on the same line, so
that Px;k (zk) = Pxik (xk) and
By calculating and comparing the value of the cost function in Eq. (6.132)
for each of the possibilities (6.134) and (6.135), we can verify that (6.135)
gives a lower cost if and only if J3k < l. Q.E.D.
subject to x E n~ 1 Xi.
Based on the preceding analysis, we can convert this problem to the un-
constrained minimization problem
m
minimize L (fi(x) + hi(x) + cdist(x; Xi))
i=l
subject to x E Rn,
368 Additional Algorithmic Topics Chap. 6
(6.137)
where Vhik (xk) denotes some subgradient of hik at Xk. It then performs a
proximal iteration on a cost component fik,
(6.138)
if f3k < l,
(6.139)
if f3k ~ l,
where
/3 _ O!kC
(6.140)
k - dist(zk; xik)'
with the convention that f3k = oo if dist(zk; Xik) = 0. The index ik may be
chosen either randomly or according to a cyclic rule. Our earlier conver-
gence analysis extends straightforwardly to the case of three cost function
components for each index i. Moreover the subgradient, proximal, and
projection operations may be performed in any order.
Note that the penalty parameter c can be taken as large as desired,
and it does not affect the algorithm as long as
in which case f3k ~ l [cf. Eq. (6.140)]. Thus we may keep increasing c so
that f3k ~ l, up to the point where it reaches some "very large" threshold.
It would thus appear that in practice we may be able to use a stepsize f3k
that is always equal to 1 in Eq. (6.139), leading to the simpler algorithm
of randomized and cyclic sampling schemes for selecting the cost function
components and the constraint components.
While this algorithm does not depend on the penalty parameter c,
its currently available convergence proof requires an additional condition.
This is the so-called linear regularity condition, namely that for some rJ > 0,
V XE ~n,
subject to X E n~ 1 Xi,
which is obtained by replacing convex inequality constraints of the form
g1 ( x) :S: 0 with the nondifferentiable penalty terms c max { 0, g1 ( x)}, where
c > 0 is a penalty parameter (cf. Section 1.5). Then a possible incremental
method at each iteration, would either do a subgradient iteration on f, or
select one of the violated constraints (if any) and perform a subgradient
iteration on the corresponding function g1 , or select one of the sets Xi and
do an interpolated projection on it.
In this section we will consider the block coordinate descent approach that
we discussed briefly in Section 2.1.2. We focus on the problem
minimize f (x)
subject to x E X,
where f : ~n M ~ is a differentiable convex function, and X is a Cartesian
product of closed convex sets X1, ... , Xm:
X = X1 X X2 X · ·· X Xm,
where Xi is a subset of ~ni (we allow that ni > 1, although the most
common case is when ni = 1 for all i). The vector x is partitioned as
X= (x1,x 2 , ... ,xm),
370 Additional Algorithmic Topics Chap. 6
i = 1, ... , m; (6.141)
where we assume that the preceding minimization has at least one optimal
solution. Thus, at each iteration, the cost is minimized with respect to each
of the block components xL, taken in cyclic order, with each minimization
incorporating the results of the preceding minimizations. Naturally, the
method makes practical sense if the minimization in Eq. (6.141) is fairly
easy. This is frequently so when each xi is a scalar, but there are also other
cases of interest, where xi is multidimensional.
The coordinate descent approach has a sound theoretical basis thanks
to its iterative cost function descent character, and is often conveniently
applicable. In particular, when the coordinate blocks are one-dimensional,
the descent direction does not require a special calculation. Moreover if
the cost function is a sum of functions with "loose coupling" between the
block components (i.e., each block component appears in just a few of the
functions in the sum), then the calculation of the minimum along each
block component may be simplified. Another structure that favors the use
of block coordinate descent is when the cost function involves terms of
the form h(Ax), where A is a matrix such that computing y = Ax is far
more expensive than computing h(y); this simplifies the minimization over
a block component.
The following proposition gives the basic convergence result for the
method. It turns out that it is necessary to make an assumption implying
that the minimum in Eq. (6.141) is uniquely attained. This assumption
is satisfied if f is strictly convex in each block component when all other
block components are held fixed. We will discuss later a version of the
algorithm, which involves quadratic regularization and does not require this
assumption. While the proposition as stated applies to convex optimization
problems, consistently with the framework of this section, the proof can
be adapted to use only the continuous differentiability of f and not its
convexity (see the proposition after the next one).
Sec. 6.5 Coordinate Descent Methods 371
viewed as a function of~' attains a unique minimum over Xi. Let {xk}
be the sequence generated by the block coordinate descent method
(6.141). Then, every limit point of {xk} minimizes f over X.
Proof: Denote
\:/ k. (6.142)
Let ;i; = (x1, ... ,xm) be a limit point of the sequence {xk}, and note
that ;i; E X since X is closed. Equation (6.142) implies that the sequence
{f (xk)} converges to f (x). We will show that ;i; satisfies the optimality
condition
v' f (x)'(x - x) ;:::: o, \:/x EX;
cf. Prop. 1.1.8 in Appendix B.
Let {xki I j = 0, 1, ... } be a subsequence of {xk} that converges to x.
From the definition (6.141) of the algorithm and Eq. (6.142), we have
for all i = 1, ... , m. By adding these inequalities, and using the Cartesian
product structurr of the set X, it follows that V f (x)'(x - x) ::::0: 0 for all
x E X, thereby completing the proof.
To show that { zt.} converges to x as j -+ oo, we assume the contrary,
J
or equivalently that { ztj - Xkj} does not converge to zero. Let
Thus, z~. = Xk 1· + 1'k1 s(, llsUI = 1, ands( differs from zero only along
J J J J
the first block component. Notice that sl belongs to a compact set and
J
therefore has a limit point s 1 . By restricting to a further subsequence of
{k1 }, we assume thats( converges to s 1 .
J
Let us fix some EE [O, l]. Since O:::; E"y:::; 1'kj, the vector Xkj + E)'st
lies on the line segment joining Xkj and Xk 1 + l'kj stj = z~i, and belongs to
X since X is convex. Using the fact that f is monotonically nonincreasing
on the interval from Xki to z~j (by the convexity of J), we obtain
Since f(xk) converges to f(x), Eq. (6.142) shows that f(z{) also converges
J
to f (x). Taking the limit as j tends to infinity, we obtain
We conclude that f(x) = f(x+E"ys 1 ), for every EE [O, 1]. Since 7ys 1 =/= 0 and
by Eq. (6.143), x 1 attains the minimum of J(x 1 , x2 , ... ,x=) over x 1 E X 1 ,
this contradicts the hypothesis that f is uniquely minimized when viewed as
a function of the first block component. This contradiction establishes that
z( converges to x, which as noted earlier, shows that V2f(x)'(x 2 -x 2 );:::: 0
J
for all x 2 E X2.
By using {z(} in place of {xk 1 }, and {z~} in place of {z(} in the
J J J
preceding arguments, we can show that '\l3f(x)'(x 3 - x 3) ::::0: 0 for all x 3 E
X 3, and similarly Vd(x)'(xi - xi) ;:::: 0 for all xi E Xi and i. Q.E.D.
Sec. 6.5 Coordinate Descent Methods 373
The proof is nearly identical to the one of Prop. 6.5.1, using at the
right point the monotonic nonincrease assumption in place of convexity of
f. An alternative assumption, also discussed in [Ber99], Section 2.7, under
which the conclusion of Prop. 6.5.2 can be shown with a similar proof is
that the sets Xi are compact (as well as convex), and that for each i and
x EX, the function (6.144) of the ith block-component~ attains a unique
minimum over Xi, when all other block-components are held fixed.
The nonnegative matrix factorization problem, described in Section
1.3, is an important example where the cost function is convex as a function
of each block component, but not convex as a function of the entire set
of block components. For this problem, a special convergence result for
the case of just two block components applies, which does not require the
uniqueness of the minimum in the two block component minimizations.
This result can be shown with a variation of the proofs of Props. 6.5.1 and
6.5.2; see [GrS99], [GrSOO].
There are many variations of coordinated descent, which are aimed at im-
proved efficiency, application-specific structures, and distributed comput-
ing environments. We describe some of the possibilities here and in the
exercises, and we refer to the literature for a more detailed analysis.
(a) We may apply coordinate descent in the context of a dual problem.
This is often convenient because the dual constraint set often has the
374 Additional Algorithmic Topics Chap. 6
i
xk+l Eargmm
·
~EXi
{f( 1
xk+ i-1 c m)
i+l , ... ,xk
1, ... ,xk+l'"'xk + -l 11c.,,-xki 112} ,
2c
1
F(x, y) = f(x) + -llx
2c
- Yll 2 -
on the size of the communication delays between the processors; cf. the
terminology of Section 2.1.6. The analysis in this section is based on the
author's paper [Ber83] (see [BeT89a], [BeT91], [FrSOO] for broad surveys
of totally asynchronous algorithms). A different line of analysis applies
to partially asynchronous algorithms, for which it is necessary to have a
bound on the size of the communication delays. Such algorithms will not
be considered here; see [TBA86], [BeT89a], Chapter 7, for gradient-like
methods, [BeT89a] for network flow algorithms, [TBT90] for nonexpan-
sive iterations, and [LiW14] which focuses on coordinate descent methods,
under less restrictive conditions than the ones of the present section (a
sup-norm contraction property of the algorithmic map is not assumed).
Let us consider parallelizing a stationary fixed point algorithm by sep-
arating it into several local algorithms operating concurrently at different
processors. As we discussed in Section 2.1.6, in an asynchronous algorithm,
the local algorithms do not have to wait at predetermined points for pre-
determined information to become available. Thus some processors may
execute more iterations than others, while the communication delays be-
tween processors may be unpredictable. Another practical setting that may
be modeled well by a distributed asynchronous iteration is when all com-
putation takes place at a single computer, but any number of coordinates
may be simultaneously updated at a time, with the order of coordinate
selection possibly being random.
With this context in mind, we introduce a model of asynchronous
distributed solution of abstract fixed point problems of the form x = F(x),
where F is a given function. We represent x as x = (x1, ... , xm), where
xi E Rni with ni being some positive integer. Thus x E ~n, where n =
n1 +···+nm, and F maps ~n to ~n. We denote by Fi : ~n H ~ni
the ith component of F, so F(x) = (Fi(x), ... ,Fm(x)). Our computation
framework involves m interconnected processors, the ith of which updates
the ith component xi by applying the corresponding mapping Fi. Thus, in
a (synchronous) distributed fixed point algorithm, processor i iterates at
time t according to
i _
Xt+l -
{ Fi. (x;
,1
(t), ... , x";' (t))
irn
if t E Ri,
(6.147)
x't if tr/. Ri.
378 Additional Algorithmic Topics Chap. 6
Here Tij ( t) is the time at which the jth coordinate used in this update was
computed, and the difference t- Ti1 (t) is referred to as the communication
delay from j to i at time t.
We noted in Section 2.1.6 that an example of an algorithm of this
type is a coordinate descent method, where we assume that the ith scalar
coordinate is updated at a subset of times Ri C {O, 1, ... }, according to
i
xt+l E argmm· f ( x 1 _ (t), i-1
... ,x __
c i+l
(t)'"''x __
m
(t), ... ,x _ (t) ) ,
{EW Til Ti,i-1 Ti,z.+l Tim
and is left unchanged (xl+i = xi) if t .J_ R;. Here we can assume with-
out loss of generality that each scalar coordinate is assigned to a separate
processor. The reason is that a physical processor that updates a block
of scalar coordinates may be replaced by a block of fictitious processors,
each assigned to a single scalar coordinate, and updating their coordinates
simultaneously.
To discuss the convergence of the asynchronous algorithm (6.147), we
introduce the following assumption.
(2) Box Condition: For all k, S(k) is a Cartesian product of the form
Proof: To explain the idea of the proof, let us note that the given condi-
tions imply that updating any component xi, by applying F to x E S(k),
while leaving all other components unchanged, yields a vector in S(k).
Thus, once enough time passes so that the delays become "irrelevant,"
then after x enters S(k), it stays within S(k). Moreover, once a compo-
nent xi enters the subset Si(k) and the delays become "irrelevant," xi gets
permanently within the smaller subset Si(k + 1) at the first time that xi
is iterated on with x E S (k). Once each component xi, i = 1, ... , m, gets
within Si(k+ 1), the entire vector xis within S(k+ 1) by the box condition.
Thus the iterates from S(k) eventually get into S(k + 1) and so on, and
converge pointwise to x* in view of the assumed properties of { S(k) }.
With this idea in mind, we show by induction that for each k 2'.: 0,
there is a time tk such that:
(1) Xt E S(k) for all t 2'.: tk.
[In words, after some time, all fixed point estimates will be in S(k) and all
estimates used in iteration (6.147) will come from S(k).]
The induction hypothesis is true fork= 0 since xo E S(O). Assuming
it is true for a given k, we will show that there exists a time tk+l with the
required properties. For each i = 1, ... , m, let t(i) be the first element of
Ri such that t(i) 2'.: tk. Then by the synchronous convergence condition,
we have F(xt(i)) E S(k + 1), implying (in view of the box condition) that
x~(i)+l E Si(k + 1).
380 Additional Algorithmic Topics Chap. 6
·~ x = (x1,x 2 )
S1(0)
Similarly, for every t E Ri, t ~ t(i), we have xl+i E Si(k + 1). Between
elements of Ri, xf does not change. Thus,
xf E Si(k + 1), \f t ~ t( i) + 1.
Let t~ = maxi {t( i)} + 1. Then, using the box condition we have
XtES(k+l),
l xl ~ = .211ax Ix: I,
i-l, ... ,n W
x 1 Iterations
...._
\~
0 --
.-..
ls(k+l) • x* t I
S(k) I
S(O) I
I
x2 Iteration
F(x)=Ax+b,
where A and bare given n x n matrix and vector in Rn. Let us denote by IAI
the matrix whose components are the absolute values of the components
of A and let a(IAI) denote the spectral radius of IAI (the largest modulus
among the moduli of the eigenvalues of IAI). Then it can be shown that
F is a contraction with respect to some weighted sup-norm if and only if
382 Additional Algorithmic Topics Chap. 6
where aij are the components of A. To see this note that the ith component
of Ax satisfies
n n
j=l j=l
so IIAxlloo : : ; Pllxlloo, where p = maxi L?=l 1%1 < 1. This shows that
A (and hence also F) is a contraction with respect to I · 1 00 . A similar
argument shows also the reverse assertion.
We finally note a few extensions of the theorem. It is possible to
allow F to be time-varying, so in place of F we operate with a sequence
of mappings Fk, k = 0, l, .... Then if all Fk have a common fixed point,
the conclusion of the theorem holds (see [BeT89a] for details). Another
extension is to allow F to have multiple fixed points and introduce an
assumption that roughly says that nk= 0 S(k) is the set of all fixed points.
Then the conclusion is that any limit point of { Xt} is a fixed point.
(6.148)
where Dk lR2 n .....+ (-oo, oo] is a regularization term that replaces the
quadratic
1
-llx-xkll
2ck
2
Figure 6.6.1. Illustration of the generalized proximal algorithm (6.148) for a con-
vex cost function /. The regularization term is convex but need not be quadratic
or real-valued. In this figure, 'Yk is the scalar by which the graph of -Dk(·,xk)
must be raised so that it just touches the graph of /.
guidelines about the kind of behavior that may be expected from the al-
gorithm when f is a closed proper convex function. In particular, under
suitable assumptions on Dk, we expect to be able to show convergence to
the optimal value and convergence to an optimal solution if one exists (cf.
Prop. 5.1.3).
where xl denotes the ith coordinate of Xk; see Fig. 6.6.3. Because the loga-
rithm is finite only for positive arguments, the algorithm requires that >0 xi
384 Additional Algorithmic Topics Chap. 6
0 X
for all i, and must generate a sequence that lies strictly within the positive
orthant. Thus the algorithm can be used only for functions f for which the
minimum above is well-defined and guaranteed to be attained within the pos-
itive orthant.
Figure 6.6.4. Illustration of the generalized proximal algorithm (6.148) for the
case of a nonconvex cost function f.
With this condition we are assured that the algorithm has a cost improve-
ment property. Indeed, we have
Xk EX*, (6.153)
ifxkr/.X*. (6.154)
by the Fenchel Duality Theorem (Prop. 1.2.1), there exists a dual optimal
solution A* such that -.X* is a subgradient of Dk(·, xk) at Xk, so that
A*= 0 [by Eq. (6.151)], and also A* is a subgradient off at Xk, so that Xk
Sec. 6.6 Generalized Proximal Methods 387
minimizes f. Note that the condition (6.153) may fail if Dk(·,xk) is not
differentiable. For example if f(x) = ½llxll 2 and Dk(x,xk) = ¼llx - xkll,
then for any c > 0, the points Xk E [-1/c, 1/c] minimize J(-) + Dk(·,xk)-
Simple examples can also be constructed to show that the relative interior
condition is essential to guarantee the condition (6.153).
We summarize the preceding discussion in the following proposition.
Some Examples
Let 'I/; : ~n .-+ (-oo, oo] be a convex function, which is differentiable within
int(dom('I/;)), and define for all x,y E int(dom('I/;)),
if xi> 0,
if xi= 0,
if xi< 0,
with gradient V'!j;i(xi) = ln(xi) + 1, we obtain from Eq. (6.157) the function
Except for the constant term c~ L~=l yi, which is inconsequential since it
does not depend on x, this is the regularization function that is used in the
entropy minimization algorithm of Example 6.6.1.
Note that because of the convexity of '!j;, the condition (6.151) holds.
Furthermore, because of the differentiability of Dk ( ·, Xk) (a consequence of the
differentiability of '!f;), the condition (6.153) holds as well when f is convex.
minimize f(x)
subjectto xEX, g1(x)S:O, ... ,gr(x)S:O,
where f, 91, ... , 9r : ar >-+ R are convex functions, and X is a closed convex
set. Consider also the corresponding primal and dual functions
We assume that p is closed, so that there is no duality gap, and except for sign
changes, q and p are conjugates of each other [i.e., p( u) is equal to ( -q) * ( -u);
cf. Section 4.2 in Appendix BJ.
Let us consider the entropy minimization algorithm of Example 6.6.1,
applied to maximization over µ 2: 0 of the dual function. It is given by
where µi andµ{ denote the jth coordinates ofµ and µk, respectively, and it
corresponds to the case
390 Additional Algorithmic Topics Chap. 6
(6.159)
and from the optimality condition of the Fenchel Duality Theorem (Prop.
1.2.1), the primal optimal solution of Eq. (6.158) is given by
(6.160)
To calculate u;;,, we first note that the conjugate of the entropy function
x(ln(x) - 1) if X > 0,
¢(x) = { O if X = 0,
CX) if X < 0,
is the exponential function ¢* (u) = eu, [to see this, simply calculate supx { x' u-
eu }, the conjugate of the exponential eU, and show that it is equal to ¢(x)].
Thus the conjugate of the function
is equal to
Since we have
(6.161)
1
-µecg
C
Constraint Level g
It can be seen that uk+l = g(xk), where Xk is obtained through the mini-
mization
min {f(x) +I_~ µ{eckgj(x)},
xEX Ck L
j=l
(6.162)
From Eqs. (6.160) and (6.161), and the fact Uk+1 = g(xk), it follows that the
corresponding multiplier iteration is
j = 1, . .. ,r. (6.163)
(6.164)
we have
Mk(x, Xk) = f(x) + Dk(X,Xk),
so the algorithm (6.164) can be written in the generalized proximal format
(6.156). Moreover the condition (6.166) is equivalent to the condition (6.151)
that guarantees cost improvement, which is strict assuming also that
Xk EX*, (6.167)
where X* is the set of desirable points for convergence, cf. Eq. (6.153) and
Prop. 6.6.1.
As an example, consider the problem of unconstrained minimization of
the function
f(x) = R(x) + IIAx - bll 2 ,
where A is an m x n matrix, b is a vector in Rm, and R : Rn >---+ R is a
nonnegative-valued convex regularization function. Let D be any symmet-
ric matrix such that D - A' A is positive definite (for example D may be a
sufficiently large multiple of the identity). Let us define
M(x, y) = R(x) + IIAy - bll 2 + 2(x - y)' A' (Ay - b) + (x - y)' D(x - y),
Sec. 6.6 Generalized Proximal Methods 393
and note that M satisfies the condition M(x, x) = f(x) [cf. Eq. (6.165)], as
well as the condition M(x, Xk) ~ f(xk) for all x and k [cf. Eq. (6.166)] in view
of the calculation
When Dis the identity matrix I, by scaling A, we can make the matrix
I - A' A positive definite, and from Eq. (6.168), we have
(6.170)
/4(
'+' X
i
-
i)
Xk = -1 1 X
i
-
i
Xk
Ip . (6.171)
p
We will aim to show that while the algorithm has satisfactory convergence
properties for all p > l, it attains a superlinear convergence rate, provided
p is larger than the order of growth of f around the optimum. This occurs
under natural conditions, even when Ck is kept constant - an old result, first
obtained in [KoB76] (see also [Ber82a], Section 5.4, and [BeT94a], which we
will follow in the subsequent derivation).
We assume that f : Rn ,-+ (-oo, oo] is a closed convex function with a
nonempty set of minima, denoted X*. We also assume that for some scalars
fJ > 0, 8 > 0, and 'Y > 1, we have
where
d(x) = min
x*EX*
llx - x* II
(6.174)
XE Rn.
From the form of the proximal minimization (6.169)-(6.170), and using Eqs.
(6.173), (6.174), we have for all k large enough so that lxl+i -xll ~ 8,
~ f(xk) +-
1 I:n ¢,(x1. - x1). - f*
Ck
i=l
=-
1 I:n ¢,(x1. - x1).
.
Ck
i=l
(6.175)
n
Also from the growth assumption (6.172), we have for all k large enough so
that d(xk) ~ 8,
(6.176)
Sec. 6.6 Generalized Proximal Methods 395
minimize f(x)
subject to x E X,
-
Xk+l E argmin { '\lf(xk) I
(x -xk) 1 llx - xkll 2} ;
+ -2O'.k
xEX
cf. Prop. 6.1.4. In this form the method resembles the proximal algorithm,
the difference being that f(x) is replaced by its linearized version
and the stepsize O'.k plays the role of the penalty parameter.
If we also replace the quadratic
396 Additional Algorithmic Topics Chap. 6
One advantage of this method is that using in place off its linearization may
simplify the minimization above for a problem with special structure.
As an example, consider the minimization of f(x) over the unit simplex
A special case of the mirror descent method, called entropic descent, uses the
entropy regularization function of Example 6.6.1 and has the form
where 'v';f(xk) are the components of 'v' f (xk), It can be verified that this
minimization can be done in closed form as follows:
i = 1, ... ,n.
Thus it involves less overhead per iteration than the corresponding gradient
projection iteration, which requires projection on the unit simplex, as well as
the corresponding proximal iteration.
When f is differentiable, the convergence properties of mirror descent
are similar to those of the gradient projection method, although depending
on the problem at hand and the nature of D1;,(x, Xk) the analysis may be
more complicated. When f is nondifferentiable, an analysis similar to the
one for the subgradient projection method may be carried out; see [BeT03].
For extensions and further analysis of the method, we refer to the surveys
[JuNlla), [JuNllb), and the references quoted there.
In this section we return to the idea of cost function descent for nondiffer-
entiable cost functions, which we discussed in Section 2.1.3. We noted there
the theoretical difficulties around the use of the steepest descent direction,
which is obtained by projection of the origin on the subdifferential. In
this section we focus on the E-subdifferential, aiming at theoretically more
sound descent algorithms. We subsequently use these algorithms in an un-
usual way: to obtain a strong duality analysis for the extended monotropic
programming problem that we discussed in Section 4.4, in connection with
generalized polyhedral approximation.
Sec. 6.7 E-Descent and Extended Monotropic Programming 397
6. 7.1 E-Subgradients
and that
ne.j.08d(x) = of(x).
We will now discuss in more detail the properties of E-subgradients, with a
view towards using them in cost function descent algorithms.
we see from Eq. (6.178) that of(x) can be characterized as the 0-level set
off;:
of(x) = {g I J;(g) ~ o}. (6.179)
398 Additional Algorithmic Topics Chap. 6
It follows from Eqs. (6.179) and (6.180) that for every x E <lorn(!), there
are two cases of interest:
(a) (cl f)(x) = f (x). Then, we have
of(x) = arg min J;(g) = {g I J;(g) = o},
gE!Rn
od(x) = {g I f;(g) $ E} .
Proposition 6. 7.1: Let f : lRn i-+ ( -oo, oo] be a proper .convex func-
tion and let Ebe a positive scalar. For every x E <lorn(!), the following
hold:
(a) The E-subdifferential od(x) is a closed convex set.
(b) If (clf)(x) = f(x), then 8,f(x) is nonempty and its support
function is given by
. f f (X + ad) - f (X) + E
aa,J(x) (d) = sup d'g = m . , d E lRn.
gE8,f(x) a>O O',
Sec. 6.7 E-Descent and Extended Monotropic Programming 399
0 a
As Prop. 6.7.l(b) shows, the minimal and maximal slopes of planes that support
the graph of Fd and pass through ( 0, f(x) - E) are
Proof: (a) We have shown that od(x) is the E-level set of the function f;
[cf. Eq. (6.180)]. Since f;; is closed and convex, being a conjugate function,
od(x) is closed and convex.
(b) By Eqs. (6.180) and (6.181), 8.f(x) is the E-level set off;;, while
inf f;;(g) = 0.
gE~n
6. 7 .2 1:-Descent Method
We will now discuss an iterative cost function descent algorithm that uses
1:-subgradients. Let f : ~n f-t (-oo, oo] be a proper convex function to be
minimized.
We say that a direction d is an 1:-descent direction at x E dom(f) ,
where E is a positive scalar, if
inf f (x
a>O
+ ad) < f(x ) - E,
is an E-descent direction.
Sec. 6.7 E-Descent and Extended Monotropic Programming 401
f(z) f(x+ad)
Figure 6.7.2. Illustration of the connection between od(x) and E-descent direc-
tions [cf. Eq. (6.182)). In the figure on the left, we have
or equivalently, that the horizontal hyperplane [normal (0, 1)) that passes through
( x, J(x) - E) contains the epigraph of f in its upper halfspace, or equivalently,
that O E od(x). In this case there is no E-descent direction. In the figure on the
right, d is an E-descent direction because the slope shown is negative [cf. Prop.
6.7.l(b)].
Proof: (a) By definition, 0 E ad(x) if and only if f(z) ~ f(x) - c for all
z E ~n, which is equivalent to infzE~n f(z) + E ~ f(x).
(b) If there exists an E-descent direction, then by part (a), we have O tf.
ad(x). Conversely, assume that O tf. ad(x). The vector g is the projection
of the origin on the closed convex set ad(x), which is nonempty in view
of the assumption (clf)(x) = f(x) [cf. Prop. 6.7.l(b)]. By the Projection
Theorem (Prop. 1.1.9 in Appendix B),
or
sup (-g)'g:::; -ll"?Jll 2 < 0,
gE8ef(x)
where the last inequality follows from the hypothesis O tf. ad(x). By Eq.
(6.182), this implies that -g is an E-descent direction. Q.E.D.
(6.183)
Since 8md(x) is closed, this proves the right-hand side of Eq. (6.184).
To prove the left-hand side of Eq. (6.184), assume to arrive at a
contradiction, that there exists a 9 E 8d (x) such that
d' (91 + · · · + 9m) < b < d' 9, V 91 E 8d1(x), ... ,9m E 8dm(x).
fi(x + a1d) - Ji(x) + E + ... + fm(X + amd) - fm(x) + E < d' 9 , (6 .185 )
a1 am
and let
1
a--------
-l/a1+···+l/am·
By the convexity of Ji, the ratio (fi(X + ad) - fi(x))/a is monotonically
nondecreasing in a. Thus, since ai ;?: a, we have
Since g E 8d(x), this contradicts Prop. 6.7.l(b), and proves the left-hand
side of Eq. (6.184). Q.E.D.
The potential lack of closure of the set 8d1 (x)+· · ·+8dm(x) indicates
a practical difficulty in implementing the method. In particular, in order
to find an E-descent direction one will ordinarily minimize llg1 + · · · + gmll
over gi E 8di(x), i = 1, ... , m, but an optimal solution to this problem
may not exist. Thus, it may be difficult to check computationally whether
0 E cl(8d1(x) + · · · + 8dm(x)),
which is the test for mE-optimality of x. The closure of the vector sum
8d1(x)+· · ·+8dm(x) may be guaranteed under various assumptions (e.g.,
the ones given in Section 1.4 of Appendix B; see also Section 6.7.4).
One may use Prop. 6.7.3 to approximate 8d(x) in cases where f is
the sum of convex functions whose E-subdifferential is easily computed or
approximated. The following is an illustrative example.
minimize L f;(xi)
i=l
subject to x E P,
P = P1 n · · · n Pr,
with
j = 1, ... ,r,
for some vectors a1 and scalars b1 . We can write this problem as
subject to x E Rn,
which is an interval in ar. Similarly, it can be seen that 8,fi (Xi) is a compact
interval of the ith axis. Thus
n m
i=l i=l
subject to x E S,
where
= ( X1, .. ,,Xm,)
X def
is a vector in Rn1 + .. +nm, with components Xi E Rni, i = 1, ... , m, and
Ji : Rni f-t (-oo, oo] is a closed proper convex function for each i,
S is a subspace of Rn1 + .. +nm.
The dual problem was derived in Section 4.4. It has the form
m
where the nonzero element in (0, ... , 0, Ai, 0, ... , 0) is in the ith position.
The following proposition gives conditions for strong duality.
i=l i=l
where 8s is the indicator function of S, for which 8,8s(x) = SJ_ for all
x E S and E > 0. In this method, we start with a vector x 0 E X, and
we generate a sequence {xk} C X. At the kth iteration, given the current
iterate xk, we find the vector of minimum norm wk on the set T (xk, E)
(which is closed by assumption). If wk = 0 the method stops, verifying
that O E a(m+i).f (xk) [cf. the right side of Eq. (6.184)]. If wk =/- 0, we
generate a vector xk+I EX of the form xk+I = xk - akwk, satisfying
such a vector is guaranteed to exist, since O .j. T(xk, E) and hence O .j.
8.J(xk) by Prop. 6.7.3. Since f(xk) ~ f* and at the current stage of the
proof we have assumed that f * > -oo, the method must stop at some
iteration with a vector x = (x1, ... , Xm) such that OE T(x, E). Thus some
vector in 8.f1 (x)+· · ·+8,]m(x) must belong to SJ_. In view ofEq. (6.188),
it follows that there must exist vectors
i = l, ... ,m,
such that
A= (A1, ... , Am) E SJ_.
From the definition of an E-subgradient we have
i = l, ... ,m,
408 Additional Algorithmic Topics Chap. 6
and by adding over i, and using the fact x ES and>. E s1-, we obtain
m m
L fi(Xi)::::; - L ft(>.i) + mt.
i=l i=l
Since xis primal feasible and - r=: 1 ft(>,i) is the dual value at>., it follows
that
f* :::; q* + mf.
Taking the limit as € ----+ 0, we obtain f * :::; q*, and using also the weak
duality relation q* :::; f*, we obtain f* = q*. Q.E.D.
We now delineate some special cases where the assumptions for strong
EMP duality of Prop. 6.7.4 are satisfied. We first note that in view of
Eq. (6.188), the set 8, J i (x ) is compact if 8.fi(xi) is compact, and it is
polyhedral if 8.fi(xi) is polyhedral. Since the vector sum of a compact
set and a polyhedral set is closed (see the discussion at the end of Section
1.4 of Appendix B), it follows that if each of the sets 8.fi (Xi) is either
compact or polyhedral, then T(x, €) is closed, and by Prop. 6. 7.4, we have
q* = f*. Furthermore, from Prop. 5.4.1 of Appendix B, 8fi(Xi ) and hence
also 8.fi (Xi ) is compact if Xi E int ( dom(fi)) (as in the case where Ji is
real-valued). Moreover 8.fi(Xi ) is polyhedral if Ji is polyhedral [being t he
level set of a polyhedral function, cf. Eq. (6.180)]. There are some other
interesting special cases where 8.fi(xi ) is polyhedral, as we now describe.
One such special case is when Ji depends on a single scalar component
of x, as in the case of a monotropic programming problem. The following
definition introduces a more general case.
h(x) = h(a'x),
where a is a vector in ~ and h:R H ( -oo, oo] is a scalar closed
proper convex function.
Thus,
Proposition 6.7.8:
(a) The conjugate of an essentially one-dimensional function is a do-
main one-dimensional function such that the affine hull of its
domain is a subspace.
(b) The conjugate of a domain one-dimensional function is the sum
of an essentially one-dimensional function and a linear function.
h(x ) = h (a' x ),
Sec. 6.7 E-Descent and Extended Monotropic Programming 411
so h*(>..) = Ti,* (a'>..) where Ti,* is the conjugate of the scalar function h(,) =
h( ,a). Since his closed proper convex, the same is true for Ti,*, and it follows
that h* is essentially one-dimensional. Finally, consider the case where
b -=/- 0. Then we use a translation argument and write h(x) = h(x - b),
where h is a function such that the affine hull of its domain is the subspace
spanned by a. The conjugate of h is essentially one-dimensional (by the
preceding argument), and the conjugate of h is obtained by adding b' >.. to
it. Q.E.D.
We now turn to the dual problem, and derive a duality result that is
analogous to the one of Prop. 6.7.7. We say that a function is co-finite if
its conjugate is real-valued. If we apply Prop. 6. 7. 7 to the dual problem
(6.187), we obtain the following.
412 Additional Algorithmic Topics Chap. 6
Proof: This is a consequence of Props. 6.7.7 and 6.7.9, and the fact that
when ni = l, the functions Ji and Qi are essentially one-dimensional. Ap-
plying Prop. 6.7.7 to the primal problem, shows that q* = f* under the
hypothesis that the primal problem is feasible. Applying Prop. 6.7.9 to
the dual problem, shows that q* = f * under the hypothesis that the dual
problem is feasible. Q.E.D.
minimize f (x)
(6.190)
subject to x E X, g1 (x) ~ 0, j = l, ... ,r,
r 1
B(x) = - L--:--(x)·
j=l 91
Note that both of these functions are convex since the constraints gj are
convex. Figure 6.8.1 illustrates the form of B(x).
EB(x)
/
E'B(x)
Boundary of S Boundary of S
' .
Figure 6.8.1. Form of a barrier function. The barrier term EB(x) tends to zero
for all interior points x E S as E --t 0.
Xk E argmin{J(x)
xES
+ EkB(x) }, k = 0, 1, . ... (6.191)
Since the barrier function is defined only on the interior set S, the successive
iterates of any method used for this minimization must be interior points.
If X = ~, one may use unconstrained methods such as Newton's
method with the stepsize properly selected to ensure that all iterates lie in
414 Additional Algorithmic Topics Chap. 6
The vector xis a feasible point of the original problem (6.190), since Xk ES
and X is a closed set. If x were not a global minimum, there would exist a
feasible vector x* such that f(x*) < f(x) and therefore also [since by the
Line Segment Principle (Prop. 1.3.1 in Appendix B) x* can be approached
arbitrarily closely through the interior set SJ an interior point x E S such
that f (x) < f(x). We now have by the definition of Xk,
k = 0, l, ... ,
which by taking the limit ask-+ oo and k EK, implies together with Eq.
(6.192), that J(x) :=; f (x). This is a contradiction, thereby proving that x
is a global minimum of the original problem. Q.E.D.
Figure 6.8.2. Illustration of the level sets of the barrier-augmented cost function,
and the convergence process of the barrier method for the problem
subject to 2 :::; x 1 ,
with optimal solution x• = (2, 0). For the case of the logarithmic barrier function
B(x) = - In (x 1 - 2), we have
successive iterates lying in the interior of the constraint set. These methods
are generically referred to as interior point methods, and have been exten-
sively applied to linear, quadratic, and conic programming problems. The
logarithmic barrier function has been central in many of these methods.
In the next two sections we will discuss a few methods that are designed
for problems with special structure. In particular, in Section 6.8.1 we will
discuss in some detail primal-dual methods for linear programming, one
416 Additional Algorithmic Topics Chap. 6
of the most popular methods for solving linear programs. In Section 6.8.2
we will address briefly interior point methods for conic programming prob-
lems. In Section 6.8.3 we will combine the cutting plane and interior point
approaches.
minimize c' x
(LP)
subject to Ax = b, X ~ 0,
maximize b' ,\
(DP)
subject to A',\::;; c.
where
n
Fe(x) = c'x - EL lnxi,
i=l
(b) For every k, the pair ( Xk, Ak) is such that Xk is an interior point of
the positive orthant, i.e., Xk > 0, while Ak is an interior point of the
dual feasible region, i.e.,
c-A'..\k > 0.
Sec. 6.8 Interior Point Methods 417
(6.194)
Zk=c-A'>..k.
where x- 1 denotes the vector with components (xi)- 1 . Let z be the vector
of slack variables
z = c-A1 >...
Note that >.. is dual feasible if and only if z 2: 0.
Using the vector z, we can write the first condition of Eq. (6.195) as
z - Ex- 1 = 0 or, equivalently, X Ze = Ee, where X and Z are the diagonal
418 Additional Algorithmic Topics Chap. 6
x~
(
xl
:
0
x2
-~-), Z= (~t 0
z2 ...
O)
0
' e= (1)
1
. .
0 xn 0 0 ... zn 1
Note also that if a= 1, i.e., a pure Newton step is used, x(a, E) is primal
feasible, since from Eq. (6.201) we have A(x + Ax) = b.
Sec. 6.8 Interior Point Methods 419
We will now evaluate the changes in the constraint violation and the merit
function (6.194) induced by the Newton iteration.
By using Eqs. (6.199) and (6.201), the new constraint violation is
given by
for feasible Xk, has been suggested as a good practical rule. Usually, when
Xk has already become feasible, ak is chosen as 0ak, where 0 is a factor very
close to 1 (say 0.999), and ak is the maximum stepsize a that guarantees
that x(a,Ek) 2'. 0 and z(a,Ek) 2'. 0
When Xk is not feasible, the choice of ak must also be such that the merit
function is improved. In some works, a different stepsize for the x update
than for the (>-., z) update has been suggested. The stepsize for the x
update is near the maximum stepsize a that guarantees x( a, Ek) 2: 0, and
Sec. 6.8 Interior Point Methods 421
the stepsize for the (>., z) update is near the maximum stepsize a that
guarantees z(a, Ek) ~ 0.
There are a number of additional practical issues related to imple-
mentation, for which we refer to the specialized literature. We refer to
the research monographs [Wri97], [Ye97], and other sources for a detailed
discussion, as well as extensions to nonlinear/ convex programming prob-
lems, such as quadratic programming. There are also more sophisticated
implementations of the Newton/primal-dual idea, one of which we describe
next.
Predictor-Corrector Variants
We will now discuss some modified versions of the preceding interior point
methods, which are based on a variation of Newton's method where the
Hessian is evaluated periodically every q > 1 iterations in order to econo-
mize in iteration overhead. When q = 2 and the problem is to solve the
system h(x) = 0, where g : ~n r-+ ~n, this variation of Newton's method
takes the form
Xk = Xk - (Vh(xk)')- 1 h(xk), (6.210)
~z + A'~>. = 0, (6.218)
with v defined by
where ~.x and ~z are the diagonal matrices corresponding to ~x and ~2,
respectively. Here E and E are the barrier parameters corresponding to the
two iterations.
The composite Newton direction is
~x = ~x+~x,
~z = ~z+~z,
~,\ = ~~+ ~>.,
and the corresponding iteration is
x(a,t) = x + a~x,
.\(a, t) =.\+a~.\,
z(a, t) = z + a~z,
where a is a stepsize such that O < a :=:; 1 and
We now discuss briefly interior point methods for the conic programming
problems discussed in Section 1.2. Consider first the second order cone
problem
minimize c'x
(6.226)
subject to Aix - bi E Ci, i = 1, ... , m,
where x E ~n, c is a vector in ~n, and for i = 1, ... , m, Ai is an ni x n
matrix, bi is a vector in ~ni, and Ci is the second order cone of ~ni. We
approximate this problem with
m
where Bi is a function defined in the interior of the second order cone Ci,
and given by
maximize b' ,\
(6.228)
subject to D - (>.1A1 + · · · + AmAm) E C,
where b E ~m, D, A 1, ... , Am are symmetric matrices, and C is the cone
of positive semidefinite matrices. It consists of solving the problem
The properties of this method are similar to the ones of the preceding
second order cone method. In particular, if Xk is an optimal solution of
the approximating problem (6.229), then every limit point of { xk} is an
optimal solution of the original problem (6.228).
We finally note that there are primal-dual interior point methods for
conic programming problems, which bear similarity with the one given for
linear programming in the preceding section. We refer to the literature for
further discussion and a complexity analysis; see e.g., [NeN94], [BoV04].
to f, constructed using the points xo, ... , Xk generated so far, and associ-
ated subgradients go, ... , gk, with gi E 8 f (xi) for all i = 0, ... , k. However,
it generates the next vector Xk+l by using a different mechanism. In partic-
ular, instead of minimizing Fk over X, the method obtains xk+ 1 by finding
a "central pair" (xk+1,Wk+i) within the subset
where Jk is the best upper bound to the optimal value that has been found
so far,
Jk = . min f (xi)
i=O, ... ,k
with nonempty interior, its analytic center is defined as the unique maxi-
mizer of
m
2:)n ( Cp - a~ y)
p=l
426 Additional Algorithmic Topics Chap. 6
~f(x1) + (x - x1)'g1
~ :::?---,r,o-.,F.-/ I
Xo X
Sk = {(x, w) I x E X, Fk (x) :S w :S h}
in the central cutting plane method.
over y E P.
Another possibility is the ball center of S, i.e., the center of the largest
inscribed sphere in Sk; for the generic polyhedral set P of the form (6.230),
the ball center can be obtained by solving the following problem with op-
timization variables (y, a):
maximize a
subject to a~(y + d) ::=; Cp, V lldll ::=; a, p = l, ... , m,
valued maximal monotone operator, and the gradient v' f with a general
single-valued monotone operator.
The forward-backward algorithm was proposed and analyzed by Lions
and Mercier [LiM79], and Passty [Pas79]. Additional convergence results
for the algorithm and a discussion of its applications were given by Gabay
[Gab83] and Tseng [Tse91b]. The convergence result of Prop. 6.3.3, in the
case where the stepsize is constant, descends from the more general results
of [Gab83] and [Tse91b]. A modification that converges under weaker as-
sumptions is given by Tseng [TseOO]. The rate of convergence has been
further discussed by Chen and Rockafellar [ChR97].
Variants of proximal gradient and Newton-like methods have been
proposed and analyzed by several authors, including cases where the dif-
ferentiable function is not convex; see e.g., Fukushima and Mine [FuM81],
[MiF81], Patriksson [Pat93], [Pat98], [Pat99], Tseng and Yun [TsY09], and
Schmidt [SchlO]. The methods have received renewed attention, as they
are well-matched to the structure of some large-scale machine learning and
signal processing problems; see Beck and Teboulle [BeT09a], [BeT09b],
[BeTlO], and the references they give to algorithms for problems with spe-
cial structures.
There has been a lot of additional recent work in this area, which
cannot be fully surveyed here. Methods (with and without extrapolation),
which replace the gradient with an aggregated gradient that is calculated
incrementally, are proposed and analyzed by Xiao [XialO], and Xiao and
Zhang [XiZ14]. Inexact variants that admit errors in the proximal min-
imization and the gradient calculation, in the spirit of the E-subgradient
methods of 3.3, have been discussed by Schmidt, Roux, and Bach [SRBll].
The convergence rate for some interesting special cases was investigated by
Tseng [TselO], Hou et al. [HZS13], and Zhang, Jiang, and Luo [ZJL13].
Algorithms where f has an additive form with components treated incre-
mentally are discussed by Duchi and Singer [Du809], and by Langford, Li,
and Zhang [LLZ09]. For recent work on proximal Newton-like methods, see
Becker and Fadili [BeF12], Lee, Sun, and Saunders [11812], [11814], and
Chouzenoux, Pesquet, and Repetti [CPR14]. The finite and superlinear
convergence rate results of Exercises 6.4 and 6.5 are new to the author's
knowledge.
Section 6.4: Incremental subgradient methods were proposed by sev-
eral authors in the 60s and 70s. Perhaps the earliest paper is by Litvakov
[Lit66], which considered convex/nondifferentiable extensions of linear least
squares problems. There were several other related subsequent proposals,
including the paper by Kibardin [Kib80]. These works remained unnoticed
in the Western literature, where incremental methods were reinvented of-
ten in different contexts and with different lines of analysis. We mention
the papers by Solodov and Zavriev [SoZ98], Bertsekas [Ber99] (Section
6.3.2), Ben-Tal, Margalit, and Nemirovski [BMNOl], Nedic and Bertsekas
Sec. 6.9 Notes, Sources, and Exercises 429
EXERCISES
(ii) f(x) + v'f(x)'(y-x) + 21r, llv'f(x)-v'f(y)ll 2 ::; f(y), for all x,y E Rn.
(iii) (v' f(x) - v' J(y) )' (x - y) 2 t llv' f(x) - v' J(y) 112, for all x, y E Rn.
(iv) f(y)::; f(x) + v'f(x)'(y-x) + ½lly-xll 2 , for all x,y E Rn.
(v) (v'f(x) -v'f(y)) 1 (x-y)::; Lllx-yll 2 , for all x,y E Rn.
Note: This equivalence, given as part of Th. 2.1.5 in [Nes04], and also given
in part as Prop. 12.60 of [RoW98], proves among others the converse to Prop.
6.1.9(a). Proof: In Prop. 6.1.9, we showed that (i) implies (ii) and that (ii)
implies (iii). Moreover, (iii) and the Schwarz inequality imply (i). Thus (i), (ii),
and (iii) are equivalent.
The proof of Prop. 6.1.9(a) also shows that (iv) implies (v). To obtain (iv)
from (v), we use a variation of the proof of the descent lemma (Prop. 6.1.2). Let
t be a scalar parameter and let g(t) = f (x + t(y - x)). We have
J(y)-f(x)=g(I)-g(O)=
1 0
1
dd!(t)dt= 10
1
v'f(x+t(y-x)) 1 (y-x)dt,
so that
f(y)-f(x)-v'f(x)'(y-x) = 1 1
(v'f(x+t(y-x))-v'f(x)) 1 (y-x)dt::; ½lly-xll 2
where the last inequality follows by using (v) with y replaced by x + t(y - x) to
obtain
(v' f (x + t(y - x)) - v' f(x) )' (y - x)::; Lt IIY - xii 2, \:/ t E [O, l].
By definition
1
'vhx(y-x) = L('vf(y) - 'vf(x)), (6.231)
while by the assumption (iv), we have hx(z)::; ½llzll 2 for all z, implying that
to obtain
1
'vhy(x - y) = L ('v J(x) - 'v J(y) ),
and
(6.233)
and similarly
h;(-0) + hy(x - y) = -0'(x -y).
Thus we obtain 11011::; IIY - xii, i.e., that ll'vf(y) - 'vf(x)II::; LIiy- xii-
Sec. 6.9 Notes, Sources, and Exercises 435
where Pa,h is the proximal operator corresponding to a and h (cf. Section 5.1.4).
Denote by x* the optimal solution (which exists and is unique by the strong
convexity off), and let z* = x* - av' f (x*). We assume that a :S 1/ L, where L
is the Lipschitz constant of v'f, so that Xk-+ x* (cf. Prop. 6.3.3).
(a) Show that for some scalars p E (0, 1) and q E (0, l], we have
llx-av'f(x)-z*II :SPllx-x*II, \fxE3r, (6.234)
Hint: Use Eq. (6.236), Exercise 2.1, and the linearity of the proximal op-
erator.
436 Additional Algorithmic Topics Chap. 6
Gradient Step
- Xo - aV f (xo)
/
X
Figure 6.9.1. Illustration of the finite convergence process of the proximal gra-
dient method for the case of a sharp minimum, where h is nondifferentiable at
x• and - 'vf(x*) E int(8h(x*)) (cf. Exercise 6.4) . The figure a lso illustrates how
the method can attain superlinear convergence (cf. Exercise 6.5) . These results
should be compared with the convergence rate analysis of the proximal algorithm
in Section 5.1.2.
d(x) = xEX*
min llx - x*ll-
Show that there exists k 2:'. 0 such that for all k 2:'. k we have
f(xk+1) :S f(xk) + v' f(xk ) ' (xk+1 - xk) + 21a llxk+1 - Xkll 2 ,
where the last step uses the gradient inequality f(xk)+ v' f(xk)'(x* -xk) :S f(x*).
Letting x* be the vector of X* that is at minimum distance from Xk, we obtain
Since Xk converges to some point of X*, by using the hypothesis (6.238), we have
for sufficiently large k,
i = 1, .. . ,n,
i = 1, ... ,n,
1 ~.
Xk+I = m ~z);,
i=l
Verify that the convergence result of Prop. 6.5.1 applies to this algorithm.
Sec. 6.9 Notes, Sources, and Exercises 439
is a special case of the block coordinate descent method applied to the problem
i
xk+I .
E arg min 1 ... , xk+I'
f( xk+I, i-1 ..,,c xki+I , ... , xkm) + -1 11.:. , - xki 112 ,
~~ ~
for some scalar c > 0. Assuming that f is convex, show that every limit point of
the sequence of vectors Xk = (xl, ... , x,;') is a global minimum. Hint: Apply the
result of Prop. 6.5.1 to the cost function
For a related analysis of this type of algorithm see [Aus92], and for a recent
analysis see [BST14].
(a) Use the optimality conditions of Prop. 5.4.7 in Appendix B to show that
if and only if
x* E arg min {"\!F(x*)'x+G(x)}.
xE!Rn
G(x) = ~ Gi(xt
i=l
This exercise compares the convergence rates associated with various orders of
coordinate selection in the coordinate descent method, and illustrates how a
good deterministic order can outperform a randomized order. We consider the
minimization of a differentiable function f : Rn >-* R, with Lipschitz continuous
gradient along the ith coordinate, i.e., for some L > 0, we have
where Ci is the ith coordinate direction, Ci = (0, ... , 0, 1, 0, ... , 0) with the 1 in
the ith position, and "\!if is the ith component of the gradient. We also assume
that f is strongly convex in the sense that for some a > 0
Abbreviated Solution: By the descent lemma (Prop. 6.1.2) and Eq. (6.239),
we have
while by minimizing over y both sides of Eq. (6.240) with x = Xk, we have
(6.243)
= J(xk) - 2~ t
i=l
¾lvikJ(xk)l 2
Subtracting J(x*) from both sides and using Eq. (6.243), we obtain Eq.
(6.241).
(b) ( Gauss-Southwell Coordinate Selection [ScF14]) Assume that ik is selected
according to
ik E arg. max l~7if(xk)I,
i=l, ... ,n
and let the following strong convexity assumption hold for some a1 > 0,
Abbreviated Solution: Minimizing with respect toy both sides of Eq. (6.244)
with x = Xk, we obtain
Combining this with Eq. (6.242) and using the definition of ik, which im-
plies that
we obtain Eq. (6.245). Note: It can be shown that ~ ::; a1 ::; a, so that
the convergence rate estimate (6.245) is more favorable than the estimate
(6.241) of part (a); see [ScF14].
442 Additional Algorithmic Topics Chap. 6
where Li is a Lipschitz constant for the gradient along the ith coordinate,
i.e.,
Set Notation
443
444 Mathematical Background Appendix A
Vector Notation
{x + y Ix EX, y E Y},
{x -y Ix EX, y E Y}.
Matrices
For any matrix A, we use Aij, [A]ij, or aij to denote its ijth component.
The transpose of A, denoted by A', is defined by [A']ij = aji· For any two
matrices A and B of compatible dimensions, the transpose of the product
matrix AB satisfies (AB)' = B' A'. The inverse of a square and invertible
A is denoted A- 1.
If X is a subset of ?Rn and A is an m x n matrix, then the image of
X under A is denoted by AX (or A· X if this enhances notational clarity):
AX= {Ax Ix EX}.
If Y is a subset of ?Rm, the inverse image of Y under A is denoted by A- 1Y:
A- 1 Y = {x I Ax E Y}.
Sec. A.1 Linear Algebra 447
Square Matrices
Note that the only use of complex numbers in this book is in relation
to eigenvalues and eigenvectors. All other matrices or vectors are implicitly
assumed to have real components.
Proposition A.1.1:
(a) Let A be an n x n matrix. The following are equivalent:
(i) The matrix A is nonsingular.
(ii) The matrix A 1 is nonsingular.
(iii) For every nonzero x E ~n, we have Ax # 0.
448 Mathematical Background Appendix A
provided all the inverses appearing above exist. For a proof, multiply the
right-hand side by A+ CBC' and show that the product is the identity.
Another useful formula provides the inverse of the partitioned matrix
M=[~ ~]-
There holds
-1 - [ Q -QBD-1 ]
M - -D- 1CQ D- 1 + D- 1CQBD- 1 '
Sec. A.1 Linear Algebra 449
where
Q = (A - BD- 1 C)- 1 ,
provided all the inverses appearing above exist. For a proof, multiply M
with the given expression for M- 1 and verify that the product is the iden-
tity.
Proposition A.1.4:
(a) A square matrix is symmetric and positive definite if and only if
it is invertible and its inverse is symmetric and positive definite.
(b) The sum of two symmetric positive semidefinite matrices is pos-
itive semidefinite. If one of the two matrices is positive definite,
the sum is positive definite.
450 Mathematical Background Appendix A
n ) 1/2
llxll = (x'x) 1l 2 = (
8 lxil 2
Except for specialized contexts, we use this norm. In particular, in the
absence of a clear indication to the contrary, II · II will denote the Euclidean
norm. The Schwarz inequality states that for any two vectors x and y, we
have
lx'yl :S llxll · IIYII,
with equality holding if and only if x = ay for some scalar a. The
Pythagorean Theorem states that for any two vectors x and y that are
orthogonal, we have
llx + Yll 2 = llxll 2 + IIYll 2 -
Two other important norms are the maximum norm 11 ·lloo (also called
sup-norm or £00 -norm), defined by
llxlloo = i=l,
. max... ,n
lxil,
Sec. A.2 Topological Properties 451
llxlli = L lxil-
i=l
Sequences
(d) We have
lim inf Xk
k-too
+ lim inf Yk Slim inf(xk + Yk),
k-too k-too
o( ·) Notation
for all sequences {xk} such that Xk-+ 0 and Xk -=J. 0 for all k.
Proposition A.2.4:
(a) The union of a finite collection of closed sets is closed.
(b) The intersection of any collection of closed sets is clo::;ed.
(c) The union of any collection of open sets is open.
(d) The intersection of a finite collection of open sets is open.
(e) A set is open if and only if all of its elements are interior points.
(f) Every subspace of Rn is closed.
(g) A set X c Rn is compact if and only if every sequence of elements
of X has a subsequence that converges to an element of X.
(h) If {Xk} is a sequence of nonempty and compact subsets of Rn
such that Xk+i C Xk for all k, then the intersection nk= 0 Xk is
nonempty and compact.
Continuity
Proposition A.2.6 :
(a) Any vector norm on Rn is a continuous function .
(b) Let f : Wm H WP and g : Rn H Wm be continuous functions.
The composition f ·g: Rn H WP, defined by (f-g)(x) = f(g(x)),
is a continuous function.
(c) Let f : Rn H Wm be continuous, and let Y be an open (re-
spectively, closed) subset of Rm. Then the inverse image of Y,
{x E wnI f(x) E Y}, is open (respectively, closed).
(d) Let f: Rn H Wm be continuous, and let X be a compact subset
of Rn. Then the image of X, {f(x) Ix EX}, is compact.
456 Mathematical Background Appendix A
are nonempty and compact for all 'Y E R with 'Y > f*, where
f* = inf f(x).
xEX
Since the set of minima off is the intersection of the nonempty and compact
sets V'Yk for any sequence bk} with 'Yk -1.- f* and 'Yk > f* for all k, it follows
from Prop. A.2.4(h) that the set of minima is nonempty. This proves the
following classical theorem of Weierstrass.
A..3 DERIVATIVES
where ei is the ith unit vector (all components are O except for the ith
component which is 1). If the above limit exists, it is called the ith par-
tial derivative off at the vector x and it is denoted by (of /oxi)(x) or
of(x)/oxi (xi in this section will denote the ith component of the vector
x). Assuming all of these partial derivatives exist, the gradient off at xis
defined as the column vector
'vf(x) = [
of(x)
8x1
of(x)
:
l .
OXn
82 f(x)
OXiOXj
the ith partial derivative of of /oxj at a vector XE Rn. The Hessian off
at x, denoted by "v 2 f(x), is the matrix whose components are the above
second derivatives. The matrix "v 2 f(x) is symmetric. In our development,
whenever we assume that f is twice differentiable, we also assume that it
is twice continuously differentiable.
We now state some theorems relating to differentiable functions.
(b) We have
Proof: We first note that T can have at most one fixed point (if i; and x
are two fixed points, we have
which implies that x = x). Using the contraction property, we have for all
k,m>O
m m-1
llxk+m - Xk II :::; (3k llxm - xo II :::; (3k L llxe - Xe-1 II :::; (3k L (3f llx1 - Xo II,
l=l l=O
and finally,
[jk(I - (3m)
llxk+m - Xkll s; 1 _ (3 llx1 - xoll-
Thus {xk} is a Cauchy sequence, and hence converges to some x*. Taking
the limit in the equation Xk+l = T(xk) and using the continuity of T
(implied by the contraction property), we see that x* must be a fixed point
ofT. Q.E.D.
which holds for all x, y E ~n, and o: E [O, 1], as can be verified by a
straightforward calculation. For any fixed point x* of T, we have
where for the first equality we use iteration (A.1) and the fact x* = T(x*),
for the second equality we apply the identity (A.2), and for the inequality
we use the nonexpansiveness of T. By adding Eq. (A.3) for all k, we obtain
00
lim
k--+oo, kE IC
IIT(xk) - Xkll = 0, (A.4)
for some subsequence {xk}JC. Since from Eq. (A.3), {xk}JC is bounded, it
has at least one limit point, call it x, so {xkh:-+ x for an infinite index
set Kc K. Since Tis nonexpansive it is continuous, so {T(xk) he-+ T(x),
and in view of Eq. (A.4), it follows that x is a fixed point of T. Letting
x* = x in Eq. (A.3), we see that { llxk - xii} is nonincreasing and hence
converges, necessarily to 0, so the entire sequence { Xk} converges to the
fixed point x. Q.E.D.
Sec. A.4 Convergence Theorems 461
N onstationary Iterations
For nonstationary iterations of the form Xk+l = Tk(xk), where the function
Tk depends on k, the ideas of the preceding propositions may apply but
with modifications. The following proposition is often useful in this respect.
Vk = 0,1, ... ,
where f3k 2 0, "/k > 0 for all k, and
'oo
L"lk =oo,
k=O
Then ak-----+ 0.
Proof: We first show that given any E > 0, we have ak < E for infinitely
many k. Indeed, if this were not so, by letting k be such that ak :?'. E and
f3khk ~ E/2 for all k :?'. k, we would have for all k 2 k
By repeating this argument, we obtain ak < E for all k :?'. k. Since E can be
arbitrarily small, it follows that a k -----+ 0. Q.E.D.
V x, y E ~ n, k = 0, 1, . . . ,
462 Mathematical Background Appendix A
L,k = oo,
k=O
Assume also that all the mappings Tk have a common fixed point x*. Then
llxk+l - x*II = IITk(xk) - n(x*)II : : ; (1- 1k)llxk - x*II + f3k,
and from Prop. A.4.3, it follows that the sequence {xk} generated by the
iteration Xk+1 = Tk(xk) converges to x* starting from any xo E lRn.
Supermartingale Convergence
Proposition A.4.4: Let {Yk}, {Zk}, {Wk}, and {Vk} be four scalar
sequences such that
Then either Yk ---r -oo, or ·else {Yk} converges to a finite value and
I:%:oZk < oo.
Proof: We first give the proof assuming that Vk = 0, and then generalize.
In this case, using the nonnegativity of {Zk}, we have
Yk+1 ::::; Yk + Wk.
By writing this relation for the index k set to k, ... , k, where k 2:: k, and
adding, we have
k 00
Since E~o Wk < oo, it follows that {Yk} is bounded above, and by taking
upper limit of the left hand side as k -+ oo and lower limit of the right
hand side as k -+ oo, we have
This implies that either Yk -+ -oo, or else {Yk} converges to a finite value.
In the latter case, by writing Eq. (A.5) for the index k set to 0, ... , k, and
adding, we have
k k
LZt $Yo+ I:w1-Yk+1, Vk = 0, l, ... ,
l=O l=O
since we generally have (1 +a) $ ea and log(l +a) $ a for any a~ 0. Thus
the assumption E~o Vk < oo implies that
00
Define
k-1 k k
The next theorem has a long history. The particular version we give
here is due to Robbins and Sigmund (RoS71]. Their proof assumes the
special case of the theorem where Vk = 0 (see Neveu [Nev75], p. 33, for a
proof of this special case), and then uses the line of proof of the preceding
proposition. Note, however, that contrary to the preceding proposition,
the following theorem requires nonnegativity of the sequence {Yk}.
Fejer Monotonicity
V x• EX*,
00
L'Yk = oo,
k=O
Proof: (a) Let {Ek } be a positive sequence such t hat I::=0 (1 + ,Bk)Ek < oo,
and let xk be a point of X * such that
(b) Following the argument of the proof of Prop. A.4.4, define for all k,
k-1 k
Vk = 0, l, ... , (A.8)
while {Yk} has a limit point at 0, since xis a limit point of {xk}· For any
E > 0, let k be such that
00
so that Y k --+ 0 implies that llxk - xllP --+ 0, and hence Xk --+ x.
(c) From Prop. A.4.4, it follows that
CX)
Thus limk-+oo, kEK ¢(xk; x*) = 0 for some subsequence {xk}K, By part (a),
{xk} is bounded, so the subsequence {xk}K has a limit point x, and by the
lower semicontinuity of¢(·; x*), we must have
which in view of the nonnegativity of¢, implies that ¢(x; x*) = 0. Using
the hypothesis (A.7), it follows that x E X*, so by part (b), the entire
sequence {xk} converges to x. Q.E.D.
APPENDIX B
Convex Optimization Theory:
A Summary
In this appendix, we provide a summary of theoretical concepts and results
relating to convex analysis, convex optimization, and duality theory. In
particular, we list the relevant definitions and propositions (without proofs)
of the author's book "Convex Optimization Theory," Athena Scientific,
2009. For ease of use, the chapter, section, definition, and proposition
numbers of the latter book are identical to the ones of this appendix.
Proposition 1.1.1:
(a) The intersection niEICi of any collection {Ci Ii E J} of convex
sets is convex.
(b) The vector sum C1 + C2 of two convex sets C1 and C2 is convex.
(c) The set >..C is convex for any convex set C and scalar >... Fur-
thermore, if C is a convex set and >..1, >..2 are positive scalars,
467
468 Convex Optimization Theory: A Summary Chap. 1
(d) The closure and the interior of a convex set are convex.
(e) The image and t he inverse image of a convex set under an affine
function are convex.
F(x) = f(Ax),
If f is convex, then F is also convex, while if f is closed, then F is
also closed.
be the function
If Ji, ... , f m are convex, then F is also convex, while if Ji, ... , f m are
closed, then F is also closed.
Strong Convexity
a
f (ax+ (1 - o:)y) + 20:(1 - o:)llx - Yll 2 :S o:f(x) + (1 - o:)f(y).
a
f(x) ~ f(x*) + 2 11x - x*ll 2 , V XE C.
Sn aff(C) c C,
Closures of Functions
The closure of the epigraph of a function f : X t-+ [-oo, oo] can be seen
to be a legitimate epigraph of another function. This function, called the
closure off and denoted elf: Rn t-+ [-00,00], is given by
The closure of the convex hull of the epigraph of f is the epigraph of some
function, denoted cl f called the convex closure of f. It can be seen that
elf is the closure of the function F: Rn t-+ [-oo, oo] given by
It is easily shown that F is convex, but it need not be closed and its
domain may be strictly contained in <lorn( cl f) (it can be seen though that
the closures of the domains of F and cl f coincide).
inf f(x)
xEX
= xEX
inf (clf)(x) = inf (clf)(x) = inf F(x) = inf (clf)(x),
xE!Rn xE!Rn xE!Rn
where Fis given by Eq. (B.1). Furthermore, any vector that attains
the infimum off over X also attains the infimum of elf, F, and elf.
(clf)(y) = limf(y
a.j..0
+ a(x -y)), \:/yERn.
F(x) = f(Ax),
is convex and
is convex and
V = {x E C I Ax E W}
V = {x E C I Ax E W}
(assuming it is nonempty) is Le n N(A), where N(A) is the
nullspace of A.
C = s+ (CnS..L).
V,,={xlf(x)S,}, 1 E R.
480 Convex Optimization Theory: A Summary Chap. 1
Then:
(a) All the nonempty level sets V7 have the same recession cone,
denoted Rf, and given by
Let {Ck} be a sequence of nonempty closed sets in )Rn with Ck+l C Ck for
all k (such a sequence is said to be nested). We are concerned with the
question whether nk= 0 Ck is nonempty.
llxk 11 -t oo,
where dis some nonzero common direction of recession of the sets Ck,
d-f- 0,
A special case is when all the sets Ck are equal. In particular, for a
nonempty closed convex set C, and a sequence {xk} CC, we say that {xk}
is an asymptotic sequence of C if {xk} is asymptotic (as per the preceding
definition) for the sequence {Ck}, where Ck= C.
Given any unbounded sequence {xk} such that Xk E Ck for each k,
there exists a subsequence { Xk} kEK that is asymptotic for the corresponding
subsequence {CkhEK· In fact, any limit point of {xk/llxkll} is a common
direction of recession of the sets Ck,
\:/ k ~ k.
We say that the sequence {Ck} is retractive if all its asymptotic se-
quences are retractive. In the special case Ck = C, we say that the set
C is retractive if all its asymptotic sequences are retractive.
and we assume that all the sets Nk are nonempty. A simple consequence
is that a polyhedral set is retractive, since it is the nonempty intersection
of a finite number of closed halfspaces.
RxnR CL.
Then, {Ck} is retractive, and nk=o Ck is nonempty.
When specialized to just two sets, the above proposition implies that
if C1 and -C2 are closed convex sets, then C1 - C2 is closed if there is no
484 Convex Optimization Theory: A Summary Chap. 1
{x I a'x = b},
where a is nonzero vector in ~n (called the normal of the hyperplane), and
b is a scalar. The sets
(1) C2 - C1 is closed.
(2) C1 is closed and C2 is compact.
(3) C1 and C2 are polyhedral.
(4) C1 and C2 are closed, and
where Re, and Le, denote the recession cone and the lineality
space of Ci, i = 1, 2.
(5) C1 is closed, C2 is polyhedral, and Re 1 n Re2 C Le1 •
ri(C1) n ri(C2) = 0.
ri(C)nP=0.
(b) If (u, w) does not belong to cl(C), there exists a nonvertical hy-
perplane strictly separating (u, w) and C.
(d) The conjugates of f and its convex closure cl f are equal. Fur-
thermore, if cl f is proper' then
(clf)(x) = j**(x),
The right-hand sides of the preceding two relations are equal, so we obtain
X = {x I o(x)::; o}
where a1, ... , ar are vectors in Rn, and b1, ... , br are scalars.
Given a nonempty convex set C, a vector x E C is said to be an
extreme point of C if it does not lie strictly between the endpoints of any
line segment contained in the set, i.e., if there do not exist vectors y E C
and z E C, with y -=fa x and z -=fa x, and a scalar a E (0, 1) such that
x = ay + (1 - a)z.
P = {x I Ax= b, x ~ O},
P = {x I Ax= b, c :S x :S d},
{ x I a1x :S bj, j = 1, . .. , r}
has an extreme point if and only if the set {aj I j = 1, ... , r} contains
n linearly independent vectors.
Proposition 2.2.1:
(a) For any nonempty set C, we have
(C*)* = cl(conv(C)).
In particular, if C is closed and convex, we have (C*)* = C.
C = {x I aJx:::; 0, j = 1, ... , r },
P = {x I = f
x
J=l
µjVj + Y, f
J=l
µj = 1, µj ~ 0, j = 1, ... , m,
.
y E c} .
V x E dom(f),
x* = arg min
xEX
f ( x ).
{x I f(x) ~~}
is nonempty and bounded.
496 Convex Optimization Theory: A Summary Chap. 3
(3) f is coercive, i.e., if for every sequence { xk} such that llxk II -+ oo,
we have limk-+= f(xk) = oo.
Then the set of minima of f over Rn is nonempty and compact.
J(x) = zE!Rm
inf F(x, z).
Then:
(a) If F is convex, then f is also convex.
(b) We have
P( epi(F)) c epi(f) C c1(P(epi(F))),
Sec. B.3.3 Partial Minimization of Convex Functions 497
where P(·) denotes projection on the space of (x, w), i.e., for any
subset S of Wn+m+l, P(S) = {(x, w) I (x, z, w) ES}.
{z I F(x,z)::; "1}
{ z I F(x, z) ~ "i}
is nonempty and its recession cone is equal to its lineality space. Then
498 Convex Optimization Theory: A Summary Chap. 4
and that the infima and the suprema above are attained.
Proposition 3.4.1: A pair (x*, z*) is a saddle point of</> if and only
if the minimax equality (B.3) holds, and x* is an optimal solution of
the problem
minimize sup¢(x,z)
zEZ
subject to x E X,
while z* is an optimal solution of the problem
(a) Min Common Point Problem: Consider all vectors that are common
to M and the (n + l)st axis. We want to find one whose (n + l)st
component is minimum.
(b) Max Crossing Point Problem: Consider nonvertical hyperplanes that
contain M in their corresponding "upper" closed halfspace, i.e., the
closed halfspace whose recession cone contains the vertical halfline
{ (0, w) I w ~ O}. We want to find the maximum crossing point of
the (n + 1)st axis with such a hyperplane.
We refer to the two problems as the min common/max crossing {MC/MC}
framework, and we will show that it can be used to develop much of the
core theory of convex optimization in a unified way.
Mathematically, the min common problem is
minimize w
subject to (0, w) E M.
We also refer to this as the primal problem, and we denote by w* its optimal
value,
w* = inf w.
(O,w)EM
maximize inf
(u,w)EM
{w + µ' u}
(B.4)
subject to µ E Rn.
We also refer to this as the dual problem, we denote by q* its optimal value,
q* = sup q(µ),
µEfRn
M=M+{(O,w) lw=::::o}
is convex. Then the set of feasible solutions of the max crossing prob-
lem, {µ I q(µ) > -oo }, is contained in the cone
There are several interesting special cases where the set M is the epigraph of
some function. For example, consider the problem of minimizing a function
f : ~n H [-oo, oo]. We introduce a function F : ~n+r H [-oo, oo] of the
pair (x, u), which satisfies
M = epi(p) .
The min common value w * is the minimal value of f , since
w* = p(O) = x inf
E!Rn
F(x, 0) = inf f(x).
x E!Rn
maximize q(µ)
subject to µ E ~ r ,
inf F(x, 0)
xE1Rn
=- inf F*(O, µ).
µE1Rr
minimize f (x)
(B.8)
subject to x EX, g(x) ~ 0,
(cl ¢ )(x, ·) the concave closure of ¢(x, ·) [the smallest concave and upper
semicontinuous function that majorizes ¢(x , ·)].
M=M+{(u,O)luEC},
The following propositions give general results for strong duality, as well
existence of dual optimal solutions.
The following propositions address special cases where the set M has par-
tially polyhedral structure.
Then q* = w*, and Q*, the set of optimal solutions of the max crossing
problem, is a nonempty subset of Rj,, the polar cone of the recession
cone of P. Furthermore, Q* is compact if int(.D) n Pi- 0.
A'x::;o c'x:::; 0.
which can be derived from the MC/MC duality framework in Section 4.2.
We denote the primal and dual optimal values by f * and q*, respectively.
where Xis a convex set in ~n, g(x) = (g1(x), ... ,gr(x))', f: X H ~ and
gj : XH ~, j = 1, ... , r, are convex functions. The dual problem is
maximize inf L( x, µ)
xEX
subject to µ ~ 0,
where L is the Lagrangian function
L(x,µ) = f(x) + µ'g(x), XE X, µ E ~r.
For this and other similar problems, we denote the primal and dual opti-
mal values by f * and q*, respectively. We always have the weak duality
relation q* ~ f*; cf. Prop. 4.1.2. When strong duality holds, dual optimal
solutions are also referred to as Lagrange multipliers. The following eight
propositions are the main results relating to strong duality in a variety of
contexts. They provide conditions (often called constraint qualifications),
which guarantee that q* = f *.
x* E argminL(x,µ*),
xEX
µjgi(x•) = 0, j = 1, .. . , r.
minimize f (x )
(B.12)
subject to x E X, g(x ) '.S 0, Ax = b,
minimize f (x)
(B.13)
subject to x E X , Ax = b,
t he Lagrangian function is
x* E argminL(x,µ*,.X*),
xEX
X=PnC,
g(x) = (g1(x), ... ,gr(x))', the functions f: lJtn H lR and gj: l)tn i-+ lR,
j = 1, ... , r, are defined over lRtt, A is an m x n matrix, and b E lJtm.
Assume that j* is finite and that for some r with 1 ~ r ~ r, the
functions gj, j = 1, ... , r, are polyhedral, and the functions f and gj,
j = r + 1, ... , r, are convex over C. Assume further that:
(1) There exists a vector x E ri(C) in the set
(2) There exists x E Pnc such that gj(x) < 0 for all j =.r+ 1, ... , r .
Then q* = j* and there exists at least one dual optimal solution.
We will now give a different type of result, which under some com-
pactness assumptions, guarantees strong duality and that there exists an
optimal primal solution (even if there may be no dual optimal solution).
where A is an m x n matrix, Ji : lJtn i-+ (-oo, oo] and h : lJtm H (-oo, oo]
are closed proper convex functions. We assume that there exists a feasible
solution.
(b) There holds f * = q*, and (x*, >. *) is a' primal and dual optimal 1
x* E arg min { Ji (x) - x' A'>.*} and Ax* E arg min { /2 (z) + z' >. *}.
xEWn zEWn
(B.15)
minimize f (x)
(B.16)
subject to x E C,
minimize f * (>.)
subject to >.EC,
Then there is no duality gap and the dual problem has an optimal
solution.
of(x) =SJ..+ G,
This is known as the Fenchel inequality. A pair (x, y) satisfies this inequal-
ity as an equation if and only if x attains the supremum in the definition
F(x) = J(Ax)
is proper. Then
8F(x) = A 18f(Ax),
Proposition 5.4. 7: Let f: lRn i-+ (-oo, oo] be a proper convex func-
tion, let X be a nonempty convex subset of lRn, and assume that one
of the following four conditions holds:
Sec. B.5.4 Directional Derivatives 515
For a proper convex function f : ~n 1---t (-oo, oo], the directional derivative
at any x E <lorn(!) in a direction d E ~n, is defined by
a
f(x +ad)~ =f(x +ad)+ ( 1- = f(x) = f(x) a) a
+ =(f(x + ad) - J(x)),
a a a
so that
f(x + ad) - J(x) < J(x + ad) - f(x)
\:/ a E (O,a). (B.20)
a - -a '
Thus the limit in Eq. (B.19) is well-defined (as a real number, or oo, or
-oo) and an alternative definition of f'(x; d) is
We will now provide theorems regarding the validity of the minimax equal-
ity and the existence of saddle points. These theorems are obtained by
specializing the MC/MC theorems of Chapter 4. We will assume through-
out this section the following:
(a) X and Z are nonempty convex subsets of ~n and ~m, respectively.
(b) ¢:Xx Z H ~ is a function such that¢(·, z) : X H ~ is convex and
closed for each z E Z, and -¢(x, ·) : Z H ~ is convex and closed for
each x E X .
satisfies either p(O) < oo, or else p(O) = oo and p(u) > -oo for all
u E ~m. Then
and the supremum over Z in the left-hand side is finite and is attained.
Furthermore, the set of z E Z attaining this supremum is compact if
and only if O lies in the interior of dom(p).
Proposition 5.5.4: Assume that t is proper and that the level sets
{ x I t(x) ~, }, , E W, are compact. Then
Proposition 5.5.5: Assume that tis proper, and that the recession
cone and the constancy space of t are equal. Then
518 Convex Optimization Theory: A Summary Chap. 5
{ x EX I ¢(x, z) S "Y}
{z E Z I ¢(x,z) ~ "Y}
[ACH97] Auslender, A., Cominetti, R., and Haddou, M., 1997. "Asymp-
totic Analysis for Penalty and Barrier Methods in Convex and Linear Pro-
gramming," Math. of Operations Research, Vol. 22, pp. 43-62.
[ALS14] Abernethy, J., Lee, C., Sinha, A., and Tewari, A., 2014. "Online
Linear Optimization via Smoothing," arXiv preprint arXiv:1405.6076.
[AgB14] Agarwal, A., and Bottou, L., 2014. "A Lower Bound for the Op-
timization of Finite Sums," arXiv preprint arXiv:1410.0723.
[AgDll] Agarwal, A., and Duchi, J.C., 2011. "Distributed Delayed Stochas-
tic Optimization," In Advances in Neural Information Processing Systems
(NIPS 2011), pp. 873-881.
[AlG03] Alizadeh, F., and Goldfarb, D., 2003. "Second-Order Cone Pro-
gramming," Math. Programming, Vol. 95, pp. 3-51.
[AnH13] Andersen, M. S., and Hansen, P. C., 2013. "Generalized Row-
Action Methods for Tomographic Imaging," Numerical Algorithms, Vol.
67, pp. 1-24.
[Arm66] Armijo, L., 1966. "Minimization of Functions Having Continuous
Partial Derivatives," Pacific J. Math., Vol. 16, pp. 1-3.
[Ash72] Ash, R. B., 1972. Real Analysis and Probability, Academic Press,
NY.
[AtV95] Atkinson, D. S., and Vaidya, P. M., 1995. "A Cutting Plane Algo-
rithm for Convex Programming that Uses Analytic Centers," Math. Pro-
gramming, Vol. 69, pp. 1-44.
[AuE76] Aubin, J. P., and Ekeland, I., 1976. "Estimates of the Duality
Gap in Nonconvex Optimization," Math. of Operations Research, Vol. 1,
pp. 255- 245.
[AuT03] Auslender, A., and Teboulle, M., 2003. Asymptotic Cones and
Functions in Optimization and Variational Inequalities, Springer, NY.
[AuT04] Auslender, A., and Teboulle, M., 2004. "Interior Gradient and
Epsilon-Subgradient Descent Methods for Constrained Convex Minimiza-
tion," Math. of Operations Research, Vol. 29, pp. 1-26.
[Aus76] Auslender, A., 1976. Optimization: Methodes Numeriques, Mason,
Paris.
[Aus92] Auslender, A., 1992. "Asymptotic Properties of the Fenchel Dual
519
520 References
[BMNOl] Ben-Tal, A., Margalit, T., and Nemirovski, A., 2001. "The Or-
dered Subsets Mirror Descent Optimization Method and its Use for the
Positron Emission Tomography Reconstruction," in Inherently Parallel Al-
gorithms in Feasibility and Optimization and their Applications (D. But-
nariu, Y. Censor, and S. Reich, eds.), Elsevier, Amsterdam, Netherlands.
[BMROO] Birgin, E. G., Martinez, J. M., and Raydan, M., 2000. "Non-
monotone Spectral Projected Gradient Methods on Convex Sets," SIAM
J. on Optimization, Vol. 10, pp. 1196-1211.
[BMS99] Boltyanski, V., Martini, H., and Soltan, V., 1999. Geometric
Methods and Optimization Problems, Kluwer, Boston.
[BN003] Bertsekas, D. P., Nedic, A., and Ozdaglar, A. E., 2003. Convex
Analysis and Optimization, Athena Scientific, Belmont, MA.
[BOT06] Bertsekas, D. P., Ozdaglar, A. E., and Tseng, P., 2006 "Enhanced
Fritz John Optimality Conditions for Convex Programming," SIAM J. on
Optimization, Vol. 16, pp. 766-797.
[BPCll] Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J., 2011.
Distributed Optimization and Statistical Learning via the Alternating Di-
rection Method of Multipliers, Now Publishers Inc, Boston, MA.
[BPP13] Bhatnagar, S., Prasad, H., and Prashanth, L. A., 2013. Stochas-
tic Recursive Algorithms for Optimization, Lecture Notes in Control and
Information Sciences, Springer, NY.
[BPT97a] Bertsekas, D. P., Polymenakos, L. C., and Tseng, P., 1997. "An
E-Relaxation Method for Separable Convex Cost Network Flow Problems,"
SIAM J. on Optimization, Vol. 7, pp. 853-870.
[BPT97b] Bertsekas, D. P., Polymenakos, L. C., and Tseng, P., 1997.
"Epsilon-Relaxation and Auction Methods for Separable Convex Cost Net-
work Flow Problems," in Network Optimization, Pardalos, P. M., Hearn,
D. W., and Hager, W.W., (Eds.), Lecture Notes in Economics and Math-
ematical Systems, Springer-Verlag, NY, pp. 103-126.
[BSL14] Bergmann, R., Steidl, G., Laus, F., and Weinmann, A., 2014.
"Second Order Differences of Cyclic Data and Applications in Variational
Denoising," arXiv preprint arXiv:1405.5349.
[BSS06] Bazaraa, M. S., Sherali, H. D., and Shetty, C. M., 2006. Nonlinear
Programming: Theory and Algorithms, 3rd Edition, Wiley, NY.
[BST14] Bolte, J., Sabach, S., and Teboulle, M., 2014. "Proximal Alternat-
ing Linearized Minimization for Nonconvex and Nonsmooth Problems,"
Math. Programming, Vol. 146, pp. 1-36.
[BaB88] Barzilai, J., and Borwein, J.M., 1988. "Two Point Step Size Gra-
dient Methods," IMA J. of Numerical Analysis, Vol. 8, pp. 141-148.
[BaB96] Bauschke, H. H., and Borwein, J. M., 1996. "On Projection Algo-
rithms for Solving Convex Feasibility Problems," SIAM Review, Vol. 38,
pp. 367-426.
522 References
[DeT91] Dennis, J. E., and Torczon, V., 1991. "Direct Search Methods on
Parallel Machines," SIAM J. on Optimization, Vol. 1, pp. 448-474.
[Dem66] Demjanov, V. F., 1966. "The Solution of Several Minimax Prob-
lems," Kibernetika, Vol. 2, pp. 58-66.
[Dem68] Demjanov, V. F., 1968. "Algorithms for Some Minimax Prob-
lems," J. of Computer and Systems Science, Vol. 2, pp. 342-380.
[DoE03] Donoho, D. L., Elad, M., 2003. "Optimally Sparse Representation
in General (Nonorthogonal) Dictionaries via f\ Minimization," Proc. of the
National Academy of Sciences, Vol. 100, pp. 2197-2202.
[DrH04] Drezner, Z., and Hamacher, H. W., 2004. Facility Location: Ap-
plications and Theory, Springer, NY.
[DuS83] Dunn, J. C., and Sachs, E., 1983. "The Effect of Perturbations on
the Convergence Rates of Optimization Algorithms," Appl. Math. Optim.,
Vol. 10, pp. 143-157.
[DuS09] Ouchi, J., and Singer, Y., 2009. "Efficient Online and Batch Learn-
ing Using Forward Backward Splitting, J. of Machine Learning Research,
Vol. 10, pp. 2899-2934.
[Dun79] Dunn, J.C., 1979. "Rates of Convergence for Conditional Gradient
Algorithms Near Singular and Nonsingular Extremals," SIAM J. on Control
and Optimization, Vol. 17, pp. 187-211.
[Dun80] Dunn, J. C., 1980. "Convergence Rates for Conditional Gradient
Sequences Generated by Implicit Step Length Rules," SIAM J. on Control
and Optimization, Vol. 18, pp. 473-487.
[Dun81] Dunn, J. C., 1981. "Global and Asymptotic Convergence Rate
Estimates for a Class of Projected Gradient Processes," SIAM J. on Control
and Optimization, Vol. 19, pp. 368-400.
[Dun87] Dunn, J. C., 1987. "On the Convergence of Projected Gradient
Processes to Singular Critical Points," J. of Optimization Theory and Ap-
plications, Vol. 55, pp. 203-216.
[Dun91] Dunn, J.C., 1991. "A Subspace Decomposition Principle for Scaled
Gradient Projection Methods: Global Theory," SIAM J. on Control and
Optimization, Vol. 29, pp. 219-246.
[EcB92] Eckstein, J., and Bertsekas, D. P., 1992. "On the Douglas-Rachford
Splitting Method and the Proximal Point Algorithm for Maximal Monotone
Operators," Math. Programming, Vol. 55, pp. 293-318.
[EcS13] Eckstein, J., and Silva, P. J. S., 2013. "A Practical Relative Error
Criterion for Augmented Lagrangians," Math. Programming, Vol. 141, Ser.
A, pp. 319-348.
[Eck94] Eckstein, J., 1994. "Nonlinear Proximal Point Algorithms Using
Bregman Functions, with Applications to Convex Programming," Math.
of Operations Research, Vol. 18, pp. 202-226.
[Eck03] Eckstein, J., 2003. "A Practical General Approximation Criterion
References 531
[Fen51] Fenchel, W., 1951. Convex Cones, Sets, and Functions, Mimeogra-
phed Notes, Princeton Univ.
[F1H95] Florian, M. S., and Hearn, D., 1995. "Network Equilibrium Models
and Algorithms," Handbooks in OR and MS, Ball, M. 0., Magnanti, T.
L., Monma, C. L., and Nemhauser, G. L., (Eds.), Vol. 8, North-Holland,
Amsterdam, pp. 485-550.
[FiM68] Fiacco, A. V., and McCormick, G. P., 1968. Nonlinear Program-
ming: Sequential Unconstrained Minimization Techniques, Wiley, NY.
[FiN03] Figueiredo, M.A. T., and Nowak, R. D., 2003. "An EM Algorithm
for Wavelet-Based Image Restoration," IEEE Trans. Image Processing, Vol.
12, pp. 906-916.
[FleOO] Fletcher, R., 2000. Practical Methods of Optimization, 2nd edition,
Wiley, NY.
[FoG83] Fortin, M., and Glowinski, R., 1983. "On Decomposition-Coordina-
tion Methods Using an Augmented Lagrangian," in: M. Fortin and R.
Glowinski, eds., Augmented Lagrangian Methods: Applications to the So-
lution of Boundary-Value Problems, North-Holland, Amsterdam.
[FrG13] Friedlander, M. P., and Goh, G., 2013. "Tail Bounds for Stochastic
Approximation," arXiv preprint arXiv:1304.5586.
[FrG14] Freund, R. M., and Grigas, P., 2014. "New Analysis and Results
for the Frank-Wolfe Method," arXiv preprint arXiv:1307.0873, to appear
in Math. Programming.
[FrSOO] Frommer, A., and Szyld, D. B., 2000. "On Asynchronous Itera-
tions," J. of Computational and Applied Mathematics, Vol. 123, pp. 201-
216.
[FrS12] Friedlander, M. P., and Schmidt, M., 2012. "Hybrid Deterministic-
Stochastic Methods for Data Fitting," SIAM J. Sci. Comput., Vol. 34, pp.
A1380-A1405.
[FrT07] Friedlander, M. P., and Tseng, P., 2007. "Exact Regularization of
Convex Programs," SIAM J. on Optimization, Vol. 18, pp. 1326-1350.
[FrW56] Frank, M., and Wolfe, P., 1956. "An Algorithm for Quadratic
Programming," Naval Research Logistics Quarterly, Vol. 3, pp. 95-110.
[Fra02] Frangioni, A., 2002. "Generalized Bundle Methods," SIAM J. on
Optimization, Vol. 13, pp. 117-156.
[Fri56] Frisch, M. R., 1956. "La Resolution des Problemes de Programme
Lineaire par la Methode du Potential Logarithmique," Cahiers du Semi-
naire D'Econometrie, Vol. 4, pp. 7-20.
[FuM81] Fukushima, M., and Mine, H., 1981. "A Generalized Proximal
Point Algorithm for Certain Non-Convex Minimization Problems," Inter-
nat. J. Systems Sci., Vol. 12, pp. 989-1000.
[Fuk92] Fukushima, M., 1992. "Application of the Alternating Direction
Method of Multipliers to Separable Convex Programming Problems," Com-
References 533
[Gol85] Golshtein, E. G., 1985. "A Decomposition Method for Linear and
Convex Programming Problems," Matecon, Vol. 21, pp. 1077-1091.
[GonOO] Gonzaga, C. C., 2000. "Two Facts on the Convergence of the
Cauchy Algorithm," J. of Optimization Theory and Applications, Vol. 107,
pp. 591-600.
[GrS99] Grippo, L., and Sciandrone, M., 1999. "Globally Convergent Block-
Coordinate Techniques for Unconstrained Optimization," Optimization Me-
thods and Software, Vol. 10, pp. 587-637.
[GrSOO] Grippo, L., and Sciandrone, M., 2000. "On the Convergence of the
Block Nonlinear Gauss-Seidel Method Under Convex Constraints," Oper-
ations Research Letters, Vol. 26, pp. 127-136.
[Gri94] Grippo, L., 1994. "A Class of Unconstrained Minimization Methods
for Neural Network Training," Optim. Methods and Software, Vol. 4, pp.
135-150.
[GriOO] Grippo, L., 2000. "Convergent On-Line Algorithms for Supervised
Learning in Neural Networks," IEEE Trans. Neural Networks, Vol. 11, pp.
1284-1299.
fGul92J Guler, 0., 1992. "New Proximal Point Algorithms for Convex Min-
imization," SIAM J. on Optimization, Vol. 2, pp. 649-664.
[HCW14] Hong, M., Chang, T. H., Wang, X., Razaviyayn, M., Ma, S.,
and Luo, z. Q., 2014. "A Block Successive Upper Bound Minimization
Method of Multipliers for Linearly Constrained Convex Optimization,"
arXiv preprint arXiv:1401.7079.
[HJN14] Harchaoui, Z., Juditsky, A., and Nemirovski, A., 2014. "Condi-
tional Gradient Algorithms for Norm-Regularized Smooth Convex Opti-
mization," Math. Programming, pp. 1-38.
[HKR95] den Hertog, D., Kaliski, J., Roos, C., and Terlaky, T., 1995. "A
Path-Following Cutting Plane Method for Convex Programming," Annals
of Operations Research, Vol. 58, pp. 69-98.
[HLV87] Hearn, D. W., Lawphongpanich, S., and Ventura, J. A., 1987. "Re-
stricted Simplicial Decomposition: Computation and Extensions," Math.
Programming Studies, Vol. 31, pp. 119-136.
[HMTlO] Halko, N., Martinsson, P.-G., and Tropp, J. A., 2010. "Find-
ing Structure with Randomness: Probabilistic Algorithms for Constructing
Approximate Matrix Decompositions," arXiv preprint arXiv:0909.4061.
[HTF09] Hastie, T., Tibshirani, R., and Friedman, J., 2009. The Elements
of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edi-
tion, Springer, NY. On line at http://statweb.stanford.edu/ tibs/ElemStat-
Learn/
[HYS15] Hu, Y., Yang, X., and Sim, C. K., 2015. "Inexact Subgradient
Methods for Quasi-Convex Optimization Problems," European Journal of
Operational Research, Vol. 240, pp. 315-327.
536 References
[HZS13] Hou, K., Zhou, Z., So, A. M. C., and Luo, Z. Q., 2013. "On
the Linear Convergence of the Proximal Gradient Method for Trace Norm
Regularization," in Advances in Neural Information Processing Systems
(NIPS 2013), pp. 710-718.
[Ha90] Ha, C. D., 1990. "A Generalization of the Proximal Point Algo-
rithm," SIAM J. on Control and Optimization, Vol. 28, pp. 503-512.
[HaB70] Haarhoff, P. C., and Buys, J. D, 1970. "A New Method for the
Optimization of a Nonlinear Function Subject to Nonlinear Constraints,"
Computer J., Vol. 13, pp. 178-184.
[HaH93] Hager, W. W., and Hearn, D. W., 1993. "Application of the Dual
Active Set Algorithm to Quadratic Network Optimization," Computational
Optimization and Applications, Vol. 1, pp. 349-373.
[HaM79] Han, S. P., and Mangasarian, 0. L., 1979. "Exact Penalty Func-
tions in Nonlinear Programming," Math. Programming, Vol. 17, pp. 251-
269.
[Hay08] Haykin, S., 2008. Neural Networks and Learning Machines, (3rd
Ed.), Prentice Hall, Englewood Cliffs, NJ.
[HeD09] Helou, E. S., and De Pierro, A. R., 2009. "Incremental Subgradi-
ents for Constrained Convex Optimization: A Unified Framework and New
Methods," SIAM J. on Optimization, Vol. 20, pp. 1547-1572.
[HeL89] Hearn, D. W., and Lawphongpanich, S., 1989. "Lagrangian Dual
Ascent by Generalized Linear Programming," Operations Res. Letters, Vol.
8, pp. 189-196.
[HeMll] Henrion, D., and Malick, J., 2011. "Projection Methods for Conic
Feasibility Problems: Applications to Polynomial Sum-of-Squares Decom-
positions," Optimization Methods and Software, Vol. 26, pp. 23-46.
[HeM12] Henrion, D., and Malick, J., 2012. "Projection Methods in Conic
Optimization," In Handbook on Semidefinite, Conic and Polynomial Opti-
mization, Springer, NY, pp. 565-600.
[Her09] Gabor, H., 2009. Fundamentals of Computerized Tomography: Im-
age Reconstruction from Projection, (2nd ed.), Springer, NY.
[Hes69] Hestenes, M. R., 1969. "Multiplier and Gradient Methods," J. Opt.
Th. and Appl., Vol. 4, pp. 303-320.
[Hes75] Hestenes, M. R., 1975. Optimization Theory: The Finite Dimen-
sional Case, Wiley, NY.
[HiL93] Hiriart-Urruty, J.-B., and Lemarechal, C., 1993. Convex Analysis
and Minimization Algorithms, Vols. I and II, Springer-Verlag, Berlin and
NY.
[Hil57] Hildreth, C., 1957. "A Quadratic Programming Procedure," Naval
Res. Logist. Quart., Vol. 4, pp. 79-85. See also "Erratum," Naval Res.
Logist. Quart., Vol. 4, p. 361.
References 537
[HoK71] Hoffman, K., and Kunze, R., 1971. Linear Algebra, Pearson, En-
glewood Cliffs, NJ.
[HoL13] Hong, M., and Luo, Z. Q., 2013. "On the Linear Convergence of the
Alternating Direction Method of Multipliers," arXiv preprint arXiv:1208.-
3922.
[Hoh77] Hohenbalken, B. von, 1977. "Simplicial Decomposition in Nonlin-
ear Programming," Math. Programming, Vol. 13, pp. 49-68.
[Hol74] Holloway, C. A., 1974. "An Extension of the Frank and Wolfe
Method of Feasible Directions," Math. Programming, Vol. 6, pp. 14-27.
[IPS03] Iusem, A. N., Pennanen, T., and Svaiter, B. F., 2003. "Inexact
Variants of the Proximal Point Algorithm Without Monotonicity," SIAM
J. on Optimization, Vol. 13, pp. 1080-1097.
[IST94] Iusem, A. N., Svaiter, B. F., and Teboulle, M., 1994. "Entropy-
Like Proximal Methods in Convex Programming," Math. of Operations
Research, Vol. 19, pp. 790-814.
[IbF96] Ibaraki, S., and Fukushima, M., 1996. "Partial Proximal Method of
Multipliers for Convex Programming Problems," J. of Operations Research
Society of Japan, Vol. 39, pp. 213-229.
[IuT95] Iusem, A. N., and Teboulle, M., 1995. "Convergence Rate Analysis
of Nonquadratic Proximal Methods for Convex and Linear Programming,"
Math. of Operations Research, Vol. 20, pp. 657-677.
[Ius99] Iusem, A. N., 1999. "Augmented Lagrangian Methods and Proximal
Point Methods for Convex Minimization," Investigacion Operativa, Vol. 8,
pp. 11-49.
[Ius03] Iusem, A. N., 2003. "On the Convergence Properties of the Pro-
jected Gradient Method for Convex Optimization," Computational and
Applied Mathematics, Vol. 22, pp. 37-52.
[JFY09] Joachims, T., Finley, T., and Yu, C.-N. J., 2009. "Cutting-Plane
Training of Structural SVMs," Machine Learning, Vol. 77, pp. 27-59.
[JRJ09] Johansson, B., Rabi, M., and Johansson, M., 2009. "A Random-
ized Incremental Subgradient Method for Distributed Optimization in Net-
worked Systems," SIAM J. on Optimization, Vol. 20, pp. 1157-1170.
[Jag13] Jaggi, M., 2013. "Revisiting Frank-Wolfe: Projection-Free Sparse
Convex Optimization," Proc. of ICML 2013.
[JiZ14] Jiang, B., and Zhang, S., 2014. "Iteration Bounds for Finding the E-
Stationary Points for Structured Nonconvex Optimization," arXiv preprint
arXiv:1410.4066.
[JoY09] Joachims, T., and Yu, C.-N. J., 2009. "Sparse Kernel SVMs via
Cutting-Plane Training," Machine Learning, Vol. 76, pp. 179-193.
[JoZ13] Johnson, R., and Zhang, T., 2013. "Accelerating Stochastic Gra-
dient Descent Using Predictive Variance Reduction," Advances in Neural
Information Processing Systems 26 (NIPS 2013).
538 References
[Joa06] Joachims, T., 2006. "Training Linear SVMs in Linear Time," Inter-
national Conference on Knowledge Discovery and Data Mining, pp. 217-
226.
[JuNlla] Juditsky, A., and Nemirovski, A., 2011. "First Order Methods for
Nonsmooth Convex Large-Scale Optimization, I: General Purpose Meth-
ods," in Optimization for Machine Learning, by Sra, S., Nowozin, S., and
Wright, S. J. (eds.), MIT Press, Cambridge, MA, pp. 121-148.
[JuNllb] Juditsky, A., and Nemirovski, A., 2011. "First Order Methods
for Nonsmooth Convex Large-Scale Optimization, II: Utilizing Problem's
Structure," in Optimization for Machine Learning, by Sra, S., Nowozin, S.,
and Wright, S. J. (eds.), MIT Press, Cambridge, MA, pp. 149-183.
[KaW94] Kall, P., and Wallace, S. W., 1994. Stochastic Programming,
Wiley, Chichester, UK.
[Kac37] Kaczmarz, S., 1937. "Approximate Solution of Systems of Linear
Equations," Bull. Acad. Pol. Sci., Lett. A 35, pp. 335-357 (in German);
English transl.: Int. J. Control, Vol. 57, pp. 1269-1271, 1993.
[Kar84] Karmarkar, N., 1984. "A New Polynomial-Time Algorithm for Lin-
ear Programming," In Proc. of the 16th Annual ACM Symp. on Theory of
Computing, pp. 302-311.
[Kel60] Kelley, J.E., 1960. "The Cutting-Plane Method for Solving Convex
Programs," J. Soc. Indust. Appl. Math., Vol. 8, pp. 703-712.
[Kel99] Kelley, C. T., 1999. Iterative Methods for Optimization, Siam,
Philadelphia, PA.
[Kib80] Kibardin, V. M., 1980. "Decomposition into Functions in the Min-
imization Problem," Automation and Remote Control, Vol. 40, pp. 1311-
1323.
[Kiw04] Kiwiel, K. C., 2004. "Convergence of Approximate and Incremental
Subgradient Methods for Convex Optimization," SIAM J. on Optimization,
Vol. 14, pp. 807-840.
[KoB72] Kort, B. W., and Bertsekas, D. P., 1972. "A New Penalty Function
Method for Constrained Minimization," Proc. 1972 IEEE Confer. Decision
Control, New Orleans, LA, pp. 162-166.
[KoB76] Kort, B. W., and Bertsekas, D. P., 1976. "Combined Primal-Dual
and Penalty Methods for Convex Programming," SIAM J. on Control and
Optimization, Vol. 14, pp. 268-294.
[KoN93] Kortanek, K. 0., and No, H., 1993. "A Central Cutting Plane
Algorithm for Convex Semi-Infinite Programming Problems," SIAM J. on
Optimization, Vol. 3, pp. 901-918.
[Kor75] Kort, B. W., 1975. "Combined Primal-Dual and Penalty Function
Algorithms for Nonlinear Programming," Ph.D. Thesis, Dept. of Enginee-
ring-Economic Systems, Stanford Univ., Stanford, Ca.
References 539
[LiM79] Lions, P. L., and Mercier, B., 1979. "Splitting Algorithms for the
Sum of Two Nonlinear Operators," SIAM J. on Numerical Analysis, Vol.
16, pp. 964-979.
[LiP87] Lin, Y. Y., and Pang, J.-S., 1987. "Iterative Methods for Large
Convex Quadratic Programs: A Survey," SIAM J. on Control and Opti-
mization, Vol. 18, pp. 383-411.
[LiW14] Liu, J., and Wright, S. J., 2014. "Asynchronous Stochastic Coordi-
nate Descent: Parallelism and Convergence Properties," Univ. of Wisconsin
Report, arXiv preprint arXiv:1403.3862.
[Lin07] Lin, C. J., 2007. "Projected Gradient Methods for Nonnegative
Matrix Factorization," Neural Computation, Vol. 19, pp. 2756-2779.
[Lit66] Litvakov, B. M., 1966. "On an Iteration Method in the Problem of
Approximating a Function from a Finite Number of Observations," Avtom.
Telemech., No. 4, pp. 104-113.
[Lju77] Ljung, L., 1977. "Analysis of Recursive Stochastic Algorithms,"
IEEE Trans. on Automatic Control, Vol. 22, pp. 551-575.
[LuT91 J Luo, Z. Q., and Tseng, P., 1991. "On the Convergence of a Matrix-
Splitting Algorithm for the Symmetric Monotone Linear Complementarity
Problem," SIAM J. on Control and Optimization, Vol. 29, pp. 1037-1060.
[LuT92] Luo, Z. Q., and Tseng, P., 1992. "On the Convergence of the
Coordinate Descent Method for Convex Differentiable Minimization," J.
Optim. Theory Appl., Vol. 72, pp. 7-35.
[LuT93a] Luo, Z. Q., and Tseng, P., 1993. "On the Convergence Rate
of Dual Ascent Methods for Linearly Constrained Convex Minimization,"
Math. of Operations Research, Vol. 18, pp. 846-867.
[LuT93b] Luo, Z. Q., and Tseng, P., 1993. "Error Bound and Reduced-
Gradient Projection Algorithms for Convex Minimization over a Polyhedral
Set," SIAM J. on Optimization, Vol. 3, pp. 43-59.
[LuT93c] Luo, Z. Q., and Tseng, P., 1993. "Error Bounds and Convergence
Analysis of Feasible Descent Methods: A General Approach," Annals of
Operations Research, Vol. 46, pp. 157-178.
[LuT94a] Luo, Z. Q., and Tseng, P., 1994. "Analysis of an Approximate
Gradient Projection Method with Applications to the Backpropagation Al-
gorithm," Optimization Methods and Software, Vol. 4, pp. 85-101.
[LuT94b] Luo, Z. Q., and Tseng, P., 1994. "On the Rate of Convergence of a
Distributed Asynchronous Routing Algorithm," IEEE Trans. on Automatic
Control, Vol. 39, pp. 1123-1129.
[LuT13] Luss, R., and Teboulle, M., 2013. "Conditional Gradient Algo-
rithms for Rank-One Matrix Approximations with a Sparsity Constraint,"
SIAM Review, Vol. 55, pp. 65-98.
[LuY08] Luenberger, D. G., and Ye, Y., 2008. Linear and Nonlinear Pro-
gramming, 3rd Edition, Springer, NY.
542 References
137-148.
[Sch86] Schrijver, A., 1986. Theory of Linear and Integer Programming,
Wiley, NY.
[SchlOJ Schmidt, M., 2010. "Graphical Model Structure Learning with Ll-
Regularization," PhD Thesis, Univ. of British Columbia.
[Sch14a] Schmidt, M., 2014. "Convergence Rate of Stochastic Gradient with
Constant Step Size," Computer Science Report, Univ. of British Columbia.
[Sch14b] Schmidt, M., 2014. "Convergence Rate of Proximal Gradient with
General Step-Size," Dept. of Computer Science, Unpublished Note, Univ.
of British Columbia.
[ShZ12] Shamir, 0., and Zhang, T., 2012. "Stochastic Gradient Descent for
Non-Smooth Optimization: Convergence Results and Optimal Averaging
Schemes," arXiv preprint arXiv:1212.1824.
[Sha79] Shapiro, J. E., 1979. Mathematical Programming Structures and
Algorithms, Wiley, NY.
[Sho85] Shor, N. Z., 1985. Minimization Methods for Nondifferentiable
Functions, Springer-Verlag, Berlin.
[Sho98] Shor, N. Z., 1998. Nondifferentiable Optimization and Polynomial
Problems, Kluwer Academic Publishers, Dordrecht, Netherlands.
[SmS04] Smola, A. J., and Scholkopf, B., 2004. "A Tutorial on Support
Vector Regression," Statistics and Computing, Vol. 14, pp. 199-222.
[SoZ98] Solodov, M. V., and Zavriev, S. K., 1998. "Error Stability Proper-
ties of Generalized Gradient-Type Algorithms," J. Opt. Theory and Appl.,
Vol. 98, pp. 663-680.
[Sol98] Solodov, M. V., 1998. "Incremental Gradient Algorithms with Step-
sizes Bounded Away from Zero," Computational Optimization and Appli-
cations, Vol. 11, pp. 23-35.
[Spa03] Spall, J. C., 2003. Introduction to Stochastic Search and Optimiza-
tion: Estimation, Simulation, and Control, J. Wiley, Hoboken, NJ.
[Spa12] Spall, J. C., 2012. "Cyclic Seesaw Process for Optimization and
Identification," J. of Optimization Theory and Applications, Vol. 154, pp.
187-208.
[Spi83] Spingarn, J. E., 1983. "Partial Inverse of a Monotone Operator,"
Applied Mathematics and Optimization, Vol. 10, pp. 247-265.
[Spi85] Spingarn, J. E., 1985. "Applications of the Method of Partial In-
verses to Convex Programming: Decomposition," Math. Programming,
Vol. 32, pp. 199-223.
[StV09J Strohmer, T., and Vershynin, R., 2009. "A Randomized Kaczmarz
Algorithm with Exponential Convergence," J. Fourier Anal. Appl., Vol. 15,
pp. 262-278.
[StW70] Stoer, J., and Witzgall, C., 1970. Convexity and Optimization in
Finite Dimensions, Springer-Verlag, Berlin.
552 References
[YSQ14] You, K., Song, S., and Qiu, L., 2014. "Randomized Incremental
Least Squares for Distributed Estimation Over Sensor Networks," Preprints
of the 19th World Congress The International Federation of Automatic
Control Cape Town, South Africa.
[Ye92] Ye, Y., 1992. "A Potential Reduction Algorithm Allowing Column
Generation," SIAM J. on Optimization, Vol. 2, pp. 7-20.
[Ye97] Ye, Y., 1997. Interior Point Algorithms: Theory and Analysis, Wiley
Interscience, NY.
[YuR07] Yu, H., and Rousu, J., 2007. "An Efficient Method for Large Mar-
gin Parameter Optimization in Structured Prediction Problems," Technical
Report C-2007-87, Univ. of Helsinki.
[ZJL13] Zhang, H., Jiang, J., and Luo, Z. Q., 2013. "On the Linear Con-
vergence of a Proximal Gradient Method for a Class of Nonsmooth Convex
Minimization Problems," J. of the Operations Research Society of China,
Vol. 1, pp. 163-186.
[ZLW99] Zhao, X., Luh, P. B., and Wang, J., 1999. "Surrogate Gradient
Algorithm for Lagrangian Relaxation," J. Optimization Theory and Appli-
cations, Vol. 100, pp. 699-712.
[ZMJ13] Zhang, L., Mahdavi, M., and Jin, R., 2013. "Linear Convergence
with Condition Number Independent Access of Full Gradients," Advances
in Neural Information Processing Systems 26 (NIPS 2013), pp. 980-988.
[ZTD92] Zhang, Y., Tapia, R. A., and Dennis, J. E., 1992. "On the Su-
perlinear and Quadratic Convergence of Primal-Dual Interior Point Linear
Programming Algorithms," SIAM J. on Optimization, Vol. 2, pp. 304-324.
[Za102] Zalinescu, C., 2002. Convex Analysis in General Vector Spaces,
World Scientific, Singapore.
[Zan69] Zangwill, W. I., 1969. Nonlinear Programming, Prentice-Hall, En-
glewood Cliffs, NJ.
[Zou60] Zoutendijk, G., 1960. Methods of Feasible Directions, Elsevier
Pub!. Co., Amsterdam.
[Zou76] Zoutendijk, G., 1976. Mathematical Programming Methods, North
Holland, Amsterdam.
INDEX
557
558 Index
Hinge loss 30 K
Hyperplane 484
Kaczmarz method 85, 98, 131
Hyperplane separation 484-487
Krasnosel'skii-Mann Theorem 252,
I 285,300,459
Ill-conditioning 60, 109, 413
L
Image 445, 446
Improper function 468 £1-norm 451
Incremental Gauss-Newton method £00 -norm 450
103, 120 LMS method 119
Incremental Newton method 97, 101, Lagrange multiplier 507
118, 119 Lagrangian function 3, 507
Incremental aggregated method 91, Lasso problem 27
94 Least absolute value deviations 27,
Incremental constraint projection 288
method 102, 365, 429 Least mean squares method 119
Incremental gradient method 84, 105, Left-continuous function 455
118, 119, 130-132 Level set 469
Incremental gradient with momen- Limit 451
tum 92 Limit point 451, 453
Incremental method 25, 83, 166, Limited memory quasi-Newton me-
320 thod 63, 338
Incremental proximal method 341, Line minimization 60, 65, 69, 320
385,429 Line segment principle 4 73
Incremental subgradient method 84, Lineality space 4 79
166,341,385,428 Linear-conic problems 15, 16
Indicator function 487 Linear convergence 57
Inner linearization 107, 182, 188, Linear equation 445
194,296 Linear function 445
Infeasible problem 494 Linear inequality 445
Infimum 444 Linear programming 16, 415, 434
Inner approximation 402 Linear programming duality 506
Inner product 444 Linear regularity 369
Instability 186, 191, 269 Linear transformation preservation
Integer programming 6 of closedness 483
Interior of a set 412, 453 Linearly independent vectors 446
Interior point 453 Lipschitz continuity 141, 455, 512
Interior point method 108, 412, 415, Local convergence 68
423,432 Local maximum 495
Interpolated iteration 249, 253, 298, Local minimum 495
459 Location theory 32
Inverse barrier 412 Logarithmic barrier 412, 416
Inverse image 445, 446 Logistic loss 30
J Lower limit 452
Index 561
z
Zero-sum games 9
An insightful, comprehensive, and up-to-date treatment of the theory of convex optimization
algorithms, and some of their applications in large-scale resource allocation, signal processing,
and machine learning. The book complements the author's 2009 "Convex Optimization Theory"
book, which focuses on convex analysis and duality theory. The two books can be read independently.
share notation and style, and together cover the entire finite-dimensional convex optimization field.
Related Athena Scientific books by the same author: Visit Athena Scientific online at:
www.athenasc.com
9
II781886
llllll 1111111111111111
529281