Subgradient Method
Ryan Tibshirani
Convex Optimization 10-725
Last last time: gradient descent
Consider the problem
min f (x)
x
for f convex and differentiable, dom(f ) = Rn . Gradient descent:
choose initial x(0) ∈ Rn , repeat:
x(k) = x(k−1) − tk · ∇f (x(k−1) ), k = 1, 2, 3, . . .
Step sizes tk chosen to be fixed and small, or by backtracking line
search
If ∇f is Lipschitz, gradient descent has convergence rate O(1/).
Downsides:
• Requires f differentiable — addressed this lecture
• Can be slow to converge — addressed next lecture
2
Subgradient method
Now consider f convex, having dom(f ) = Rn , but not necessarily
differentiable
Subgradient method: like gradient descent, but replacing gradients
with subgradients. Initialize x(0) , repeat:
x(k) = x(k−1) − tk · g (k−1) , k = 1, 2, 3, . . .
where g (k−1) ∈ ∂f (x(k−1) ), any subgradient of f at x(k−1)
Subgradient method is not necessarily a descent method, thus we
(k)
keep track of best iterate xbest among x(0) , . . . , x(k) so far, i.e.,
(k)
f (xbest ) = min f (x(i) )
i=0,...,k
3
Outline
Today:
• How to choose step sizes
• Convergence analysis
• Intersection of sets
• Projected subgradient method
4
Step size choices
• Fixed step sizes: tk = t all k = 1, 2, 3, . . .
• Diminishing step sizes: choose to meet conditions
∞
X ∞
X
t2k < ∞, tk = ∞,
k=1 k=1
i.e., square summable but not summable. Important here that
step sizes go to zero, but not too fast
There are several other options too, but key difference to gradient
descent: step sizes are pre-specified, not adaptively computed
5
Convergence analysis
Assume that f convex, dom(f ) = Rn , and also that f is Lipschitz
continuous with constant G > 0, i.e.,
|f (x) − f (y)| ≤ Gkx − yk2 for all x, y
Theorem: For a fixed step size t, subgradient method satisfies
(k)
lim f (xbest ) ≤ f ? + G2 t/2
k→∞
Theorem: For diminishing step sizes, subgradient method sat-
isfies
(k)
lim f (xbest ) = f ?
k→∞
6
Basic inequality
Can prove both results from same basic inequality. Key steps:
• Using definition of subgradient,
kx(k) − x? k22 ≤
kx(k−1) − x? k22 − 2tk f (x(k−1) ) − f (x? ) + t2k kg (k−1) k22
• Iterating last inequality,
kx(k) − x? k22 ≤
k
X k
X
kx(0) − x? k22 − 2 ti f (x(i−1) ) − f (x? ) + t2i kg (i−1) k22
i=1 i=1
7
• Using kx(k) − x? k2 ≥ 0, and letting R = kx(0) − x? k2 ,
k
X k
X
0 ≤ R2 − 2 ti f (x(i−1) ) − f (x? ) + G2 t2i
i=1 i=1
(k)
• Introducing f (xbest ) = mini=0,...,k f (x(i) ), and rearranging, we
have the basic inequality
R2 + G2 ki=1 t2i
P
(k) ?
f (xbest ) − f (x ) ≤
2 ki=1 ti
P
For different step sizes choices, convergence results can be directly
obtained from this bound, e.g., previous theorems follow
8
Convergence rate
The basic inequality tells us that after k steps, we have
R2 + G2 ki=1 t2i
P
(k) ?
f (xbest ) − f (x ) ≤
2 ki=1 ti
P
With fixed step size t, this gives
(k) R2 G2 t
f (xbest ) − f ? ≤ +
2kt 2
For this to be ≤ , let’s make each term ≤ /2. So we can choose
t = /G2 , and k = R2 /t · 1/ = R2 G2 /2
That is, subgradient method has convergence rate O(1/2 ) ... note
that this is slower than O(1/) rate of gradient descent
9
Example: regularized logistic regression
Given (xi , yi ) ∈ Rp × {0, 1} for i = 1, . . . , n, the logistic regression
loss is
Xn
f (β) = − yi xTi β + log(1 + exp(xTi β))
i=1
This is a smooth and convex function with
n
X
∇f (β) = yi − pi (β) xi
i=1
where pi (β) = exp(xTi β)/(1 + exp(xTi β)), i = 1, . . . , n. Consider
the regularized problem:
min f (β) + λ · P (β)
β
where P (β) = kβk22 , ridge penalty; or P (β) = kβk1 , lasso penalty
10
Ridge: use gradients; lasso: use subgradients. Example here has
n = 1000, p = 20:
Gradient descent Subgradient method
t=0.001 t=0.001
2.00
1e−01
t=0.001/k
0.50
1e−04
f−fstar
f−fstar
0.20
1e−07
1e−10
0.05
0.02
1e−13
0 50 100 150 200 0 50 100 150 200
k k
Step sizes hand-tuned to be favorable for each method (of course
comparison is imperfect, but it reveals the convergence behaviors)
11
Polyak step sizes
Polyak step sizes: when the optimal value f ? is known, take
f (x(k−1) ) − f ?
tk = , k = 1, 2, 3, . . .
kg (k−1) k22
Can be motivated from first step in subgradient proof:
kx(k) −x? k22 ≤ kx(k−1) −x? k22 −2tk f (x(k−1) )−f (x? ) +t2k kg (k−1) k22
Polyak step size minimizes the right-hand side
With Polyak step sizes, can show subgradient method converges to
optimal value. Convergence rate is still O(1/2 )
12
Example: intersection of sets
Suppose we want to find x? ∈ C1 ∩ · · · ∩ Cm , i.e., find a point in
intersection of closed, convex sets C1 , . . . , Cm
First define
fi (x) = dist(x, Ci ), i = 1, . . . , m
f (x) = max fi (x)
i=1,...,m
and now solve
min f (x)
x
Check: is this convex?
Note that f ? = 0 ⇐⇒ x? ∈ C1 ∩ · · · ∩ Cm
13
Recall the distance function dist(x, C) = miny∈C ky − xk2 . Last
time we computed its gradient
x − PC (x)
∇dist(x, C) =
kx − PC (x)k2
where PC (x) is the projection of x onto C
Also recall subgradient rule: if f (x) = maxi=1,...,m fi (x), then
[
∂f (x) = conv ∂fi (x)
i:fi (x)=f (x)
So if fi (x) = f (x) and gi ∈ ∂fi (x), then gi ∈ ∂f (x)
14
Put these two facts together for intersection of sets problem, with
fi (x) = dist(x, Ci ): if Ci is farthest set from x (so fi (x) = f (x)),
and
x − PCi (x)
gi = ∇fi (x) =
kx − PCi (x)k2
then gi ∈ ∂f (x)
Now apply subgradient method, with Polyak size tk = f (x(k−1) ).
At iteration k, with Ci farthest from x(k−1) , we perform update
x(k−1) − PCi (x(k−1) )
x(k) = x(k−1) − f (x(k−1) )
kx(k−1) − PCi (x(k−1) )k2
= PCi (x(k−1) )
15
For two sets, this is the famous alternating projections algorithm1 ,
i.e., just keep projecting back and forth
(From Boyd’s lecture notes)
1
von Neumann (1950), “Functional operators, volume II: The geometry of
orthogonal spaces”
16
Projected subgradient method
To optimize a convex function f over a convex set C,
min f (x) subject to x ∈ C
x
we can use the projected subgradient method. Just like the usual
subgradient method, except we project onto C at each iteration:
x(k) = PC x(k−1) − tk · g (k−1) , k = 1, 2, 3, . . .
Assuming we can do this projection, we get the same convergence
guarantees as the usual subgradient method, with the same step
size choices
17
What sets C are easy to project onto? Lots, e.g.,
• Affine images: {Ax + b : x ∈ Rn }
• Solution set of linear system: {x : Ax = b}
• Nonnegative orthant: Rn+ = {x : x ≥ 0}
• Some norm balls: {x : kxkp ≤ 1} for p = 1, 2, ∞
• Some simple polyhedra and simple cones
Warning: it is easy to write down seemingly simple set C, and PC
can turn out to be very hard! E.g., generally hard to project onto
arbitrary polyhedron C = {x : Ax ≤ b}
Note: projected gradient descent works too, more next time ...
18
Can we do better?
Upside of the subgradient method: broad applicability. Downside:
O(1/2 ) convergence rate over problem class of convex, Lipschitz
functions is really slow
Nonsmooth first-order methods: iterative methods updating x(k) in
x(0) + span{g (0) , g (1) , . . . , g (k−1) }
where subgradients g (0) , g (1) , . . . , g (k−1) come from weak oracle
Theorem (Nesterov): For any k ≤ n−1 and starting point x(0) ,
there is a function in the problem class such that any nonsmooth
first-order method satisfies
RG
f (x(k) ) − f ? ≥ √
2(1 + k + 1)
19
Improving on the subgradient method
In words, we cannot do better than the O(1/2 ) rate of subgradient
method (unless we go beyond nonsmooth first-order methods)
So instead of trying to improve across the board, we will focus on
minimizing composite functions of the form
f (x) = g(x) + h(x)
where g is convex and differentiable, h is convex and nonsmooth
but “simple”
For a lot of problems (i.e., functions h), we can recover the O(1/)
rate of gradient descent with a simple algorithm, having important
practical consequences
20
References and further reading
• S. Boyd, Lecture notes for EE 264B, Stanford University,
Spring 2010-2011
• Y. Nesterov (1998), “Introductory lectures on convex
optimization: a basic course”, Chapter 3
• B. Polyak (1987), “Introduction to optimization”, Chapter 5
• L. Vandenberghe, Lecture notes for EE 236C, UCLA, Spring
2011-2012
21