0% found this document useful (0 votes)

181 views21 pages

Subgradient Method: Ryan Tibshirani Convex Optimization 10-725

The subgradient method is an extension of gradient descent that can be used when the objective function is convex but not necessarily differentiable. It replaces gradients with subgradients in the update rule. The subgradient method converges to the optimal value at a rate of O(1/k^2), which is slower than the O(1/k) rate of gradient descent. The method can be applied to problems like regularized logistic regression and finding the intersection of convex sets.

Uploaded by

Saheli Chakraborty

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

181 views21 pages

Subgradient Method: Ryan Tibshirani Convex Optimization 10-725

Uploaded by

Saheli Chakraborty

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Subgradient Method

Ryan Tibshirani
Convex Optimization 10-725
Last last time: gradient descent
Consider the problem
min f (x)
x

for f convex and differentiable, dom(f ) = Rn . Gradient descent:

choose initial x(0) ∈ Rn , repeat:

x(k) = x(k−1) − tk · ∇f (x(k−1) ), k = 1, 2, 3, . . .

Step sizes tk chosen to be fixed and small, or by backtracking line

If ∇f is Lipschitz, gradient descent has convergence rate O(1/).

Downsides:
• Requires f differentiable — addressed this lecture
• Can be slow to converge — addressed next lecture

2
Subgradient method

Now consider f convex, having dom(f ) = Rn , but not necessarily

differentiable

Subgradient method: like gradient descent, but replacing gradients

with subgradients. Initialize x(0) , repeat:

x(k) = x(k−1) − tk · g (k−1) , k = 1, 2, 3, . . .

where g (k−1) ∈ ∂f (x(k−1) ), any subgradient of f at x(k−1)

Subgradient method is not necessarily a descent method, thus we

(k)
keep track of best iterate xbest among x(0) , . . . , x(k) so far, i.e.,
(k)
f (xbest ) = min f (x(i) )
i=0,...,k

3
Outline

Today:
• How to choose step sizes
• Convergence analysis
• Intersection of sets
• Projected subgradient method

4
Step size choices

• Fixed step sizes: tk = t all k = 1, 2, 3, . . .

• Diminishing step sizes: choose to meet conditions
∞
X ∞
X
t2k < ∞, tk = ∞,
k=1 k=1

i.e., square summable but not summable. Important here that

step sizes go to zero, but not too fast

There are several other options too, but key difference to gradient
descent: step sizes are pre-specified, not adaptively computed

5
Convergence analysis

Assume that f convex, dom(f ) = Rn , and also that f is Lipschitz

continuous with constant G > 0, i.e.,

|f (x) − f (y)| ≤ Gkx − yk2 for all x, y

Theorem: For a fixed step size t, subgradient method satisfies

(k)
lim f (xbest ) ≤ f ? + G2 t/2
k→∞

Theorem: For diminishing step sizes, subgradient method sat-

isfies
(k)
lim f (xbest ) = f ?
k→∞

6
Basic inequality

Can prove both results from same basic inequality. Key steps:
• Using definition of subgradient,

kx(k) − x? k22 ≤
kx(k−1) − x? k22 − 2tk f (x(k−1) ) − f (x? ) + t2k kg (k−1) k22

• Iterating last inequality,

kx(k) − x? k22 ≤
k
X k
X
kx(0) − x? k22 − 2 ti f (x(i−1) ) − f (x? ) + t2i kg (i−1) k22
i=1 i=1

7
• Using kx(k) − x? k2 ≥ 0, and letting R = kx(0) − x? k2 ,

k
X k
X
0 ≤ R2 − 2 ti f (x(i−1) ) − f (x? ) + G2 t2i

i=1 i=1

(k)
• Introducing f (xbest ) = mini=0,...,k f (x(i) ), and rearranging, we
have the basic inequality

R2 + G2 ki=1 t2i
P
(k) ?
f (xbest ) − f (x ) ≤
2 ki=1 ti
P

For different step sizes choices, convergence results can be directly

obtained from this bound, e.g., previous theorems follow

8
Convergence rate

The basic inequality tells us that after k steps, we have

R2 + G2 ki=1 t2i
P
(k) ?
f (xbest ) − f (x ) ≤
2 ki=1 ti
P

With fixed step size t, this gives

(k) R2 G2 t
f (xbest ) − f ? ≤ +
2kt 2
For this to be ≤ , let’s make each term ≤ /2. So we can choose
t = /G2 , and k = R2 /t · 1/ = R2 G2 /2

That is, subgradient method has convergence rate O(1/2 ) ... note
that this is slower than O(1/) rate of gradient descent

9
Example: regularized logistic regression
Given (xi , yi ) ∈ Rp × {0, 1} for i = 1, . . . , n, the logistic regression
loss is
Xn
f (β) = − yi xTi β + log(1 + exp(xTi β))
i=1

This is a smooth and convex function with

n
X
∇f (β) = yi − pi (β) xi
i=1

where pi (β) = exp(xTi β)/(1 + exp(xTi β)), i = 1, . . . , n. Consider

the regularized problem:

min f (β) + λ · P (β)

where P (β) = kβk22 , ridge penalty; or P (β) = kβk1 , lasso penalty

10
Ridge: use gradients; lasso: use subgradients. Example here has
n = 1000, p = 20:

Gradient descent Subgradient method

t=0.001 t=0.001

2.00
1e−01

t=0.001/k

0.50
1e−04
f−fstar

f−fstar

0.20
1e−07
1e−10

0.05
0.02
1e−13

0 50 100 150 200 0 50 100 150 200

k k

Step sizes hand-tuned to be favorable for each method (of course

comparison is imperfect, but it reveals the convergence behaviors)
11
Polyak step sizes

Polyak step sizes: when the optimal value f ? is known, take

f (x(k−1) ) − f ?
tk = , k = 1, 2, 3, . . .
kg (k−1) k22

Can be motivated from first step in subgradient proof:

kx(k) −x? k22 ≤ kx(k−1) −x? k22 −2tk f (x(k−1) )−f (x? ) +t2k kg (k−1) k22

Polyak step size minimizes the right-hand side

With Polyak step sizes, can show subgradient method converges to

optimal value. Convergence rate is still O(1/2 )

12
Example: intersection of sets

Suppose we want to find x? ∈ C1 ∩ · · · ∩ Cm , i.e., find a point in

intersection of closed, convex sets C1 , . . . , Cm

First define

fi (x) = dist(x, Ci ), i = 1, . . . , m
f (x) = max fi (x)
i=1,...,m

and now solve

min f (x)
x

Check: is this convex?

Note that f ? = 0 ⇐⇒ x? ∈ C1 ∩ · · · ∩ Cm

13
Recall the distance function dist(x, C) = miny∈C ky − xk2 . Last
time we computed its gradient

x − PC (x)
∇dist(x, C) =
kx − PC (x)k2

where PC (x) is the projection of x onto C

Also recall subgradient rule: if f (x) = maxi=1,...,m fi (x), then

[
∂f (x) = conv ∂fi (x)
i:fi (x)=f (x)

So if fi (x) = f (x) and gi ∈ ∂fi (x), then gi ∈ ∂f (x)

14
Put these two facts together for intersection of sets problem, with
fi (x) = dist(x, Ci ): if Ci is farthest set from x (so fi (x) = f (x)),
and
x − PCi (x)
gi = ∇fi (x) =
kx − PCi (x)k2
then gi ∈ ∂f (x)

Now apply subgradient method, with Polyak size tk = f (x(k−1) ).

At iteration k, with Ci farthest from x(k−1) , we perform update

x(k−1) − PCi (x(k−1) )

x(k) = x(k−1) − f (x(k−1) )
kx(k−1) − PCi (x(k−1) )k2
= PCi (x(k−1) )

15
For two sets, this is the famous alternating projections algorithm1 ,
i.e., just keep projecting back and forth

(From Boyd’s lecture notes)

1
von Neumann (1950), “Functional operators, volume II: The geometry of
orthogonal spaces”
16
Projected subgradient method

To optimize a convex function f over a convex set C,

min f (x) subject to x ∈ C

we can use the projected subgradient method. Just like the usual
subgradient method, except we project onto C at each iteration:

x(k) = PC x(k−1) − tk · g (k−1) , k = 1, 2, 3, . . .

Assuming we can do this projection, we get the same convergence

guarantees as the usual subgradient method, with the same step
size choices

17
What sets C are easy to project onto? Lots, e.g.,
• Affine images: {Ax + b : x ∈ Rn }
• Solution set of linear system: {x : Ax = b}
• Nonnegative orthant: Rn+ = {x : x ≥ 0}
• Some norm balls: {x : kxkp ≤ 1} for p = 1, 2, ∞
• Some simple polyhedra and simple cones

Warning: it is easy to write down seemingly simple set C, and PC

can turn out to be very hard! E.g., generally hard to project onto
arbitrary polyhedron C = {x : Ax ≤ b}

Note: projected gradient descent works too, more next time ...

18
Can we do better?
Upside of the subgradient method: broad applicability. Downside:
O(1/2 ) convergence rate over problem class of convex, Lipschitz
functions is really slow

Nonsmooth first-order methods: iterative methods updating x(k) in

x(0) + span{g (0) , g (1) , . . . , g (k−1) }

where subgradients g (0) , g (1) , . . . , g (k−1) come from weak oracle

Theorem (Nesterov): For any k ≤ n−1 and starting point x(0) ,

there is a function in the problem class such that any nonsmooth
first-order method satisfies
RG
f (x(k) ) − f ? ≥ √
2(1 + k + 1)

19
Improving on the subgradient method

In words, we cannot do better than the O(1/2 ) rate of subgradient

method (unless we go beyond nonsmooth first-order methods)

So instead of trying to improve across the board, we will focus on

minimizing composite functions of the form

f (x) = g(x) + h(x)

where g is convex and differentiable, h is convex and nonsmooth

but “simple”

For a lot of problems (i.e., functions h), we can recover the O(1/)
rate of gradient descent with a simple algorithm, having important
practical consequences

20
References and further reading

• S. Boyd, Lecture notes for EE 264B, Stanford University,

Spring 2010-2011
• Y. Nesterov (1998), “Introductory lectures on convex
optimization: a basic course”, Chapter 3
• B. Polyak (1987), “Introduction to optimization”, Chapter 5
• L. Vandenberghe, Lecture notes for EE 236C, UCLA, Spring
2011-2012

Subgradient Method for Optimization
No ratings yet
Subgradient Method for Optimization
33 pages
02-Subgrad Method Notes
No ratings yet
02-Subgrad Method Notes
27 pages
Subgradient Methods for Convex Optimization
No ratings yet
Subgradient Methods for Convex Optimization
33 pages
Subgradient Method for ECE Students
No ratings yet
Subgradient Method for ECE Students
22 pages
L10 - Subgrad - PGD (Partially Annotated)
No ratings yet
L10 - Subgrad - PGD (Partially Annotated)
39 pages
Subgradient Methods
No ratings yet
Subgradient Methods
56 pages
Chapter 3 Unconstrained Convex Optimization
No ratings yet
Chapter 3 Unconstrained Convex Optimization
28 pages
Subgradients: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Subgradients: Ryan Tibshirani Convex Optimization 10-725
25 pages
Advanced Optimization Techniques
No ratings yet
Advanced Optimization Techniques
8 pages
Smooth Convex Optimization Lecture
No ratings yet
Smooth Convex Optimization Lecture
28 pages
Gradient Descent in Convex Optimization
No ratings yet
Gradient Descent in Convex Optimization
27 pages
Primal - Dual Decomposition Methods
No ratings yet
Primal - Dual Decomposition Methods
40 pages
Coordinate Descent Algorithms: Stephen J. Wright
No ratings yet
Coordinate Descent Algorithms: Stephen J. Wright
32 pages
Subgradients in Convex Analysis
No ratings yet
Subgradients in Convex Analysis
39 pages
O4MD 03 Descent Methods
No ratings yet
O4MD 03 Descent Methods
18 pages
Optimization for Convex Functions
No ratings yet
Optimization for Convex Functions
31 pages
Midterm 1 Notes
No ratings yet
Midterm 1 Notes
46 pages
Mirror Descent Slides
No ratings yet
Mirror Descent Slides
35 pages
Week02 Convex Optimization
No ratings yet
Week02 Convex Optimization
48 pages
Convergence of Descent Methods For Semi-Algebraic and
No ratings yet
Convergence of Descent Methods For Semi-Algebraic and
37 pages
Lipschitz Gradient and Strong Convexity
No ratings yet
Lipschitz Gradient and Strong Convexity
37 pages
Unconstrained and Constrained Optimization Techniques
No ratings yet
Unconstrained and Constrained Optimization Techniques
25 pages
Lecture 3 Gradient Descent
No ratings yet
Lecture 3 Gradient Descent
37 pages
Two Simple Projection-Type Methods For Solving Variational Inequalities
No ratings yet
Two Simple Projection-Type Methods For Solving Variational Inequalities
23 pages
EE364b Optimization Exercises Guide
No ratings yet
EE364b Optimization Exercises Guide
48 pages
Lecture 3 Gradient Descent
No ratings yet
Lecture 3 Gradient Descent
37 pages
Gradient - Descent Important 23-24
No ratings yet
Gradient - Descent Important 23-24
37 pages
Nonlinear Programming 3rd Edition Theoretical Solutions Manual
No ratings yet
Nonlinear Programming 3rd Edition Theoretical Solutions Manual
20 pages
Chapter 2 - Final
No ratings yet
Chapter 2 - Final
11 pages
Fast Algorithms for Convex Optimization
No ratings yet
Fast Algorithms for Convex Optimization
114 pages
Gradient Method in Convex Optimization
No ratings yet
Gradient Method in Convex Optimization
31 pages
s10107 011 0484 9
No ratings yet
s10107 011 0484 9
39 pages
Homework 2
No ratings yet
Homework 2
5 pages
A Note On The Accelerated Proximal Gradient Method For Nonconvex Optimization
No ratings yet
A Note On The Accelerated Proximal Gradient Method For Nonconvex Optimization
9 pages
MIT ML Optimization Lecture
No ratings yet
MIT ML Optimization Lecture
89 pages
Local Search in Smooth Convex Sets: CX Ax B A I A A A A A A O D X Ax B X CX CX O A I J Z O Opt D X X C A B P CX
No ratings yet
Local Search in Smooth Convex Sets: CX Ax B A I A A A A A A O D X Ax B X CX CX O A I J Z O Opt D X X C A B P CX
9 pages
Basic Concepts: 1.1 Continuity
No ratings yet
Basic Concepts: 1.1 Continuity
7 pages
Convex - Optimization - Homework 3
No ratings yet
Convex - Optimization - Homework 3
6 pages
Lecture 05 - Unconstrained
No ratings yet
Lecture 05 - Unconstrained
21 pages
Adaptive Convex Optimization Methods
No ratings yet
Adaptive Convex Optimization Methods
23 pages
Lecture 7 8 Other Descent Methods
No ratings yet
Lecture 7 8 Other Descent Methods
7 pages
Subgradients Slides
No ratings yet
Subgradients Slides
37 pages
Weak Subgradient Method For Solving Nonsmooth Nonconvex Optimization Problems
No ratings yet
Weak Subgradient Method For Solving Nonsmooth Nonconvex Optimization Problems
42 pages
Convex and Constrained Optimization Insights
No ratings yet
Convex and Constrained Optimization Insights
19 pages
6 Gradient Method
No ratings yet
6 Gradient Method
19 pages
Stochastic Subgradient Methods Overview
No ratings yet
Stochastic Subgradient Methods Overview
17 pages
Co 463
No ratings yet
Co 463
116 pages
(K) K (k+1) (K) K (K)
No ratings yet
(K) K (k+1) (K) K (K)
6 pages
Bolte 2007
No ratings yet
Bolte 2007
17 pages
Frank-Wolfe Method Overview
No ratings yet
Frank-Wolfe Method Overview
28 pages
Convex Optimization Techniques
No ratings yet
Convex Optimization Techniques
25 pages
Hybrid Cubic Regularization in CGM
No ratings yet
Hybrid Cubic Regularization in CGM
36 pages
CS-6777 Liu Abs
100% (1)
CS-6777 Liu Abs
103 pages
Extrapolated Proximal Subgradient Algorithms For Nonconvex and Nonsmooth Fractional Programs
No ratings yet
Extrapolated Proximal Subgradient Algorithms For Nonconvex and Nonsmooth Fractional Programs
29 pages
Xu2001 Minimax
No ratings yet
Xu2001 Minimax
13 pages
DS303: Introduction To Machine Learning: Stochastic Gradient Descent
No ratings yet
DS303: Introduction To Machine Learning: Stochastic Gradient Descent
19 pages
Nesterov CD 2012
No ratings yet
Nesterov CD 2012
23 pages
Proximal Algorithms for Convex Optimization
No ratings yet
Proximal Algorithms for Convex Optimization
5 pages
Proximal Operators in Optimization
No ratings yet
Proximal Operators in Optimization
4 pages
ISO 6789:2017 Calibration Guide
No ratings yet
ISO 6789:2017 Calibration Guide
3 pages
4.1 General Theory of NTH Order Linear Equations
No ratings yet
4.1 General Theory of NTH Order Linear Equations
19 pages
2731 LectNt-rev2013 PDF
No ratings yet
2731 LectNt-rev2013 PDF
80 pages
Convolution Techniques in Signal Processing
No ratings yet
Convolution Techniques in Signal Processing
36 pages
Fluconazole Calibration Curves Analysis
No ratings yet
Fluconazole Calibration Curves Analysis
3 pages
Unit Load Method-Example
No ratings yet
Unit Load Method-Example
16 pages
Variance Lecture
No ratings yet
Variance Lecture
14 pages
ECN115 G. Renshaw's CH
No ratings yet
ECN115 G. Renshaw's CH
25 pages
Calculus Limit Problems Quiz
No ratings yet
Calculus Limit Problems Quiz
6 pages
Pharmacy Educators' Guide to Content Analysis
No ratings yet
Pharmacy Educators' Guide to Content Analysis
11 pages
Real Analysis II Assignment Solutions
No ratings yet
Real Analysis II Assignment Solutions
15 pages
PDE Final Exam - U of T Engineering
No ratings yet
PDE Final Exam - U of T Engineering
18 pages
Segment 4: Decision Making, Systems, Modeling, and Support
No ratings yet
Segment 4: Decision Making, Systems, Modeling, and Support
66 pages
Blagouchine Two Series Expansions For The Logarithm of The Gamma Function (JMAA 2016)
No ratings yet
Blagouchine Two Series Expansions For The Logarithm of The Gamma Function (JMAA 2016)
31 pages
Understanding Analytics Types
No ratings yet
Understanding Analytics Types
2 pages
Lecture 4 - Differential Equation
No ratings yet
Lecture 4 - Differential Equation
10 pages
Prallethrin
No ratings yet
Prallethrin
2 pages
Multidimensional Item Response Theory Package
No ratings yet
Multidimensional Item Response Theory Package
170 pages
MODULE 2 Calculus2
No ratings yet
MODULE 2 Calculus2
9 pages
Maths Lab Manual
No ratings yet
Maths Lab Manual
16 pages
PHM Conference 2019 Program
No ratings yet
PHM Conference 2019 Program
15 pages
Pizza Corner
100% (2)
Pizza Corner
12 pages
Research Concepts: Qualitative & Quantitative
No ratings yet
Research Concepts: Qualitative & Quantitative
2 pages
MATLAB ODE Solvers for Heat Transfer
No ratings yet
MATLAB ODE Solvers for Heat Transfer
353 pages
Mathematical Preliminaries and Error Analysis: Per-Olof Persson
No ratings yet
Mathematical Preliminaries and Error Analysis: Per-Olof Persson
16 pages
Understanding Differential Equations
No ratings yet
Understanding Differential Equations
24 pages
AP Calc BC-Unit 2 - Differentiation - Definition and Fundamental Properties
No ratings yet
AP Calc BC-Unit 2 - Differentiation - Definition and Fundamental Properties
5 pages
Numerical Analysis
No ratings yet
Numerical Analysis
15 pages
Worksheet 12: Averages and Measures of Spread: Core Revision Exercises: Data Handling
100% (1)
Worksheet 12: Averages and Measures of Spread: Core Revision Exercises: Data Handling
2 pages
FEM-2 Marks and 16 Marks Ans PDF
100% (1)
FEM-2 Marks and 16 Marks Ans PDF
34 pages

Subgradient Method: Ryan Tibshirani Convex Optimization 10-725

Uploaded by

Subgradient Method: Ryan Tibshirani Convex Optimization 10-725

Uploaded by

Subgradient Method

for f convex and differentiable, dom(f ) = Rn . Gradient descent:

x(k) = x(k−1) − tk · ∇f (x(k−1) ), k = 1, 2, 3, . . .

Step sizes tk chosen to be fixed and small, or by backtracking line

If ∇f is Lipschitz, gradient descent has convergence rate O(1/).

Now consider f convex, having dom(f ) = Rn , but not necessarily

Subgradient method: like gradient descent, but replacing gradients

x(k) = x(k−1) − tk · g (k−1) , k = 1, 2, 3, . . .

where g (k−1) ∈ ∂f (x(k−1) ), any subgradient of f at x(k−1)

Subgradient method is not necessarily a descent method, thus we

• Fixed step sizes: tk = t all k = 1, 2, 3, . . .

i.e., square summable but not summable. Important here that

Assume that f convex, dom(f ) = Rn , and also that f is Lipschitz

|f (x) − f (y)| ≤ Gkx − yk2 for all x, y

Theorem: For a fixed step size t, subgradient method satisfies

Theorem: For diminishing step sizes, subgradient method sat-

• Iterating last inequality,

For different step sizes choices, convergence results can be directly

The basic inequality tells us that after k steps, we have

With fixed step size t, this gives

This is a smooth and convex function with

where pi (β) = exp(xTi β)/(1 + exp(xTi β)), i = 1, . . . , n. Consider

min f (β) + λ · P (β)

where P (β) = kβk22 , ridge penalty; or P (β) = kβk1 , lasso penalty

Gradient descent Subgradient method

0 50 100 150 200 0 50 100 150 200

Step sizes hand-tuned to be favorable for each method (of course

Polyak step sizes: when the optimal value f ? is known, take

Can be motivated from first step in subgradient proof:

Polyak step size minimizes the right-hand side

With Polyak step sizes, can show subgradient method converges to

Suppose we want to find x? ∈ C1 ∩ · · · ∩ Cm , i.e., find a point in

and now solve

Check: is this convex?

where PC (x) is the projection of x onto C

Also recall subgradient rule: if f (x) = maxi=1,...,m fi (x), then

So if fi (x) = f (x) and gi ∈ ∂fi (x), then gi ∈ ∂f (x)

Now apply subgradient method, with Polyak size tk = f (x(k−1) ).

x(k−1) − PCi (x(k−1) )

(From Boyd’s lecture notes)

To optimize a convex function f over a convex set C,

min f (x) subject to x ∈ C

x(k) = PC x(k−1) − tk · g (k−1) , k = 1, 2, 3, . . .

Assuming we can do this projection, we get the same convergence

Warning: it is easy to write down seemingly simple set C, and PC

Nonsmooth first-order methods: iterative methods updating x(k) in

x(0) + span{g (0) , g (1) , . . . , g (k−1) }

where subgradients g (0) , g (1) , . . . , g (k−1) come from weak oracle

Theorem (Nesterov): For any k ≤ n−1 and starting point x(0) ,

In words, we cannot do better than the O(1/2 ) rate of subgradient

So instead of trying to improve across the board, we will focus on

f (x) = g(x) + h(x)

where g is convex and differentiable, h is convex and nonsmooth

• S. Boyd, Lecture notes for EE 264B, Stanford University,

You might also like

If ∇f is Lipschitz, gradient descent has convergence rate O(1/).

In words, we cannot do better than the O(1/2 ) rate of subgradient