A Gentle Introduction To Gradient-Based Optimization
A Gentle Introduction To Gradient-Based Optimization
Neha S. Wadia 1*
, Yatin Dandi2, and Michael I. Jordan3
1
Center for Computational Mathematics, Flatiron Institute
2
SPOC Laboratory and IdePHICS Laboratory, Ecole Polytechnique Fédérale de Lausanne (EPFL)
3
Department of EECS and Department of Statistics, University of California, Berkeley
* [email protected]
Abstract
The rapid progress in machine learning in recent years has been based on a highly productive
connection to gradient-based optimization. Further progress hinges in part on a shift in focus from
pattern recognition to decision-making and multi-agent problems. In these broader settings, new
mathematical challenges emerge that involve equilibria and game theory instead of optima. Gradient-
based methods remain essential—given the high dimensionality and large scale of machine-learning
problems—but simple gradient descent is no longer the point of departure for algorithm design.
We provide a gentle introduction to a broader framework for gradient-based algorithms in machine
learning, beginning with saddle points and monotone games, and proceeding to general variational
inequalities. While we provide convergence proofs for several of the algorithms that we present, our
main focus is that of providing motivation and intuition.
Contents
1 Introduction 4
1.1 The Challenges of Decision-Making Processes . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Multi-Way Markets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Challenges at the Intersection of Machine Learning and Economics . . . . . . . . . . 7
1.4 Two Illustrative Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4.1 Strategic Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4.2 Distribution-Free Uncertainty Quantification for Decision-Making . . . . . . . 8
1.5 Overview of the Lectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.6 The Subgradient Method and a First Convergence Proof . . . . . . . . . . . . . . . . 10
3 Computing Equilibria 24
3.1 Monotone Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Fixed-Point Finding Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.1 A Naive Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.2 The Proximal Point Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.3 The Extragradient Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.4 High-Resolution Continuous-Time Limits . . . . . . . . . . . . . . . . . . . . 30
1
Acknowledgements
These notes are based on three lectures delivered by Michael Jordan at the summer school “Statis-
tical Physics and Machine Learning” held in Les Houches, France, in July 2022. Neha Wadia and
Yatin Dandi were students at the school. The authors are grateful to Florent Krzakala and Lenka
Zdeborová for organizing the school.
The authors thank Sai Praneeth Karimireddy for useful discussions clarifying the material on
monotone operators, and Sidak Pal Singh for collaboration in the early stages of this manuscript.
MJ was supported in part by the Mathematical Data Science program of the Office of Naval
Research under grant number N00014-18-1-2764 and by the Vannevar Bush Faculty Fellowship pro-
gram under grant number N00014-21-1-2941. NW’s attendance at the summer school was supported
by a UC Berkeley Graduate Division Conference Travel Grant. The Flatiron Institute is a division
of the Simons Foundation.
Author contributions: MJ prepared and delivered the lectures and edited the text. YD partially
drafted an early version of the first lecture and made the figures. NW wrote the text and edited the
figures.
2
List of Algorithms
1 Subgradient Method with Constant Step Size . . . . . . . . . . . . . . . . . . . . . . 11
2 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Accelerated Gradient Descent (Nesterov’s Method) . . . . . . . . . . . . . . . . . . . 17
4 Proximal Point Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5 Extragradient Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3
1 Introduction
Much research in current machine learning is devoted to “solving” or even surpassing human intel-
ligence, with a certain degree of opacity about what that means. A broader perspective shifts the
focus from the intelligence of a single individual to instead consider the intelligence of the collective.
The goals at this level of analysis may take into consideration overall flow, stability, robustness, and
adaptivity. Desired behavior may be formulated in terms of tradeoffs and equilibria rather than
optima. Social welfare can be an explicit objective, and we might ask that each individual be better
off when participating in the system than they were before. Thus, the goal would be to enhance
human intelligence instead of supplanting or superseding it.
The mathematical concepts needed to support such a perspective incorporate the pattern recog-
nition and optimization tools familiar in current machine learning research, but they go further,
bringing in dynamical systems methods that aim to find equilibria and bringing in ideas from mi-
croeconomics such as incentive theory, social welfare, and mechanism design. A first point of contact
between such concepts and machine learning recognizes that learning systems are not only pattern
recognition systems, but they are also decision-making systems. While a classical pattern recognition
system may construe decision-making as the thresholding of a prediction, a more realistic perspective
views a single prediction as a constituent of an active process by which a decision-maker gathers
evidence, designs further data-gathering efforts, considers counterfactuals, investigates the relevance
of older data to the current decision, and manages uncertainty. A still broader perspective takes
into account the interactions with other decision-makers, particularly in domains in which there is
scarcity. These interactions may include signaling mechanisms and other coordination devices. As
we broaden our scope in this way, we arrive at microeconomic concerns, while retaining the focus
on data-driven, online algorithms that is the province of machine learning.
The remainder of our presentation is organized as follows. We first expand on the limitations of
a pure pattern-recognition perspective on machine learning, bringing in aspects of decision-making
and social context. We then turn to an exemplary microeconomic model—the multi-way matching
market—and begin to consider blends of machine learning and microeconomics. Finally, we turn
to a didactic presentation of some of the mathematical concepts that bridge between the problem
of finding optima—specifically in the machine-learning setting of nonconvex functions and high-
dimensional spaces—and that of finding equilibria.
4
4. Explainability and interpretability. In a decision-making context, it is often important for the
human beings interacting with a model to be able to interpret how the model arrives at its
predictions so that they can reason about the relevance of those predictions to their lives.
All the problems we have listed above involve understanding how to design models whose outputs
we can guarantee are interpretable, fair, safe, and reliable for people to interact with. The discipline
that has historically dealt with this type of problem is not machine learning but economics.
To further illustrate the point, we discuss two specific examples of scenarios where the output of
pattern recognition system is clearly and naturally part of a decision-making process.
Suppose your physician has access to a trained neural network that takes as input a data vector
containing all your health information and generates an output vector of the predicted risk of your
developing various diseases. At your next annual medical check-in, you take a standard battery
of tests and your physician updates your data vector with the results and then queries the neural
network. Let’s focus on just one of the predictions of the network: your risk, represented by a
number between 0 and 1, of developing a fatal heart condition within the next five years, the only
treatment for which is preemptive heart surgery. Let’s say that on average people who score a 0.7
or higher develop the condition. The neural network outputs a 0.71 for you. Is there any sense in
which this prediction and the accompanying threshold constitutes a decision? Do you immediately
get the surgery?
The most likely scenario is that you sit down with your physician to reason through how to
interpret the result and make the most appropriate choice for your healthcare. A number of questions
might reasonably come up. What was the provenance of the training data for the neural network?
Were there enough contributions to that data from individuals with similar health profiles to yourself
to engender some degree of relevance in the prediction? What is the uncertainty the prediction?
What are the sources of that uncertainty? Over the course of the conversation, you may suddenly
remember a piece of information that did not seem relevant before—that you had an uncle who died
of a suspected heart condition, let’s say—but that is now highly relevant. What is the effect of that
new data on your reasoning? Furthermore, you may consider counterfactuals. What if you were
to change your diet and swim for thirty minutes every day for six months—would that change the
prediction, and would it be worth risking postponing any treatment for that period of time in order
to ward off the risk of potentially unnecessary heart surgery?
The neural network can clearly help orient this decision-making process, but we cannot naively
threshold the output of the model to make a decision, especially because the stakes in this context
are high. We also see that the notion of a tradeoff appears naturally in decision-making contexts.
Let us consider another example. Recommendation systems are some of the most successful
pattern recognition systems built to date. Consider recommendation systems for books and movies.
Is it problematic if a single book or movie is recommended to a large number of people? Not
particularly. Books and movies are not scarce. Today, it is possible to print books on demand and
to stream movies on demand. Now let’s consider a recommendation system for routes to the airport.
Is it problematic to recommend the same route to the airport to a large number of people? In this
case, the answer is a clear yes. If everyone is directed down the same route, this will cause congestion.
The same applies in the case of a recommendation system for restaurants. Recommending the same
restaurant to a large number of people will result in lines around the block at a handful of places and
an artificial lack of customers at others. A simplistic approach to fixing this type of problem is to
treat recommendation in the context of scarcity as a load-balancing problem. The trouble with this
approach is that in many cases, it will not be clear on what basis to assign one subset of people to,
say, one restaurant versus another. In fact, any such assignment is likely to be arbitrarily made and
will introduce further problems with regards to fairness and also economic efficiency. Once again,
we see that the naive approach to making recommendations fails in certain contexts, in particular,
in contexts where a scarce resource is being allocated.
We hope we have convinced the reader that no matter how accurate the predictions of a model
may be in some technical sense, naive approaches to using those predictions to make decisions do
5
not work well. We argue that a better way forward is to think of the model and the parties that
interact with it as constituting a multi-way market.
6
parties.
In these lectures, we will focus mostly on the first item in this list. To give the reader a sense
of the flavor of research involved in addressing some of the other points in the list, however, we
give brief vignettes in the next section of research in strategic classification and distribution-free
uncertainty quantification.
7
Figure 1: The poverty index score was approximately Gaussian-distributed at the time when it was selected
as a measure of poverty that could be thresholded for the purposes of making social policy. With time, its
distribution became further and further skewed to the left of the threshold. This figure is reproduced from
[1].
to the left of the threshold. In a mere decade, the entire peak of the distribution of poverty index
scores had moved to the left of the threshold, which now qualified almost the whole population for
welfare assistance.
Strategic classification is a problem in which a population that is being classified for some purpose
may strategically modify their features in order to elicit a favorable classification. This problem finds
a natural formulation as a sequential two-player Stackelberg game between a player (the decision-
maker) deploying the classifier and a player (the strategic agents) being classified. The goal of the
decision-maker is to arrive at the Stackelberg equilibrium, which is the model that minimizes the
decision-maker’s loss after the strategic agents have adapted to it.
In the prevailing formulation of this game, the decision-maker leads by deploying a model, and
the strategic agents follow by immediately playing their best response to the model. It is assumed
that the strategic agents adapt to the model instantaneously. In [2], the authors formulate the game
more realistically by introducing a finite natural timescale of adaptation for the strategic agents.
Relative to this timescale, the decision-maker may choose to play on a timescale that is either
faster (a “reactive” decision-maker) or slower (a “proactive” decision-maker).1 These timescales
are naturally introduced to the problem by formulating the game such that each player must use a
learning algorithm to learn their strategies, as opposed to playing a strategy from a predetermined
set of possibilities. Within this formulation of strategic classification, the authors of [2] found that
the decision-maker can drive the game toward one of two different equilibria depending on whether
they chose to play proactively or reactively. Thus the dynamics of play directly influences the
outcome of the game. Furthermore, the authors were able to identify certain game settings in which
both players prefer the equilibrium that results when the decision-maker plays reactively.
The results of [2] exemplify how incorporating ideas of statistical learning into a game-theoretic
problem can yield a richer model of some social phenomenon that is productive of new insights. We
refer the reader to [3] for a broader discussion of this emerging field at the interface of learning and
economics.
8
statistical methods of uncertainty quantification such as the bootstrap are prohibitively computa-
tionally intensive at the scales of modern problems.
Conformal prediction is a methodology that gives us a way forward. Given the predictions of a
pattern recognition model, such as an image classifier, a conformal prediction algorithm efficiently
computes a confidence interval over the outputs of the model, thus providing statistical coverage
without the need to manipulate the model itself or to make strong assumptions about its architecture
or about the distribution of the data. The interested reader may refer to [4] for an accessible
introduction to conformal prediction.
To give the reader a flavor of the mathematics involved in conformal prediction, we give a high-
level overview of a method of constructing risk-controlling prediction sets [5]. Prediction sets are sets
that provably contain the true value of the entity being predicted with high probability. Prediction
sets are an actionable form of uncertainty quantification, and therefore useful for decision-making
purposes in certain contexts. For example, if a physician were to put an image of a patient’s colonic
epithelium through a classifier while trying to diagnose the cause of their chronic stomach pain, it
would be just as important to know what the likely causes of the pain are as to know which causes
can safely be ruled out.
Consider a dataset {(Xi , Yi }m
i=1 of independent, identically distributed pairs consisting of feature
vectors Xi X and labels Yi ∈ Y. In many applications, X = Rd for some large d ∈ N. Consider a
partition of the dataset into sets of size n and n − m; the former form the calibration set Ical , the
latter the training set Itrain . We are in possession of a trained model fˆ (say, a classifier) that was
trained on Itrain . The goal is to construct a set-valued predictor Tλ : X → Y ′ , where Y ′ is a space
of sets (in many problems, Y ′ = 2Y ) and λ ∈ R is a scalar parameter; given some user-specified
γ, δ ∈ R, for an appropriate choice λ̂ of λ, we require the risk R of Tλ̂ (X) to be bounded above with
high probability:
P R Tλ̂ (X) ≤ γ ≥ 1 − δ. (1)
In [5], the authors provide a procedure by which to build T and choose λ̂.
Tλ has the following nesting property: for λ1 < λ2 , we have Tλ1 ⊂ Tλ2 . We define a loss function
L : Y × Y ′ → R≥0 on T . A typical choice of L(Y, Tλ(X) ) is 1Y ∈Tλ (X) . We assume that the loss
function satisfies a monotonicity condition: Tλ1 ⊂ Tλ2 implies L(Tλ1 ) > L(Tλ2 ); in other words,
enlarging the prediction set drives down its loss. The risk of the prediction set is the average value
of the loss over the data: R (Tλ (X)) = EL(Y, Tλ (X)).
Using the calibration set Ical , it is possible to construct a quantity Rmax (λ) such that for all λ,
P (R (Tλ (X)) ≤ Rmax (λ)) ≥ 1 − δ. Generally speaking, this construction involves inverting certain
concentration inequalities. The reader is referred to Section 3 of [5] for details. Once we have
Rmax (λ), taking
λ̂ = inf{λ : Rmax (Tλ′ ) < γ ∀λ′ ≥ λ} (2)
ensures that the property (1) is satisfied. This is the content of the main theorem of [5]. This
formalism is applied to a number of examples in [5], including protein folding, in which case fˆ is
AlphaFold, scene segmentation, and tumor detection in the colon. Note that the assumption of a
monotonic loss can be removed, and λ can be a low-dimensional vector instead of a number; these
generalizations can be found in [6].
9
We will present the problem of computing equilibria as a generalization of the problem of com-
puting optima. We will provide a self-contained introduction to both these problems, starting with
optima and moving to equilibria. Our goal is to equip the reader with enough terminology and
technical knowledge to read and contextualize the primary literature.
Although the algorithms we will discuss are usually implemented in discrete time, on occasion
we will find it useful to study them in continuous time. As we saw in our discussion of Stackelberg
equilibria in Section 1.4.1, the dynamics of an algorithm can influence what the algorithm converges
to. This is true both in optimization and in games, and we will find in each case that the continuous-
time perspective makes this dependency clear. It is also the case that proofs of convergence are often
much simpler to write in continuous time than in discrete time.
We begin with the essentials of convex optimization through the discussion of two fundamental
algorithms: the subgradient method, in Section 1.6 and gradient descent, in Section 2.1. Convex
functions are the simplest class of functions for which we can provide convergence guarantees to
the global minimum of the function. We will see that the subgradient method reduces to gradient
descent when the function being optimized is differentiable in addition to being convex. In Section
2.2, we talk about the convergence properties of gradient descent on nonconvex functions, which
may in general have more than one minimum and may also have saddle points and maxima. We
will find it useful to have this discussion in continuous time.
Gradient descent is a first-order algorithm; its continuous-time limit is a first-order differen-
tial equation in time. We also provide an exposition on optimization with a description of some
recent results on accelerated algorithms; the continuous-time limits of these algorithms are second-
order differential equations in time, and can be interpreted as the equations of motion of a certain
Lagrangian.
The dynamics of a marketplace can often be formulated as a game. In order to effectively design
new markets and understand existing ones, it is crucial to be able to compute the equilibria of the
relevant games. Returning to discrete time, in Section 2.4, we graduate from studying optima to
studying equilibria. We introduce some basic vocabulary, including the concept of a variational
inequality, and we show that the Nash equilibria of certain kinds of games can be written as the
fixed points of variational inequalities involving monotone operators. In Section 3.1 we define mono-
tonicity and strong monotonicity, and in Section 3.2 we discuss the proximal point method and the
extragradient algorithm, and their convergence behavior on monotone operators. We wrap up these
lectures with a brief foray back into continuous time in Section 3.2.4, with a comparative study of
a number of fixed-point finding algorithms.
Throughout the text, we will omit mathematical details where necessary in the interests of clarity
and brevity (although we will make note of where we do this). The interested reader may find the
details of almost everything we discuss here in the following references: Convex Analysis [7], Convex
Analysis and Monotone Operator Theory in Hilbert Spaces [8], and Finite-Dimensional Variational
Inequalities and Complementarity Problems [9].
where f is a convex function. A function is said to be convex if the line segment connecting any two
function values f (a) and f (b) lies above f (x) at all points x between a and b. We will give another
definition of convexity in the context of differentiable functions in the next lecture. In this section,
10
we do not assume that f is differentiable. In what follows, we do require some further regularity
conditions on f beyond convexity (such as convexity of its domain), but we will not trouble with
those details here.
A central notion in convex analysis is that of the subgradient of a function.
Definition 1.1 (Subgradient). gx is a subgradient of f at x if, ∀y, f (y) ≥ f (x) + ⟨gx , y − x⟩. This
inequality is called the subgradient inequality.
In the definition above and throughout these notes we use ⟨·⟩ to denote the Euclidean inner
Pd
product: ⟨x, y⟩ = i=1 xi yi .
In general the subgradient is not unique. For example, the subgradient of the function f (x) = |x|
where x ∈ R is given by
1,
x>0
gx = (−1, 1), x = 0 (4)
−1, x < 0.
For this reason, it is also useful to give a name to the set of all subgradients of a function, which we
do next.
Definition 1.2 (Subdifferential). The subdifferential ∂f (x) of a function f at a point x is the set
of all subgradients of f at x, i.e.,
In Algorithm 1, T is the number of iterations for which we run the algorithm, η is the (constant)
step size, gk is shorthand for gxk , and gk ∈ ∂f (xk ). We will often refer to xk as “the k-th iterate”.
The subgradient method is not a descent method: it is not guaranteed to make progress toward
the optimum with each iteration. This is different from the case of gradient descent, which we will
discuss in the sequel
√ (see Lemma 2.1). Nonetheless, we can prove that it converges in function value
with a rate 1/ T . Let x⋆ denote the optimum (i.e., the solution to (3)), and || · || the Euclidean
norm, and consider the quantity ||xk+1 − x⋆ ||2 :
To arrive at (5), we applied the subgradient inequality and assumed that ||gk ||2 is bounded above
by a constant G2 . We now move f (xk ) − f (x⋆ ) to the left-hand side, sum over k, and divide by 2ηT :
T T
1 X 1 X 1
(f (xk ) − f (x⋆ )) ≤ ||xk − x⋆ ||2 − ||xk+1 − x⋆ ||2 + ηG2 . (6)
T 2ηT 2
k=1 k=1
11
On the right-hand side of (6) we have a telescoping sum, in which all but the first and last terms
cancel. On the left-hand side, we take advantage of the convexity of f and apply Jensen’s inequality,
which states that f evaluated at the average of a set of points xi is at most the average of function
values at those points (for f convex). We arrive at
T
!
1 X 1 1
f xk − f (x⋆ ) ≤ ||x1 − x⋆ ||2 − ||xT +1 − x⋆ ||2 + ηG2
T 2ηT 2
k=1
1 1
≤ R2 + ηG2 . (7)
2ηT 2
In the last step above, we have dropped the term ||xT +1 − x⋆ ||2 and assumed that ||x1 − x⋆ ||2 is
bounded above by a constant R2 . To obtain a rate of convergence, we choose η to minimize the
R
right-hand side of the last inequality. The optimal value of η turns out to be G√ T
. Substituting
this in the inequality, we finally arrive at
T
!
1 X RG
f xk − f (x⋆ ) ≤ √ . (8)
T T
k=1
12
2 Computing Optima in Discrete and Continuous Time
2.1 Convergence Guarantees for Gradient Descent on Convex Functions
We remind the reader that we are currently discussing algorithms for the following optimization
problem
min f (x).
x∈Rd
In Section 1.6, we assumed that f is convex but not necessarily differentiable, and we studied the
convergence behavior of the subgradient method (Algorithm 1) on this problem. When f is a smooth
function of x, the subgradient method reduces to the well-known gradient descent algorithm.
We will now give a definition of convexity for differentiable functions, and then explore the
consequences on the convergence behavior of gradient descent of two further assumptions we can
make on f (x): (1) Lipschitz smoothness, and (2) strong convexity. The proof techniques that arise
will be useful when we study variational inequalities.
Definition 2.1 (Convexity). A (differentiable) function f is convex on an interval if and only if
f (x) ≥ f (y) + ⟨∇f (y), x − y⟩ (9)
for all x, y in the interval.
There are several other equivalent definitions of convexity. Similarly, there are many ways to
characterize smoothness. One possibility is the following.
Definition 2.2 (Smoothness). A function f is (Lipschitz) smooth if it satisfies:
||∇f (x) − ∇f (y)|| ≤ L||x − y||, (10)
where L > 0. Such a function is often referred to as L-smooth.
Note that (10) is equivalent to
L
f (y) ≤ f (x) + ⟨∇f (x), y − x⟩ + ||x − y||2 . (11)
2
The proof of this equivalence is left as an exercise to the reader. Hint: use the fundamental theorem
of calculus and the Cauchy-Schwarz inequality. The inequality (11) indicates that an L-smooth
function can be upper-bounded by a quadratic at every point y in its domain. With this intuition it
is simple to see why, for example, the function f (x) = |x| is not smooth: at the point x = 0, there
is no quadratic that both touches f only at x = 0 and sits above f for all x > 0.
We will now see that gradient descent on L-smooth convex functions √ converges with a rate 1/T ,
where T is the total number of iterations. This is faster than the 1/ T rate than we had in (8),
when we did not assume smoothness.
We begin by proving a type of result called a descent lemma, which guarantees that we make a
certain amount of progress in decreasing the function value at each iteration of gradient descent.
Lemma 2.1 (Descent lemma for gradient descent on smooth convex functions). Consider an L-
smooth convex function f . Under the dynamics of Algorithm 2 with η = L−1 , we have
1
f (xk+1 ) ≤ f (xk ) − ||∇f (xk )||2 . (12)
2L
13
Proof. Take y = xk+1 and x = xk in (11), and replace xk+1 − xk with −L−1 ∇f (xk ).
Theorem 2.2. On an L-smooth convex function f , gradient descent with step size L−1 converges
in function value with a rate 1/T , where T is the number of iterations.
Proof. Beginning with Lemma 2.1, we have
1
f (xk+1 ) ≤ f (xk ) − ||∇f (xk )||2
2L
(i) 1
≤ f (x⋆ ) + ⟨∇f (xk ), xk − x⋆ ⟩ − ||∇f (xk )||2
2L !
2
(ii) L 1
= f (x⋆ ) + ||xk − x⋆ ||2 − xk − x⋆ − ∇f (xk )
2 L
(iii) L
= f (x⋆ ) + ||xk − x⋆ ||2 − ||xk+1 − x⋆ ||2 .
(13)
2
In (i), we applied the definition of convexity (take y = xk and x = x⋆ in (9)). In (ii), we completed
the square, and in (iii), we recognized that xk − L−1 ∇f (xk ) is simply a gradient descent update
with step size L−1 and replaced it with xk+1 . Now we move f (x⋆ ) to the left-hand side and sum
(13) over k. On the right-hand side, this results in a telescoping sum. We are left with the following:
T
X L
(f (xk+1 ) − f (x⋆ )) ≤ ||x1 − x⋆ ||2 − ||xT +1 − x⋆ ||2 .
(14)
2
k=1
We can drop the second term on the right-hand side of (14), since it is nonpositive. Note that
Lemma 2.1 implies f (xk+1 ) − f (x⋆ ) ≤ f (xk ) − f (x⋆ ) for all t. Each term in the sum on the left-hand
side of (14) is therefore lower-bounded by the quantity f (xT ) − f (x⋆ ). Using this lower bound, (14)
implies
L
T (f (xT ) − f (x⋆ )) ≤ ||x1 − x⋆ ||2 . (15)
2
Lastly, we bound the distance between x1 and the optimum by some R > 0, and divide through by
T . We thus arrive at the result
RL 1
f (xT ) − f (x⋆ ) ≤ . (16)
2 T
14
Theorem 2.3. On an L-smooth and µ-strongly convex function f , gradient descent with step size
η ≤ 2/(L + µ) converges at the rate (1 − C)T , where C ≤ 1 is a constant that depends on η, µ, and
L, and T is the total number of iterations. In particular,
T
⋆ 2 2ηµL
||xT +1 − x || ≤ 1− ||x1 − x⋆ ||2 . (18)
µ+L
See [10], Theorem 2.1.15, for a proof. The rate in Theorem 2.3 is known as a linear rate of
convergence in optimization parlance.
(We say that f is ρ-Hessian Lipschitz.) Gradient descent halts naturally when ∇f (x⋆ ) = 0. When f
is nonconvex, this can happen not just when x⋆ is a minimum, but also if it is a maximum or a saddle
point. We need therefore to understand whether gradient descent can efficiently avoid maxima and
saddle points. Here, in particular, we focus on how to efficiently escape saddle points. Note that in
the nonconvex setting, we will no longer be looking for convergence guarantees to global minima,
but to local minima. The guarantees thus obtained are useful nonetheless. In many applications,
it is the case that f does not have any spurious local minima; principal components analysis is a
notable example.
First, we introduce some terminology:
• A first-order stationary point of f is any point x such that ∇f (x) = 0.
• A second-order stationary point of f is any point x such that ∇f (x) = 0 and ∇2 f (x′ ) ⪰ 0.
Definition 2.4 (Strict saddle point). x is a strict saddle point of f if it is a saddle point and the
smallest eigenvalue of the Hessian at x is strictly negative: λmin (∇2 f (x)) < 0.
In the following, we will discuss convergence guarantees for a variant of gradient descent that
avoids/escapes strict saddle points. That is, we will not discuss how to escape saddles that have no
escape directions (i.e., at which λmin (∇2 f (x)) = 0).
We will be talking about convergence rates in a different way than we have so far. Instead of
bounding the error in function value (or distance to the optimum) after T iterations, we will ask how
many iterations are necessary for the error to drop below some ε. Let us restate the convergence
guarantee of Theorem 2.2 in this manner:
15
Theorem 2.4. On an L-smooth convex function f , gradient descent with step size η = L−1 converges
to an ε-FOSP in (f (x1 ) − f ⋆ )L/ε2 iterations.
Previous results on optimizing nonconvex functions with gradient descent had established that
the algorithm asymptotically avoids saddle points [13, 14]. However, these results made no finite-
time guarantee of convergence. Other work then established that gradient descent can escape saddles
in a length of time that is polynomial in the dimension d [15]. In [11], we showed that this dimension
dependence can be improved. We developed a variant of gradient descent we call perturbed gradient
descent (PGD) that converges with high probability to ε-SOSPs at a rate that has a polylogarithmic
dependence on d. PGD takes an injection of noise when the norm of the gradient of f is small,
and otherwise behaves like vanilla gradient descent. The injected noise is sampled uniformly from
a d-dimensional ball of radius r. (For details, see the description of Algorithm 2 in [11].) We prove
the following theorem guaranteeing convergence of PGD to ε-SOSPs.
Theorem 2.5 (PGD escapes strict saddle points). Consider an L-smooth and ρ-Hessian Lipschitz
function f : Rd → R. There exists an absolute constant cmax such that for any δ > 0, ε < L2 /ρ,
∆f ≥ f (x1 ) − f ⋆ , and c ≤ cmax , PGD with step size η = L−1 will output an ε-SOSP with probability
1 − δ, and will terminate in a number of iterations of order
(f (x1 ) − f ⋆ )L
4 dL∆f
log . (20)
ε2 ε2 δ
16
rate achieved by gradient descent, discussed in Theorem 2.2. This optimal rate is achieved by the
family of second-order or so-called accelerated algorithms. A representative member of this family
is Nesterov’s method (Algorithm 3).
λk = (k − 1)/(k + 2)
yk+1 = xk − η∇f (xk )
xk+1 = (1 − λk )yk+1 + λk yk
We use dots to denote derivatives with respect to time. Nesterov’s method maps to the following
second-order differential equation, first written down in [18]:
3
ẍt + ẋt + ∇f (xt ) = 0. (22)
t
Note that xt is shorthand for x(t), and is not to be confused with the discrete-time iterate xk . The
authors of [18] also showed that the Euler discretization of (22) recovers Algorithm 3. This result
provided some insight, and also gave rise to new questions and problems. We will briefly describe
one such question and its resolution before returning to the main theme of this section.
More than one discrete-time algorithm collapses to (22) in the continuous-time limit. In partic-
ular, the heavy ball method, a second-order algorithm due to Polyak [19], also maps to (22). The
trouble with this is that the heavy ball method does not achieve the optimal 1/T 2 rate of Nesterov’s
method, indicating that (22) can fail to distinguish between two algorithms that achieve different
rates in discrete time. This problem can be addressed by taking high-resolution continuous-time
limits. High-resolution ordinary differential equations appear in fluid dynamics and are useful to
describe systems with multiple timescales. See [20] for details on how to use high-resolution limits
to derive distinct continuous-time analogs of the heavy ball method and Nesterov’s method. We
will revisit high-resolution differential equations in another context at the end of the third lecture,
in Section 3.2.4.
Let us return to the primary point of discussion. We have mentioned that a continuous-time
limit of Nesterov’s method was studied in [18] and that taking the continuous-time point of view was
productive of insight into the behavior of the algorithm. Our work, which will now describe, was
motivated by the question of whether there exists a generative mechanism for accelerated algorithms.
The work described in the following is joint work with Ashia Wilson, Andre Wibisono, and Michael
Betancourt [21, 22].
17
Let us equip ourselves with a definition. For any choice of a convex distance-generating function
h, we can define the Bregman divergence Dh between two points as follows:
Note that if h is a quadratic, the Bregman divergence is simply the Euclidean distance between x
and y.
In [21], we introduce the Bregman Lagrangian:
where αt , γt , βt are scaling functions, and f is the continuously differentiable convex function we
are interested in optimizing. Note that we use the shorthand αt for α(t), and similarly for βt and
γt . By making specific choices of the scaling functions, we will see later that we can generate a
family of algorithms. The first term in (24) is a generalized kinetic energy term (note that if we
take h(x) = x2 , this term reduces to ẋ2 /2, the familiar kinetic energy from physics). The second
term can be interpreted as a potential energy.
Given a Lagrangian, we can set up the following variational problem over paths:
Z
min dt L(x, ẋ, t). (25)
x
The solutions to this optimization problem are precisely the solutions to the Euler-Lagrange equation
d ∂ ∂
L(x, ẋ, t) = L(x, ẋ, t). (26)
dt ∂ ẋ ∂x
Under what we call the ideal scaling β̇t ≤ eαt , γ̇t = eαt , the Euler-Lagrange equation of (24) is
−1
ẍt + (eαt − α̇t ) ẋt + e2αt +βt ∇2 h xt + e−αt ẋt
∇f (xt ) = 0. (27)
The proof of this result is relatively simple. It involves writing down a Lyapunov function E for
the dynamics and showing that its time derivative is nonpositive.
Proof.
f is convex, and so Df (x⋆ , xt ) ≥ 0. The first term in Ėt is therefore nonpositive. Under the ideal
scaling, β̇t − eαt ≤ 0, and so the second term in Ėt is also nonpositive. Thus Ėt ≤ 0.
With suitable choices of the scaling functions, we can now generate a whole family of continuous-
time optimization algorithms from (27) with an accompanying convergence rate guarantee (in con-
tinuous time). For example, in the Euclidean case, where h is quadratic in x, if we make the choices
αt = log p − log t, βt = p log t + log C, and γt = p log t where C > 0, (27) reduces to (22), the
Euler discretization of which is Nesterov’s method. We can also recover mirror descent, the cubic-
regularized Newton method, and other known discrete-time algorithms from (27), which serves as a
generator of accelerated algorithms. The interested reader may refer to [21] for details.
18
Let us now address the problem of mapping back into discrete time.
We begin by noting that the Bregman Lagrangian (24) is a covariant operator. The dynamics
generated by covariant operators has the property that given any two endpoints, the path between
them is fixed. However, we are free to choose βt , which controls the rate at which this fixed path is
traversed. In continuous time, then, we can obtain arbitrarily fast convergence by manipulating βt ,
and the algorithmic notion of acceleration that we began by wishing to understand disappears. It
returns, however, when one considers how to discretize (27).
It turns out that it is not possible to obtain arbitrarily fast stable discretizations of (27). To
understand this, we note that in classical mechanics, the dynamics of a system often has conserved
quantities associated with it, such as energy or momentum. Not all discretization schemes inherit
or preserve these conservation laws. Symplectic integrators are a class of discretization schemes that
do preserve what is known as phase space volume, and therefore conserve energy, etc. The method
of symplectic integration traces its roots back to Jacobi, Hamilton, and Poincaré, and grew out of
the need to discretize the dynamics of physical systems while preserving the relevant conservation
laws. Importantly for us, symplectic integrators are provably stable. Their stability enables them
to take larger step sizes than other kinds of integrators. This is the origin of acceleration in discrete
time.e a triple
Symplectic integrators are traditionally defined for time-independent Hamiltonians. (We have
thus far been talking about continuous-time algorithms in a Lagrangian framework, but a simple
Legendre transform allows us to equivalently frame the discussion in a Hamiltonian framework.)
However, the Bregman Lagrangian (24) is time-dependent. Indeed, as we saw, its associated Lya-
punov function, which can be interpreted as an energy, decreases with time. Intuitively, this is
necessary from an optimization standpoint in order to converge to a point. The question that
arises is, how can we use symplectic integrators for dissipative systems? In joint work [23] with
Guilherme França and René Vidal, we solved this problem by lifting the relevant Hamiltonian to
higher dimensions so that it acquires time independence. We discretize the time-independent higher-
dimensional Hamiltonian, and gauge-fix to obtain rate-preserving algorithms corresponding to the
original Hamiltonian. We call the integrators so-obtained presymplectic integrators. When we use
the term ‘rate-preserving’ to describe an integrator, we mean that it preserves the rate of conver-
gence of the continuous-time dynamics up to an error that can be controlled. One of the central
results in [23] is that pre-symplectic integrators are rate-preserving. The technical tool used prove
this type of result is called backward error analysis [24, 25].
In conclusion, we now have a systematic pipeline for generating accelerated algorithms with
convergence guarantees. The practitioner may (1) ask for a certain rate of convergence for a given
f , (2) write down the corresponding Bregman Hamiltonian, (3) apply a symplectic integrator, and
(4) code up the resulting algorithm.
We mentioned at the beginning of this section that taking a continuous-time perspective can
enable proofs of results we do not yet know how to prove in discrete time. We give an example of
such a result now. Let us briefly return to the problem of escaping saddles in nonconvex optimization.
We discussed previously that perturbed gradient descent (PGD) can escape saddles and converge to
ε-SOSPs in a number of iterations that goes as 1/ε2 . This was the content of Theorem 2.5. In [26],
we showed that a perturbed version of accelerated gradient descent (PAGD) can do the same in a
number of iterations that goes as 1/ε7/4 , i.e., it converges faster than PGD. In order to prove this
result, we worked in continuous time, studying the Hamiltonian that generates PAGD. The result
implies that acceleration can be helpful even in nonconvex optimization, and not just in the convex
case, where all the prior discussion of this section has been set.
An open problem. The continuous-time framework we have described here only applies to deter-
ministic algorithms. An open challenge is to construct a variational framework for diffusions, which
may enable similar illuminating and unifying perspectives on accelerated stochastic algorithms.
19
We end this section by briefly reviewing a few results for stochastic algorithms in continuous
time.
where k is the iteration number, δ is the time increment, and ξ is a d-dimensional standard
normal random variable. A natural question to ask is, how close are we to p⋆ after n iterations
of the MCMC algorithm? Assuming U (x) is L-smooth and µ-strongly convex, guarantees
of the following type have been proven: if n ≥ O(d/ε2 ), then d p(n) , p⋆ ≤ ε, where d is
the relevant distance measure, and p(n) is the probability density generated by the MCMC
algorithm at the nth iteration. Results exist in the cases where d is the total variation distance
[27], the 2-Wasserstein distance [28], and the Kullback-Leibler divergence [29].
2. Now consider an underdamped Langevin diffusion
dxt = vt dt,
p
dvt = −γvt dt + λ∇U (xt ) dt + 2γλ dBt , (31)
⋆
where γ is the coefficient
of friction. The stationary distribution p is proportional to
2
exp −U (x) − ||v||2 /2λ . In [30], we are able to make a similar guarantee on the MCMC
algorithm corresponding to √ (31) as for (30). That is, assuming U (x) is L-smooth and µ-
strongly convex, if n ≥ O( d/ε), then d p(n) , p⋆ ≤ ε, where d is the 2-Wasserstein distance.
This is a faster rate than in the overdamped case, where we needed the square of the number
of iterations required here to make the same guarantee on the distance between p(n) and p⋆ .
The intuition behind this faster rate lies in the fact that underdamped diffusions are smoother
than overdamped ones, and therefore easier to discretize.
The proof technique in [30] involves another coupling time argument. We saw one of these in
the discussion on how PGD escapes saddle points. In this case we study the coupling time of
two Brownian motions that are reflections of each other.
Now that we have some context in the results just stated, we can make precise the type of
open problem mentioned earlier. Suppose we wish to diffuse (under some stochastic differential
equation) to a target distribution p⋆ . What equation gets us to p⋆ the fastest, and what is the
correct mathematical formulation for this problem?
20
2.4 Variational Inequalities: From Minima to Nash Equilibria and Fixed
Points
In this last section of the lecture, we will introduce the setting for the third lecture. From now on, we
will be concerned with how to find (Nash) equilibria of certain types of functions f (we will make this
precise). The discussion will break with all the technical material we have seen so far in an important
way: Nash equilibria are saddle points of f . Therefore, instead of studying algorithms that avoid
saddle points, we will talk about algorithms that converge to saddle points. In the next lecture, we
will study two such algorithms, the proximal point method, and the extragradient method.
In the following, we introduce (1) zero-sum games, (2) Nash equilibria, and (3) variational in-
equalities. We then give some context for why we introduce these ideas, and for why we are interested
in computing Nash equilibria.
The game is called “zero-sum” because the cost functions of players x1 and x2 , respectively f and
−f , add to zero. A simple instance of (32) is the zero-sum bilinear game f (x1 , x2 ) = x⊤
1 Ax2 , where
A is a positive semi-definite symmetric matrix. (32) is a member of the set of convex-concave games,
called so since f is convex in x1 and concave in x2 . Convex-concave games are in turn a subset of
the family of minimax games, which are a subset of the family of monotone games. This last piece
of terminology and its significance will be explained Section 3.1.
The solutions of (32) are saddle points (and not minima) of f ; they are the set of points (x⋆1 , x⋆2 )
such that
f (x⋆1 , x2 ) ≤ f (x⋆1 , x⋆2 ) ≤ f (x1 , x⋆2 ) . (33)
In the parlance of game theory, some (but not, in general, all) of the solutions of the optimization
problem that defines the dynamics of a game are called equilibria.
We have already encountered one example of a (Stackelberg) game in Section 1.4.1. Stackelberg
games are sequential (players take turns to play). (32) is not a sequential game. Its equilibria
are called Nash equilibria. We will define a Nash equilibrium for an N -player game formally later.
Intuitively, a Nash equilibrium is a saddle point of f that neither player is incentivized to move
away from. When we introduce N -player games, this intuition will carry over: Nash equilibria will
correspond to points where each player is as happy as possible relative to what all the other players
are doing.
Definition 2.5 (Variational Inequality). Given a subset X of Rd and an operator (or vector field)
F : X → Rd , a variational inequality is an optimization problem of the following form. Find x⋆ such
that
⟨F (x⋆ ), x − x⋆ ⟩ ≥ 0 ∀x ∈ X . (34)
Note that x⋆ need not always be unique, and that X may or may not be convex. (There are
certain requirements on the topology of X but we will omit discussion of these here.) Variational
21
inequalities unify the formulations of unconstrained and constrained optimization problems, as well
optimization problems involving equilibria, and many others. If we replace the inequality with
equality in (34), and F is a gradient field, then we recover classical optimization; furthermore, if X
is (a proper subset of) Rn , then the optimization problem is (constrained) unconstrained.
In the language of variational inequalities, the two-player zero-sum game is represented by the
vector field
∇x1 f (x1 , x2 )
F (x) = , (35)
−∇x2 f (x1 , x2 )
with X = X1 ⊗ X2 . Note that F is not the gradient of an underlying function.
Representing a game as a VI allows us to express the problem of computing the equilibria of the
game as the problem of computing the fixed points of the relevant operator F . Although on the
face of it this mathematical mapping may not appear to confer an advantage, in fact there exists
a vast machinery to compute the fixed points of nonlinear operators that we can take advantage of
through it. We will come back to this point in the next section.
if and only if each Ki is a closed convex subset of Rdi and each gi is a convex function of xi .
Exercise to the reader: check that if N = 2, g1 = f , and g2 = −f , then (36) reduces to a
two-player zero sum game.
Nash equilibria are not the only equilibria that can be represented as solutions of variational
inequalities. Some other interesting examples, which we will not address here, are Markov perfect
equilibria (from probability theory), Walrasian equilibria (from economics), and the equilibria of
frictional contact problems (from mechanics).
Armed with the definitions of games and Nash equilibria, we can briefly mentioning a gradient-
based method developed in collaboration with Eric Mazumdar and S. Sankar Sastry for finding local
Nash equilibria in two-player zero-sum games [35]. This example also pulls together some of the ideas
we saw in the last section, namely of working in continuous time and paying attention to symplectic
geometry. Consider the problem (32). We have mentioned that the solutions to this optimization
problem are saddle points of f . However, not all saddle points of f are Nash equilibria. The latter
are specifically saddle points that are axis-aligned ; we refer the reader to the paper [35] for further
details on what this means. Prior to our work, it was known that gradient-based algorithms for
finding Nash equilibria could (in the worst case) diverge, or get stuck in limit cycles (i.e., simply
follow the flow of the vector field). We showed that in addition, these algorithms could also converge
to non-Nash fixed points. The distinction between non-Nash and Nash fixed points has to do with
the symplectic component of the vector field. When the symplectic component of the vector field
22
near a fixed point is removed, only Nash fixed points persist. We exploited this fact and developed an
algorithm we called “local symplectic surgery” that could effect this removal, and therefore provably
converge only to Nash equilibria.
23
3 Computing Equilibria
Consider the two kinds of vector fields pictured in Fig. 2. In Fig. 2(a), the flow is towards a point;
our discussion of convex optimization in Lectures 1 and 2 was in this context. In Fig. 2(b) the flow
moves around a point. Both the flows pictured in Fig. 2 are generated by monotone operators. Given
a monotone game, it will not a priori be apparent which of the two types of flows we are dealing
with. To reliably compute equilibria, therefore, we will need algorithms that handle both cases. We
will spend the rest of these notes developing and studying such algorithms.
−F (x) −F (x)
Figure 2: The notion of a fixed point is broader than that of a minimum. Here we have examples of two
types of vector fields F (x). In (a), the negative flow (indicated by arrow heads) of the field is toward the
fixed point; this fixed point could be a minimum. In (b), the negative flow of the vector field is around the
fixed point; this fixed point is not a minimum. In this lecture, we study algorithms that compute the fixed
points of vector fields associated with monotone operators, which can be either of the flavor (a) or (b).
To begin the discussion, we introduce the notion of monotonicity, which generalizes the notion
of convexity, and a strengthening of this notion called strong monotonocity.
Definition 3.1 (Monotone Operator). An operator F is said to be monotone on a set X if
⟨F (x) − F (y), x − y⟩ ≥ 0 ∀x, y ∈ X . (37)
The property (37) is known as monotonicity.
We will assume that the constraint set X is convex. When this is the case, we are guaranteed
that a fixed point of F exists.
We remind the reader that our notation for a fixed point of F is x⋆ . Taking x = x⋆ and y = x
and setting F (x⋆ ) = 0 in (37), we see that monotonicity of F implies that the angle θ between the
vectors −F (x) and x⋆ − x for an arbitrary x can be no greater than π/2. This is illustrated in Fig. 3.
Definition 3.2 (Strong Monotonicity). An operator F is said to be strongly monotone on a set X
if
⟨F (x) − F (y), x − y⟩ ≥ µ||x − y||2 ∀x, y ∈ X , (38)
for some µ > 0.
Strong monotonicity is a strengthening of monotonicity. It guarantees that the angle between
−F (x) and x⋆ − x is strictly less than π/2. Furthermore, given any arbitrary point x, strong
monotonicity of F ensures that the fixed point lies on or above the quadratic function µ||x − y||2
centered at x. In contrast, monotonicity merely implies that x⋆ is on the same side of the hyperplane
defined by −F at x. This is illustrated in Fig. 4.
24
−F (x)
θ ≤ π/2 x
x⋆
Figure 3: Monotonicity implies that the angle θ between the operator −F (x) and the vector pointing from
x to x⋆ is at most π/2 for all x. To see this in (37), take x = x⋆ , y = x and without loss of generality let
F (x⋆ ) = 0.
Strong Monotonicity
x⋆SM
µ||x − y||2
−F (x) Monotonicity
x⋆M
y
x
Figure 4: While monotonicity of F guarantees that x⋆ is on the same side of the hyperplane defined by
−F at x as the quadratic µ||x − y||2 , strong monotonicity additionally guarantees that x⋆ is on or above
the quadratic. We use the label x⋆M (x⋆SM ) to indicate the fixed point of F when F is monotone (strongly
monotone). In general, x⋆M may be either aligned with the axis y or anywhere above it, while x⋆SM is confined
to be on or above µ||x − y||2 .
25
y x
X
projX (y) projX (x)
Figure 5: Projection onto a convex set X is a contractive operation, which means that the difference of
projections is at most the difference of the arguments.
fact to drop the projection operator in (40) at the expense of an inequality. Doing so, and expanding
the resulting squared norm, we have
2 2
||xk+1 − x⋆ || ≤ ||xk − ηF (xk ) − x⋆ + ηF (x⋆ )||
2 2
= ||xk − x⋆ || − 2η⟨F (xk ) − F (x⋆ ), xk − x⋆ ⟩ + η 2 ||F (xk ) − F (x⋆ )|| . (41)
Applying the monotonicity property (37) to the second term in (41), we arrive at
2 2 2
||xk+1 − x⋆ || ≤ ||xk − x⋆ || + η 2 ||F (xk ) − F (x⋆ )|| . (42)
Since both the terms on the right-hand side of (42) are positive, we cannot conclude that ||xk+1 −x⋆ ||
is shrinking with iteration number. Thus, the naive gradient-descent-like iteration (39) may diverge
on generally monotone problems.
As we have already mentioned, we can fix this by assuming that F is not just monotone but
strongly monotone. This allows us to retain the second term in (41). It is important to retain this
term, since it is the only one with a negative sign in (41), and is therefore needed to counteract
the expansive effect of the third term. We will also assume that F is L-Lipschitz. With these two
assumptions, we can easily show per-iteration contraction of the distance between the iterate and
the fixed point. Starting once again from (41), we have
2 2 2
||xk+1 − x⋆ || ≤ ||xk − x⋆ || − 2η⟨F (xk ) − F (x⋆ ), xk − x⋆ ⟩ + η 2 ||F (xk ) − F (x⋆ )||
2 2
≤ ||xk − x⋆ || − 2ηµ ||xk − x⋆ || + η 2 L2 ||xk − x⋆ ||2
= 1 − 2ηµ + η 2 L2 ||xk − x⋆ ||2
µ2
η=µ/L2
= 1 − 2 ||xk − x⋆ ||2 . (43)
L
With (43), we have proved the following theorem.
Theorem 3.1. For a µ-strongly monotone and L-Lipschitz operator F and a convex constraint set
X , the iterates of the gradient-descent-like algorithm (39) with step size µ/L2 converge to the fixed
point x⋆ of F with the rate 1 − (µ/L)2 .
The rate of convergence in Theorem 3.1 is slower than the (roughly) 1 − µ/L rate achieved by
gradient descent on strongly convex functions in Theorem 2.3. In fact, it is possible to extract this
faster rate from (39) by making one further assumption on F , that of co-coercivity. We give its
definition next.
Definition 3.3 (Co-coercivity). An operator F is said to be co-coercive on a set X if there exists
an α > 0 such that
1
⟨F (x) − F (y), x − y⟩ ≥ ||F (x) − F (y)||2 ∀x, y ∈ X . (44)
α
26
In fact, a µ-strongly monotone and L-Lipschitz operator is co-coercive with α = L, but L is a
suboptimal value of α. It is possible to achieve a faster rate by identifying the optimal value of α.
In particular, if F is µ-strongly monotone and α-co-coercive, it is possible to achieve a convergence
rate of 1 − µ/α. Lastly, we note that if F is the gradient field ∇f of a strongly convex and L-
smooth function f , then it is L-co-coercive. This fact and the convergence result stated just before
it together recover the fast convergence rate of gradient descent from Theorem 2.3 in the variational
inequality framework.
where η > 0 is the step size, I is the identity operator and F is a monotone operator. A simple
calculation shows that the operator (I + ηF ) is expansive:
2 2 2
||(I + ηF ) (x) − (I + ηF ) (y)|| = ||x − y|| + 2η⟨F (x) − F (y), x − y⟩ + η 2 ||F (x) − F (y)||
2 2
≥ ||x − y|| + η 2 ||F (x) − F (y)|| . (46)
Applying (I + ηF ) to two points x and y thus strictly increases the squared distance between them.
This implies that (45) always diverges. This can also be seen simply by inspection in Fig. 2: following
the flow F (as opposed to −F , which is what gradient descent does) is a bad strategy both when F
is a gradient field and when it isn’t.
We can use the expansive property of (I + ηF ) to our advantage. In particular, notice that if we
run (45) backward in time,
−1
xk = (I + ηF ) xk+1 , (47)
then we might expect to obtain a procedure that shrinks the distance between any two points
instead of increasing it. The proximal point method, which we now introduce, does precisely this.
The proximal point update is given by rearranging (47) as follows:
In the form that we present it here, it was first studied by Rockafellar in 1976 [37].
xk+1 = xk − ηF (xk+1 )
Notice that the proximal point update given in (48) is an implicit equation, i.e., it contains xk+1
on both sides, and must be solved at each iteration. Since there are multiple ways to solve the
implicit equation, the proximal point method as stated defines a family of algorithms.
27
xk
−F (xk )
xGD
k+1
xP P
k+1
x⋆
−F (xGD
k+1 )
Figure 6: Following the flow of −F at the point xk leads us away from x⋆ , to a point xGD k+1 (the superscript
GD stands for gradient descent). However, taking a step in the direction of the flow −F at xGD k+1 from xk
brings us to xEG ⋆
k+1 , which is closer to x than xk . This is the general logic behind the proximal point method
(48), and is precisely what the extragradient algorithm (Algorithm 5) does. Hence the superscript in xEG k+1 .
Before we enter the discussion of ways to solve for xk+1 , let us study the effect on ||xk+1 −x⋆ ||2 of
applying a single step of the proximal point method to confirm the intuition that led us to introduce
it:
2 2
||xk+1 − x⋆ || = ||xk − ηF (xk+1 ) − x⋆ ||
2 2
= ||xk − x⋆ || − 2η⟨F (xk+1 ), xk − x⋆ ⟩ + η 2 ||F (xk+1 )||
(a) 2 2
= ||xk − x⋆ || − 2η⟨F (xk+1 ), xk+1 − x⋆ ⟩ − 2η⟨F (xk+1 ), xk − xk+1 ⟩ + η 2 ||F (xk+1 )||
2 2
≤ ||xk − x⋆ || − 2η 2 ⟨F (xk+1 ), F (xk+1 )⟩ + η 2 ||F (xk+1 )||
2 2
= ||xk − x⋆ || − η 2 ||F (xk+1 )|| . (49)
To arrive at (a), we added and subtracted ⟨F (xk+1 ), xk+1 ⟩ from the line before it. The second term
in (a) is nonnegative by monotonicity of F ; neglecting it resulted in the inequality in the next line.
We used (48) to replace xk − xk+1 with ηF (xk+1 ) in the third term of (a).
We have proved the following descent lemma:
Lemma 3.2 (Descent lemma for the proximal point method on monotone operators). Consider a
monotone operator F . Under the dynamics of Algorithm 4 with step size η,
2 2 2
||xk+1 − x⋆ || ≤ ||xk − x⋆ || − η 2 ||F (xk+1 )|| . (50)
Lemma 3.2 guarantees that the distance to the fixed point shrinks on every iteration of the
proximal point method, and therefore implies convergence.
In the interests of completeness, we will now give some more vocabulary and state an important
theorem concerning monotone operators.
−1
Let us give the inverse operator (I + ηF ) a name. We call it the backward operator B. B
is also known as the resolvent of F . We can also define the operator 2B − I. See Fig. 7 for some
intuition for the distinction between the results of applying B versus 2B − I to xk .
Now we can state a theorem that connects the properties of F , B and 2B − I.
Theorem 3.3. The following are equivalent.
• F is a monotone operator.
• There exists a firmly non-expansive operator B such that B is the resolvent of F .
28
xk
xk+1 = B(xk )
(2B − I)xk
Figure 7: Applying the backward operator to xk brings us to the point xk+1 . This is the proximal point
update. Applying the operator 2B − I to xk results in a reflection about xk+1 .
In words, firm non-expansivity guarantees that applying the operator to two points strictly
decreases the distance between. The amount of the decrease is given by the second term on the
left-hand side of (51).
x̃k = xk − ηF (xk )
xt+1 = xk − ηF (x̃k )
The extragradient update can explicitly be shown to be an approximation to the proximal point
update [39], but we will not do that here. Nonetheless, we will leverage the idea to prove convergence
of the extragradient method by bounding the discrepancy between the iterates it produces and the
iterates that would be produced if we could exactly solve the proximal point update.
29
We give the proof of convergence of the extragradient method in the case where F is strongly
monotone (so we can easily get a rate) and L-Lipschitz:
2 2 2
||xk+1 − x⋆ || = ||xk − x⋆ || − 2η⟨F (x̃k ), xk − x⋆ ⟩ + η 2 ||F (x̃k )||
(a) 2 2
= ||xk − x⋆ || − 2η⟨F (x̃k ), x̃k − x⋆ ⟩ − 2η⟨F (x̃k ), xk − x̃k ⟩ + η 2 ||F (x̃k )||
(b)
2 2 2 2
≤ ||xk − x⋆ || − 2ηµ ||x̃k − x⋆ || + ||xk − x̃k − ηF (x̃k )|| − ||xk − x̃k ||
(c)
2 2 2
≤ ||xk − x⋆ || − 2ηµ ||x̃k − x⋆ || + η 2 L2 − 1 ||xk − x̃k ||
(d)
2 2
≤ (1 − ηµ) ||xk − x⋆ || + η 2 L2 − 1 + 2ηµ ||xk − x̃k ||
(e)
µ 2 1
≤ 1− ||xk − x⋆ || for η =
2(L + µ) 2(L + µ)
(f ) µ 2
≤ 1− ||xk − x⋆ || . (52)
4L
In (a), we added and subtracted 2η⟨F (x̃k ), x̃k ⟩, our usual trick. In (b), we applied the strong
monotonicity property to the second term, and completed the square with the last two terms of the
previous line. To arrive at the third term in (c), we manipulated the third term in (b) as follows:
first, we used the dynamics of the algorithm to replace xk − x̃k with ηF (xk ), second, we applied
the Lipschitz property to the quantity ||F (xk ) − F (x̃k )||2 , and third, we grouped the resulting term,
η 2 L2 ||xk − x̃k ||2 , with the last term of line (b). To arrive at (d), we applied the triangle inequality,
−2ηµ||a||2 ≤ −ηµ||a + b||2 + 2ηµ||b||2 , with a = x̃k − xk + xk − x⋆ and b = x̃k − xk . In (e), we took
η = [2(L + µ)]−1 . This ensures that the prefactor of the error term ||xk − x̃k ||2 in (d) is negative,
and that the term can therefore be dropped. Note that with this assignment for the step size, line
(d) is a descent lemma for the extragradient method. Lastly, to simplify the rate of convergence in
(f), we used the fact that 2L ≥ L + µ, which is easily proved as follows:
⟨F (x) − F (y), x − y⟩ ||F (x) − F (y)||
µ≤ ≤ ≤ L. (53)
||x − y||2 ||x − y||
In (53), the first inequality follows from strong monotonicity, the second from the Cauchy-Schwarz
inequality, and the third from the Lipschitz property of F .
We have proved the following theorem:
Theorem 3.4. For µ-strongly monotone and L-Lipschitz operator F , the iterates of Algorithm 5
with step size η = 1/[2(µ + L)] converge to the fixed point of F at the linear rate 1 − µ/(4L). In
particular,
2
µ t 2
||xk+1 − x⋆ || ≤ 1 − ||x1 − x⋆ || . (54)
4L
The extragradient algorithm is one practical instantiation of the proximal point method. There
are others. We will mention some of them in the next section. We end this section by noting that
one way to construct approximate proximal point algorithms is by expanding the resolvent of F in
powers of η, truncating the expansion after a fixed number of terms m, and then rearranging the
result to solve for xk+1 . The accuracy of the approximation to the proximal point update can be
controlled by increasing m.
30
We return to the two-player zero-sum game (32) to contextualize the discussion, which is based
on work done in [40]. We remind the reader that the vector field F of the game (32) is given in (35).
There are several algorithms that can be used to compute the fixed points of F . We consider four
of them: (1) gradient-descent-ascent (GDA), (2) the extragradient algorithm (EG), (3) optimistic
GDA (OGDA), and (4) the lookahead algorithm (LA) [41]. Writing the state vector of the two
x1
players x1 and x2 together as z = , the update rules for all four algorithms are given below:
x2
In the lookahead algorithm, the quantity z̃k+ℓ is computed by applying ℓ iterations of some base
algorithm. Here, we will take the base algorithm to be GDA.
Now we might ask the following questions—what are the relative merits of each of these al-
gorithms and how might we pick the most appropriate among them in the context of a specific
problem? We have seen previously how taking a continuous-time perspective in optimization can be
productive of insight into such questions, and so we take that perspective again here. An interesting
technical roadblock arises, however, which is that when we naively take the continuous-time limits of
(55)-(58), they all collapse to a single differential equation. This is particularly problematic because
it is known that GDA can diverge on two-player zero-sum games [42, 40], but that EG, OGDA,
and LA do not. The continuous-time limit of these algorithms therefore fails to distinguish between
them even though they display distinct convergence behaviors in discrete time.
The solution to this problem is to take more careful continuous-time limits. We take what is
called a high-resolution limit, which has its roots in the study of hydrodynamics, and is useful to
study systems that contain multiple timescales. Here is the basic idea: Typically, we are faced with
an algorithm of the form xn+1 = G(xn , η) where G is the update rule and η is the step size. In
order to arrive at a continuous-time representation x(t), we make the association t = nη and take
the limit η → 0. The idea behind a high-resolution limit is to instead take t = ng(η) where g is a
function (other than the identity function) of η.
The high-resolution continuous-time limits of GDA, EG, OGDA, and LA are distinct. Writing
β = 2/η, they are
∇2x1 f (z)
∇x2 ∇x1 f (z)
J(z) = , (63)
−∇x1 ∇x2 f (z) ∇2x2 f (z)
which is not present in the high-resolution continuous-time limit of GDA. This suggests that the
three convergent algorithms all leverage information contained in the Jacobian of F in order to
converge. It is also interesting that the continuous-time limit of LA is L-dependent. The structure
of the continuous-time equations (59)-(62) thus immediately give clues to understanding the different
convergence properties of the various algorithms and motivates new questions for future research.
31
References
[1] Adriana Camacho and Emily Conover. Manipulation of social program eligibility. American
Economic Journal: Economic Policy, 3(2):41–65, 2011.
[2] Tijana Zrnic, Eric Mazumdar, S. Shankar Sastry, and Michael I. Jordan. Who leads and
who follows in strategic classification? Advances in Neural Information Processing Systems,
34:15257–15269, 2021.
[3] Juan C. Perdomo, Tijana Zrnic, Celestine Mendler-Dünner, and Moritz Hardt. Performative
prediction. In Proceedings of the 37th International Conference on Machine Learning, volume
119, pages 7599–7609, 2020.
[4] Anastasios N. Angelopoulos and Stephen Bates. A gentle introduction to conformal prediction
and distribution-free uncertainty quantification. arXiv preprint arXiv:2107.07511, 2021.
[5] Stephen Bates, Anastasios N. Angelopoulos, Lihua Lei, Jitendra Malik, and Michael I. Jordan.
Distribution-free, risk-controlling prediction sets. J. ACM, 68(6), 2021.
[6] Anastasios N. Angelopoulos, Stephen Bates, Emmanuel J. Candès, Michael I. Jordan, and Lihua
Lei. Learn then test: Calibrating predictive algorithms to achieve risk control. arXiv preprint
arXiv:2110.01052, 2022.
[7] R. Tyrrell Rockafellar. Convex Analysis. Princeton University Press, 1970.
[8] Heinz H. Bauschke and Patrick L. Combettes. Convex Analysis and Monotone Operator Theory
in Hilbert Spaces. Springer, 2011.
[9] Francisco Facchinei and Jong-Shi Pang. Finite-Dimensional Variational Inequalities and Com-
plementarity Problems. Springer, 2003.
[10] Yurii Nesterov. Introductory Lectures on Convex Optimization. Springer, 2004.
[11] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, and Michael I. Jordan. How to
escape saddle points efficiently. In Proceedings of the 34th International Conference on Machine
Learning, volume 70, pages 1724–1732, 2017.
[12] Chi Jin, Praneeth Netrapalli, Rong Ge, Sham M. Kakade, and Michael I. Jordan. On nonconvex
optimization for machine learning: Gradients, stochasticity, and saddle points. J. ACM, 68(2),
2 2021.
[13] Jason D. Lee, Max Simchowitz, Michael I. Jordan, and Benjamin Recht. Gradient descent only
converges to minimizers. In 29th Annual Conference on Learning Theory, volume 49, pages
1246––1257, 2016.
[14] Jason D. Lee, Ioannis Panageas, Georgios Piliouras, Max Simchowitz, Michael I. Jordan, and
Benjamin Recht. First-order methods almost always avoid strict saddle points. Mathematical
Programming, 176:311–337, 2019.
[15] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points – online
stochastic gradient for tensor decomposition. In Proceedings of the 28th Conference on Learning
Theory, volume 40, pages 797–842, 2015.
[16] Arkadi S. Nemirovski and David B. Yudin. Problem Complexity and Method Efficiency in
Optimization. J. Wiley & Sons, 1983.
32
[17] Michael Baes. Estimate sequence methods: extensions and approximations. IFOR Internal
Report, ETH Zürich, 2009.
[18] Weijie Su, Stephen Boyd, and Emmanuel J. Candès. A differential equation for modeling
Nesterov’s accelerated gradient method: Theory and insights. Journal of Machine Learning
Research, 17:1–43, 2016.
[19] Boris T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR
Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
[20] Bin Shi, Simon S. Du, Michael I. Jordan, and Weijie J. Su. Understanding the acceleration
phenomenon via high-resolution differential equations. Mathematical Programming, 195:79–148,
2022.
[21] Andre Wibisono, Ashia C. Wilson, and Michael I. Jordan. A variational perspective on acceler-
ated methods in optimization. Proceedings of the National Academy of Sciences, 113(47):E7351–
E7358, 2016.
[22] Michael Betancourt, Michael I. Jordan, and Ashia C. Wilson. On symplectic optimization.
arXiv preprint arXiv:1802.03653, 2018.
[23] Guilherme França, Michael I. Jordan, and René Vidal. On dissipative symplectic integration
with applications to gradient-based optimization. Journal of Statistical Mechanics: Theory and
Experiment, 2021(4):043402, 2021.
[24] Ernst Hairer, Syvert P. Nørsett, and Gerhard Wanner. Solving Ordinary Differential Equations
I. Springer, second edition, 1993.
[25] Ernst Hairer and Gerhard Wanner. Solving Ordinary Differential Equations II. Springer, second
edition, 1996.
[26] Chi Jin, Praneeth Netrapalli, and Michael I. Jordan. Accelerated gradient descent escapes
saddle points faster than gradient descent. In Proceedings of the 31st Conference On Learning
Theory, volume 75, pages 1042–1085, 2018.
[27] Arnak S. Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-
concave densities. Journal of the Royal Statistical Society. Series B (Statistical Methodology),
79(3):651–676, 2017.
[28] Alain Durmus and Eric Moulines. High-dimensional Bayesian inference via the unadjusted
Langevin algorithm. Bernoulli, 25(4A):2854–2882, 2019.
[29] Xiang Cheng and Peter Bartlett. Convergence of Langevin MCMC in KL-divergence. In
Proceedings of Algorithmic Learning Theory, volume 83, pages 186–211, 2018.
[30] Xiang Cheng, Niladri S. Chatterji, Peter L. Bartlett, and Michael I. Jordan. Underdamped
Langevin MCMC: A non-asymptotic analysis. In Proceedings of the 31st Conference On Learn-
ing Theory, volume 75, pages 300–323, 2018.
[31] Yi-An Ma, Niladri S. Chatterji, Xiang Cheng, Nicolas Flammarion, Peter L. Bartlett, and
Michael I. Jordan. Is there an analog of Nesterov acceleration for gradient-based MCMC?
Bernoulli, 27(3):1942 – 1992, 2021.
[32] Wenlong Mou, Yi-An Ma, Martin J. Wainwright, Peter L. Bartlett, and Michael I. Jordan.
High-order Langevin diffusion yields an accelerated MCMC algorithm. Journal of Machine
Learning Research, 22(42):1–41, 2021.
33
[33] Niladri Chatterji, Jelena Diakonikolas, Michael I. Jordan, and Peter L. Bartlett. Langevin
Monte Carlo without smoothness. In Proceedings of the 23rd International Conference on
Artificial Intelligence and Statistics, volume 108, pages 1716–1726, 2020.
[34] Eric Mazumdar, Aldo Pacchiano, Yian Ma, Michael Jordan, and Peter Bartlett. On approxi-
mate Thompson sampling with Langevin algorithms. In Proceedings of the 37th International
Conference on Machine Learning, volume 119, pages 6797–6807, 2020.
[35] Eric V. Mazumdar, Michael I. Jordan, and S. Shankar Sastry. On finding local Nash equilibria
(and only local Nash equilibria) in zero-sum games. arXiv preprint arXiv:1901.00838, 2019.
[36] Constantinos Daskalakis, Paul W. Goldberg, and Christos H. Papadimitriou. The complexity
of computing a Nash equilibrium. SIAM Journal on Computing, 39:195–259, 2009.
[37] R. Tyrrell Rockafellar. Monotone operators and the proximal point algorithm. SIAM Journal
on Control and Optimization, 14(5):877–898, 1976.
[38] Galina M Korpelevich. The extragradient method for finding saddle points and other problems.
Matecon, 12:747–756, 1976.
[39] Aryan Mokhtari, Asuman Ozdaglar, and Sarath Pattathil. A unified analysis of extra-gradient
and optimistic gradient methods for saddle point problems: Proximal point approach. In Pro-
ceedings of the 23rd International Conference on Artificial Intelligence and Statistics, pages
1497–1507. PMLR, 2020.
[40] Tatjana Chavdarova, Michael I. Jordan, and Manolis Zampetakis. Last-iterate conver-
gence of saddle point optimizers via high-resolution differential equations. arXiv preprint
arXiv:2112.13826, 2022.
[41] Tatjana Chavdarova, Matteo Pagliardini, Martin Jaggi, and Francois Fleuret. Taming GANS
with lookahead. Technical report, Idiap, 2020.
[42] Constantinos Daskalakis, Andrew Ilyas, Vasilis Syrgkanis, and Haoyang Zeng. Training GANs
with optimism. In Proceedings of the 6th International Conference on Learning Representations,
2018.
34