0% found this document useful (0 votes)
33 views36 pages

A Gentle Introduction To Gradient-Based Optimization

This document provides an introduction to gradient-based optimization and variational inequalities in machine learning, emphasizing the shift from pattern recognition to decision-making and multi-agent problems. It discusses the mathematical challenges that arise in these broader contexts, including the need for new algorithms and convergence proofs. The authors aim to motivate and provide intuition for these concepts while addressing issues such as uncertainty quantification, robustness, fairness, and explainability in decision-making processes.

Uploaded by

tanase.roxana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views36 pages

A Gentle Introduction To Gradient-Based Optimization

This document provides an introduction to gradient-based optimization and variational inequalities in machine learning, emphasizing the shift from pattern recognition to decision-making and multi-agent problems. It discusses the mathematical challenges that arise in these broader contexts, including the need for new algorithms and convergence proofs. The authors aim to motivate and provide intuition for these concepts while addressing issues such as uncertainty quantification, robustness, fairness, and explainability in decision-making processes.

Uploaded by

tanase.roxana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

arXiv:2309.04877v1 [cs.

LG] 9 Sep 2023

A Gentle Introduction to Gradient-Based Optimization and


Variational Inequalities for Machine Learning

Neha S. Wadia 1*
, Yatin Dandi2, and Michael I. Jordan3

1
Center for Computational Mathematics, Flatiron Institute
2
SPOC Laboratory and IdePHICS Laboratory, Ecole Polytechnique Fédérale de Lausanne (EPFL)
3
Department of EECS and Department of Statistics, University of California, Berkeley

September 12, 2023

* [email protected]
Abstract

The rapid progress in machine learning in recent years has been based on a highly productive
connection to gradient-based optimization. Further progress hinges in part on a shift in focus from
pattern recognition to decision-making and multi-agent problems. In these broader settings, new
mathematical challenges emerge that involve equilibria and game theory instead of optima. Gradient-
based methods remain essential—given the high dimensionality and large scale of machine-learning
problems—but simple gradient descent is no longer the point of departure for algorithm design.
We provide a gentle introduction to a broader framework for gradient-based algorithms in machine
learning, beginning with saddle points and monotone games, and proceeding to general variational
inequalities. While we provide convergence proofs for several of the algorithms that we present, our
main focus is that of providing motivation and intuition.
Contents
1 Introduction 4
1.1 The Challenges of Decision-Making Processes . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Multi-Way Markets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Challenges at the Intersection of Machine Learning and Economics . . . . . . . . . . 7
1.4 Two Illustrative Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4.1 Strategic Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4.2 Distribution-Free Uncertainty Quantification for Decision-Making . . . . . . . 8
1.5 Overview of the Lectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.6 The Subgradient Method and a First Convergence Proof . . . . . . . . . . . . . . . . 10

2 Computing Optima in Discrete and Continuous Time 13


2.1 Convergence Guarantees for Gradient Descent on Convex Functions . . . . . . . . . 13
2.2 Gradient Descent on Nonconvex Functions: Escaping Saddle Points Efficiently . . . 15
2.3 Variational, Hamiltonian, and Symplectic Perspectives on Acceleration . . . . . . . . 16
2.4 Variational Inequalities: From Minima to Nash Equilibria and Fixed Points . . . . . 21
2.4.1 Two-Player Zero-Sum Games . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4.2 Variational Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4.3 Nash Equilibria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.4 Computing Nash Equilibria . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 Computing Equilibria 24
3.1 Monotone Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Fixed-Point Finding Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.1 A Naive Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.2 The Proximal Point Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.3 The Extragradient Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.4 High-Resolution Continuous-Time Limits . . . . . . . . . . . . . . . . . . . . 30

1
Acknowledgements
These notes are based on three lectures delivered by Michael Jordan at the summer school “Statis-
tical Physics and Machine Learning” held in Les Houches, France, in July 2022. Neha Wadia and
Yatin Dandi were students at the school. The authors are grateful to Florent Krzakala and Lenka
Zdeborová for organizing the school.
The authors thank Sai Praneeth Karimireddy for useful discussions clarifying the material on
monotone operators, and Sidak Pal Singh for collaboration in the early stages of this manuscript.
MJ was supported in part by the Mathematical Data Science program of the Office of Naval
Research under grant number N00014-18-1-2764 and by the Vannevar Bush Faculty Fellowship pro-
gram under grant number N00014-21-1-2941. NW’s attendance at the summer school was supported
by a UC Berkeley Graduate Division Conference Travel Grant. The Flatiron Institute is a division
of the Simons Foundation.
Author contributions: MJ prepared and delivered the lectures and edited the text. YD partially
drafted an early version of the first lecture and made the figures. NW wrote the text and edited the
figures.

2
List of Algorithms
1 Subgradient Method with Constant Step Size . . . . . . . . . . . . . . . . . . . . . . 11
2 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Accelerated Gradient Descent (Nesterov’s Method) . . . . . . . . . . . . . . . . . . . 17
4 Proximal Point Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5 Extragradient Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3
1 Introduction
Much research in current machine learning is devoted to “solving” or even surpassing human intel-
ligence, with a certain degree of opacity about what that means. A broader perspective shifts the
focus from the intelligence of a single individual to instead consider the intelligence of the collective.
The goals at this level of analysis may take into consideration overall flow, stability, robustness, and
adaptivity. Desired behavior may be formulated in terms of tradeoffs and equilibria rather than
optima. Social welfare can be an explicit objective, and we might ask that each individual be better
off when participating in the system than they were before. Thus, the goal would be to enhance
human intelligence instead of supplanting or superseding it.
The mathematical concepts needed to support such a perspective incorporate the pattern recog-
nition and optimization tools familiar in current machine learning research, but they go further,
bringing in dynamical systems methods that aim to find equilibria and bringing in ideas from mi-
croeconomics such as incentive theory, social welfare, and mechanism design. A first point of contact
between such concepts and machine learning recognizes that learning systems are not only pattern
recognition systems, but they are also decision-making systems. While a classical pattern recognition
system may construe decision-making as the thresholding of a prediction, a more realistic perspective
views a single prediction as a constituent of an active process by which a decision-maker gathers
evidence, designs further data-gathering efforts, considers counterfactuals, investigates the relevance
of older data to the current decision, and manages uncertainty. A still broader perspective takes
into account the interactions with other decision-makers, particularly in domains in which there is
scarcity. These interactions may include signaling mechanisms and other coordination devices. As
we broaden our scope in this way, we arrive at microeconomic concerns, while retaining the focus
on data-driven, online algorithms that is the province of machine learning.
The remainder of our presentation is organized as follows. We first expand on the limitations of
a pure pattern-recognition perspective on machine learning, bringing in aspects of decision-making
and social context. We then turn to an exemplary microeconomic model—the multi-way matching
market—and begin to consider blends of machine learning and microeconomics. Finally, we turn
to a didactic presentation of some of the mathematical concepts that bridge between the problem
of finding optima—specifically in the machine-learning setting of nonconvex functions and high-
dimensional spaces—and that of finding equilibria.

1.1 The Challenges of Decision-Making Processes


Here is a partial list of obvious challenges that must be addressed when the predictions of pattern
recognition systems are used in the context of decision-making. All of these are active areas of
research in current machine learning.
1. Uncertainty quantification. It is important to be able to give calibrated notions of confidence
in the predictions of a model in order to effectively use those predictions to make decisions.
However, it is not always clear how to systematically assign uncertainty to model predic-
tions, partially because our understanding of the properties of modern model architectures
is incomplete. Moreover, the traditional and well-studied methods of classical statistics such
as the jackknife and the bootstrap are not computationally efficient at the scale of modern
applications. In Section 1.4.2, we will discuss some new ideas that help address this problem.
2. Robustness and adversaries. It is important to build models that can provably maintain
performance under distribution shift, i.e., differences between test and training data, including
differences that are imposed adversarially.
3. Biases and fairness. We would like to control for bias in our models and ensure that people
are treated fairly.

4
4. Explainability and interpretability. In a decision-making context, it is often important for the
human beings interacting with a model to be able to interpret how the model arrives at its
predictions so that they can reason about the relevance of those predictions to their lives.
All the problems we have listed above involve understanding how to design models whose outputs
we can guarantee are interpretable, fair, safe, and reliable for people to interact with. The discipline
that has historically dealt with this type of problem is not machine learning but economics.
To further illustrate the point, we discuss two specific examples of scenarios where the output of
pattern recognition system is clearly and naturally part of a decision-making process.
Suppose your physician has access to a trained neural network that takes as input a data vector
containing all your health information and generates an output vector of the predicted risk of your
developing various diseases. At your next annual medical check-in, you take a standard battery
of tests and your physician updates your data vector with the results and then queries the neural
network. Let’s focus on just one of the predictions of the network: your risk, represented by a
number between 0 and 1, of developing a fatal heart condition within the next five years, the only
treatment for which is preemptive heart surgery. Let’s say that on average people who score a 0.7
or higher develop the condition. The neural network outputs a 0.71 for you. Is there any sense in
which this prediction and the accompanying threshold constitutes a decision? Do you immediately
get the surgery?
The most likely scenario is that you sit down with your physician to reason through how to
interpret the result and make the most appropriate choice for your healthcare. A number of questions
might reasonably come up. What was the provenance of the training data for the neural network?
Were there enough contributions to that data from individuals with similar health profiles to yourself
to engender some degree of relevance in the prediction? What is the uncertainty the prediction?
What are the sources of that uncertainty? Over the course of the conversation, you may suddenly
remember a piece of information that did not seem relevant before—that you had an uncle who died
of a suspected heart condition, let’s say—but that is now highly relevant. What is the effect of that
new data on your reasoning? Furthermore, you may consider counterfactuals. What if you were
to change your diet and swim for thirty minutes every day for six months—would that change the
prediction, and would it be worth risking postponing any treatment for that period of time in order
to ward off the risk of potentially unnecessary heart surgery?
The neural network can clearly help orient this decision-making process, but we cannot naively
threshold the output of the model to make a decision, especially because the stakes in this context
are high. We also see that the notion of a tradeoff appears naturally in decision-making contexts.
Let us consider another example. Recommendation systems are some of the most successful
pattern recognition systems built to date. Consider recommendation systems for books and movies.
Is it problematic if a single book or movie is recommended to a large number of people? Not
particularly. Books and movies are not scarce. Today, it is possible to print books on demand and
to stream movies on demand. Now let’s consider a recommendation system for routes to the airport.
Is it problematic to recommend the same route to the airport to a large number of people? In this
case, the answer is a clear yes. If everyone is directed down the same route, this will cause congestion.
The same applies in the case of a recommendation system for restaurants. Recommending the same
restaurant to a large number of people will result in lines around the block at a handful of places and
an artificial lack of customers at others. A simplistic approach to fixing this type of problem is to
treat recommendation in the context of scarcity as a load-balancing problem. The trouble with this
approach is that in many cases, it will not be clear on what basis to assign one subset of people to,
say, one restaurant versus another. In fact, any such assignment is likely to be arbitrarily made and
will introduce further problems with regards to fairness and also economic efficiency. Once again,
we see that the naive approach to making recommendations fails in certain contexts, in particular,
in contexts where a scarce resource is being allocated.
We hope we have convinced the reader that no matter how accurate the predictions of a model
may be in some technical sense, naive approaches to using those predictions to make decisions do

5
not work well. We argue that a better way forward is to think of the model and the parties that
interact with it as constituting a multi-way market.

1.2 Multi-Way Markets


Let us continue discussing the example of a recommendation system for restaurants. A traditional
recommendation system uses all the information at its disposal, such as the user’s browsing history,
location history, a questionnaire about their culinary preferences that the user fills out when they
sign up to use the system, and average customer ratings of all the restaurants in its database, to
recommend a ranked list of restaurants to the user. We have already discussed that this approach
naturally results in competition to obtain a table at a small number of restaurants. The problem
gets worse as the number of people using the system increases.
A more efficient, scalable, and flexible way to structure this problem is as a two-way market, with
the restaurants on one side and the customers on the other. When the customer wishes to find a
restaurant for dinner, they consent to share certain information, such as their current location and a
few preferences—cuisine, ambiance, and price range, for example—with the restaurants temporarily.
The restaurants then process the incoming information from multiple customers and make offers
to them, and perhaps some of them offer a ten percent discount on the total price of dinner if the
customer arrives in the next fifteen minutes. Customers then choose the offer that appeals to them.
Recommendations emerge organically from an economic system of information sharing and bidding.
Notice the mathematical structure of this problem, which is to compute an optimal matching
between customers and restaurants while accounting for the interactions between and within both
groups, and recognizing the strategic nature of the participants. The overall goals of the system
design can be those of classical learning systems, including accuracy, coverage, and precision, but
they can also include an economic aspect, including social welfare or revenue.
Two-way matching markets that are not learning-based have had a significant impact in many
social and commercial domains. A classic example is the matching algorithm that assigns medical
residents to hospitals in the United States every year. It takes as input a ranked list of hospitals from
each prospective resident, and a ranked list of prospective residents from each hospital, and computes
an optimal matching. Each side of the market must state their preferences up front. However, it
is now recognized that this may not always be possible. One or both parties may not know their
preferences beforehand. An obvious solution suggests itself, which is for one or both parties to learn
their preferences from data. A new subfield of study, called learning-based mechanism design, is now
emerging to address how to combine learning with matching to design new algorithms that can do
both.
The market we described for restaurant matchings was hypothetical. However, such markets
are beginning to become a reality. An example is the company United Masters, which provides a
platform to match musicians, especially young musicians who may not be represented by a record
label, with their listeners. The platform is structured as a three-way market, involving the musicians,
listeners, and brands. The latter can use the platform to locate and make contracts directly with
specific musicians in order to use their music for branding purposes. The digital marketplace also
provides artists with data that informs them of where the highest numbers of listeners of their music
are located; they can use this information, for example, to convince venue owners in the relevant
locations to sign them for paid performances. Overall this is a matching system in which data
analysis and fairness issues are key.
Creating markets that work as intended is not a simple task. We have already hinted at some of
the mathematical challenges involved in our discussion of restaurant recommendations and United
Masters; problems such as identifying and aligning incentives, designing matching algorithms in the
face of unknown or partially known preferences, and coming up with alternative business models
to advertising-based revenue generation. In fact, there is a great deal of new mathematics that
will need to be invented in order to enable the effective design of markets that are beneficial for all

6
parties.

1.3 Challenges at the Intersection of Machine Learning and Economics


We provide a few rough and ready examples of mathematical challenges that arise when we try to
develop new machine learning algorithms that bring in ideas from game theory, mechanism design,
and incentive theory. The goal of addressing these challenges is to discover new principles to build
healthy learning-based markets that are stable over long periods of time.

1. Understanding relationships among optima, equilibria, and dynamics. Training a pattern


recognition model typically involves finding the minimum of a loss function. The dynamics of
marketplaces, which can be described using the language of game theory, typically converge to
equilibria. Gradient descent, which provably converges to minima, provably fails to compute
equilibria; other algorithms, such as the extragradient algorithm, are needed for the latter
task. We will study both these algorithms in the following lectures.
2. Designing multi-way markets in which individual agents must explore to learn their preferences.
This involves integrating bandit algorithms with matching algorithms.
3. Designing large-scale multi-way markets in which agents view the other sides of the market
through interaction with recommendation systems. We described an example of such a mar-
ketplace in the context of a recommendation system for restaurants.
4. Uncertainty quantification for blackbox and adversarial settings. Correctly quantifying uncer-
tainty is crucial to construct safe algorithms and facilitate effective decision-making.
5. Mechanism design with learned preferences.
6. The design of contracts that incentivize agents to provide data in a federated learning setting.
7. Incentive-aware classification and evaluation.
8. Tradeoffs involving fairness, privacy, and statistical efficiency.

In these lectures, we will focus mostly on the first item in this list. To give the reader a sense
of the flavor of research involved in addressing some of the other points in the list, however, we
give brief vignettes in the next section of research in strategic classification and distribution-free
uncertainty quantification.

1.4 Two Illustrative Examples


1.4.1 Strategic Classification
Predictive models are routinely deployed in society for various purposes: for example, an insurance
company may use a predictive model to decide whether or not to insure a prospective client and
what premium to charge. Similarly, a college admissions committee evaluates applicants based on
some set of criteria to decide whether or not to admit them to the college. When a predictive
model is deployed in society, the people who interact with it will naturally strategize to maximize
their utility under the model. Such strategic behavior can eventually render the predictions of the
model meaningless. In fact, this effect is so well-known that it bears a name—Goodhart’s Law —in
economics. A classic illustration of the effect is given in Fig.1 in the context of the poverty index
score [1], a measure of poverty that was designed to be thresholded to determine the subpopulation
that qualified for social welfare programs. When the threshold was first introduced, the poverty
index score was approximately Gaussian-distributed across the population. However, people soon
learned to manipulate their scores in such a way as to increase their chances of placing themselves

7
Figure 1: The poverty index score was approximately Gaussian-distributed at the time when it was selected
as a measure of poverty that could be thresholded for the purposes of making social policy. With time, its
distribution became further and further skewed to the left of the threshold. This figure is reproduced from
[1].

to the left of the threshold. In a mere decade, the entire peak of the distribution of poverty index
scores had moved to the left of the threshold, which now qualified almost the whole population for
welfare assistance.
Strategic classification is a problem in which a population that is being classified for some purpose
may strategically modify their features in order to elicit a favorable classification. This problem finds
a natural formulation as a sequential two-player Stackelberg game between a player (the decision-
maker) deploying the classifier and a player (the strategic agents) being classified. The goal of the
decision-maker is to arrive at the Stackelberg equilibrium, which is the model that minimizes the
decision-maker’s loss after the strategic agents have adapted to it.
In the prevailing formulation of this game, the decision-maker leads by deploying a model, and
the strategic agents follow by immediately playing their best response to the model. It is assumed
that the strategic agents adapt to the model instantaneously. In [2], the authors formulate the game
more realistically by introducing a finite natural timescale of adaptation for the strategic agents.
Relative to this timescale, the decision-maker may choose to play on a timescale that is either
faster (a “reactive” decision-maker) or slower (a “proactive” decision-maker).1 These timescales
are naturally introduced to the problem by formulating the game such that each player must use a
learning algorithm to learn their strategies, as opposed to playing a strategy from a predetermined
set of possibilities. Within this formulation of strategic classification, the authors of [2] found that
the decision-maker can drive the game toward one of two different equilibria depending on whether
they chose to play proactively or reactively. Thus the dynamics of play directly influences the
outcome of the game. Furthermore, the authors were able to identify certain game settings in which
both players prefer the equilibrium that results when the decision-maker plays reactively.
The results of [2] exemplify how incorporating ideas of statistical learning into a game-theoretic
problem can yield a richer model of some social phenomenon that is productive of new insights. We
refer the reader to [3] for a broader discussion of this emerging field at the interface of learning and
economics.

1.4.2 Distribution-Free Uncertainty Quantification for Decision-Making


As we discussed previously, translating predictions into decisions requires reliable uncertainty quan-
tification. Existing methods in pattern recognition do not provide reliable error bars, and traditional
1 Note that in this language, most prior work dealt with the setting of a proactive decision-maker.

8
statistical methods of uncertainty quantification such as the bootstrap are prohibitively computa-
tionally intensive at the scales of modern problems.
Conformal prediction is a methodology that gives us a way forward. Given the predictions of a
pattern recognition model, such as an image classifier, a conformal prediction algorithm efficiently
computes a confidence interval over the outputs of the model, thus providing statistical coverage
without the need to manipulate the model itself or to make strong assumptions about its architecture
or about the distribution of the data. The interested reader may refer to [4] for an accessible
introduction to conformal prediction.
To give the reader a flavor of the mathematics involved in conformal prediction, we give a high-
level overview of a method of constructing risk-controlling prediction sets [5]. Prediction sets are sets
that provably contain the true value of the entity being predicted with high probability. Prediction
sets are an actionable form of uncertainty quantification, and therefore useful for decision-making
purposes in certain contexts. For example, if a physician were to put an image of a patient’s colonic
epithelium through a classifier while trying to diagnose the cause of their chronic stomach pain, it
would be just as important to know what the likely causes of the pain are as to know which causes
can safely be ruled out.
Consider a dataset {(Xi , Yi }m
i=1 of independent, identically distributed pairs consisting of feature
vectors Xi X and labels Yi ∈ Y. In many applications, X = Rd for some large d ∈ N. Consider a
partition of the dataset into sets of size n and n − m; the former form the calibration set Ical , the
latter the training set Itrain . We are in possession of a trained model fˆ (say, a classifier) that was
trained on Itrain . The goal is to construct a set-valued predictor Tλ : X → Y ′ , where Y ′ is a space
of sets (in many problems, Y ′ = 2Y ) and λ ∈ R is a scalar parameter; given some user-specified
γ, δ ∈ R, for an appropriate choice λ̂ of λ, we require the risk R of Tλ̂ (X) to be bounded above with
high probability:  
P R Tλ̂ (X) ≤ γ ≥ 1 − δ. (1)

In [5], the authors provide a procedure by which to build T and choose λ̂.
Tλ has the following nesting property: for λ1 < λ2 , we have Tλ1 ⊂ Tλ2 . We define a loss function
L : Y × Y ′ → R≥0 on T . A typical choice of L(Y, Tλ(X) ) is 1Y ∈Tλ (X) . We assume that the loss
function satisfies a monotonicity condition: Tλ1 ⊂ Tλ2 implies L(Tλ1 ) > L(Tλ2 ); in other words,
enlarging the prediction set drives down its loss. The risk of the prediction set is the average value
of the loss over the data: R (Tλ (X)) = EL(Y, Tλ (X)).
Using the calibration set Ical , it is possible to construct a quantity Rmax (λ) such that for all λ,
P (R (Tλ (X)) ≤ Rmax (λ)) ≥ 1 − δ. Generally speaking, this construction involves inverting certain
concentration inequalities. The reader is referred to Section 3 of [5] for details. Once we have
Rmax (λ), taking
λ̂ = inf{λ : Rmax (Tλ′ ) < γ ∀λ′ ≥ λ} (2)
ensures that the property (1) is satisfied. This is the content of the main theorem of [5]. This
formalism is applied to a number of examples in [5], including protein folding, in which case fˆ is
AlphaFold, scene segmentation, and tumor detection in the colon. Note that the assumption of a
monotonic loss can be removed, and λ can be a low-dimensional vector instead of a number; these
generalizations can be found in [6].

1.5 Overview of the Lectures


We hope we have convinced the reader that there are many open mathematical challenges involved
in combining ideas from machine learning and economics to design the algorithms we will live among
in the future, and that these challenges are worth tackling. We will devote the rest of these lectures
to one of these mathematical challenges, namely, the challenge of computing equilibria (in games
and other dynamical systems).

9
We will present the problem of computing equilibria as a generalization of the problem of com-
puting optima. We will provide a self-contained introduction to both these problems, starting with
optima and moving to equilibria. Our goal is to equip the reader with enough terminology and
technical knowledge to read and contextualize the primary literature.
Although the algorithms we will discuss are usually implemented in discrete time, on occasion
we will find it useful to study them in continuous time. As we saw in our discussion of Stackelberg
equilibria in Section 1.4.1, the dynamics of an algorithm can influence what the algorithm converges
to. This is true both in optimization and in games, and we will find in each case that the continuous-
time perspective makes this dependency clear. It is also the case that proofs of convergence are often
much simpler to write in continuous time than in discrete time.
We begin with the essentials of convex optimization through the discussion of two fundamental
algorithms: the subgradient method, in Section 1.6 and gradient descent, in Section 2.1. Convex
functions are the simplest class of functions for which we can provide convergence guarantees to
the global minimum of the function. We will see that the subgradient method reduces to gradient
descent when the function being optimized is differentiable in addition to being convex. In Section
2.2, we talk about the convergence properties of gradient descent on nonconvex functions, which
may in general have more than one minimum and may also have saddle points and maxima. We
will find it useful to have this discussion in continuous time.
Gradient descent is a first-order algorithm; its continuous-time limit is a first-order differen-
tial equation in time. We also provide an exposition on optimization with a description of some
recent results on accelerated algorithms; the continuous-time limits of these algorithms are second-
order differential equations in time, and can be interpreted as the equations of motion of a certain
Lagrangian.
The dynamics of a marketplace can often be formulated as a game. In order to effectively design
new markets and understand existing ones, it is crucial to be able to compute the equilibria of the
relevant games. Returning to discrete time, in Section 2.4, we graduate from studying optima to
studying equilibria. We introduce some basic vocabulary, including the concept of a variational
inequality, and we show that the Nash equilibria of certain kinds of games can be written as the
fixed points of variational inequalities involving monotone operators. In Section 3.1 we define mono-
tonicity and strong monotonicity, and in Section 3.2 we discuss the proximal point method and the
extragradient algorithm, and their convergence behavior on monotone operators. We wrap up these
lectures with a brief foray back into continuous time in Section 3.2.4, with a comparative study of
a number of fixed-point finding algorithms.
Throughout the text, we will omit mathematical details where necessary in the interests of clarity
and brevity (although we will make note of where we do this). The interested reader may find the
details of almost everything we discuss here in the following references: Convex Analysis [7], Convex
Analysis and Monotone Operator Theory in Hilbert Spaces [8], and Finite-Dimensional Variational
Inequalities and Complementarity Problems [9].

1.6 The Subgradient Method and a First Convergence Proof


We begin our exposition of the mathematics of optimization with the problem of finding the minimum
of a convex function that need not be differentiable. As we have mentioned previously, convexity is
the simplest setting in which we can provide convergence guarantees to the global minimum of the
function.
Consider the optimization problem
min f (x), (3)
x∈Rd

where f is a convex function. A function is said to be convex if the line segment connecting any two
function values f (a) and f (b) lies above f (x) at all points x between a and b. We will give another
definition of convexity in the context of differentiable functions in the next lecture. In this section,

10
we do not assume that f is differentiable. In what follows, we do require some further regularity
conditions on f beyond convexity (such as convexity of its domain), but we will not trouble with
those details here.
A central notion in convex analysis is that of the subgradient of a function.
Definition 1.1 (Subgradient). gx is a subgradient of f at x if, ∀y, f (y) ≥ f (x) + ⟨gx , y − x⟩. This
inequality is called the subgradient inequality.
In the definition above and throughout these notes we use ⟨·⟩ to denote the Euclidean inner
Pd
product: ⟨x, y⟩ = i=1 xi yi .
In general the subgradient is not unique. For example, the subgradient of the function f (x) = |x|
where x ∈ R is given by 
1,
 x>0
gx = (−1, 1), x = 0 (4)

−1, x < 0.

For this reason, it is also useful to give a name to the set of all subgradients of a function, which we
do next.
Definition 1.2 (Subdifferential). The subdifferential ∂f (x) of a function f at a point x is the set
of all subgradients of f at x, i.e.,

∂f (x) = {gx : ∀y, f (y) ≥ f (x) + ⟨gx , y − x⟩}.

If f is differentiable at x, then ∂f (x) contains a single element equal to the derivative of f at x.


We remind the reader that in this section we do not assume that f is differentiable. We are now
ready to introduce the subgradient method for solving (3).

Algorithm 1 Subgradient Method with Constant Step Size


Require: η > 0, T > 0

while k < T do xk+1 = xk − ηgk


end while

In Algorithm 1, T is the number of iterations for which we run the algorithm, η is the (constant)
step size, gk is shorthand for gxk , and gk ∈ ∂f (xk ). We will often refer to xk as “the k-th iterate”.
The subgradient method is not a descent method: it is not guaranteed to make progress toward
the optimum with each iteration. This is different from the case of gradient descent, which we will
discuss in the sequel
√ (see Lemma 2.1). Nonetheless, we can prove that it converges in function value
with a rate 1/ T . Let x⋆ denote the optimum (i.e., the solution to (3)), and || · || the Euclidean
norm, and consider the quantity ||xk+1 − x⋆ ||2 :

||xk+1 − x⋆ ||2 = ||xk − ηgk − x⋆ ||2


= ||xk − x⋆ ||2 − − − 2η⟨gk , xk − x⋆ ⟩ + η 2 ||gk ||2
≤ ||xk − x⋆ ||2 − 2η (f (xk ) − f (x⋆ )) + η 2 G2 . (5)

To arrive at (5), we applied the subgradient inequality and assumed that ||gk ||2 is bounded above
by a constant G2 . We now move f (xk ) − f (x⋆ ) to the left-hand side, sum over k, and divide by 2ηT :
T T
1 X 1 X  1
(f (xk ) − f (x⋆ )) ≤ ||xk − x⋆ ||2 − ||xk+1 − x⋆ ||2 + ηG2 . (6)
T 2ηT 2
k=1 k=1

11
On the right-hand side of (6) we have a telescoping sum, in which all but the first and last terms
cancel. On the left-hand side, we take advantage of the convexity of f and apply Jensen’s inequality,
which states that f evaluated at the average of a set of points xi is at most the average of function
values at those points (for f convex). We arrive at
T
!
1 X 1  1
f xk − f (x⋆ ) ≤ ||x1 − x⋆ ||2 − ||xT +1 − x⋆ ||2 + ηG2
T 2ηT 2
k=1
1 1
≤ R2 + ηG2 . (7)
2ηT 2

In the last step above, we have dropped the term ||xT +1 − x⋆ ||2 and assumed that ||x1 − x⋆ ||2 is
bounded above by a constant R2 . To obtain a rate of convergence, we choose η to minimize the
R
right-hand side of the last inequality. The optimal value of η turns out to be G√ T
. Substituting
this in the inequality, we finally arrive at
T
!
1 X RG
f xk − f (x⋆ ) ≤ √ . (8)
T T
k=1

We have thus proved the following result.


Theorem 1.1. On a convex √ function f , the subgradient method converges in function value on the
average iterate with a rate 1/ T , where T is the number of iterations.
A remarkable fact about (8) is that it has no explicit dependence on the dimension √ of x (it
depends implicity on the dimension through R and G). It can further be shown that the 1/ T rate
is tight; i.e., that there exist convex√functions such that f (xk ) − f (x⋆ ) scales as T −1/2 . Lastly, we
note that in order to extract the 1/ T rate in (8), we had to choose a step size that depended on
T . We will not see this type of dependence of the step size on the total number of iterations again
in these lectures.
This discussion of the subgradient method introduces a pattern that will recur as we continue
our study of optimization in the subsequent sections: we make certain regularity assumptions on the
function and write down an algorithm to optimize it; we then exploit the dynamics of the algorithm
as well as the regularity properties of the function to prove convergence. Typically, we are interested
in the rate at which the distance between the iterate and the optimum shrinks, and in particular,
we are interested in algorithms that achieve a fast rate.

12
2 Computing Optima in Discrete and Continuous Time
2.1 Convergence Guarantees for Gradient Descent on Convex Functions
We remind the reader that we are currently discussing algorithms for the following optimization
problem
min f (x).
x∈Rd
In Section 1.6, we assumed that f is convex but not necessarily differentiable, and we studied the
convergence behavior of the subgradient method (Algorithm 1) on this problem. When f is a smooth
function of x, the subgradient method reduces to the well-known gradient descent algorithm.

Algorithm 2 Gradient Descent


Require: η > 0

xk+1 = xk − η∇f (xk )

We will now give a definition of convexity for differentiable functions, and then explore the
consequences on the convergence behavior of gradient descent of two further assumptions we can
make on f (x): (1) Lipschitz smoothness, and (2) strong convexity. The proof techniques that arise
will be useful when we study variational inequalities.
Definition 2.1 (Convexity). A (differentiable) function f is convex on an interval if and only if
f (x) ≥ f (y) + ⟨∇f (y), x − y⟩ (9)
for all x, y in the interval.
There are several other equivalent definitions of convexity. Similarly, there are many ways to
characterize smoothness. One possibility is the following.
Definition 2.2 (Smoothness). A function f is (Lipschitz) smooth if it satisfies:
||∇f (x) − ∇f (y)|| ≤ L||x − y||, (10)
where L > 0. Such a function is often referred to as L-smooth.
Note that (10) is equivalent to
L
f (y) ≤ f (x) + ⟨∇f (x), y − x⟩ + ||x − y||2 . (11)
2
The proof of this equivalence is left as an exercise to the reader. Hint: use the fundamental theorem
of calculus and the Cauchy-Schwarz inequality. The inequality (11) indicates that an L-smooth
function can be upper-bounded by a quadratic at every point y in its domain. With this intuition it
is simple to see why, for example, the function f (x) = |x| is not smooth: at the point x = 0, there
is no quadratic that both touches f only at x = 0 and sits above f for all x > 0.
We will now see that gradient descent on L-smooth convex functions √ converges with a rate 1/T ,
where T is the total number of iterations. This is faster than the 1/ T rate than we had in (8),
when we did not assume smoothness.
We begin by proving a type of result called a descent lemma, which guarantees that we make a
certain amount of progress in decreasing the function value at each iteration of gradient descent.
Lemma 2.1 (Descent lemma for gradient descent on smooth convex functions). Consider an L-
smooth convex function f . Under the dynamics of Algorithm 2 with η = L−1 , we have
1
f (xk+1 ) ≤ f (xk ) − ||∇f (xk )||2 . (12)
2L

13
Proof. Take y = xk+1 and x = xk in (11), and replace xk+1 − xk with −L−1 ∇f (xk ).
Theorem 2.2. On an L-smooth convex function f , gradient descent with step size L−1 converges
in function value with a rate 1/T , where T is the number of iterations.
Proof. Beginning with Lemma 2.1, we have
1
f (xk+1 ) ≤ f (xk ) − ||∇f (xk )||2
2L
(i) 1
≤ f (x⋆ ) + ⟨∇f (xk ), xk − x⋆ ⟩ − ||∇f (xk )||2
2L !
2
(ii) L 1
= f (x⋆ ) + ||xk − x⋆ ||2 − xk − x⋆ − ∇f (xk )
2 L
(iii) L
= f (x⋆ ) + ||xk − x⋆ ||2 − ||xk+1 − x⋆ ||2 .

(13)
2
In (i), we applied the definition of convexity (take y = xk and x = x⋆ in (9)). In (ii), we completed
the square, and in (iii), we recognized that xk − L−1 ∇f (xk ) is simply a gradient descent update
with step size L−1 and replaced it with xk+1 . Now we move f (x⋆ ) to the left-hand side and sum
(13) over k. On the right-hand side, this results in a telescoping sum. We are left with the following:
T
X L
(f (xk+1 ) − f (x⋆ )) ≤ ||x1 − x⋆ ||2 − ||xT +1 − x⋆ ||2 .

(14)
2
k=1

We can drop the second term on the right-hand side of (14), since it is nonpositive. Note that
Lemma 2.1 implies f (xk+1 ) − f (x⋆ ) ≤ f (xk ) − f (x⋆ ) for all t. Each term in the sum on the left-hand
side of (14) is therefore lower-bounded by the quantity f (xT ) − f (x⋆ ). Using this lower bound, (14)
implies
L
T (f (xT ) − f (x⋆ )) ≤ ||x1 − x⋆ ||2 . (15)
2
Lastly, we bound the distance between x1 and the optimum by some R > 0, and divide through by
T . We thus arrive at the result
RL 1
f (xT ) − f (x⋆ ) ≤ . (16)
2 T

Theorem 2.2 can be found in Nesterov [10].


In Section 1.6, assuming only
√convexity, we saw that the subgradient method converges in average
function value with the rate 1/ T . Since the subgradient method reduces to gradient descent when
f is differentiable, we have just argued that the same method achieves a faster rate, 1/T , when we
assume f is L-smooth in addition to being convex. If we were to make a third assumption, that
f is not just convex but strongly convex, we would obtain an even faster rate of convergence for
gradient descent. We present the definition of strong convexity next, and state the corresponding
convergence result for gradient descent.
Definition 2.3 (Strong convexity). A function f is said to be strongly convex if there exists µ > 0
such that
µ
f (y) ≥ f (x) + ⟨∇f (x), y − x⟩ + ||x − y||2 . (17)
2
Such a function is said to be µ-strongly convex.
All strongly convex functions are also convex.

14
Theorem 2.3. On an L-smooth and µ-strongly convex function f , gradient descent with step size
η ≤ 2/(L + µ) converges at the rate (1 − C)T , where C ≤ 1 is a constant that depends on η, µ, and
L, and T is the total number of iterations. In particular,
 T
⋆ 2 2ηµL
||xT +1 − x || ≤ 1− ||x1 − x⋆ ||2 . (18)
µ+L

See [10], Theorem 2.1.15, for a proof. The rate in Theorem 2.3 is known as a linear rate of
convergence in optimization parlance.

2.2 Gradient Descent on Nonconvex Functions: Escaping Saddle Points


Efficiently
So far we have dealt with convergence guarantees for gradient descent on convex functions. In this
section we investigate guarantees for the performance of gradient descent on nonconvex functions,
and study how these differ from the guarantees we can make for convex functions. One of the
differences, we will see, is that when f is nonconvex, convergence guarantees for gradient descent
can have a dimension dependence. The discussion in this section is based on joint work [11, 12] with
Chi Jin, Rong Ge, Praneeth Netrapalli, and Sham Kakade.
Consider an L-smooth function f (x) : Rd → R. We assume that f also has ρ-Lipschitz Hessians:

||∇2 f (x) − ∇2 f (y)|| ≤ ρ||x − y|| ∀x, y. (19)

(We say that f is ρ-Hessian Lipschitz.) Gradient descent halts naturally when ∇f (x⋆ ) = 0. When f
is nonconvex, this can happen not just when x⋆ is a minimum, but also if it is a maximum or a saddle
point. We need therefore to understand whether gradient descent can efficiently avoid maxima and
saddle points. Here, in particular, we focus on how to efficiently escape saddle points. Note that in
the nonconvex setting, we will no longer be looking for convergence guarantees to global minima,
but to local minima. The guarantees thus obtained are useful nonetheless. In many applications,
it is the case that f does not have any spurious local minima; principal components analysis is a
notable example.
First, we introduce some terminology:
• A first-order stationary point of f is any point x such that ∇f (x) = 0.
• A second-order stationary point of f is any point x such that ∇f (x) = 0 and ∇2 f (x′ ) ⪰ 0.

• An ε-first-order stationary point (ε-FOSP) is a point x such that ∇f (x) ≤ ε.



• An ε-second-order stationary point (ε-SOSP) is an ε-FOSP where ∇2 f (x) ⪰ − ρεI. Here, I
is the d-dimensional identity matrix.
Note that saddle points are first-order stationary points but not second-order stationary points.

Definition 2.4 (Strict saddle point). x is a strict saddle point of f if it is a saddle point and the
smallest eigenvalue of the Hessian at x is strictly negative: λmin (∇2 f (x)) < 0.
In the following, we will discuss convergence guarantees for a variant of gradient descent that
avoids/escapes strict saddle points. That is, we will not discuss how to escape saddles that have no
escape directions (i.e., at which λmin (∇2 f (x)) = 0).
We will be talking about convergence rates in a different way than we have so far. Instead of
bounding the error in function value (or distance to the optimum) after T iterations, we will ask how
many iterations are necessary for the error to drop below some ε. Let us restate the convergence
guarantee of Theorem 2.2 in this manner:

15
Theorem 2.4. On an L-smooth convex function f , gradient descent with step size η = L−1 converges
to an ε-FOSP in (f (x1 ) − f ⋆ )L/ε2 iterations.
Previous results on optimizing nonconvex functions with gradient descent had established that
the algorithm asymptotically avoids saddle points [13, 14]. However, these results made no finite-
time guarantee of convergence. Other work then established that gradient descent can escape saddles
in a length of time that is polynomial in the dimension d [15]. In [11], we showed that this dimension
dependence can be improved. We developed a variant of gradient descent we call perturbed gradient
descent (PGD) that converges with high probability to ε-SOSPs at a rate that has a polylogarithmic
dependence on d. PGD takes an injection of noise when the norm of the gradient of f is small,
and otherwise behaves like vanilla gradient descent. The injected noise is sampled uniformly from
a d-dimensional ball of radius r. (For details, see the description of Algorithm 2 in [11].) We prove
the following theorem guaranteeing convergence of PGD to ε-SOSPs.
Theorem 2.5 (PGD escapes strict saddle points). Consider an L-smooth and ρ-Hessian Lipschitz
function f : Rd → R. There exists an absolute constant cmax such that for any δ > 0, ε < L2 /ρ,
∆f ≥ f (x1 ) − f ⋆ , and c ≤ cmax , PGD with step size η = L−1 will output an ε-SOSP with probability
1 − δ, and will terminate in a number of iterations of order

(f (x1 ) − f ⋆ )L
 
4 dL∆f
log . (20)
ε2 ε2 δ

(See Theorem 3 in [11] and the proof therein.)


This result implies that the presence of strict saddles does not slow gradient descent down
by much: the number of iterations needed to converge goes as 1/ε2 , which is precisely the same
dependence we find for gradient descent on smooth convex functions (see Theorem 2.4).
We give a brief description of the proof technique. In the vicinity of a saddle point, there exists
a band (a ‘stuck region’) in which the flow of the vector field ∇f is mostly toward the saddle, i.e.,
where PGD can get stuck. The core technical part of the proof involved bounding the thickness of
this band. To do so, we studied the coupling time of two Brownian motions initialized a distance
k apart. When k is small enough, the diffusions eventually couple. Otherwise, they never couple.
Essentially, there is a phase transition in the coupling time that depends on how far apart the
diffusions are initialized. The value of this initial distance at the phase transition can be bounded,
and this bound gives us control over the thickness of the stuck region.
Achieving tighter control than was previously possible over the thickness of the stuck region is
precisely what enables us to improve on the polynomial dimension dependence of the convergence
rate from previous work [15]. We note that the requirement of ρ-Hessian Lipschitzness is absolutely
crucial to the analysis. Without this assumption, PGD cannot find the negative directions it must
follow to escape the vicinity of a saddle.
Two open questions follow. The first is, is the polylogarithmic dimension dependence of the
convergence rate simply an artifact of the proof technique? Since we have neither a lower bound on
the dimension dependence, nor an alternative proof technique, we do not know the answer to this
question. The second is, is there a way to bound the thickness of the stuck region that relies not on
ideas from probability theory but on ideas from differential geometry?

2.3 Variational, Hamiltonian, and Symplectic Perspectives on Accelera-


tion
On a given class of functions, is there a limit to how fast an algorithm can reach the optimum?
This type of question is addressed by proving a lower bound [16] on the convergence rate. Lower
bounds are usually established by studying worst-case function instances. For the class of smooth
and strongly convex functions, the lower bound is a rate of 1/T 2 , which is faster than the 1/T

16
rate achieved by gradient descent, discussed in Theorem 2.2. This optimal rate is achieved by the
family of second-order or so-called accelerated algorithms. A representative member of this family
is Nesterov’s method (Algorithm 3).

Algorithm 3 Accelerated Gradient Descent (Nesterov’s Method)


Require: η > 0

λk = (k − 1)/(k + 2)
yk+1 = xk − η∇f (xk )
xk+1 = (1 − λk )yk+1 + λk yk

On a technical level, accelerated algorithms are somewhat mysterious. Proofs of convergence of


these methods are not easy to motivate or parse. They rely on the technique of estimate sequences,
first developed by Nesterov [10] and later generalized by Baes [17].
We will now discuss a series of results in which we take a continuous-time view of acceleration.
We will see that this viewpoint will allow us to interpret acceleration in discrete time as arising
from the discretization mechanism of a specific second-order differential equation. In general, we
believe that taking a continuous-time viewpoint in optimization can provide a unifying perspective
and enable results that we do not yet know how to prove in discrete time. We will see examples of
this later in the discussion.
By taking the limit η → 0, we can derive continuous-time limits for discrete-time algorithms.
When the latter are deterministic, the former are ordinary differential equations. It is simple to
show, for example, that in continuous time, gradient descent maps to the gradient flow equation,

ẋt = −∇f (xt ). (21)

We use dots to denote derivatives with respect to time. Nesterov’s method maps to the following
second-order differential equation, first written down in [18]:
3
ẍt + ẋt + ∇f (xt ) = 0. (22)
t
Note that xt is shorthand for x(t), and is not to be confused with the discrete-time iterate xk . The
authors of [18] also showed that the Euler discretization of (22) recovers Algorithm 3. This result
provided some insight, and also gave rise to new questions and problems. We will briefly describe
one such question and its resolution before returning to the main theme of this section.
More than one discrete-time algorithm collapses to (22) in the continuous-time limit. In partic-
ular, the heavy ball method, a second-order algorithm due to Polyak [19], also maps to (22). The
trouble with this is that the heavy ball method does not achieve the optimal 1/T 2 rate of Nesterov’s
method, indicating that (22) can fail to distinguish between two algorithms that achieve different
rates in discrete time. This problem can be addressed by taking high-resolution continuous-time
limits. High-resolution ordinary differential equations appear in fluid dynamics and are useful to
describe systems with multiple timescales. See [20] for details on how to use high-resolution limits
to derive distinct continuous-time analogs of the heavy ball method and Nesterov’s method. We
will revisit high-resolution differential equations in another context at the end of the third lecture,
in Section 3.2.4.
Let us return to the primary point of discussion. We have mentioned that a continuous-time
limit of Nesterov’s method was studied in [18] and that taking the continuous-time point of view was
productive of insight into the behavior of the algorithm. Our work, which will now describe, was
motivated by the question of whether there exists a generative mechanism for accelerated algorithms.
The work described in the following is joint work with Ashia Wilson, Andre Wibisono, and Michael
Betancourt [21, 22].

17
Let us equip ourselves with a definition. For any choice of a convex distance-generating function
h, we can define the Bregman divergence Dh between two points as follows:

Dh (y, x) = h(y) − h(x) − ⟨∇h(x), y − x⟩. (23)

Note that if h is a quadratic, the Bregman divergence is simply the Euclidean distance between x
and y.
In [21], we introduce the Bregman Lagrangian:

L(x, ẋ, t) = eγt +αt Dh x + e−αt ẋ, x − eβt f (x) ,


 
(24)

where αt , γt , βt are scaling functions, and f is the continuously differentiable convex function we
are interested in optimizing. Note that we use the shorthand αt for α(t), and similarly for βt and
γt . By making specific choices of the scaling functions, we will see later that we can generate a
family of algorithms. The first term in (24) is a generalized kinetic energy term (note that if we
take h(x) = x2 , this term reduces to ẋ2 /2, the familiar kinetic energy from physics). The second
term can be interpreted as a potential energy.
Given a Lagrangian, we can set up the following variational problem over paths:
Z
min dt L(x, ẋ, t). (25)
x

The solutions to this optimization problem are precisely the solutions to the Euler-Lagrange equation
 
d ∂ ∂
L(x, ẋ, t) = L(x, ẋ, t). (26)
dt ∂ ẋ ∂x

Under what we call the ideal scaling β̇t ≤ eαt , γ̇t = eαt , the Euler-Lagrange equation of (24) is
−1
ẍt + (eαt − α̇t ) ẋt + e2αt +βt ∇2 h xt + e−αt ẋt

∇f (xt ) = 0. (27)

We establish the following general convergence result for this dynamics.


Theorem 2.6. The dynamics (27) has the convergence rate

f (xt ) − f (x⋆ ) ≤ O e−βt .



(28)

The proof of this result is relatively simple. It involves writing down a Lyapunov function E for
the dynamics and showing that its time derivative is nonpositive.
Proof.

Et = Dh x⋆ , xt + e−αt ẋt + eβt (f (xt ) − f (x⋆ )) ,



 
Ėt = −eαt +βt Df (x⋆ , xt ) + eβt β̇t − eαt (f (xt ) − f (x⋆ )).

f is convex, and so Df (x⋆ , xt ) ≥ 0. The first term in Ėt is therefore nonpositive. Under the ideal
scaling, β̇t − eαt ≤ 0, and so the second term in Ėt is also nonpositive. Thus Ėt ≤ 0.
With suitable choices of the scaling functions, we can now generate a whole family of continuous-
time optimization algorithms from (27) with an accompanying convergence rate guarantee (in con-
tinuous time). For example, in the Euclidean case, where h is quadratic in x, if we make the choices
αt = log p − log t, βt = p log t + log C, and γt = p log t where C > 0, (27) reduces to (22), the
Euler discretization of which is Nesterov’s method. We can also recover mirror descent, the cubic-
regularized Newton method, and other known discrete-time algorithms from (27), which serves as a
generator of accelerated algorithms. The interested reader may refer to [21] for details.

18
Let us now address the problem of mapping back into discrete time.
We begin by noting that the Bregman Lagrangian (24) is a covariant operator. The dynamics
generated by covariant operators has the property that given any two endpoints, the path between
them is fixed. However, we are free to choose βt , which controls the rate at which this fixed path is
traversed. In continuous time, then, we can obtain arbitrarily fast convergence by manipulating βt ,
and the algorithmic notion of acceleration that we began by wishing to understand disappears. It
returns, however, when one considers how to discretize (27).
It turns out that it is not possible to obtain arbitrarily fast stable discretizations of (27). To
understand this, we note that in classical mechanics, the dynamics of a system often has conserved
quantities associated with it, such as energy or momentum. Not all discretization schemes inherit
or preserve these conservation laws. Symplectic integrators are a class of discretization schemes that
do preserve what is known as phase space volume, and therefore conserve energy, etc. The method
of symplectic integration traces its roots back to Jacobi, Hamilton, and Poincaré, and grew out of
the need to discretize the dynamics of physical systems while preserving the relevant conservation
laws. Importantly for us, symplectic integrators are provably stable. Their stability enables them
to take larger step sizes than other kinds of integrators. This is the origin of acceleration in discrete
time.e a triple
Symplectic integrators are traditionally defined for time-independent Hamiltonians. (We have
thus far been talking about continuous-time algorithms in a Lagrangian framework, but a simple
Legendre transform allows us to equivalently frame the discussion in a Hamiltonian framework.)
However, the Bregman Lagrangian (24) is time-dependent. Indeed, as we saw, its associated Lya-
punov function, which can be interpreted as an energy, decreases with time. Intuitively, this is
necessary from an optimization standpoint in order to converge to a point. The question that
arises is, how can we use symplectic integrators for dissipative systems? In joint work [23] with
Guilherme França and René Vidal, we solved this problem by lifting the relevant Hamiltonian to
higher dimensions so that it acquires time independence. We discretize the time-independent higher-
dimensional Hamiltonian, and gauge-fix to obtain rate-preserving algorithms corresponding to the
original Hamiltonian. We call the integrators so-obtained presymplectic integrators. When we use
the term ‘rate-preserving’ to describe an integrator, we mean that it preserves the rate of conver-
gence of the continuous-time dynamics up to an error that can be controlled. One of the central
results in [23] is that pre-symplectic integrators are rate-preserving. The technical tool used prove
this type of result is called backward error analysis [24, 25].
In conclusion, we now have a systematic pipeline for generating accelerated algorithms with
convergence guarantees. The practitioner may (1) ask for a certain rate of convergence for a given
f , (2) write down the corresponding Bregman Hamiltonian, (3) apply a symplectic integrator, and
(4) code up the resulting algorithm.
We mentioned at the beginning of this section that taking a continuous-time perspective can
enable proofs of results we do not yet know how to prove in discrete time. We give an example of
such a result now. Let us briefly return to the problem of escaping saddles in nonconvex optimization.
We discussed previously that perturbed gradient descent (PGD) can escape saddles and converge to
ε-SOSPs in a number of iterations that goes as 1/ε2 . This was the content of Theorem 2.5. In [26],
we showed that a perturbed version of accelerated gradient descent (PAGD) can do the same in a
number of iterations that goes as 1/ε7/4 , i.e., it converges faster than PGD. In order to prove this
result, we worked in continuous time, studying the Hamiltonian that generates PAGD. The result
implies that acceleration can be helpful even in nonconvex optimization, and not just in the convex
case, where all the prior discussion of this section has been set.

An open problem. The continuous-time framework we have described here only applies to deter-
ministic algorithms. An open challenge is to construct a variational framework for diffusions, which
may enable similar illuminating and unifying perspectives on accelerated stochastic algorithms.

19
We end this section by briefly reviewing a few results for stochastic algorithms in continuous
time.

1. Consider an overdamped Langevin diffusion



dxt = −∇U (xt ) dt + 2 dBt , (29)

where U : Rd → R is the potential and Bt is standard Brownian motion. The stationary


distribution p⋆ of this diffusion is of course the Boltzmann distribution. A discretization of
(29) is the following Markov Chain Monte Carlo (MCMC) algorithm

x̃(k+1)δ = x̃kδ − ∇U (x̃kδ ) + 2δξk , (30)

where k is the iteration number, δ is the time increment, and ξ is a d-dimensional standard
normal random variable. A natural question to ask is, how close are we to p⋆ after n iterations
of the MCMC algorithm? Assuming U (x) is L-smooth and µ-strongly convex,  guarantees
of the following type have been proven: if n ≥ O(d/ε2 ), then d p(n) , p⋆ ≤ ε, where d is
the relevant distance measure, and p(n) is the probability density generated by the MCMC
algorithm at the nth iteration. Results exist in the cases where d is the total variation distance
[27], the 2-Wasserstein distance [28], and the Kullback-Leibler divergence [29].
2. Now consider an underdamped Langevin diffusion

dxt = vt dt,
p
dvt = −γvt dt + λ∇U (xt ) dt + 2γλ dBt , (31)

where γ is the coefficient
 of friction. The stationary distribution p is proportional to
2
exp −U (x) − ||v||2 /2λ . In [30], we are able to make a similar guarantee on the MCMC
algorithm corresponding to √ (31) as for (30). That is, assuming U (x) is L-smooth and µ-
strongly convex, if n ≥ O( d/ε), then d p(n) , p⋆ ≤ ε, where d is the 2-Wasserstein distance.
This is a faster rate than in the overdamped case, where we needed the square of the number
of iterations required here to make the same guarantee on the distance between p(n) and p⋆ .
The intuition behind this faster rate lies in the fact that underdamped diffusions are smoother
than overdamped ones, and therefore easier to discretize.
The proof technique in [30] involves another coupling time argument. We saw one of these in
the discussion on how PGD escapes saddle points. In this case we study the coupling time of
two Brownian motions that are reflections of each other.

3. An underdamped Langevin diffusion corresponds to Nesterov acceleration in the simplex of


probability measures [31].
4. Higher-order Langevin diffusion yields an accelerated MCMC algorithm [32].
5. Smoothness isn’t necessary for Langevin diffusion [33].

6. Langevin-based algorithms can achieve logarithmic regret on multi-armed bandits [34].

Now that we have some context in the results just stated, we can make precise the type of
open problem mentioned earlier. Suppose we wish to diffuse (under some stochastic differential
equation) to a target distribution p⋆ . What equation gets us to p⋆ the fastest, and what is the
correct mathematical formulation for this problem?

20
2.4 Variational Inequalities: From Minima to Nash Equilibria and Fixed
Points
In this last section of the lecture, we will introduce the setting for the third lecture. From now on, we
will be concerned with how to find (Nash) equilibria of certain types of functions f (we will make this
precise). The discussion will break with all the technical material we have seen so far in an important
way: Nash equilibria are saddle points of f . Therefore, instead of studying algorithms that avoid
saddle points, we will talk about algorithms that converge to saddle points. In the next lecture, we
will study two such algorithms, the proximal point method, and the extragradient method.
In the following, we introduce (1) zero-sum games, (2) Nash equilibria, and (3) variational in-
equalities. We then give some context for why we introduce these ideas, and for why we are interested
in computing Nash equilibria.

2.4.1 Two-Player Zero-Sum Games


Consider a function f : Rd1 × Rd2 → R, involving two “players” x1 and x2 . One player acts to
minimize f , and the other acts to maximize it. This leads to the following optimization problem,
also called a two-player zero-sum game:

min max f (x1 , x2 ). (32)


x1 ∈Rd1 x2 ∈Rd2

The game is called “zero-sum” because the cost functions of players x1 and x2 , respectively f and
−f , add to zero. A simple instance of (32) is the zero-sum bilinear game f (x1 , x2 ) = x⊤
1 Ax2 , where
A is a positive semi-definite symmetric matrix. (32) is a member of the set of convex-concave games,
called so since f is convex in x1 and concave in x2 . Convex-concave games are in turn a subset of
the family of minimax games, which are a subset of the family of monotone games. This last piece
of terminology and its significance will be explained Section 3.1.
The solutions of (32) are saddle points (and not minima) of f ; they are the set of points (x⋆1 , x⋆2 )
such that
f (x⋆1 , x2 ) ≤ f (x⋆1 , x⋆2 ) ≤ f (x1 , x⋆2 ) . (33)
In the parlance of game theory, some (but not, in general, all) of the solutions of the optimization
problem that defines the dynamics of a game are called equilibria.
We have already encountered one example of a (Stackelberg) game in Section 1.4.1. Stackelberg
games are sequential (players take turns to play). (32) is not a sequential game. Its equilibria
are called Nash equilibria. We will define a Nash equilibrium for an N -player game formally later.
Intuitively, a Nash equilibrium is a saddle point of f that neither player is incentivized to move
away from. When we introduce N -player games, this intuition will carry over: Nash equilibria will
correspond to points where each player is as happy as possible relative to what all the other players
are doing.

2.4.2 Variational Inequalities


Two-player zero-sum games (and indeed, the larger class of monotone games) can be written as
instances of a general class of optimization problems called variational inequalities (VIs).

Definition 2.5 (Variational Inequality). Given a subset X of Rd and an operator (or vector field)
F : X → Rd , a variational inequality is an optimization problem of the following form. Find x⋆ such
that
⟨F (x⋆ ), x − x⋆ ⟩ ≥ 0 ∀x ∈ X . (34)
Note that x⋆ need not always be unique, and that X may or may not be convex. (There are
certain requirements on the topology of X but we will omit discussion of these here.) Variational

21
inequalities unify the formulations of unconstrained and constrained optimization problems, as well
optimization problems involving equilibria, and many others. If we replace the inequality with
equality in (34), and F is a gradient field, then we recover classical optimization; furthermore, if X
is (a proper subset of) Rn , then the optimization problem is (constrained) unconstrained.
In the language of variational inequalities, the two-player zero-sum game is represented by the
vector field  
∇x1 f (x1 , x2 )
F (x) = , (35)
−∇x2 f (x1 , x2 )
with X = X1 ⊗ X2 . Note that F is not the gradient of an underlying function.
Representing a game as a VI allows us to express the problem of computing the equilibria of the
game as the problem of computing the fixed points of the relevant operator F . Although on the
face of it this mathematical mapping may not appear to confer an advantage, in fact there exists
a vast machinery to compute the fixed points of nonlinear operators that we can take advantage of
through it. We will come back to this point in the next section.

2.4.3 Nash Equilibria


Finally, let us define the notion of a Nash equilibrium. Consider a game of N players xi . Let PN
each
xi ∈ Rdi . Stacking all the players into a vector gives the strategy vector x, which lives in PR i=1 di .
Each player has an associated cost function gi (xi , x−i ), where we use x−i to denote the j̸=i di −
dimensional vector representing all players except xi . The solution for player i lives in the solution
set Si (x−i ), which is the set of solutions to the optimization problem minxi gi (xi , x−i ) such that
xi ∈ Ki ⊆ Rdi where Ki is a constraint set.
PN 
Definition 2.6 (Nash Equilibrium). A Nash equilibrium is a vector x⋆ ∈ R i=1 di s.t. x⋆i ∈ Si x⋆−i
∀i.
The problem of finding a Nash equilibrium can be written as a variational inequality with
 
∇x1 g1 (x)
X = K1 ⊗ · · · ⊗ KN , F (x) =  ..
, (36)
 
.
−∇xN gN (x)

if and only if each Ki is a closed convex subset of Rdi and each gi is a convex function of xi .
Exercise to the reader: check that if N = 2, g1 = f , and g2 = −f , then (36) reduces to a
two-player zero sum game.
Nash equilibria are not the only equilibria that can be represented as solutions of variational
inequalities. Some other interesting examples, which we will not address here, are Markov perfect
equilibria (from probability theory), Walrasian equilibria (from economics), and the equilibria of
frictional contact problems (from mechanics).
Armed with the definitions of games and Nash equilibria, we can briefly mentioning a gradient-
based method developed in collaboration with Eric Mazumdar and S. Sankar Sastry for finding local
Nash equilibria in two-player zero-sum games [35]. This example also pulls together some of the ideas
we saw in the last section, namely of working in continuous time and paying attention to symplectic
geometry. Consider the problem (32). We have mentioned that the solutions to this optimization
problem are saddle points of f . However, not all saddle points of f are Nash equilibria. The latter
are specifically saddle points that are axis-aligned ; we refer the reader to the paper [35] for further
details on what this means. Prior to our work, it was known that gradient-based algorithms for
finding Nash equilibria could (in the worst case) diverge, or get stuck in limit cycles (i.e., simply
follow the flow of the vector field). We showed that in addition, these algorithms could also converge
to non-Nash fixed points. The distinction between non-Nash and Nash fixed points has to do with
the symplectic component of the vector field. When the symplectic component of the vector field

22
near a fixed point is removed, only Nash fixed points persist. We exploited this fact and developed an
algorithm we called “local symplectic surgery” that could effect this removal, and therefore provably
converge only to Nash equilibria.

2.4.4 Computing Nash Equilibria


Why are we interested in computing Nash equilibria? And why do we bring up the framework of
variational inequalities in this context? We explain.
Nash equilibria are hard to compute in general [36]. This fact has profound implications for the
mathematical study of markets. Much effort in microeconomics has been made to understand the
properties of Nash equilibria under the assumption that the game/system/market under consider-
ation is already in such an equilibrium. For this assumption to hold, the players must be able to
arrive at/compute the equilibrium. However, the negative computability result of [36] implies that
this is generally not possible. The questions then are, what are the classes of games in which it is
tractable to compute Nash equilibria, and what are the algorithms that are needed for this purpose?
It turns out that the class of two-player zero-sum games is tractable in this specific sense. We have
already seen that as long as the cost function is convex, two-player zero-sum games can be written
as variational inequalities, and in particular, variational inequalities involving monotone operators.
We can therefore apply and build on mathematical machinery originally developed to compute the
fixed points of monotone operators to compute the Nash equilibria of two-player zero-sum games.
We will define monotonicity in the next lecture. For now, it suffices to note that monotone
operators are a class of operators for which it is guaranteed that a fixed point exists, or equivalently,
it is a class of operators for which it is possible to guarantee that the corresponding variational
inequalities actually have solutions.
Another reason why the language of variational inequalities is useful to us is that it naturally
centralizes the role of dynamics in understanding the behavior of interacting agents. As we have seen
in [2], and in our discussion of nonconvex optimization, the dynamics influences which fixed point
we arrive at, and so it is important to understand the relationship between a fixed-point-finding
algorithm and the properties of the fixed points it converges to. This is helpful both to understand
the behaviors of existing markets and to effectively design new markets with certain properties (and
not others) in the future.
The literature on fixed-point-finding algorithms for monotone operators is vast. To simplify the
exposition in what follows, and to make good use of the previous discussion, we will take a relatively
unusual approach. In the next lecture, we will first establish that gradient descent (which does
well at finding the minima of convex functions) generally diverges on monotone problems. We will
identify the reason why, and then ask how we can “fix” it. This will motivate the discussion of the
proximal point method and the extragradient algorithm.

23
3 Computing Equilibria
Consider the two kinds of vector fields pictured in Fig. 2. In Fig. 2(a), the flow is towards a point;
our discussion of convex optimization in Lectures 1 and 2 was in this context. In Fig. 2(b) the flow
moves around a point. Both the flows pictured in Fig. 2 are generated by monotone operators. Given
a monotone game, it will not a priori be apparent which of the two types of flows we are dealing
with. To reliably compute equilibria, therefore, we will need algorithms that handle both cases. We
will spend the rest of these notes developing and studying such algorithms.

3.1 Monotone Operators

−F (x) −F (x)

(a) fixed points x⋆ (b)

Figure 2: The notion of a fixed point is broader than that of a minimum. Here we have examples of two
types of vector fields F (x). In (a), the negative flow (indicated by arrow heads) of the field is toward the
fixed point; this fixed point could be a minimum. In (b), the negative flow of the vector field is around the
fixed point; this fixed point is not a minimum. In this lecture, we study algorithms that compute the fixed
points of vector fields associated with monotone operators, which can be either of the flavor (a) or (b).

To begin the discussion, we introduce the notion of monotonicity, which generalizes the notion
of convexity, and a strengthening of this notion called strong monotonocity.
Definition 3.1 (Monotone Operator). An operator F is said to be monotone on a set X if
⟨F (x) − F (y), x − y⟩ ≥ 0 ∀x, y ∈ X . (37)
The property (37) is known as monotonicity.
We will assume that the constraint set X is convex. When this is the case, we are guaranteed
that a fixed point of F exists.
We remind the reader that our notation for a fixed point of F is x⋆ . Taking x = x⋆ and y = x
and setting F (x⋆ ) = 0 in (37), we see that monotonicity of F implies that the angle θ between the
vectors −F (x) and x⋆ − x for an arbitrary x can be no greater than π/2. This is illustrated in Fig. 3.
Definition 3.2 (Strong Monotonicity). An operator F is said to be strongly monotone on a set X
if
⟨F (x) − F (y), x − y⟩ ≥ µ||x − y||2 ∀x, y ∈ X , (38)
for some µ > 0.
Strong monotonicity is a strengthening of monotonicity. It guarantees that the angle between
−F (x) and x⋆ − x is strictly less than π/2. Furthermore, given any arbitrary point x, strong
monotonicity of F ensures that the fixed point lies on or above the quadratic function µ||x − y||2
centered at x. In contrast, monotonicity merely implies that x⋆ is on the same side of the hyperplane
defined by −F at x. This is illustrated in Fig. 4.

24
−F (x)

θ ≤ π/2 x

x⋆

Figure 3: Monotonicity implies that the angle θ between the operator −F (x) and the vector pointing from
x to x⋆ is at most π/2 for all x. To see this in (37), take x = x⋆ , y = x and without loss of generality let
F (x⋆ ) = 0.

Strong Monotonicity
x⋆SM

µ||x − y||2

−F (x) Monotonicity
x⋆M
y
x

Figure 4: While monotonicity of F guarantees that x⋆ is on the same side of the hyperplane defined by
−F at x as the quadratic µ||x − y||2 , strong monotonicity additionally guarantees that x⋆ is on or above
the quadratic. We use the label x⋆M (x⋆SM ) to indicate the fixed point of F when F is monotone (strongly
monotone). In general, x⋆M may be either aligned with the axis y or anywhere above it, while x⋆SM is confined
to be on or above µ||x − y||2 .

3.2 Fixed-Point Finding Algorithms


3.2.1 A Naive Algorithm
We will now see that a (naive) gradient-descent-like algorithm diverges on monotone problems. The
cause of the divergent behavior is that monotonicity allows the angle θ between −F (x) and x⋆ − x to
be as large as π/2 (see Fig. 3). For problems of the type in Fig. 2(b), this implies that the algorithm
may simply follow the flow of F around x⋆ and not make progress towards it, or it may even diverge.
As the reader may have guessed, we will be able to fix this problem if we assume that F is strongly
monotone, ensuring that θ is strictly bounded away from π/2.
Using the notation projX (x) for the projection of the vector x onto the convex constraint set X ,
let us study the behavior of the iteration
xk+1 = projX (xk − ηF (xk )). (39)
Does (39) converge to x⋆ ?
As will be familiar by now, we begin by looking at the distance ||xk+1 − x⋆ ||:
2 2
||xk+1 − x⋆ || = ||projX (xk − ηF (xk )) − projX (x⋆ − ηF (x⋆ ))|| . (40)
Projection onto a convex set is a contractive operation.2 This implies that difference of two projec-
tions is at most the difference of the arguments (see Fig. 5 for a pictorial proof). We can use this
2 In fact, projection onto a convex set is a firmly non-expansive procedure. We will formally define firm non-

expansivity later in this lecture.

25
y x

X
projX (y) projX (x)

Figure 5: Projection onto a convex set X is a contractive operation, which means that the difference of
projections is at most the difference of the arguments.

fact to drop the projection operator in (40) at the expense of an inequality. Doing so, and expanding
the resulting squared norm, we have
2 2
||xk+1 − x⋆ || ≤ ||xk − ηF (xk ) − x⋆ + ηF (x⋆ )||
2 2
= ||xk − x⋆ || − 2η⟨F (xk ) − F (x⋆ ), xk − x⋆ ⟩ + η 2 ||F (xk ) − F (x⋆ )|| . (41)
Applying the monotonicity property (37) to the second term in (41), we arrive at
2 2 2
||xk+1 − x⋆ || ≤ ||xk − x⋆ || + η 2 ||F (xk ) − F (x⋆ )|| . (42)
Since both the terms on the right-hand side of (42) are positive, we cannot conclude that ||xk+1 −x⋆ ||
is shrinking with iteration number. Thus, the naive gradient-descent-like iteration (39) may diverge
on generally monotone problems.
As we have already mentioned, we can fix this by assuming that F is not just monotone but
strongly monotone. This allows us to retain the second term in (41). It is important to retain this
term, since it is the only one with a negative sign in (41), and is therefore needed to counteract
the expansive effect of the third term. We will also assume that F is L-Lipschitz. With these two
assumptions, we can easily show per-iteration contraction of the distance between the iterate and
the fixed point. Starting once again from (41), we have
2 2 2
||xk+1 − x⋆ || ≤ ||xk − x⋆ || − 2η⟨F (xk ) − F (x⋆ ), xk − x⋆ ⟩ + η 2 ||F (xk ) − F (x⋆ )||
2 2
≤ ||xk − x⋆ || − 2ηµ ||xk − x⋆ || + η 2 L2 ||xk − x⋆ ||2
= 1 − 2ηµ + η 2 L2 ||xk − x⋆ ||2


µ2
 
η=µ/L2
= 1 − 2 ||xk − x⋆ ||2 . (43)
L
With (43), we have proved the following theorem.
Theorem 3.1. For a µ-strongly monotone and L-Lipschitz operator F and a convex constraint set
X , the iterates of the gradient-descent-like algorithm (39) with step size µ/L2 converge to the fixed
point x⋆ of F with the rate 1 − (µ/L)2 .
The rate of convergence in Theorem 3.1 is slower than the (roughly) 1 − µ/L rate achieved by
gradient descent on strongly convex functions in Theorem 2.3. In fact, it is possible to extract this
faster rate from (39) by making one further assumption on F , that of co-coercivity. We give its
definition next.
Definition 3.3 (Co-coercivity). An operator F is said to be co-coercive on a set X if there exists
an α > 0 such that
1
⟨F (x) − F (y), x − y⟩ ≥ ||F (x) − F (y)||2 ∀x, y ∈ X . (44)
α

26
In fact, a µ-strongly monotone and L-Lipschitz operator is co-coercive with α = L, but L is a
suboptimal value of α. It is possible to achieve a faster rate by identifying the optimal value of α.
In particular, if F is µ-strongly monotone and α-co-coercive, it is possible to achieve a convergence
rate of 1 − µ/α. Lastly, we note that if F is the gradient field ∇f of a strongly convex and L-
smooth function f , then it is L-co-coercive. This fact and the convergence result stated just before
it together recover the fast convergence rate of gradient descent from Theorem 2.3 in the variational
inequality framework.

3.2.2 The Proximal Point Method


Let us take a second look at (42), which implies that if F is monotone, then applying an iteration
of (39) can in general only increase the distance to the optimum. In the special case where F is the
gradient field of a convex and L-smooth function, however, (39) reduces to gradient descent, which
does converge. In other words, (39) diverges for general monotone vector fields (such as the example
in Fig. 2(b)), but not in the special case of a gradient field (Fig. 2(a)). We will now briefly introduce
a modification to (39) that we will show causes it to diverge on all monotone problems. We will
then fix this algorithm to obtain another algorithm—the proximal point method—that converges
for all monotone problems.
Let us consider the update

xk+1 = xk + ηF (xk ) = (I + ηF ) xk , (45)

where η > 0 is the step size, I is the identity operator and F is a monotone operator. A simple
calculation shows that the operator (I + ηF ) is expansive:
2 2 2
||(I + ηF ) (x) − (I + ηF ) (y)|| = ||x − y|| + 2η⟨F (x) − F (y), x − y⟩ + η 2 ||F (x) − F (y)||
2 2
≥ ||x − y|| + η 2 ||F (x) − F (y)|| . (46)

Applying (I + ηF ) to two points x and y thus strictly increases the squared distance between them.
This implies that (45) always diverges. This can also be seen simply by inspection in Fig. 2: following
the flow F (as opposed to −F , which is what gradient descent does) is a bad strategy both when F
is a gradient field and when it isn’t.
We can use the expansive property of (I + ηF ) to our advantage. In particular, notice that if we
run (45) backward in time,
−1
xk = (I + ηF ) xk+1 , (47)
then we might expect to obtain a procedure that shrinks the distance between any two points
instead of increasing it. The proximal point method, which we now introduce, does precisely this.
The proximal point update is given by rearranging (47) as follows:

xk+1 = xk − ηF (xk+1 ). (48)

In the form that we present it here, it was first studied by Rockafellar in 1976 [37].

Algorithm 4 Proximal Point Method


Require: η > 0

xk+1 = xk − ηF (xk+1 )

Notice that the proximal point update given in (48) is an implicit equation, i.e., it contains xk+1
on both sides, and must be solved at each iteration. Since there are multiple ways to solve the
implicit equation, the proximal point method as stated defines a family of algorithms.

27
xk
−F (xk )
xGD
k+1

xP P
k+1

x⋆
−F (xGD
k+1 )

Figure 6: Following the flow of −F at the point xk leads us away from x⋆ , to a point xGD k+1 (the superscript
GD stands for gradient descent). However, taking a step in the direction of the flow −F at xGD k+1 from xk
brings us to xEG ⋆
k+1 , which is closer to x than xk . This is the general logic behind the proximal point method
(48), and is precisely what the extragradient algorithm (Algorithm 5) does. Hence the superscript in xEG k+1 .

Before we enter the discussion of ways to solve for xk+1 , let us study the effect on ||xk+1 −x⋆ ||2 of
applying a single step of the proximal point method to confirm the intuition that led us to introduce
it:
2 2
||xk+1 − x⋆ || = ||xk − ηF (xk+1 ) − x⋆ ||
2 2
= ||xk − x⋆ || − 2η⟨F (xk+1 ), xk − x⋆ ⟩ + η 2 ||F (xk+1 )||
(a) 2 2
= ||xk − x⋆ || − 2η⟨F (xk+1 ), xk+1 − x⋆ ⟩ − 2η⟨F (xk+1 ), xk − xk+1 ⟩ + η 2 ||F (xk+1 )||
2 2
≤ ||xk − x⋆ || − 2η 2 ⟨F (xk+1 ), F (xk+1 )⟩ + η 2 ||F (xk+1 )||
2 2
= ||xk − x⋆ || − η 2 ||F (xk+1 )|| . (49)

To arrive at (a), we added and subtracted ⟨F (xk+1 ), xk+1 ⟩ from the line before it. The second term
in (a) is nonnegative by monotonicity of F ; neglecting it resulted in the inequality in the next line.
We used (48) to replace xk − xk+1 with ηF (xk+1 ) in the third term of (a).
We have proved the following descent lemma:
Lemma 3.2 (Descent lemma for the proximal point method on monotone operators). Consider a
monotone operator F . Under the dynamics of Algorithm 4 with step size η,
2 2 2
||xk+1 − x⋆ || ≤ ||xk − x⋆ || − η 2 ||F (xk+1 )|| . (50)

Lemma 3.2 guarantees that the distance to the fixed point shrinks on every iteration of the
proximal point method, and therefore implies convergence.
In the interests of completeness, we will now give some more vocabulary and state an important
theorem concerning monotone operators.
−1
Let us give the inverse operator (I + ηF ) a name. We call it the backward operator B. B
is also known as the resolvent of F . We can also define the operator 2B − I. See Fig. 7 for some
intuition for the distinction between the results of applying B versus 2B − I to xk .
Now we can state a theorem that connects the properties of F , B and 2B − I.
Theorem 3.3. The following are equivalent.
• F is a monotone operator.
• There exists a firmly non-expansive operator B such that B is the resolvent of F .

28
xk

xk+1 = B(xk )

(2B − I)xk

Figure 7: Applying the backward operator to xk brings us to the point xk+1 . This is the proximal point
update. Applying the operator 2B − I to xk results in a reflection about xk+1 .

• The operator 2B − I is non-expansive.


This theorem and its proof can be found in [8] (see Prop 4.4 and Cor. 23.9 therein).
The utility of Theorem 3.3 for our purposes is that given a monotone operator F , we are guar-
anteed that the resolvent exists, and so we can always construct a proximal point method for F .
Since the resolvent is also firmly non-expansive, we are further guaranteed that the proximal point
method will converge. To see why, and for completeness, we end this section with the definition of
firm non-expansivity.
Definition 3.4 (Firm Non-expansivity). An operator B is firmly non-expansive on a set X if
2 2 2
||B(x) − B(y)|| + ||(I − B)(x) − (I − B)(y)|| ≤ ||x − y|| ∀x, y ∈ X . (51)

In words, firm non-expansivity guarantees that applying the operator to two points strictly
decreases the distance between. The amount of the decrease is given by the second term on the
left-hand side of (51).

3.2.3 The Extragradient Algorithm


Let us now address the implicit nature of the proximal point update (48) by giving a practical algo-
rithm that approximates it—the extrapolated gradient algorithm [38], or, as it is often abbreviated,
the extragradient algorithm. The extragradient algorithm is a two-step method in which we compute
an intermediate iterate x̃k by applying a gradient-descent-like update to xk , and then apply another
such update to arrive at xk+1 . In the second update step, F is evaluated at x̃k . This is illustrated
in Fig. 6.

Algorithm 5 Extragradient Method


Require: η > 0

x̃k = xk − ηF (xk )
xt+1 = xk − ηF (x̃k )

The extragradient update can explicitly be shown to be an approximation to the proximal point
update [39], but we will not do that here. Nonetheless, we will leverage the idea to prove convergence
of the extragradient method by bounding the discrepancy between the iterates it produces and the
iterates that would be produced if we could exactly solve the proximal point update.

29
We give the proof of convergence of the extragradient method in the case where F is strongly
monotone (so we can easily get a rate) and L-Lipschitz:
2 2 2
||xk+1 − x⋆ || = ||xk − x⋆ || − 2η⟨F (x̃k ), xk − x⋆ ⟩ + η 2 ||F (x̃k )||
(a) 2 2
= ||xk − x⋆ || − 2η⟨F (x̃k ), x̃k − x⋆ ⟩ − 2η⟨F (x̃k ), xk − x̃k ⟩ + η 2 ||F (x̃k )||
(b)
2 2 2 2
≤ ||xk − x⋆ || − 2ηµ ||x̃k − x⋆ || + ||xk − x̃k − ηF (x̃k )|| − ||xk − x̃k ||
(c)
2 2 2
≤ ||xk − x⋆ || − 2ηµ ||x̃k − x⋆ || + η 2 L2 − 1 ||xk − x̃k ||


(d)
2 2
≤ (1 − ηµ) ||xk − x⋆ || + η 2 L2 − 1 + 2ηµ ||xk − x̃k ||


(e)
 
µ 2 1
≤ 1− ||xk − x⋆ || for η =
2(L + µ) 2(L + µ)
(f )  µ  2
≤ 1− ||xk − x⋆ || . (52)
4L
In (a), we added and subtracted 2η⟨F (x̃k ), x̃k ⟩, our usual trick. In (b), we applied the strong
monotonicity property to the second term, and completed the square with the last two terms of the
previous line. To arrive at the third term in (c), we manipulated the third term in (b) as follows:
first, we used the dynamics of the algorithm to replace xk − x̃k with ηF (xk ), second, we applied
the Lipschitz property to the quantity ||F (xk ) − F (x̃k )||2 , and third, we grouped the resulting term,
η 2 L2 ||xk − x̃k ||2 , with the last term of line (b). To arrive at (d), we applied the triangle inequality,
−2ηµ||a||2 ≤ −ηµ||a + b||2 + 2ηµ||b||2 , with a = x̃k − xk + xk − x⋆ and b = x̃k − xk . In (e), we took
η = [2(L + µ)]−1 . This ensures that the prefactor of the error term ||xk − x̃k ||2 in (d) is negative,
and that the term can therefore be dropped. Note that with this assignment for the step size, line
(d) is a descent lemma for the extragradient method. Lastly, to simplify the rate of convergence in
(f), we used the fact that 2L ≥ L + µ, which is easily proved as follows:
⟨F (x) − F (y), x − y⟩ ||F (x) − F (y)||
µ≤ ≤ ≤ L. (53)
||x − y||2 ||x − y||
In (53), the first inequality follows from strong monotonicity, the second from the Cauchy-Schwarz
inequality, and the third from the Lipschitz property of F .
We have proved the following theorem:
Theorem 3.4. For µ-strongly monotone and L-Lipschitz operator F , the iterates of Algorithm 5
with step size η = 1/[2(µ + L)] converge to the fixed point of F at the linear rate 1 − µ/(4L). In
particular,
2
 µ t 2
||xk+1 − x⋆ || ≤ 1 − ||x1 − x⋆ || . (54)
4L
The extragradient algorithm is one practical instantiation of the proximal point method. There
are others. We will mention some of them in the next section. We end this section by noting that
one way to construct approximate proximal point algorithms is by expanding the resolvent of F in
powers of η, truncating the expansion after a fixed number of terms m, and then rearranging the
result to solve for xk+1 . The accuracy of the approximation to the proximal point update can be
controlled by increasing m.

3.2.4 High-Resolution Continuous-Time Limits


We studied the phenomenon of acceleration in optimization from a continuous-time perspective in
Section 2.3. We end these notes by illustrating with an example the value of thinking in continuous-
time about fixed-point problems.

30
We return to the two-player zero-sum game (32) to contextualize the discussion, which is based
on work done in [40]. We remind the reader that the vector field F of the game (32) is given in (35).
There are several algorithms that can be used to compute the fixed points of F . We consider four
of them: (1) gradient-descent-ascent (GDA), (2) the extragradient algorithm (EG), (3) optimistic
GDA (OGDA), and (4) the lookahead   algorithm (LA) [41]. Writing the state vector of the two
x1
players x1 and x2 together as z = , the update rules for all four algorithms are given below:
x2

GDA : zk+1 = zk − γF (zk ), (55)


EG : zk+1 = zk − γF (zk ), (56)
OGDA : zk+1 = zk − 2γF (zk ) + γF (zk−1 ), (57)
LA : zk+1 = zk + α (z̃k+ℓ − zk ) , α ∈ (0, 1]. (58)

In the lookahead algorithm, the quantity z̃k+ℓ is computed by applying ℓ iterations of some base
algorithm. Here, we will take the base algorithm to be GDA.
Now we might ask the following questions—what are the relative merits of each of these al-
gorithms and how might we pick the most appropriate among them in the context of a specific
problem? We have seen previously how taking a continuous-time perspective in optimization can be
productive of insight into such questions, and so we take that perspective again here. An interesting
technical roadblock arises, however, which is that when we naively take the continuous-time limits of
(55)-(58), they all collapse to a single differential equation. This is particularly problematic because
it is known that GDA can diverge on two-player zero-sum games [42, 40], but that EG, OGDA,
and LA do not. The continuous-time limit of these algorithms therefore fails to distinguish between
them even though they display distinct convergence behaviors in discrete time.
The solution to this problem is to take more careful continuous-time limits. We take what is
called a high-resolution limit, which has its roots in the study of hydrodynamics, and is useful to
study systems that contain multiple timescales. Here is the basic idea: Typically, we are faced with
an algorithm of the form xn+1 = G(xn , η) where G is the update rule and η is the step size. In
order to arrive at a continuous-time representation x(t), we make the association t = nη and take
the limit η → 0. The idea behind a high-resolution limit is to instead take t = ng(η) where g is a
function (other than the identity function) of η.
The high-resolution continuous-time limits of GDA, EG, OGDA, and LA are distinct. Writing
β = 2/η, they are

GDA : z̈(t) = −β ż(t) − βF (z(t)), (59)


EG : z̈(t) = −β ż(t) − βF (z(t)) + 2J(z(t))F (z(t)), (60)
OGDA : z̈(t) = −β ż(t) − βF (z(t)) − 2J(z(t))ż(t), (61)
LA, k = 2 : z̈(t) = −β ż(t) − αβF (z(t)) + 2αJ(z(t))F (z(t)). (62)

In (60)-(62), we note the appearance of the Jacobian J of the vector field,

∇2x1 f (z)
 
∇x2 ∇x1 f (z)
J(z) = , (63)
−∇x1 ∇x2 f (z) ∇2x2 f (z)

which is not present in the high-resolution continuous-time limit of GDA. This suggests that the
three convergent algorithms all leverage information contained in the Jacobian of F in order to
converge. It is also interesting that the continuous-time limit of LA is L-dependent. The structure
of the continuous-time equations (59)-(62) thus immediately give clues to understanding the different
convergence properties of the various algorithms and motivates new questions for future research.

31
References
[1] Adriana Camacho and Emily Conover. Manipulation of social program eligibility. American
Economic Journal: Economic Policy, 3(2):41–65, 2011.
[2] Tijana Zrnic, Eric Mazumdar, S. Shankar Sastry, and Michael I. Jordan. Who leads and
who follows in strategic classification? Advances in Neural Information Processing Systems,
34:15257–15269, 2021.

[3] Juan C. Perdomo, Tijana Zrnic, Celestine Mendler-Dünner, and Moritz Hardt. Performative
prediction. In Proceedings of the 37th International Conference on Machine Learning, volume
119, pages 7599–7609, 2020.
[4] Anastasios N. Angelopoulos and Stephen Bates. A gentle introduction to conformal prediction
and distribution-free uncertainty quantification. arXiv preprint arXiv:2107.07511, 2021.

[5] Stephen Bates, Anastasios N. Angelopoulos, Lihua Lei, Jitendra Malik, and Michael I. Jordan.
Distribution-free, risk-controlling prediction sets. J. ACM, 68(6), 2021.
[6] Anastasios N. Angelopoulos, Stephen Bates, Emmanuel J. Candès, Michael I. Jordan, and Lihua
Lei. Learn then test: Calibrating predictive algorithms to achieve risk control. arXiv preprint
arXiv:2110.01052, 2022.
[7] R. Tyrrell Rockafellar. Convex Analysis. Princeton University Press, 1970.
[8] Heinz H. Bauschke and Patrick L. Combettes. Convex Analysis and Monotone Operator Theory
in Hilbert Spaces. Springer, 2011.

[9] Francisco Facchinei and Jong-Shi Pang. Finite-Dimensional Variational Inequalities and Com-
plementarity Problems. Springer, 2003.
[10] Yurii Nesterov. Introductory Lectures on Convex Optimization. Springer, 2004.
[11] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, and Michael I. Jordan. How to
escape saddle points efficiently. In Proceedings of the 34th International Conference on Machine
Learning, volume 70, pages 1724–1732, 2017.
[12] Chi Jin, Praneeth Netrapalli, Rong Ge, Sham M. Kakade, and Michael I. Jordan. On nonconvex
optimization for machine learning: Gradients, stochasticity, and saddle points. J. ACM, 68(2),
2 2021.

[13] Jason D. Lee, Max Simchowitz, Michael I. Jordan, and Benjamin Recht. Gradient descent only
converges to minimizers. In 29th Annual Conference on Learning Theory, volume 49, pages
1246––1257, 2016.
[14] Jason D. Lee, Ioannis Panageas, Georgios Piliouras, Max Simchowitz, Michael I. Jordan, and
Benjamin Recht. First-order methods almost always avoid strict saddle points. Mathematical
Programming, 176:311–337, 2019.
[15] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points – online
stochastic gradient for tensor decomposition. In Proceedings of the 28th Conference on Learning
Theory, volume 40, pages 797–842, 2015.
[16] Arkadi S. Nemirovski and David B. Yudin. Problem Complexity and Method Efficiency in
Optimization. J. Wiley & Sons, 1983.

32
[17] Michael Baes. Estimate sequence methods: extensions and approximations. IFOR Internal
Report, ETH Zürich, 2009.
[18] Weijie Su, Stephen Boyd, and Emmanuel J. Candès. A differential equation for modeling
Nesterov’s accelerated gradient method: Theory and insights. Journal of Machine Learning
Research, 17:1–43, 2016.

[19] Boris T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR
Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
[20] Bin Shi, Simon S. Du, Michael I. Jordan, and Weijie J. Su. Understanding the acceleration
phenomenon via high-resolution differential equations. Mathematical Programming, 195:79–148,
2022.
[21] Andre Wibisono, Ashia C. Wilson, and Michael I. Jordan. A variational perspective on acceler-
ated methods in optimization. Proceedings of the National Academy of Sciences, 113(47):E7351–
E7358, 2016.
[22] Michael Betancourt, Michael I. Jordan, and Ashia C. Wilson. On symplectic optimization.
arXiv preprint arXiv:1802.03653, 2018.
[23] Guilherme França, Michael I. Jordan, and René Vidal. On dissipative symplectic integration
with applications to gradient-based optimization. Journal of Statistical Mechanics: Theory and
Experiment, 2021(4):043402, 2021.

[24] Ernst Hairer, Syvert P. Nørsett, and Gerhard Wanner. Solving Ordinary Differential Equations
I. Springer, second edition, 1993.
[25] Ernst Hairer and Gerhard Wanner. Solving Ordinary Differential Equations II. Springer, second
edition, 1996.
[26] Chi Jin, Praneeth Netrapalli, and Michael I. Jordan. Accelerated gradient descent escapes
saddle points faster than gradient descent. In Proceedings of the 31st Conference On Learning
Theory, volume 75, pages 1042–1085, 2018.
[27] Arnak S. Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-
concave densities. Journal of the Royal Statistical Society. Series B (Statistical Methodology),
79(3):651–676, 2017.

[28] Alain Durmus and Eric Moulines. High-dimensional Bayesian inference via the unadjusted
Langevin algorithm. Bernoulli, 25(4A):2854–2882, 2019.
[29] Xiang Cheng and Peter Bartlett. Convergence of Langevin MCMC in KL-divergence. In
Proceedings of Algorithmic Learning Theory, volume 83, pages 186–211, 2018.

[30] Xiang Cheng, Niladri S. Chatterji, Peter L. Bartlett, and Michael I. Jordan. Underdamped
Langevin MCMC: A non-asymptotic analysis. In Proceedings of the 31st Conference On Learn-
ing Theory, volume 75, pages 300–323, 2018.
[31] Yi-An Ma, Niladri S. Chatterji, Xiang Cheng, Nicolas Flammarion, Peter L. Bartlett, and
Michael I. Jordan. Is there an analog of Nesterov acceleration for gradient-based MCMC?
Bernoulli, 27(3):1942 – 1992, 2021.
[32] Wenlong Mou, Yi-An Ma, Martin J. Wainwright, Peter L. Bartlett, and Michael I. Jordan.
High-order Langevin diffusion yields an accelerated MCMC algorithm. Journal of Machine
Learning Research, 22(42):1–41, 2021.

33
[33] Niladri Chatterji, Jelena Diakonikolas, Michael I. Jordan, and Peter L. Bartlett. Langevin
Monte Carlo without smoothness. In Proceedings of the 23rd International Conference on
Artificial Intelligence and Statistics, volume 108, pages 1716–1726, 2020.
[34] Eric Mazumdar, Aldo Pacchiano, Yian Ma, Michael Jordan, and Peter Bartlett. On approxi-
mate Thompson sampling with Langevin algorithms. In Proceedings of the 37th International
Conference on Machine Learning, volume 119, pages 6797–6807, 2020.
[35] Eric V. Mazumdar, Michael I. Jordan, and S. Shankar Sastry. On finding local Nash equilibria
(and only local Nash equilibria) in zero-sum games. arXiv preprint arXiv:1901.00838, 2019.
[36] Constantinos Daskalakis, Paul W. Goldberg, and Christos H. Papadimitriou. The complexity
of computing a Nash equilibrium. SIAM Journal on Computing, 39:195–259, 2009.
[37] R. Tyrrell Rockafellar. Monotone operators and the proximal point algorithm. SIAM Journal
on Control and Optimization, 14(5):877–898, 1976.
[38] Galina M Korpelevich. The extragradient method for finding saddle points and other problems.
Matecon, 12:747–756, 1976.

[39] Aryan Mokhtari, Asuman Ozdaglar, and Sarath Pattathil. A unified analysis of extra-gradient
and optimistic gradient methods for saddle point problems: Proximal point approach. In Pro-
ceedings of the 23rd International Conference on Artificial Intelligence and Statistics, pages
1497–1507. PMLR, 2020.

[40] Tatjana Chavdarova, Michael I. Jordan, and Manolis Zampetakis. Last-iterate conver-
gence of saddle point optimizers via high-resolution differential equations. arXiv preprint
arXiv:2112.13826, 2022.
[41] Tatjana Chavdarova, Matteo Pagliardini, Martin Jaggi, and Francois Fleuret. Taming GANS
with lookahead. Technical report, Idiap, 2020.

[42] Constantinos Daskalakis, Andrew Ilyas, Vasilis Syrgkanis, and Haoyang Zeng. Training GANs
with optimism. In Proceedings of the 6th International Conference on Learning Representations,
2018.

34

You might also like