0% found this document useful (0 votes)
44 views95 pages

Optimization Methods For Large-Scale Machine Learnig2

This document provides a review of optimization methods for large-scale machine learning. It discusses how optimization problems arise in machine learning applications like text classification and neural networks. A major theme is that stochastic gradient methods have traditionally played a central role in large-scale machine learning where conventional methods typically fail. The paper also discusses opportunities for next generation optimization methods, including techniques that reduce noise in stochastic gradients and methods that use second-order approximations.

Uploaded by

zeSky Armour
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views95 pages

Optimization Methods For Large-Scale Machine Learnig2

This document provides a review of optimization methods for large-scale machine learning. It discusses how optimization problems arise in machine learning applications like text classification and neural networks. A major theme is that stochastic gradient methods have traditionally played a central role in large-scale machine learning where conventional methods typically fail. The paper also discusses opportunities for next generation optimization methods, including techniques that reduce noise in stochastic gradients and methods that use second-order approximations.

Uploaded by

zeSky Armour
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 95

Optimization Methods for Large-Scale Machine Learning

Léon Bottou∗ Frank E. Curtis† Jorge Nocedal‡

February 12, 2018


arXiv:1606.04838v3 [stat.ML] 8 Feb 2018

Abstract
This paper provides a review and commentary on the past, present, and future of numerical
optimization algorithms in the context of machine learning applications. Through case studies
on text classification and the training of deep neural networks, we discuss how optimization
problems arise in machine learning and what makes them challenging. A major theme of our
study is that large-scale machine learning represents a distinctive setting in which the stochastic
gradient (SG) method has traditionally played a central role while conventional gradient-based
nonlinear optimization techniques typically falter. Based on this viewpoint, we present a com-
prehensive theory of a straightforward, yet versatile SG algorithm, discuss its practical behavior,
and highlight opportunities for designing algorithms with improved performance. This leads to
a discussion about the next generation of optimization methods for large-scale machine learning,
including an investigation of two main streams of research on techniques that diminish noise in
the stochastic directions and methods that make use of second-order derivative approximations.

Contents
1 Introduction 3

2 Machine Learning Case Studies 4


2.1 Text Classification via Convex Optimization . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Perceptual Tasks via Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Formal Machine Learning Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Overview of Optimization Methods 13


3.1 Formal Optimization Problem Statements . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Stochastic vs. Batch Optimization Methods . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 Motivation for Stochastic Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4 Beyond SG: Noise Reduction and Second-Order Methods . . . . . . . . . . . . . . . 19

Facebook AI Research, New York, NY, USA. E-mail: [email protected]

Department of Industrial and Systems Engineering, Lehigh University, Bethlehem, PA, USA. Supported by U.S.
Department of Energy grant de–sc0010615 and U.S. National Science Foundation grant DMS–1016291. E-mail:
[email protected]

Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, IL, USA.
Supported by Office of Naval Research grant N00014-14-1-0313 P00003 and Department of Energy grant DE-FG02-
87ER25047. E-mail: [email protected]

1
4 Analyses of Stochastic Gradient Methods 21
4.1 Two Fundamental Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 SG for Strongly Convex Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3 SG for General Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.4 Work Complexity for Large-Scale Learning . . . . . . . . . . . . . . . . . . . . . . . 34
4.5 Commentary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5 Noise Reduction Methods 40


5.1 Reducing Noise at a Geometric Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2 Dynamic Sample Size Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2.1 Practical Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.3 Gradient Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.3.1 SVRG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.3.2 SAGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3.3 Commentary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.4 Iterate Averaging Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6 Second-Order Methods 50
6.1 Hessian-Free Inexact Newton Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.1.1 Subsampled Hessian-Free Newton Methods . . . . . . . . . . . . . . . . . . . 53
6.1.2 Dealing with Nonconvexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.2 Stochastic Quasi-Newton Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.2.1 Deterministic to Stochastic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.2.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.3 Gauss-Newton Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.4 Natural Gradient Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.5 Methods that Employ Diagonal Scalings . . . . . . . . . . . . . . . . . . . . . . . . . 64

7 Other Popular Methods 68


7.1 Gradient Methods with Momentum . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
7.2 Accelerated Gradient Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
7.3 Coordinate Descent Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

8 Methods for Regularized Models 74


8.1 First-Order Methods for Generic Convex Regularizers . . . . . . . . . . . . . . . . . 75
8.1.1 Iterative Soft-Thresholding Algorithm (ISTA) . . . . . . . . . . . . . . . . . . 77
8.1.2 Bound-Constrained Methods for `1 -norm Regularized Problems . . . . . . . . 77
8.2 Second-Order Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
8.2.1 Proximal Newton Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
8.2.2 Orthant-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

9 Summary and Perspectives 82

A Convexity and Analyses of SG 82

B Proofs 83

2
1 Introduction
The promise of artificial intelligence has been a topic of both public and private interest for decades.
Starting in the 1950s, there has been great hope that classical artificial intelligence techniques based
on logic, knowledge representation, reasoning, and planning would result in revolutionary software
that could, amongst other things, understand language, control robots, and provide expert advice.
Although advances based on such techniques may be in store in the future, many researchers have
started to doubt these classical approaches, choosing instead to focus their efforts on the design
of systems based on statistical techniques, such as in the rapidly evolving and expanding field of
machine learning.
Machine learning and the intelligent systems that have been borne out of it—such as search
engines, recommendation platforms, and speech and image recognition software—have become an
indispensable part of modern society. Rooted in statistics and relying heavily on the efficiency of
numerical algorithms, machine learning techniques capitalize on the world’s increasingly powerful
computing platforms and the availability of datasets of immense size. In addition, as the fruits of
its efforts have become so easily accessible to the public through various modalities—such as the
cloud —interest in machine learning is bound to continue its dramatic rise, yielding further societal,
economic, and scientific impacts.
One of the pillars of machine learning is mathematical optimization, which, in this context,
involves the numerical computation of parameters for a system designed to make decisions based
on yet unseen data. That is, based on currently available data, these parameters are chosen to
be optimal with respect to a given learning problem. The success of certain optimization methods
for machine learning has inspired great numbers in various research communities to tackle even
more challenging machine learning problems, and to design new methods that are more widely
applicable.
The purpose of this paper is to provide a review and commentary on the past, present, and future
of the use of numerical optimization algorithms in the context of machine learning applications.
A major theme of this work is that large-scale machine learning represents a distinctive setting in
which traditional nonlinear optimization techniques typically falter, and so should be considered
secondary to alternative classes of approaches that respect the statistical nature of the underlying
problem of interest.
Overall, this paper attempts to provide answers for the following questions.

1. How do optimization problems arise in machine learning applications and what makes them
challenging?

2. What have been the most successful optimization methods for large-scale machine learning
and why?

3. What recent advances have been made in the design of algorithms and what are open questions
in this research area?

We answer the first question with the aid of two case studies. The first, a study of text classifica-
tion, represents an application in which the success of machine learning has been widely recognized
认知
and celebrated. The second, a study of perceptual tasks such as speech or image recognition, rep-
resents an application in which machine learning still has had great success, but in a much more
enigmatic manner that leaves many questions unanswered. These case studies also illustrate the
神秘的

3
variety of optimization problems that arise in machine learning: the first involves convex optimiza-
tion problems—derived from the use of logistic regression or support vector machines—while the
second typically involves highly nonlinear and nonconvex problems—derived from the use of deep
neural networks.
With these case studies in hand, we turn our attention to the latter two questions on optimiza-
tion algorithms, the discussions around which represent the bulk of the paper. Whereas traditional
gradient-based methods may be effective for solving small-scale learning problems in which a batch
approach may be used, in the context of large-scale machine learning it has been a stochastic
algorithm—namely, the stochastic gradient method (SG) proposed by Robbins and Monro [130]—
that has been the core strategy of interest. Due to this central role played by SG, we discuss its
fundamental theoretical and practical properties within a few contexts of interest. We also discuss
recent trends in the design of optimization methods for machine learning, organizing them accord-
ing to their relationship to SG. We discuss: (i) noise reduction methods that attempt to borrow
from the strengths of batch methods, such as their fast convergence rates and ability to exploit
parallelism; (ii) methods that incorporate approximate second-order derivative information with
the goal of dealing with nonlinearity and ill-conditioning; and (iii) methods for solving regularized
problems designed to avoid overfitting and allow for the use of high-dimensional models. Rather
than contrast SG and other methods based on the results of numerical experiments—which might
bias our review toward a limited test set and implementation details—we focus our attention on
fundamental computational trade-offs and theoretical properties of optimization methods.
We close the paper with a look back at what has been learned, as well as additional thoughts
about the future of optimization methods for machine learning.

2 Machine Learning Case Studies


Optimization problems arise throughout machine learning. We provide two case studies that illus-
trate their role in the selection of prediction functions in state-of-the-art machine learning systems.
We focus on cases that involve very large datasets and for which the number of model parameters
to be optimized is also large. By remarking on the structure and scale of such problems, we provide
a glimpse into the challenges that make them difficult to solve.

2.1 Text Classification via Convex Optimization


The assignment of natural language text to predefined classes based on their contents is one of the
fundamental tasks of information management [56]. Consider, for example, the task of determining
whether a text document is one that discusses politics. Educated humans can make a determination 文本分类
of this type unambiguously, say by observing that the document contains the names of well-known
politicians. Early text classification systems attempted to consolidate knowledge from human
experts by building on such observations to formally characterize the word sequences that signify a
discussion about a topic of interest (e.g., politics). Unfortunately, however, concise characterizations
of this type are difficult to formulate. Rules about which word sequences do or do not signify a
topic need to be refined when new documents arise that cannot be classified accurately based
on previously established rules. The need to coordinate such a growing collection of possibly
contradictory rules limits the applicability of such systems to relatively simple tasks.
By contrast, the statistical machine learning approach begins with the collection of a sizeable

4
set of examples {(x1 , y1 ), . . . , (xn , yn )}, where for each i ∈ {1, . . . , n} the vector xi represents
the features of a text document (e.g., the words it includes) and the scalar yi is a label indicating
whether the document belongs (yi = 1) or not (yi = −1) to a particular class (i.e., topic of interest).
With such a set of examples, one can construct a classification program, defined by a prediction
function h, and measure its performance by counting how often the program prediction h(xi ) differs
明智的
from the correct prediction yi . In this manner, it seems judicious to search for a prediction function
that minimizes the frequency of observed misclassifications, otherwise known as the empirical risk
of misclassification:
n
(
1X 1 if A is true,
Rn (h) = 1l[h(xi ) 6= yi ], where 1l[A] = (2.1)
n 0 otherwise.
i=1

The idea of minimizing such a function gives rise to interesting conceptual issues. Consider, for
example, a function that simply memorizes the examples, such as
(
yi if x = xi for some i ∈ {1, . . . , n},
hrote (x) = (2.2)
±1 (arbitrarily) otherwise.

This prediction function clearly minimizes (2.1), but it offers no performance guarantees on doc-
死记硬背
uments that do not appear in the examples. To avoid such rote memorization, one should aim to
find a prediction function that generalizes the concepts that may be learned from the examples.
One way to achieve good generalized performance is to choose amongst a carefully selected class
of prediction functions, perhaps satisfying certain smoothness conditions and enjoying a convenient
parametric representation. How such a class of functions may be selected is not straightforward;
we discuss this issue in more detail in §2.3. For now, we only mention that common practice for
choosing between prediction functions belonging to a given class is to compare them using cross-
validation procedures that involve splitting the examples into three disjoint subsets: a training set,
a validation set, and a testing set. The process of optimizing the choice of h by minimizing Rn
in (2.1) is carried out on the training set over the set of candidate prediction functions, the goal
of which is to pinpoint a small subset of viable candidates. The generalized performance of each
of these remaining candidates is then estimated using the validation set, the best performing of
which is chosen as the selected function. The testing set is only used to estimate the generalized
performance of this selected function.
Such experimental investigations have, for instance, shown that the bag of words approach
works very well for text classification [56, 81]. In such an approach, a text document is represented
by a feature vector x ∈ Rd whose components are associated with a prescribed set of vocabulary
words; i.e., each nonzero component indicates that the associated word appears in the document.
(The capabilities of such a representation can also be increased by augmenting the set of words
with entries representing short sequences of words.) This encoding scheme leads to very sparse
vectors. For instance, the canonical encoding for the standard RCV1 dataset [94] uses a vocabulary
of d = 47,152 words to represent news stories that typically contain fewer than 1,000 words. Scaling
techniques can be used to give more weight to distinctive words, while scaling to ensure that each
document has kxk = 1 can be performed to compensate for differences in document lengths [94].
Thanks to such a high-dimensional sparse representation of documents, it has been deemed
empirically sufficient to consider prediction functions of the form h(x; w, τ ) = wT x − τ . Here, wT x
is a linear discriminant parameterized by w ∈ Rd and τ ∈ R is a bias that provides a way to

5
TP/(TP+FN)
compromise between precision and recall.1 The accuracy of the predictions could be determined by
counting the number of times that sign(h(x; w, τ )) matches the correct label, i.e., 1 or −1. However,
while such a prediction function may be appropriate for classifying new documents, formulating
an optimization problem around it to choose the parameters (w, τ ) is impractical in large-scale
settings due to the combinatorial structure introduced by the sign function, which is discontinuous.
Instead, one typically employs a continuous approximation through a loss function that measures
a cost for predicting h when the true label is y; e.g., one may choose a log-loss function of the form
`(h, y) = log(1 + exp(−hy)). Moreover, one obtains a class of prediction functions via the addition
of a regularization term parameterized by a scalar λ > 0, leading to a convex optimization problem:
n
1X λ
min `(h(xi ; w, τ ), yi ) + kwk22 . (2.3)
(w,τ )∈Rd ×R n 2
i=1

This problem may be solved multiple times for a given training set with various values of λ > 0,
with the ultimate solution (w∗ , τ∗ ) being the one that yields the best performance on a validation
set.
Many variants of problem (2.3) have also appeared in the literature. For example, the well-
known support vector machine problem [38] amounts to using the hinge loss `(h, y) = max(0, 1 −
hy). Another popular feature is to use an `1 -norm regularizer, namely λkwk1 , which favors sparse
solutions and therefore restricts attention to a subset of vocabulary words. Other more sophisticated
losses can also target a specific precision/recall tradeoff or deal with hierarchies of classes; e.g., see
[65, 49]. The choice of loss function is usually guided by experimentation.
In summary, both theoretical arguments [38] and experimental evidence [81] indicate that a
carefully selected family of prediction functions and such high-dimensional representations of docu-
ments leads to good performance while avoiding overfitting—recall (2.2)—of the training set. The
use of a simple surrogate, such as (2.3), facilitates experimentation and has been very successful for
a variety of problems even beyond text classification. Simple surrogate models are, however, not the
most effective in all applications. For example, for certain perceptual tasks, the use of deep neural
networks—which lead to large-scale, highly nonlinear, and nonconvex optimization problems—has
produced great advances that are unmatched by approaches involving simpler, convex models. We
discuss such problems next. 简单模型并不一定好

2.2 Perceptual Tasks via Deep Neural Networks


Just like text classification tasks, perceptual tasks such as speech and image recognition are not well
performed in an automated manner using computer programs based on sets of prescribed rules. For
instance, the infinite diversity of writing styles easily defeats attempts to concisely specify which
pixel combinations represent the digit four ; see Figure 2.1. One may attempt to design heuristic
不拘一格的
techniques to handle such eclectic instantiations of the same object, but as previous attempts to
design such techniques have often failed, computer vision researchers are increasingly embracing
machine learning techniques. 壮观的
During the past five years, spectacular applicative successes on perceptual problems have been
achieved by machine learning techniques through the use of deep neural networks (DNNs). Although
there are many kinds of DNNs, most recent advances have been made using essentially the same
1
Precision and recall, defined as the probabilities P[y = 1|h(x) = 1] and P[h(x) = 1|y = 1], respectively, are conve-
nient measures of classifier performance when the class of interest represents only a small fraction of all documents.

6
Fig. 2.1: No known prescribed rules express all pixel combinations that represent four.

types that were popular in the 1990s and almost forgotten in the 2000s [111]. What have made
recent successes possible are the availability of much larger datasets and greater computational
resources.
Because DNNs were initially inspired by simplified models of biological neurons [134, 135], they
术语
are often described with jargon borrowed from neuroscience. For the most part, this jargon turns
into a rather effective language for describing a prediction function h whose value is computed by
applying successive transformations to a given input vector xi ∈ Rd0 . These transformations are
made in layers. For example, a canonical fully connected layer performs the computation
(j) (j−1)
xi = s(Wj xi + bj ) ∈ Rdj , (2.4)
(0)
where xi = xi , the matrix Wj ∈ Rdj ×dj−1 and vector bj ∈ Rdj contain the jth layer parameters,
and s is a component-wise nonlinear activation function. Popular choices for the activation function
include the sigmoid function s(x) = 1/(1+exp(−x)) and the hinge function s(x) = max{0, x} (often
called a rectified linear unit (ReLU) in this context). In this manner, the ultimate output vector
x(J)
i leads to the prediction function value h(xi ; w), where the parameter vector w collects all the
parameters {(W1 , b1 ), . . . , (WJ , bJ )} of the successive layers.
Similar to (2.3), an optimization problem in this setting involves the collection of a training set
{(x1 , y1 ) . . . (xn , yn )} and the choice of a loss function ` leading to
n
1X
min `(h(xi ; w), yi ). (2.5)
w∈Rd n
i=1

However, in contrast to (2.3), this optimization problem is highly nonlinear and nonconvex, making
it intractable to solve to global optimality. That being said, machine learning experts have made
great strides in the use of DNNs by computing approximate solutions by gradient-based methods.
This has been made possible by the conceptually straightforward, yet crucial observation that the
gradient of the objective in (2.5) with respect to the parameter vector w can be computed by the
chain rule using algorithmic differentiation [71]. This differentiation technique is known in the
machine learning community as back propagation [134, 135].
The number of layers in a DNN and the size of each layer are usually determined by performing
comparative experiments and evaluating the system performance on a validation set, as in the
procedure in §2.1. A contemporary fully connected neural network for speech recognition typically
has five to seven layers. This amounts to tens of millions of parameters to be optimized, the
training of which may require up to thousands of hours of speech data (representing hundreds of
millions of training examples) and weeks of computation on a supercomputer. Figure 2.2 illustrates
the word error rate gains achieved by using DNNs for acoustic modeling in three state-of-the-art
speech recognition systems. These gains in accuracy are so significant that DNNs are now used in
all the main commercial speech recognition products.

7
"
! $
! #$"

#$
'
%$! #$
$&

Fig. 2.2: Word error rates reported by three different research groups on three standard speech
recognition benchmarks. For all three groups, deep neural networks (DNNs) significantly outper-
form the traditional Gaussian mixture models (GMMs) [50]. These experiments were performed
between 2010 and 2012 and were instrumental in the recent revival of DNNs.

At the same time, convolutional neural networks (CNNs) have proved to be very effective for
computer vision and signal processing tasks [87, 24, 88, 85]. Such a network is composed of con-
(j−1)
volutional layers, wherein the parameter matrix Wj is a circulant matrix and the input xi is
(j−1)
intepreted as a multichannel image. The product Wj xi then computes the convolution of the
image by a trainable filter while the activation function—which are piecewise linear functions as
opposed to sigmoids—can perform more complex operations that may be interpreted as image
rectification, contrast normalization, or subsampling. Figure 2.3 represents the architecture of the
winner of the landmark 2012 ImageNet Large Scale Visual Recognition Competition (ILSVRC)
[148]. The figure illustrates a CNN with five convolutional layers and three fully connected lay-
ers [85]. The input vector represents the pixel values of a 224 × 224 image while the output scores
represent the odds that the image belongs to each of 1,000 categories. This network contains about
60 million parameters, the training of which on a few million labeled images takes a few days on a
dual GPU workstation.

Fig. 2.3: Architecture for image recognition. The 2012 ILSVRC winner consists of eight layers [85].
Each layer performs a linear transformation (specifically, convolutions in layers C1–C5 and matrix
multiplication in layers F6–F8) followed by nonlinear transformations (rectification in all layers,
contrast normalization in C1–C2, and pooling in C1–C2 and C5). Regularization with dropout
noise is used in layers F6–F7.

8
Figure 2.4 illustrates the historical error rates of the winner of the 2012 ILSVRC. In this com-
petition, a classification is deemed successful if the correct category appeared among the top five
categories returned by the system. The large performance gain achieved in 2012 was confirmed in
the following years, and today CNNs are considered the tool of choice for visual object recogni-
tion [129]. They are currently deployed by numerous Internet companies for image search and face
recognition.

Fig. 2.4: Historical top5 error rate of the annual winner of the ImageNet image classification
challenge. A convolutional neural network (CNN) achieved a significant performance improvement
over all traditional methods in 2012. The following years have cemented CNNs as the current
state-of-the-art in visual object recognition [85, 129].

The successes of DNNs in modern machine learning applications are undeniable. Although the
training process requires extreme skill and care—e.g., it is crucial to initialize the optimization
process with a good starting point and to monitor its progress while correcting conditioning issues
as they appear [89]—the mere fact that one can do anything useful with such large, highly nonlinear
and nonconvex models is remarkable.

2.3 Formal Machine Learning Procedure


Through our case studies, we have illustrated how a process of machine learning leads to the
selection of a prediction function h through solving an optimization problem. Moving forward, it
is necessary to formalize our presentation by discussing in greater detail the principles behind the
selection process, stressing the theoretical importance of uniform laws of large numbers as well as
大数一致定律
the practical importance of structural risk minimization.结构风险最小化
For simplicity, we continue to focus on the problems that arise in the context of supervised
classification; i.e., we focus on the optimization of prediction functions for labeling unseen data
based on information contained in a set of labeled training data. Such a focus is reasonable as
many unsupervised and other learning techniques reduce to optimization problems of comparable
form; see, e.g., [155].

Fundamentals Our goal is to determine a prediction function h : X → Y from an input space X


to an output space Y such that, given x ∈ X , the value h(x) offers an accurate prediction about the
true output y. That is, our goal is to choose a prediction function that avoids rote memorization

9
and instead generalizes the concepts that can be learned from a given set of examples. To do this,
one should choose the prediction function h by attempting to minimize a risk measure over an
adequately selected family of prediction functions [158], call it H.
To formalize this idea, suppose that the examples are sampled from a joint probability distri-
bution function P (x, y) that simultaneously represents the distribution P (x) of inputs as well as
the conditional probability P (y|x) of the label y being appropriate for an input x. (With this view,
one often refers to the examples as samples; we use both terms throughout the rest of the paper.)
Rather than one that merely minimizes the empirical risk (2.1), one should seek to find h that
yields a small expected risk of misclassification over all possible inputs, i.e., an h that minimizes

R(h) = P[h(x) 6= y] = E[1l[h(x) 6= y]], (2.6)

where P[A] and E[A] respectively denote the probability and expected value of A. Such a framework
变化的
is variational since we are optimizing over a set of functions, and is stochastic since the objective
function involves an expectation.
While one may desire to minimize the expected risk (2.6), in practice one must attempt to do
so without explicit knowledge of P . Instead, the only tractable option is to construct a surrogate
problem that relies solely on the examples {(xi , yi )}ni=1 . Overall, there are two main issues that
must be addressed: (i) how to choose the parameterized family of prediction functions H and (ii)
how to determine (and find) the particular prediction function h ∈ H that is optimal.

Choice of Prediction Function Family The family of functions H should be determined with
three potentially competing goals in mind. First, H should contain prediction functions that are
able to achieve a low empirical risk over the training set, so as to avoid bias or underfitting the
data. This can be achieved by selecting a rich family of functions or by using a priori knowledge 拟合性,泛化
性,可实现可
to select a well-targeted family. Second, the gap between expected risk and empirical risk, namely, 解
R(h) − Rn (h), should be small over all h ∈ H. Generally, this gap decreases when one uses more
training examples, but, due to potential overfitting, it increases when one uses richer families of
functions (see below). This latter fact puts the second goal at odds with the first. Third, H should
be selected so that one can efficiently solve the corresponding optimization problem, the difficulty
of which may increase when one employs a richer family of functions and/or a larger training set.
Our observation about the gap between expected and empirical risk can be understood by
recalling certain laws of large numbers. For instance, when the expected risk represents a misclas-
sification probability as in (2.6), the Hoeffding inequality [75] guarantees that, with probability at
least 1 − η, one has
s  
1 2
|R(h) − Rn (h)| ≤ log for a given h ∈ H.
2n η

This bound offers the intuitive explanation that the gap decreases as one uses more training exam-
ples. However, this view is insufficient for our purposes since, in the context of machine learning,
h is not a fixed function! Rather, h is the variable over which one is optimizing.
For this reason, one often turns to uniform laws of large numbers and the concept of the Vapnik-
Chervonenkis (VC) dimension of H, a measure of the capacity of such a family of functions [158].
For the intuition behind this concept, consider, e.g., a binary classification scheme in R2 where
one assigns a label of 1 for points above a polynomial and −1 for points below. The set of linear

10
polynomials has a low capacity in the sense that it is only capable of accurately classifying training
points that can be separated by a line; e.g., in two variables, a linear classifier has a VC dimension
of three. A set of high-degree polynomials, on the other hand, has a high capacity since it can
accurately separate training  points that are interspersed; the VC dimension of a polynomial of
d+D
degree D in d variables is . That being said, the gap between empirical and expected
d
risk can be larger for a set of high-degree polynomials since the high capacity allows them to overfit
a given set of training data.
Mathematically, with the VC dimension measuring capacity, one can establish one of the most
important results in learning theory: with dH defined as the VC dimension of H, one has with
probability at least 1 − η that
s    !
1 2 dH n
sup |R(h) − Rn (h)| ≤ O log + log . (2.7)
h∈H 2n η n dH

This bound gives a more accurate picture of the dependence of the gap on the choice of H. For
example, it shows that for a fixed dH , uniform convergence is obtained by increasing the number
of training points n. However, it also shows that, for a fixed n, the gap can widen for larger dH .
Indeed, to maintain the same gap, one must increase n at the same rate if dH is increased. The
uniform convergence embodied in this result is crucial in machine learning since one wants to ensure
that the prediction system performs well with any data provided to it. In §4.4, we employ a slight
variant of this result to discuss computational trade-offs that arise in large-scale learning.2
Interestingly, one quantity that does not enter in (2.7) is the number of parameters that distin-
guish a particular member function h of the family H. In some settings such as logistic regression,
this number is essentially the same as dH , which might suggest that the task of optimizing over
笨重的
h ∈ H is more cumbersome as dH increases. However, this is not always the case. Certain families
服从的
of functions are amenable to minimization despite having a very large or even infinite number of
parameters [156, Section 4.11]. For example, support vector machines [38] were designed to take
advantage of this fact [156, Theorem 10.3]. 有些函数组虽然参数很多,但是优化很容易,比如SVM
All in all, while bounds such as (2.7) are theoretically interesting and provide useful insight, they
are rarely used directly in practice since, as we have suggested in §2.1 and §2.2, it is typically easier
to estimate the gap between empirical and expected risk with cross-validation experiments. We
now present ideas underlying a practical framework that respects the trade-offs mentioned above.

Structural Risk Minimization An approach for choosing a prediction function that has proved
to be widely successful in practice is structural risk minimization [157, 156]. Rather than choose
a generic family of prediction functions—over which it would be both difficult to optimize and to
estimate the gap between empirical and expected risks—one chooses a structure, i.e., a collection of
nested function families. For instance, such a structure can be formed as a collection of subsets of
a given family H in the following manner: given a preference function Ω, choose various values of a
hyperparameter C, according to each of which one obtains the subset HC := {h ∈ H : Ω(h) ≤ C}.
Given a fixed number of examples, increasing C reduces the empirical risk (i.e., the minimum of
2
We also note that considerably better bounds hold when one can collect statistics on actual examples, e.g.,
by determining gaps dependent on an observed variance of the risk or by considering uniform bounds restricted to
families of prediction functions that achieve a risk within a certain threshold of the optimum [55, 102, 26].

11
Rn (h) over h ∈ HC ), but, after some point, it typically increases the gap between expected and
empirical risks. This phenomenon is illustrated in Figure 2.5.
Other ways to introduce structures are to consider a regularized empirical risk Rn (h) + λΩ(h)
(an idea introduced in problem (2.3), which may be viewed as the Lagrangian for minimizing Rn (h)
subject to Ω(h) ≤ C), enlarge the dictionary in a bag-of-words representation, increase the degree
of a polynomial model function, or add to the dimension of an inner layer of a DNN.

Fig. 2.5: Illustration of structural risk minimization. Given a set of n examples, a decision function
family H, and a relative preference function Ω, the figure illustrates a typical relationship between
the expected and empirical risks corresponding to a prediction function obtained by an optimization
algorithm that minimizes an empirical risk Rn (h) subject to Ω(h) ≤ C. The optimal empirical risk
decreases when C increases. Meanwhile, the deviation between empirical and expected risk is
bounded above by a quantity—which depends on H and Ω—that increases with C. While not
shown in the figure, the value of C that offers the best guarantee on the expected risk increases
with n, i.e., the number of examples; recall (2.7).

Given such a set-up, one can avoid estimating the gap between empirical and expected risk
by splitting the available data into subsets: a training set used to produce a subset of candidate
solutions, a validation set used to estimate the expected risk for each such candidate, and a testing
set used to estimate the expected risk for the candidate that is ultimately chosen. Specifically, over
the training set, one minimizes an empirical risk measure Rn over HC for various values of C. This
results in a handful of candidate functions. The validation set is then used to estimate the expected
risk corresponding to each candidate solution, after which one chooses the function yielding the
lowest estimated risk value. Assuming a large enough range for C has been used, one often finds
that the best solution
尽管
does not correspond to the largest value of C considered; again, see Figure 2.5.
大道,方法,渠道
Another, albeit indirect avenue toward risk minimization is to employ an algorithm for mini-
mizing Rn , but terminate the algorithm early, i.e., before an actual minimizer of Rn is found. In
this manner, the role of the hyperparameter is played by the training time allowed, according to
which one typically finds the relationships illustrated in Figure 2.6. Theoretical analyses related to
the idea of early stopping are much more challenging than those for other forms of structural risk
minimization. However, it is worthwhile to mention these effects since early stopping is a popular
technique in practice, and is often essential due to computational budget limitations.
Overall, the structural risk minimization principle has proved useful for many applications, and

12
Fig. 2.6: Illustration of early stopping. Prematurely stopping the optimization of the empirical risk
Rn often results in a better expected risk R. In this manner, the stopping time plays a similar role
as the hyperparameter C in the illustration of structural risk minimization in Figure 2.5.

can be viewed as an alternative of the approach of employing expert human knowledge mentioned in
§2.1. Rather than encoding knowledge as formal classification rules, one encodes it via preferences
for certain prediction functions over others, then explores the performance of various prediction
functions that have been optimized under the influence of such preferences.

3 Overview of Optimization Methods


We now turn our attention to the main focus of our study, namely, numerical algorithms for solving
optimization problems that arise in large-scale machine learning. We begin by formalizing our
problems of interest, which can be seen as generic statements of problems of the type described
in §2 for minimizing expected and empirical risks. We then provide an overview of two main
classes of optimization methods—stochastic and batch—that can be applied to solve such problems,
emphasizing some of the fundamental reasons why stochastic methods have inherent advantages.
We close this section with a preview of some of the advanced optimization techniques that are
discussed in detail in later sections, which borrow ideas from both stochastic and batch methods.

3.1 Formal Optimization Problem Statements


As seen in §2, optimization problems in machine learning arise through the definition of prediction
and loss functions that appear in measures of expected and empirical risk that one aims to minimize.
Our discussions revolve around the following definitions.

Prediction and Loss Functions Rather than consider a variational optimization problem over
a generic family of prediction functions, we assume that the prediction function h has a fixed form
and is parameterized by a real vector w ∈ Rd over which the optimization is to be performed.
Formally, for some given h(·; ·) : Rdx × Rd → Rdy , we consider the family of prediction functions

H := {h(·; w) : w ∈ Rd }.

13
We aim to find the prediction function in this family that minimizes the losses incurred from
inaccurate predictions. For this purpose, we assume a given loss function ` : Rdy × Rdy → R as
one that, given an input-output pair (x, y), yields the loss `(h(x; w), y) when h(x; w) and y are the
predicted and true outputs, respectively.

Expected Risk Ideally, the parameter vector w is chosen to minimize the expected loss that
would be incurred from any input-output pair. To state this idea formally, we assume that losses
are measured with respect to a probability distribution P (x, y) representing the true relationship
between inputs and outputs. That is, we assume that the input-output space Rdx × Rdy is endowed
with P : Rdx × Rdy → [0, 1] and the objective function we wish to minimize is
Z
R(w) = `(h(x; w), y) dP (x, y) = E[`(h(x; w), y)]. (3.1)
Rdx ×Rdy

We say that R : Rd → R yields the expected risk (i.e., expected loss) given a parameter vector w
with respect to the probability distribution P .
站不住脚
Empirical Risk While it may be desirable to minimize (3.1), such a goal is untenable when
one does not have complete information about P . Thus, in practice, one seeks the solution of a
problem that involves an estimate of the expected risk R. In supervised learning, one has access
(either all-at-once or incrementally) to a set of n ∈ N independently drawn input-output samples
{(xi , yi )}ni=1 ⊆ Rdx × Rdy , with which one may define the empirical risk function Rn : Rd → R by
n
1X
Rn (w) = `(h(xi ; w), yi ). (3.2)
n
i=1

Generally speaking, minimization of Rn may be considered the practical optimization problem


of interest. For now, we consider the unregularized measure (3.2), remarking that the optimiza-
tion methods that we discuss in the subsequent sections can be applied readily when a smooth
regularization term is included. (We leave a discussion of nonsmooth regularizers until §8.)
Note that, in §2, the functions R and Rn represented misclassification error ; see (2.1) and (2.6).
However, these new definitions of R and Rn measure the loss as determined by the function `. We
use these latter definitions for the rest of the paper.

Simplified Notation The expressions (3.1) and (3.2) show explicitly how the expected and
empirical risks depend on the loss function, sample space or sample set, etc. However, when
discussing optimization methods, we will often employ a simplified notation that also offers some
avenues for generalizing certain algorithmic ideas. In particular, let us represent a sample (or set
of samples) by a random seed ξ; e.g., one may imagine a realization of ξ as a single sample (x, y)
from Rdx × Rdy , or a realization of ξ might be a set of samples {(xi , yi )}i∈S . In addition, let us
refer to the loss incurred for a given (w, ξ) as f (w; ξ), i.e.,
f is the composition of the loss function ` and the prediction function h. (3.3)
In this manner, the expected risk for a given w is the expected value of this composite function
taken with respect to the distribution of ξ:
(Expected Risk) R(w) = E[f (w; ξ)]. (3.4)

14
In a similar manner, when given a set of realizations {ξ[i] }ni=1 of ξ corresponding to a sample set
{(xi , yi )}ni=1 , let us define the loss incurred by the parameter vector w with respect to the ith sample
as
fi (w) := f (w; ξ[i] ), (3.5)
and then write the empirical risk as the average of the sample losses:
n
1X
(Empirical Risk) Rn (w) = fi (w). (3.6)
n
i=1

For future reference, we use ξ[i] to denote the ith element of a fixed set of realizations of a random
variable ξ, whereas, starting in §4, we will use ξk to denote the kth element of a sequence of random
variables.

3.2 Stochastic vs. Batch Optimization Methods


Let us now introduce some fundamental optimization algorithms for minimizing risk. For the
moment, since it is the typical setting in practice, we introduce two algorithm classes in the context
of minimizing the empirical risk measure Rn in (3.6). Note, however, that much of our later
discussion will focus on the performance of algorithms when considering the true measure of interest,
namely, the expected risk R in (3.4).
Optimization methods for machine learning fall into two broad categories. We refer to them as
stochastic and batch. The prototypical stochastic optimization method is the stochastic gradient
method (SG) [130], which, in the context of minimizing Rn and with w1 ∈ Rd given, is defined by

wk+1 ← wk − αk ∇fik (wk ). (3.7)

Here, for all k ∈ N := {1, 2, . . . }, the index ik (corresponding to the seed ξ[ik ] , i.e., the sample pair
(xik , yik )) is chosen randomly from {1, . . . , n} and αk is a positive stepsize. Each iteration of this
method is thus very cheap, involving only the computation of the gradient ∇fik (wk ) corresponding
to one sample. The method is notable in that the iterate sequence is not determined uniquely by the
function Rn , the starting point w1 , and the sequence of stepsizes {αk }, as it would in a deterministic
optimization algorithm. Rather, {wk } is a stochastic process whose behavior is determined by the
random sequence {ik }. Still, as we shall see in our analysis in §4, while each direction −∇fik (wk )
might not be one of descent from wk (in the sense of yielding a negative directional derivative for
Rn from wk ), if it is a descent direction in expectation, then the sequence {wk } can be guided
toward a minimizer of Rn .
For many in the optimization research community, a batch approach is a more natural and well-
known idea. The simplest such method in this class is the steepest descent algorithm—also referred
to as the gradient, batch gradient, or full gradient method—which is defined by the iteration
n
αk X
wk+1 ← wk − αk ∇Rn (wk ) = wk − ∇fi (wk ). (3.8)
n
i=1

Computing the step −αk ∇Rn (wk ) in such an approach is more expensive than computing the step
−αk ∇fik (wk ) in SG, though one may expect that a better step is computed when all samples are
considered in an iteration.

15
Stochastic and batch approaches offer different trade-offs in terms of per-iteration costs and
expected per-iteration improvement in minimizing empirical risk. Why, then, has SG risen to such
突出的
prominence in the context of large-scale machine learning? Understanding the reasoning behind
this requires careful consideration of the computational trade-offs between stochastic and batch
methods, as well as a deeper look into their abilities to guarantee improvement in the underlying
expected risk R. We start to investigate these topics in the next subsection.
We remark in passing that the stochastic and batch approaches mentioned here have analogues
in the simulation and stochastic optimization communities, where they are referred to as stochastic
approximation (SA) and sample average approximation (SAA), respectively [63].

Inset 3.1: Herbert Robbins and Stochastic Approximation


The paper by Robbins and Monro [130] represents a landmark in the history of numerical
optimization methods. Together with the invention of back propagation [134, 135], it also
represents one of the most notable developments in the field of machine learning. The SG
method was first proposed in [130], not as a gradient method, but as a Markov chain.
Viewed more broadly, the works by Robbins and Monro [130] and Kalman [83] mark the
beginning of the field of stochastic approximation, which studies the behavior of iterative meth-
ods that use noisy signals. The initial focus on optimization led to the study of algorithms
that track the solution of the ordinary differential equation ẇ = −∇F (w). Stochastic approx-
imation theory has had a major impact in signal processing and in areas closer to the subject
of this paper, such as pattern recognition [4] and neural networks [20].
After receiving his PhD, Herbert Robbins became a lecturer at New York University, where
he co-authored with Richard Courant the popular book What is Mathematics? [39], which is
still in print after more than seven decades [40]. Robbins went on to become one of the
most prominent mathematicians of the second half of the twentieth century, known for his
contributions to probability, algebra, and graph theory.

3.3 Motivation for Stochastic Methods


Before discussing the strengths of stochastic methods such as SG, one should not lose sight of
the fact that batch approaches possess some intrinsic advantages. First, when one has reduced
the stochastic problem of minimizing the expected risk R to focus exclusively on the deterministic
problem of minimizing the empirical risk Rn , the use of full gradient information at each iterate
opens the door for many deterministic gradient-based optimization methods. That is, in a batch
approach, one has at their disposal the wealth of nonlinear optimization techniques that have been
developed over the past decades, including the full gradient method (3.8), but also accelerated
gradient, conjugate gradient, quasi-Newton, and inexact Newton methods [114]. (See §6 and §7 for
discussion of these techniques.) Second, due to the sum structure of Rn , a batch method can easily
benefit from parallelization since the bulk of the computation lies in evaluations of Rn and ∇Rn .
Calculations of these quantities can even be done in a distributed manner.
Despite these advantages, there are intuitive, practical, and theoretical reasons for following a
stochastic approach. Let us motivate them by contrasting the hallmark SG iteration (3.7) with the
full batch gradient iteration (3.8).

16
Intuitive Motivation On an intuitive level, SG employs information more efficiently than a
batch method. To see this, consider a situation in which a training set, call it S, consists of ten
copies of a set Ssub . A minimizer of empirical risk for the larger set S is clearly given by a minimizer
for the smaller set Ssub , but if one were to apply a batch approach to minimize Rn over S, then
each iteration would be ten times more expensive than if one only had one copy of Ssub . On the
other hand, SG performs the same computations in both scenarios, in the sense that the stochastic
gradient computations involve choosing elements from Ssub with the same probabilities. In reality,
a training set typically does not consist of exact duplicates of sample data, but in many large-scale
applications the data does involve a good deal of (approximate) redundancy. This suggests that
using all of the sample data in every optimization iteration is inefficient. 在每一个iteration用所有的数据时不高效的
A similar conclusion can be drawn by recalling the discussion in §2 related to the use of training,
validation, and testing sets. If one believes that working with only, say, half of the data in the
training set is sufficient to make good predictions on unseen data, then one may argue against
working with the entire training set in every optimization iteration. Repeating this argument,
working with only a quarter of the training set may be useful at the start, or even with only an
eighth of the data, and so on. In this manner, we arrive at motivation for the idea that working
with small samples, at least initially, can be quite appealing.

Practical Motivation The intuitive benefits just described have been observed repeatedly in
practice, where one often finds very real advantages of SG in many applications. As an example,
Figure 3.1 compares the performance of a batch L-BFGS method [97, 113] (see §6) and the SG
method (3.7) with a constant stepsize (i.e., αk = α for all k ∈ N) on a binary classification problem
using a logistic loss objective function and the data from the RCV1 dataset mentioned in §2.1.
The figure plots the empirical risk Rn as a function of the number of accesses of a sample from
the training set, i.e., the number of evaluations of a sample gradient ∇fik (wk ). Each set of n
consecutive accesses is called an epoch. The batch method performs only one step per epoch while
SG performs n steps per epoch. The plot shows the behavior over the first 10 epochs. The advantage
of SG is striking and representative of typical behavior in practice. (One should note, however, that
to obtain such efficient behavior, it was necessary to run SG repeatedly using different choices for
the stepsize α until a good choice was identified for this particular problem. We discuss theoretical
and practical issues related to the choice of stepsize in our analysis in §4.) 对不同的问题要选择学习率超参数
At this point, it is worthwhile to mention that the fast initial improvement achieved by SG,
followed by a drastic slowdown after 1 or 2 epochs, is common in practice and is fairly well under-
stood. An intuitive way to explain this behavior is by considering the following example due to
Bertsekas [15].

Example 3.1. Suppose that each fi in (3.6) is a convex quadratic with minimal value at zero
and minimizers wi,∗ evenly distributed in [−1, 1] such that the minimizer of Rn is w∗ = 0; see
Figure 3.2. At w1  −1, SG will, with certainty, move to the right (toward w∗ ). Indeed, even if
a subsequent iterate lies slightly to the right of the minimizer w1,∗ of the “leftmost” quadratic, it is
likely (but not certain) that SG will continue moving to the right. However, as iterates near w∗ ,
the algorithm enters a region of confusion in which there is a significant chance that a step will not
move toward w∗ . In this manner, progress will slow significantly. Only with more complete gradient
information could the method know with certainty how to move toward w∗ .

Despite the issues illustrated by this example, we shall see in §4 that one can nevertheless ensure

17
0.6

0.5

Empirical Risk
0.4

0.3 LBFGS

0.2

0.1 SGD

0
0 0.5 1 1.5 2 2.5 3 3.5 4
Accessed Data Points 5
x 10

Fig. 3.1: Empirical risk Rn as a function of the number of accessed data points (ADP) for a batch
L-BFGS method and the stochastic gradient (SG) method (3.7) on a binary classification problem
with a logistic loss objective and the RCV1 dataset. SG was run with a fixed stepsize of α = 4.

w1 − 1 w1,* 1
Fig. 3.2: Simple illustration to motivate the fast initial behavior of the SG method for minimizing
empirical risk (3.6), where each fi is a convex quadratic. This example is adapted from [15].

convergence by employing a sequence of diminishing


4 stepsizes to overcome any oscillatory behavior
of the algorithm.

Theoretical Motivation One can also cite theoretical arguments for a preference of SG over a
batch approach. Let us give a preview of these arguments now, which are studied in more depth
and further detail in §4.

• It is well known that a batch approach can minimize Rn at a fast rate; e.g., if Rn is strongly
convex (see Assumption 4.5) and one applies a batch gradient method, then there exists a
constant ρ ∈ (0, 1) such that, for all k ∈ N, the training error satisfies

Rn (wk ) − Rn∗ ≤ O(ρk ), (3.9)

where Rn∗ denotes the minimal value of Rn . The rate of convergence exhibited here is refereed
to as R-linear convergence in the optimization literature [117] and geometric convergence in
the machine learning research community; we shall simply refer to it as linear convergence.
From (3.9), one can conclude that, in the worst case, the total number of iterations in which
the training error can be above a given  > 0 is proportional to log(1/). This means that, with

18
a per-iteration cost proportional to n (due to the need to compute ∇Rn (wk ) for all k ∈ N),
the total work required to obtain -optimality for a batch gradient method is proportional to
n log(1/).

• The rate of convergence of a basic stochastic method is slower than for a batch gradient
method; e.g., if Rn is strictly convex and each ik is drawn uniformly from {1, . . . , n}, then,
for all k ∈ N, the SG iterates defined by (3.7) satisfy the sublinear convergence property (see
Theorem 4.7)
E[Rn (wk ) − Rn∗ ] = O(1/k). (3.10)
However, it is crucial to note that neither the per-iteration cost nor the right-hand side of
(3.10) depends on the sample set size n. This means that the total work required to obtain
-optimality for SG is proportional to 1/. Admittedly, this can be larger than n log(1/)
for moderate values of n and , but, as discussed in detail in §4.4, the comparison favors
SG when one moves to the big data regime where n is large and one is merely limited by a
computational time budget.

• Another important feature of SG is that, in a stochastic optimization setting, it yields the


same convergence rate as in (3.10) for the error in expected risk, R − R∗ , where R∗ is the
minimal value of R. Specifically, by applying the SG iteration (3.7), but with ∇fik (wk )
replaced by ∇f (wk ; ξk ) with each ξk drawn independently according to the distribution P ,
one finds that
E[R(wk ) − R∗ ] = O(1/k); (3.11)
again a sublinear rate, but on the expected risk. Moreover, in this context, a batch approach
is not even viable without the ability to compute ∇R. Of course, this represents a different
setting than one in which only a finite training set is available, but it reveals that if n is large
with respect to k, then the behavior of SG in terms of minimizing the empirical risk Rn or
the expected risk R is practically indistinguishable up to iteration k. This property cannot
be claimed by a batch method.

In summary, there are intuitive, practical, and theoretical arguments in favor of stochastic over
batch approaches in optimization methods for large-scale machine learning. For these reasons,
and since SG is used so pervasively by practitioners, we frame our discussions about optimization
methods in the context of their relationship with SG. We do not claim, however, that batch methods
have no place in practice. For one thing, if Figure 3.1 were to consider a larger number of epochs,
then one would see the batch approach eventually overtake the stochastic method and yield a
lower training error. This motivates why many recently proposed methods try to combine the best
properties of batch and stochastic algorithms. Moreover, the SG iteration is difficult to parallelize
and requires excessive communication between nodes in a distributed computing setting, providing
further impetus for the design of new and improved optimization algorithms.

3.4 Beyond SG: Noise Reduction and Second-Order Methods


Looking forward, one of the main questions being asked by researchers and practitioners alike is:
what lies beyond SG that can serve as an efficient, reliable, and easy-to-use optimization method
for the kinds of applications discussed in §2?

19
To answer this question, we depict in Figure 3.3 methods that aim to improve upon SG as lying
on a two-dimensional plane. At the origin of this organizational scheme is SG, representing the
base from which all other methods may be compared.

Stochas)c Batch
gradient method gradient method

noise reduc/on methods

second-order methods

Stochas)c Batch
Newton method Newton method

Fig. 3.3: Schematic of a two-dimensional spectrum of optimization methods for machine learn-
ing. The horizontal axis represents methods designed to control stochastic noise; the second axis,
methods that deal with ill conditioning.

From the origin along the horizontal access, we place methods that are neither purely stochastic
nor purely batch, but attempt to combine the best properties of both approaches. For example,
observing the iteration (3.7), one quickly realizes that there is no particular reason to employ infor-
mation from only one sample point per iteration. Instead, one can employ a mini-batch approach
in which a small subset of samples, call it Sk ⊆ {1, . . . , n}, is chosen randomly in each iteration,
leading to
αk X
wk+1 ← wk − ∇fi (wk ). (3.12)
|Sk |
i∈Sk

Such an approach falls under the framework set out by Robbins and Monro [130], and allows some
degree of parallelization to be exploited in the computation of mini-batch gradients. In addition,
one often finds that, due to the reduced variance of the stochastic gradient estimates, the method
is easier to tune in terms of choosing the stepsizes {αk }. Such a mini-batch SG method has been
widely used in practice.
Along this horizontal axis, one finds other methods as well. In our investigation, we classify
two main groups as dynamic sample size and gradient aggregation methods, both of which aim to
improve the rate of convergence from sublinear to linear. These methods do not simply compute
mini-batches of fixed size, nor do they compute full gradients in every iteration. Instead, they
dynamically replace or incorporate new gradient information in order to construct a more reliable
step with smaller variance than an SG step. For this reason, we refer to the methods along the
horizontal axis as noise reduction methods. We discuss methods of this type in §5.
Along the second axis in Figure 3.3 are algorithms that, in a broad sense, attempt to overcome
the adverse effects of high nonlinearity and ill-conditioning. For such algorithms, we use the term
second-order methods, which encompasses a variety of strategies; see §6. We discuss well known

20
inexact Newton and quasi-Newton methods, as well as (generalized) Gauss-Newton methods [14,
141], the natural gradient method [5], and scaled gradient iterations [152, 54].
We caution that the schematic representation of methods presented in Figure 3.3 should not be
taken too literally since it is not possible to truly organize algorithms so simply, or to include all
methods along only two such axes. For example, one could argue that iterate averaging methods
do not neatly belong in the category of second-order methods, even though we place them there,
and one could argue that gradient methods with momentum [123] or acceleration [107, 108] do
belong in this category, even though we discuss them separately in §7. Nevertheless, Figure 3.3
provides a useful road map as we describe and analyze a large collection of optimization methods
of various forms and characteristics. Moreover, our two-dimensional roadmap is useful in that it
suggests that optimization methods do not need to exist along the coordinate axes only; e.g., a batch
Newton method is placed at the lower-right corner, and one may consider various combinations of
second-order and noise reduction schemes.

4 Analyses of Stochastic Gradient Methods


In this section, we provide insights into the behavior of a stochastic gradient method (SG) by
establishing its convergence properties and worst-case iteration complexity bounds. A preview of
such properties were given in (3.10)–(3.11), but now we prove these and other interesting results in
detail, all within the context of a generalized SG algorithm. We start by analyzing our SG algorithm
when it is invoked to minimize a strongly convex objective function, where it is possible to establish
a global rate of convergence to the optimal objective value. This is followed by analyses when our
SG algorithm is employed to minimize a generic nonconvex objective. To emphasize the generality
of the results proved in this section, we remark that the objective function under consideration
could be the expected risk (3.4) or empirical risk (3.6); i.e., we refer to the objective function
F : Rd → R, which represents either

 R(w) = E[f (w; ξ)] expected risk


or


F (w) = n (4.1)
 1 X
 Rn (w) = n fi (w). empirical risk



i=1

Our analyses apply equally to both objectives; the only difference lies in the way that one picks
the stochastic gradient estimates in the method.3
We define our generalized SG method as Algorithm 4.1. The algorithm merely presumes that
three computational tools exist: (i) a mechanism for generating a realization of a random variable 从分布中采样
ξk (with {ξk } representing a sequence of jointly independent random variables); (ii) given an iterate
wk ∈ Rd and the realization of ξk , a mechanism for computing a stochastic vector g(wk , ξk ) ∈ Rd ;
and (iii) given an iteration number k ∈ N, a mechanism for computing a scalar stepsize αk > 0. 步长机制
3
Picking samples uniformly from a finite training set, replacing them into the set for each iteration, corresponds
to sampling from a discrete distribution giving equal weight to every sample. In this case, the SG algorithm in
this section optimizes the empirical risk F = Rn . Alternatively, picking samples in each iteration according to the
distribution P , the SG algorithm optimizes the expected risk F = R. One could also imagine picking samples without
replacement until one exhausts a finite training set. In this case, the SG algorithm here can be viewed as optimizing
either Rn or R, but only until the training set is exhausted. After that point, our analyses no longer apply. Generally
speaking, the analyses of such incremental algorithms often requires specialized techniques [15, 72].

21
Algorithm 4.1 Stochastic Gradient (SG) Method
1: Choose an initial iterate w1 .
2: for k = 1, 2, . . . do
3: Generate a realization of the random variable ξk .
4: Compute a stochastic vector g(wk , ξk ).
5: Choose a stepsize αk > 0.
6: Set the new iterate as wk+1 ← wk − αk g(wk , ξk ).
7: end for

The generality of Algorithm 4.1 can be seen in various ways. First, the value of the random
variable ξk need only be viewed as a seed for generating a stochastic direction; as such, a realization
of it may represent the choice of a single training sample as in the simple SG method stated as (3.7),
or may represent a set of samples as in the mini-batch SG method (3.12). Second, g(wk , ξk ) could
represent a stochastic gradient—i.e., an unbiased estimator of ∇F (wk ), as in the classical method
of Robbins and Monro [130]—or it could represent a stochastic Newton or quasi-Newton direction;
see §6. That is, our analyses cover the choices


 ∇f (wk ; ξk )
n

 k

 1 X
∇f (wk ; ξk,i )


g(wk , ξk ) = nk (4.2)
i=1

 nk
1 X


 Hk n ∇f (wk ; ξk,i ),



k i=1

where, for all k ∈ N, one has flexibility in the choice of mini-batch size nk and symmetric positive
definite scaling matrix Hk . No matter what choice is made, we shall come to see that all of our
theoretical results hold as long as the expected angle between g(wk , ξk ) and ∇F (wk ) is sufficiently
positive. Third, Algorithm 4.1 allows various choices of the stepsize sequence {αk }. Our analyses
focus on two choices, one involving a fixed stepsize and one involving diminishing stepsizes, as both
are interesting in theory and in practice. Finally, we note that Algorithm 4.1 also covers active
learning techniques in which the iterate wk influences the sample selection.4
Notwithstanding all of this generality, we henceforth refer to Algorithm 4.1 as SG. The particular
instance (3.7) will be referred to as simple SG or basic SG, whereas the instance (3.12) will be
referred to as mini-batch SG.
Beyond our convergence and complexity analyses, a complete appreciation for the properties of
SG is not possible without highlighting its theoretical advantages over batch methods in terms of
computational complexity. Thus, we include in section §4.4 a discussion of the trade-offs between
rate of convergence and computational effort among prototypical stochastic and batch methods for
large-scale learning.
4
We have assumed that the elements of the random variable sequence {ξk } are independent in order to avoid
requiring certain machinery from the analyses of stochastic processes. Viewing ξk as a seed instead of a sample
during iteration k makes this restriction minor. However, it is worthwhile to mention that all of the results in this
section still hold if, instead, {ξk } forms an adapted (non-anticipating) stochastic process and expectations taken with
respect to ξk are replaced by expectations taken with respect to the conditional distribution of ξk given {ξ1 , . . . , ξk−1 }.

22
4.1 Two Fundamental Lemmas
Our approach for establishing convergence guarantees for SG is built upon an assumption of smooth-
ness of the objective function. (Alternative foundations are possible; see Appendix A.) This, and
an assumption about the first and second moments of the stochastic vectors {g(wk , ξk )} lead to two
fundamental lemmas from which all of our results will be derived.
Our first assumption is formally stated as the following. Recall that, as already mentioned in
(4.1), F can stand for either expected or empirical risk.

Assumption 4.1 (Lipschitz-continuous objective gradients). The objective function F :


Rd → R is continuously differentiable and the gradient function of F , namely, ∇F : Rd → Rd , is
Lipschitz continuous with Lipschitz constant L > 0, i.e., 比通常连续更强的光滑性条件。直觉上,利普希茨连续函数限制了函
数改变的速度,符合利普希茨条件的函数的斜率,必小于一个称为利
利普希茨连续 普希茨常数的实数(该常数依函数而定)。
d
k∇F (w) − ∇F (w)k2 ≤ Lkw − wk2 for all {w, w} ⊂ R .

Intuitively, Assumption 4.1 ensures that the gradient of F does not change arbitrarily quickly
with respect to the parameter vector. Such an assumption is essential for convergence analyses of
most gradient-based methods; without it, the gradient would not provide a good indicator for how
far to move to decrease F . An important consequence of Assumption 4.1 is that

F (w) ≤ F (w) + ∇F (w)T (w − w) + 12 Lkw − wk22 for all {w, w} ⊂ Rd . (4.3)

This inequality is proved in Appendix B, but note that it also follows immediately if F is twice
continuously differentiable and the Hessian function ∇2 F : Rd → Rd×d satisfies k∇2 F (w)k2 ≤ L
for all w ∈ Rd .
Under Assumption 4.1 alone, we obtain the following lemma. In the result, we use Eξk [·] to
denote an expected value taken with respect to the distribution of the random variable ξk given wk .
Therefore, Eξk [F (wk+1 )] is a meaningful quantity since wk+1 depends on ξk through the update in
Step 6 of Algorithm 4.1.

Lemma 4.2. Under Assumption 4.1, the iterates of SG (Algorithm 4.1) satisfy the following in-
equality for all k ∈ N:

Eξk [F (wk+1 )] − F (wk ) ≤ −αk ∇F (wk )T Eξk [g(wk , ξk )] + 12 αk2 LEξk [kg(wk , ξk )k22 ]. (4.4)

Proof. By Assumption 4.1, the iterates generated by SG satisfy

F (wk+1 ) − F (wk ) ≤ ∇F (wk )T (wk+1 − wk ) + 21 Lkwk+1 − wk k22


≤ −αk ∇F (wk )T g(wk , ξk ) + 21 αk2 Lkg(wk , ξk )k22 .

Taking expectations in these inequalities with respect to the distribution of ξk , and noting that
wk+1 —but not wk —depends on ξk , we obtain the desired bound.

This lemma shows that, regardless of how SG arrived at wk , the expected decrease in the
objective function yielded by the kth step is bounded above by a quantity involving: (i) the expected
directional derivative of F at wk along −g(xk , ξk ) and (ii) the second moment of g(xk , ξk ). For
example, if g(wk , ξk ) is an unbiased estimate of ∇F (wk ), then it follows from Lemma 4.2 that

Eξk [F (wk+1 )] − F (wk ) ≤ −αk k∇F (wk )k22 + 21 αk2 LEξk [kg(wk , ξk )k22 ]. (4.5)

23
We shall see that convergence of SG is guaranteed as long as the stochastic directions and stepsizes
are chosen such that the right-hand side of (4.4) is bounded above by a deterministic quantity that
asymptotically ensures sufficient descent in F . One can ensure this in part by stating additional
requirements on the first and second moments of the stochastic directions {g(wk , ξk )}. In particular,
in order to limit the harmful effect of the last term in (4.5), we restrict the variance of g(wk , ξk ),
i.e.,
Vξk [g(wk , ξk )] := Eξk [kg(wk , ξk )k22 ] − kEξk [g(wk , ξk )]k22 . (4.6)
Assumption 4.3 (First and second moment limits). The objective function and SG (Algo-
rithm 4.1) satisfy the following:
(a) The sequence of iterates {wk } is contained in an open set over which F is bounded below by
a scalar Finf .
(b) There exist scalars µG ≥ µ > 0 such that, for all k ∈ N,
∇F (wk )T Eξk [g(wk , ξk )] ≥ µk∇F (wk )k22 and (4.7a)
kEξk [g(wk , ξk )]k2 ≤ µG k∇F (wk )k2 . (4.7b)

(c) There exist scalars M ≥ 0 and MV ≥ 0 such that, for all k ∈ N,


Vξk [g(wk , ξk )] ≤ M + MV k∇F (wk )k22 . (4.8)

The first condition, Assumption 4.3(a), merely requires the objective function to be bounded
below over the region explored by the algorithm. The second requirement, Assumption 4.3(b), states
that, in expectation, the vector −g(wk , ξk ) is a direction of sufficient descent for F from wk with a
norm comparable to the norm of the gradient. The properties in this requirement hold immediately
with µG = µ = 1 if g(wk , ξk ) is an unbiased estimate of ∇F (wk ), and are maintained if such an
unbiased estimate is multiplied by a positive definite matrix Hk that is conditionally uncorrelated
with g(wk , ξk ) given wk and whose eigenvalues lie in a fixed positive interval for all k ∈ N. The
third requirement, Assumption 4.3(c), states that the variance of g(wk , ξk ) is restricted, but in a
relatively minor manner. For example, if F is a convex quadratic function, then the variance is
allowed to be nonzero at any stationary point for F and is allowed to grow quadratically in any
direction.
All together, Assumption 4.3, combined with the definition (4.6), requires that the second
moment of g(wk , ξk ) satisfies
Eξk [kg(wk , ξk )k22 ] ≤ M + MG k∇F (wk )k22 with MG := MV + µ2G ≥ µ2 > 0. (4.9)
In fact, all of our analyses in this section hold if this bound on the second moment were to be
assumed directly. (We have stated Assumption 4.3 in the form above merely to facilitate our
discussion in §5.)
The following lemma builds on Lemma 4.2 under the additional conditions now set forth in
Assumption 4.3.
Lemma 4.4. Under Assumptions 4.1 and 4.3, the iterates of SG (Algorithm 4.1) satisfy the fol-
lowing inequalities for all k ∈ N:
Eξk [F (wk+1 )] − F (wk ) ≤ −µαk k∇F (wk )k22 + 21 αk2 LEξk [kg(wk , ξk )k22 ] (4.10a)
1 2 1 2
≤ −(µ − 2 αk LMG )αk k∇F (wk )k2 + 2 αk LM. (4.10b)

24
Proof. By Lemma 4.2 and (4.7a), it follows that

Eξk [F (wk+1 )] − F (wk ) ≤ −αk ∇F (wk )T Eξk [g(wk , ξk )] + 12 αk2 LEξk [kg(wk , ξk )k22 ]
≤ −µαk k∇F (wk )k22 + 21 αk2 LEξk [kg(wk , ξk )k22 ],

which is (4.10a). Assumption 4.3, giving (4.9), then yields (4.10b).

As mentioned, this lemma reveals that regardless of how the method arrived at the iterate wk ,
the optimization process continues in a Markovian
马尔科夫过程
manner in the sense that wk+1 is a random
variable that depends only on the iterate wk , the seed ξk , and the stepsize αk and not on any past
iterates. This can be seen in the fact that the difference Eξk [F (wk+1 )] − F (wk ) is bounded above by
a deterministic quantity. Note also that the first term in (4.10b) is strictly negative for small αk and
suggests a decrease in the objective function by a magnitude proportional to k∇F (wk )k22 . However,
the second term in (4.10b) could be large enough to allow the objective value to increase. Balancing
these terms is critical in the design of SG methods.

4.2 SG for Strongly Convex Objectives


The most benign setting for analyzing the SG method is in the context of minimizing a strongly
convex objective function. For the reasons described in Inset 4.2, when not considering a generic
nonconvex objective F , we focus on the strongly convex case and only briefly mention the (not
strongly) convex case in certain occasions.

Inset 4.2: Perspectives on SG Analyses


All of the convergence rate and complexity results presented in this paper relate to the
minimizaton of strongly convex functions. This is in contrast with a large portion of the
literature on optimization methods for machine learning, in which much effort is placed to
strengthen convergence guarantees for methods applied to functions that are convex, but not
strongly convex. We have made this choice for a few reasons. First, it leads to a focus on
results that are relevant to actual machine learning practice, since in many situations when a
convex model is employed—such as in logistic regression—it is often regularized by a strongly
convex function to facilitate the solution process. Second, there exist a variety of situations in
which the objective function is not globally (strongly) convex, but is so in the neighborhood of
local minimizers, meaning that our results can represent the behavior of the algorithm in such
regions of the search space. Third, one can argue that related results when minimizing non-
strongly convex models can be derived as extensions of the results presented here [3], making
our analyses a starting point for deriving a more general theory.
实用主义的
We have also taken a pragmatic approach in the types of convergence guarantees that
we provide. In particular, in our analyses, we focus on results that reveal the properties of
SG iterates in expectation. The stochastic approximation literature, on the other hand, often
relies on martingale techniques to establish almost sure convergence [66, 131] under the same
assumptions [21]. For our purposes, we omit these complications since, in our view, they do
not provide significant additional insights into the forces driving convergence of the method.

We formalize a strong convexity assumption as the following.

25
Assumption 4.5 (Strong convexity). The objective function F : Rd → R is strongly convex in
that there exists a constant c > 0 such that

F (w) ≥ F (w) + ∇F (w)T (w − w) + 12 ckw − wk22 for all (w, w) ∈ Rd × Rd . (4.11)

Hence, F has a unique minimizer, denoted as w∗ ∈ Rd with F∗ := F (w∗ ).

A useful fact from convex analysis (proved in Appendix B) is that, under Assumption 4.5, one
can bound the optimality gap at a given point in terms of the squared `2 -norm of the gradient of
the objective at that point:

2c(F (w) − F∗ ) ≤ k∇F (w)k22 for all w ∈ Rd . (4.12)

We use this inequality in several proofs. We also observe that, from (4.3) and (4.11), the constants
in Assumptions 4.1 and 4.5 must satisfy c ≤ L.
We now state our first convergence theorem for SG, describing its behavior when minimizing
a strongly convex objective function when employing a fixed stepsize. In this case, it will not be
possible to prove convergence to the solution, but only to a neighborhood of the optimal value.
(Intuitively, this limitation should be clear from (4.10b) since the first term on the right-hand side
decreases in magnitude as the solution is approached—i.e., as ∇F (wk ) tends to zero—but the last
term remains constant. Thus, after some point, a reduction in the objective cannot be expected.)
We use E[·] to denote an expected value taken with respect to the joint distribution of all random
variables. For example, since wk is completely determined by the realizations of the independent
random variables {ξ1 , ξ2 , . . . , ξk−1 }, the total expectation of F (wk ) for any k ∈ N can be taken as

E[F (wk )] = Eξ1 Eξ2 . . . Eξk−1 [F (wk )].

The theorem shows that if the stepsize is not too large, then, in expectation, the sequence of
function values {F (wk )} converges near the optimal value.

Theorem 4.6 (Strongly Convex Objective, Fixed Stepsize). Under Assumptions 4.1, 4.3,
and 4.5 (with Finf = F∗ ), suppose that the SG method (Algorithm 4.1) is run with a fixed stepsize,
αk = ᾱ for all k ∈ N, satisfying
µ
0 < ᾱ ≤ . (4.13)
LMG
Then, the expected optimality gap satisfies the following inequality for all k ∈ N :
 
ᾱLM k−1 ᾱLM
E[F (wk ) − F∗ ] ≤ + (1 − ᾱcµ) F (w1 ) − F∗ −
2cµ 2cµ
(4.14)
k→∞ ᾱLM
−−−→ .
2cµ

Proof. Using Lemma 4.4 with (4.13) and (4.12), we have for all k ∈ N that

Eξk [F (wk+1 )] − F (wk )] ≤ −(µ − 21 ᾱLMG )ᾱk∇F (wk )k22 + 12 ᾱ2 LM


≤ − 12 ᾱµk∇F (wk )k22 + 21 ᾱ2 LM
≤ −ᾱcµ(F (wk ) − F∗ ) + 12 ᾱ2 LM.

26
Subtracting F∗ from both sides, taking total expectations, and rearranging, this yields

E[F (wk+1 ) − F∗ ] ≤ (1 − ᾱcµ)E[F (wk ) − F∗ ] + 12 ᾱ2 LM.

Subtracting the constant ᾱLM/(2cµ) from both sides, one obtains

ᾱLM ᾱ2 LM ᾱLM


E[F (wk+1 ) − F∗ ] − ≤ (1 − ᾱcµ) E[F (wk ) − F∗ ] + −
2cµ 2 2cµ
 
ᾱLM
= (1 − ᾱcµ) E[F (wk ) − F∗ ] − . (4.15)
2cµ

Observe that (4.15) is a contraction inequality since, by (4.13) and (4.9),

cµ2 cµ2 c
0 < ᾱcµ ≤ ≤ 2
= ≤ 1. (4.16)
LMG Lµ L

The result thus follows by applying (4.15) repeatedly through iteration k ∈ N.

If g(wk , ξk ) is an unbiased estimate of ∇F (wk ), then µ = 1, and if there is no noise in g(wk , ξk ),


then we may presume that MG = 1 (due to (4.9)). In this case, (4.13) reduces to ᾱ ∈ (0, 1/L], a
classical stepsize requirement of interest for a steepest descent method.
Theorem 4.6 illustrates the interplay between the stepsizes and bound on the variance of the
stochastic directions. If there were no noise in the gradient computation or if noise were to decay
with k∇F (wk )k22 (i.e., if M = 0 in (4.8) and (4.9)), then one can obtain linear convergence to
the optimal value. This is a standard result for the full gradient method with a sufficiently small
positive stepsize. On the other hand, when the gradient computation is noisy, one clearly loses
this property. One can still use a fixed stepsize and be sure that the expected objective values will
converge linearly to a neighborhood of the optimal value, but, after some point, the noise in the
gradient estimates prevent further progress; recall Example 3.1. It is apparent from (4.14) that
selecting a smaller stepsize worsens the contraction constant in the convergence rate, but allows
one to arrive closer to the optimal value.
These observations provide a foundation for a strategy often employed in practice by which
SG is run with a fixed stepsize, and, if progress appears to stall, a smaller stepsize is selected and
the process is repeated. A straightforward instance of such an approach can be motivated with
µ
the following sketch. Suppose that α1 ∈ (0, LM G
] is chosen as in (4.13) and the SG method is
run with this stepsize from iteration k1 = 1 until iteration k2 , where wk2 is the first iterate at
which the expected suboptimality gap is smaller than twice the asymptotic value in (4.14), i.e.,
E[F (wk2 )−F∗ ] ≤ 2Fα1 , where Fα := αLM 2cµ . Suppose further that, at this point, the stepsize is halved
and the process is repeated; see Figure 4.1. This leads to the stepsize schedule {αr+1 } = {α1 2−r },
index sequence {kr }, and bound sequence {Fαr } = { αr2cµ LM
} & 0 such that, for all r ∈ {2, 3, . . . },

E[F (wkr+1 ) − F∗ ] ≤ 2Fαr where E[F (wkr ) − F∗ ] ≈ 2Fαr−1 = 4Fαr . (4.17)

In this manner, the expected suboptimality gap converges to zero.


However, this does not occur by halving the stepsize in every iteration, but only once the gap
itself has been cut in half from a previous threshold. To see what is the appropriate effective rate

27
of stepsize decrease, we may invoke Theorem 4.6, from which it follows that to achieve the first
bound in (4.17) one needs
(1 − αr cµ)(kr+1 −kr ) (4Fαr − Fαr ) ≤ Fαr
log(1/3) log(3) (4.18)
=⇒ kr+1 − kr ≥ ≈ = O(2r ).
log(1 − αr cµ) αr cµ
In other words, each time the stepsize is cut in half, double the number of iterations are required.
This is a sublinear rate of stepsize decrease—e.g., if {kr } = {2r−1 }, then αk = α1 /k for all k ∈
{2r }—which, from {Fαr } = { αr2cµ
LM
} and (4.17), means that a sublinear convergence rate of the
suboptimality gap is achieved.

!" " ")!


% # &'

-
#$ +, -,
#$)*
#(
!

Fig. 4.1: Depiction of the strategy of halving the stepsize α when the expected suboptimality gap
is smaller than twice the asymptotic limit Fα . In the figure, the segment B–B 0 has one third of the
length of A–A0 . This is the amount of decrease that must be made in the exponential term in (4.14)
by raising the contraction factor to the power of the number of steps during which one maintains
a given constant stepsize; see (4.18). Since the contraction factor is (1 − αcµ), the number of steps
must be proportional to α. Therefore, whenever the stepsize is halved, one must maintain it twice
as long. Overall, doubling the number of iterations halves the suboptimality gap each time, yielding
an effective rate of O(1/k).

In fact, these conclusions can be obtained in a more rigorous manner that also allows more
flexibility in the choice of stepsize sequence. The following result harks back to the seminal work
of Robbins and Monro [130], where the stepsize requirement takes the form

X ∞
X
αk = ∞ and αk2 < ∞. (4.19)
k=1 k=1

Theorem 4.7 (Strongly Convex Objective, Diminishing Stepsizes). Under Assumptions 4.1,
4.3, and 4.5 (with Finf = F∗ ), suppose that the SG method (Algorithm 4.1) is run with a stepsize
sequence such that, for all k ∈ N,
β 1 µ
αk = for some β > and γ > 0 such that α1 ≤ . (4.20)
γ+k cµ LMG

28
Then, for all k ∈ N, the expected optimality gap satisfies
ν
E[F (wk ) − F∗ ] ≤ , (4.21)
γ+k
where
β 2 LM
 
ν := max , (γ + 1)(F (w1 ) − F∗ ) . (4.22)
2(βcµ − 1)
Proof. By (4.20), the inequality αk LMG ≤ α1 LMG ≤ µ holds for all k ∈ N. Hence, along with
Lemma 4.4 and (4.12), one has for all k ∈ N that

Eξk [F (wk+1 )] − F (wk ) ≤ −(µ − 21 αk LMG )αk k∇F (wk )k22 + 12 αk2 LM
≤ − 21 αk µk∇F (wk )k22 + 12 αk2 LM
≤ −αk cµ(F (wk ) − F (w∗ )) + 12 αk2 LM.

Subtracting F∗ from both sides, taking total expectations, and rearranging, this yields

E[F (wk+1 ) − F∗ ] ≤ (1 − αk cµ)E[F (wk ) − F∗ ] + 12 αk2 LM. (4.23)

We now prove (4.21) by induction. First, the definition of ν ensures that it holds for k = 1. Then,
assuming (4.21) holds for some k ≥ 1, it follows from (4.23) that

β 2 LM
 
βcµ ν
E[F (wk+1 ) − F∗ ] ≤ 1 − + (with k̂ := γ + k)
k̂ k̂ 2k̂ 2
!
k̂ − βcµ β 2 LM
= ν+
k̂ 2 2k̂ 2
!
β 2 LM
 
k̂ − 1 βcµ − 1 ν
= ν− ν+ ≤ ,
k̂ 2 k̂ 2 2
2k̂ } k̂ + 1
| {z
nonpositive by the definition of ν

where the last inequality follows because k̂ 2 ≥ (k̂ + 1)(k̂ − 1).

Let us now remark on what can be learned from Theorems 4.6 and 4.7.

Role of Strong Convexity Observe the crucial role played by the strong convexity parameter
c > 0, the positivity of which is needed to argue that (4.15) and (4.23) contract the expected
optimality gap. However, the strong convexity constant impacts the stepsizes in different ways in
Theorems 4.6 and 4.7. In the case of constant stepsizes, the possible values of ᾱ are constrained
by the upper bound (4.13) that does not depend on c. In the case of diminishing stepsizes, the
initial stepsize α1 is subject to the same upper bound (4.20), but the stepsize parameter β must be
larger than 1/(cµ). This additional requirement is critical to ensure the O(1/k) convergence rate.
How critical? Consider, e.g., [105] in which the authors provide a simple example (with unbiased
gradient estimates and µ = 1) involving the minimization of a deterministic quadratic function with
only one optimization variable in which c is overestimated, which results in β being underestimated.
In the example, even after 109 iterations, the distance to the solution remains greater than 10−2 .

29
Role of the Initial Point Also observe the role played by the initial point, which determines the
initial optimality gap, namely, F (w1 )−F∗ . When using a fixed stepsize, the initial gap appears with
an exponentially decreasing factor; see (4.14). In the case of diminishing stepsizes, the gap appears
prominently in the second term defining ν in (4.22). However, with an appropriate initialization
phase, one can easily diminish the role played by this term.5 For example, suppose that one begins
by running SG with a fixed stepsize ᾱ until one (approximately) obtains a point, call it w1 , with
F (w1 ) − F∗ ≤ ᾱLM/(2cµ). A guarantee for this bound can be argued from (4.14). Starting here
with α1 = ᾱ, the choices for β and γ in Theorem 4.7 yield

α1 LM βLM β 2 LM
(γ + 1)E[F (w1 ) − F∗ ] ≤ βα1−1 = < ,
2cµ 2cµ 2(βcµ − 1)

meaning that the value for ν is dominated by the first term in (4.22).
On a related note, we claim that for practical purposes the initial stepsize should be chosen as
large as allowed, i.e., α1 = µ/(LMG ). Given this choice of α1 , the best asymptotic regime with
decreasing stepsizes (4.21) is achieved by making ν as small as possible. Since we have argued that
only the first term matters in the definition of ν, this leads to choosing β = 2/(cµ). Under these
conditions, one has
β 2 LM
  
2 L M
ν= = 2 . (4.24)
2(βcµ − 1) µ c c
We shall see the (potentially large) ratios L/c and M/c arise again later.

Trade-Offs of (Mini-)Batching As a final observation about what can be learned from Theo-
rems 4.6 and 4.7, let us take a moment to compare the theoretical performance of two fundamental
algorithms—the simple SG iteration (3.7) and the mini-batch SG iteration (3.12)—when these re-
sults are applied for minimizing empirical risk, i.e., when F = Rn . This provides a glimpse into
how such results can be used to compare algorithms in terms of their computational trade-offs.
The most elementary instance of our SG algorithm is simple SG, which, as we have seen, consists
of picking a random sample index ik at each iteration and computing

g(wk , ξk ) = ∇fik (wk ). (4.25)

By contrast, instead of picking a single sample, mini-batch SG consists of randomly selecting a


subset Sk of the sample indices and computing
1 X
g(wk , ξk ) = ∇fi (wk ) . (4.26)
|Sk |
i∈Sk

To compare these methods, let us assume for simplicity that we employ the same number of samples
in each iteration so that the mini-batches are of constant size, i.e., |Sk | = nmb . There are then
two distinct regimes to consider, namely, when nmb  n and when nmb ≈ n. Our goal here is
to use the results of Theorems 4.6 and 4.7 to show that, in the former scenario, the theoretical
benefit of mini-batching can appear to be somewhat ambiguous, meaning that one must leverage
5
In fact, the bound (4.21) slightly overstates the asymptotic influence of the initial optimality gap. Applying
Chung’s lemma [36] to the contraction equation (4.23) shows that the first term in the definition of ν effectively
determines the asymptotic convergence rate of E[F (wk ) − F∗ ].

30
certain computational tools to benefit from mini-batching in practice. As for the scenario when
nmb ≈ n, the comparison is more complex due to a trade-off between per-iteration costs and overall
convergence rate of the method (recall §3.3). We leave a more formal treatment of this scenario,
specifically with the goals of large-scale machine learning in mind, for §4.4.
Suppose then that the mini-batch size is nmb  n. The computation of the stochastic direction
g(wk , ξk ) in (4.26) is clearly nmb times more expensive than in (4.25). In return, the variance of
the direction is reduced by a factor of 1/nmb . (See §5.2 for further discussion of this fact.) That is,
with respect to our analysis, the constants M and MV that appear in Assumption 4.3 (see (4.8))
are reduced by the same factor, becoming M/nmb and MV /nmb for mini-batch SG. It is natural to
ask whether this reduction of the variance pays for the higher per-iteration cost.
Consider, for instance, the case of employing a sufficiently small constant stepsize ᾱ > 0. For
mini-batch SG, Theorem 4.6 leads to
 
ᾱLM k−1 ᾱLM
E[F (wk ) − F∗ ] ≤ + [1 − ᾱcµ] F (w1 ) − F∗ − .
2cµ nmb 2cµ nmb

Using the simple SG method with stepsize ᾱ/nmb leads to a similar asymptotic gap:

ᾱcµ k−1
   
ᾱLM ᾱLM
E[F (wk ) − F∗ ] ≤ + 1− F (w1 ) − F∗ − .
2cµ nmb nmb 2cµ nmb

The worse contraction constant (indicated using square brackets) means that one needs to run nmb
times more iterations of the simple SG algorithm to obtain an equivalent optimality gap. That said,
since the computation in a simple SG iteration is nmb times cheaper, this amounts to effectively
the same total computation as for the mini-batch SG method. A similar analysis employing the
result of Theorem 4.7 can be performed when decreasing stepsizes are used.
These observations suggest that the methods can be comparable. However, an important con-
sideration remains. In particular, the convergence theorems require that the initial stepsize be
smaller than µ/(LMG ). Since (4.9) shows that MG ≥ µ2 , the largest this stepsize can be is 1/(µL).
Therefore, one cannot simply assume that the mini-batch SG method is allowed to employ a stepsize
that is nmb times larger than the one used by SG. In other words, one cannot always compensate
for the higher per-iteration cost of a mini-batch SG method by selecting a larger stepsize.
One can, however, realize benefits of mini-batching in practice since it offers important op-
portunities for software optimization and parallelization; e.g., using sizeable mini-batches is often
the only way to fully leverage a GPU processor. Dynamic mini-batch sizes can also be used as a
substitute for decreasing stepsizes; see §5.2.

4.3 SG for General Objectives


As mentioned in our case study of deep neural networks in §2.2, many important machine learning
models lead to nonconvex optimization problems, which are currently having a profound impact
in practice. Analyzing the SG method when minimizing nonconvex objectives is more challenging
than in the convex case since such functions may possess multiple local minima and other stationary
points. Still, we show in this subsection that one can provide meaningful guarantees for the SG
method in nonconvex settings.
Paralleling §4.2, we present two results—one for employing a fixed positive stepsize and one for
diminishing stepsizes. We maintain the same assumptions about the stochastic directions g(wk , ξk ),

31
but do not assume convexity of F . As before, the results in this section still apply to a wide class
of methods since g(wk , ξk ) could be defined as a (mini-batch) stochastic gradient or a Newton-like
direction; recall (4.2).
Our first result describes the behavior of the sequence of gradients of F when fixed stepsizes are
employed. Recall from Assumption 4.3 that the sequence of function values {F (wk )} is assumed to
be bounded below by a scalar Finf .

Theorem 4.8 (Nonconvex Objective, Fixed Stepsize). Under Assumptions 4.1 and 4.3,
suppose that the SG method (Algorithm 4.1) is run with a fixed stepsize, αk = ᾱ for all k ∈ N,
satisfying
µ
0 < ᾱ ≤ . (4.27)
LMG
Then, the expected sum-of-squares and average-squared gradients of F corresponding to the SG
iterates satisfy the following inequalities for all K ∈ N:
"K #
X
2 K ᾱLM 2(F (w1 ) − Finf )
E k∇F (wk )k2 ≤ + (4.28a)
µ µᾱ
k=1
K
" #
1 X ᾱLM 2(F (w1 ) − Finf )
and therefore E k∇F (wk )k22 ≤ + (4.28b)
K µ Kµᾱ
k=1
K→∞ ᾱLM
−−−−→ .
µ

Proof. Taking the total expectation of (4.10b) and from (4.27),

E[F (wk+1 )] − E[F (wk )] ≤ −(µ − 12 ᾱLMG )ᾱE[k∇F (wk )k22 ] + 12 ᾱ2 LM
≤ − 21 µᾱE[k∇F (wk )k22 ] + 21 ᾱ2 LM.

Summing both sides of this inequality for k ∈ {1, . . . , K} and recalling Assumption 4.3(a) gives
K
X
Finf − F (w1 ) ≤ E[F (wK+1 )] − F (w1 ) ≤ − 21 µᾱ E[k∇F (wk )k22 ] + 12 K ᾱ2 LM.
k=1

Rearranging yields (4.28a), and dividing further by K yields (4.28b).

If M = 0, meaning that there is no noise or that noise reduces proportionally to k∇F (wk )k22
(see equations (4.8) and (4.9)), then (4.28a) captures a classical result for the full gradient method
applied to nonconvex functions, namely, that the sum of squared gradients remains finite, implying
that {k∇F (wk )k2 } → 0. In the presence of noise (i.e., M > 0), Theorem 4.8 illustrates the
interplay between the stepsize ᾱ and the variance of the stochastic directions. While one cannot
bound the expected optimality gap as in the convex case, inequality (4.28b) bounds the average
norm of the gradient of the objective function observed on {wk } visited during the first K iterations.
This quantity gets smaller when K increases, indicating that the SG method spends increasingly
more time in regions where the objective function has a (relatively) small gradient. Moreover,
the asymptotic result that one obtains from (4.28b) illustrates that noise in the gradients inhibits
further progress, as in (4.14) for the convex case. The average norm of the gradients can be made

32
arbitrarily small by selecting a small stepsize, but doing so reduces the speed at which the norm of
the gradient approaches its limiting distribution.
We now turn to the case when the SG method is applied to a nonconvex objective with a
decreasing sequence of stepsizes satisfying the classical conditions (4.19). While not the strongest
result that one can prove in this context—and, in fact, we prove a stronger result below—the
following theorem is perhaps the easiest to interpret and remember. Hence, we state it first.
Theorem 4.9 (Nonconvex Objective, Diminishing Stepsizes). Under Assumptions 4.1 and
4.3, suppose that the SG method (Algorithm 4.1) is run with a stepsize sequence satisfying (4.19).
Then
lim inf E[k∇F (wk )k22 ] = 0 . (4.29)
k→∞
The proof of this theorem follows based on the results given in Theorem 4.10 below. A “lim inf”
result of this type should be familiar to those knowledgeable of the nonlinear optimization litera-
ture. After all, such a result is all that can be shown for certain important methods, such as the
nonlinear conjugate gradient method [114]. The intuition that one should gain from the statement
of Theorem 4.9 is that, for the SG method with diminishing stepsizes, the expected gradient norms
cannot stay bounded away from zero.
The following result characterizes more precisely the convergence property of SG.
Theorem 4.10 (Nonconvex Objective, Diminishing Stepsizes). Under Assumptions 4.1 and
P SG method (Algorithm 4.1) is run with a stepsize sequence satisfying (4.19).
4.3, suppose that the
Then, with AK := K k=1 αk ,
"K #
X
lim E αk k∇F (wk )k22 < ∞ (4.30a)
K→∞
k=1
K
" #
1 X K→∞
and therefore E αk k∇F (wk )k22 −−−−→ 0. (4.30b)
AK
k=1

Proof. The second condition in (4.19) ensures that {αk } → 0, meaning that, without loss of gen-
erality, we may assume that αk LMG ≤ µ for all k ∈ N. Then, taking the total expectation of
(4.10b),
E[F (wk+1 )] − E[F (wk )] ≤ −(µ − 21 αk LMG )αk E[k∇F (wk )k22 ] + 12 αk2 LM
≤ − 12 µαk E[k∇F (wk )k22 ] + 21 αk2 LM .
Summing both sides of this inequality for k ∈ {1, . . . , K} gives
K
X K
X
Finf − E[F (w1 )] ≤ E[F (wK+1 )] − E[F (w1 )] ≤ − 12 µ αk E[k∇F (wk )k22 ] + 12 LM αk2 .
k=1 k=1

Dividing by µ/2 and rearranging the terms, we obtain


K K
X 2(E[F (w1 )] − Finf ) LM X 2
αk E[k∇F (wk )k22 ] ≤ + αk .
µ µ
k=1 k=1

The second condition in (4.19) implies that the right-hand side of this inequality converges to a
finite limit when K increases, proving (4.30a). Then, (4.30b) follows since the first condition in
(4.19) ensures that AK → ∞ as K → ∞.

33
Theorem 4.10 establishes results about a weighted sum-of-squares and a weighted average of
squared gradients of F similar to those in Theorem 4.8. However, unlike (4.28b), the conclu-
sion (4.30b) states that the weighted average norm of the squared gradients converges to zero even
if the gradients are noisy, i.e., if M > 0. The fact that (4.30b) only specifies a property of a
weighted average (with weights dictated by {αk }) is only of minor importance since one can still
conclude that the expected gradient norms cannot asymptotically stay far from zero.
We can now see that Theorem 4.9 is a direct consequence of Theorem 4.10, for if (4.29) did not
hold, it would contradict Theorem 4.10.
The next result gives a stronger conclusion than Theorem 4.9, at the expense of only showing
a property of the gradient of F at a randomly selected iterate.

Corollary 4.11. Suppose the conditions of Theorem 4.10 hold. For any K ∈ N, let k(K) ∈
{1, . . . , K} represent a random index chosen with probabilities proportional to {αk }K
k=1 . Then,
K→∞
k∇F (wk(K) )k2 −−−−→ 0 in probability.

Proof. Using Markov’s inequality and (4.30a), for any ε > 0, we can write
K→∞
P{k∇F (wk )k2 ≥ ε} = P{k∇F (wk )k22 ≥ ε2 } ≤ ε−2 E[ Ek [k∇F (wk )k22 ] ] −−−−→ 0 ,

which is the definition of convergence in probability.

Finally, we present the following result (with proof in Appendix B) to illustrate that stronger
convergence results also follow under additional regularity conditions.

Corollary 4.12. Under the conditions of Theorem 4.10, if we further assume that the objective
function F is twice differentiable, and that the mapping w 7→ k∇F (w)k22 has Lipschitz-continuous
derivatives, then
lim E[k∇F (wk )k22 ] = 0.
k→∞

4.4 Work Complexity for Large-Scale Learning


Our discussion thus far has focused on the convergence properties of SG when minimizing a given
objective function representing either expected or empirical risk. However, our investigation would
be incomplete without considering how these properties impact the computational workload associ-
ated with solving an underlying machine learning problem. As previewed in §3, there are arguments
that a more slowly convergent algorithm such as SG, with its sublinear rate of convergence, is more
efficient for large-scale learning than (full, batch) gradient-based methods that have a linear rate
of convergence. The purpose of this section is to present these arguments in further detail.
As a first attempt for setting up a framework in which one may compare optimization algorithms
for large-scale learning, one might be tempted to consider the situation of having a given training
set size n and asking what type of algorithm—e.g., a simple SG or batch gradient method—would
provide the best guarantees in terms of achieving a low expected risk. However, such a comparison
is difficult to make when one cannot determine the precise trade-off between per-iteration costs and
overall progress of the optimization method that one can guarantee.
An easier way to approach the issue is to consider a big data scenario with an infinite supply of
training examples, but a limited computational time budget. One can then ask whether running a

34
simple optimization algorithm such as SG works better than running a more sophisticated batch
optimization algorithm. We follow such an approach now, following the work in [22, 23].
Suppose that both the expected risk R and the empirical risk Rn attain their minima with
parameter vectors w∗ ∈ arg min R(w) and wn ∈ arg min Rn (w), respectively. In addition, let w̃n
be the approximate empirical risk minimizer returned by a given optimization algorithm when the
time budget Tmax is exhausted. The tradeoffs associated with this scenario can be formalized as
choosing the family of prediction functions H, the number of examples n, and the optimization
accuracy  := E[Rn (w̃n ) − Rn (wn )] in order to minimize, within time Tmax , the total error

E[R(w̃n )] = R(w∗ ) + E[R(wn ) − R(w∗ )] + E[R(w̃n ) − R(wn )] . (4.31)


| {z } | {z } | {z }
Eapp (H) Eest (H, n) Eopt (H, n, )

To minimize this error, one needs to balance the contributions from each of the three terms on the
right-hand side. For instance, if one decides to make the optimization more accurate—i.e., reducing
 in the hope of also reducing the optimization error Eopt (H, n, ) (evaluated with respect to R
rather than Rn )—one might need to make up for the additional computing time by: (i) reducing
the sample size n, potentially increasing the estimation error Eest (H, n); or (ii) simplifying the
function family H, potentially increasing the approximation error Eapp (H).
Useful guidelines for achieving an optimal balance between these errors can be obtained by
setting aside the choice of H and carrying out a worst-case analysis on the influence of the sample
size n and optimization tolerance , which together only influence the estimation and optimization
errors. This simplified set-up can be formalized in terms of the macroscopic optimization problem

min E(n, ) = E[R(w̃n ) − R(w∗ )] s.t. T (n, ) ≤ Tmax . (4.32)


n,

The computing time T (n, ) depends on the details of the optimization algorithm in interesting
ways. For example, the computing time of a batch algorithm increases linearly (at least) with
the number of examples n, whereas, crucially, the computing time of a stochastic algorithm is
independent of n. With a batch optimization algorithm, one could consider increasing  in order to
make time to use more training examples. However, with a stochastic algorithm, one should always
use as many examples as possible because the per-iteration computing time is independent of n.
To be specific, let us compare the solutions of (4.32) for prototypical stochastic and batch
methods—namely, simple SG and a batch gradient method—using simplified forms for the worst-
case of the error function E and the time function T . For the error function, a direct application
of the uniform laws of large numbers [155] yields

E(n, ) = E[R(w̃n ) − R(w∗ )] = E[R(w̃n ) − Rn (w̃n )] + E[Rn (w̃n ) − Rn (wn )]


| p{z } | {z }
=O log(n)/n =
+ E[Rn (wn ) − Rn (w∗ )] + E[Rn (w∗ ) − R(w∗ )],
| {z } | p{z }
≤0 =O log(n)/n

which leads to the upper bound


r !
log(n)
E(n, ) = O + . (4.33)
n

35
The inverse-square-root dependence on the number of examples n that appears here is typical of
statistical problems. However, even faster convergence rates for reducing these terms with respect
to n can be established under specific conditions. For instance, when the loss function is strongly
convex [91] or when the data distribution satisfies certain assumptions [154], it is possible to show
that  
log(n)
E(n, ) = O + .
n
To simplify further, let us work with the asymptotic (i.e., for large n) equivalence
1
E(n, ) ∼ + , (4.34)
n
which is the fastest rate that remains compatible with elementary statistical results.6 Under this
assumption, noting that the time constraint in (4.32) will always be active (since one can always
lower , and hence E(n, ), by giving more time to the optimization algorithm), and recalling the
worst-case computing time bounds introduced in §3.3, one arrives at the following conclusions.

• A simple SG method can achieve -optimality with a computing time of Tstoch (n, ) ∼ 1/.
Hence, within the time budget Tmax , the accuracy achieved is proportional to 1/Tmax , regard-
less of the n. This means that, to minimize the error E(n, ), one should simply choose n as
large as possible. Since the maximum number of examples that can be processed by SG dur-
ing the time budget is proportional to Tmax , it follows that the optimal error is proportional
to 1/Tmax .

• A batch gradient method can achieve -optimality with a computing time of Tbatch (n, ) ∼
n log(1/). This means that, within the time budget Tmax , it can achieve -optimality by pro-
cessing n ∼ Tmax / log(1/) examples. One now finds that the optimal error is not necessarily
achieved by choosing n as large as possible, but rather by choosing  (which dictates n) to
minimize (4.34). Differentiating E(n, ) ∼ log(1/)/Tmax +  with respect to  and setting the
result equal to zero, one finds that optimality is achieved with  ∼ 1/Tmax , from which it
follows that the optimal error is proportional to log(Tmax )/Tmax + 1/Tmax .

These results are summarized in Table 4.1. Even though a batch approach possesses a better
dependency on , this advantage does not make up for its dependence on n. This is true even though
we have employed (4.34), the most favorable rate that one may reasonably assume. In conclusion,
we have found that a stochastic optimization algorithm performs better in terms of expected error,
and, hence, makes a better learning algorithm in the sense considered here. These observations are
supported by practical experience (recall Figure 3.1 in §3.3).

A Lower Bound The results reported in Table 4.1 are also notable because the SG algorithm
matches a lower complexity bound that has been established for any optimization method employing
a noisy oracle. To be specific, in the widely employed model for studying optimization algorithms
proposed by Nemirovsky and Yudin [106], one assumes that information regarding the objective
function is acquired by querying an oracle, ignoring the computational demands of doing so. Using
6
For example,Rsuppose that one is estimating the mean of a distribution P defined on the real line by minimizing
the risk R(µ) = (x − µ)2 dP (x). The convergence rate (4.34) amounts to estimating the distribution mean with a
variance proportional to 1/n. A faster convergence rate would violate the Cramer-Rao bound.

36
Batch Stochastic
 
1 1
T (n, ) ∼ n log
 
log(Tmax ) 1 1
E∗ ∼ +
Tmax Tmax Tmax

Table 4.1: The first row displays the computing times of idealized batch and stochastic optimization
algorithms. The second row gives the corresponding solutions of (4.32), assuming (4.34).

such a model, it has been established, e.g., that the full gradient method applied to minimize a
strongly convex objective function is not optimal in terms of the accuracy that can be achieved
within a given number of calls to the oracle, but that one can achieve an optimal method through
acceleration techniques; see §7.2.
The case when only gradient estimates are available through a noisy oracle has been studied,
e.g. in [1, 128]. Roughly speaking, these investigations show that, again when minimizing a strongly
convex function, no algorithm that performs k calls to the oracle can guarantee accuracy better
than Ω(1/k). As we have seen, SG achieves this lower bound up to constant factors. This analysis
applies both for the optimization of expected risk and empirical risk.

4.5 Commentary
Although the analysis presented in §4.4 can be quite compelling, it would be premature to conclude
that SG is a perfect solution for large-scale machine learning problems. There is, in fact, a large
gap between asymptotical behavior and practical realities. Next, we discuss issues related to this
gap.

Fragility of the Asymptotic Performance of SG The convergence speed given, e.g., by


Theorem 4.7, holds when the stepsize constant β exceeds a quantity inversely proportional to the
strong convexity parameter c (see (4.20)). In some settings, determining such a value is relatively
easy, such as when the objective function includes a squared `2 -norm regularizer (e.g., as in (2.3)),
in which case the regularization parameter provides a lower bound for c. However, despite the fact
that this can work well in practice, it is not completely satisfactory because one should reduce the
regularization parameter when the number of samples increases. It is therefore desirable to design
algorithms that adapt to local convexity properties of the objective, so as to avoid having to place
cumbersome restrictions on the stepsizes.

SG and Ill-Conditioning The analysis of §4.4 is compelling since, as long as the optimization
problem is reasonably well-conditioned, the constant factors favor the SG algorithm. In particular,
the minimalism of the SG algorithm allows for very efficient implementations that either fully
leverage the sparsity of training examples (as in the case study on text classification in §2.1) or
harness the computational power of GPU processors (as in the case study on deep neural network
in §2.2). In contrast, state-of-the-art batch optimization algorithms often carry more overhead.
However, this advantage erodes when the conditioning of the objective function worsens. Again,
consider Theorem 4.7. This result involves constant factors that grow with both the condition

37
number L/c and the ratio M/c. Both of these ratios can be improved greatly by adaptively
rescaling the stochastic directions based on matrices that capture local curvature information of
the objective function; see 6.

Opportunities for Distributed Computing Distributing the SG step computation can poten-
tially reduce the computing time by a constant factor equal to the number of machines. However,
such an improvement is difficult to realize. The SG algorithm is notoriously difficult to distribute
efficiently because it accesses the shared parameter vector w with relatively high frequency. Con-
sequently, even though it is very robust to additional noise and can be run with very relaxed
synchronization [112, 45], distributed SG algorithms suffer from large communication overhead.
Since this overhead is potentially much larger than the additional work associated with mini-batch
and other methods with higher per-iteration costs, distributed computing offers new opportunities
for the success of such methods for machine learning.

Alternatives with Faster Convergence As mentioned above, [1, 128] establish lower complex-
ity bounds for optimization algorithms that only access information about the objective function
through noisy estimates of F (wk ) and ∇F (wk ) acquired in each iteration. The bounds apply, e.g.,
when SG is employed to minimize the expected risk R using gradient estimates evaluated on sam-
ples drawn from the distribution P . However, an algorithm that optimizes the empirical risk Rn
has access to an additional piece of information: it knows when a gradient estimate is evaluated
on a training example that has already been visited during previous iterations. Recent gradient
aggregation methods (see §5.3) make use of this information and improve upon the lower bound in
[1] for the optimization of the empirical risk (though not for the expected risk). These algorithms
enjoy linear convergence with low computing times in practice. Another avenue for improving the
convergence rate is to employ a dynamic sampling approach (see §5.2), which, as we shall see, can
match the optimal asymptotic efficiency of SG in big data settings.

38
Inset 4.3: Regret Bounds
Convergence results for SG and its variants are occasionally established using regret bounds
as an intermediate step [164, 144, 54]. Regret bounds can be traced to Novikoff’s analysis of the
Perceptron [115] and to Cover’s universal portfolios [41]. To illustrate, suppose that one aims
to minimize a convex expected risk measure R(w) = E[f (w; ξ)] over w ∈ Rd with minimizer
w∗ ∈ Rd . At a given iterate wk , one obtains by convexity of f (w; ξk ) (recall (A.1)) that

kwk+1 − w∗ k2 − kwk − w∗ k2 ≤ −2αk (f (wk ; ξk ) − f (w∗ ; ξk )) + αk2 k∇f (wk ; ξk )k22 .

Following [164], assuming that k∇f (wk ; ξk )k22 ≤ M and kwk − w∗ k22 < B for some constants
M > 0 and B > 0 for all k ∈ N, one finds that
−1
αk+1 kwk+1 −w∗ k22 − αk−1 kwk −w∗ k22
−1
≤ − 2(f (wk ; ξk ) − f (w∗ ; ξk )) + αk M + (αk+1 − αk−1 )kwk − w∗ k22
−1
≤ − 2(f (wk ; ξk ) − f (w∗ ; ξk )) + αk M + (αk+1 − αk−1 )B.

Summing for k = {1 . . . K} with stepsizes αk = 1/ k leads to the regret bound
K K
! !
X X √ √
f (wk ; ξk ) ≤ f (w∗ ; ξk ) + M K + o( K), (4.35)
k=1 k=1

which bounds the losses incurred from {wk } compared to those yielded by the fixed vector w∗ .
Such a bound is remarkable because its derivation holds for any sequence of noise variables {ξk }.
This means that the average loss observed during the execution of the SG algorithm is never
much worse than the best average loss one would have obtained if the optimal parameter w∗
were known in advance. Further assuming that the noise variables are independent and using
a martingale argument [34] leads to more traditional results of the form
K
" #  
1 X 1
E F (wk ) ≤ F∗ + O √ .
K K
k=1

As long as one makes the same independent noise assumption, results obtained with this tech-
nique cannot be fundamentally different from the results that we have established. However,
one should appreciate that the regret bound (4.35) itself remains meaningful when the noise
variables are dependent or even adversarial. Such results reveals interesting connections be-
tween probability theory, machine learning, and game theory [143, 34].

39
5 Noise Reduction Methods
The theoretical arguments in the previous section, together with extensive computational expe-
rience, have led many in the machine learning community to view SG as the ideal optimization
approach for large-scale applications. We argue, however, that this is far from settled. SG suffers
from, amongst other things, the adverse effect of noisy gradient estimates. This prevents it from
converging to the solution when fixed stepsizes are used and leads to a slow, sublinear rate of
convergence when a diminishing stepsize sequence {αk } is employed.
To address this limitation, methods endowed with noise reduction capabilities have been devel-
oped. These methods, which reduce the errors in the gradient estimates and/or iterate sequence,
have proved to be effective in practice and enjoy attractive theoretical properties. Recalling the
schematic of optimization methods in Figure 3.3, we depict these methods on the horizontal axis
given in Figure 5.1.

Stochas)c Batch
gradient method gradient method
Noise reduc)on methods:

•  Dynamic sampling
•  Gradient aggrega2on
•  Iterate averaging

Stochas)c
Newton method

Fig. 5.1: View of the schematic from Figure 3.3 with a focus on noise reduction methods.

The first two classes of methods that we consider achieve noise reduction in a manner that
allows them to possess a linear rate of convergence to the optimal value using a fixed stepsize. The
first type, dynamic sampling methods, achieve noise reduction by gradually increasing the mini-
batch size used in the gradient computation, thus employing increasingly more accurate gradient
estimates as the optimization process proceeds. Gradient aggregation methods, on the other hand,
improve the quality of the search directions by storing gradient estimates corresponding to samples
employed in previous iterations, updating one (or some) of these estimates in each iteration, and
defining the search direction as a weighted average of these estimates.
The third class of methods that we consider, iterate averaging methods, accomplish noise reduc-
tion not by averaging gradient estimates, but by maintaining an average of iterates computed
√ during
the optimization process. Employing a more aggressive stepsize sequence—of order O(1/ k) rather
than O(1/k), which is appealing in itself—it is this sequence of averaged iterates that converges
to the solution. These methods are closer in spirit to SG and their rate of convergence remains
sublinear, but it can be shown that the variance of the sequence of average iterates is smaller than
the variance of the SG iterates.
To formally motivate a concept of noise reduction, we begin this section by discussing a fun-
damental result that stipulates a rate of decrease in noise that allows a stochastic-gradient-type

40
method to converge at a linear rate. We then show that a dynamic sampling method that enforces
such noise reduction enjoys optimal complexity bounds, as defined in §4.4. Next, we discuss three
gradient aggregation methods—SVRG, SAGA, and SAG—the first of which can be viewed as a
bridge between methods that control errors in the gradient with methods like SAGA and SAG in
which noise reduction is accomplished in a more subtle manner. We conclude with a discussion of
the practical and theoretical properties of iterate averaging methods.

5.1 Reducing Noise at a Geometric Rate


Let us recall the fundamental inequality (4.4), which we restate here for convenience:

Eξk [F (wk+1 )] − F (wk ) ≤ −αk ∇F (wk )T Eξk [g(wk , ξk )] + 12 αk2 LEξk [kg(wk , ξk )k22 ].

(Recall that, as stated in (4.1), the objective F could stand for the expected risk R or empirical risk
Rn .) It is intuitively clear that if −g(wk , ξk ) is a descent direction in expectation (making the first
term on the right hand side negative) and if we are able to decrease Eξk [kg(wk , ξk )k22 fast enough
(along with k∇F (wk )k22 ), then the effect of having noisy directions will not impede a fast rate of
convergence. From another point of view, we can expect such behavior if, in Assumption 4.3, we
suppose instead that the variance of g(wk , ξk ) vanishes sufficiently quickly.
We formalize this intuitive argument by considering the SG method with a fixed stepsize and
showing that the sequence of expected optimality gaps vanishes at a linear rate as long as the
variance of the stochastic vectors, denoted by Vξk [g(wk , ξk )] (recall (4.6)), decreases geometrically.

Theorem 5.1 (Strongly Convex Objective, Noise Reduction). Suppose that Assumptions 4.1,
4.3, and 4.5 (with Finf = F∗ ) hold, but with (4.8) refined to the existence of constants M ≥ 0 and
ζ ∈ (0, 1) such that, for all k ∈ N,

Vξk [g(wk , ξk )] ≤ M ζ k−1 . (5.1)

In addition, suppose that the SG method (Algorithm 4.1) is run with a fixed stepsize, αk = ᾱ for
all k ∈ N, satisfying  
µ 1
0 < ᾱ ≤ min , . (5.2)
Lµ2G cµ
Then, for all k ∈ N, the expected optimality gap satisfies

E[F (wk ) − F∗ ] ≤ ωρk−1 , (5.3)

where

ω := max{ ᾱLM
cµ , F (w1 ) − F∗ } (5.4a)
ᾱcµ
and ρ := max{1 − 2 , ζ} < 1. (5.4b)

Proof. By Lemma 4.4 (specifically, (4.10a)), we have

Eξk [F (wk+1 )] − F (wk ) ≤ −µᾱk∇F (wk )k22 + 21 ᾱ2 LEξk [kg(wk , ξk )k22 ].

41
Hence, from (4.6), (4.7b), (5.1), (5.2), and (4.12), we have
 
Eξk [F (wk+1 )] − F (wk ) ≤ −µᾱk∇F (wk )k22 + 12 ᾱ2 L µ2G k∇F (wk )k22 + M ζ k−1
≤ −(µ − 12 ᾱLµ2G )ᾱk∇F (wk )k22 + 21 ᾱ2 LM ζ k−1
≤ − 12 µᾱk∇F (wk )k22 + 12 ᾱ2 LM ζ k−1
≤ −ᾱcµ(F (wk ) − F∗ ) + 12 ᾱ2 LM ζ k−1 .

Adding and subtracting F∗ and taking total expectations, this yields

E[F (wk+1 ) − F∗ ] ≤ (1 − ᾱcµ)E[F (wk ) − F∗ ] + 12 ᾱ2 LM ζ k−1 . (5.5)

We now use induction to prove (5.3). By the definition of ω, it holds for k = 1. Then, assuming
it holds for k ≥ 1, we have from (5.5), (5.4a), and (5.4b) that

E[F (wk+1 ) − F∗ ] ≤ (1 − ᾱcµ) ωρk−1 + 12 ᾱ2 LM ζ k−1


2 LM
 k−1 !
ᾱ ζ
= ωρk−1 1 − ᾱcµ +
2ω ρ
ᾱ2 LM
 
≤ ωρk−1 1 − ᾱcµ +

 ᾱcµ 
≤ ωρk−1 1 − ᾱcµ +
 ᾱcµ  2
= ωρk−1 1 −
2
≤ ωρk ,

as desired.

Consideration of the typical magnitudes of the constants µ, L, µG , and c in (5.2) reveals that
the admissible range of values of ᾱ is large, i.e., the restriction on the stepsize ᾱ is not unrealistic
in practical situations.
Now, a natural question is whether one can design efficient optimization methods for attaining
the critical bound (5.1) on the variance of the stochastic directions. We show next that a dynamic
sampling method is one such approach.

5.2 Dynamic Sample Size Methods


Consider the iteration
wk+1 ← wk − ᾱg(wk , ξk ), (5.6)
where the stochastic directions are computed for some τ > 1 as
1 X
g(wk , ξk ) := ∇f (wk ; ξk,i ) with nk := |Sk | = dτ k−1 e. (5.7)
nk
i∈Sk

That is, consider a mini-batch SG iteration with a fixed stepsize in which the mini-batch size
used to compute unbiased stochastic gradient estimates increases geometrically as a function of the

42
iteration counter k. To show that such an approach can fall under the class of methods considered
in Theorem 5.1, suppose that the samples represented by the random variables {ξk,i }i∈Sk are drawn
independently according to P for all k ∈ N. If we assume that each stochastic gradient ∇f (wk ; ξk,i )
has an expectation equal to the true gradient ∇F (wk ), then (4.7a) holds with µG = µ = 1. If, in
addition, the variance of each such stochastic gradient is equal and is bounded by M ≥ 0, then for
arbitrary i ∈ Sk we have (see [61, p.183])

Vξk [∇f (wk ; ξk,i )] M


Vξk [g(wk , ξk )] ≤ ≤ . (5.8)
nk nk
This bound, when combined with the rate of increase in nk given in (5.7), yields (5.1). We have thus
shown that if one employs a mini-batch SG method with (unbiased) gradient estimates computed
as in (5.7), then, by Theorem 5.1, one obtains linear convergence to the optimal value of a strongly
convex function. We state this formally as the following corollary to Theorem 5.1.

Corollary 5.2. Let {wk } be the iterates generated by (5.6)–(5.7) with unbiased gradient estimates,
i.e., Eξk,i [∇f (wk ; ξk,i )] = ∇F (wk ) for all k ∈ N and i ∈ Sk . Then, the variance condition (5.1)
is satisfied, and if all other assumptions of Theorem 5.1 hold, then the expected optimality gap
vanishes linearly in the sense of (5.3).

The reader may question whether it is meaningful to describe a method as linearly convergent if
the per-iteration cost increases without bound. In other words, it is not immediately apparent that
such an algorithm is competitive with a classical SG approach even though the desired reduction in
the gradient variance is achieved. To address this question, let us estimate the total work complexity
of the dynamic sampling algorithm, defined as the number of evaluations of the individual gradients
∇f (wk ; ξk,i ) required to compute an -optimal solution, i.e., to achieve

E[F (wk ) − F∗ ] ≤ . (5.9)

We have seen that the simple SG method (3.7) requires one such evaluation per iteration, and that
its rate of convergence for diminishing stepsizes (i.e., the only set-up in which convergence to the
solution can be guaranteed) is given by (4.21). Therefore, as previously mentioned, the number of
stochastic gradient evaluations required by the SG method to guarantee (5.9) is O(−1 ). We now
show that the method (5.6)–(5.7) can attain the same complexity.

Theorem 5.3. Suppose that the dynamic sampling SG method (5.6)–(5.7) is run with a stepsize ᾱ
satisfying (5.2) and some τ ∈ (1, (1 − ᾱcµ −1
2 ) ]. In addition, suppose that Assumptions 4.1, 4.3,
and Assumption 4.5 (with Finf = F∗ ) hold. Then, the total number of evaluations of a stochastic
gradient of the form ∇f (wk ; ξk,i ) required to obtain (5.9) is O(−1 ).

Proof. We have that the conditions of Theorem 5.1 hold with ζ = 1/τ . Hence, we have from (5.3)
that there exists k ∈ N such that (5.9) holds for all k ≥ k. We can then use ωρk−1 ≤  to write
(k − 1) log ρ ≤ log(/ω), which along with ρ ∈ (0, 1) (recall (5.4b)) implies that
   
log(/ω) log(ω/)
k−1≥ = . (5.10)
log ρ − log ρ
Let us now estimate the total number of sample gradient evaluations required up through
iteration k. We claim that, without loss of generality, we may assume that log(ω/)/(− log ρ) is

43
integer-valued and that (5.10) holds at equality. Then, by (5.7), the number of sample gradients
required in iteration k is dτ k−1 e where
log(ω/)
τ k−1 = τ − log ρ
  
log(ω/)
= exp log τ − log ρ
  
log(ω/)
= exp log τ
− log ρ
log τ
= (exp(log(ω/))) − log ρ
 ω θ log τ
= with θ := .
 − log ρ

Therefore, the total number of sample gradient evaluations for the first k iterations is

k k
!
τk − τ τ (ω/)θ − τ
   ω θ  
X
j−1
X
j−1 1
dτ e≤2 τ =2 =2 ≤2 .
τ −1 τ −1  1 − 1/τ
j=1 j=1

In fact, since τ ≤ (1 − ᾱcµ −1


2 ) , it follows from (5.4b) that ρ = ζ = τ
−1 , which implies that θ = 1.

Specifically, with τ = (1 − σ( ᾱcµ


2 ))
−1 for some σ ∈ [0, 1], then θ = 1 and

k
X 4ω
dτ j−1 e ≤ ,
σᾱcµ
j=1

as desired.

The discussion so far has focused on dynamic sampling strategies for a gradient method, but
these ideas also apply for second-order methods that incorporate the matrix H as in (4.2).
This leads to the following question: given the rate of convergence of a batch optimization
algorithm on strongly convex functions (i.e., linear, superlinear, etc.), what should be the sampling
rate so that the overall algorithm is efficient in the sense that it results in the lowest computational
complexity? To answer this question, certain results have been established [120]: (i) if the opti-
mization method has a sublinear rate of convergence, then there is no sampling rate that makes
the algorithm “efficient”; (ii) if the optimization algorithm is linearly convergent, then the sam-
pling rate must be geometric (with restrictions on the constant in the rate) for the algorithm to be
“efficient”; (iii) for superlinearly convergent methods, increasing the sample size at a rate that is
slightly faster than geometric will yield an “efficient” method.

5.2.1 Practical Implementation


We now address the question of how to design practical algorithms that enjoy the theoretical
benefits of noise reduction through the use of increasing sample sizes.
One approach is to follow the theoretical results described above and tie the rate of growth in
the sample size nk = |Sk | to the rate of convergence of the optimization iteration [62, 77]. Such an
approach, which presets the sampling rate before running the optimization algorithm, requires some

44
experimentation. For example, for the iteration (5.6), one needs to find a value of the parameter
τ > 1 in (5.7) that yields good performance for the application at hand. In addition, one may want
to delay the application of dynamic sampling to prevent the full sample set from being employed
too soon (or at all). Such heuristic adaptations could be difficult in practice.
An alternative is to consider mechanisms for choosing the sample sizes not according to a
prescribed sequence, but adaptively according to information gathered during the optimization
process. One avenue that has been explored along these lines has been to design techniques that
produce descent directions sufficiently often [28, 73]. Such an idea is based on two observations.
First, one can show that any direction g(wk , ξk ) is a descent direction for F at wk if, for some
χ ∈ [0, 1), one has

δ(wk , ξk ) := kg(wk , ξk ) − ∇F (wk )k2 ≤ χkg(wk , ξk )k2 . (5.11)

To see this, note that k∇F (wk )k2 ≥ (1 − χ)kg(wk , ξk )k2 , which after squaring (5.11) implies

∇F (wk )T g(wk , ξk ) ≥ 21 (1 − χ2 )kg(wk , ξk )k22 + k∇F (wk )k22


≥ 21 (1 − χ2 + (1 − χ)2 )kg(wk , ξk )k22
≥ (1 − χ)kg(wk , ξk )k22 .

The second observation is that while one cannot cheaply verify the inequality in (5.11) because it
involves the evaluation of ∇F (wk ), one can estimate the left-hand side δ(wk , ξk ), and then choose
nk so that (5.11) holds sufficiently often.
Specifically, if we assume that g(wk , ξk ) is an unbiased gradient estimate, then

E[δ(wk , ξk )2 ] = E[kg(wk , ξk ) − ∇F (wk )k22 ] = Vξk [g(wk , ξk )].

This variance is expensive to compute, but one can approximate it with a sample variance. For
example, if the samples are drawn without replacement from a set of (very large) size nk , then one
has the approximation
trace(Cov({∇f (wk ; ξk,i )}i∈Sk ))
Vξk [g(wk , ξk )] ≈ =: ϕk .
nk
An adaptive sampling algorithm thus tests the following condition in place of (5.11):

ϕk ≤ χ2 kg(wk , ξk )k22 . (5.12)

If this condition is not satisfied, then one may consider increasing the sample size—either imme-
diately in iteration k or in a subsequent iteration—to a size that one might predict would satisfy
such a condition. This technique is algorithmically attractive, but does not guarantee that the
sample size nk increases at a geometric rate. One can, however, employ a backup [28, 73]: if (5.12)
increases the sampling rate more slowly than a preset geometric sequence, then a growth in the
sample size is imposed.
Dynamic sampling techniques are not yet widely used in machine learning, and we suspect
that the practical technique presented here might merely serve as a starting point for further
investigations. Ultimately, an algorithm that performs like SG at the start and transitions to a
regime of reduced variance in an efficient manner could prove to be a very powerful method for
large-scale learning.

45
5.3 Gradient Aggregation
The dynamic sample size methods described in the previous subsection require a careful balance
in order to achieve the desired linear rate of convergence without jeopardizing per-iteration costs.
Alternatively, one can attempt to achieve an improved rate by asking a different question: rather
than compute increasingly more new stochastic gradient information in each iteration, is it possible
to achieve a lower variance by reusing and/or revising previously computed information? After
all, if the current iterate has not been displaced too far from previous iterates, then stochastic
gradient information from previous iterates may still be useful. In addition, if one maintains
indexed gradient estimates in storage, then one can revise specific estimates as new information
is collected. Ideas such as these lead to concepts of gradient aggregation. In this subsection, we
present ideas for gradient aggregation in the context of minimizing a finite sum such as an empirical
risk measure Rn , for which they were originally designed.
Gradient aggregation algorithms for minimizing finite sums that possess cheap per-iteration
costs have a long history. For example, Bertsekas and co-authors [13] have proposed incremental
gradient methods, the randomized versions of which can be viewed as instances of a basic SG
method for minimizing a finite sum. However, the basic variants of these methods only achieve a
sublinear rate of convergence. By contrast, the methods on which we focus in this section are able
to achieve a linear rate of convergence on strongly convex problems. This improved rate is achieved
primarily by either an increase in computation or an increase in storage.

5.3.1 SVRG
The first method we consider operates in cycles. At the beginning of each cycle, an iterate wk is
1 Pn
available at which the algorithm computes a batch gradient ∇Rn (wk ) = n i=1 ∇fi (wk ). Then,
after initializing w̃1 ← wk , a set of m inner iterations indexed by j with an update w̃j+1 ← w̃j − αg̃j
are performed, where
g̃j ← ∇fij (w̃j ) − (∇fij (wk ) − ∇Rn (wk )) (5.13)
and ij ∈ {1, . . . , n} is chosen at random. This formula has a simple interpretation. Since the
expected value of ∇fij (wk ) over all possible ij ∈ {1, . . . , n} is equal to ∇Rn (wk ), one can view
∇fij (wk ) − ∇Rn (wk ) as the bias in the gradient estimate ∇fij (wk ). Thus, in every iteration, the
algorithm randomly draws a stochastic gradient ∇fij (w̃j ) evaluated at the current inner iterate w̃j
and corrects it based on a perceived bias. Overall, g̃j represents an unbiased estimator of ∇Rn (w̃j ),
but with a variance that one can expect to be smaller than if one were simply to chose g̃j = ∇fij (w̃j )
(as in simple SG). This is the reason that the method is referred to as the stochastic variance reduced
gradient (SVRG) method.
A formal description of a few variants of SVRG is presented as Algorithm 5.1. For both options
(b) and (c), it has been shown that when applied to minimize a strongly convex Rn , Algorithm 5.1
can achieve a linear rate of convergence [82]. More precisely, if the stepsize α and the length of the
inner cycle m are chosen so that
 
1 1
ρ := + 2Lα < 1,
1 − 2αL mcα

then, given that the algorithm has reached wk , one obtains

E[Rn (wk+1 ) − Rn (w∗ )] ≤ ρE[Rn (wk ) − Rn (w∗ )]

46
(where expectation is taken with respect to the random variables in the inner cycle). It should be
emphasized that resulting linear convergence rate applies to the outer iterates {wk }, where each
step from wk to wk+1 requires 2m + n evaluations of component gradients: Step 7 requires two
stochastic gradient evaluations while Step 3 requires n (a full gradient). Therefore, one iteration
of SVRG is much more expensive than one in SG, and in fact is comparable to a full gradient
iteration.

Algorithm 5.1 SVRG Methods for Minimizing an Empirical Risk Rn


1: Choose an initial iterate w1 ∈ Rd , stepsize α > 0, and positive integer m.
2: for k = 1, 2, . . . do
3: Compute the batch gradient ∇Rn (wk ).
4: Initialize w̃1 ← wk .
5: for j = 1, . . . , m do
6: Chose ij uniformly from {1, . . . , n}.
7: Set g̃j ← ∇fij (w̃j ) − (∇fij (wk ) − ∇Rn (wk )).
8: Set w̃j+1 ← w̃j − αg̃j .
9: end for
10: Option (a): Set wk+1 = w̃m+1
1 Pm
11: Option (b): Set wk+1 = m j=1 w̃j+1
12: Option (c): Choose j uniformly from {1, . . . , m} and set wk+1 = w̃j+1 .
13: end for

In practice, SVRG appears to be quite effective in certain applications compared with SG if


one requires high training accuracy. For the first epochs, SG is more efficient, but once the iterates
approach the solution the benefits of the fast convergence rate of SVRG can be observed. Without
explicit knowledge of L and c, the length of the inner cycle m and the stepsize α are typically both
chosen by experimentation.

5.3.2 SAGA
The second method we consider employs an iteration that is closer in form to SG in that it does
not operate in cycles, nor does it compute batch gradients (except possibly at the initial point).
Instead, in each iteration, it computes a stochastic vector gk as the average of stochastic gradients
evaluated at previous iterates. Specifically, in iteration k, the method will have stored ∇fi (w[i] ) for
all i ∈ {1, . . . , n} where w[i] represents the latest iterate at which ∇fi was evaluated. An integer
j ∈ {1, . . . , n} is then chosen at random and the stochastic vector is set by
n
1X
gk ← ∇fj (wk ) − ∇fj (w[j] ) + ∇fi (w[i] ). (5.14)
n
i=1

Taking the expectation of gk with respect to all choices of j ∈ {1, . . . , n}, one again has that
E[gk ] = ∇Rn (wk ). Thus, the method employs unbiased gradient estimates, but with variances that
are expected to be less than the stochastic gradients that would be employed in a basic SG routine.
A precise algorithm employing (5.14), referred to as SAGA [46], is given in Algorithm 5.2.
Beyond its initialization phase, the per-iteration cost of Algorithm 5.2 is the same as in a basic
SG method. However, it has been shown that the method can achieve a linear rate of convergence

47
Algorithm 5.2 SAGA Method for Minimizing an Empirical Risk Rn
1: Choose an initial iterate w1 ∈ Rd and stepsize α > 0.
2: for i = 1, . . . , n do
3: Compute ∇fi (w1 ).
4: Store ∇fi (w[i] ) ← ∇fi (w1 ).
5: end for
6: for k = 1, 2, . . . do
7: Choose j uniformly in {1, . . . , n}.
8: Compute ∇fj (wk ).
Set gk ← ∇fj (wk ) − ∇fj (w[j] ) + n1 ni=1 ∇fi (w[i] ).
P
9:
10: Store ∇fj (w[j] ) ← ∇fj (wk ).
11: Set wk+1 ← wk − αgk .
12: end for

when minimizing a strongly convex Rn . Specifically, with α = 1/(2(cn + L)), one can show that
 k  
2 c 2 nD
E[kwk − w∗ k2 ] ≤ 1 − kw1 − w∗ k2 + ,
2(cn + L) cn + L
where D := Rn (w1 ) − ∇Rn (w∗ )T (w1 − w∗ ) − Rn (w∗ ).
Of course, attaining such a result requires knowledge of the strong convexity constant c and Lips-
chitz constant L. If c is not known, then the stepsize can instead be chosen to be α = 1/(3L) and
a similar convergence result can be established; see [46].
Alternative initialization techniques could be used in practice, which may be more effective
than evaluating all the gradients {∇fi }ni=1 at the initial point. For example, one could perform
one epoch of simple SG, or one can assimilate iterates one-by-one and compute gk only using the
gradients available up to that point.
One important drawback of Algorithm 5.2 is the need to store n stochastic gradient vectors,
which would be prohibitive in many large-scale applications. Note, however, that if the component
functions are of the form fi (wk ) = fˆ(xTi wk ), then
∇fi (wk ) = fˆ0 (xTi wk )xi .
That is, when the feature vectors {xi } are already available in storage, one need only store the
scalar fˆ0 (xTi wk ) in order to construct ∇fi (wk ) at a later iteration. Such a functional form of fi
occurs in logistic and least squares regression.
Algorithm 5.2 has its origins in the stochastic average gradient (SAG) algorithm [139, 90], where
the stochastic direction is defined as
n
!
1 X
gk ← ∇fj (wk ) − ∇fj (w[j] ) + ∇fi (w[i] ) . (5.15)
n
i=1

Although this gk is not an unbiased estimator of ∇Rn (wk ), the method enjoys a linear rate of
convergence. One finds that the SAG algorithm is a randomized version of the Incremental Ag-
gregated Gradient (IAG) method proposed in [17] and analyzed in [72], where the index j of the
component function updated at every iteration is chosen cyclically. Interestingly, randomizing this
choice yields good practical benefits.

48
5.3.3 Commentary
Although the gradient aggregation methods discussed in this section enjoy a faster rate of con-
vergence than SG (i.e., linear vs. sublinear), they should not be regarded as clearly superior to
SG. After all, following similar analysis as in §4.4, the computing time for SG can be shown to be
T (n, ) ∼ κ2 / with κ := L/c. (In fact, a computing time of κ/ is often observed in practice.) On
the other hand, the computing times for SVRG, SAGA, and SAG are

T (n, ) ∼ (n + κ) log(1/),

which grows with the number of examples n. Thus, following similar analysis as in §4.4, one
finds that, for very large n, gradient aggregation methods are comparable to batch algorithms and
therefore cannot beat SG in this regime. For example, if κ is close to 1, then SG is clearly more
efficient since within a single epoch it reaches the optimal testing error [25]. On the other hand,
there exists a regime with κ  n in which gradient aggregation methods may be superior, and
perhaps even easier to tune. At present, it is not known how useful gradient aggregation methods
will prove to be in the future of large-scale machine learning. That being said, they certainly
represent a class of optimization methods of interest due to their clever use of past information.

5.4 Iterate Averaging Methods


Since its inception, it has been observed that SG generates noisy iterate sequences that tend to
oscillate around minimizers during the optimization process. Hence, a natural idea is to compute
a corresponding sequence of iterate averages that would automatically possess less noisy behavior.
Specifically, for minimizing a continuously differentiable F with unbiased gradient estimates, the
idea is to employ the iteration

wk+1 ← wk − αk g(wk , ξk )
k+1
1 X (5.16)
and w̃k+1 ← wj ,
k+1
j=1

where the averaged sequence {w̃k } has no effect on the computation of the SG iterate sequence {wk }.
Early hopes were that this auxiliary averaged sequence might possess better convergence properties
than the SG iterates themselves. However, such improved behavior was found to be elusive when
using classical stepsize sequences that diminished with a rate of O(1/k) [124].
A fundamental advancement in the use of iterate averaging came with the work of Polyak [125],
which was subsequently advanced with Juditsky [126]; see also the work of Ruppert [136] and
Nemirovski and Yudin [106]. Here, the idea remains to employ the iteration (5.16), but with
stepsizes diminishing at a slower rate of O(1/(k a )) for some a ∈ ( 12 , 1). When minimizing strongly
convex objectives, it follows from this choice that E[kwk −w∗ k22 ] = O(1/(k a )) while E[kw̃k −w∗ k22 ] =
O(1/k). What is interesting, however, is that in certain cases this combination of long steps and
averaging yields an optimal constant in the latter rate (for the iterate averages) in the sense that
no rescaling of the steps—through multiplication with a positive definite matrix (see (6.2) in the
next section)—can improve the asymptotic rate or constant. This shows that, due to averaging,
the adverse effects caused by ill-conditioning disappear. (In this respect, the effect of averaging has
been characterized as being similar to that of using a second-order method in which the Hessian
approximations approach the Hessian of the objective at the minimizer [125, 22]; see §6.2.1.) This

49
asymptotic behavior is difficult to achieve in practice, but is possible in some circumstances with
careful selection of the stepsize sequence [161].
Iterate averaging has since been incorporated into various schemes in order to allow longer steps
while maintaining desired rates of convergence. Examples include the robust SA and mirror descent
SA methods presented in [105], as well as Nesterov’s primal-dual averaging method proposed in
[109, §6]. This latter method is notable in this section as it employs gradient aggregation and yields
a O(1/k) rate of convergence for the averaged iterate sequence.

6 Second-Order Methods
In §5, we looked beyond classical SG to methods that are less affected by noise in the stochastic
directions. Another manner to move beyond classical SG is to address the adverse effects of high
nonlinearity and ill-conditioning of the objective function through the use of second-order informa-
tion. As we shall see, these methods improve convergence rates of batch methods or the constants
involved in the sublinear convergence rate of stochastic methods.
A common way to motivate second-order algorithms is to observe that first-order methods, such
as SG and the full gradient method, are not scale invariant. Consider, for example, the full gradient
iteration for minimizing a continuously differentiable function F : Rd → R, namely

wk+1 ← wk − αk ∇F (wk ). (6.1)

An alternative iteration is obtained by applying a full gradient approach after a linear transforma-
tion of variables, i.e., by considering minw̃ F (B w̃) for some symmetric positive definite matrix B.
The full gradient iteration for this problem has the form

w̃k+1 ← w̃k − αk B∇F (B w̃k ),

which, after scaling by B and defining {wk } := {B w̃k }, corresponds to the iteration

wk+1 ← wk − αk B 2 ∇F (wk ). (6.2)

Comparing (6.2) with (6.1), it is clear that the behavior of the algorithm will be different under this
change of variables. For instance, when F is a strongly convex quadratic with unique minimizer w∗ ,
the full gradient method (6.1) generally requires many iterations to approach the minimizer, but
from any initial point w1 the iteration (6.2) with B = (∇2 F (w1 ))−1/2 and α1 = 1 will yield
w2 = w∗ . These latter choices correspond to a single iteration of Newton’s method [52]. In general,
it is natural to seek transformations that perform well in theory and in practice.
Another motivation for second-order algorithms comes from the observation that each iteration
of the form (6.1) or (6.2) chooses the subsequent iterate by first computing the minimizer of a
second-order Taylor series approximation qk : Rd → R of F at wk , which has the form

qk (w) = F (wk ) + ∇F (wk )T (w − wk ) + 12 (w − wk )T B −2 (w − wk ). (6.3)

The full gradient iteration corresponds to B −2 = I while Newton’s method corresponds to B −2 =


∇2 F (wk ) (assuming this Hessian is positive definite). Thus, in general, a full gradient iteration
works with a model that is only first-order accurate while Newton’s method applies successive local
re-scalings based on minimizing an exact second-order Taylor model of F at each iterate.

50
Deterministic (i.e., batch) methods are known to benefit from the use of second-order informa-
tion; e.g., Newton’s method achieves a quadratic rate of convergence if w1 is sufficiently close to
a strong minimizer [52]. On the other hand, stochastic methods like the SG method in §4 cannot
achieve a convergence rate that is faster than sublinear, regardless of the choice of B; see [1, 104].
(More on this in §6.2.1.) Therefore, it is natural to ask: can there be a benefit to incorporating
second-order information in stochastic methods? We address this question throughout this sec-
tion by showing that the careful use of successive re-scalings based on (approximate) second-order
derivatives can be beneficial between the stochastic and batch regimes.
We begin this section by considering a Hessian-free Newton method that employs exact second-
order information, but in a judicious manner that exploits the stochastic nature of the objective
function. We then describe methods that attempt to mimic the behavior of a Newton algorithm
through first-order information computed over sequences of iterates; these include quasi-Newton,
Gauss-Newton, and related algorithms that employ only diagonal re-scalings. We also discuss the
natural gradient method, which defines a search direction in the space of realizable distributions,
rather than in the space of the real parameter vector w. Whereas Newton’s method is invariant
to linear transformations of the variables, the natural gradient method is invariant with respect to
more general invertible transformations.
We depict the methods of interest in this section on the downward axis illustrated in Figure 6.1.
We use double-sided arrows for the methods that can be effective throughout the spectrum between
the stochastic and batch regimes. Single-sided arrows are used for those methods that one might
consider to be effective only with at least a moderate batch size in the stochastic gradient estimates.
We explain these distinctions as we describe the methods.

Stochas)c Batch
gradient method gradient method


Diagonal Scaling
quasi-Newton
Gauss-Newton
Hessian-free Newton
Natural gradient

Stochas)c

Newton method

Fig. 6.1: View of the schematic from Figure 3.3 with a focus on second-order methods. The
dotted arrows indicate the effective regime of each method: the first three methods can employ
mini-batches of any size, whereas the last two methods are efficient only for moderate-to-large
mini-batch sizes.

51
6.1 Hessian-Free Inexact Newton Methods
Due to its scale invariance properties and its ability to achieve a quadratic rate of convergence in
the neighborhood of a strong local minimizer, Newton’s method represents an ideal in terms of
optimization algorithms. It does not scale well with the dimension d of the optimization problem,
but there are variants that can scale well while also being able to deal with nonconvexity.
When minimizing a twice-continuously differentiable F , a Newton iteration is
wk+1 ← wk + αk sk , (6.4a)
2
where sk satisfies ∇ F (wk )sk = −∇F (wk ). (6.4b)
This iteration demands much in terms of computation and storage. However, rather than solve the
Newton system (6.4b) exactly through matrix factorization techniques, one can instead only solve
it inexactly through an iterative approach such as the conjugate gradient (CG) method [69]. By
ensuring that the linear solves are accurate enough, such an inexact Newton-CG method can enjoy
a superlinear rate of convergence [47].
In fact, the computational benefits of inexact Newton-CG go beyond its ability to maintain
classical convergence rate guarantees. Like many iterative linear system techniques, CG applied to
(6.4b) does not require access to the Hessian itself, only Hessian-vector products [121]. It is in this
sense that such a method may be called Hessian-free. This is ideal when such products can be coded
directly without having to form an explicit Hessian, as Example 6.1 below demonstrates. Each
product is at least as expensive as a gradient evaluation, but as long as the number of products—
one per CG iteration—is not too large, the improved rate of convergence can compensate for the
extra per-iteration work required over a simple full gradient method.
Example 6.1. Consider the function of the parameter vector w = (w1 , w2 ) given by F (w) =
exp(w1 w2 ). Let us define, for any d ∈ R2 , the function
φ(w; d) = ∇F (w)T d = w2 exp(w1 w2 )d1 + w1 exp(w1 w2 )d2 .
Computing the gradient of φ with respect to w, we have
 2 
2 w2 exp(w1 w2 )d1 + (exp(w1 w2 ) + w1 w2 exp(w1 w2 ))d2
∇w φ(w; d) = ∇ F (w)d = .
(exp(w1 w2 ) + w1 w2 exp(w1 w2 ))d1 + w12 exp(w1 w2 )d2
We have thus obtained, for any d ∈ R2 , a formula for computing ∇2 F (w)d that does not require
∇2 F (w) explicitly. Note that by storing the scalars w1 w2 and exp(w1 w2 ) from the evaluation of F ,
the additional costs of computing the gradient-vector and Hessian-vector products are small.
The idea illustrated in this example can be applied in general; e.g., see also Example 6.2 on
page 54. For a smooth objective function F , one can compute ∇2 F (w)d at a cost that is a small
multiple of the cost of evaluating ∇F , and without forming the Hessian, which would require O(d2 )
storage. The savings in computation come at the expense of storage of some additional quantities,
as explained in Example 6.1.
In machine learning applications, including those involving multinomial logistic regression and
deep neural networks, Hessian-vector products can be computed in this manner, and an inexact
Newton-CG method can be applied. A concern is that, in certain cases, the cost of the CG iterations
may render such a method uncompetitive with alternative approaches, such as an SG method or a
limited memory BFGS method (see §6.2), which have small computational overhead. Interestingly,
however, the structure of the risk measures (3.1) and (3.2) can be exploited so that the resulting
method has lighter computational overheads, as described next.

52
6.1.1 Subsampled Hessian-Free Newton Methods
The motivation for the method we now describe stems from the observation that, in inexact Newton
methods, the Hessian matrix need not be as accurate as the gradient to yield an effective iteration.
Translated to the context of large-scale machine learning applications, this means that the iteration
is more tolerant to noise in the Hessian estimate than it is to noise in the gradient estimate.
Based on this idea, the technique we state here employs a smaller sample for defining the Hessian
than for the stochastic gradient estimate. Following similar notation as introduced in §5.2, let the
stochastic gradient estimate be
1 X
∇fSk (wk ; ξk ) = ∇f (wk ; ξk,i )
|Sk |
i∈Sk

and let the stochastic Hessian estimate be


1 X 2
∇2 fS H (wk ; ξkH ) = ∇ f (wk ; ξk,i ), (6.5)
k |SkH | H
i∈Sk

where SkH is conditionally uncorrelated with Sk given wk . If one chooses the subsample size |SkH |
small enough, then the cost of each product involving the Hessian approximation can be reduced
significantly, thus reducing the cost of each CG iteration. On the other hand, one should choose
|SkH | large enough so that the curvature information captured through the Hessian-vector products
is productive. If done appropriately, Hessian subsampling is robust and effective [2, 29, 122, 132].
An inexact Newton method that incorporates this techniques is outlined in Algorithm 6.1. The
algorithm is stated with a backtracking (Armijo) line search [114], though other stepsize-selection
techniques could be considered as well.

Algorithm 6.1 Subsampled Hessian-Free Inexact Newton Method


1: Choose an initial iterate w1 .
2: Choose constants ρ ∈ (0, 1), γ ∈ (0, 1), η ∈ (0, 1), and maxcg ∈ N.
3: for k = 1, 2, . . . do
4: Generate realizations of ξk and ξkH corresponding to Sk and SkH .
5: Compute sk by applying Hessian-free CG to solve

∇2 fS H (wk ; ξkH )s = −∇fSk (wk ; ξk ) (6.6)


k

until maxcg iterations have been performed or a trial solution yields

krk k2 := k∇2 fS H (wk ; ξkH )s + ∇fSk (wk ; ξk )k2 ≤ ρk∇fSk (wk ; ξk )k2 .
k

6: Set wk+1 ← wk + αk sk , where αk ∈ {γ 0 , γ 1 , γ 2 , . . . } is the largest element with

fSk (wk+1 ; ξk ) ≤ fSk (wk ; ξk ) + ηαk ∇fSk (wk ; ξk )T sk . (6.7)

7: end for

As previously mentioned, the (subsampled) Hessian-vector products required in Algorithm 6.1


can be computed efficiently in the context of many machine learning applications. For instance,
one such case in the following.

53
Example 6.2. Consider a binary classification problem where the training function is given by the
logistic loss with an `2 -norm regularization parameterized by λ > 0:
n
1X λ
Rn (w) = log(1 + exp(−yi wT xi )) + kwk2 . (6.8)
n 2
i=1

A (subsampled) Hessian-vector product can be computed efficiently by observing that


1 X exp(−yi wT xi )
∇2 fS H (wk ; ξkH )d = (xTi d)xi + λd.
k |SkH | H
(1 + exp(−y i w T x ))2
i
i∈Sk

To quantify the step computation cost in an inexact Newton-CG framework such as Algo-
rithm 6.1, let gcost be the cost of computing a gradient estimate ∇fSk (wk ; ξk ) and factor×gcost
denote the cost of one Hessian-vector product. If the maximum number of CG iterations, maxcg ,
is performed for every outer iteration, then the step computation cost in Algorithm 6.1 is
maxcg ×factor × gcost + gcost .
In a deterministic inexact Newton-CG method for minimizing the empirical risk Rn , i.e., when
|SkH | = |Sk | = n for all k ∈ N, the factor is at least 1 and maxcg would typically be chosen as 5,
20, or more, leading to an iteration that is many times the cost of an SG iteration. However, in a
stochastic framework using Hessian sub-sampling, the factor can be chosen to be sufficiently small
such that maxcg ×factor ≈ 1, leading to a per-iteration cost proportional to that of SG.
Implicit in this discussion is the assumption that the gradient sample size |Sk | is large enough
so that taking subsamples for the Hessian estimate is sensible. If, by contrast, the algorithm were
to operate in the stochastic regime of SG where |Sk | is small and gradients are very noisy, then it
may be necessary to choose |SkH | > |Sk | so that Hessian approximations do not corrupt the step. In
such circumstances, the method would be far less attractive than SG. Therefore, the subsampled
Hessian-free Newton method outlined here is only recommended when Sk is large. This is why, in
Figure 6.1, the Hessian-free Newton method is illustrated only with an arrow to the right, i.e., in
the direction of larger sample sizes.
Convergence of Algorithm 6.1 is easy to establish when minimizing a strongly convex empirical
risk measure F = Rn when Sk ← {1, . . . , n} for all k ∈ N, i.e., when full gradients are always
used. In this case, a benefit of employing CG to solve (6.6) is that it immediately improves upon
the direction employed in a steepest descent iteration. Specifically, when initialized at zero, it
produces in its first iteration a scalar multiple of the steepest descent direction −∇F (wk ), and
further iterations monotonically improve upon this step (in terms of minimizing a quadratic model
of the form in (6.3)) until the Newton step is obtained, which is done so in at most d iterations
of CG (in exact arithmetic). Therefore, by using any number of CG iterations, convergence can
be established using standard techniques to choose the stepsize αk [114]. When exact Hessians are
also used, the rate of convergence can be controlled through the accuracy with which the systems
(6.4b) are solved. Defining rk := ∇2 F (wk )sk + ∇F (wk ) for all k ∈ N, the iteration can enjoy a
linear, superlinear, or quadratic rate of convergence by controlling krk k2 , where for the superlinear
rates one must have {krk k2 /k∇F (wk )k2 } → 0 [47].
When the Hessians are subsampled (i.e., SkH ⊂ Sk for all k ∈ N), it has not been shown that
the rate of convergence is faster than linear; nevertheless, the reduction in the number of iterations
required to produce a good approximate solution is often significantly lower than if no Hessian
information is used in the algorithm.

54
6.1.2 Dealing with Nonconvexity
Hessian-free Newton methods are routinely applied for the solution of nonconvex problems. In
such cases, it is common to employ a trust region [37] instead of a line search and to introduce
an additional condition in Step 5 of Algorithm 6.1: terminate CG if a candidate solution sk is a
direction of negative curvature, i.e., sTk ∇2 fS H (wk ; ξkH )sk < 0 [149]. A number of more sophisticated
k
strategies have been proposed throughout the years with some success, but none have proved to be
totally satisfactory or universally accepted.
Instead of coping with indefiniteness, one can focus on strategies for ensuring positive (semi)definite
Hessian approximations. One of the most attractive ways of doing this in the context of machine
learning is to employ a (subsampled) Gauss-Newton approximation to the Hessian, which is a
matrix of the form
1 X
GS H (wk ; ξkH ) = H Jh (wk , ξk,i )T H` (wk , ξk,i )Jh (wk , ξk,i ). (6.9)
k |Sk | H
i∈Sk

Here, the matrix Jh captures the stochastic gradient information for the prediction function h(x; w),
whereas the matrix H` only captures the second order information for the (convex) loss function
`(h, y); see §6.3 for a detailed explanation. As before, one can directly code the product of this
matrix times a vector without forming the matrix components explicitly. This approach has been
applied successfully in the training of deep neural networks [12, 100].
We mention in passing that there has been much discussion about the role that negative cur-
vature and saddle points play in the optimization of deep neural networks; see e.g., [44, 70, 35].
Numerical tests designed to probe the geometry of the objective function in the neighborhood of a
minimizer when training a deep neural network have shown the presence of negative curvature. It is
believed that the inherent stochasticity of the SG method allows it to navigate efficiently through
this complex landscape, but it is not known whether classical techniques to avoid approaching
saddle points will prove to be successful for either batch or stochastic methods.

6.2 Stochastic Quasi-Newton Methods


One of the most important developments in the field of nonlinear optimization came with the
advent of quasi-Newton methods. These methods construct approximations to the Hessian using
only gradient information, and are applicable for convex and nonconvex problems. Versions that
scale well with the number of variables, such as limited memory methods, have proved to be effective
in a wide range of applications where the number of variables can be in the millions. It is therefore
natural to ask whether quasi-Newton methods can be extended to the stochastic setting arising in
machine learning. Before we embark on this discussion, let us review the basic principles underlying
quasi-Newton methods, focusing on the most popular scheme, namely BFGS [27, 60, 68, 146].
In the spirit of Newton’s method (6.4), the BFGS iteration for minimizing a twice continuously
differentiable function F has the form

wk+1 ← wk − αk Hk ∇F (wk ), (6.10)

where Hk is a symmetric positive definite approximation of (∇2 F (wk ))−1 . This form of the iteration
is consistent with (6.4), but the signifying feature of a quasi-Newton scheme is that the sequence

55
{Hk } is updated dynamically by the algorithm rather than through a second-order derivative com-
putation at each iterate. Specifically, in the BFGS method, the new inverse Hessian approximation
is obtained by defining the iterate and gradient displacements

sk := wk+1 − wk and vk := ∇F (wk+1 ) − ∇F (wk ),

then setting
T
vk sT vk sT sk sTk
  
Hk+1 ← I− T k Hk I− T k + . (6.11)
s k vk s k vk sTk vk
−1
One important aspect of this update is that it ensures that the secant equation Hk+1 sk = vk holds,
meaning that a second-order Taylor expansion is satisfied along the most recent displacement
(though not necessarily along other directions).
A remarkable fact about the BFGS iteration (6.10)–(6.11) is that it enjoys a local superlinear
rate of convergence [51], and this with only first-order information and without the need for any
linear system solves (which are required by Newton’s method for it to be quadratically convergent).
However, a number of issues need to be addressed to have an effective method in practice. For one
thing, the update (6.11) yields dense matrices, even when the exact Hessians are sparse, restricting
its use to small and midsize problems. A common solution for this is to employ a limited memory
scheme, leading to a method such as L-BFGS [97, 113]. A key feature of this approach is that the
matrices {Hk } need not be formed explicitly; instead, each product of the form Hk ∇F (wk ) can be
computed using a formula that only requires recent elements of the sequence of displacement pairs
{(sk , vk )} that have been saved in storage. Such an approach incurs per-iteration costs of order
O(d), and delivers practical performance that is significantly better than an full gradient iteration,
though the rate of convergence is only provably linear.

6.2.1 Deterministic to Stochastic


Let us consider the extension of a quasi-Newton approach from the deterministic to the stochastic
setting arising in machine learning. The iteration now takes the form

wk+1 ← wk − αk Hk g(wk , ξk ). (6.12)

Since we are interested in large-scale problems, we assume that (6.12) implements an L-BFGS
scheme, which avoids the explicit construction of Hk . A number of questions arise when consider-
ing (6.12). We list them now with some proposed solutions.

Theoretical Limitations The convergence rate of a stochastic iteration such as (6.12) cannot
be faster than sublinear [1]. Given that SG also has a sublinear rate of convergence, what benefit,
if any, could come from incorporating Hk in (6.12)? This is an important question. As it happens,
one can see a benefit of Hk in terms of the constant that appears in the sublinear rate. Recall
that for the SG method (Algorithm 4.1), the constant depends on L/c, which in turn depends
on the conditioning of {∇2 F (wk )}. This is typical of first-order methods. In contrast, one can
show [25] that if the sequence of Hessian approximations in (6.12) satisfies {Hk } → ∇2 F (w∗ )−1 ,
then the constant is independent of the conditioning of the Hessian. Although constructing Hessian
approximations with this property might not be viable in practice, this fact suggests that stochastic
quasi-Newton methods could be better equipped to cope with ill-conditioning than SG.

56
Additional Per-Iteration Costs The SG iteration is very inexpensive, requiring only the eval-
uation of g(wk , ξk ). The iteration (6.12), on the other hand, also requires the product Hk g(wk , ξk ),
which is known to require 4md operations where m is the memory in the L-BFGS updating scheme.
Assuming for concreteness that the cost of evaluating g(wk , ξk ) is exactly d operations (using only
one sample) and that the memory parameter is set to the typical value of m = 5, one finds that
the stochastic quasi-Newton method is 20 times more expensive than SG. Can the iteration (6.12)
yield fast enough progress as to offset this additional per-iteration cost? To address this question,
one need only observe that the calculation just mentioned focuses on the gradient g(wk , ξk ) being
based on a single sample. However, when employing mini-batch gradient estimates, the additional
cost of the iteration (6.12) is only marginal. (Mini-batches of size 256 are common in practice.)
The use of mini-batches may therefore be considered essential when one contemplates the use of a
stochastic quasi-Newton method. This mini-batch need not be large, as in the Hessian-free Newton
method discussed in the previous section, but it should not be less than, say, 20 or 50 in light of
the additional costs of computing the matrix-vector products.

Conditioning of the Scaling Matrices The BFGS formula (6.11) for updating Hk involves
differences in gradient estimates computed in consecutive iterations. In stochastic settings, the
gradients {g(wk , ξk )} are noisy estimates of {∇F (wk )}. This can cause the updating process to
yield poor curvature estimates, which may have a detrimental rather than beneficial effect on
the quality of the computed steps. Since BFGS, like all quasi-Newton schemes, is an overwriting
process, the effects of even a single bad update may linger for numerous iterations. How could such
effects be avoided in the stochastic regime? There have been various proposals to avoid differencing
noisy gradient estimates. One possibility is to employ the same sample when computing gradient
differences [18, 142]. An alternative approach that allows greater freedom in the choice of the
stochastic gradient is to decouple the step computation (6.12) and the Hessian update. In this
manner, one can employ a larger sample, if necessary, when computing the gradient displacement
vector. We discuss these ideas further in §6.2.2.
It is worthwhile to note that if the gradient estimate g(wk , ξk ) does not have high variance, then
standard BFGS updating can be applied without concerns. Therefore, in the rest of this section,
we focus on algorithms that employ noisy gradient estimates in the step computation (6.12). This
means, e.g., that we are not considering the potential to tie the method with noise reduction
techniques described in §5, though such an idea is natural and could be effective in practice.

6.2.2 Algorithms
A straightforward adaptation of L-BFGS involves only the replacement of deterministic gradients
with stochastic gradients throughout the iterative process. The displacement pairs might then be
defined as
sk := wk+1 − wk and vk := ∇fSk (wk+1 , ξk ) − ∇fSk (wk , ξk ). (6.13)
Note the use of the same seed ξk in the two gradient estimates, in order to address the issues related
to noise mentioned above. If each fi is strongly convex, then sTk vk > 0, and positive definiteness
of the updates is also maintained. Such an approach is sometimes referred to as online L-BFGS
[142, 103].
One disadvantage of this method is the need to compute two, as opposed to only one, gradient
estimate per iteration: one to compute the gradient displacement (namely, g(wk+1 , ξk )) and another

57
(namely, g(wk+1 , ξk+1 )) to compute the subsequent step. This is not too onerous, at least as long
as the per-iteration improvement outweighs the extra per-iteration cost. A more worrisome feature
is that updating the inverse Hessian approximations with every step may not be warranted, and
may even be detrimental when the gradient displacement is based on a small sample, since it could
easily represent a poor approximation of the action of the true Hessian of F .
An alternative strategy, which might better represent the action of the true Hessian even when
g(wk , ξk ) has high variance, is to employ an alternative vk . In particular, since ∇F (wk+1 ) −
∇F (wk ) ≈ ∇2 F (wk )(wk+1 − wk ), one could define

vk := ∇2 fS H (wk ; ξkH )sk , (6.14)


k

where ∇2 fS H (wk ; ξkH ) is a subsampled Hessian and |SkH | is large enough to provide useful curvature
k
information. As in the case of Hessian-free Newton from §6.1.1, the product (6.14) can be performed
without explicitly constructing ∇2 fS H (wk ; ξkH ).
k
Regardless of the definition of vk , when |SkH | is much larger than |Sk |, the cost of quasi-Newton
updating is excessive due to the cost of computing vk . To address this issue, the computation of
vk can be performed only after a sequence of iterations, so as to amortize costs. This leads to the
idea of decoupling the step computation from the quasi-Newton update. This approach, which we
refer to for convenience as SQN, performs a sequence of iterations of (6.12) with Hk fixed, then
computes a new displacement pair (sk , vk ) with sk defined as in (6.13) and vk set using one of the
strategies outlined above. This pair replaces one of the old pairs in storage, which in turn defines
the limited memory BFGS step.
To formalize all of these alternatives, we state the general stochastic quasi-Newton method
presented as Algorithm 6.2, with some notation borrowed from Algorithm 6.1. In the method,
the step computation is based on a collection of m displacement pairs P = {sj , vj } in storage
and the current stochastic gradient ∇fSk (wk ; ξk ), where the matrix-vector product in (6.12) can be
computed through a two-loop recursion [113, 114]. To demonstrate the generality of the method, we
note that the online L-BFGS method sets SkH ← Sk and update pairs = true in every iteration.
In SQN using (6.14), on the other hand, |SkH | should be chosen larger than |Sk | and one sets update
pairs = true only every, say, 10 or 20 iterations.
To guarantee that the BFGS update is well defined, each displacement pair (sj , vj ) must satisfy
T
sj vj > 0. In deterministic optimization, this issue is commonly addressed by either performing
a line search (involving exact gradient computations) or modifying the displacement vectors (e.g.,
through damping) so that sTj vj > 0, in which case one does ensure that (6.11) maintains positive
definite approximations. However, these mechanisms have not been fully developed in the stochastic
regime when exact gradient information is unavailable and the gradient displacement vectors are
noisy. Simple ways to overcome these difficulties is to replace the Hessian matrix with a Gauss-
Newton approximation or to introduce a combination of damping and regularization (say, through
the addition of simple positive definite matrices).
There remains much to be explored in terms of stochastic quasi-Newton methods for machine
learning applications. Experience has shown that some gains in performance can be achieved, but
the full potential of the quasi-Newton schemes discussed above (and potentially others) is not yet
known.

58
Algorithm 6.2 Stochastic Quasi-Newton Framework
1: Choose an initial iterate w1 and initialize P ← ∅.
2: Choose a constant m ∈ N.
3: Choose a stepsize sequence {αk } ⊂ R++ .
4: for k = 1, 2, . . . , do
5: Generate realizations of ξk and ξkH corresponding to Sk and SkH .
6: Compute ŝk = Hk g(wk , ξk ) using the two-loop recursion based on the set P.
7: Set sk ← −αk ŝk .
8: Set wk+1 ← wk + sk .
9: if update pairs then
10: Compute sk and vk (based on the sample SkH ).
11: Add the new displacement pair (sk , vk ) to P.
12: If |P| > m, then remove eldest pair from P.
13: end if
14: end for

6.3 Gauss-Newton Methods


The Gauss-Newton method is a classical approach for nonlinear least squares, i.e., minimization
problems in which the objective function is a sum of squares. This method readily applies for
optimization problems arising in machine learning involving a least squares loss function, but the
idea generalizes for other popular loss functions as well. The primary advantage of Gauss-Newton
is that it constructs an approximation to the Hessian using only first-order information, and this
approximation is guaranteed to be positive semidefinite, even when the full Hessian itself may
be indefinite. The price to pay for this convenient representation is that it ignores second-order
interactions between elements of the parameter vector w, which might mean a loss of curvature
information that could be useful for the optimization process.

Classical Gauss-Newton Let us introduce the classical Gauss-Newton approach by considering


a situation in which, for a given input-output pair (x, y), the loss incurred by a parameter vector w
is measured via a squared norm discrepancy between h(x; w) ∈ Rd and y ∈ Rd . Representing the
input-output pair being chosen randomly via the subscript ξ, we may thus write

f (w; ξ) = `(h(xξ ; w), yξ ) = 12 kh(xξ ; w) − yξ k22 .

Writing a second-order Taylor series model of this function in the vicinity of parameter vector wk
would involve its gradient and Hessian at wk , and minimizing the resulting model (recall (6.3))
would lead to a Newton iteration. Alternatively, a Gauss-Newton approximation of the function
is obtained by making an affine approximation of the prediction function inside the quadratic
loss function. Letting Jh (·; ξ) represent the Jacobian of h(xξ ; ·) with respect to w, we have the
approximation
h(xξ ; w) ≈ h(xξ ; wk ) + Jh (wk ; ξ)(w − wk ),

59
which leads to
f (w; ξ) ≈ 21 kh(xξ ; wk ) + Jh (wk ; ξ)(w − wk ) − yξ k22
= 21 kh(xξ ; wk ) − yξ k22 + (h(xξ ; wk ) − yξ )T Jh (wk ; ξ)(w − wk )
+ 21 (w − wk )T Jh (wk ; ξ)T Jh (wk ; ξ)(w − wk ).

In fact, this approximation is similar to a second-order Taylor series model, except that the terms
involving the second derivatives of the prediction function h with respect to the parameter vector
have been dropped. The remaining second-order terms are those resulting from the positive cur-
vature of the quadratic loss `. This leads to replacing the subsampled Hessian matrix (6.5) by the
Gauss-Newton matrix
1 X
GS H (wk ; ξkH ) = H Jh (wk ; ξk,i )T Jh (wk ; ξk,i ) . (6.15)
k |Sk | H i∈Sk

Since the Gauss-Newton matrix only differs from the true Hessian by terms that involve the factors
h(xξ ; wk ) − yξ , these two matrices are the same when the loss is equal to zero, i.e., when h(xξ ; wk ) =
yξ .
A challenge in the application of a Gauss-Newton scheme is that the Gauss-Newton matrix is of-
ten singular or nearly singular. In practice, this is typically handled by regularizing it by adding to it
a positive multiple of the identity matrix. For least-squares loss functions, the inexact Hessian-free
Newton methods of §6.1 and the stochastic quasi-Newton methods of §6.2 with gradient displace-
ment vectors defined as in (6.14) can be applied with (regularized) Gauss-Newton approximations.
This has the benefit that the scaling matrices are guaranteed to be positive definite.
The computational cost of the Gauss-Newton method depends on the dimensionality of the
prediction function. When the prediction function is scalar-valued, the Jacobian matrix Jh is a
single row whose elements are already being computed as an intermediate step in the computation
of the stochastic gradient ∇f (w; ξ). However, this is no longer true when the dimensionality is
larger than one since then computing the stochastic gradient vector ∇f (w; ξ) does not usually
require the explicit computation of all rows of the Jacobian matrix. This happens, for instance, in
deep neural networks when one uses back propagation [134, 135].

Generalized Gauss-Newton Gauss-Newton ideas can also be generalized for other standard
loss functions [141]. To illustrate, let us consider a slightly more general situation in which loss is
measured by a composition of an arbitrary convex loss function `(h, y) and a prediction function
h(x; w). Combining the affine approximation of the prediction function h(xξ ; w) with a second
order Taylor expansion of the loss function ` leads to the generalized Gauss-Newton matrix
1 X
GS H (wk ; ξkH ) = Jh (wk ; ξk,i )T H` (wk ; ξk,i ) Jh (wk ; ξk,i ) (6.16)
k |SkH | H
i∈Sk

∂ ` 2
(recall (6.9)), where H` (wk ; ξ) = ∂h2 (h(xξ ; wk ), yξ ) captures the curvature of the loss function `.
This can be seen as a generalization of (6.15) in which H` = I.
When training a deep neural network, one may exploit this generalized strategy by redefining `
and h so that as much as possible of the network’s computation is formally performed by ` rather
than by h. If this can be done in such a way that convexity of ` is maintained, then one can faithfully

60
capture second-order terms for ` using the generalized Gauss-Newton scheme. Interestingly, in many
useful situations, this strategy gives simpler and more elegant expressions for H` . For instance,
probability estimation problems often reduce to using logarithmic losses of the form f (w; ξ) =
− log(h(xξ ; w)). The generalized Gauss-Newton matrix then reduces to
1 X 1
GS H (wk ; ξkH ) = Jh (wk ; ξk,i )T Jh (wk ; ξk,i )
k |SkH | H
h(w; ξk,i )2
i∈Sk
1 X
= ∇f (w; ξk,i )∇f (w; ξk,i )T , (6.17)
|SkH | H
i∈Sk

which does not require explicit computation of the Jacobian Jh .

6.4 Natural Gradient Method


We have seen that Newton’s method is invariant to linear transformations of the parameter vec-
tor w. By contrast, the natural gradient method [5, 6] aims to be invariant with respect to all
differentiable and invertible transformations. The essential idea consists of formulating the gra-
dient descent algorithm in the space of prediction functions rather than specific parameters. Of
course, the actual computation takes place with respect to the parameters, but accounts for the
anisotropic relation between the parameters and the decision function. That is, in parameter space,
the natural gradient algorithm will move the parameters more quickly along directions that have
a small impact on the decision function, and more cautiously along directions that have a large
impact on the decision function.
We remark at the outset that many authors [119, 99] propose quasi-natural-gradient methods
that are strikingly similar to the quasi-Newton methods described in §6.2. The natural gradient
approach therefore offers a different justification for these algorithms, one that involves qualitatively
different approximations. It should also be noted that research on the design of methods inspired
by the natural gradient is ongoing and may lead to markedly different algorithms [33, 76, 101].

Information Geometry In order to directly formulate the gradient descent in the space of
prediction functions, we must elucidate the geometry of this space. Amari’s work on information
geometry [6] demonstrates this for parametric density estimation. The space H of prediction
functions for such a problem is a family of densities hw (x) parametrized by w ∈ W and satisfying
the normalization condition Z
hw (x) dx = 1 for all w ∈ W.

Assuming sufficient regularity, the derivatives of such densities satisfy the identity
Z t
∂ hw (x) ∂t ∂t1
Z
∀t > 0 dx = h w (x) dx = = 0. (6.18)
∂wt ∂wt ∂wt
To elucidate the geometry of the space H, we seek to quantify how the density hw changes when
one adds a small quantity δw to its parameter. We achieve this in a statistically meaningful way
by observing the Kullback-Leibler (KL) divergence
  
hw (x)
DKL (hw khw+δw ) = Ehw log , (6.19)
hw+δw (x)

61
where Ehw denotes the expectation with respect to the distribution hw . Note that (6.19) only
depends on the values of the two density functions hw and hw+δw and therefore is invariant with
respect to any invertible transformation of the parameter w. Approximating the divergence with a
second-order Taylor expansion, one obtains
DKL (hw khw+δw ) = Ehw [log(hw (x)) − log(hw+δw (x))]
   2 
T ∂ log(hw (x)) 1 T ∂ log(hw (x))
≈ −δw Ehw − 2 δw Ehw δw,
∂w ∂w2
which, after observing that (6.18) implies that the first-order term is null, yields
DKL (hw khw+δw ) ≈ 12 δwT G(w)δw. (6.20)
This is a quadratic form defined by the Fisher information matrix
"  #
∂ log(hw (x)) T
 2  
∂ log(hw (x)) ∂ log(hw (x))
G(w) := −Ehw = −Ehw , (6.21)
∂w2 ∂w ∂w

where the latter equality follows again from (6.18). The second form of G(w) is often preferred
because it makes clear that the Fisher information matrix G(w) is symmetric and always positive
semidefinite, though not necessarily positive definite.
The relation (6.20) means that the KL divergence behaves locally like a norm associated with
G(w). Therefore, every small region of H looks like a small region of a Euclidean space. However, as
we traverse larger regions of H, we cannot ignore that the matrix G(w) changes. Such a construction
defines a Riemannian geometry.7
Suppose, for instance, that we move along a smooth path connecting two densities, call them hw0
and hw1 . A parametric representation of the path can be given by a differentiable function, for
which we define
φ : t ∈ [0, 1] 7→ φ(t) ∈ W with φ(0) = w0 and φ(1) = w1 .
We can compute the length of the path by viewing it as a sequence of infinitesimal segments
[φ(t), φ(t + dt)] whose length is given by (6.20), i.e., the total length is
s T
Z 1   
dφ dφ
Dφ = (t) G(φ(t)) (t) dt.
0 dt dt
An important tool for the study of Riemannian geometries is the characterization of its geodesics,
i.e., the shortest paths connecting two points. In a Euclidean space, the shortest path between
two points is always the straight line segment connecting them. In a Riemannian space, on the
other hand, the shortest path between two points can be curved and does not need to be unique.
Such considerations are relevant to optimization since every iterative optimization algorithm can be
viewed as attempting to follow a particular path connecting the initial point w0 to the optimum w∗ .
In particular, following the shortest path is attractive because it means that the algorithm reaches
the optimum after making the fewest number of changes to the optimization variables, hopefully
requiring the least amount of computation.
7
The objective of information geometry [6] is to exploit the Riemannian structure of parametric families of density
functions to gain geometrical insights on the fundamental statistical phenomena. The natural gradient algorithm is
only a particular aspect of this broader goal [5].

62
Natural Gradient Let us now assume that the space H of prediction functions {hw : w ∈ W} has
a Riemannian geometry locally described by an identity of the form (6.20). We seek an algorithm
that minimizes a functional F : hw ∈ H 7→ F (hw ) = F (w) ∈ R and is invariant with respect to
differentiable invertible transformations of the parameters represented by the vector w.
Each iteration of a typical iterative optimization algorithm computes a new iterate hwk+1 on the
basis of information pertaining to the current iterate hwk . Since we can only expect this information
to be valid in a small region surrounding hwk , we restrict our attention to algorithms that make
a step from hwk to hwk+1 of some small length ηk > 0. The number of iterations needed to reach
the optimum then depends directly on the length of the path followed by the algorithm, which is
desired to be as short as possible. Unfortunately, it is rarely possible to exactly follow a geodesic
using only local information. We can, however, formulate the greedy strategy that

hwk+1 = arg min F (h) s.t. D(hwk kh) ≤ ηk2 , (6.22)


h∈H

and use (6.20) to reformulate this problem in terms of the parameters:

wk+1 = arg min F (w) s.t. 1


2 (w − wk )T G(wk ) (w − wk ) ≤ ηk2 . (6.23)
w∈W

The customary derivation of the natural gradient algorithm handles the constraint in (6.23) using
a Lagrangian formulation with Lagrange multiplier 1/αk . In addition, since ηk is assumed small,
it replaces F (w) in (6.23) by the first-order approximation F (wk ) + ∇F (wk )T (w − wk ). These two
choices lead to the expression
1
wk+1 = arg min ∇F (wk )T (w − wk ) + (w − wk )T G(wk )(w − wk ),
w∈W 2αk
the optimization of the right-hand side of which leads to the natural gradient iteration

wk+1 = wk − αk G−1 (wk )∇F (wk ). (6.24)

We can also replace F (w) in (6.23) by a noisy first-order approximation, leading to a stochastic
natural gradient iteration where ∇F (wk ) in (6.24) is replaced by a stochastic gradient estimate.
Both batch and stochastic versions of (6.24) resemble the quasi-Newton update rules discussed
in §6.2. Instead of multiplying the gradient by the inverse of an approximation of the Hessian
(which is not necessarily positive definite), it employs the positive semidefinite matrix G(wk ) that
expresses the local geometry of the space of prediction functions. In principle, this matrix does not
even take into account the objective function F . However, as we shall now describe, one finds that
these choices are all closely related in practice.

Practical Natural Gradient Because the discovery of the natural gradient algorithm is closely
associated with information geometry, nearly all its applications involve density estimation [5,
33] or conditional probability estimation [119, 76, 99] using objective functions that are closely
connected to the KL divergence. Natural gradient in this context is closely related to Fisher’s
scoring algorithm [118]. For instance, in the case of density estimation, the objective is usually the
negative log likelihood
n
1X
F (w) = − log(hw (xi )) ≈ constant + DKL (P khw ),
n
i=1

63
where {x1 , . . . , xn } represent independent training samples from an unknown distribution P . Re-
calling the expression of the Fisher information matrix (6.21) then clarifies its connection with the
Hessian as one finds that
 2   2 
∂ log(hw (x)) 2 ∂ log(hw (x))
G(w) = −Ehw and ∇ F (w) = −EP .
∂w2 ∂w2

These two expressions do not coincide in general because the expectations involve different distri-
butions. However, when the natural gradient algorithm approaches the optimum, the parametric
density hwk ideally approaches the true distribution P , in which case the Fisher information matrix
G(wk ) approaches the Hessian matrix ∇2 F (wk ). This means that the natural gradient algorithm
and Newton’s method perform very similarly as optimality is approached.
Although it is occasionally possible to determine a convenient analytic expression [33, 76], the
numerical computation of the Fisher information matrix G(wk ) in large learning systems is generally
very challenging. Moreover, estimating the expectation (6.21) with, say, a Monte-Carlo approach
is usually prohibitive due to the cost of sampling the current density estimate hwk .
Several authors [119, 99] suggest to use instead a subset of training examples and compute a
quantity of the form
! !T
e k) = 1 ∂ log(hw (xi )) ∂ log(hw (xi ))
X
G(w .
|Sk | ∂w
wk ∂w
wk
i∈Sk

Although such algorithms are essentially equivalent to the generalized Gauss-Newton schemes de-
scribed in §6.3, the natural gradient perspective comes with an interesting insight into the relation
between the generalized Gauss-Newton matrix (6.17) and the Hessian matrix (6.5). Similar to the
equality (6.21), these two matrices would be equal if the expectation was taken with respect to the
model distribution hw instead of the empirical sample distribution.

6.5 Methods that Employ Diagonal Scalings


The methods that we have discussed so far in this section are forced to overcome the fact that when
employing an iteration involving an Rd × Rd scaling matrix, one needs to ensure that the improved
per-iteration progress outweighs the added per-iteration cost. We have seen that these added costs
can be as little as 4md operations and therefore amount to a moderate multiplicative factor on the
cost of each iteration.
A strategy to further reduce this multiplicative factor, while still incorporating second-order-
type information, is to restrict attention to diagonal or block-diagonal scaling matrices. Rather
than perform a more general linear transformation through a symmetric positive definite matrix
(i.e., corresponding to a scaling and rotation of the direction), the incorporation of a diagonal
scaling matrix only has the effect of scaling the individual search direction components. This can
be efficiently achieved by multiplying each coefficients of the gradient vector by the corresponding
diagonal term of the scaling matrix, or, when the prediction function is linear, by adaptively
renormalizing the input pattern coefficients [133].

Computing Diagonal Curvature A first family of algorithms that we consider directly com-
putes the diagonal terms of the Hessian or Gauss-Newton matrix, then divides each coefficient of

64
the stochastic gradient vector g(wk , ξk ) by the corresponding diagonal term. Since the computation
overhead of this operation is very small, it becomes important to make sure that the estimation of
the diagonal terms of the curvature matrix is very efficient.
For instance, in the context of deep neural networks, [12] describes a back-propagation algorithm
to efficiently compute the diagonal terms of the squared Jacobian matrix Jh (wk ; ξk )T Jh (wk ; ξk )
that appears in the expression of the Gauss-Newton matrix (6.15). Each iteration of the proposed
algorithm picks a training example, computes the stochastic gradient g(wk , ξk ), updates a running
estimate of the diagonal coefficients of the Gauss-Newton matrix by

Gk i = (1 − λ) Gk−1 i + λ Jh (wk ; ξk )T Jh (wk ; ξk ) ii for some 0 < λ < 1,


     

then performs the scaled stochastic weight update


!
    α  
wk+1 i
= wk i
−   g(wk , ξk ) i .
Gk i

The small regularization constant µ > 0 is introduced to handle situations where the Gauss-Newton
matrix is singular or nearly singular. Since the computation of the diagonal of the squared Jacobian
has a cost that is comparable to the cost of the computation of the stochastic gradient, each iteration
of this algorithm is roughly twice as expensive as a first-order stochastic gradient iteration. The
experience described in [12] shows that improvement in per-iteration progress can be sufficient to
modestly outperform a well-tuned SG algorithm.
After describing this algorithm in later work [89, §9.1], the authors make two comments that
illustrate well how this algorithm was used in practice. They first observe that the curvature
only changes very slowly in the particular type of neural network that was considered. Due to
this observation, a natural idea is to further reduce the computational overhead of the method by
estimating the ratios α/([Gk+1 ]i + µ) only once every few epochs, for instance using a small subset
of examples as in (6.15). The authors also mention that, as a rule of thumb, this diagonal scheme
typically improves the convergence speed by only a factor of three relative to SG. Therefore, it might
be more enlightening to view such an algorithm as a scheme to periodically retune a first-order SG
approach rather than as a complete second-order method.

Estimating Diagonal Curvature Instead of explicitly computing the diagonal terms of the
curvature matrix, one can follow the template of §6.2 and directly estimate the diagonal [Hk ]i
of the inverse Hessian using displacement pairs {(sk , vk )} as defined in (6.13). For instance, [18]
proposes to compute the scaling terms [Hk ]i with the running average
  !
sk
Hk+1 i = (1 − λ) Hk i + λProj  i ,
   
vk i

where Proj(·) represents a projection onto a predefined positive interval. It was later found that
a direct application of (6.13) after a parameter update introduces a correlated noise that ruins
the curvature estimate [19, §3]. Moreover, correcting this problem made the algorithm perform
substantially worse because the chaotic behavior of the rescaling factors [Hk ]i makes the choice of
the stepsize α very difficult.
These problems can be addressed with a combination of two ideas [19, §5]. The first idea consists
of returning to estimating the diagonal of the Hessian instead of the diagonal of this inverse, which

65
amounts to working with the ratio [vk ]i /[sk ]i instead of [sk ]i /[vk ]i . The second idea ensures that
the effective stepsizes are monotonically decreasing by replacing the running average by the sum
  !
vk
Gk+1 i = Gk i + Proj  i .
   
sk i

This effectively constructs a separate diminishing stepsize sequence α/[Gk ]i for each coefficient of
the parameter vector. Keeping the curvature estimates in a fixed positive interval ensures that
the effective stepsizes decrease at the rate O(1/k) as prescribed by Theorem 4.7, while taking the
local curvature into account. This combination was shown to perform very well when the input
pattern coefficients have very different variances [19], something that often happens, e.g., in text
classification problems.

Diagonal Rescaling without Curvature The algorithms described above often require some
form of regularization to handle situations where the Hessian matrix is (nearly) singular. To
illustrate why this is needed, consider, e.g., optimization of the convex objective function

F (w1 , w2 ) = 12 w12 + log(ew2 + e−w2 ),

for which ones finds


   
w1 2 1 0
∇F (w1 , w2 ) = and ∇ F (w1 , w2 ) = .
tanh(w2 ) 0 1/cosh2 (w2 )

Performing a first-order gradient method update from a starting point of (3, 3) yields the negative
gradient step −∇F ≈ [−3, −1], which unfortunately does not point towards the optimum, namely
the origin. Moreover, rescaling the step with the inverse Hessian actually gives a worse update
direction −(∇2 F )−1 ∇F ≈ [−3, −101] whose large second component requires a small stepsize to
keep the step well contained. Batch second-order optimization algorithms can avoid having to guess
a good stepsize by using, e.g., line search techniques. Stochastic second-order algorithms, on the
other hand, cannot rely on such procedures as easily.
This problem is of great concern in situations where the objective function is nonconvex. For
instance, optimization algorithms for deep neural networks must navigate around saddle points
and handle near-singular curvature matrices. It is therefore tempting to consider diagonal rescaling
techniques that simply ensure equal progress along each axis, rather than attempt to approximate
curvature very accurately.
For instance, RMSprop [152] estimates the average magnitude of each element of the stochastic
gradient vector g(wk , ξk ) by maintaining the running averages
     2
Rk i
= (1 − λ) Rk−1 i + λ g(wk , ξk ) i .

The rescaling operation then consists in dividing each component of g(wk , ξk ) by the square root of
the corresponding running average, ensuring that the expected second moment of each coefficient
of the rescaled gradient is close to the unity:
    α  
wk+1 i = wk i − q  g(wk , ξk ) i .
Rk i + µ

66
This surprisingly simple approach has shown to be very effective for the optimization of deep neural
networks. Various improvement have been proposed [162, 84] on an empirical basis. The theorerical
explanation of this performance on nonconvex problems is still the object of active research [43].
The popular Adagrad algorithm [54] can be viewed as a member of this family that replaces
the running average by a sum:
     2
Rk i = Rk−1 i + g(wk , ξk ) i .
p
In this manner, the approach constructs a sequence of diminishing effective stepsizes α/ [Rk ]i + µ
for each coefficient of the parameter vector. This algorithm was initially proposed and analyzed
for the optimization of (not necessarily strongly)
√ convex functions for which SG theory suggests
diminishing stepsizes that scale with O(1/ k). Adagrad is also known to perform well on deep
learning networks, but one often finds that its stepsizes decrease too aggressively early in the
optimization [162].

Structural Methods The performance of deep neural network training can of course be im-
proved by employing better optimization algorithms. However, it can also be improved by changing
the structure of the network in a manner that facilitates the optimization [80, 74]. We now describe
one of these techniques, batch normalization [80], and discuss its relation to diagonal second-order
methods.
Consider a particular fully connected layer in a deep neural network of the form discussed in
(j)
§2.2. Using the notation of equation (2.4), the vector xi represents the input values of layer j
when the network is processing the i-th training example. Omitting the layer index for simplicity,
(j)
let x̂i = (xi , 1) denote the input vector augmented with an additional unit coefficient and let
ŵr = (Wr1 , . . . , Wrdj−1 , br ) be the rth row of the matrix Wj augmented with the rth coefficient of
the bias vector bj . The layer outputs are then obtained by applying the activation function to the
quantities sr = ŵrT x̂i for r ∈ {1, . . . , dj }. Assuming for simplicity that all other parameters of the
network are kept fixed, we can write
n
1X
F (ŵ1 , . . . , ŵdj ) = `(h(ŵ1T x̂i , ŵ2T x̂i , . . . , ŵdTj x̂i ), yi ),
n
i=1

where h(s1 , . . . , sdj ) encapsulates all subsequent layers in the network. The diagonal block of the
Gauss-Newton matrix (6.16) corresponding to the parameters ŵr then has the form
" #
dh T ∂ 2 `
  
1 X dh
G[r] = x̂i x̂Ti , (6.25)
|S| dsr ∂h2 dsr
i∈S

which can be viewed as a weighted second moment matrix of the augmented input vectors {x̂i }i∈S .
In particular, this matrix is perfectly conditioned if the weighted distribution of the layer inputs
is white, i.e., they have zero mean and a unit covariance matrix. This could be achieved by first
preprocessing the inputs by an affine transform that whitens their weighted distribution.
Two simplifications can drastically reduce the computational cost of this operation. First, we
can ignore the bracketed coefficient in (6.25) and assume that we can use the same whitening
transformation for all outputs r ∈ {1, . . . , dj }. Second, we can ignore the input cross-correlations
and simply ensure that each input variable has zero mean and unit variance by replacing the input

67
vector coefficients x̂i [t] for each t ∈ {1, . . . , dj−1 } by the linearly transformed values αt x̂i [t] + βt .
Despite these simplifications, this normalization operation is very likely to improve the second
order properties of the objective function. An important detail here is the computation of the
normalization constants αt and βt . Estimating the mean and the standard deviation of each input
with a simple running average works well if one expects these quantities to change very slowly.
This is unfortunately not true in recent neural networks.8
Batch normalization [80] defines a special kind of neural network layer that performs this nor-
malization using statistics collected with the current mini-batch of examples. The back-propagation
algorithm that computes the gradients must of course be adjusted to account for the on-the-fly com-
putation of the normalization coefficients. Assuming that one uses sufficiently large mini-batches,
computing the statistics in this manner ensures that the normalization constants are very fresh.
This comes at the price of making the output of the neural network on a particular training pattern
dependent on the other patterns in the mini-batch. Since these other examples are a priori random,
this amounts to generating additional noise in the stochastic gradient optimization. Although the
variance of this noise is poorly controlled, inserting batch normalization layers in various points of a
deep neural network is extremely effective and is now standard practice. Whether one can achieve
the same improvement with more controlled techniques remains to be seen.

7 Other Popular Methods


Some optimization methods for machine learning are not well characterized as being within the two-
dimensional schematic introduced in §3.4 (see Figure 3.3 on page 20), yet represent fundamentally
unique approaches that offer theoretical and/or practical advantages. The purpose of this section
is to discuss a few such ideas, namely, gradient methods with momentum, accelerated gradient
methods, and coordinate descent methods. For ease of exposition, we introduce these techniques
under the assumption that one is minimizing a continuously differentiable (not necessarily convex)
function F : Rn → R and that full gradients can be computed in each iteration. Then, after each
technique is introduced, we discuss how they may be applied in stochastic settings.

7.1 Gradient Methods with Momentum


Gradient methods with momentum are procedures in which each step is chosen as a combination
of the steepest descent direction and the most recent iterate displacement. Specifically, with an
initial point w1 , scalar sequences {αk } and {βk } that are either predetermined or set dynamically,
and w0 := w1 , these methods are characterized by the iteration

wk+1 ← wk − αk ∇F (wk ) + βk (wk − wk−1 ). (7.1)

Here, the latter is referred to as the momentum term, which, recursively, maintains the algorithm’s
movement along previous search directions.
The iteration (7.1) can be motivated in various ways; e.g., it is named after the fact that it
represents a discretization of a certain second-order ordinary differential equation with friction. Of
8
This used to be true in the 1990s because neural networks were using bounded activation functions such as
the sigmoid s(x) = 1/(1 + e−x ). However, many recent results were achieved using the ReLU activation function
s(x) = max{0, x} which is unbounded and homogeneous. The statistics of the intermediate variables in such network
can change extremely quickly during the first phases of the optimization process [86].

68
course, when βk = 0 for all k ∈ N, it reduces to the steepest descent method. When αk = α and
βk = β for some constants α > 0 and β > 0 for all k ∈ N, it is referred to as the heavy ball method
[123], which is known to yield a superior rate of convergence as compared to steepest descent with
a fixed stepsize for certain functions of interest. For example, when F is a strictly convex quadratic
with minimum and maximum eigenvalues given by c > 0 and L ≥ c, respectively, steepest descent
and the heavy ball method each yield a linear rate of convergence (in terms of the distance to the
solution converging to zero) with contraction constants respectively given by

κ−1 κ−1 L
and √ where κ := ≥ 1. (7.2)
κ+1 κ+1 c

Choosing (α, β) to achieve these rates requires knowledge of (c, L), which might be unavailable.
Still, even without this knowledge, the heavy ball method often outperforms steepest descent.
Additional connections with (7.1) can be made when F is a strictly convex quadratic. In
particular, if (αk , βk ) is chosen optimally for all k ∈ N, in the sense that the pair is chosen to solve

min F (wk − α∇F (wk ) + β(wk − wk−1 )), (7.3)


(α,β)

then (7.1) is exactly the linear conjugate gradient (CG) algorithm. While the heavy ball method is
a stationary iteration (in the sense that the pair (α, β) is fixed), the CG algorithm is nonstationary
and its convergence behavior is relatively more complex; in particular, the step-by-step behavior of
CG depends on the eigenvalue distribution of the Hessian of F [69]. That said, in contrast to the
heavy ball method, CG has a finite convergence guarantee. This, along with the fact that problems
with favorable eigenvalue distributions are quite prevalent, has lead to the great popularity of CG
in a variety of situations. More generally, nonlinear CG methods, which also follow the procedure
in (7.1), can be viewed as techniques that approximate the optimal values defined by (7.3) when F
is not quadratic.
An alternative view of the heavy ball method is obtained by expanding (7.1):
k
X
wk+1 ← wk − α β k−j ∇F (wk );
j=1

thus, each step can be viewed as an exponenetially decaying average of past gradients. By writing
the iteration this way, one can see that the steps tend to accumulate contributions in directions of
persistent descent, while directions that oscillate tend to get cancelled, or at least remain small.
This latter interpretation provides some intuitive explanation as to why a stochastic heavy ball
method, and stochastic gradient methods with momentum in general, might be successful in various
settings. In particular, their practical performance have made them popular in the community
working on training deep neural networks [150]. Replacing the true gradient with a stochastic
gradient in (7.1), one obtains an iteration that, over the long run, tends to continue moving in
directions that the stochastic gradients suggest are ones of improvement, whereas movement is
limited along directions along which contributions of many stochastic gradients cancel each other
out. Theoretical guarantees about the inclusion of momentum in stochastic settings are elusive,
and although practical gains have been reported [92, 150], more experimentation is needed.

69
7.2 Accelerated Gradient Methods
A method with an iteration similar to (7.1), but with its own unique properties, is the accelerated
gradient method proposed by Nesterov [107]. Written as a two-step procedure, it involves the
updates
w̃k ← wk + βk (wk − wk−1 )
(7.4)
and wk+1 ← w̃k − αk ∇F (w̃k ),
which leads to the condensed form

wk+1 ← wk − αk ∇F (wk + βk (wk − wk−1 )) + βk (wk − wk−1 ). (7.5)

In this manner, it is easy to compare the approach with (7.1). In particular, one can describe
their difference as being a reversal in the order of computation. In (7.1), one can imagine taking
the steepest descent step and then applying the momentum term, whereas (7.5) results when one
follows the momentum term first, then applies a steepest descent step (with the gradient evaluated
at w̃k , not at wk ).
While this difference may appear to be minor, it is well known that (7.5) with appropriately
chosen αk = α > 0 for all k ∈ N and {βk } % 1 leads to an optimal iteration complexity when F
is convex and continuously differentiable with a Lipschitz continuous gradient. Specifically, while
in such cases a steepest descent method converges with a distance to the optimal value decaying
with a rate O( k1 ), the iteration (7.5) converges with a rate O( k12 ), which is provably the best rate
that can be achieved by a gradient method. Unfortunately, no intuitive explanation as to how
Nesterov’s method achieves this optimal rate has been widely accepted. Still, one cannot deny the
analysis and the practical gains that the technique has offered.
Acceleration ideas have been applied in a variety of other contexts as well, including for the
minimization of nonsmooth convex functions; see [96]. For now, we merely mention that when
applied in stochastic settings—with stochastic gradients employed in place of full gradients—one
can only hope that acceleration might improve the constants in the convergence rate offered in
Theorem 4.7; i.e., the rate itself cannot be improved [1].

7.3 Coordinate Descent Methods


Coordinate descent (CD) methods are among the oldest in the optimization literature. As their
name suggests, they operate by taking steps along coordinate directions: one attempts to minimize
the objective with respect to a single variable while all others are kept fixed, then other variables
are updated similarly in an iterative manner. Such a simple idea is easy to implement, so it is
not surprising that CD methods have a long history in many settings. Their limitations have been
documented and well understood for many years (more on these below), but one can argue that
their advantages were not fully recognized until recent work in machine learning and statistics
demonstrated their ability to take advantage of certain structures commonly arising in practice.
The CD method for minimizing F : Rd → R is given by the iteration
∂F
wk+1 ← wk − αk ∇ik F (wk )eik , where ∇ik F (wk ) := (wk ), (7.6)
∂wik
wik represents the ik -th element of the parameter vector, and eik represents the ik -th coordinate
vector for some ik ∈ {1, . . . , d}. In other words, the solution estimates wk+1 and wk differ only in
their ik -th element as a result of a move in the ik -th coordinate from wk .

70
Specific versions of the CD method are defined by the manner in which the sequences {αk }
and {ik } are chosen. In some applications, it is possible to choose αk as the global minimizer
of F from wk along the ik -th coordinate direction. An important example of this, which has
contributed to the recent revival of CD methods, occurs when the objective function has the form
F (w) = q(w) + kwk1 when q is a convex quadratic. Here, the exact minimization along each
coordinate is not only possible, but desirable as it promotes the generation of sparse iterates; see
also §8. More often, an exact one-dimensional minimization of F is not practical, in which case one
is typically satisfied with αk yielding a sufficient reduction in F from wk . For example, so-called
second-order CD methods compute αk as the minimizer of a quadratic model of F along the ik -th
coordinate direction.
Concerning the choice of ik , one could select it in each iteration in at least three different
ways: by cycling through {1, . . . , d}; by cycling through a random reordering of these indices
(with the indices reordered after each set of d steps); or simply by choosing an index randomly
with replacement in each iteration. Randomized CD algorithms (represented by the latter two
strategies) have superior theoretical properties than the cyclic method (represented by the first
strategy) as they are less likely to choose an unfortunate series of coordinates; more on this below.
However, it remains an open question whether such randomized algorithms are more effective in
typical applications.
We mention in passing that it is also natural to consider a block-coordinate descent method in
which a handful of elements are chosen in each iteration. This is is particularly effective when the
objective function is (partially) block separable, which occurs in matrix factorization problems and
least squares and logistic regression when each sample only depends on a few features. Clearly, in
such settings, there are great advantages of a block-coordinate descent approach. However, since
their basic properties are similar to the case of using a single index in each iteration, we focus on
the iteration (7.6).

Convergence Properties Contrary to what intuition might suggest, a CD method is not guar-
anteed to converge when applied to minimize any given continuously differentiable function. Powell
[127] gives an example of a nonconvex continuously differentiable function of three variables for
which a cyclic CD method with αk chosen by exact one-dimensional minimization cycles without
converging to a solution, i.e., at any limit point the gradient of F is nonzero. Although one can
argue that failures of this type are unlikely to occur in practice, particularly for a randomized CD
method, they show the weakness of the myopic strategy in a CD method that considers only one
variable at a time. This is in contrast with the full gradient method, which guarantees convergence
to stationarity even when the objective is nonconvex.
On the other hand, if the objective F is strongly convex, the CD method will not fail and
one can establish a linear rate of convergence. The analysis is very simple when using a constant
stepsize and we present one such result to provide some insights into the tradeoffs that arise with
a CD approach. Let us assume that ∇F is coordinate-wise Lipschitz continuous in the sense that,
for all w ∈ Rd , i ∈ {1, . . . , d}, and ∆wi ∈ R, there exists a constant Li > 0 such that

|∇i F (w + ∆wi ei ) − ∇i F (w)| ≤ Li |∆wi |. (7.7)

We then define the maximum coordinate-wise Lipschitz constant as


b :=
L max Li .
i∈{1,...,d}

71
Loosely speaking, L
b is a bound on the curvature of the function along all coordinates.

Theorem 7.1. Suppose that the objective function F : Rd → R is continuously differentiable,


strongly convex with constant c > 0, and has a gradient that is coordinate-wise Lipschitz continuous
with constants {L1 , . . . , Ld }. In addition, suppose that αk = 1/L b and ik is chosen independently
and uniformly from {1, . . . , d} for all k ∈ N. Then, for all k ∈ N, the iteration (7.6) yields
c k
 
E[F (wk+1 )] − F∗ ≤ 1 − (F (w1 ) − F∗ ). (7.8)
dLb
Proof. As Assumption 4.1 leads to (4.3), coordinate-wise Lipschitz continuity of ∇F yields
ik
F (wk+1 ) ≤ F (wk ) + ∇ik F (wk )(wk+1 b ik − wik )2 .
− wkik ) + 12 L(w k+1 k

Thus, with the stepsize chosen as αk = 1/L,


b it follows that

F (wk+1 ) − F (wk ) ≤ − 1b ∇ik F (wk )2 + 1


b ∇ik F (wk )
2
=− 1 2
b ∇ik F (wk ) .
L 2L 2L
Taking expectations with respect to the distribution of ik , one obtains
1 2
Eik [F (wk+1 )] − F (wk ) ≤ − b Eik [∇ik F (wk ) ]
2L
d
!
X
= − 1b d1 ∇i F (wk )2
2L
i=1
= − b k∇F (wk )k22 .
1
2Ld

Subtracting F∗ , taking total expectations, recalling (4.12), and applying the above inequality re-
peatedly over the first k ∈ N iterations yields (7.8), as desired.

A brief overview of the convergence properties of other CD methods under the assumptions of
Theorem 7.1 and in other settings is also worthwhile. First, it is interesting to compare the result
of Theorem 7.1 with a result obtained using the deterministic Gauss-Southwell rule, in which ik
is chosen in each iteration according to the largest (in absolute value) component of the gradient.
Using this approach, one obtains a similar result in which c is replaced by ĉ, the strong convexity
parameter as measured by the `1 -norm [116]. Since nc ≤ ĉ ≤ c, the Gauss-Southwell rule can be
up to n times faster than a randomized strategy, but in the worst case it is no better (and yet
more expensive due to the need to compute the full gradient vector in each iteration). Alternative
methods have also been proposed in which, in each iteration, the index ik is chosen
√ randomly with
probabilities proportional to Li or according to the largest ratio |∇i F (wk )|/ Li [93, 110]. These
strategies also lead to linear convergence rates with constants that are better in some cases.

Favorable Problem Structures Theorem 7.1 shows that a simple randomized CD method is
linearly convergent with constant dependent on the parameter dimension d. At first glance, this
appears to imply that such a method is less efficient than a standard full gradient method. However,
in situations in which d coordinate updates can be performed at cost similar to the evaluation of
one full gradient, the method is competitive with a full gradient method both theoretically and in
practice. Classes of problems in this category include those in which the objective function is
n d
1X X
F (w) = F̃j (xTj w) + F̂i (wi ), (7.9)
n
j=1 i=1

72
where, for all j ∈ {1, . . . , n}, the function F̃j is continuously differentiable and dependent on the
sparse data vector xj , and, for all i ∈ {1, . . . , d}, the function F̂i is a (nonsmooth) regularization
function. Such a form arises in least squares and logistic regression; see also §8.
For example, consider an objective function of the form
d
1 X
f (w) = kXw − yk22 + F̂i (wi ) with X = x1 · · ·
 
xn ,
2
i=1

which might be the original function of interest or might represent a model of (7.9) in which the
first term is approximated by a convex quadratic model. In this setting,
ik
∇ik f (wk+1 ) = xTik rk+1 + F̂i0k (wk+1 ) with rk+1 := Awk+1 − b,

where, with wk+1 = wk +βk eik , one may observe that rk+1 = rk +βk xik . That is, since the residuals
{rk } can be updated with cost proportional to the number of nonzeros in xik , call it nnz(xik ), the
overall cost of computing the search direction in iteration k + 1Pis also O(nnz(xik )). On the other
hand, an evaluation of the entire gradient requires a cost of O( ni=1 nnz(xi )).
Overall, there exist various types of objective functions for which minimization by a CD method
(with exact gradient computations) can be effective. These include objectives that are (partially)
block separable (which arise in dictionary learning and non-negative matrix factorization problems),
have structures that allow for the efficient computation of individual gradient components, or are
diagonally dominant in the sense that each step along a coordinate direction yields a reduction in
the objective proportional to that which would have been obtained by a step along the steepest
descent direction. Additional settings in which CD methods are effective are online problems where
gradient information with respect to a group of variables becomes available in time, in which case
it is natural to update these variables as soon as information is received.

Stochastic Dual Coordinate Ascent What about stochastic CD methods? As an initial


thought, one might consider the replacement of ∇ik F (wk ) in (7.6) with a stochastic approxima-
tion, but this is not typical since one can usually as easily compute a d-dimensional stochastic
gradient to apply an SG method. However, an interesting setting for the application of stochas-
tic CD methods arises when one considers approaches to minimize a convex objective function
of the form (7.9) by maximizing its dual. In particular, defining the convex conjugate of F̃j as
F̃j? (u) := maxw (wT u − F̃j (w)), the Fenchel-Rockafellar dual of (7.9) when F̂i (·) = λ2 (·)2 for all
i ∈ {1, . . . , d} is given by
2
n n
1 X
? λ 1
X
Fdual (v) = −F̃j (−vj ) − vj xj
.
n 2 λn
j=1 j=1
2

The stochastic dual coordinate ascent (SDCA) method [145] applied to a function of this form has
an iteration similar to (7.6), except that negative gradient steps are replaced by gradient steps due
to the fact that one aims to maximize the dual. At the conclusion of a run of the algorithm, the
1 Pn
corresponding primal solution can be obtained as w ← λn j=1 v j x j . The per-iteration cost of this
approach is on par with that of a stochastic gradient method.

73
Parallel CD Methods We close this section by noting that CD methods are also attractive
when one considers the exploitation of parallel computation. For example, consider a multicore
system in which the parameter vector w is stored in shared memory. Each core can then execute
a CD iteration independently and in an asynchronous manner, where if d is large compared to the
number of cores, then it is unlikely that two cores are attempting to update the same variable at
the same time. Since, during the time it takes a core to perform an update, the parameter vector w
has likely changed (due to updates produced by other cores), each update is being made based on
slightly stale information. However, convergence of the method can be proved, and improves when
one can bound the degree of staleness of each update. For further information and insight into
these ideas, we refer the reader to [16, 98].

8 Methods for Regularized Models


Our discussion of structural risk minimization (see §2.3) highlighted the key role played by regular-
ization functions in the formulation of optimization problems for machine learning. The optimiza-
tion methods that we presented and analyzed in the subsequent sections (§3–§7) are all applicable
when the objective function involves a smooth regularizer, such as the squared `2 -norm. In this
section, we expand our investigation by considering optimization methods that handle the regular-
ization as a distinct entity, in particular when that function is nonsmooth. One such regularizer
that deserves special attention is the `1 -norm, which induces sparsity in the optimal solution vector.
For machine learning, sparsity can be beneficial since it leads to simpler models, and hence can be
seen as a form of feature selection, i.e., for biasing the optimization toward solutions where only a
few elements of the parameter vector are nonzero.
Broadly, this section focuses on the nonsmooth optimization problem

min Φ(w) := F (w) + λΩ(w), (8.1)


w∈Rd

where F : Rd → R includes the composition of a loss and prediction function, λ > 0 is a regular-
ization parameter, and Ω : Rd → R is a convex, nonsmooth regularization function. Specifically,
we pay special attention to methods for solving the problem

min φ(w) := F (w) + λkwk1 . (8.2)


w∈Rd

As discussed in §2, it is often necessary to solve a series of such problems over a sequence of values
for the parameter λ. For further details in terms of problem (8.2), we refer the reader to [147, 9] for
examples in a variety of applications. However, in our presentation of optimization methods, we
assume that λ has been prescribed and is fixed. We remark in passing that (8.2) has as a special
case the well-known LASSO problem [151] when F (w) = kAw − bk22 for some A ∈ Rn×d and b ∈ Rn .
Although nondifferentiable, the regularized `1 problem (8.2) has a structure that can be ex-
ploited in the design of algorithms. The algorithms that have been proposed can be grouped into
classes of first- or second-order methods, and distinguished as those that either minimize the non-
smooth objective directly, as in a proximal gradient method, or by approximately minimizing a
sequence of more complicated models, such as in a proximal Newton method.
There exist other sparsity-inducing regularizers besides the `1 -norm, including group-sparsity-
inducing regularizers that combine `1 - and `2 -norms taken with respect to groups of variables

74
[147, 9], as well as the nuclear norm for optimization over spaces of matrices [32]. While we do
not discuss other such regularizers in detail, our presentation of methods for `1 -norm regularized
problems represents how methods for alternative regularizers can be designed and characterized.
As in the previous sections, we introduce the algorithms in this section under the assumption
that F is continuously differentiable and full, batch gradients can be computed for it in each
iteration, commenting on stochastic method variants once motivating ideas have been described.

8.1 First-Order Methods for Generic Convex Regularizers


The fundamental algorithm in unconstrained smooth optimization is the gradient method. For
solving problem (8.1), the proximal gradient method represents a similar fundamental approach.
Given an iterate wk , a generic proximal gradient iteration, with αk > 0, is given by
 
T 1 2
wk+1 ← arg min F (wk ) + ∇F (wk ) (w − wk ) + kw − wk k2 + λΩ(w) . (8.3)
w∈Rd 2αk

The term proximal refers to the presence of the third term in the minimization problem on the
right-hand side, which encourages the new iterate to be close to wk . Notice that if the regularization
(i.e., last) term were not present, then (8.3) exactly recovers the gradient method update wk+1 ←
wk − αk ∇F (wk ); hence, as previously, we refer to αk as the stepsize parameter. On the other hand,
if the regularization term is present, then, similar to the gradient method, each new iterate is found
by minimizing a model formed by a first-order Taylor series expansion of the objective function plus
a simple scaled quadratic. Overall, the only thing that distinguishes a proximal gradient method
from the gradient method is the regularization term, which is left untouched and included explicitly
in each step computation.
To show how an analysis similar to those seen in previous sections can be used to analyze (8.3),
we prove the following theorem. In it, we show that if F is strongly convex and its gradient function
is Lipschitz continuous, then the iteration yields a global linear rate of convergence to the optimal
objective value provided that the stepsizes are sufficiently small.

Theorem 8.1. Suppose that F : Rd → R is continuously differentiable, strongly convex with


constant c > 0, and has a gradient that is Lipschitz continuous with constant L > 0. In addition,
suppose that αk = α ∈ (0, 1/L] for all k ∈ N. Then, for all k ∈ N, the iteration (8.3) yields

Φ(wk+1 ) − Φ(w∗ ) ≤ (1 − αc)k (Φ(w1 ) − Φ(w∗ )),

where w∗ ∈ Rd is the unique global minimizer of Φ in (8.1).

Proof. Since αk = α ∈ (0, 1/L], it follows from (4.3) that

Φ(wk+1 ) = F (wk+1 ) + λΩ(wk+1 )


≤ F (wk ) + ∇F (wk )T (wk+1 − wk ) + 21 Lkwk+1 − wk k22 + λΩ(wk+1 )
≤ F (wk ) + ∇F (wk )T (wk+1 − wk ) + 1
2α kwk+1 − wk k22 + λΩ(wk+1 )
≤ F (wk ) + ∇F (wk )T (w − wk ) + 1
2α kw − wk k22 + λΩ(w) for all w ∈ Rd ,

75
where the last inequality follows since wk+1 is defined by (8.3). Representing w = wk + d, we obtain

Φ(wk+1 ) ≤ F (wk ) + ∇F (wk )T d + 1 2


2α kdk2 + λΩ(wk + d)
≤ F (wk ) + ∇F (wk )T d + 12 ckdk22 − 12 ckdk22 + 2α
1
kdk22 + λΩ(wk + d)
≤ F (wk + d) + λΩ(wk + d) − 12 ckdk22 + 2α 1
kdk22
= Φ(wk + d) + 12 ( α1 − c)kdk22 ,

which for d = −αc(wk − w∗ ) means that

Φ(wk+1 ) ≤ Φ(wk − αc(wk − w∗ )) + 21 ( α1 − c)kαc(wk − w∗ )k22


= Φ(wk − αc(wk − w∗ )) + 21 αc2 (1 − αc)kwk − w∗ k22 . (8.4)

On the other hand, since the c-strongly convex function Φ satisfies (e.g., see [108, pp. 63–64])

Φ(τ w + (1 − τ )w) ≤ τ Φ(w) + (1 − τ )Φ(w) − 12 cτ (1 − τ )kw − wk22


(8.5)
for all (w, w, τ ) ∈ Rd × Rd × [0, 1],

we have (considering w = wk , w = w∗ , and τ = αc ∈ (0, 1] in (8.5)) that

Φ(wk − αc(wk − w∗ )) ≤ αcΦ(w∗ ) + (1 − αc)Φ(wk ) − 21 c(αc)(1 − αc)kwk − w∗ k22


= αcΦ(w∗ ) + (1 − αc)Φ(wk ) − 12 αc2 (1 − αc)kwk − w∗ k22 . (8.6)

Combining (8.6) with (8.4) and subtracting Φ(w∗ ), it follows that

Φ(wk+1 ) − Φ(w∗ ) ≤ (1 − αc)(Φ(wk ) − Φ(w∗ )).

The result follows by applying this inequality repeated over the first k ∈ N iterations.

One finds in Theorem 8.1 an identical result as for a gradient method for minimizing a smooth
strongly convex function. As in such methods, the choice of the stepsize α is critical in practice;
the convergence guarantees demand that it be sufficiently small, but a value that is too small might
unduly slow the optimization process.
The proximal gradient iteration (8.3) is practical only when the proximal mapping
 
1 2
proxλΩ,αk (w̃) := arg minn λΩ(w) + kw − w̃k2
w∈R 2αk
can be computed efficiently. This can be seen in the fact that the iteration (8.3) can equivalently
be written as wk+1 ← proxλΩ,αk (wk − αk ∇F (wk )), i.e., the iteration is equivalent to applying a
proximal mapping to the result of a gradient descent step. Situations when the proximal mapping
is inexpensive to compute include when Ω is the indicator function for a simple set, such as a
polyhedral set, when it is the `1 -norm, or, more generally, when it is separable.
A stochastic version of the proximal gradient method can be obtained, not surprisingly, by
replacing ∇F (wk ) in (8.3) by a stochastic approximation g(wk , ξk ). The iteration remains cheap to
perform (since F (wk ) can be ignored as it does not effect the computed step). The resulting method
attains similar behavior as a stochastic gradient method; analyses can be found, e.g., in [140, 8].
We now turn our attention to the most popular nonsmooth regularizer, namely the one defined
by the `1 norm.

76
8.1.1 Iterative Soft-Thresholding Algorithm (ISTA)
In the context of solving the `1 -norm regularized problem (8.2), the proximal gradient method is
 
T 1 2
wk+1 ← arg min F (wk ) + ∇F (wk ) (w − wk ) + kw − wk k2 + λkwk1 . (8.7)
w∈Rd 2αk

The optimization problem on the right-hand side of this expression is separable and can be solved
in closed form. The solution can be written component-wise as

[wk − αk ∇F (wk )]i + αk λ if [wk − αk ∇F (wk )]i < −αk λ

[wk+1 ]i ← 0 if [wk − αk ∇F (wk )]i ∈ [−αk λ, αk λ] (8.8)

[wk − αk ∇F (wk )]i − αk λ if [wk − αk ∇F (wk )]i > αk λ.

One also finds that this iteration can be written, with (·)+ := max{·, 0}, as

wk+1 ← Tαk λ (wk − αk ∇F (wk )), where [Tαk λ (w̃)]i = (|w̃i | − αk λ)+ sgn(w̃i ). (8.9)

In this form, Tαk λ is referred to as the soft-thresholding operator, which leads to the name iterative
soft-thresholding algorithm (ISTA) being used for (8.7)–(8.8) [53, 42].
It is clear from (8.8) that the ISTA iteration induces sparsity in the iterates. If the steepest
descent step with respect to F yields a component with absolute value less than αk λ, then that
component is set to zero in the subsequent iterate; otherwise, the operator still has the effect
of shrinking components of the solution estimates in terms of their magnitudes. When only a
stochastic estimate g(wk , ξk ) of the gradient is available, it can be used instead of ∇F (wk ).
A variant of ISTA with acceleration (recall §7.2), known as FISTA [10], is popular in practice.
We also mention that effective techniques have been developed for computing the stepsize αk , in
ISTA or FISTA, based on an estimate of the Lipschitz constant of ∇F or on curvature measured
in recent iterations [10, 159, 11].

8.1.2 Bound-Constrained Methods for `1 -norm Regularized Problems


By observing the structure created by the `1 -norm, one finds that an equivalent smooth reformu-
lation of problem (8.2) is easily derived. In particular, by writing w = u − v where u plays the
positive part of w while v plays the negative part, problem (8.2) can equivalently be written as
d
X
min φ̃(u, v) s.t. (u, v) ≥ 0, where φ̃(u, v) = F (u − v) + λ (ui + vi ). (8.10)
(u,v)∈Rd ×Rd
i=1

Now with a bound-constrained problem in hand, one has at their disposal a variety of optimization
methods that have been developed in the optimization literature.
The fundamental iteration for solving bound-constrained optimization problems is the gradient
projection method. In the context of (8.10), the iteration reduces to
       
uk+1 uk ∇u φ̃(uk , vk ) uk − αk ∇F (uk − vk ) − αk λe
← P+ − αk = P+ , (8.11)
vk+1 vk ∇v φ̃(uk , vk ) vk + αk ∇F (uk − vk ) − αk λe

where P+ projects onto the nonnegative orthant and e ∈ Rd is a vector of ones.

77
Interestingly, the gradient projection method can also be derived from the perspective of a
proximal gradient method where the regularization term Ω is chosen to be the indicator function
for the feasible set (a box). In this case, the mapping Tαk λ is replaced by the projection operator onto
the bound constraints, causing the corresponding proximal gradient method to coincide with the
gradient projection method. In the light of this observation, one should expect the iteration (8.11)
to inherit the property of being globally linearly convergent when F satisfies the assumptions of
Theorem 8.1. However, since the variables in (8.10) have been split into positive and negative parts,
this property is maintained only if the iteration maintains complementarity of each iterate pair,
i.e., if [uk ]i [vk ]i = 0 for all k ∈ N and i ∈ {1, . . . , d}. This behavior is also critical for the practical
performance of the method in general since, without it, the algorithm would not generate sparse
solutions. In particular, maintaining this property allows the algorithm to be implemented in such
a way that one effectively only needs d optimization variables.
A natural question that arises is whether the iteration (8.11) actually differs from an ISTA
iteration, especially given that both are built upon proximal gradient ideas. In fact, the iterations
can lead to the same update, but do not always. Consider, for example, an iterate wk = uk − vk
such that for i ∈ {1, . . . , d} one finds [wk ]i > 0 with [uk ]i = [wk ]i and [vk ]i = 0. (A similar look,
with various signs reversed, can be taken when [wk ]i < 0.) If [wk − αk ∇F (wk )]i > αk λ, then (8.8)
and (8.11) yield
[wk+1 ]i ← [wk − αk ∇F (wk )]i − αk λ > 0
and [uk+1 ]i ← [uk − αk ∇F (wk )]i − αk λ > 0.
However, it is important to note the step taken in the negative part; in particular, if [∇F (wk )]i ≤ λ,
then [vk+1 ]i ← 0, but, if [∇F (wk )]i > λ, then [vk+1 ]i ← αk ∇F (wk ) − αk λ, in which case the lack
of complementarity between uk+1 and vk+1 should be rectified. A more significant difference arises
when, e.g., [wk − αk ∇F (wk )]i < −αk λ, in which case (8.8) and (8.11) yield
[wk+1 ]i ← [wk − αk ∇F (wk )]i + αk λ < 0,
[uk+1 ]i ← 0,
and [vk+1 ]i ← [vk + αk ∇F (wk )]i − αk λ > 0.
The pair ([uk+1 ]i , [vk+1 ]i ) are complementary, but [wk+1 ]i and [−vk+1 ]i differ by [wk ]i > 0.
Several first-order [59, 58] and second-order [138] gradient projection methods have been pro-
posed to solve (8.2). Such algorithms should be preferred over similar techniques for general bound-
constrained optimization, e.g., those in [163, 95], since such general techniques may be less effective
by not exploiting the structure of the reformulation (8.10) of (8.2).
A stochastic projected gradient method, with ∇F (wk ) replaced by g(wk , ξk ), has similar con-
vergence properties as a standard SG method; e.g., see [105]. These properties apply in the present
context, but also apply when a proximal gradient method is used to solve (8.1) when Ω represents
the indicator function of a box constraint.

8.2 Second-Order Methods


We now turn our attention to methods that, like Newton’s method for smooth optimization, are
designed to solve regularized problems through successive minimization of second-order models
constructed along the iterate sequence {wk }. As in a proximal gradient method, the smooth
function F is approximated by a Taylor series and the regularization term is kept unchanged. We
focus on two classes of methods for solving (8.2): proximal Newton and orthant-based methods.

78
Both classes of methods fall under the category of active-set methods. One could also consider
the application of an interior-point method to solve the bound-constrained problem (8.10) [114].
This, by its nature, constitutes a second-order method that would employ Hessians of F or cor-
responding quasi-Newton approximations. However, a disadvantage of the interior-point approach
is that, by staying away from the boundary of the feasible region, it does not promote the fast
generation of sparse solutions, which is in stark contrast with the methods described below.

8.2.1 Proximal Newton Methods


We use the term proximal Newton to refer to techniques that directly minimize the nonsmooth
function arising as the sum of a quadratic model of F and the regularizer. In particular, for solving
problem (8.2), a proximal Newton method is one that constructs, at each k ∈ N, a model
1
qk (w) = F (wk ) + ∇F (wk )T (w − wk ) + (w − wk )T Hk (w − wk ) + λkwk1 ≈ φ(w). (8.12)
2
where Hk represents ∇2 F (wk ) or a quasi-Newton approximation of it. This model has a similar form
as the one in (8.7), except that the simple quadratic is replaced by the quadratic form defined by Hk .
A proximal Newton method would involve (approximately) minimizing this model to compute a
trial iterate w̃k , then a stepsize αk > 0 would be taken from a predetermined sequence or chosen
by a line search to ensure that the new iterate wk+1 ← wk + αk (w̃k − wk ) yields Φ(wk+1 ) < Φ(wk ).
Proximal Newton methods are more challenging to design, analyze, and implement than prox-
imal gradient methods. That being said, they can perform better in practice once a few key
challenges are addressed. The three ingredients below have proved to be essential in ensuring
the practical success and scalability of a proximal Newton method. For simplicity, we assume
throughout that Hk has been chosen to be positive definite.

Choice of Subproblem Solver The model qk inherits the nonsmooth structure of φ, which has
the benefit of allowing a proximal Newton method to cross manifolds of nondifferentiability while
simultaneously promoting sparsity of the iterates. However, the method needs to overcome the
fact that the model qk is nonsmooth, which makes the subproblem for minimizing qk challenging.
Fortunately, with the particular structure created by a quadratic plus an `1 -norm term, various
methods are available for minimizing such a nonsmooth function. For example, coordinate descent
is particularly well-suited in this context [153, 78] since the global minimizer of qk along a coordi-
nate descent direction can be computed analytically. Such a minimizer often occurs at a point of
nondifferentiability (namely, when a component is zero), thus ensuring that the method will gener-
ate sparse iterates. Updating the gradient of the model qk after each coordinate descent step can
also be performed efficiently, even if Hk is given as a limited memory quasi-Newton approximation
[137]. Numerical experiments have shown that employing a coordinate descent iteration is more
efficient in certain applications than employing, say, an ISTA iteration to minimize qk , though the
latter is also a viable strategy in some applications.

Inaccurate Subproblem Solves A proximal Newton method needs to overcome the fact that, in
large-scale settings, it is impractical to minimize qk accurately for all k ∈ N. Hence, it is natural to
consider the computation of an inexact minimizer of qk . The issue then becomes: what are practical,
yet theoretically sufficient termination criteria when computing an approximate minimizer of the
nonsmooth function qk ? A common suggestion in the literature has been to use the norm of the

79
minimum-norm subgradient of qk at a given approximate minimizer. However, this measure is not
continuous, making it inadequate for these purposes.9 Interestingly, the norm of an ISTA step is
an appropriate measure. In particular, letting istak (w) represent the result of an ISTA step applied
to qk from w, the value kistak (w) − wk2 satisfies the following two key properties: (i) it equals zero
if and only if w is a minimizer of qk , and (ii) it varies continuously over Rd .
Complete and sufficient termination criteria are then as follows: a trial point w̃k represents a
sufficiently accurate minimizer of qk if, for some η ∈ [0, 1), one finds

kistak (w̃k ) − w̃k k2 ≤ ηkista(wk ) − wk k2 and qk (w̃k ) < qk (wk ).

The latter condition, requiring a decrease in qk , must also be imposed since the ISTA criterion
alone does not exert sufficient control to ensure convergence. Employing such criteria, it has
been observed to be efficient to perform the minimization of qk inaccurately at the start, and to
increase the accuracy of the model minimizers as one approaches the solution. A superlinear rate of
convergence for the overall proximal Newton iteration can be obtained by replacing η by ηk where
{ηk } & 0, along with imposing a stronger descent condition on the decrease in qk [31].

Elimination of Variables Due to the structure created by the `1 -norm regularizer, it can be
effective in some applications to first identify a set of active variables—i.e., ones that are predicted
to be equal to zero at a minimizer for qk —then compute an approximate minimizer of qk over the
remaining free variables. Specifically, supposing that a set Ak ⊆ {1, . . . , d} of active variables has
been identified, one may compute an (approximate) minimizer of qk by (approximately) solving the
reduced-space problem
min qk (w) s.t. [w]i = 0, i ∈ Ak . (8.13)
w∈Rd

Moreover, during the minimization process for this problem, one may have reason to believe that
the process may be improved by adding or removing elements from the active set estimate Ak . In
any case, performing the elimination of variables imposed in (8.13) has the effect of reducing the
size of the subproblem, and can often lead to fewer iterations being required in the overall proximal
Newton method. How should the active set Ak be defined? A technique that has become popular
recently is to use sensitivity information as discussed in more detail in the next subsection.

8.2.2 Orthant-Based Methods


Our second class of second-order methods is based on the observation that the `1 -norm regularized
objective φ in problem (8.2) is smooth in any orthant in Rd . Based on this observation, orthant-
based methods construct, in every iteration, a smooth quadratic model of the objective, then
produce a search direction by minimizing this model. After performing a line search designed to
reduce the objective function, a new orthant is selected and the process is repeated. This approach
can be motivated by the success of second-order gradient projection methods for bound constrained
optimization, which at every iteration employ a gradient projection search to identify an active set
and perform a minimization of the objective function over a space of free variables to compute a
search direction.
9
Consider the one-dimensional case of having qk (w) = |w|. The minimum-norm subgradient of qk has a magnitude
of 1 at all points, except at the minimizer w∗ = 0; hence, this norm does not provide a measure of proximity to w∗ .

80
The selection of an orthant is typically done using sensitivity information. Specifically, with the
minimum norm subgradient of φ at w ∈ Rd , which is given component-wise for all i ∈ {1, . . . , d} by

[∇F (w)]i + λ if wi > 0 or {wi = 0 and [∇F (w)]i + λ < 0}

ĝi (w) = [∇F (w)]i − λ if wi < 0 or {wi = 0 and [∇F (w)]i − λ > 0} (8.14)

0 otherwise,

the active orthant for an iterate wk is characterized by the sign vector


(
sgn([wk ]i ) if [wk ]i 6= 0
ζk,i = (8.15)
sgn(−[ĝ(wk )]i ) if [wk ]i = 0.

Along these lines, let us also define the subsets of {1, . . . , d} given by

Ak = {i : [wk ]i = 0 and |[∇F (wk )]i | ≤ λ} (8.16)


and Fk = {i : [wk ]i 6= 0} ∪ {i : [wk ]i = 0 and |[∇F (wk )]i | > λ} , (8.17)

where Ak represents the indices of variables that are active and kept at zero while Fk represents
those that are free to move.
Given these quantities, an orthant-based method proceeds as follows. First, one computes the
(approximate) solution dk of the (smooth) quadratic problem

min ĝ(wk )T d + 21 dT Hk d
d∈Rn
s.t. di = 0, i ∈ Ak ,

where Hk represents ∇2 F (xk ) or an approximation of it. Then, the algorithm performs a line
search—over a path contained in the current orthant—to compute the next iterate. For example,
one option is a projected backtracking line search along dk , which involves computing the largest
αk in a decreasing geometric sequence so

F (Pk (wk + αk dk )) < F (wk ).

Here, Pk (w) projects w ∈ Rd onto the orthant defined by ζ k , i.e.,


(
wi if sgn(wi ) = ζk,i
[Pk (w)]i = (8.18)
0 otherwise.

In this way, the initial and final points of an iteration lie in the same orthant. Orthant-based
methods have proved to be quite useful in practice; e.g., see [7, 30].

Commentary Proximal Newton and orthant-based methods represent two efficient classes of
second-order active-set methods for solving the `1 -norm regularized problem (8.2). The proximal
Newton method is reminiscent of the sequential quadratic programming method (SQP) for con-
strained optimization; they both solve a complex subproblem that yields a useful estimate of the
optimal active set. Although solving the piecewise quadratic model (8.12) is very expensive in
general, the coordinate descent method has proven to be well suited for this task, and allows the

81
proximal Newton method to be applied to very large problems [79]. Orthant-based methods have
shown to be equally useful, but in a more heuristic way, since some popular implementations lack
convergence guarantees [58, 30]. Stochastic variants of both proximal Newton and orthant-based
schemes can be devised in natural ways, and generally inherit the properties of stochastic proximal
gradient methods as long as the Hessian approximations are forced to possess eigenvalues within a
positive interval.

9 Summary and Perspectives


Mathematical optimization is one of the foundations of machine learning, touching almost every
aspect of the discipline. In particular, numerical optimization algorithms, the main subject of this
paper, have played an integral role in the transformational progress that machine learning has
experienced over the past two decades. In our study, we highlight the dominant role played by
the stochastic gradient method (SG) of Robbins and Monro [130], whose success derives from its
superior work complexity guarantees. A concise, yet broadly applicable convergence and complexity
theory for SG is presented here, providing insight into how these guarantees have translated into
practical gains.
Although the title of this paper suggests that our treatment of optimization methods for machine
learning is comprehensive, much more could be said about this rapidly evolving field. Perhaps most
importantly, we have neither discussed nor analyzed at length the opportunities offered by parallel
and distributed computing, which may alter our perspectives in the years to come. In fact, it has
already been widely acknowledged that SG, despite its other benefits, may not be the best suited
method for emerging computer architectures.
This leads to our discussion of a spectrum of methods that have the potential to surpass SG
in the next generation of optimization methods for machine learning. These methods, such as
those built on noise reduction and second-order techniques, offer the ability to attain improved
convergence rates, overcome the adverse effects of high nonlinearity and ill-conditioning, and exploit
parallelism and distributed architectures in new ways. There are important methods that are not
included in our presentation—such as the alternating direction method of multipliers (ADMM) [57,
64, 67] and the expectation-maximization (EM) method and its variants [48, 160]—but our study
covers many of the core algorithmic frameworks in optimization for machine learning, with emphasis
on methods and theoretical guarantees that have the largest impact on practical performance.
With the great strides that have been made and the various avenues for continued contributions,
numerical optimization promises to continue to have a profound impact on the rapidly growing field
of machine learning.

A Convexity and Analyses of SG


Our analyses of SG in §4 can be characterized as relying primarily on smoothness in the sense of
Assumption 4.1. This has advantages and disadvantages. On the positive side, it allows us to prove
convergence results that apply equally for the minimization of convex and nonconvex functions,
the latter of which has been rising in importance in machine learning; recall §2.2. It also allows
us to present results that apply equally to situations in which the stochastic vectors are unbiased
estimators of the gradient of the objective, or when such estimators are scaled by a symmetric

82
positive definite matrix; recall (4.2). A downside, however, is that it requires us to handle the
minimization of nonsmooth models separately, which we do in §8.
As an alternative, a common tactic employed by many authors is to leverage convexity in-
stead of smoothness, allowing for the establishment of guarantees that can be applied in smooth
and nonsmooth settings. For example, a typical approach for analyzing stochastic gradient-based
methods is to commence with the following fundamental equations related to squared distances to
the optimum:

kwk+1 − w∗ k22 − kwk − w∗ k22 = 2(wk+1 − wk )T (wk − w∗ ) + kwk+1 − wk k22


= −2αk g(wk , ξk )T (wk − w∗ ) + αk2 kg(wk , ξk )k22 . (A.1)

Assuming that Eξk [g(wk , ξk )] = ĝ(wk ) ∈ ∂F (wk ), one then obtains

Eξk [kwk+1 − w∗ k22 ] − kwk − w∗ k22


= − 2αk ĝ(wk )T (wk − w∗ ) + αk2 Eξk [kg(wk , ξk )k22 ], (A.2)

which has certain similarities with (4.10a). One can now introduce an assumption of convexity to
bound the first term on the right-hand side in this expression; in particular, convexity offers the
subgradient inequality
ĝ(wk )T (wk − w∗ ) ≥ F (wk ) − F (w∗ ) ≥ 0
while strong convexity offers the stronger condition (4.11). Combined with a suitable assumption
on the second moment of the stochastic subgradients to bound the second term in the expression,
the entire right-hand side can be adequately controlled through judicious stepsize choices. The
resulting analysis then has many similarities with that presented in §4, especially if one introduces
an assumption about Lipscithz continuity of the gradients of F in order to translate results on
decreases in the distance to the solution in terms of decreases in F itself. The interested reader
will find a clear exposition of such results in [105].
Note, however, that one can see in (A.2) that analyses based on distances to the solution do
not carry over easily to nonconvex settings or when (quasi-)Newton type steps are employed. In
such situations, without explicit knowledge of w∗ , one cannot easily ensure that the first term on
the right-hand side can be bounded appropriately.

B Proofs
Inequality (4.3). Under Assumption 4.1, one obtains
Z 1
∂F (w + t(w − w))
F (w) = F (w) + dt
0 ∂t
Z 1
= F (w) + ∇F (w + t(w − w))T (w − w) dt
0
Z 1
T
= F (w) + ∇F (w) (w − w) + [∇F (w + t(w − w)) − ∇F (w)]T (w − w) dt
0
Z 1
≤ F (w) + ∇F (w)T (w − w) + Lkt(w − w)k2 kw − wk2 dt,
0

from which the desired result follows.

83
Inequality (4.12). Given w ∈ Rd , the quadratic model

q(w) := F (w) + ∇F (w)T (w − w) + 21 ckw − wk22

has the unique minimizer w∗ := w − 1c ∇F (w) with q(w∗ ) = F (w) − 1 2


2c k∇F (w)k2 . Hence, the
inequality (4.11) with w = w∗ and any w ∈ Rd yields

F∗ ≥ F (w) + ∇F (w)T (w∗ − w) + 12 ckw∗ − wk22 ≥ F (w) − 1 2


2c k∇F (w)k2 ,

from which the desired result follows.

Corollary 4.12. Define G(w) := k∇F (w)k22 and let LG be the Lipschitz constant of ∇G(w) =
2∇2 F (w)∇F (w). Then,

G(wk+1 ) − G(wk ) ≤ ∇G(wk )T (wk+1 − wk ) + 12 LG kwk+1 − wk k22


≤ −αk ∇G(wk )T g(wk , ξk ) + 21 αk2 LG kg(wk , ξk )k22 .

Taking the expectation with respect to the distribution of ξk , one obtains from Assumptions 4.1
and 4.3 and the inequality (4.9) that

Eξk [G(wk+1 ] − G(wk )


≤ − 2αk ∇F (wk )T ∇2 F (wk )T Eξk [g(wk , ξk )] + 12 αk2 LG Eξk [kg(wk , ξk )k22 ]
≤ 2αk k∇F (wk )k2 k∇2 F (wk )k2 kEξk [g(wk , ξk )]k2 + 21 αk2 LG Eξk [kg(wk , ξk )k22 ]
≤ 2αk LµG k∇F (wk )k22 + 21 αk2 LG (M + MV k∇F (wk )k22 ).

Taking the total expectation simply yields

E[G(wk+1 )] − E[G(wk )]
(B.1)
≤ 2αk LµG E[k∇F (wk )k22 ] + 21 αk2 LG (M + MV E[k∇F (wk )k22 ]).

Recall that Theorem 4.10 establishes that the first component of this bound is the term of a
convergent
P∞ 2 sum. The second component of this bound is also the term of a convergent sum since
α converges and since α 2 ≤ α for sufficiently large k ∈ N, meaning that again the result of
k=1 k k k
Theorem 4.10 can be applied. Therefore, the right-hand side of (B.1) is the term of a convergent
sum. Let us now define
K
X
+
SK = max(0, E[G(wk+1 )] − E[G(wk )])
k=1
K
X

and SK = max(0, E[G(wk )] − E[G(wk+1 )]).
k=1

+
Since the bound (B.1) is positive and forms a convergent sum, the nondecreasing sequence SK is
+ −
upper bounded and therefore converges. Since, for any K ∈ N, one has G(wK ) = G(w0 )+SK −SK ≥

0, the nondecreasing sequence SK is upper bounded and therefore also converges. Therefore G(wK )
converges and Theorem 4.9 means that this limit must be zero.

84
References
[1] A. Agarwal, P. L. Bartlett, P. Ravikumar, and M. J. Wainwright. Information-Theoretic
Lower Bounds on the Oracle Complexity of Stochastic Convex Optimization. IEEE Trans-
actions on Information Theory, 58(5):3235–3249, 2012.
[2] Naman Agarwal, Brian Bullins, and Elad Hazan. Second order stochastic optimization in
linear time. arXiv preprint arXiv:1602.03943, 2016.
[3] Zeyuan Allen Zhu and Elad Hazan. Optimal black-box reductions between optimization
objectives. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors,
Advances in Neural Information Processing Systems 29, pages 1606–1614. 2016.
[4] S.-I. Amari. A theory of adaptive pattern classifiers. IEEE Transactions on Electronic
Computers, EC-16:299–307, 1967.
[5] S.-I. Amari. Natural gradient works efficiently in learning. Neural Computation, 10(2):251–
276, 1998.
[6] S.-I. Amari and H. Nagaoka. Methods of Information Geometry. American Mathematical
Society, Providence, RI, USA, 1997.
[7] Galen Andrew and Jianfeng Gao. Scalable training of l 1-regularized log-linear models. In
Proceedings of the 24th international conference on Machine learning, pages 33–40. ACM,
2007.
[8] Yves F Atchade, Gersende Fort, and Eric Moulines. On stochastic proximal gradient algo-
rithms. arXiv preprint arXiv:1402.2365, 2014.
[9] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with sparsity-inducing
penalties. Foundations and Trends
R in Machine Learning, 4(1):1–106, 2012.

[10] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse
problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.
[11] S. Becker, E. J. Candès, and M. Grant. Templates for convex cone problems with applications
to sparse signal recovery. Mathematical Programming Computation, 3(3):165–218, 2011.
[12] S. Becker and Y. LeCun. Improving the convergence of back-propagation learning with
second-order methods. In D. Touretzky, G. Hinton, and T. Sejnowski, editors, Proc. of the
1988 Connectionist Models Summer School, pages 29–37, San Mateo, 1989. Morgan Kaufman.
[13] D. P. Bertsekas. Nonlinear Programming. Athena Scientific, Belmont, Massachusetts, 1995.
[14] D. P. Bertsekas. Incremental least squares methods and the extended Kalman filter. SIAM
Journal on Optimization, 6(3):807–822, 1996.
[15] D. P. Bertsekas. Convex Optimization Algorithms. Athena Scientific, Nashua, NH, USA,
2015.
[16] D. P. Bertsekas and J. N. Tsitsiklis. Parallel and Distributed Computation: Numerical Meth-
ods. Prentice-Hall, Englewood Cliffs, NJ, USA, 1989.

85
[17] D. Blatt, A. O. Hero, and H. Gauchman. A convergent incremental gradient method with a
constant step size. SIAM Journal on Optimization, 18(1):29–51, 2007.

[18] A. Bordes, L. Bottou, and P. Gallinari. SGD-QN: Careful Quasi-Newton Stochastic Gradient
Descent. Journal of Machine Learning Research, 10:1737–1754, 2009.

[19] Antoine Bordes, Léon Bottou, Patrick Gallinari, Jonathan Chang, and S. Alex Smith. Erra-
tum: Sgdqn is less careful than expected. Journal of Machine Learning Research, 11:2229–
2240, 2010.

[20] L. Bottou. Stochastic Gradient Learning in Neural Networks. In Proceedings of Neuro-Nı̂mes


91, Nimes, France, 1991. EC2.

[21] L. Bottou. Online Algorithms and Stochastic Approximations. In D. Saad, editor, Online
Learning and Neural Networks. Cambridge University Press, Cambridge, UK, 1998.

[22] L. Bottou. Large-Scale Machine Learning with Stochastic Gradient Descent. In Y. Lechevallier
and G. Saporta, editors, Proceedings of the 19th International Conference on Computational
Statistics (COMPSTAT’2010), pages 177–187, Paris, France, August 2010. Springer.

[23] L. Bottou and O. Bousquet. The Tradeoffs of Large Scale Learning. In J. C. Platt, D. Koller,
Y. Singer, and S. T. Roweis, editors, Advances in Neural Information Processing Systems 20,
pages 161–168. Curran Associates, Inc., 2008.

[24] L. Bottou, F. Fogelman Soulié, P. Blanchet, and J. S. Lienard. Experiments with Time Delay
Networks and Dynamic Time Warping for Speaker Independent Isolated Digit Recognition.
In Proceedings of EuroSpeech 89, volume 2, pages 537–540, Paris, France, 1989.

[25] L. Bottou and Y. Le Cun. On-line Learning for Very Large Datasets. Applied Stochastic
Models in Business and Industry, 21(2):137–151, 2005.

[26] O. Bousquet. Concentration Inequalities and Empirical Processes Theory Applied to the Anal-
ysis of Learning Algorithms. PhD thesis, Ecole Polytechnique, 2002.

[27] C. G. Broyden. The Convergence of a Class of Double-Rank Minimization Algorithms. Jour-


nal of the Institute of Mathematics and Its Applications, 6(1):76–90, 1970.

[28] R. H. Byrd, G. M. Chin, J. Nocedal, and Y. Wu. Sample Size Selection in Optimization
Methods for Machine Learning. Mathematical Programming, Series B, 134(1):127–155, 2012.

[29] Richard H Byrd, Gillian M Chin, Will Neveitt, and Jorge Nocedal. On the use of stochas-
tic Hessian information in optimization methods for machine learning. SIAM Journal on
Optimization, 21(3):977–995, 2011.

[30] Richard H Byrd, Gillian M Chin, Jorge Nocedal, and Figen Oztoprak. A family of second-
order methods for convex\ ell 1-regularized optimization. Mathematical Programming, pages
1–33, 2012.

[31] Richard H Byrd, Jorge Nocedal, and Figen Oztoprak. An inexact successive quadratic ap-
proximation method for convex L-1 regularized optimization. arXiv preprint arXiv:1309.3529,
2013.

86
[32] Emmanuel J Candès and Benjamin Recht. Exact matrix completion via convex optimization.
Foundations of Computational mathematics, 9(6):717–772, 2009.

[33] Jean-François Cardoso. Blind signal separation: statistical principles. Proc. of the IEEE,
9(10):2009–2025, Oct 1998.

[34] N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the Generalization Ability of On-Line


Learning Algorithms. IEEE Transactions on Information Theory, 50(9):2050–2057, 2004.

[35] Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, and Yann LeCun.
The loss surfaces of multilayer networks. In AISTATS, 2015.

[36] K. L. Chung. On a stochastic approximation method. Annals of Mathematical Statistics,


25(3):463–483, 1954.

[37] A. R. Conn, N. I. M. Gould, and Ph. L. Toint. Trust Region Methods. Society for Industrial
and Applied Mathematics, 2000.

[38] C. Cortes and V. N. Vapnik. Support-Vector Networks. Machine Learning, 20(3):273–297,


1995.

[39] R. Courant and H. Robbins. What is Mathematics? Oxford University Press, First edition,
1941.

[40] R. Courant and H. Robbins. What is Mathematics? Oxford University Press, Second edition,
1996. Revised by I. Stewart.

[41] T. M. Cover. Universal Portfolios. Mathematical Finance, 1(1):1–29, 1991.

[42] I. Daubechies, M. Defrise, and C. De Mol. An iterative thresholding algorithm for linear
inverse problems with a sparsity constraint. Communications on Pure and Applied Mathe-
matics, 58:1413–1457, 2004.

[43] Yann N. Dauphin, Harm de Vries, Junyoung Chung, and Yoshua Bengio. Rmsprop and
equilibrated adaptive learning rates for non-convex optimization. CoRR, abs/1502.04390,
2015.

[44] Yann N. Dauphin, Razvan Pascanu, Çaglar Gülçehre, KyungHyun Cho, Surya Ganguli, and
Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-
convex optimization. In Advances in Neural Information Processing Systems 27: Annual
Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal,
Quebec, Canada, pages 2933–2941, 2014.

[45] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato,


A. W. Senior, P. A. Tucker, K. Yang, and A. Y. Ng. Large Scale Distributed Deep Networks.
In Advances in Neural Information Processing Systems 25, pages 1232–1240, 2012.

[46] A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: A Fast Incremental Gradient Method
With Support for Non-Strongly Convex Composite Objectives. In Z. Ghahramani, M. Welling,
C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information
Processing Systems 27, pages 1646–1654. Curran Associates, Inc., 2014.

87
[47] R. S. Dembo, Eisenstat S. C., and T. Steihaug. Inexact Newton Methods. SIAM Journal on
Numerical Analysis, 19(2):400–408, 1982.

[48] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum Likelihood from Incomplete Data
via the EM Algorithm. Journal of the Royal Statistical Society. Series B (Methodological),
39(1):1–38, 1977.

[49] Jia Deng, Nan Ding, Yangqing Jia, Andrea Frome, Kevin Murphy, Samy Bengio, Yuan Li,
Hartmut Neven, and Hartwig Adam. Large-scale object classification using label relation
graphs. In Proceedings of the 13th European Conference on Computer Vision (ECCV), Part
I, pages 48–64, 2014.

[50] L. Deng, G. E. Hinton, and B. Kingsbury. New types of deep neural network learning for
speech recognition and related applications: An overview. In IEEE International Conference
on Acoustics, Speech and Signal Processing, ICASSP 2013, Vancouver, BC, Canada, pages
8599–8603, 2013.

[51] J. E. Dennis and J. J. Moré. A Characterization of Superlinear Convergence and Its Appli-
cation to Quasi-Newton Methods. Mathematics of Computation, 28(126):549–560, 1974.

[52] J. E. Dennis, Jr. and R. B. Schnabel. Numerical Methods for Unconstrained Optimization
and Nonlinear Equations. Classics in Applied Mathematics. SIAM, Philadelphia, PA, USA,
1996.

[53] D.L. Donoho. De-noising by soft-thresholding. Information Theory, IEEE Transactions on,
41(3):613–627, 1995.

[54] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and
stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, 2011.

[55] R. M. Dudley. Uniform Central Limit Theorems. Cambridge Studies in Advanced Mathe-
matics, 63. Cambridge University Press, 1999.

[56] S. T. Dumais, J. C. Platt, D. Hecherman, and M. Sahami. Inductive Learning Algorithms


and Representations for Text Categorization. In Proceedings of the 1998 ACM CIKM In-
ternational Conference on Information and Knowledge Management, Bethesda, Maryland,
USA, pages 148–155, 1998.

[57] J. Eckstein and D. P. Bertsekas. On the Douglas-Rachford splitting method and the proximal
point algorithm for maximal monotone operators. Mathematical Programming, 55:293–318,
1992.

[58] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. Liblinear:
A library for large linear classification. The Journal of Machine Learning Research, 9:1871–
1874, 2008.

[59] M. A. T. Figueiredo, R. D. Nowak, and S. J. Wright. Gradient Projection for Sparse Recon-
struction: Application to Compressed Sensing and Other Inverse Problems. IEEE Journal of
Selected Topics in Signal Processing, 1(4):586–597, 2007.

88
[60] R. Fletcher. A New Approach to Variable Metric Algorithms. Computer Journal, 13(3):317–
322, 1970.

[61] J. E. Freund. Mathematical Statistics. Prentice-Hall, Englewood Cliffs, NJ, USA, 1962.

[62] M. P. Friedlander and M. Schmidt. Hybrid Deterministic-Stochastic Methods for Data Fit-
ting. SIAM Journal on Scientific Computing, 34(3):A1380–A1405, 2012.

[63] M. C. Fu. Optimization for simulation: Theory vs. practice. INFORMS Journal on Comput-
ing, 14(3):192–215, 2002.

[64] D. Gabay and B. Mercier. A Dual Algorithm for the Solution of Nonlinear Variational
Problems via Finite Element Approximations. Computers and Mathematics with Applications,
2:17–40, 1976.

[65] Gilles Gasso, Aristidis Pappaioannou, Marina Spivak, and Léon Bottou. Batch and online
learning algorithms for nonconvex Neyman-Pearson classication. ACM Transaction on Intel-
ligent System and Technologies, 3(2), 2011.

[66] E. G. Gladyshev. On stochastic approximations. Theory of Probability and its Applications,


10:275–278, 1965.

[67] R. Glowinski and A. Marrocco. Sur l’approximation, par elements finis déordre un, et la
resolution, par penalisation-dualité, d’une classe de problems de Dirichlet non lineares. Revue
Française d’Automatique, Informatique, et Recherche Opérationalle, 9:41–76, 1975.

[68] D. Goldfarb. A Family of Variable Metric Updates Derived by Variational Means. Mathe-
matics of Computation, 24(109):23–26, 1970.

[69] G. H. Golub and C. F. Van Loan. Matrix Computations. Johns Hopkins University Press,
Baltimore, MD, USA, fourth edition, 2012.

[70] Ian J. Goodfellow and Oriol Vinyals. Qualitatively characterizing neural network optimization
problems. CoRR, abs/1412.6544, 2014.

[71] A. Griewank. Automatic Differentiation. Princeton Companion to Applied Mathematics,


Nicolas Higham Ed., Princeton University Press, 2014.

[72] Mert Gurbuzbalaban, Asuman Ozdaglar, and Pablo Parrilo. On the convergence rate of
incremental aggregated gradient algorithms. ArXiv, abs/1506.02081, 2015.

[73] F. S. Hashemi, S. Ghosh, and R. Pasupathy. On adaptive sampling rules for stochastic
recursions. In Simulation Conference (WSC), 2014 Winter, pages 3959–3970, 2014.

[74] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for im-
age recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), 2016.

[75] W. Hoeffding. Probability Inequalities for Sums of Bounded Random Variables. Journal of
the American Statistical Association, 58(301):13–30, 1963.

89
[76] Matthew D. Hoffman, David M. Blei, Chong Wang, and John William Paisley. Stochastic
variational inference. Journal of Machine Learning Research, 14(1):1303–1347, 2013.

[77] T. Homem-de Mello and A. Shapiro. A Simulation-Based Approach to Two-Stage Stochastic


Programming with Recourse. Mathematical Programming, 81(3):301–325, 1998.

[78] C. J. Hsieh, M. A. Sustik, P. Ravikumar, and I. S. Dhillon. Sparse inverse covariance ma-
trix estimation using quadratic approximation. Advances in Neural Information Processing
Systems (NIPS), 24, 2011.

[79] Cho-Jui Hsieh, Mátyás A Sustik, Inderjit S Dhillon, Pradeep K Ravikumar, and Russell
Poldrack. Big & quic: Sparse inverse covariance estimation for a million variables. In Advances
in Neural Information Processing Systems, pages 3165–3173, 2013.

[80] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training
by reducing internal covariate shift. In Proceedings of the 32nd International Conference on
Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pages 448–456, 2015.

[81] T. Joachims. Text Categorization with Suport Vector Machines: Learning with Many
Relevant Features. In Proceedings od the 10th European Conference on Machine Learning
(ECML’98), Chemnitz, Germany, pages 137–142, 1998.

[82] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive vari-
ance reduction. In Advances in Neural Information Processing Systems, pages 315–323, 2013.

[83] R. E. Kalman. A New Approach to Linear Filtering and Prediction Problems. Transactions
of the ASME – Journal of Basic Engineering (Series D), 82:35–45, 1960.

[84] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR,
abs/1412.6980, 2014.

[85] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classification with Deep Convolu-
tional Neural Networks. In Advances in Neural Information Processing Systems 25, 2012.

[86] Jean Lafond, Nicolas Vasilache, and Léon Bottou. Manuscript in preparation, 2016.

[87] Y. Le Cun, B. E. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. E. Hubbard, and


L. D. Jackel. Handwritten Digit Recognition with a Back-Propagation Network. In Advances
in Neural Information Processing Systems 2, pages 396–404, 1989.

[88] Y. Le Cun, L. Bottou, Y. Bengio, and P. Haffner. Gradient Based Learning Applied to
Document Recognition. Proceedings of IEEE, 86(11):2278–2324, 1998.

[89] Y. Le Cun, L. Bottou, G. B. Orr, and K.-R. Müller. Efficient Backprop. In Neural Networks,
Tricks of the Trade, Lecture Notes in Computer Science LNCS 1524. Springer Verlag, 1998.

[90] N. Le Roux, M. Schmidt, and F. R. Bach. A Stochastic Gradient Method with an Exponential
Convergence Rate for Finite Training Sets. In F. Pereira, C. J. C. Burges, L. Bottou, and
K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages
2663–2671. Curran Associates, Inc., 2012.

90
[91] W. S. Lee, P. L. Bartlett, and R. C. Williamson. The Importance of Convexity in Learning
with Squared Loss. IEEE Transactions on Information Theory, 44(5):1974–1980, 1998.

[92] Todd K. Leen and Genevieve B. Orr. Optimal stochastic search and adaptive momentum.
In Advances in Neural Information Processing Systems 6, [7th NIPS Conference, Denver,
Colorado, USA, 1993], pages 477–484, 1993.

[93] Dennis Leventhal and Adrian S Lewis. Randomized methods for linear constraints: conver-
gence rates and conditioning. Mathematics of Operations Research, 35(3):641–654, 2010.

[94] D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. RCV1: A New Benchmark Collection of Text
Categorization Research. Journal of Machine Learning Research, 5:361–397, 2004.

[95] Chih-Jen Lin and Jorge J Moré. Newton’s method for large bound-constrained optimization
problems. SIAM Journal on Optimization, 9(4):1100–1127, 1999.

[96] H. Lin, J. Mairal, and Z. Harchaoui. A universal catalyst for first-order optimization. Advances
in Neural Information Processing Systems (NIPS), 2015.

[97] D. C. Liu and J. Nocedal. On the limited memory BFGS method for large scale optimization.
Mathematical Programming, 45:503–528, 1989.

[98] J. Liu, S. J. Wright, C. Ré, V. Bittorf, and S. Sridhar. An Asynchronous Parallel Stochastic
Coordinate Descent Algorithm. Journal of Machine Learning Research, 16:285–322, 2015.

[99] Gaétan Marceau-Caron and Yann Ollivier. Practical riemannian neural networks. CoRR,
abs/1602.08007, 2016.

[100] J. Martens and Sutskever I. Learning Recurrent Neural Networks with Hessian-Free Opti-
mization. In 28th International Conference on Machine Learning. ICML, 2011.

[101] James Martens. New insights and perspectives on the natural gradient method. arXiv preprint
arXiv:1412.1193, 2014.

[102] P. Massart. Some applications of concentration inequalities to Statistics. Annales de la


Faculté des Sciences de Toulouse, series 6, 9(2):245–303, 2000.

[103] A. Mokhtari and A. Ribeiro. RES: Regularized Stochastic BFGS algorithm. IEEE Transac-
tions on Signal Processing, 62(23):6089–6104, 2014.

[104] N. Murata. A Statistical Study of On-line Learning. In D. Saad, editor, On-line Learning in
Neural Networks, pages 63–92. Cambridge University Press, New York, NY, USA, 1998.

[105] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust Stochastic Approximation


Approach to Stochastic Programming. SIAM Journal on Optimization, 19(4):1574–1609,
2009.

[106] A. S. Nemirovski and D. B. Yudin. On Cezaro’s convergence of the steepest descent method
for approximating saddle point of convex-concave functions. Soviet Mathematics–Doklady,
19, 1978.

91
[107] Y. Nesterov. A method of solving a convex programming problem with convergence rate
O(1/k 2 ). Soviet Mathematics Doklady, 27(2):372–376, 1983.

[108] Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Kluwer Aca-
demic Publishers, 2004.

[109] Y. Nesterov. Primal-dual subgradient methods for convex problems. Mathematical Program-
ming, Series B, 120(1):221–259, 2009.

[110] Yu Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems.


SIAM Journal on Optimization, 22(2):341–362, 2012.

[111] NIPS Foundation. Advances in neural information processing systems (29 volumes), 1987–
2016. Volumes 0–28, http://papers.nips.cc.

[112] F. Niu, B. Recht, C. Re, and S. J. Wright. Hogwild!: A Lock-Free Approach to Parallelizing
Stochastic Gradient Descent. In Advances in Neural Information Processing Systems 24,
pages 693–701, 2011.

[113] J. Nocedal. Updating Quasi-Newton Matrices With Limited Storage. Mathematics of Com-
putation, 35(151):773–782, 1980.

[114] J. Nocedal and S. J. Wright. Numerical Optimization. Springer New York, Second edition,
2006.

[115] A. B. J. Novikoff. On Convergence Proofs on Perceptrons. In Proceedings of the Symposium


on the Mathematical Theory of Automata, volume 12, pages 615–622, 1962.

[116] J. Nutini, M. Schmidt, I. H. Laradji, M. Friedlander, and H. Koepke. Coordinate Descent


Converges Faster with the Gauss-Southwell Rule than Random Selection. In 32nd Interna-
tional Conference on Machine Learning. ICML, 2015.

[117] J. Ortega and W. Rheinboldt. Iterative Solution of Nonlinear Equations in Several Variables.
Society for Industrial and Applied Mathematics, 2000.

[118] M. R. Osborne. Fisher’s method of scoring. International Statistical Review / Revue Inter-
nationale de Statistique, 60(1):99–117, Apr 1992.

[119] Hyeyoung Park, Sun-ichi Amari, and Kenji Fukumizu. Adaptive natural gradient learning
algorithms for various stochastic models. Neural Networks, 13(7):755–764, 2000.

[120] R. Pasupathy, P. W. Glynn, S. Ghosh, and F. S. Hashemi. On Sampling Rates in Stochastic


Recursions. 2015. Under review.

[121] B. A. Pearlmutter. Fast Exact Multiplication by the Hessian. Neural Computation, 6(1):147–
160, 1994.

[122] Mert Pilanci and Martin J Wainwright. Newton sketch: A linear-time optimization algorithm
with linear-quadratic convergence. arXiv preprint arXiv:1505.02250, 2015.

[123] B. T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR
Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.

92
[124] B. T. Polyak. Comparison of the convergence rates for single-step and multi-step optimization
algorithms in the presence of noise. Engineering Cybernetics, 15(1):6–10, 1977.

[125] B. T. Polyak. New Method of Stochastic Approximation Type. Automation and Remote
Control, 51(4):937–946, 1991.

[126] B. T. Polyak and A. B. Juditsky. Acceleration of Stochastic Approximation by Averaging.


SIAM Journal on Control and Optimization, 30(4):838–855, 1992.

[127] M. J. D. Powell. On Search Directions for Minimization Algorithms. Mathematical Program-


ming, 4(1):193–201, 1973.

[128] Maxim Raginsky and Alexander Rakhlin. Information-based complexity, feedback and dy-
namics in convex programming. IEEE Transactions on Information Theory, 57(10):7036–
7056, 2011.

[129] A. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. CNN features off-the-shelf: an


astounding baseline for recognition. arXiv:1403.6382, 2014.

[130] H. Robbins and S. Monro. A Stochastic Approximation Method. The Annals of Mathematical
Statistics, 22(3):400–407, 1951.

[131] H. Robbins and D. Siegmund. A convergence theorem for non negative almost supermartin-
gales and some applications. In Jagdish S. Rustagi, editor, Optimizing Methods in Statistics.
Academic Press, 1971.

[132] Farbod Roosta-Khorasani and Michael W Mahoney. Sub-sampled Newton methods II: Local
convergence rates. arXiv preprint arXiv:1601.04738, 2016.

[133] Stéphane Ross, Paul Mineiro, and John Langford. Normalized online learning. In Proceedings
of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence (UAI), 2013.

[134] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by


error propagation. In Parallel Distributed Processing: Explorations in the Microstructure of
Cognition, volume 1, chapter 8, pages 318–362. MIT Press, Cambridge, MA, 1986.

[135] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning Representations by Back-


Propagating Errors. Nature, 323:533–536, 1986.

[136] D. Ruppert. Efficient estimations from a slowly convergent Robbins-Monro process. Technical
report, Cornell University Operations Research and Industrial Engineering, 1988.

[137] Katya Scheinberg and Xiaocheng Tang. Practical inexact proximal quasi-newton method
with global complexity analysis. arXiv preprint arXiv:1311.6547, 2013.

[138] M. Schmidt. Graphical Model Structure Learning with `1 -Regularization. PhD thesis, Univer-
sity of British Columbia, 2010.

[139] M. Schmidt, N. Le Roux, and F. Bach. Minimizing finite sums with the stochastic average
gradient. arXiv preprint arXiv:1309.2388, 2013.

93
[140] Mark Schmidt, Nicolas L Roux, and Francis R Bach. Convergence rates of inexact proximal-
gradient methods for convex optimization. In Advances in neural information processing
systems, pages 1458–1466, 2011.

[141] N. N. Schraudolph. Fast Curvature Matrix-Vector Products. In G. Dorffner, H. Bischof, and


K. Hornik, editors, Artificial Neural Networks, ICANN 2001, volume 2130 of Lecture Notes
in Computer Science, pages 19–26. Springer-Verlag Berlin Heidelberg, 2001.

[142] N. N. Schraudolph, J. Yu, and S. Günter. A stochastic quasi-Newton method for online
convex optimization. In International Conference on Artificial Intelligence and Statistics,
pages 436–443. Society for Artificial Intelligence and Statistics, 2007.

[143] G. Shafer and V. Vovk. Probability and finance: It’s only a game!, volume 491. John Wiley
& Sons, 2005.

[144] S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter. Pegasos: primal estimated sub-
gradient solver for SVM. Mathematical Programming, 127(1):3–30, 2011.

[145] S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized
loss. Journal of Machine Learning Research, 14(1):567–599, 2013.

[146] D. F. Shanno. Conditioning of Quasi-Newton Methods for Function Minimization. Mathe-


matics of Computation, 24(111):647–656, 1970.

[147] S. Sra, S. Nowozin, and S.J. Wright. Optimization for Machine Learning. Mit Pr, 2011.

[148] Stanford Vision Lab. ImageNet Large Scale Visual Recognition Challenge (ILSVRC). http:
//image-net.org/challenges/LSVRC/, 2015 (accessed October 25, 2015).

[149] T. Steihaug. The Conjugate Gradient Method and Trust Regions in Large Scale Optimization.
SIAM Journal on Numerical Analysis, 20(3):626–637, 1983.

[150] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and
momentum in deep learning. In 30th International Conference on Machine Learning. ICML,
2013.

[151] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal
Statistical Society. Series B (Methodological), pages 267–288, 1996.

[152] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5. RMSPROP: Divide the gradient by a run-
ning average of its recent magnitude. COURSERA: Neural Networks for Machine Learning,
2012.

[153] P. Tseng and S. Yun. A coordinate gradient descent method for nonsmooth separable mini-
mization. Mathematical Programming, 117(1-2):387–423, 2009.

[154] A. B. Tsybakov. Optimal aggregation of classifiers in statistical learning. Annals of Statis-


tistics, 32(1), 2004.

[155] V. N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer, 1983.

94
[156] V. N. Vapnik. Statistical Learning Theory. Wiley, New York, 1998.

[157] V. N. Vapnik and A. Y. Chervonenkis. Theory of Pattern Recognition. Nauka, Moscow, 1974.
German Translation: Theorie der Zeichenerkennung, Akademie–Verlag, Berlin, 1979.

[158] V. N. Vapnik and A. Ya. Chervonenkis. On the uniform convergence of relative frequencies of
events to their probabilities. Proceedings of the USSR Academy of Sciences, 181(4):781–783,
1968. English translation: Soviet Math. Dokl., 9:915-918, 1968.

[159] S. J. Wright, R. D. Nowak, and M. A. T. Figueiredo. Sparse Reconstruction by Separable


Approximation. IEEE Transactions on Signal Processing, 57(7):2479–2493, 2009.

[160] C. F. J. Wu. On the convergence properties of the em algorithm. The Annals of Statistics,
11(1):95–103, 1983.

[161] W. Xu. Towards optimal one pass large scale learning with averaged stochastic gradient
descent. arXiv preprint arXiv:1107.2490v2, 2011.

[162] Matthew D. Zeiler. ADADELTA: an adaptive learning rate method. CoRR, abs/1212.5701,
2012.

[163] C. Zhu, R. H. Byrd, P. Lu, and J. Nocedal. Algorithm 78: L-BFGS-B: Fortran subroutines for
large-scale bound constrained optimization. ACM Transactions on Mathematical Software,
23(4):550–560, 1997.

[164] M. Zinkevich. Online Convex Programming and Generalized Infinitesimal Gradient Ascent. In
Proceedings of the Twentieth International Conference on Machine Learning (ICML), pages
928–936, 2003.

95

You might also like