(Machine Learning - Foundations, Methodologies, and Applications) Fengxiang He, Dacheng Tao - Foundations of Deep Learning-Springer (2025)
(Machine Learning - Foundations, Methodologies, and Applications) Fengxiang He, Dacheng Tao - Foundations of Deep Learning-Springer (2025)
and Applications
Fengxiang He
Dacheng Tao
Foundations of
Deep Learning
Machine Learning: Foundations, Methodologies,
and Applications
Series Editors
Kay Chen Tan, Department of Computing, Hong Kong Polytechnic University,
Hong Kong, China
Dacheng Tao, Nanyang Technological University, Singapore, Singapore
Books published in this series focus on the theory and computational foundations,
advanced methodologies and practical applications of machine learning, ideally
combining mathematically rigorous treatments of a contemporary topics in machine
learning with specific illustrations in relevant algorithm designs and demonstrations
in real-world applications. The intended readership includes research students and
researchers in computer science, computer engineering, electrical engineering, data
science, and related areas seeking a convenient medium to track the progresses made
in the foundations, methodologies, and applications of machine learning.
Topics considered include all areas of machine learning, including but not limited
to:
• Decision tree
• Artificial neural networks
• Kernel learning
• Bayesian learning
• Ensemble methods
• Dimension reduction and metric learning
• Reinforcement learning
• Meta learning and learning to learn
• Imitation learning
• Computational learning theory
• Probabilistic graphical models
• Transfer learning
• Multi-view and multi-task learning
• Graph neural networks
• Generative adversarial networks
• Federated learning
This series includes monographs, introductory and advanced textbooks, and state-
of-the-art collections. Furthermore, it supports Open Access publication mode.
Fengxiang He · Dacheng Tao
Foundations of Deep
Learning
Fengxiang He Dacheng Tao
School of Informatics College of Computing and Data Science
University of Edinburgh Nanyang Technological University
Edinburgh, United Kingdom Singapore, Singapore
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature
Singapore Pte Ltd. 2025
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
v
vi Preface
to the limitations in computing power and the inability to train deeper architectures
effectively.
The resurgence of interest in neural networks came about in the 1980s, marking
the second wave of deep learning research. This period, characterized as “connec-
tionist”, saw the development of more sophisticated neural network architectures
and learning algorithms. Notably, the introduction of the backpropagation algorithm
by Rumelhart, Hinton, and Williams in 1986 revolutionized the field by enabling
efficient training of deep neural networks. More importantly, the significance of
this paper lies not only in the proposal of an algorithm but also in a major turning
point where neural networks moved from psychology and physiology to the field
of machine learning. Despite significant advancements, the limitations of computa-
tional resources and the lack of large-scale datasets led to another decline in interest
in neural networks during the 1990s, marking the end of the second wave.
The third wave of deep learning, which dawned around 2010, emerged from
obscurity with a surge of game-changing advances, shaking the very foundations of
artificial intelligence. Leading this charge was Hinton’s inspiring research on deep
belief networks and unsupervised pretraining in 2006, revolutionizing the landscape
of deep neural network training. This momentum culminated in the remarkable emer-
gence of AlexNet, a successor to LeNet, which swept the field with its triumphant
victory in the 2012 ImageNet challenge, demonstrating the unparalleled capabilities
of deep learning in computer vision. The pioneering efforts by luminaries sparked a
revolution, unleashing a torrent of innovation that transcended disciplinary bound-
aries and reshaped societal norms. The pinnacle of recognition arrived with the 2018
ACM Turing Award, honoring Yoshua Bengio, Geoffrey Hinton, and Yann LeCun for
their monumental contributions, firmly establishing deep learning as the vanguard
of artificial intelligence innovation.
The success of deep learning is not only attributed to advancements in algorithms
but also relies significantly on two critical pillars: the availability of vast amounts
of data and the remarkable computational power of GPUs. Deep learning models
thrive on large datasets, which provide the diverse and abundant examples neces-
sary for training robust models. Additionally, the emergence of GPUs, particularly
those developed by NVIDIA, has revolutionized the field by enabling accelerated
training and inference processes. These GPUs are optimized for parallel processing,
allowing deep learning algorithms to harness their immense computational capa-
bilities for training complex models with unprecedented speed and efficiency. As
a result, researchers and practitioners can explore more sophisticated architectures
and tackle increasingly complex problems across various domains, from computer
vision to natural language processing.
The impact of deep learning has been nothing short of transformative, permeating
every facet of our lives with its boundless potential. From personalized recommen-
dations and intelligent assistants to breakthroughs in healthcare and autonomous
vehicles, deep learning has become the driving force behind some of the most
awe-inspiring advancements of our time. Yet, amid the dazzling array of practical
successes lies a profound challenge—the quest to unravel the mysteries of deep
learning’s inner working mechanisms.
Preface vii
missings, please don’t hesitate to bring them to our attention. Your feedback is
important to ensure the accuracy and clarity of this work for future readers.
Part I Background
2 Statistical Learning Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1 Glivenko-Cantelli Theorem and Concentration Inequalities . . . . . 17
2.1.1 Glivenko-Cantelli Theorem . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1.2 Concentration Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2 Probably Approximately Correct (PAC) Learning . . . . . . . . . . . . . 26
2.2.1 The PAC Learning Model . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2.2 Generalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.3 Insights from Glivenko-Cantelli Theorem . . . . . . . . . . . . 32
2.3 PAC-Bayesian Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3 Hypothesis Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.1 Worst-Case Bounds Based on the Rademacher Complexity . . . . . 39
3.2 Worst-Case Bounds Based on the Vapnik-Chervonenkis
(VC) Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4 Algorithmic Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.1 Definition of Algorithmic Stability . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Algorithmic Stability and Generalization Error Bounds . . . . . . . . 48
4.3 Uniform Stability of Regularized Learning . . . . . . . . . . . . . . . . . . . 56
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
xi
xii Contents
Deep learning refers to a subset of machine learning techniques that employ artificial
neural networks characterized by multiple layers of interconnected nodes (or neu-
rons) followed by nonlinear activation functions. Such networks are of significant
size. These deep neural networks are designed to process and learn from experience,
extracting complex patterns and features through successive layers of computation.
Such experience can be obtained either from human-annotated electronic records
such as datasets or from the learner’s own interactions with its perceived environment.
By leveraging these layered architectures, deep learning models can autonomously
discover and represent intricate relationships within the data, enabling them to make
sophisticated predictions and informed decisions based on learned knowledge.
To date, however, the development of deep learning has relied heavily on experi-
ments that have not thoroughly explored or sought an understanding of its theoretical
foundations. The black-box nature of deep learning introduces unmanageable risk
into its applications since we do not know why deep learning works, when it may
fail, or how to prevent failures.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025 1
F. He and D. Tao, Foundations of Deep Learning, Machine Learning: Foundations,
Methodologies, and Applications, https://doi.org/10.1007/978-981-16-8233-9_1
2 1 Deep Learning: A (Currently) Black-Box Model
where x is the feature and y is the corresponding label. For object recognition,
images and their representations are the features, and the labels are the concepts
depicted such as ‘tiger’, ‘horse’, and ‘cheetah’.
• Feature: A feature is a collection of attributes that represents an example and is
typically structured as a vector or matrix denoted as x ∈ X. Features are often
obtained through sensor measurements, such as those from optical cameras or
light detection and ranging (LiDAR) devices. Certain deep learning models are
specifically engineered for feature extraction, as they excel at producing highly
informative feature representations.
• Label: An annotation of an example. It can be a categorical value, a continuous
real value, or a combined set of such values. Deep learning models learn mappings
between features and labels. In this book, we denote a label by y ∈ Y.
• Training sample: A set of examples used to train a deep learning model:
S = {z 1 , . . . , z m },
Z ∼ D, S ∼ Dm .
For clarity, we use upper-case and lower-case letters to represent random variables
and their corresponding observations, respectively.
Remark 1.1 It’s crucial to acknowledge that the assumption of i.i.d. data, where
each data point is assumed to be drawn from the same distribution and is unrelated
to other data points, may not hold in many real-world scenarios. A notable coun-
terexample is observed in time series data, such as financial data, where observations
can exhibit significant temporal dependencies and variability over time. Despite this
departure from the i.i.d. assumption, deep learning models can still effectively capture
complex patterns and trends in such data due to their ability to learn from sequential
information. Although the temporal variability in time series data challenges the i.i.d.
assumption, it typically does not critically impede the performance of deep learning
models. Therefore, in this book, we maintain the use of the i.i.d. assumption for
simplicity and to establish the fundamental principles and theoretical foundations of
deep learning, recognizing that the flexibility and adaptability of these models allow
them to handle real-world data complexities effectively.
1.1 Definitions and Terminology 3
Remark 1.3 Deep learning has shown significant value in dealing with the chal-
lenges arising in conventional reinforcement learning (RL), which is an important
machine learning paradigm. In RL, a learner determines what action an agent needs
to take in each specific state of the environment. Deep RL exploits deep learning for
4 1 Deep Learning: A (Currently) Black-Box Model
efficient feature extraction to boost the policy approximation ability in RL, which is
a considerable barrier for conventional methods.
L
f (x) = ,
1 − ek (x − x0 )
eb
f (b) = l ,
j=1 eb j
1.1 Definitions and Terminology 5
Notably, deep learning breaks free of the framework of learning a mapping from
features to labels.
Unsupervised learning is also of significant interest for deep learning theory.
Prevalent scenarios are listed as follows.
decode training data. The encoder maps input data points to a latent space repre-
sentation, while the decoder takes a point from the latent space and reconstructs
the original input data point. Therefore, VAEs are capable to generate new data
by decoding low-dimensional points efficiently sampled from the latent space into
meaningful outputs. This method is widely useful to synthesise images and audio
waves, as it allows for the generation of diverse and creative outputs by manipulat-
ing the latent representations. DMs have recently been widely used to generate new
data by mimicking training data through two steps: forward diffusion and reverse
diffusion. In forward diffusion, noise is progressively added to an image over
several steps, gradually transforming it into a completely noise-like picture. By
contrast, reverse diffusion learns to invert the forward diffusion process. Starting
from the noisy image, the model iteratively denoises it, to progressively reconstruct
the original image. The above data generation models effectively learn to capture
and replicate the sophisticated patterns present in the distribution of the training
data. These generators aim to generate data instances that are indistinguishable
from real data. Although these models have found widespread applications in
tasks such as image generation, question answering, or broadly the synthesis of
realistic data for various purposes, including deep learning model training, they
face many challenges, such as data bias, low reliability, poor quality, scalability,
ethical and privacy concerns, as well as interpretability and transparency.
• Self-supervised learning: It aims to autonomously discover the underlying struc-
tures in data without relying on manually annotated labels. Contrastive learning
is a typical self-supervised learning, which improves the feature representation
capability of deep learning models by using large amounts of unlabelled data.
By simultaneously maximizing intra-class similarity and minimizing inter-class
dissimilarity, it improves the model’s representation capability, ensuring that the
feature representations of data in the same category are alike and those of data from
different categories are as distinct as possible. In general, it does not primarily focus
on the intricate details of the instances but instead aims to distinguish the data at
an abstract semantic feature space level. Consequently, the model is simplified, its
optimization is more efficient, and its generalization ability is improved. For exam-
ple, in the context of self-supervised learning like RotNet, images are artificially
augmented by rotating them with specific angles (e.g., 0 degrees, 90 degrees) to
create pairs of rotated and original images. The rotation angles themselves are used
as labels to train the model to predict the rotation applied to each image. While
these rotation labels may appear arbitrary or devoid of semantic meaning, the pro-
cess allows the model to learn rich and informative features that capture intrinsic
properties of the data, such as object shape, orientation, and texture. Representative
algorithms in contrastive learning include: SimCLR: Enhances feature representa-
tion through a simple contrastive learning framework by using data augmentation
and large-scale batch training; MoCo: Improves the efficiency of contrastive learn-
ing by storing sample pairs in a dynamic dictionary through a momentum contrast
mechanism; and BYOL: Enhances the stability of contrastive learning by intro-
ducing a self-supervised mechanism without the need for negative samples. In
order to define supervisory signals without human intervention, self-supervised
1.2 Advances in Deep Learning 7
learning explores and exploits the inherent structure of the data. By generating
labels from data transformations or relationships (such as image rotations in Rot-
Net), self-supervised learning helps the model learn effective representations that
generalize well to downstream tasks, even in the absence of explicit labels. This
approach has been extensively applied to computer vision, natural language pro-
cessing, speech recognition, and bioinformatics, demonstrating its versatility and
scalability in learning from unlabeled data.
Compared with deep neural networks, most conventional machine learning models
share four major barriers to achieving comparably elegant deep learning performance.
• Conventional machine learning models/algorithms usually have significantly
lower representation capacity. Their capacity is sufficient for modelling simple
data, such as what is necessary for income modelling and credit assessment, but
considerably limits their ability to fit complex data, such as images and videos. For
example, the vanilla support vector machine (SVM) algorithm can be used to train
only linear classifiers. In contrast, deep learning models have a uniform approx-
imation ability: they can approximate any continuous function, time series, or
distribution at any accuracy as long as the networks are sufficiently wide. In addi-
tion, stochastic gradient-based optimizers, including stochastic gradient descent
and its variants, have excellent capabilities for optimizing neural networks. More-
over, representation learning proceeds in a data-driven manner, which significantly
reduces the human labour burden incurred.
• Conventional machine learning algorithms usually suffer from the curse of dimen-
sionality: to maintain the robustness of a learning model/algorithm, the required
training sample size increases rapidly as the dimensionality of the input data
increases. In contrast, existing empirical studies have not yet observed the curse of
dimensionality in deep learning. In fact, increasing the input dimensionality has
been shown to boost performance. For example, some papers have reported that
training on higher-resolution images can lead to considerably higher performance,
even as the input dimensionality significantly increases.
• Conventional machine learning algorithms usually rely on restrictive assumptions,
considerably limiting their potential application domains. For example, linear clas-
sifiers assume that the data are linearly separable, hidden Markov models assume
that the state sequence is a Markov chain, and kernel methods assume that the
data lie in the corresponding reproducing kernel Hilbert space. In contrast, deep
learning breaks free of most of these assumptions.
• Some conventional machine learning algorithms have limited capabilities in deal-
ing with large-scale data. A prime example is Markov chain Monte Carlo, a preva-
lent Bayesian inference method that faces significant difficulty in scaling to the
processing of big data.
8 1 Deep Learning: A (Currently) Black-Box Model
Significant developments have emerged in the field of deep learning over the past
decade. The rapid progress in deep learning is empowering abundant developments
in many, possibly almost all, areas of both engineering and science and is further
driving a technological revolution in a variety of industry sectors.
• Computer vision (CV) is the field of technology that seeks to help machines “see”
and understand a digital form of the world, including images and videos captured
by cameras and point clouds generated by LiDAR sensors. The primary subar-
eas of CV include recognition, detection, segmentation, and tracking. Potential
applications involve a variety of industry sectors, including autonomous vehicles,
medical diagnosis, and entertainment. The outstanding approximation capabilities
and trainability of deep learning models enable deep learning to vastly enhance
performance in CV tasks, providing improved speed and accuracy.
• Natural language processing (NLP) is the field of training machines to read, under-
stand, and derive meaning from natural languages. Machines are good at dealing
with machine languages or programming languages, which follow clear and con-
cise logic. Unfortunately, this characteristic does not apply to natural languages.
Natural languages exhibit considerable ambiguity, vagueness, and violations of the
grammar rules as summarized in linguistics;1 this, however, is exactly the wisdom
and charm of natural languages. Deep learning models, particularly long short-
term memory (LSTM) networks and recurrent neural networks (RNNs), enable
data-driven approaches to the discovery of knowledge from natural languages.
Prevalent subareas of NLP include information retrieval, text mining, dialogue
1Theoretically, it would be possible to summarize the grammar of a natural language in its entirety,
as per finite length, the capacity of all potential sentences is finite. However, such a grammar book
would be unacceptably large.
1.3 Applications of Deep Learning 9
2 The lack of hypothesis tests is not merely due to negligence. It is honestly doubtful that current
deep learning algorithms could pass hypothesis tests if they were performed. Most deep learning
algorithms exhibit significant variability in performance, and partial reporting is common in current
experimental practice, which severely undermines any confidence in the reproducibility of deep
learning algorithms.
1.4 The Status Quo of Deep Learning Theory 11
• Optimization refers to the ability to search for the optimal solution to a problem.
Machine learning (including deep learning) algorithms are usually formulated
to solve optimization problems. Thus, their optimization performance is of vital
importance. Unfortunately, the optimization problems in deep learning tend to be
highly nonconvex. Nevertheless, they can be well solved in practice by means
of stochastic-gradient-based methods, such as stochastic gradient descent (SGD),
which is apparently a stochastic convex optimization method. Why does SGD, a
convex optimization method, work for a nonconvex optimization problem?
• Generalization refers to the ability to perform well on unseen data. Good general-
izability ensures that a trained model can robustly handle unseen circumstances.
Generalizability is particularly important when long-tail events regularly appear
and can cause catastrophic failure. Statistical learning theory has established a
rigorous upper bound on the generalization error depending on the hypothesis
complexity; however, major difficulties arise in analysing deep learning.
Theoretical foundations have been established for conventional machine learning
algorithms. However, significant difficulties arise when we attempt to apply these
theoretical foundations to deep learning. These difficulties are listed as follows:
• Overparameterization. Statistical learning theory is grounded in Occam’s razor
principle:
Plurality should not be posited without necessity.
Remark 1.5 In the era of foundation models, state-of-the-art models often do not
impose strict constraints on model complexity. On the contrary, larger models have
shown greater effectiveness on testing benchmarks, which may exceed the size of
the training set by a significant margin. The underlying reason for this phenomenon
can be attributed to the capacity of these larger models to capture and learn intricate
patterns and nuances present in vast and diverse datasets. Additionally, larger mod-
els possess more parameters and computational power, enabling them to represent
complex relationships and dependencies within the data more comprehensively. This
enhanced capacity often leads to improved performance and generalization capabil-
ities, especially in tasks involving extensive and diverse data inputs. Surely, it is still
important to balance model size with considerations of efficiency, interpretability,
and practical deployment in real-world applications.
• Highly complex empirical risk landscape. Highly complex empirical risk land-
scape. The landscape of empirical risk in deep learning is characterized by sig-
nificant complexity, posing unique challenges compared to traditional machine
learning paradigms. In conventional settings, learnability and optimizability are
typically ensured through regularization techniques that leverage properties like
convexity, Lipschitz continuity, and differentiability. However, these conventional
guarantees do not directly translate to deep learning due to the intrinsic nature
of neural networks. Deep neural networks are composed of multiple layers with
nonlinear activation functions, leading to loss surfaces that exhibit extreme non-
smoothness and nonconvexity. This inherent complexity renders traditional opti-
mization guarantees ineffective and raises significant obstacles in understanding
the underlying principles of deep learning. The complex and rugged nature of deep
learning loss surfaces has long been a barrier to developing comprehensive theories
and analytical frameworks for neural network behavior. Recent progress in deep
learning research has shifted the focus towards investigating the geometric prop-
erties of loss surfaces as a means to decode the behavior of neural networks. It is
increasingly recognized that the intricate geometry of loss surfaces directly reflects
the dynamics of learning and generalization in deep learning models. Exploring
and characterizing the geometric features of loss landscapes holds promise as a
pathway to gaining deeper insights into the working mechanisms of neural net-
works. By delving into the geometric intricacies of loss surfaces, researchers aim
to uncover fundamental principles governing optimization, generalization, and
model performance in deep learning. This evolving perspective highlights the
importance of advancing our understanding of loss surface geometry to enhance
the robustness, efficiency, and interpretability of deep learning systems.
1.5 Notations
where D is the depth of the neural network, W j is the j-th weight matrix, and σ j is
the j-th nonlinear activation function for any j = 1, ..., D. Usually, we assume that
the activation function σ j is a continuous function between Euclidean spaces and
that σ j (0) = 0. Popular activation functions include the softmax, sigmoid, and tanh
functions.
The generalization bound measures the generalizability of an algorithm. For any
hypothesis h learned by an algorithm A, the expected risk R(h) and the empirical risk
R̂ S (h) with respect to the training sample set S are respectively defined as follows:
1
m
R(h) = Ez l(h, z), R̂ S (h) = l(h, z i ),
m i=1
where l(·) is the loss function. Furthermore, when the output hypothesis h learned
by algorithm A is stochastic (i.e., exhibits randomness), we typically calculate the
expectations of both the expected risk, R̂(h), and the empirical risk R̂ S (h). This
involves averaging over the randomness introduced by algorithm A.
This chapter introduces mathematical tools for establishing the foundations of deep
learning and, more broadly, machine learning. We first introduce Glivenko-Cantelli
theorem and concentration inequalities, which characterizes the convergence of an
empirical process in the context of convergence in probability. The Glivenko-Cantelli
theorem and concentration inequalities also inspire the concept of the sample com-
plexity required to ensure a desirable level of generalizability on unseen data. We
then discuss the Probably Approximately Correct (PAC) learning framework, which
characterizes learning algorithms that can learn a target concept in an appropriate
amount of time given a sufficiently high sample complexity. The PAC framework
has become the foundation of statistical learning. Usually, PAC is categorized in the
‘frequentist’ sense, which suffers from significant deficiency in studying stochastic
algorithms. To address this issue, the PAC-Bayes framework is also discussed, which
integrates the PAC framework with Bayesian methods to characterize randomness.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025 17
F. He and D. Tao, Foundations of Deep Learning, Machine Learning: Foundations,
Methodologies, and Applications, https://doi.org/10.1007/978-981-16-8233-9_2
18 2 Statistical Learning Theory
This section introduces Glivenko-Cantelli theorem (or central limit theorem) based
on the measure theory. Measure theory offers a portfolio of rigorous tools for math-
ematically characterizing the convergence of a sequence of hypotheses. However,
this rigour is usually accompanied by technical difficulty in practice. We thus also
introduce concentration inequalities in the next section to simplify the reader’s
understanding, which may also characterize the convergence.
We first recall some necessary definitions in the measure theory. Readers can also
consult Stein and Shakarchi (2009) for reference. Measure is defined as follows.
Definition 2.1 Let X be the sample space. A collection A of subsets of X is called
a σ −field if it satisfies the following conditions:
• ∅ ∈ A; ∞
• if A1 , A2 , · · · ∈ A then i=1 Ai ∈ A;
• if A ∈ A then Ac ∈ A.
The tuple (X, A) is called a measurable space. A subset A of X is called measurable
if A ∈ A. Moreover, a function f : X → R is called measurable, if for all a ∈ R,
the set f −1 ([−∞, a)) = {x ∈ X : f (x) < a} is measurable.
We then may define probability measure.
Definition 2.2 A probability measure P on a measurable space (X, A) is a function
P : A → [0, 1] satisfying
• P(∅) = 0, P(X) = 1;
• if A1 , A2 , . . . is a collection of disjoint members of A, then
∞ ∞
P Ai = P(Ai ).
i=1 i=1
Given the definition of the Dirac measure, the empirical measure is usually rep-
m as a linear combination of the Dirac measures at the observations, Pm =
resented
m −1 i=1 δxi .
2.1 Glivenko-Cantelli Theorem and Concentration Inequalities 19
1
m
f → Pm f = f (xi ).
m i=1
Let P be the shared distribution of all X i . We define the notation PF as PF =
sup{|P f | : f ∈ F }. We now recall the law of large numbers.
1
m
xi → E[X 1 ] almost surely, as m → ∞.
m i=1
We can write the uniform version of the law of large numbers as follows:
Pm − PF → 0.
The above equation is guaranteed to converge almost surely in the outer norm, i.e.,
the event {Pm − PF → 0} holds with probability 1. A Glivenko-Cantelli class
is defined as a class F that satisfies the law of large numbers. It is also referred
to as a P-Glivenko-Cantelli class since the class is dependent on the underlying
measure P.
Let l and u be real functions with finite norms on a measurable space X. We define
a bracket [l, u] as the set of all functions f ∈ F satisfying l(x) ≤ f (x) ≤ u(x) for
every x ∈ X. It should be noted that the functions l and u are not necessarily in the
Glivenko-Cantelli class F . An -bracket is a bracket [l, u] that satisfies u − l ≤ ,
where is a positive real number and · is a given norm. The bracketing number
N[] (, F , · ) is the minimum number of -brackets that can contain the class
F . The bracketing entropy is the natural logarithm of the bracketing number. The
occasional stars as superscripts refer to outer measures in the first case, and minimal
measureable envelopes in the second case.
Theorem 2.2 Let F be a class of measurable functions. If N[] (, F , · ) < ∞ for
all > 0, then F is a P-Glivenko-Cantelli class. We have that
Pm − P∗F → 0,
a.s.
We present a brief proof of this theorem as follows. For any > 0, we choose
finitely many -brackets {[li , u i ]}i=1
m
and argue that, by selecting a bound on
|(Pm − P) f | (for each f ) in terms of the [li , u i ] that contains it, we obtain
20 2 Statistical Learning Theory
sup |(Pm − P) f | ≤ max (Pm − P) u i ∨ max (P − Pm ) li + .
f ∈F 1≤i≤m 1≤i≤m
To conclude, according to the strong law for random variables, the right-hand side
of the above inequality is guaranteed to be less than 2.
The following is the definition of the Glivenko-Cantelli theorem for a continuous
distribution function on a line. Let F be a continuous cumulative distribution function
(CDF), and let P be the corresponding measure. By making the continuity of F
on the line uniform, we can find that for every > 0, −∞ = t0 < t1 < t2 < . . . <
tk < tk+1 = ∞, where k is a positive integer such that the union of the brackets
Ix≤ti , Ix≤ti+1 for i = 0, 1, . . . , k contains {Ix≤t : t ∈ R} and satisfies the equation
F (ti+1 ) − F (ti ) ≤ . Then, the above theorem applies. Notably, the continuity of
the distribution function F is essential, although the Glivenko-Cantelli theorem on
a line holds for arbitrary distribution functions. This more general result will be
considered a subsequent corollary of the Glivenko-Cantelli theorem.
The following lemma characterizes a setting that guarantees a finite bracketing
number for appropriate classes of functions and characterizes ready application in
inference for parametric statistical models.
Lemma 2.1 Suppose that F = { f (·, t) : t ∈ T }, where T is a compact subset of a
metric space (D, d) and the functions f : X × T → R are continuous in t for P-
almost x ∈ X. If the envelope function F defined by F(x) = supt∈T | f (x, t)| satisfies
the condition P F < ∞, then N[] (, F , L 1 (P)) < ∞ for each > 0.
We will derive the consistency in parametric statistical models.
Consistency in parametric models: Let { p(x, θ ) : θ ∈ }, ⊂ Rd be a class
of parametric densities. Suppose that X 1 , X 2 , . . . are generated from some p (x, θ0 ).
Additionally, assume that is compact and p(x, θ ) is continuous w.r.t. θ for Pθ0 -
almost x. We define M(θ ) = Eθ0 l (X 1 , θ ), where l(x, θ ) = log p(x, θ ). Eventually,
we assume that supθ∈ |l(x, θ )| ≤ B(x) for some B with Eθ0 B (X 1 ) < ∞.
Note that M(θ ) is continuous on if it is finite for all θ . If Pθ0 denotes the
measure corresponding to θ0 , then M(θ ) = Pθ0 l(·, θ ), and the maximum likelihood
estimate (MLE) of θ is given by θ̂m = argmaxθ Mm (θ ), where Mm (θ ) = Pm l(·, θ ).
Under the assumption that the model is identifiable (i.e., the probability distributions
corresponding to different θ s are different), it can be observed that M(θ ) is uniquely
(and globally) maximized at θ0 .
Eventually, θ0 is a well-separated maximizer in the sense that for any η > 0,
supθ∈∩Bη (θ0 )c M(θ ) < M (θ0 ), while Bη (θ0 ) is the open ball of radius η centred at
θ0 . Let ψ(η) = M (θ0 ) − supθ∈∩Bη (θ0 )c M(θ ). Then, we have ψ(η) > 0.
Our goal is to show that θ̂n → pθ∗0 θ0 . Given > 0, we have that
Thus,
∗ ∗
P θ̂m ∈ B (θ0 ) c
≤ P sup |Mm (θ ) − M(θ )| ≥ ψ()/2
θ∈
∗
= P sup Pm − Pθ0 l(·, θ ) ≥ ψ()/2 .
θ∈
∗
This equation converges to 0 because supθ∈ Pm − Pθ0 l(·, θ ) → 0 a.s.;
based on our assumptions on the parametric
model, we can conclude from Lemma
2.1 that N[] η, {l(·, θ ) : θ ∈ }, L 1 Pθ0 < ∞ for every η > 0 and then invoke
Theorem 2.2.
Next, we provide the necessary and sufficient conditions for a class of functions
F to be a Glivenko-Cantelli class in terms of covering numbers Wellner et al. (2013).
Theorem 2.3 Let F be a P-measurable class of measurable functions bounded in
L 1 (P). Then, F is a P-Glivenko-Cantelli class if and only if
(a) P ∗ F < ∞ and
(b)
E∗ log N (, F M , L 2 (Pm ))
lim =0
m→∞ m
and
∗
∀ > 0, log N FPm ,1 , F , L 1 (Pm ) = o p (m).
for an integer M ≥ 1 that depends solely on F and a constant K 1 that also depends
solely only on F , where the supremum is taken over all probability measures for
which F Q,r > 0. Thus, a VC class of functions with an integrable envelope F is
a Glivenko-Cantelli class for any probability measure on the corresponding sample
space. Fortunately, functions formed by combining VC classes of functions via vari-
ous mathematical operations often satisfy entropy bounds similar to those illustrated
above. Therefore, classes of such (greater) complexity remain Glivenko-Cantelli
classes under the integrability hypothesis.
As a unique case, consider F = f t (x) = I−∞<x≤t : t ∈ Rd . Thus, f t (x) is the
indicator of the infinite rectangle to the ‘southwest’ of the point t. For any probability
measure Q on the d-dimensional Euclidean space, we have that
d
K
N (c, F, L 1 (Q)) ≤ Md ,
Given any > 0, an appropriate choice of M ensures that the second term is not
larger than . It is sufficient that, for this choice of M, the first term will eventually
be smaller than . To this end, we fix X 1 , X 2 , . . . , X m . An -net G (assumed to be
of minimal size) over F M in L 2 (Pm ) is also an -net in L 1 (Pm ). Then, we have
1 m 1 m
E i f (X i ) ≤ E i f (X i ) + .
m m
i=1 FM i=1 G
2.1 Glivenko-Cantelli Theorem and Concentration Inequalities 23
and thus,
√ 1 + log N (, F M , L 2 (Pm ))
Bm ≤ 6M →0
m
It follows that E∗ Pm − PF → 0. However, our goal is to show almost sure
convergence. This is deduced through a submartingale argument, a simplified ver-
sion of which is presented at the end of these notes. The idea here is to show that
Pm − P∗F is a reverse submartingale with respect to a (decreasing) filtration that
converges to the symmetric sigma field and therefore has an almost sure limit. This
almost sure limit, measurable with respect to the symmetric sigma field, must be
a nonnegative constant almost surely. The fact that the expectation converges to 0
forces this constant to become 0.
24 2 Statistical Learning Theory
Concentration inequality, which describes the deviation between the sum (or sample
mean) of a group of random variables and their expected value, is a very useful
concept in the analysis of PAC learning. In addition to characterizing the asymptotic
convergence of an iterative machine learning algorithm, concentration inequalities
also offer practical tools for evaluating the convergence rate. We will give a brief
summary of various concentration inequalities in this subsection.
Markov’s inequality. Let X be a nonnegative random variable, and a > 0. Then,
EX
P(X ≥ a) ≤ .
a
1
P(|X − μ| ≥ kσ ) ≤ .
k2
m
Chernoff bounds. Let X = X i , and t be a positive constant. Then,
i=1
P (X i > a) = P et X i > eta ≤ e−ta E et X i ,
1 1
m m
P( Xi − EX i ≥ ) ≤ exp(−2m 2 ),
m i=1 m i=1
1 m
1
m
P( Xi − EX i ≥ ) ≤ 2 exp(−2m 2 ).
m i=1 m i=1
−2 2
P(| f (X 1 , · · · , X m ) − E( f (X 1 , · · · , X m ))| ≥ ) ≤ 2 exp m 2 .
i=1 ci
σ2 Ct
P( X̄ − E[ X̄ ] ≥ t) ≤ exp − 2 h ,
C mσ 2
m 2 t 2 /2
P( X̄ − E[ X̄ ] ≥ t) ≤ exp − 2 , ∀t ∈ R.
σ + Ctm/3
Before designing algorithms to learn from training data, we should consider the
following issues. What is the most essential consideration in a learning task? How
many examples are required to successfully train a model? Additionally, is there a
generalizable model that can be learned for the current problem? To formalize and
solve these issues, this section introduces the PAC learning framework. The PAC
learning framework can help determine an algorithm’s sample complexity (how many
examples are required to train a desirable approximator for the given problem) and
whether a problem is learnable. In this section, we first present the PAC framework,
then provide a theoretical assurance for learning algorithms, and eventually hope to
guide algorithm design in accordance with the results of our analysis.
We first present some relevant definitions and notations before introducing the PAC
model. We consider a binary classification task, i.e., the label y ∈ {0, 1}. We express
a concept c as a mapping from a feature x to the label y. We use C to denote a concept
class, which represents the set of concepts we hope to learn. For instance, it may be
the set of all triangles on a plane.
Suppose that all examples are ii.i.d. according to a fixed but unknown distribution
D. Then, the framework of the learning problem is constructed as follows. The
learner considers a fixed set of possible concepts H, which is termed the hypothesis
set. The hypothesis set does not need to coincide with C. The learner receives a
sample set S : S = (x1 , . . . , xm ) with known labels (c (x1 ) , . . . , c (xm )) that depend
on the concept c ∈ C to be learned (here, c represents an optimal mapping). The
examples in S are independently sampled from the distribution D. The learning task
requires the selection of a hypothesis h S ∈ H based on the labelled sample S, which
has a small generalization error with respect to c. The generalization error (or risk, or
true error, or simply error) of a hypothesis h ∈ H is denoted by R(h) and is defined
as follows.
Definition 2.4 (Generalization risk) The generalization risk of a hypothesis h ∈ H
is defined as
R(h) = P [h(x) = c(x)], (2.1)
x∼D
By linearizing the expectation and considering that the sample S is drawn in an i.i.d.
manner, we obtain
1
m
E m R̂ S (h) = E Ih(xi )=c(xi ) = E Ih(x)=c(x) = R(h).
S∼D m i=1 S∼Dm x∼D
Next, we present the PAC learning framework. We use n to denote a number, size(c)
to denote the maximal cost of the computational representation of c ∈ C, and h S to
denote the hypothesis that is obtained by algorithm A when applied to the labelled
sample S. The computational cost of representing the element x ∈ X is at most O(n).
To simplify the notation, we do not explicitly express the dependence of h S on A.
Definition 2.6 (PAC learning) For any > 0, any δ > 0, any distributions D on X
and any target concept c ∈ C, the concept class C is PAC learnable if there exist
an algorithm A and a polynomial function poly(·, ·, ·, ·) such that the following
inequality holds for any sample size m ≥ poly(1/, 1/δ, n, size(c)):
P [R (h S ) ≤ ] ≥ 1 − δ. (2.4)
the PAC framework is suitable for solving the learnability question for a concept
class C rather than for a particular concept. It is important to note that the target
concept c ∈ C is unknown to the algorithm, whereas the concept class C is known.
Hence, we can focus only on the sample complexity in the definition of PAC learning
and omit the polynomial dependency on n and size(c) without explicitly discussing
the concepts’ computational representation.
2.2.2 Generalities
This general scenario is stochastic. Under such circumstances, the output label is a
probabilistic function of the input. This scenario can describe a number of real-world
problems. For instance, when the goal is to predict gender using input pairs consisting
of the weight and height of a person, the labels of the input sample generally may
not be unique. Both ‘man’ and ‘woman’ will be possible genders for most pairs,
although there may exist distinct probability distributions for the label being ‘man’
or ‘woman’ for each fixed pair. An extension of the PAC learning framework, known
as agnostic PAC learning, is able to deal with this setting.
Definition 2.7 (Agnostic PAC learning) Let H be a hypothesis set. The algorithm A
is said to be an agnostic PAC learning algorithm if there exists a polynomial function
poly(·, ·, , ··) such that for any > 0 and δ > 0, for distribution D over X × Y , and
for any sample size m ≥ poly(1/, 1/δ, n, size(c)), the following expression holds:
The scenario is considered deterministic if the label of an input sample can be deter-
mined with probability one by a function f : x → y. In such a scenario, it is sufficient
to consider the distribution D in the input space. The training sample can be acquired
by drawing (x1 , . . . , xm ) according to D. The corresponding labels can be obtained
from f : yi = f (xi ) for all i ∈ [m].
In the deterministic case, there exists a target function f with no generalization error
(R(h) = 0), while in the stochastic case, there is a minimal nonzero error for any
hypothesis.
Definition 2.8 (Bayes error) Given a distribution D over X × Y , the Bayes error R ∗
is defined as the infimum of the errors obtained by a measurable function h : X → y:
Therefore, the average error generated by h Bayes on x ∈ X is min{P(0 | x), P(1 | x)},
the minimum possible error. On this basis, the definition of noise is given as follows.
We consider that the hypotheses are consistent and that the target concept c is within
the cardinality |H| of the finite hypothesis set. In this section, we will present a
general sample complexity bound for such consistent hypotheses.
30 2 Statistical Learning Theory
Proof For any > 0, we define H as H = {h ∈ H : R(h) > }. The probability
that a hypothesis h is consistent on an i.i.d. sample set S, i.e., that it has no error on
any point in S, can be bounded as follows:
P R̂ S (h) = 0 ≤ (1 − )m .
Let the right-hand side be equal to δ; then, solving for completes the proof.
This theorem illustrates that a consistent algorithm A is PAC learnable when the
hypothesis set H is finite since the sample complexity in (2.9) is dominated by a
polynomial in 1/ and 1/δ. From expression (2.10), we know that the generaliza-
tion error of consistent hypotheses has an upper bound, which decreases with an
increasing sample size m.
The cost of proposing a consistent algorithm is the necessity of using a larger
hypothesis set H that contains the target concepts. The upper bound of (2.10)
increases with increasing |H|, although the dependency is only logarithmic. The
term log |H|, which is not a constant factor, can be interpreted as the number of bits
needed to represent H. As a result, the ratio between the number of bits (log2 |H|)
and the sample size m determines this theorem’s generalization guarantee.
2.2 Probably Approximately Correct (PAC) Learning 31
Difficult learning problems may arise in which there exist concept classes that are
more complex than the hypotheses used in the learning algorithm. In other words,
there may not exist a hypothesis in H that is consistent with the labelled training
sample. Nevertheless, such inconsistent hypotheses with minor errors in the training
sample can also be useful. Under certain assumptions, they may even benefit from
favourable guarantees. In this section, we introduce learning guarantees for finite
hypotheses in the inconsistent case by using either Hoeffding’s inequality or the
following corollary, which establishes the relationship between the empirical error
and generalization error of a single hypothesis.
Corollary 2.1 Let > 0 be fixed. Then, for any hypothesis h : x → {0, 1}, the
following expressions hold:
P R̂ S (h) − R(h) ≥ ≤ exp −2m 2 , (2.11)
P R̂ S (h) − R(h) ≤ − ≤ exp −2m 2 . (2.12)
Let the right-hand side of (2.13) be equal to δ; then, solving for immediately
produces the following bound.
Corollary 2.2 (Generalization bound–single hypothesis) Consider a fixed
hypothesis h : x → {0, 1}. For any δ > 0, the following expression holds with a
probability of at least 1 − δ:
log 2δ
R(h) ≤ R̂ S (h) + . (2.14)
2m
Proof We denote the elements of H by h 1 , . . . , h |H| . Applying the union bound and
Corollary 2.2 to each hypothesis yields
P ∃h ∈ H R(h) − R̂ S (h) >
= P R̂ S (h 1 ) − R (h 1 ) > ∨ . . . ∨ R̂ S h |H| − R h |H| >
≤ P R̂ S (h) − R(h) > (Triangle inequality)
h∈H
≤ 2|H| exp −2m 2 .
A larger sample size m provides a better generalization guarantee, and the bound
increases logarithmically with |H|. However, the bound is a less favourable function
of log2 |H|/m, varying with the square root of log2 |H|/m. A quadratically larger
labelled sample is required for a given |H| to obtain the same guarantee as in the
consistent case. The values of the bounds show that a balance is sought between the
empirical error and the size of the hypothesis set. Although a larger hypothesis set is
penalized by the second term, increasing the size of the hypothesis set helps reduce
the first term, i.e., the empirical error. For a similar empirical error, however, the use
of a smaller hypothesis set is recommended; this can be regarded as an example of
the Occam’s razor principle, named after the theologian William of Occam, which
states that the simplest explanation is the best. In this context, this principle can be
understood to mean that when all other things are equal, a hypothesis set of a smaller
size is better.
A fundamental result of statistical learning theory states that a concept class is PAC
learnable if and only if it is a uniform Glivenko-Cantelli class. However, the theorem
is valid only under special assumptions regarding the class’s measurability, in which
case even the PAC learnability becomes consistent. Otherwise, a classical example
can be constructed, under the continuum hypothesis developed by Dudley and Durst
2.3 PAC-Bayesian Learning 33
Theorem 2.6 For a concept class F , the following two conditions are equivalent:
1. F is distribution-free PAC learnable; 2. F is a uniform Glivenko-Cantelli class.
and
1
m
R̂ S (Q) = Eθ∼Q [l(h θ , z i )].
m i=1
To establish the PAC-Bayesian bound, we introduce the following definition for the
Kullback-Leibler divergence.
Definition 2.11 Given two probability measures D, G ∈ Q, the Kullback-Leibler
(KL) divergence between G and D is
34 2 Statistical Learning Theory
D(θ )
KL(DG) = D(θ ) log( )dθ.
G(θ )
Note that for any probability measures D and G, KL(DG) ≥ 0, where the equality
holds if and only if D = G.
In this section, we will focus on a bound as proposed in (Catoni 2007). Let us fix
a probability measure π ∈ Q, which is usually called the prior.
Proof We first recall Donsker and Varadhan’s variational formula (Donsker and
Varadhan 1976). For any measurable and bounded function g : → R, we have
Moreover, given a risk R(·), the supremum with respect to Q on the right-hand side
is reached for the Gibbs measure π R , where its density with respect to π is
dπ R e g(θ)
(θ ) = .
dπ Eϑ∼π [e g(ϑ) ]
With the help of Hoeffding’s lemma, for a fixed h θ ∈ H and any t > 0, we have
mt 2 C 2
E S [etm[R(θ)− R̂ S (θ)] ] ≤ e 8 .
λ2 C 2
E S [etm|R(θ)− R̂ S (θ)| ] ≤ e 8m .
λ2 C 2
Eθ∼π E S [etm|R(θ)− R̂ S (θ)| ] ≤ e 8m .
2.3 PAC-Bayesian Learning 35
λ2 C 2
E S Eθ∼π [etm|R(θ)− R̂ S (θ)| ] ≤ e 8m .
λ2 C 2
E S [esup Q∈Q λEθ ∈Q [R(θ)− R̂ S (θ)]−KL(Qπ) ] ≤ e 8m .
λ2 C 2
E S [esup Q∈Q λEθ ∈Q [R(θ)− R̂ S (θ)]−KL(Qπ)− 8m ] ≤ 1.
λ2 C 2
P sup λEθ∈Q [R(θ ) − R̂ S (θ )] − KL(Qπ ) − >s
Q∈Q 8m
λ2 C 2
≤E S esup Q∈Q λEθ ∈Q [R(θ)− R̂ S (θ)]− 8m e−s ≤ e−s .
λ2 C 2
P sup λEθ∈Q [R(θ ) − R̂ S (θ )] − KL(Qπ ) − > log 1/δ ≤ δ.
Q∈Q 8m
We will present two examples related to the general PAC-Bayes bound above.
36 2 Statistical Learning Theory
Example 2.1 (Finite case) Let us consider a special case in which the set Q is finite.
In such a case, the Gibbs posterior Q̂ is a probability on the finite set given by
The above upper bound holds for all Q ∈ Q. Thus, it holds for the Q in the set of the
Dirac masses {δθ , θ ∈ }. Then, we have Eϑ∼δθ R̂ S (ϑ) = R̂ S (θ ) and
δθ (ϑ) 1
KL(δθ π ) = δθ (ϑ) log = log .
ϑ∈
π(ϑ) π(θ )
Example 2.2 (Lipschitz loss and Gaussian priors) We assume that l(h θ , z) ≤ C,
∀h θ ∈ H and ∀z ∈ Z. Let the loss function l(h, z) be L-Lipschitz for any z ∈ Z,
and let the prior π follow a standard Gaussian distribution π = N (0, σ 2 Idim(θ) ).
Then, for any δ ∈ (0, 1), with a probability of at least 1 − δ, it holds that
B2 dim(γ )
dim(γ ) 2σ 2
+ log m + log 1
Eθ∼ Q̂ R(θ ) ≤ inf R̂ S (γ ) + Lσ + 2
.
γ ∈, γ :γ ≤B m 2m
Proof Let the posterior Q̂ be a Gibbs measure. For any δ ∈ (0, 1), it holds that
λC 2 KL(Qπ ) + log 1/δ
Eθ∼ Q̂ R(θ ) ≤ inf Eϑ∼Q R̂ S (ϑ) + +
Q∈Q 8m λ
γ 22 dim(θ ) s 2 σ2
KL(Qπ ) = + [ + log − 1].
2σ 2 2 σ2 s2
Moreover, according to the Jensen’s inequality, we obtain
Eϑ∼Q R̂ S (ϑ) ≤ R̂ S (γ ) + LEθ∼Q [θ − γ ] ≤ R̂ S (γ ) + L Eθ∼Q [θ − γ 2 ]
= R̂ S (γ ) + Ls dim(θ ).
dim(γ ) λC 2
Eθ∼ Q̂ R(θ ) ≤ inf R̂ S (γ ) + Lσ +
γ m 8m
− 1 + log m] + log 1/δ
γ 22 dim(γ ) 1
2σ 2
+ [m
+ 2
.
λ
By further assuming that γ ≤ B for some B > 0, we obtain
dim(γ ) λC 2
Eθ∼ Q̂ R(θ ) ≤ inf R̂ S (γ ) + Lσ +
γ :γ ≤B m 8m
− 1 + log m] + log 1/δ
γ 22 dim(γ ) 1
2σ 2
+ [m
+ 2
.
λ
B2 dim(γ )
In this case, we can see that the optimal λ is C1 8m( 2σ 2 + 2
log m + log 1δ ),
which indicates that, with a probability of at least 1 − δ, it holds that
B2 dim(γ )
dim(γ ) 2σ 2
+ log m + log 1
Eθ∼ Q̂ R(θ ) ≤ inf R̂ S (γ ) + Lσ + 2
.
γ :γ ≤B m 2m
References
This chapter presents the hypothesis complexity concept frequently used in sta-
tistical learning theory, which is important for deriving generalization bounds for
deep learning models. The hypothesis complexity characterizes the complexity
of a machine learning algorithm and can be measured in terms of the Vapnik-
Chervonenkis (VC) dimension, Rademacher complexity, and covering number. Intu-
itively, a more complex algorithm has worse generalizability. In this way, we may
study the generalizability via the hypothesis complexity.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025 39
F. He and D. Tao, Foundations of Deep Learning, Machine Learning: Foundations,
Methodologies, and Applications, https://doi.org/10.1007/978-981-16-8233-9_3
40 3 Hypothesis Complexity
One can obtain uniform generalization bounds via the Rademacher complexity.
Theorem 3.1 (see Mohri et al. (2018)) Let H be a family of functions taking values
in {-1,+1}, and let D be the distribution over the input space X. Then, for any δ > 0,
with probability at least 1 − δ over samples S of size m drawn according to Dm , the
following holds for any h ∈ H:
log 1δ
R(h) ≤ R̂ S (h) + R(H) + . (3.3)
2m
Then, by McDiarmid’s inequality, for any δ > 0, with probability at least 1 − δ, the
following holds:
log 1δ
D(S) ≤ E S [D(S)] + . (3.6)
2m
As a result,
log 1δ
R(h) ≤ R̂ S (h) + R(H) + . (3.8)
2m
where the σi are independent uniform random variables taking values in {−1, +1}
and x1 , . . . , xm are the components of the vector x.
Proof Let t > 0 be a number to be chosen later.
m
m
m
exp Eσ sup tσi xi ≤Eσ exp sup tσi xi = Eσ sup exp tσi xi
x∈G i=1 x∈G i=1 x∈G i=1
m
m
≤ Eσ exp tσi xi = Eσ [exp(tσi xi )]
x∈G i=1 x∈G i=1
m (t x )2 m
(t xi )2
i
≤ exp = exp
x∈G i=1
2 x∈G i=1
2
t 2r 2 t 2r 2
≤ exp( ) = |G| exp( ).
x∈G
2 2
The first inequality comes from Jensen’s inequality, and the third inequality is from
Hoeffding’s inequality. By the result above, we have
m
1 tr 2
Eσ sup σi xi ≤ log |G| + . (3.10)
x∈G i=1 t 2
Let t be 2 log |G|/r 2 . Then, we can obtain the result that we have claimed.
The proof is complete.
One can prove a margin bound on the basis of the covering number, defined as
follow, or the Rademacher complexity of the hypothesis space.
Definition 3.2 (Covering number) The covering number N(H, , · ) of a space
H is defined as the minimal cardinality of any subset V ⊂ H that covers H at scale
under metric · , i.e.,
sup min A − B ≤ . (3.11)
A∈H B∈V
Intuitively, the covering number reflects the richness of the hypothesis space. A
larger covering number indicates a space with more balls needed to represent it,
suggesting a potentially more complex set of hypotheses.
42 3 Hypothesis Complexity
The covering number indicates how many balls are needed to cover the hypoth-
esis space, whereas the Rademacher complexity measures the maximal correlation
between a hypothesis and noise and thus characterizes the “goodness-of-fit” of the
hypothesis space to noise. A higher Rademacher complexity suggests that hypothe-
ses in the class might be overly susceptible to noise, leading to poorer generalization
performance on unseen data.
The relationship between hypothesis complexity and these measures becomes
evident when we consider techniques like the Dudley entropy integral.
Theorem 3.2 (Dudley entropy integral) Let H be a real-valued function class taking
values in [0, 1], and assume that 0 ∈ H. Then,
√
4α 12 m
R(H) ≤ inf √ + log N(H, , · 2 )d . (3.12)
α>0 m m α
√
Proof Let N be an arbitrary positive integer in N, and let i = m2−(i−1) for each
i ∈ [N ]. Let Bi denote the cover achieving N(H, , · 2 ). By the definition of the
covering number, for any f ∈ H, there exists a bi [ f ] ∈ Bi such that
f − bi [ f ]2 ≤ i .
Let b1 = 0; then, the third term becomes 0. For the first term, we can use the Cauchy-
Schwarz inequality to obtain
m
m
m
Eσ sup N
σt ( f (xt ) − bt [ f ]) ≤ Eσ σt Eσ sup
2
( f (xt ) − btN [ f ])2
f ∈H t=1 t=1 f ∈H t=1
√
≤ m N .
3.2 Worst-Case Bounds Based on the Vapnik-Chervonenkis (VC) Dimension 43
and
√
N
≤ m N + 12 (i − i+1 ) log N(H, i , · 2 )
i=1
√
√ m
≤ m N + 12 log N(H, , · 2 )d.
N +1
Finally, for any α > 0, let N +1 > α and N ≤ 4α for some N ; then,
√
m
4α 12
R(H) ≤ √ + log N(H, , · 2 )d, (3.13)
m m α
We can obtain the following theorem regarding the relationship between the
growth function and the VC dimension.
Theorem 3.3 (Sauer’s lemma; Shelah (1972); Sauer (1972)) Let H be a hypothesis
set with VCdim(H) = d. Then, for all m ∈ N, the following inequality holds:
d
m
H (m) ≤ . (3.15)
i=0
i
d
m−1
|G1 | ≤ G1 (m − 1) ≤ . (3.17)
i=0
i
d−1
m−1
|G2 | ≤ G2 (m − 1) ≤ . (3.19)
i=0
i
3.2 Worst-Case Bounds Based on the Vapnik-Chervonenkis (VC) Dimension 45
Finally, we have
d
m−1
d−1
m−1 d
m
|G| = |G1 | + |G2 | ≤ + = , (3.20)
i=0
i i=0
i i=0
i
Proof Lemma 3.2 follows from Massart’s lemma (Lemma 3.1). Let the set of vectors
of function values (h(x1 ), . . . , h(xm ))T be denoted by H|S ; then, we have
√
m 2 log |H|S | 2 log H (m)
R(H) ≤ ES ≤ . (3.23)
m m
The first inequality follows from Massart’s lemma, and the second inequality follows
from the definition of H (m).
The other lemma is a foundational lemma in combinatorial mathematics and
extremal set theory that was independently proven by (Shelah 1972) and (Sauer
1972).
We now may prove Theorem 3.4.
Proof of Theorem 3.4 Theorem 3.3 yields
m m d−i m d m d
d m m
m d
H (m) ≤ ≤ = 1+ ≤ ed .
i=0
i i=0
i d d m d
(3.24)
By combining this result with Lemma 3.2 above, we complete the proof of
Theorem 3.4.
46 3 Hypothesis Complexity
References
Mehryar, Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. 2018. Foundations of Machine
Learning. MIT press.
Nesterov, Yurii E. 1983. A method for solving the convex programming problem with convergence
rate o (1/k∧ 2). In Dokl. Akad. Nauk Sssr 269: 543–547.
Peter, L Bartlett, and Shahar Mendelson. 2002. Rademacher and Gaussian complexities: Risk
bounds and structural results. Journal of Machine Learning Research 3: 463–482.
Sauer, Norbert. 1972. On the density of families of sets. Journal of Combinatorial Theory, Series
A 13 (1): 145–147.
Shelah, Saharon. 1972. A combinatorial problem; stability and order for models and theories in
infinitary languages. Pacific Journal of Mathematics 41 (1): 247–261.
Vladimir, N Vapnik and A Ya Chervonenkis. 2015. On the uniform convergence of relative
frequencies of events to their probabilities. In Measures of complexity, 11–30. Springer.
Chapter 4
Algorithmic Stability
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025 47
F. He and D. Tao, Foundations of Deep Learning, Machine Learning: Foundations,
Methodologies, and Applications, https://doi.org/10.1007/978-981-16-8233-9_4
48 4 Algorithmic Stability
Definition 4.3 (Error stability) An algorithm A has error stability β with respect to
the loss function l if the following holds:
We can then prove the following polynomial generalization bound in terms of the
hypothesis stability (Bousquet and Elisseeff 2002).
Theorem 4.1 (Polynomial generalization bound in terms of hypothesis stability)
For a learning algorithm A with hypothesis stability β1 and pointwise hypothesis
4.2 Algorithmic Stability and Generalization Error Bounds 49
and
\i M 2 + 6Mmβ1
R(A(S)) ≤ R̂ S (A(S )) + . (4.6)
2mδ
Thus, we have
M2
E[(R(A(S)) − R̂ S (A(S))2 ] ≤ + 3MES,zi [|l(A(S), z i ) − l(A(S i ), z i )|]
2m
M2
= + 6Mβ2
2m
and
M2
E[(R(A(S)) − R̂ S (A(S \i )))2 ] ≤ + 3ME S,z [|l(A(S), z) − l(A(S \i ), z)|]
2m
M2
= + 3Mβ1 .
2m
and
X = R(A(S)) − R̂ S (A(S \i )), (4.9)
Then, we may prove the following theorem based on this new notion.
Proof The loss function lγ is bounded by M = 1, and the algorithm is β/γ -stable.
By using the fact that R(A(S)) ≤ R γ = Ez [lγ (A(S), z)] and applying Theorem 4.1,
we complete the proof.
Proof We adopt the notation D(A(S)) = R(A(S)) − R̂ S (A(S)) for notational sim-
plicity. In the rest of the proof, we will prove that D(A(S)) is close to its expectation
E S [D(A(S))] and that both have uniform β-stability:
1
m
E S [D(A(S))] = E S,z [l(A(S), z) − l(A(S), z i )]
m i=1
1
m
= E S,z [l(A(S), z ) − l(A(S i ), z )]
m i=1
≤ β,
4.2 Algorithmic Stability and Generalization Error Bounds 51
where the third equality is based on the “symmetry” of the expectation. Let us verify
the conditions required in McDiarmid’s inequality (Definition 4.6). Given that an
algorithm A with uniform stability β, for any m ∈ N and S, S i ∈ Z m , it holds that
Applying McDiarmid’s inequality to D(A(S)), we find that for any ∈ (0, 1),
m 2
P(|D(A(S)) − ED(A(S))| > ) ≤ 2 exp(− ).
2(mβ + M)2
Hence,
m 2
P(D(A(S)) > β + ) ≤ 2 exp(− ).
2(mβ + M)2
Eventually, we take
m 2
δ = 2 exp(− ),
2(mβ + M)2
i.e.,
2 ln(2/δ)
= (mβ + M) .
m
Deep learning algorithms usually have an extremely large “total” hypothesis com-
plexity, whereas optimizers may explore only a small part of the hypothesis space.
The following notion helps characterize the “effective” hypothesis complexity, which
depends on both the learning algorithm and the training data.
1
m
R(H) = E sup σi h, xi , (4.14)
h∈H m i=1
52 4 Algorithmic Stability
With the help of algorithmic stability, one may prove a generalization bound in
terms of the algorithmic Rademacher complexity (Liu et al. 2017).
and
Br = {h ∈ H|h − Eh S ≤ r (m, δ)}.
Our proof consists of two parts. First, we show that R(Br ) ≤ B 2 log(2/δ)α(m).
This is because
1
m
R(Br ) = E sup σi h, xi
h∈Br n i=1
1
m
= E sup σi ( h, xi − E[ h S , xi ])
h∈Br m i=1
m
1
≤ E sup h − E[h S ] σi xi
m
h∈Br i=1 ∗
m
r
≤ σi xi
m i=1 ∗
1 m 1/ p
≤ α(m)D 2m log(2/δ)C p xi ∗p
m i=1
≤ DC p B 2 log(2/δ)α(m)m −1/2+1/ p .
Note that B is a Hilbert space; thus, the first step is complete. Next, let us consider
the Doob-Meyer difference
4.2 Algorithmic Stability and Generalization Error Bounds 53
and further,
m
m
Dt 2∞ = E[h S |Z 1 , . . . , Z t ] − E[h S |Z 1 , . . . , Z t−1 ]2∞
t=1 t=1
m
= E[h S − h S t |Z 1 , . . . , Z t ]2∞
t=1
m 2
≤ E[h S − h S t ∞ |Z 1 , . . . , Z t ]
t=1
≤ mα(m)2 .
Lemma 4.1 (Theorem 2.1 in (Bartlett et al. 2005)) Let F be a class of functions that
map X to [0, M]. Assume that there exists some ρ > 0 such that for every f ∈ F,
var( f (X )) ≤ ρ. Then, with probability at least 1 − δ, we have
1
m
2ρ log(1/δ) 4M log(1/δ)
sup E[ f (x)] − f (xi ) ≤ 4R(F) + + .
f ∈F m i=1 m 3m
Proof Based on the concept of concentration inequality, let V = sup E[ f (x)] −
f ∈F
m
1
m
f (xi ) . Since sup var[ f (xi )] ≤ ρ, it holds, with probability at least 1 − e−x ,
i=1 f ∈F
2xρ 4x ME[V ] Mx
V ≤ E[V ] + + + .
m m 3m
√ √ √ √
By the inequalities x+y≤ x+ y and 2 x y ≤ αx + αy , with probability at
least 1 − e−x ,
2ρx 1 1 Mx
V ≤ inf (1 + α)E[V ] + + + .
α>0 m 3 α m
Finally, let α = 1 and consider E[V ] ≤ 2R(F); this completes the proof.
and
1
m
Vr = sup E[g(z)] − g(z i )
g∈Gr m i=1
be defined. For any r > 0 and a > 1, if Vr ≤ r/a, then every h ∈ Br satisfies
a 1
m
E[l(h, z)] ≤ l(h, z i ) + Vr .
a − 1 m i=1
Proof Let l(h, z) be denoted by f (z), and let g = r f /ω( f ). Then, by the definition
m
of Vr , we have E[g(z)] ≤ m1 g(z i ) + Vr . We note that when r ≥ E[ f (z)], g = f .
i=1
Otherwise, when r < E[ f (z)], we have
4.2 Algorithmic Stability and Generalization Error Bounds 55
1 1
m m
E[ f (z)] E[ f (z)]
E[ f (z)] ≤ f (z i ) + Vr ≤ f (z i ) + .
m i=1 r m i=1 a
Let the right-hand side be r/a; then, by applying Lemma 4.2, we obtain
a 1
m
r
E[l(h, z)] ≤ l(h, z i ) + (4.15)
a − 1 m i=1 a
a 1
m
2Ma log(1/δ) 8M log(1/δ)
≤ l(h, z i ) + + 8R(Gr ) + .
a − 1 m i=1 m 3m
(4.16)
Then, we have
a 1
m
2Ma log(1/δ) 8M log(1/δ)
sup E[l(h, z) − l(h, z i ) ≤ + 8R(l ◦ Br ) + .
h∈Br a−1m m 3m
i=1
In this section, we provide an example to illustrate the uniform stability of the kernel-
based Tikhonov regularization algorithm. Recall the definition of the Tikhonov
regularization scheme:
1
m
min{ l(h, z i ) + λ (h)},
h∈H m i=1
m
H K = { f : X → R| f (x) = αi K xi (x), αi ∈ R},
i=1
1
m
h S = arg min l(h, z i ) + λ f 2K ,
f ∈H K m i=1
where the loss function l is convex and satisfies the L-Lipschitz continuity condition.
Theorem 4.6 Suppose that the kernel function is bounded, i.e., K (x, x) ≤ κ 2 <
∞, ∀x ∈ X . The kernel-based Tikhonov regularization algorithm A is uniformly
β-stable, with probability at least 1 − δ, we have
2Lκ 2 2 ln(2/δ)
R(A(S)) ≤ R̂ S (A(S)) + + (2k + M) .
λm m
and
λ
D f 2K
2
= λ[A(S)2K + A(S i )2K − A(S) + t D f 2K − A(S i ) − t D f 2K ]
≤ [ R̂ Si (A(S i ) − t D f ) − R̂ S (A(S i ) − t D f )] + [ R̂ S (A(S i )) − R̂ Si (A(S i ))]
1 1
= [l(A(S i ) − t D f , z i ) − l(A(S i ) − t D f , z i )] + [l(A(S i ), z i ) − l(A(S i ), z i )]
m m
1 1
= [l(A(S i ) − t D f , z i ) − l(A(S i ), z i )] + [l(A(S i ), z i ) − l(A(S i ) − t D f , z i )]
m m
tL 2t Lκ
≤ (|D f (xi )| + |D f (xi )|) ≤ D f K
m m
Lκ 1
≤ D f K (by taking t = ),
m 2
where the last line follows from the reproducing property, the Cauchy-Schwartz
inequality, and the boundedness assumption on the kernel K , i.e., | f | ≤ κ f K for
any f in the RKHS. Moreover, by taking t = 1/2, we then have
58 4 Algorithmic Stability
2Lκ 2
|D f | ≤ κD f 2K ≤ .
λm
Finally, we show that the algorithm has a uniform stability of 2Lκ 2 /λm. We complete
the proof by combining this result with the result of Theorem 4.3.
References
Olivier, Bousquet, and André Elisseeff. 2002. Stability and generalization. Journal of Machine
Learning Research 2: 499–526.
Peter, L Bartlett, Olivier Bousquet, and Shahar Mendelson. 2005. Local rademacher complexities.
The Annals of Statistics 33 (4): 1497–1537.
Tongliang, Liu, Gábor Lugosi, Gergely Neu, and Dacheng Tao. 2017. Algorithmic stability and
hypothesis complexity. In International Conference on Machine Learning, 2159–2167.
Yurii, E Nesterov. 1983. A method for solving the convex programming problem with convergence
rate o (1/k 2 ). In Dokl. Akad. Nauk Sssr 269: 543–547.
Part II
Deep Learning Theory
Chapter 5
Capacity and Complexity
This chapter presents the results for deep learning theory from the perspective of
hypothesis complexity, including the Vapnik-Chervonenkis (VC) dimension, the
Rademacher complexity, and the covering number. By determining upper and lower
bounds for the VC dimension of neural networks, we can better understand their
generalizability. Additionally, we discuss margin bounds, which offer more robust
generalization assurances compared to worst-case bounds based on the VC dimen-
sion. These bounds ensure that trained models can achieve a small empirical margin
loss with high confidence. Furthermore, we examine the effect of residual connec-
tions on hypothesis complexity by analyzing the covering number of the hypothesis
space. We propose an upper bound for the covering number, providing insights into
how residual connections influence model complexity.
One can upper bound or lower bound the VC dimension of a neural network in order
to characterize its generalizability. Goldberg and Jerrum (1995) gave an O(W 2 ) upper
bound for the VC dimension of a neural network with W parameters. Bartlett et al.
(1999) improved this bound to O((W L log W + W L 2 ), where L is the number of
layers. The tightest bound thus far has been proven by Harvey et al. (2017) as follows.
Theorem 5.1 (see Harvey et al. (2017)) Consider a neural network with W parame-
ters and U units. These units have activation functions that are piecewise polynomials
with at most p pieces and a degree of at most d. Let F be the set of (real-valued)
functions computed by this network. Then, the VC dimension of the sign function
(sgn) applied to elements of F is upper bounded by O(W U log((d + 1) p)).
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025 61
F. He and D. Tao, Foundations of Deep Learning, Machine Learning: Foundations,
Methodologies, and Applications, https://doi.org/10.1007/978-981-16-8233-9_5
62 5 Capacity and Complexity
Proof The output signal of a neural network with piecewise activation can be
expressed as a Boolean formula. In detail, in each layer, the input to each com-
putation unit must lie in one of the p pieces of the activation function ψ, which can
be written as a Boolean formula. Accordingly, we can express the output signal of the
network as a Boolean function consisting of fewer than 2(1 + p)U atomic predicates,
each of which is a polynomial inequality with a degree of at most max{U + 1, 2d U }.
Harvey et al. (2017) also derived a lower bound for the VC dimension, presented
as the following theorem.
Theorem 5.3 (see (Harvey et al. 2017)) Let C be an universal constant. Given any W
and L satisfying W > C L > C 2 , there exists a rectified linear unit (ReLU) network
with fewer than L layers and fewer than W parameters that has a VC dimension
lower bounded by W L log(W/L)/C.
This theorem has a more general version. Theorem 5.3 is a special case in which
r = log2 (W/L)/2, m = r L/8, and n = W − 5m2r .
Theorem 5.4 Let r , m, and n be positive integers, and let k = m/r . There exists
a ReLU network with 3 + 5d layers, 2 + n + 4m + k((11 + r )2r + 2r + 2) param-
eters, m + n input nodes and m + 2 + k(5 × 2r + 1) computational nodes with VC
dimension ≥ mn.
Margin bounds (Vapnik 2006; Schapire et al. 1998; Bartlett and Shawe-Taylor 1999;
Koltchinskii and Panchenko 2002; Taskar et al. 2004) represent a distinct category of
generalization bounds that offer stronger assurances compared to worst-case bounds
5.2 Rademacher Complexity and Covering Number 63
based on the VC dimension. Unlike traditional bounds, margin bounds focus on the
concept of margin, which refers to the difference between the predicted score of the
correct label and the highest score of incorrect labels for a given example. These
bounds provide robust guarantees that trained models can achieve a small empirical
margin loss within a large confidence margin. Essentially, margin bounds quantify
the degree of separation between different classes in the model’s decision space,
reflecting the model’s ability to generalize beyond the training data. Furthermore,
similar generalization guarantees can be derived based on a measure of “luckiness”
(Schapire et al. 1998; Koltchinskii and Panchenko 2002), which captures the notion
of how likely a model is to perform well on unseen data due to favorable properties of
the training set. By incorporating margin-based metrics and measures of luckiness,
researchers gain valuable insights into the behavior and performance of machine
learning models, enhancing our understanding of their generalization capabilities
and reliability in real-world applications.
Margin bound
For any distribution D and margin γ > 0, we define the expected margin loss for a
hypothesis h as follows:
L γ (h) = P(x,y)∼D h(x)[y] ≤ γ + max h(x)[ j] ,
j=y
where h(x)[ j] is the j-th component of the vector h(x). The generalization bounds
obtained under the margin loss are called margin bounds.
Based on the covering number, the Dudley entropy integral, and the Rademacher
complexity, Bartlett et al. (2017) obtained the following spectrally normalized upper
bound for the generalization error:
⎛ 3/2 ⎞
X log Q
D D
Wi − Mi
2/3
1/δ ⎠
Õ ⎝
2 2,1
ρi Wi σ + ,
γm i=1 i=1 Wi
2/3
σ m
⎜ B D Q log(D Q) i=1 Wi
F
2 + log Dm
δ ⎟
O⎝ 2
⎠. (5.1)
γ 2m
where M F (1), ..., M F (D) mean the maximum of Frobenius norm of networks param-
eter matrices. Besides, Golowich et al. (2018) further improved the bound to a depth-
independent bound, as follows:
⎛ ⎛ ⎞ ⎧ ⎫⎞
⎪
⎨ ⎪
D ⎬⎟
D
D
log 1
M ( j)
⎜ j=1 F
Õ ⎝ B ⎝ M F ( j)⎠ min √ , ⎠, (5.3)
⎪
⎩ m m⎪⎭
j=1
where
Q is the network width. This bound suggests a possible tighter bound of
O m −1/2 .
5.2 Rademacher Complexity and Covering Number 65
The gradient measure plays a crucial role in refining generalization bounds and illu-
minating the intricate relationship between a model’s generalizability and its train-
ability. By analyzing the impact of gradients during training, researchers gain insights
into how noise can be strategically introduced into input data to enhance a model’s
ability to generalize to unseen examples. This observation underscores the value
of robust training strategies that incorporate regularization techniques like gradient
clipping (Merity et al. 2018; Gehring et al. 2017; Peters et al. 2018). Furthermore, the
gradient measure provides theoretical justification for gradient clipping, a technique
used to stabilize training by constraining the magnitude of gradients during optimiza-
tion. By controlling the scale of gradients, gradient clipping helps prevent exploding
gradients and facilitates more stable and effective learning. Overall, understanding
the implications of gradient measures enriches our understanding of model behavior
and informs practical strategies to improve model performance and generalization
capabilities.
(Novak et al. 2018) conducted comprehensive experiments to analyze the gener-
alization capacity of deep neural networks. Their empirical investigations revealed
pivotal factors influencing generalization, particularly focusing on the input-output
Jacobian norm and the linear region counting within the networks. The input-output
Jacobian norm measures the sensitivity of network outputs to changes in input, pro-
viding insights into how robustly the network behaves across different input varia-
tions. Meanwhile, the linear region counting quantifies the number of linear decision
boundaries that can be formed within the network’s architecture, offering clues about
the complexity and flexibility of its decision-making process. Furthermore, the study
highlighted the critical relationship between the output hypothesis of a model and the
underlying data manifold. The generalization bound, which defines the model’s abil-
ity to perform well on unseen data, is heavily influenced by how closely the model’s
output hypothesis aligns with the distribution of the training data in the input space.
This suggests the importance of designing models that not only fit the training data
well but also generalize effectively to new, unseen instances. Understanding these
fundamental aspects contributes to the development of more robust and reliable deep
learning models capable of adapting to diverse real-world scenarios.
66 5 Capacity and Complexity
In recent years, deep neural networks have repeatedly demonstrated their signifi-
cant value in practical applications and have led to breakthroughs in image recogni-
tion, language translation, and predictive analytics, driving innovation across various
industries (LeCun et al. 2015; Greff et al. 2016; Shi et al. 2018; Silver et al. 2016; Lit-
jens et al. 2017; Chang et al. 2018), particularly with the introduction of ResNet (He
et al. 2016a) and the wide adoption of residual connections to induce top performed
neural network architectures (He et al. 2016a; Huang et al. 2017; Xie et al. 2017).
These advancements have catalyzed transformations in computer vision (Krizhevsky
et al. 2012; Lin et al. 2017; He et al. 2017; Liu et al. 2019) and data analysis (Witten
et al. 2016). Residual connections link non-neighboring layers, diverging from the
traditional chain-like stacking of layers.
Empirical studies have consistently proven that incorporating residual connections
can facilitate the training of deep neural networks by largely alleviating issues related
to vanishing gradients and exploding gradients. Despite their effectiveness in training,
there remains a notable gap in theoretical analysis regarding the impact of residual
connections on the generalization ability of deep neural networks.
While residual connections have shown instrumental in improving training dynam-
ics and model convergence, their influence on generalization-i.e., a model’s perfor-
mance on unseen data in the inference stage-remains less explored from a theoretical
perspective. Comprehensive research is critical to elucidate the underlying mecha-
nisms by which residual connections impact on the generalization capabilities of deep
neural networks, thereby providing deeper insights into the theoretical foundations
of this powerful addon.
Residual connections introduce a novel approach to neural network architecture by
connecting non-neighboring layers, diverging from the traditional chain-like stack-
ing of layers. This departure from conventional network structures introduces loops
within the neural network, fundamentally altering its topology. Intuitively, the incor-
poration of residual connections significantly expands the hypothesis space of deep
neural networks. This increased complexity, however, raises concerns regarding the
generalization ability of the network, as per the principle of Occam’s razor, which
suggests a negative correlation between algorithmic generalization and hypothesis
complexity.
The lack of comprehensive analysis on this aspect poses challenges for deploying
residual-connected neural networks in safety-critical domains, such as autonomous
vehicles and medical diagnosis, where algorithmic errors could have severe conse-
quences. Without addressing this theoretical gap, leveraging the advancements made
possible by residual connections in critical applications, from autonomous vehicles
(Janai et al. 2017) to medical diagnose (Esteva et al. 2017), remains constrained.
Efforts to bridge this gap are crucial for ensuring the reliability and safety of neural
network-based systems in real-world scenarios where accuracy and generalization
are paramount.
5.3 Generalization Bound of ResNet 67
1The “stem” is defined to denote the chain-like part of the neural network besides all the residuals.
For more details, please refer to Sect. 5.3.1.
68 5 Capacity and Complexity
This section provides a notation system for deep neural networks with residual con-
nections. Motivated by the topological structure, we call it the stem-vine framework.
Deep neural networks are typically assembled by linking numerous weight matri-
ces with nonlinear operators, such as ReLU, sigmoid, and max-pooling functions. In
this discussion, we focus on a neural network design that integrates multiple resid-
ual connections into a traditional “chain-like” architecture, which sequentially stacks
layers of weight matrices and nonlinearities. Inspired by the topological arrangement
of this network, we refer to the sequential layers as the “stem” of the neural network
and designate the added residual connections as the “vines”.
Both the stem and vines consist of stacked layers of weight matrices and nonlinear
functions. The stem represents the core structure of the neural network, comprising
the primary sequence of layers, while the vines introduce additional connections that
bypass certain layers, enhancing the network’s depth and flexibility. This configu-
ration allows for more complex transformations and feature representations within
the neural network, leveraging both the traditional sequential architecture and the
introduced residual connections to enhance learning capabilities and model expres-
siveness.
We denote the weight matrices and the nonlinearities in the stem S respectively
as
S = (A1 , σ1 , A2 , σ2 , . . . , A L , σ L ) . (5.8)
2 If two weight matrices, Ai and Ai+1 , are connected directly without a nonlinearity between
them, we define a new weight matrix A = Ai · Ai+1 . The situations that nonlinearities are directly
connected are similar, as the composition of any two nonlinearities is still a nonlinearity.
Meanwhile, the number of weight matrices does not necessarily equal the number of nonlin-
earities. Sometimes, if a vine connects the stem at a vertex between two weight matrices (or two
nonlinearities), the number of the weight matrices (nonlinearities) would be larger than the number
of nonlinearities (weight matrices). Taken the 34-layer ResNet as an example, a vine connects the
stem between two nonlinearities σ33 and σ34 . In this situation, we cannot merge the two nonlinear-
ities, so the number of nonlinearities is larger than the number of weight matrices.
5.3 Generalization Bound of ResNet 69
For the brevity, we give an index j to each vertex between a weight matrix and
a nonlinearity and denote the j-th vertex as N ( j). Specifically, we give the index 1
to the vertex that receives the input data and L + L N + 1 to the vertex after the last
weight matrix/nonlinearity. Taken Eq. (5.8) as an example, the vertex between the
nonlinearity σi−1 and the weight matrix Ai is denoted as N (2i − 1) and the vertex
between the weight matrix Ai and the nonlinearity σi is denoted as N (2i).
Vines are constructed to connect the stem at two different vertexes. And there
could be over one vine connecting a same pair of the vertexes. Therefore, we use a
triple vector (s, t, i) to index the i-th vine connecting the vertexes N (s) and N (t)
and denote the vine as V (s, t, i). All triple vectors (s, t, i) constitute an index set I V ,
i.e., (s, t, i) ∈ I V . Similar to the stem, each vine V (s, t, i) is also constructed by a
series of weight matrices As,t,i s,t,i s,t,i
1 , . . . , A L s,t,i and nonlinearities σ1 , . . . , σ Ls,t,i
s,t,i , where
N
where · f and · x are respectively the norms defined on the spaces of f (x)
and x. Fortunately, almost all nonlinearities normally used in neural networks are
Lipschitz continuous, such as ReLU, max-pooling, and sigmoid (see (Bartlett et al.
2017)).
Many important tasks for deep neural networks can be categorized into multi-class
classification. Suppose input examples z 1 . . . , z m are given, where z i = (xi , yi ), xi ∈
Rn 0 is an instance, y ∈ {1, . . . , n L } is the corresponding label, and n L is the number of
the classes. Collect all instances x1 , . . . , xm as a matrix X = (x1 , . . . , xm )T ∈ Rm×n 0
that each row of X represents a data point. By employing optimization methods
(usually stochastic gradient decent, SGD), neural networks are trained to fit the
training data and then predict on test data. In mathematics, a trained deep neural
network with all parameters fixed computes a hypothesis function F : Rn 0 → Rn L .
And a natural way to convert F to a multi-class classifier is to select the coordinate
of F(x) with the largest magnitude. In other words, for an instance x, the classifier
is x → arg maxi F(x)i . Correspondingly, the margin for an instance x labelled as
yi is defined as F(x) y − maxi =y F(x)i . It quantitatively expresses the confidence of
assigning a label to an instance.
To express F, we first define the functions respectively computed by the stem and
vines. Specifically, we denote the function computed by a vine V (s, t, i) as:
70 5 Capacity and Complexity
Furthermore, we denote the output of the stem at the vertex N ( j) as the following
equation:
j
FS (X ) = σ j (A j σ j−1 (. . . σ1 (A1 X ) . . .)) . (5.12)
j
FS (X ) is also the input of the rest part of the stem. Eventually, with all residual
connections, the output hypothesis function F j (X ) at the vertex N ( j) is expressed
by the following equation:
j u, j,i
F j (X ) = FS (X ) + FV (X ) . (5.13)
(u, j,i)∈I V
Apparently,
FS (X ) = FSL (X ), F(X ) = F L (X ) . (5.14)
Naturally, we call this notation system as the stem-vine framework, and Fig. 5.1
gives an example.
Fig. 5.1 A deep neural network with Residual Connections under the Stem-Vine Framework
In this subsection, we give a covering bound generally for any deep neural network
with residual connections.
Theorem 5.6 (Covering Bound for Deep Neural Network) Suppose a deep neural
network is constituted by a stem and a series of vines.
For the stem, let (ε1 , . . . , ε L ) be given, along with L N fixed nonlinearities
(σ1 , . . . , σ L N ). Suppose the L weight matrices (A1 , . . . , A L ) lies in B1 × . . . × B L ,
where Bi is a ball centered at 0 with radius of si , i.e., Ai ≤ si . Suppose the vertex
that directly follows the weight matrix Ai is N (M(i)) (M(i) is the index of the ver-
tex). All M(i) constitute an index set I M . When the output FM( j−1) (X ) of the weight
matrix A j−1 is fixed, suppose all output hypotheses FM( j) (X ) of the weight matrix
A j constitute a hypothesis space H M( j) with an ε M( j) -cover W M( j) with covering
number N M( j) . Specifically, we define M(0) = 0 and F0 (X ) = X .
Each vine V (u, v, i), (u, v, i) ∈ IV is also a chain-like neural network that con-
structed by multiple weight matrices Au,v,i j , j ∈ {1, . . . , L u,v,i }, and nonlinearities
σ ju,v,i , j ∈ {1, . . . , L u,v,i
N }. Suppose for any weight matrix A j
u,v,i
, there is a s u,v,i
j >0
u,v,i u,v,i u,v,i
such that A j σ ≤ sj . Also, all nonlinearities σ j are Lipschitz continu-
ous. Similar to the stem, when the input of the vine Fu (X ) is fixed, suppose the
vine V (u, v, i) computes a hypothesis space HVu,v,i , constituted by all hypotheses
FVu,v,i (X ), has an εu,v,i -cover WVu,v,i with covering number NVu,v,i .
Eventually, we denote the hypothesis space computed by the neural network is H.
Then there exists an ε in terms of εi , i = {1, . . . , L} and εu,v,i , (u, v, i) ∈ I V , such
that the following inequality holds:
L
N(H, ε, · ) ≤ sup N M( j+1) sup NVu,v,i . (5.15)
j=1 FM( j) (u,v,i)∈I V Fu
Lemma 5.1 (Covering Bound for Chain-like Deep Neural Network; cf. Bartlett et al.
(2017), Lemma A.7) Suppose there are L weight matrices in a chain-like neural
network. Let (ε1 , . . . , ε L ) be given. Suppose the L weight matrices (A1 , . . . , A L )
lies in B1 × . . . × B L , where Bi is a ball centered at 0 with the radius of si , i.e.,
Bi = {Ai : Ai ≤ si }. Furthermore, suppose the input data matrix X is restricted
in a ball centred at 0 with the radius of B, i.e., X ≤ B. Suppose F is a hypothesis
5.3 Generalization Bound of ResNet 73
H = {F(X ) : Ai ∈ Bi } , (5.16)
L L
where i = 1, . . . , L and t ∈ {1, . . . , L u,v,s }. Let ε = j=1 εjρj l= j+1 ρl sl . Then
we have the following inequality:
L
N(H, ε, · ) ≤ sup Ni , (5.17)
i=1 Ai−1 ∈B i−1
Remark 5.1 The mapping induced by a chain-like neural network involves com-
posing a sequence of affine and nonlinear transformations. According to Lemma
5.1, the covering bound for a chain-like network can be decomposed into the product
of covering bounds for each layer, as demonstrated in Bartlett et al. (2017). How-
ever, when residual connections are introduced, parallel structures emerge within
the network, complicating the representation of the overall mapping as a series of
transformations. Instead, calculating a covering bound for the entire network requires
considering many additions of function spaces (as seen in Eq. (5.13)), which presents
challenges for applying previous methods directly. To address this, we introduce a
new proof in Sect. 5.3.3.3 to analyze the covering bound under the presence of
residual connections and parallel structures.
Contrary to the different proofs, the result for deep neural networks with residual
connections share similarities with the one for the chain-like network (see, respec-
tively, Eqs. (5.15) and (5.17)). The similarities lead to the property summarised as
follows.
The influences on the hypothesis complexity of weight matrices are in the same way, no
matter whether they are in the stem or the vines. Specifically, adding an identity vine could
not affect the hypothesis complexity of the deep neural network.
As indicated by Eq. (5.17) in Lemma 5.1, the covering number of the hypothesis
computed by a chain-like neural network (including the stem and all the vines) is
upper bounded by the product of the covering number of all single layers. Specifically,
the contribution
of the stem on the covering bound is the product of a series of covering
numbers, i.e., Lj=1 sup FM( j) N M( j+1) . In the meantime, applying Eq. (5.17) in Lemma
5.1, the contribution sup Fu NVu,v,i of the vine V (u, v, i) can also be decomposed as the
product of a series of covering numbers. Apparently, the contributions respectively
by the weight matrices in the stem and the ones in the vines have similar formulations.
This result gives an insight that residuals would not undermine the generalization
capability of deep neural networks. Also, if a vine V (u, v, i) is an identity mapping,
the term in Eq. (5.15) that relates to it is definitely 1, i.e., NVu,v,i = 1. This is because
74 5 Capacity and Complexity
there is no parameter to tune in an identity vine. This result gives an insight that adding
an identity vine to a neural network would not affect the hypothesis complexity.
However, it is worth noting that the vines could influence the part of the stem in the
covering bound, i.e., N M( j+1) in Eq. (5.15). The mechanism of the cross-influence
between the stem and the vines is an open problem.
Intuitively, introducing residual connections to a neural network may not change
the hypothesis space. Here, we discuss the following case as an example. Consider
a residual connection that links nodes i and j. Suppose the hypothesis functions
computed in the nodes i and j are Fi and F j , respectively. Also, we denote the
hypothesis functions computed by the bone and residual connection between nodes
i and j as Fi, j and Fi, j , respectively. See an illumination in Fig. 5.2. Then, we have
the following equation
F j = Fi, j ◦ Fi + Fi, j ◦ Fi .
Usually, the residual connection is a simpler sub-network of the bone part. Therefore,
the hypothesis space constituted by all Fi, j is a subspace of the one of Fi, j . Thus,
the hypothesis space constituted by all Fi, j ◦ Fi is a subspace of the one of Fi, j ◦ Fi .
In other words, the hypothesis space computed by a residual connection is only a
sub-space computed by the bone part. Introducing a residual connection is merging
the two spaces by addition operation, which obtains the larger space. This property
guarantees that introducing residual connections may not change the hypothesis
space.
From the vertex that receives the input data to the vertex that outputs classification
functions, there are 34 + 35 + 1 = 70 vertexes (34 is the number of weight matrices
and 35 is the number of nonlinearities). We denote them as N (1) to N (70). Addi-
tionally, we assume the norm of the the weight matrix Ai has an upper bound si , i.e.,
Ai σ ≤ si , while the Lipschitz constant of the nonlinearity σi is denoted as bi .
Under the stem-vine framework, the 16 vines in ResNet are respectively denoted as
V (3, 7, 1), V (7, 11, 1), . . . , V (63, 67, 1). Among these 16 vines, there are 3 vines,
V (15, 19, 1), V (31, 35, 1), and V (55, 59, 1), that respectively contains one weight
matrix, while all others are identity mappings. Let’s denote the weight matrices in the
vines V (15, 19, 1), V (31, 35, 1), and V (55, 59, 1) respectively as A15,19,1 1 , A31,35,1
1 ,
55,59,1 15,19,1 31,35,1 55,59,1
and A1 . Suppose the norms of A1 , A1 , and A1 are respectively
upper bounded by s115,19,1 , s131,35,1 , and s155,59,1 . Denote the reference matrices that
correspond to weight matrices (A1 , . . . , A34 ) as (M1 , . . . , M34 ). Suppose the distance
between each weight matrix Ai and the corresponding reference matrix Mi is upper
bounded by bi , i.e., AiT − MiT ≤ bi . Similarly, suppose there are reference matrices
M1s,t,1 , (s, t) ∈ {(15, 19), (31, 35), (55, 59)} respectively for weight matrices As,t,1 1 ,
and the distance between As,t 1 and M s,t,1
1 is upper bounded by b1
s,t,1
, i.e., (A s,t,1 T
i ) −
(Mis,t,1 )T ≤ b1s,t,1 . We then have the following lemma.
Lemma 5.2 (Covering Number Bound for ResNet) For a ResNet R satisfies all
conditions above, suppose the hypothesis space is H R . Then, we have
(b1u,u+4,1 )2 Fu (X T )T 2
log N(H R , ε, · ) ≤ 2
log(2W 2 )
u∈{15,31,55}
εu,u+4,1
2
34
b2j F2 j−1 (X T )T 2
2
+ log(2W 2 )
j=1
ε22 j+1
2
b34 F68 (X T )T 2
+ 2
log(2W 2 ) , (5.20)
ε70
2
and
2 2 2
F4 j+3 (X ) 2
2 ≤ X ρ1 s1
2 2 2
ρ2i s2i ρ2i+1 s2i+1
2
+1
1≤i≤ j
i ∈{4,8,14}
/
" #
ρ2i2 s2i2 ρ2i+1
2 2
s2i+1 + (s14i−1,4i+3,1 )2 , (5.22)
1≤i≤ j
i∈{4,8,14}
and specifically,
2 2 2
F68 (X T )T 2
2 ≤ X ρ1 s1 ρ34
2 2 2 2
ρ2i s2i ρ2i+1 s2i+1
2
+1
1≤i≤16
i ∈{4,8,14}
/
" #
ρ2i2 s2i2 ρ2i+1
2 2
s2i+1 + (s14i−1,4i+3,1 )2 . (5.23)
1≤i≤16
i∈{4,8,14}
and
" #
ε4 j+3 =(1 + s1 )ρ1 [(∗) + 1] (∗) + 1 + s14i−1,4i+3,1 , (5.25)
1≤i≤ j 1≤i≤ j
i ∈{4,8,14}
/ i∈{4,8,14}
In above equations/inequalities,
" #
ᾱ =(s1 + 1)ρ1 ρ34 (s34 + 1)ρ35 [(∗) + 1] (∗) + s14i−1,4i+3,1 + 1 ,
1≤i≤16 i∈{4,8,14}
i ∈{4,8,14}
/
(5.27)
and
(∗) = ρ2i (s2i + 1)ρ2i+1 (s2i+1 + 1) . (5.28)
Lemma 3.2 guarantees that when the covering number of a hypothesis space is upper
bounded, the corresponding generalization error is also upper bounded. This lemma
provides a theoretical foundation for establishing generalization bounds based on the
covering number of the hypothesis space. By leveraging Lemma 5.2, which presents a
specific covering bound for ResNet, we can derive a concrete generalization bound for
this architecture. In this subsection, we summarize and formalize the generalization
bound as Theorem 5.7, providing a clear theoretical link between the complexity of
the hypothesis space and the expected generalization performance of ResNet.
For the brevity, we rewrite the radius ε2 j+1 and εu,u+4,1 as follows:
R
log N(H, ε, · ) ≤ , (5.31)
ε2
5.3 Generalization Bound of ResNet 79
where
(b1u,u+4,1 )2 Fu (X T )T 2
R= 2
log(2W 2 )
u∈{15,31,55}
ε̂u,u+4,1
2
33
b2j F2 j−1 (X T )T 2
2
2
b34 F68 (X T )T 2
+ log(2W 2 ) + 2
log(2W 2 ) , (5.32)
j=1
ε̂22 j+1 ε̂70
2
Theorem 5.7 (Generalization Bound for ResNet) Suppose a ResNet satisfies all
conditions in Lemma 5.2. Suppose a given series of examples (x1 , y1 ), . . . , (xm , ym )
are arbitrary i.i.d. variables. Suppose hypothesis function FA : Rn 0 → Rn L is com-
puted by a ResNet with weight matrices A = (A1 , . . . , A34 , A15,19,1
1 , A31,35,1
1 ,
55,59,1
A1 ). Then for any margin λ > 0 and any real δ ∈ (0, 1), with probability at
least 1 − δ, we have the following inequality:
8 36 √ log(1/δ)
P{arg max F(x)i = y} ≤ R̂ Sλ (F) + 3 + R log m + 3 , (5.33)
i m 2 m 2m
Indicated by Theorem 5.7, the generalization bound of ResNet relies on its cov-
ering bound. Specifically, when the sample size n and the probability δ are fixed, the
generalization error satisfies that
√
P{arg max F(x)i = y} − R̂ Sλ (F) = O R , (5.34)
i
In addition to the sample size m, our generalization bound (Eq. (5.33)) highlights
a positive correlation with the norms of all weight matrices in the neural network.
Specifically, weight matrices with higher norms contribute to a higher generaliza-
tion bound, suggesting a potential trade-off with generalization performance. This
observation validates the importance of techniques such as weight decay, which aims
to regulate the norms of weight matrices to optimize generalization performance in
deep learning models. By controlling the norms of weight matrices, practitioners
can mitigate the risk of overfitting and enhance the model’s ability to generalize to
unseen data.
Weight decay, original introduced by Krogh and Hertz (1992), has become a
widely adopted regularization technique in the training of deep neural networks.
In the context of learning theory, weight decay involves augmenting the typical
training objective with an additional term that penalizes large weights. It involves
5.3 Generalization Bound of ResNet 81
adding the L 2 norm of all weights as a regularization term to the loss function during
training. This term penalizes large weight values, aiming to prevent overfitting by
discouraging complex model configurations that fit the training data too closely.
By constraining the magnitude of weight norms, weight decay promotes smoother
and more stable training dynamics, which often leads to improved generalization
performance on unseen data. This regularization method is particularly effective in
controlling model complexity and mitigating the risk of overfitting, contributing to
the robustness and stability of deep learning models.
Remark 5.2 The technique of weight decay can improve the generalization abil-
ity of deep neural networks. It refers to adding the L 2 norm of the weights
W = (W1 , . . . , W D ) to the objective function as a regularization term:
D
1
L (W ) = L(W ) + λ W2 ,
2 i=1 i
where λ is a tuneable parameter, L(W ) is the original objective function, and L (w)
is the objective function with weight decay.
D
The term 21 λ i=1 Wi2 can be easily re-expressed by the L 2 norms of all the weight
matrices. Therefore, using weight decay can control the magnitude of the norms of
all the weights matrices not to increase too much. Also, our generalization bound
(Eq. (5.33)) provides a positive correlation between the generalization bound and the
norms of all the weight matrices. Thus, our work gives a justification for why weight
decay leads to a better generalization ability.
A recent systematic experiment conducted by Li et al. (2018c) studies the influence
of weight decay on the loss surface of the deep neural networks (Li et al. 2018c). It
trains a 9-layer VGGNet (Chen et al. 2017) on the dataset CIFAR-10 (Krizhevsky
and Hinton 2009) by employing stochastic gradient descent with batch sizes of 128
(0.26% of the training set of CIFAR-10) and 8192 (16.28% of the training set of
CIFAR-10). The results demonstrate that by employing weight decay, SGD can find
flatter minima4 of the loss surface with lower test errors as shown in Fig. 5.4 (original
presented as Li et al. (2018c), p. 6, Fig. 3). Other technical advances and empirical
analysis include (Galloway et al. 2018; Zhang et al. 2018b; Chen et al. 2018a; Park
et al. 2019).
This appendix compiles several proofs that were omitted from Sect. 5.3.2. First, we
present a proof establishing the covering bound for an affine transformation induced
4The flatness (or equivalently sharpness) of the loss surface around the minima is considered as
an important index expressing the generalization ability. However, the mechanism remains elusive.
For more details, please refers to (Keskar et al. 2017) and (Dinh et al. 2017).
82 5 Capacity and Complexity
Fig. 5.4 Illustrations of the 1D and 2D visualization of the loss surface around the solutions
obtained with different weight decay and batch size. The numbers in the title of each subfigure
is respectively the parameter of weight decay, batch size, and test error. The data and figures are
originally presented in (Li et al. 2018c)
In this subsection, we provide an upper bound for the covering number of the hypoth-
esis space induced by a single weight matrix A. This covering bound relies on Maurey
sparsification lemma (Pisier 1981) and has been introduced in machine learning by
previous works (see, e.g.,(Zhang 2002; Bartlett et al. 2017)).
Suppose a data matrix X is the input of a weight matrix A. All possible values
of the output X A constitute a space. We use the following lemma to express the
complexity of all X A via the covering number.
Lemma 5.3 (Bartlett et al.; see (Bartlett et al. 2017), Lemma 3.2) Let conjugate
exponents ( p, q) and (r, s) be given with p ≤ 2, as well as positive reals (a, b, ε)
and positive integer m. Let matrix X ∈ Rn×d be given with X p ≤ b. Let H A denote
the family of matrices obtained by evaluating X with all choices of matrix A:
!
H A X A|A ∈ Rd×m , A q,s ≤a . (5.35)
5.3 Generalization Bound of ResNet 83
Then
a 2 b2 m 2/r
log N (H A , ε, · 2) ≤ ∗ log(2dm) . (5.36)
ε2
This subsection considers the upper bound for the covering number of the hypothesis
space induced by the stem of a deep neural network. Intuitively, following the stem
from the first vertex N (1) to the last one N (L), every weight matrices and nonlin-
earities increase the complexity of the hypothesis space that could be computed by
the stem. Following this intuition, we use an induction method to approach the upper
bound. The result is summarized as Lemma 5.1. This lemma is originally given in
the work by Bartlett et al. (2017). Here to make this work complete, we recall the
main part of the proof but omit the part for ε.
Proof of Lemma 5.1 We use an induction procedure to prove the lemma.
(1) The covering number of the hypothesis space computed by the first weight
matrix A1 can be straightly upper bounded by Lemma 5.3.
(2) The vertex after the j-th nonlinearity is N (2 j + 1). Suppose W2 j+1 is an
ε-cover of the hypothesis space H2 j+1 induced by the output hypotheses in the
vertex N (2 j + 1). Suppose there is a weight matrix A j+1 directly follows the vertex
N (2 j + 1). We then analyze the contribution of the weight matrix A j+1 . Assume that
there exists an upper bound s j+1 of the norm of A j+1 . For any F2 j+1 (X ) ∈ H2 j+1 ,
there exists a W (X ) ∈ W2 j+1 such that
Lemma 5.3 guarantees that for any W (X ) ∈ W2 j+1 there exists an ε2 j+1 -cover
W2 j+2 (W ) for the function space {W (X )A j+1 : W (X ) ∈ W2 j+1 , A j+1 ≤ s j+1 },
i.e., for any W (X ) ∈ Ĥ2 j+1 , there exists a V (X ) ∈ {W (X )A j+1 : W (X ) ∈ W2 j+1 ,
A j+1 ≤ s j+1 } such that
W (X ) − V (X ) ≤ ε2 j+1 . (5.38)
As for any F2 j+1 (X ) ∈ H2 j+2 {F2 j+1 (X )A j+1 : F2 j+1 (X ) ∈ H2 j+1 , A j+1 ≤
c}, there is a F2 j+1 (X ) ∈ H2 j+1 such that
Thus, applying Eqs. (5.37), (5.38), and (5.39), we prove the following inequality
84 5 Capacity and Complexity
where !
(∗∗) = A j+1 F2 j+1 (X ) : A j+1 ∈ B j+1 .
Thus, N(W2 j+1 , ε2 j+1 , · ) · N(W2 j+2 , ε2 j+2 , · ) is an upper bound for the
ε2 j+2 -covering number of the hypotheses space Hi+1 .
(3) The vertex after the j-th weight matrix is N (2 j − 1). Suppose W2 j−1 is an
ε2 j−1 -cover of the hypothesis space H2 j−1 induced by the output hypotheses in the
vertex N (2 j − 1). Suppose there is a nonlinearity σ j directly follows the vertex
N (2 j − 1). We then analyze the contribution of the nonlinearity σ j . Assume that the
nonlinearity σ j is ρ j -Lipschitz continuous. Apparently, σ j (W2 j−1 ) is a ρε2 j−1 -cover
of the hypothesis space σ j (H2 j−1 ). Specifically, for any F ∈ σ (H2 j−1 ), there exits a
F ∈ H2 j−1 that F = σ j (F). Since W2 j−1 is an ε2 j−1 -cover of the hypothesis space
H2 j−1 , there exists a W ∈ W2 j−1 such that
We thus prove that W2 j σ j (W2 j−1 ) is a ρ j ε2 j−1 -cover of the hypothesis space
σ j (H2 j−1 ). Additionally, the covering number remains the same while applying a
nonlinearity to the neural network.
By analyzing the influence of weight matrices and nonlinearities one by one, we
can prove Eq. (5.17). As for ε, the above part indeed gives an constructive method
to obtain ε from all εi and εu,v, j . Here we omit the explicit formulation of ε in terms
of εi and εu,v, j , since it could not benefit our theory. This completes the proof.
5.3 Generalization Bound of ResNet 85
In Sect. 5.3.2.1, we give a covering bound generally for all deep neural networks with
residual connections. The result is summarised as Theorem 5.6. In this subsection,
we give a detailed proof of Theorem 5.6.
Proof of Theorem 5.6. To approach the covering bound for the deep neural networks
with residuals, we first analyze the influence of adding a vine to a deep neural network,
and then use an induction method to obtain a covering bound for the whole network.
All vines are connected with the stem at two points that is respectively after a non-
linearity and before a weight matrix. When the input Fu (X ) of the vine V (u, v, i) is
fixed, suppose all the hypothesis functions FVu,v,i (X ) computed by the vine V (u, v, i)
constitute a hypothesis space HVu,v,i . As a vine is also a chain-like neural network con-
structed by stacking a series of weight matrices and nonlinearities, we can straightly
apply Lemma 5.1 to approach an upper bound for the covering number of the hypoth-
esis space HVu,v,i . It is worth noting that vines could be identity mappings. This sit-
uation is normal in ResNet–there are 13 out of all the 16 vines are identities. For
the circumstances that the vines are identities, the hypothesis space computed by the
vine only contains one element–an identity mapping. The covering number of the
hypothesis space for the identities are apparently 1.
Applying Lemmas 5.3 and 5.1, there exists an εv -cover Wv for the hypothesis
space Hv with a covering number N(Hv , εi , · ), as well as an εVu,v,i -cover WVu,v,i
for the hypothesis space HVu,v,i with a covering number N(HVu,v,i , εi , · ).
The hypotheses computed by the vine V (u, v, i) and the deep neural network
without V (u, v, i), i.e., respectively, Fv (X ) and FVu,v,i , are added element-wisely at
the vertex V (v). We denote the space constituted by all F Fv (X ) + FVu,v,i (X ) as
Hv .
Let’s define a function space as Wv {W S + WV : W S ∈ Wv , WV ∈ WVu,v,i }.
For any hypothesis F ∈ Hv , there must exist an FS ∈ Hv and FV ∈ HVu,v,i such that
F (X ) = FS (X ) + FV (X ) . (5.44)
FS (X ) − W FS (X ) ≤ εv . (5.45)
Similarly, as WVu,v,i is an εVu,v,i -cover of HVu,v,i , we can obtain a similar result. For
any FV (X ) ∈ HVu,v,i , there exists an element W FV (X ) ∈ WVu,v,i , such that
FV (X ) − W FV (X ) ≤ εVu,v,i . (5.46)
86 5 Capacity and Complexity
F (X ) − W (X ) = FV (X ) + FS (X ) − W FV (X ) − W FS (X )
≤ FV (X ) − W FV (X ) + FS (X ) − W FS (X )
≤εVu,v,i + εv . (5.47)
Therefore, the function space Wv is an (εVu,v,i + εv )-cover of the hypothesis space
Hv . An upper bound for the cardinality of the function space Wv is given as below
(it is also an εVu,v,i + εv -covering number of the hypothesis space Hv ):
where Nv and NVu,v,i can be obtained from Eq. (5.17) in Lemma 5.1, as the stem and
all the vines are chain-like neural networks.
By adding vines to the stem one by one, we can construct the whole deep neural
network. Combining Lemma 5.1 for the covering number of Fv−1 (X ) and Fu (X ),
we further prove the following inequality:
L
N(H, ε, · ) ≤ sup N M( j+1) sup NVu,v,i . (5.49)
j=1 FM( j) (u,v,i)∈I V Fu
Thus, we prove Eq. (5.15) of Theorem 5.6. As for ε, the above part indeed gives
an constructive method to obtain ε from all εi and εu,v, j . Here we omit the explicit
formulation of ε in terms of εi and εu,v, j , since it could be extremely complex and
does not benefit our theory. The proof is completed.
In Sect. 5.3.2.2, we give a covering bound for ResNet. The result is summarized as
Lemma 5.2. In this subsection, we give a detailed proof of Lemma 5.2.
Proof of Lemma 5.2. There are 34 weight matrices and 35 nonlinearities in the stem of
the 34-ResNet. Let’s denote the weight matrices respectively as A1 , ... ,A34 and denote
the nonlinearities respectively as σ1 , ... , σ35 . Apparently, there are 34 + 35 + 1 =
70 vertexes in the network, where 34 is the number of weight matrices and 35 is
the number of nonlinearities. We denote them respectively as N (1), ... , N (70).
Additionally, there are 16 vines which are respectively denoted as V (4i − 1, 4i +
5.3 Generalization Bound of ResNet 87
3, 1), i = {1, . . . , 16}, where 4i − 1 and 4i + 3 are the indexes of the vertexes that the
vine connected. Among all the 16 vines, there are 3, V (15, 19, 1), V (31, 35, 1), and
V (55, 59, 1), respectively contain one weight matrix, while all others are identities
mappings. For the vine V (4i − 1, 4i + 3, 1), i = 4, 8, 14, we denote the weight
matrix in the vine as A4i−1,4i+3,1
1 .
Applying Theorem 5.6, we straightly prove the following inequality:
34
log N(H, ε, · ) ≤ sup log N2 j+1 + sup log NVu,v,1 , (5.50)
j=1 F2 j−1 (X ) (u,v,i)∈I V Fu (X )
where N2 j+1 is the covering number of the hypothesis space constituted by all outputs
F2 j+1 (X ) at the vertex N (2 j + 1) when the input F2 j−1 (X ) of the vertex N (2 j − 1)
is fixed, NVu,v,1 is the covering number of the hypothesis space constituted by all
outputs FVu,v,i (X ) of the vine V (u, v, 1) when the input Fv (X ) is fixed, and I V is the
index set {(4i − 1, 4i + 3, 1), i = 1, . . . , 16}.
Applying Lemma 5.3, we can further prove an upper bound for the ε2 j+1 -covering
number N2 j+1 . The bound is expressed as the following inequality:
b2j+1 F2 j+1 (X T )T 2
2
log N2 j+1 ≤ log(2W 2 ) , (5.51)
ε22 j+1
where W is the maximum dimension among all features through the ResNet, i.e.,
W = maxi n i , i = 0, 1, . . . , L. Also, we can decompose F2 j+1 (X T )T 22 and utilize
an induction method to obtain an upper bound for it.
(1) If there is no vine connected with the stem at the vertex N (2 j − 1), we have
the following inequality:
≤ρ j A j F2 j−1 (X ) − 0 T T
2
≤ρ j A j σ · F2 j−1 (X ) T T
2 . (5.52)
F2 j−3 (X T )T 2 . (5.53)
88 5 Capacity and Complexity
Therefore, based on Eqs. (5.52) and (5.53), we can prove the norm of output of
ResNet as in the main text.
Similar with N2 j+1 , we can obtain an upper bound for the εu,v,1 -covering number
NVu,v,1 . Suppose the output computed at the vertex N (u) is Fu (X T ). Then, we can
prove the following inequality:
(b1u,v,1 )2 Fu (X T )T 2
log NVu,v,1 ≤ 2
log(2W 2 ) . (5.54)
εu,v,1
2
Applying Eqs. (5.51) and (5.54) to Eq. (5.50), we thus prove Eq. (5.20).
As for the formulation of the radiuses of the covers, we also employ an induction
method.
(1) Suppose the radius of the cover for the hypothesis space computed by the
weight matrix A1 and the nonlinearity σ1 is ε3 . Then, applying Eqs. (5.40) and
(5.43), after the weight matrix A2 and the nonlinearity σ2 , we prove the following
equation:
ε3 = (s2 + 1)ρ2 ε1 . (5.55)
(2) Suppose the radius of the cover for the hypothesis space computed by the
weight matrix A j−1 and the nonlinearity σ j−1 is ε2 j−1 . Assume there is no vine
connected around. Then, similarly, after the weight matrix A2 and the nonlinearity
σ j , we prove the following equation:
(3) Suppose the radius of the cover at the vertex N (i) is εi . Assume there is a vine
V (u, u + 4, 1) links the stem at the vertex N (u) and N (u + 4). Then, similarly, after
the weight matrix A2 and the nonlinearity σ j , we prove the following equation:
ε2 j+1 =εu+2 s u−1
2
+ 1 ρ u−1
2
+ εu su,u+4,1 + 1
=εu s u−1
2
+ 1 ρ u−1
2
s u−3 + 1 ρ u−3 + εu su,u+4,1 + 1
2 2
=εu s u−1
2
+ 1 s u−3 + 1 ρ u−1 ρ u−3 + εu su,u+4,1 + 1
2 2 2
. (5.57)
From Eqs. (5.55), (5.56), and (5.57), we can obtain the following equation
ε =ε1 ρ1 (s1 + 1)ρ34 (s34 + 1)ρ35 [(∗ ∗ ∗) + 1]
1≤i≤16
i ∈{4,8,14}
/
" #
(∗ ∗ ∗) + s14i−1,4i+3,1 + 1 , (5.58)
i∈{4,8,14}
where
5.3 Generalization Bound of ResNet 89
Applying Eqs. (5.55), (5.56), and (5.57), we can prove all ε2 j+1 and εu,u+4,1 .
The proof is completed.
Proof of Theorem 5.7 We prove this theorem in 2 steps: (1) We first apply Lemma
3.2 to Lemma 5.2 in order to prove an upper bound on the Rademacher complexity
of the hypothesis space computed by ResNet; and (2) We then apply the result of (1)
to Lemma 3.1 in order to prove a generalization bound.
(1) Upper bound on the Rademacher complexity.
Applying Eq. (3.12) of Lemma 3.2 to Eq. (5.31) of Lemma 5.2, we can prove the
following inequality:
' √
4α 12 m (
R(Hλ | D ) ≤ inf √ + log N(Hλ | D , ε, · |2 )dε
α>0 m m α
' √ √
m
4α 12 R
≤ inf √ + dε
α>0 mα m ε
√
4α 12 √ m
≤ inf √ + R log . (5.62)
α>0 m m α
)
Apparently, the infinimum is reached uniquely at α = 3 mR . Here, we use a simpler
and also widely used choice α = 1
m
, and prove the following inequality:
4 18 √
R(Hλ | D ) ≤ 3 + R log m . (5.63)
n 2 m
90 5 Capacity and Complexity
8 36 √ log(1/δ)
P(arg max F(x)i = y) ≤ R̂ Sλ (F) + 3 + R log m + 3 . (5.64)
i m 2 m 2m
Conventional statistical learning theory suggests that the hypotheses learned from
datasets of different sizes constitute a Glivenko-Cantelli class (Talagrand 1987; Dud-
ley et al. 1991): the learned hypothesis converges to the target hypothesis, i.e., the
generalization bound decreases, as the size of the training dataset increases. Surpris-
ingly, however, Nagarajan and Kolter (2019) reported that many classical uniform-
convergence generalization bounds may in fact increase with the training dataset size
for interpolators. This behavior is also echoed by the phenomenon of benign over-
fitting. In response to this work, Negrea et al. (2020) defended uniform convergence
by defining a new notion called the structural Glivenko-Cantelli property for learned
hypothesis sequences. They proved that (1) unregularized and overparameterized lin-
ear regression has a structural Glivenko-Cantelli surrogate hypothesis class and (2)
removing a few bits of information from a nonstructural Glivenko-Cantelli hypoth-
esis sequence can yield a sequence with the structural Glivenko-Cantelli property,
whose generalization bounds exhibit double descent. Zhou et al. (2020) proved that
consistency cannot be shown for any set by uniformly bounding the generalization
error in a norm ball. To address this, they proved that zero-error predictors exhibit
uniform convergence in a norm ball. Yang et al. (2021) presented an exact compar-
ison between the generalization error and the uniform-convergence generalization
bound in random feature models.
References
Alex, Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep
convolutional neural networks. In Advances in Neural Information Processing Systems, 1097–
1105.
Anders, Krogh and John A Hertz. 1992. A simple weight decay can improve generalization. In
Advances in Neural Information Processing Systems, 950–957.
Andre, Esteva, Brett Kuprel, Roberto A Novoa, Justin Ko, Susan M Swetter, Helen M Blau, and
Sebastian Thrun. 2017. Dermatologist-level classification of skin cancer with deep neural net-
works. nature, 542 (7639): 115–118.
Angus, Galloway, Thomas Tanay, and Graham W Taylor. 2018. Adversarial training versus weight
decay. arXiv preprint arXiv:1804.03308.
References 91
Behnam, Neyshabur, Ryota Tomioka, and Nathan Srebro. 2015. Norm-based capacity control in
neural networks. In Conference on Learning Theory, 1376–1401.
Behnam, Neyshabur, Srinadh Bhojanapalli, and Nathan Srebro. 2017. A PAC-Bayesian approach
to spectrally-normalized margin bounds for neural networks. arXiv preprint arXiv:1707.09564
Ben, Taskar, Carlos Guestrin, and Daphne Koller. 2004. Max-margin Markov networks. In Advances
in Neural Information Processing Systems, 25–32.
Chenxi, Liu, Liang-Chieh Chen, Florian Schroff, Hartwig Adam, Wei Hua, Alan L Yuille, and
Li Fei-Fei. 2019. Auto-deeplab: Hierarchical neural architecture search for semantic image seg-
mentation. In IEEE/CVF conference on computer vision and pattern recognition, 82–92.
David, Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George Van Den Driess-
che, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, and Marc Lanctot. 2016.
Mastering the game of go with deep neural networks and tree search. Nature 529 (7587): 484–489.
Eunchun, Park, B. Wade Brorsen, and Ardian Harri. 2019. Using bayesian kriging for spatial
smoothing in crop insurance rating. American Journal of Agricultural Economics 101 (1): 330–
351.
G Pisier. 1981. Remarques sur un résultat non publié de b. maurey. Séminaire Analyse fonctionnelle
(dit” Maurey-Schwartz”), 1–12.
Gao, Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. 2017. Densely con-
nected convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition,
4700–4708.
Geert, Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, Arnaud Arindra Adiyoso. Setio, Francesco
Ciompi, Mohsen Ghafoorian, Jeroen Awm Van Der. Laak, Bram Van Ginneken, and Clara I.
Sánchez. 2017. A survey on deep learning in medical image analysis. Medical Image Analysis
42: 60–88.
Guodong, Zhang, Chaoqi Wang, Bowen Xu, and Roger Grosse. 2018b. Three mechanisms of weight
decay regularization. arXiv preprint arXiv:1810.12281.
Hao, Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. 2018c. Visualizing the
loss landscape of neural nets. In Advances in Neural Information Processing Systems.
Ian, H Witten, Eibe Frank, Mark A Hall, and Christopher J Pal. 2016. Data Mining: Practical
machine learning tools and techniques. Morgan Kaufmann.
J, Janai, F Güney, A Behl, and A Geiger. 2017. Computer vision for autonomous vehicles: problems,
datasets and state of the art. arxiv e-prints. arXiv preprint arXiv:1704.05519.
Jeffrey, Negrea, Gintare Karolina Dziugaite, and Daniel Roy. 2020. In defense of uniform con-
vergence: Generalization via derandomization with an application to interpolating predictors. In
International Conference on Machine Learning, 7263–7272.
Jinghui, Chen, Dongruo Zhou, Yiqi Tang, Ziyan Yang, Yuan Cao, and Quanquan Gu. 2018a. Closing
the generalization gap of adaptive gradient methods in training deep neural networks. arXiv
preprint arXiv:1806.06763.
Jonas, Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. 2017. Con-
volutional sequence to sequence learning. In International Conference on Machine learning,
1243–1252.
Kaiming, He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In IEEE inter-
national conference on computer vision, 2961–2969.
Kaiming, He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016a. Deep residual learning for image
recognition. In IEEE Conference on Computer Vision and Pattern Recognition.
Klaus, Greff, Rupesh K. Srivastava, Jan Koutník, Bas R. Steunebrink, and Jürgen. Schmidhuber.
2016. Lstm: A search space odyssey. IEEE transactions on neural networks and learning systems
28 (10): 2222–2232.
Koltchinskii, Vladimir, and Dmitry Panchenko. 2002. Empirical margin distributions and bounding
the generalization error of combined classifiers. The Annals of Statistics 30 (1): 1–50.
Krizhevsky, Alex, and Geoffrey Hinton. 2009. Learning multiple layers of features from tiny images.
Citeseer: Technical report.
92 5 Capacity and Complexity
Laurent, Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. 2017. Sharp minima can gen-
eralize for deep nets. In International Conference on Machine learning
LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521 (7553): 436.
Liang-Chieh, Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille.
2017. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution,
and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4):
834–848.
Lijia, Zhou, Danica J Sutherland, and Nati Srebro. 2020. On uniform convergence and low-norm
interpolation learning. In Advances in Neural Information Processing Systems, 6867–6877.
Matthew, E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee,
and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Annual Conference
of the North American Chapter of the Association for Computational Linguistics, 2227–2237.
Michael, B Chang, Abhishek Gupta, Sergey Levine, and Thomas L Griffiths. 2018. Automati-
cally composing representation transformations as a means for generalization. arXiv preprint
arXiv:1807.04640.
Michel Talagrand. 1987. The glivenko-cantelli problem. The Annals of Probability, 837–870.
Nick, Harvey, Christopher Liaw, and Abbas Mehrabian. 2017. Nearly-tight VC-dimension bounds
for piecewise linear neural networks. In Annual Conference on Learning Theory, 1064–1068.
Nitish Shirish, Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping
Tak Peter Tang. 2017. On large-batch training for deep learning: Generalization gap and sharp
minima. In International Conference on Learning Representations.
Noah, Golowich, Alexander Rakhlin, and Ohad Shamir. 2018. Size-independent sample complexity
of neural networks. In Annual Conference on Learning Theory, 297–299.
Paul, Goldberg, and Mark Jerrum. 1993. Bounding the vapnik-chervonenkis dimension of con-
cept classes parameterized by real numbers. In Proceedings of the sixth annual conference on
Computational learning theory, 361–369.
Paul, W Goldberg, and Mark R. Jerrum. 1995. Bounding the Vapnik-Chervonenkis dimension of
concept classes parameterized by real numbers. Machine Learning 18 (2–3): 131–148.
Peter, Bartlett and John Shawe-Taylor. 1999. Generalization performance of support vector
machines and other pattern classifiers. Advances in Kernel methods-Support Vector Learning,
43–54.
Peter, L Bartlett, Dylan J Foster, and Matus J Telgarsky. 2017. Spectrally-normalized margin bounds
for neural networks. In Advances in Neural Information Processing Systems, 6240–6249.
Peter, L Bartlett, Vitaly Maiorov, and Ron Meir. 1999. Almost linear VC dimension bounds for
piecewise polynomial networks. In Advances in Neural Information Processing Systems, 190–
196.
Richard, M Dudley, Evarist Giné, and Joel Zinn. 1991. Uniform and universal glivenko-cantelli
classes. Journal of Theoretical Probability 4 (3): 485–510.
Robert, E Schapire, Yoav Freund, Peter Bartlett, and Wee Sun Lee. 1998. Boosting the margin: A
new explanation for the effectiveness of voting methods. The Annals of Statistics 26 (5): 1651–
1686.
Roman, Novak, Yasaman Bahri, Daniel A Abolafia, Jeffrey Pennington, and Jascha Sohl-Dickstein.
2018. Sensitivity and generalization in neural networks: an empirical study. In International
Conference on Learning Representations.
Saining, Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated residual
transformations for deep neural networks. In IEEE Conference on Computer Vision and Pattern
Recognition, 1492–1500.
Stephen, Merity, Nitish Shirish Keskar, and Richard Socher. 2018. Regularizing and optimizing
lstm language models. In International Conference on Learning Representations.
Takeru, Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. 2018. Spectral normal-
ization for generative adversarial networks. In International Conference on Learning Represen-
tations.
References 93
Tong Zhang. 2002. Covering number bounds of certain regularized linear function classes. Journal
of Machine Learning Research, 2 (Mar): 527–550.
Tsung-Yi, Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie.
2017. Feature pyramid networks for object detection. In IEEE conference on computer vision
and pattern recognition, 2117–2125.
Vaishnavh, Nagarajan and J Zico Kolter. 2019. Uniform convergence may be unable to explain
generalization in deep learning. In Advances in Neural Information Processing Systems.
Vladimir Vapnik. 2006. Estimation of Dependences based on Empirical Data. Springer Science &
Business Media.
Yichun, Shi, Charles Otto, and Anil K. Jain. 2018. Face clustering: representation and pairwise
constraints. IEEE Transactions on Information Forensics and Security 13 (7): 1626–1640.
Zhuozhuo, Tu, Fengxiang He, and Dacheng Tao. 2020. Understanding generalization in recurrent
neural networks. In International Conference on Learning Representations.
Zitong, Yang, Yu Bai, and Song Mei. 2021. Exact gap between generalization error and uniform con-
vergence in random feature models. In International Conference on Machine Learning, 11704–
11715.
Chapter 6
Stochastic Gradient Descent as an
Implicit Regularization
A natural tool for optimizing the expected risk is gradient descent (GD) methods.
Denote an estimator with parameter θ by Fθ . Specifically, the gradient of the empirical
risk in terms of the t-th iteration parameters θ (t) can be expressed as
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025 95
F. He and D. Tao, Foundations of Deep Learning, Machine Learning: Foundations,
Methodologies, and Applications, https://doi.org/10.1007/978-981-16-8233-9_6
96 6 Stochastic Gradient Descent as an Implicit Regularization
where θ (t) represents the parameters in iteration t and η > 0 is the learning rate.
In SGD, the gradient g(θ ) is estimated from minibatches of the training sample
set. Let S be the index set of a minibatch, in which all indices are drawn in an i.i.d.
manner from {1, 2, . . . , m}, where m is the number of training samples. Then, the
iterative process of SGD on minibatch S can be defined as follows:
1
ĝ S (θ (t)) =∇θ(t) R̂ S (θ (t)) = ∇θ(t)l(Fθ(t) (xn ), yn ) (6.1)
|S| n∈S
and
where
1
R̂ S (θ ) = l(Fθ (xn ), yn )
|S| n∈S
is the empirical risk on the minibatch and |S| is the cardinality of the set S. For
brevity, we adopt the notation
l(Fθ (X n ), Yn ) = ln (θ )
Intensive studies have been conducted on the generalizability of SGMs under both
continuous and convex losses.
Ying and Pontil (2008) proved a generalization bound of the following order for
2ζ
a fixed learning rate of ηt m − 2ζ +1 :
2ζ
O m − 2ζ +1 log m ,
where ζ is a constant.
6.3 Generalization Bounds on Nonconvex Loss Surfaces 97
Later, Dieuleveut and Bach (2016) improved this result to the following average
order when 2ζ + γ > 1:
2 min{ζ,1}
O m − 2 min{ζ,1}+γ .
Other related works include (Lin et al. 2016; Lin and Rosasco 2016; Chen et al.
2016; Wei et al. 2017). However, in deep learning, the loss surface is usually highly
nonconvex, which invalidates the previous results.
Results for nonconvex loss surfaces also exist in the literature. Related works include
analyses obtaining generalization bounds based on algorithmic stability and Probably
Approximately Correct (PAC)-Bayesian theory.
Algorithmic stability. Bousquet and Elisseeff (2002) proposed employing algo-
rithmic stability to measure the stability of the output hypothesis when the training
sample set is disturbed. Many versions of algorithmic stability have been proposed;
a popular one is presented as follows.
Definition 6.1 (Uniform stability; cf. Bousquet and Elisseeff (2002), Definition 6)
A machine learning algorithm A is uniformly stable if for any neighbouring sample
pair S and S that differ by only one example, we have the following inequality:
EA(S) l(A(S), z) − EA(S )l(A(S ), z) ≤ β,
where z is an arbitrary example; A(S) and A(S ) are the output hypotheses learned
on the training sets S and S , respectively; and β is a positive real constant. The
constant β is called the uniform stability of algorithm A.
drawn from the steady distribution of this stochastic process. We can ultimately
analyse the generalization and optimization of deep learning on the basis of this
steady distribution.
An SGM is usually formulated as shown in Eqs. (6.1) and (6.2). The stochastic
gradient introduces extra noise into the parameter updating. When the gradient noise
can be modelled as a Gaussian distribution, the SGM reduces to stochastic gradient
Langevin dynamics (SGLD), expressed as follows (Mandt et al. 2017):
2η
θ (t) =θ (t + 1) − θ (t) = −η ĝ S (θ (t)) = −ηg(θ ) + W, W ∼ N (0, I ),
β
Theorem 6.2 Suppose that the loss function is -smooth, L-Lipschitz, and no larger
than 1 for all data. If we run an SGM with monotonically increasing learning rates
αt ≤ c/t for T steps, then the SGM is uniformly stable with
1 + 1/c 1 1
β≤ (2cL 2 ) c+1 T c+1 .
m−1
The proof of this theorem mimics a proof of convergence. Suppose that the loss
function l is L-Lipschitz constant with respect to the weights for any example, i.e.,
In other words, the stability can be measured in terms of the stability of the weights.
Therefore, one can simply analyse how the weights diverge as a function of time t.
Generalization bounds depending on the learning rates and the number of iterations
can then be presented. Furthermore, we can easily drive a generalization bound based
on this theorem.
Under five additional assumptions, Raginsky et al. (2017) proved the following
upper bound for the expected excess risk:
β(β + U )2 1/4 1 (β + U )2 U log(β + 1)
Õ δ log + + Õ + ,
λ∗ λ∗ + m β
provided that
β(β + U ) 1
k= ˜ log5
λ∗ 4
6.3 Generalization Bounds on Nonconvex Loss Surfaces 99
and 4
η≤ ,
log(1/)
and
m − ˜ β(β+U
λ∗
∈ 0, ∧e )
.
4M 2
| f (X, Y ) − f ( X̄ , Ȳ )| ≤ 2σ 2 I (X ; Y ),
100 6 Stochastic Gradient Descent as an Implicit Regularization
1
m
2R 2 I (θ ; z i ), (6.5)
m i=1
Pθ̃ ,z̃ = Pθ ⊗ Pz .
Then, the generalization error has the same upper bound expressed in Eq. (6.5).
In contrast with Xu and Raginsky (2017), Bu et al. (2020) proposed an “individual”
version of a metric to measure the information transformed into a model learned
from data; see Eq. (6.5), where the whole training sample S is replaced by individual
data I (θ ; z i ).
Mou et al. (2018a) modelled SGLD as Langevin √ diffusion
dynamics and then
proved an O (1/m) generalization bound and an O 1/ m generalization bound for
SGLD via algorithmic stability and PAC-Bayesian theory, respectively:
• Algorithmic stability: Assume that C is an upper bound of the loss function, the
hypothesis is L-Lipschitz continuous, and the learning rate satisfies ηt ≤ log
β L2
2
for
∀t; then, the generalization error has the following upper bound on average:
⎛ 2 ⎞
2LC
N
O⎝ β ηt ⎠, (6.6)
m t=1
• PAC-Bayesian theory: Suppose that the empirical risk has a l2 -norm regulariza-
tion on the weights, i.e., λ2 θ 2 , and that the loss function is sub-Gaussian. Then,
there is the following high-probability generalization bound:
⎛ ⎞
β N
O⎝ ηk e− 3 (TN −Tk ) E[gk (θk )2 ]⎠ ,
λ
m k=1
k
where Tk = i=1 ηi , β is the temporal parameter, and gk (θk ) is the gradient at
iteration k.
He and Su (2020) empirically showed that neural networks are locally elastic:
the prediction for an instance x given by predictor learned by SGD is not sensitive
to that of a dissimilar example x. This phenomenon inspired Deng et al. (2020) to
propose a localized version of algorithmic stability.
Definition 6.2 (Locally elastic stability; cf. Deng et al. (2020), Definition 1) A
machine learning algorithm A has locally elastic stability βm (·, ·) with respect to the
loss function l if, for any sample S ∈ Zm , z i ∈ S, and z ∈ Z, the following inequality
holds:
l(A(S), z) − l(A(S \i ), z) ≤ βm (z i , z).
Theorem 6.3 Let algorithm A have βm (·, ·)-locally elastic stability. Suppose that
z ≤ 1 for any z ∈ Z and that βm (·, ·) = β(·,·)
m
, where β(·, ·) is independent of
the sample size m and β(·, ·) ≤ Mβ . Then, for any 0 < δ < 1 and η > 0, with
probability at least 1 − δ, we have
R(A(S)) − R̂ S (A(S))
2 supz ∈Z Eβ(z , z) 2 log(2/δ)
≤ + 2 2 sup Eβ(z , z) + η + Ml .
m z ∈Z m
Chen et al. (2018b) proved the existence of a trade-off between stability and con-
vergence for all iterative algorithms under either the convex smooth or the strong con-
vex smooth assumption, leading to an O (1/m) generalization bound. Other advances
include the works of (He et al. 2019; Tzen et al. 2018; Zhou et al. 2018; Wen et al.
2019; Li et al. 2019).
102 6 Stochastic Gradient Descent as an Implicit Regularization
In Bayesian statistics, prior distributions play a crucial role in encoding our beliefs or
expectations about model parameters before observing any data. Unlike conventional
machine learning algorithms that assume full understanding of the data, Bayesian
methods require us to specify priors independently of the training data. This ensures
that our models start with a neutral or uninformative stance regarding the underlying
data characteristics.
Common choices for priors in Bayesian inference include uniform distributions
(assigning equal probability to all possible values) or Gaussian distributions (assum-
ing a normal distribution of parameter values). These priors are often selected to
reflect minimal assumptions about the data, allowing the data itself to shape the
posterior distribution through Bayesian updating.
Importantly, Bayesian priors are designed to have diminishing influence as more
data is incorporated into the model during training. This property helps ensure that
the model’s performance improves based primarily on the observed data rather than
the initial prior assumptions.
Recent advances in probabilistic learning, such as PAC-Bayesian theory, aim to
extend traditional Bayesian principles to more complex scenarios, relaxing strict
assumptions about data-independent priors while preserving the theoretical under-
pinnings of Bayesian inference. This approach allows for a more flexible and nuanced
integration of prior knowledge and observed data in learning algorithms.
Distribution-dependent priors. In machine learning theory, distribution-
dependent priors are designed to leverage knowledge about the underlying data dis-
tribution without directly relying on the specific training dataset used. This concept
is rooted in the notion that the data-generating process and its distribution exist inde-
pendently of the actual training data collection (Lever et al. 2013). By incorporating
such priors, researchers aim to refine generalization bounds, which are fundamental
in assessing the theoretical performance and predictive ability of machine learning
models. Specifically, the utilization of distribution-dependent priors has been shown
to be able to considerably tighten the generalization bounds.
Following this line of reasoning, Lever et al. (2013) tightened the PAC-Bayesian
bound (see Theorem 2.7) to the following bound. Given any positive constant C,
there is that with probability at least 1 − δ > 0, with respect to the drawn sample,
the generalization error is upper bounded as follows:
∗ C∗ 2 2ξ(m) γ2 2
R(Q(h)) − C R̂ S (Q(h)) ≤ λ log + + log ,
Cm m δ 2m δ
where √σt2 I is assumed as the gradient noise, C > 0 is the upper bound of the loss
function, T is the number of iterations, ηt is the step size at iteration t, and
1
m
ge (t) = EWt−1 ∇ F(Wt−1 , z i ) .
2
m i=1
Data-dependent priors. Negrea et al. (2019) further pushed the frontier of gen-
eralization bound analysis by constructing priors that are not independent of the data.
Suppose that set S J ⊂ S has a size of n < m. It is natural to design a prior exploiting
S J to derive a data-dependent forecast of the posterior Q. We denote this subset by
S J = {z j1 , . . . , z jn }, where all indices constitute a set J . The index set J is randomly
drawn from {1, . . . , m}. Suppose that we have the following generalization bound:
EF R(θ ) − R̂ S J (θ ) ≤ B,
Further works following this approach include (Haghifam et al. 2020) in coop-
eration with (Steinke and Zakynthinou 2020; Dziugaite et al. 2020; Hellström and
Durisi 2020) and their applications (Cherian et al. 2020).
The last decade has seen the dramatic success of deep neural networks (Goodfellow
et al. 2016) based on the stochastic gradient descent (SGD) optimization method (Bot-
tou et al. 1998; Sutskever et al. 2013). The task of fine tuning the hyper-parameters
of SGD to make neural networks generalize well is both critical and challenging.
Some works have addressed the strategies of tuning hyper-parameters (Dinh et al.
2017; Goyal et al. 2017; Keskar et al. 2017) and the generalization ability of SGD
(Chen et al. 2018b; Hardt et al. 2016b; Lin et al. 2016; Mou et al. 2018a; Pensia et al.
2018). However, there is still a lack of solid evidence to validate the effectiveness of
training strategies for tuning the hyper-parameters of neural networks.
In this section, we present both theoretical and empirical evidence for a more
effective training strategy for deep neural networks:
When employing SGD to train deep neural networks, we should ensure that the batch size
is not too large and the learning rate is not too low, to make the networks generalize well.
This strategy gives a guide to tune the hyper-parameters that helps neural networks
achieve good test performance when the training error is small. It is derived from the
following property:
The generalizability of deep neural networks has a negative correlation with the ratio of
batch size to learning rate.
Regardingtheoreticalevidence,weproveanovelPAC-BayesMcAllester(1999a, b)
upper bound for the generalization error of deep neural networks trained by SGD.
The positive correlation between the proposed generalization bound with the ratio
between the batch size and the learning rate indicates that the generalizability of
neural networks is poor. This result builds the theoretical foundation of the training
strategy.
From the empirical aspect, we conduct extensive systematic experiments while
strictly controlling unrelated variables to investigate the influence of batch size and
learning rate on generalizability. Specifically, we trained 1,600 neural networks
based on two popular architectures, ResNet-110 (He et al. 2016a, b) and VGG-19
(Simonyan and Zisserman 2015), on two standard datasets, CIFAR-10 and CIFAR-
100 (Krizhevsky and Hinton 2009). The accuracies on the test set of all the networks
are collected for analysis. Since the training error is almost the same across all the
networks (it is almost 0), the test accuracy is an informative index to express the
model generalizability. Evaluation is then performed on 164 groups of the collected
data. The Spearman’s rank-order correlation coefficients and the corresponding p
6.5 The Role of Learning Rate and Batch Size in Shaping Generalizability 105
Fig. 6.1 Scatter plots of accuracy on test set to ratio of batch size to learning rate. Each point
represents a model. Totally 1,600 points are plotted
In this section, we explore and develop the theoretical foundations for the training
strategy. The main ingredient is a PAC-Bayes generalization bound of deep neural
networks based on the SGD optimization method. The generalization bound has a
positive correlation with the ratio of batch size to learning rate. This correlation
validates the effectiveness of the presented training strategy.
Let ln (θ ) = l(θ, xn ) be the contribution to the overall loss from a single observation
xn , n = 1, ..., m. Apparently, both ln (θ ) and R̂ S (θ ) are un-biased estimations of the
expected risk R(θ ), while ∇θ ln (θ ) and ĝ S (θ ) are both un-biased estimations of the
gradient g(θ ) = ∇θ R(θ ):
E [ln (θ )] = E R̂ S (θ ) =R(θ ), (6.7)
E [∇θ ln (θ )] = E ĝ S (θ ) = g(θ ) = ∇θ R(θ ), (6.8)
where C is the covariance matrix and is a constant matrix for all θ . As covariance
matrices are (semi) positive-definite, for brevity, we suppose that C can be factorized
as C = B B for brevity. This assumption can be justified by the central limit theorem
when the sample size m is large enough compared to the batch size S.
Therefore, the stochastic gradient is also drawn from a Gaussian distribution
centered at g(θ ):
1 1
ĝ B (θ ) = ∇θ ln (θ ) ∼ N g(θ ), C . (6.10)
|G| n∈G |G|
SGD uses the stochastic gradient ĝG (θ ) to iteratively update the parameter θ to
minimize the function R(θ ):
η
θ (t) = θ (t + 1) − θ (t) = −η ĝG (θ (t)) = −ηg(θ ) + √ BW, W ∼ N (0, I ).
|G|
(6.11)
Equation (6.11) expresses a well-known stochastic process called the Ornstein-
Uhlenbeck process (Uhlenbeck and Ornstein 1930).
Furthermore, we assume that the loss function in the local region around the
minimum is convex and 2-order differentiable:
1
R(θ ) = θ Aθ, (6.12)
2
where A is the Hessian matrix around the minimum and is a (semi) positive-definite
matrix. This assumption has been primarily demonstrated by empirical works (see
Li et al. (2018c, p. 1, Figs. 1(a) and 1(b) and p. 6, Figs. 4(a) and 4(b))). Without loss
of generality, we assume that the global minimum of the objective function R(θ ) is
0 and achieves at θ = 0. General cases can be obtained by translation operations,
which would not change the geometry of objective function and the corresponding
generalization ability.
From the results of Ornstein-Uhlenbeck process, Eq. (6.11) has an analytic sta-
tionary distribution:
1
q(θ ) = M exp − θ −1 θ , (6.13)
2
R̂ S (Q) = Eθ∼Q R̂ S (θ ).
Then, a classic result uniformly bounding the expected risk R(Q) in terms of the
empirical risk R̂ S (Q) and the KL divergence K L(Q||P) is as follows.
Lemma 6.1 (see (McAllester 1999a), Theorem 1) For any positive real δ ∈ (0, 1),
with probability at least 1 − δ over a sample of size m, we have the following inequal-
ity for all distributions Q:
K L(Q||P) + log 1δ + log m + 2
R(Q) ≤ R̂ S (Q) + , (6.14)
2m − 1
Theorem 6.5 For any positive real δ ∈ (0, 1), with probability at least 1 − δ over
a training sample set of size m, we have the following inequality for the distribution
Q of the output hypothesis function of SGD:
1
η
2|G|
tr (C A−1 ) − log(det()) − d + 2 log δ
+ 2 log m + 4
R(Q) ≤ R̂ S (Q) + .
4m − 2
(6.16)
and η
A + A = C, (6.17)
|G|
where A is the Hessian matrix of the loss function around the local minimum, B is
from the covariance matrix of the gradients calculated by single sample points , and
d is the dimension of the parameter θ (network size).
The proof of this generalization bound involves two key steps that draw upon con-
cepts from SDEs and the PAC-Bayes framework. (1) We use SDE results to identify
the stationary solution of the latent Ornstein-Uhlenbeck process described by Eq.
(6.11), which captures the iterative update process of SGD. This step helps establish
the behavior and stability of SGD over time. (2) We adapt the PAC-Bayes framework,
which is a theoretical framework for generalization analysis in machine learning, to
derive the generalization bound based on insights gained from the stationary dis-
tribution identified in the first step. The PAC-Bayes framework allows us to reason
about the performance and predictive ability of SGD in terms of its convergence to
a stable solution and its ability to generalize well.
To prove 6.5, we first present the following lemma.
Lemma 6.2 (cf. (Mandt et al. 2017), pp. 27-18, Appendix B) Under the 2-order
differentiable assumption (Eq. 6.12), the Ornstein-Uhlenbeck process (Eq. 6.11)’s
stationary distribution,
1
q(θ ) = M exp − θ −1 θ , (6.18)
2
η
A + A = C. (6.19)
|G|
where W (t ) is a white noise and follows N (0, I ). From Eq. (6.18), we know that
= Eθ∼Q θ θ . (6.21)
1 In scenarios where there is limited prior knowledge about latent model parameters, it is common
practice to set priors as distributions that convey minimal information, such as Gaussian or uniform
distributions. This choice is driven by the following two primary considerations. (1) Algorithms
based on Bayesian statistics are expected to converge to stationary distributions given sufficient
time and data. The existence and uniqueness of these stationary solutions are assumed based on the
latent stochastic differential equation governing the iterative process. This assumption provides a
theoretical foundation for the convergence of Bayesian algorithms, ensuring that the learned model
stabilizes over time. (2) Setting priors requires caution because we cannot assume prior knowledge
about the target hypothesis function before initiating model training. Therefore, the choice of non-
informative priors ensures that the learning process remains unbiased and adapts to the available
data without undue influence from assumed prior beliefs. This approach is fundamental in statistical
learning theory to avoid overfitting and to facilitate robust generalization to new, unseen data.
110 6 Stochastic Gradient Descent as an Implicit Regularization
1 1
p(θ ) = √ exp − θ I θ , (6.23)
2π det(I ) 2
1 1
q(θ ) = √ exp − θ −1 θ , (6.24)
2π det() 2
where Eq. (6.24) comes from Eq. (6.18) by calculating the normalizer M. Therefore,
q(θ ) 1 1 1
log = log + θ I θ − θ −1 θ . (6.25)
p(θ ) 2 det() 2
Applying Eqs. (6.25) to (6.15), we can calculate the KL divergence between the
distributions Q and P (we assume = Rd ):
q(θ )
K L(Q||P) = log q(θ )dθ
θ∈ p(θ )
!
1 1 1 −1
= log + θ I θ − θ θ q(θ )dθ
θ∈ 2 det() 2
1 1 1 1
= log + θ I θ p(θ )dθ − θ −1 θq(θ )dθ
2 det() 2 θ∈ 2 R|S|
1 1 1 1
= log + Eθ∼N(0,) θ I θ − Eθ∼N(0,) θ −1 θ
2 det() 2 2
1 1 1
= log + tr( − I ). (6.26)
2 det() 2
Therefore,
η
A A−1 + = C A−1 . (6.28)
|G|
After calculating the trace of the both sides, we have the following equation,
η
tr A A−1 + = tr C A−1 . (6.29)
|S|
Therefore,
1 η 1 η −1
tr () = tr C A−1 = tr C A . (6.31)
2 |G| 2 |G|
tr(I ) = d, (6.32)
1 η 1 1
K L(Q||P) ≤ tr (C A−1 ) − log(det()) − d. (6.33)
4 |G| 2 2
Equation (6.33) provides an upper bound on the KL divergence between the sta-
tionary distribution of SGD weights and the prior distribution over the hypothesis
space. This divergence quantifies how far the learned distribution is away from the
prior. By leveraging the monotonous nature of the generalization bound with respect
to the KL divergence, we can extend this insight to derive a PAC-Bayesian general-
ization bound for SGD. This process involves incorporating the KL divergence bound
(Eq. (6.33)) into the PAC-Bayesian framework outlined in Eq. (6.14) of Lemma 6.1,
enabling us to quantify the trade-off between model complexity and generalization
performance in a probabilistic context.
In this subsection, we study a special case with two more assumptions for further
understating the influence of the gradient fluctuation on our proposed generalization
bound.
Assumption 6.6 The matrices A and are symmetric.
Assumption 6.6 can be translated as that both the local geometry around the global
minima and the stationary distribution are homogeneous to every dimension of the
parameter space. This assumption indicates that the product A of matrices A and
is also symmetric.
Based on Assumptions 6.6, we can further prove the following theorem.
Theorem 6.7 When Assumptions 6.6 holds, under all the conditions of Theorem 6.5,
the stationary distribution of SGD has the following generalization bound,
R(Q) ≤ R̂ S (Q)
η −1
2|G| tr (C A ) + d log
2|G|
η − log(det(C A−1 )) − d + 2 log 1δ + 2 log m + 4
+ .
4m − 2
(6.34)
112 6 Stochastic Gradient Descent as an Implicit Regularization
Lemma 6.3 When Assumptions 6.6, the KL divergence between the stationary dis-
tribution Q of SGD and the prior distribution P is satisfies the following inequality
η −1 1 2|G| 1 1
K L(Q||P) ≤ tr (C A ) + d log − log(det(C A−1 )) − d.
4|G| 2 η 2 2
(6.35)
Lemma 6.3 gives an upper bound for the distance between the distribution of
the output hypothesis by SGD and the prior distribution of the hypothesis space. It
measures how far SGD can explore in the hypothesis space. Based on it, we can
further prove the following theorem that controls the generalization error of the
special case under Assumptions 6.6.
Proof Apply Assumptions 6.6 to Eq. (6.19), we can prove the following three equa-
tions:
η η η
A + A = C, 2 A = C, and = C A−1 . (6.36)
|S| |S| 2|S|
Therefore, d
η η
det() = det C A−1 = det C A−1 . (6.37)
2|S| 2|S|
Thus,
η d −1
log (det()) = log det C A
2|S|
2|S|
= − d log + log det C A−1 . (6.38)
η
Applying Eqs. (6.36) and (6.32) to Eq. (6.26), we can obtain Eq. (6.35).
The proof is completed.
characterized by high values of det (A), often correspond to models that overfit the
training data and exhibit poorer generalization to unseen data. Understanding these
local geometric properties is crucial for assessing the behavior and performance of
optimization algorithms in deep learning.
Gradient fluctuation. The covariance matrix C (or equivalently the matrix B)
characterizes how much the gradient estimates vary across different data points dur-
ing the training process. This variation represents the inherent noise in SGD. By intro-
ducing this noise into the gradient computation, SGD can explore a broader range
of solutions during optimization. This stochastic behavior allows SGD to navigate
away from sharp or narrow local minima, potentially leading to solutions with better
generalization performance on unseen data. The notion of injecting noise through
SGD has been influential in understanding its robustness and ability to escape poor
solutions during training.
Hyper-parameters. The relationship between batch size |S| and learning rate η
affects how gradients are computed and utilized during the training process. A larger
batch size typically provides a more accurate estimate of the gradient, reducing its
variance. On the other hand, a higher learning rate can lead to larger updates in
parameter space based on these gradients. Specifically, under the following assump-
tion, our generalization bound has a positive correlation with the ratio of batch size
to learning rate. The interplay between these two factors influences the stability and
convergence of the training process and can impact the generalization performance
of the model. By analyzing the generalization bound in relation to the ratio of batch
size to learning rate, we gain insights into how to optimize these hyperparameters
for better model performance and generalization.
Specifically, under the following assumption, our generalization bound has a pos-
itive correlation with the ratio of batch size to learning rate.
tr(C A−1 )η
d> , (6.39)
2|S|
This assumption can be justified that the network sizes of neural networks are usually
extremely large. This property is also called overparametrization (Du et al. 2019b;
Brutzkus et al. 2018; Allen-Zhu et al. 2019b). We can obtain the following corollary
by combining Theorem 6.7 and Assumption 6.8.
Corollary 6.1 When all conditions of Theorem 6.7 and Assumption 6.8 hold, the
generalization bound of the network has a positive correlation with the ratio of
batch size to learning rate.
114 6 Stochastic Gradient Descent as an Implicit Regularization
I
R(Q) ≤ R̂ S (Q) + , (6.41)
4m − 2
tr (C A−1 )η 1
d> = tr (C A−1 ). (6.43)
2|S| 2k
Thus,
∂I
> 0. (6.44)
∂k
So, I and further the generalization bound has a positive correlation with the ratio
of batch size to learning rate.
The proof is completed.
It reveals the negative correlation between the generalizability and the ratio. This
property further derives the training strategy, which requires us to control the ratio
and ensure that it is not too large in order to achieve a good generalization when
training deep neural networks using SGD.
To evaluate the training strategy from the empirical aspect, we conduct extensive
systematic experiments to investigate the influence of the batch size and learning
rate on the generalizability of deep neural networks trained by SGD. To deliver rig-
6.5 The Role of Learning Rate and Batch Size in Shaping Generalizability 115
orous results, our experiments strictly control all unrelated variables. The empirical
results show that there is a statistically significant negative correlation between the
generalizability of the networks and the ratio of the batch size to the learning rate,
which builds a solid empirical foundation for the training strategy.
To guarantee that the empirical results generally apply to any case, our experiments
are conducted based on two popular architectures, ResNet-110 (He et al. 2016a, b)
and VGG-19 (Simonyan and Zisserman 2015), on two standard datasets, CIFAR-
10 and CIFAR-100 (Krizhevsky and Hinton 2009), which can be downloaded from
https://www.cs.toronto.edu/~kriz/cifar.html. The separation of the training and test
sets we used are the same as the official version.
We trained 1,600 models with 20 batch sizes, S B S = {16, 32, 48, 64, 80, 96, 112,
128, 144, 160, 176, 192, 208, 224, 240, 256, 272, 288, 304, 320}, and 20 learning
rates, SL R = {0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.10, 0.11, 0.12,
0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.20}. All SGD training techniques, such as
momentum, are disabled. Also, both batch size and learning rate are constant in
our experiments. Every model with a specific pair of batch size and learning rate
is trained for 200 epochs. The test accuracies of all 200 epochs are collected for
analysis. We select the highest accuracy on the test set to express the generalizability
of each model, since the training error is almost the same across all models (they are
all nearly 0).
The collected data is then utilized to investigate three correlations: (1) the correla-
tion between network generalizability and the batch size, (2) the correlation between
network generalizability and the learning rate, and (3) the correlation between net-
work generalizability and the ratio of batch size to learning rate, where the first two are
preparations for the final one. Specifically, we calculate the Spearman’s rank-order
correlation coefficients(SCCs) and the corresponding p groups of the collected data
to investigate the statistical significance of the correlations. Almost all results indi-
cate that the correlations are statistically significant ( p < 0.005).2 The p values of
the correlation between the test accuracy and the ratio are all lower than 10−180 (see
Table 6.3).
The architectures of our models are similar to a popular implementation of ResNet-
110 and VGG-19.3 Additionally, our experiments are conducted on a computing
cluster with GPUs of NVIDIA® Tesla™ V100 16GB and CPUs of Intel® Xeon®
Gold 6140 CPU @ 2.30GHz.
2 The definition of “statistically significant” has various versions, such as p < 0.05 and p < 0.01.
This section uses a more rigorous one ( p < 0.005).
3 See Wei Yang, https://github.com/bearpaw/pytorch-classification, 2017.
116 6 Stochastic Gradient Descent as an Implicit Regularization
Fig. 6.3 Curves of test accuracy to batch size and learning rate. The four rows are respectively for
(1) ResNet-110 trained on CIFAR-10, (2) ResNet-110 trained on CIFAR-100, (3) VGG-19 trained
on CIFAR-10, and (4) VGG-19 trained on CIFAR-10. Each curve is based on 20 networks
Correlation between generalization ability and batch size. When the learning rate
is fixed as an element of SL R , we train ResNet-110 and VGG-19 on CIFAR10 and
CIFAR100 with 20 batch sizes of S B S . The plots of test accuracy to batch size are
illustrated in Fig. 6.3a. We list 1/4 of all plots due to space limitation. The rest of the
plots are in the supplementary materials. We then calculate the SCCs and the p values
as Table 6.1, where bold p values refer to the statistically significant observations,
while underlined ones refer to those that are not significant (This convention applies
to Table 6.2). The results clearly show that there is a statistically significant negative
correlation between generalization ability and batch size.
Correlation between generalization ability and learning rate. When the batch
size is fixed as an element of S B S , we train ResNet-110 and VGG-19 on CIFAR10
and CIFAR100 respectively with 20 learning rates SL R . The plot of the test accuracy
to the learning rate is illustrated in Fig. 6.3b, which include 1/4 of all plots due to
space limitation. The rest of the plots are in the supplementary materials. We then
calculate the SCC and the p values as Table 6.2 shows. The results clearly show that
there is a statistically significant positive correlation between the learning rate and
the generalization ability of SGD.
Correlation between generalization ability and ratio of batch size to learning
rate. We plot the test accuracies of ResNet-110 and VGG-19 on CIFAR-10 and
CIFAR-100 to the rate of batch size to learning rate in Fig. 6.1. Totally over 1,600
points are plotted. Additionally, we perform Spearman’s rank-order correlation test
on all the accuracies of ResNet-110 and VGG-19 on CIFAR-10 and CIFAR-100. The
SCC and p values show that the correlation between the ratio and the generalization
ability is statistically significant as Table 6.3 demonstrate. Each test is performed on
400 models. The results strongly support the training strategy.
6.6 Interplay of Optimization and Bayesian Inference 117
Table 6.1 SCC and p values of batch size to test accuracy for different learning rate (LR)
LR ResNet-110 on CIFAR-10 ResNet-110 on CIFAR-100 VGG-19 on CIFAR-10 VGG-19 on CIFAR-100
SCC p SCC p SCC p SCC p
0.01 −0.96 2.6 × 10−11 −0.92 5.6 × 10−8 −0.98 3.7 × 10−14 −0.99 7.1 × 10−18
0.02 −0.96 1.2 × 10−11 −0.94 1.5 × 10−9 −0.99 3.6 × 10−17 −0.99 7.1 × 10−18
0.03 −0.96 3.4 × 10−11 −0.99 1.5 × 10−16 −0.99 7.1 × 10−18 −1.00 1.9 × 10−21
0.04 −0.98 1.8 × 10−14 −0.98 7.1 × 10−14 −0.99 9.6 × 10−19 −0.99 3.6 × 10−17
0.05 −0.98 3.7 × 10−14 −0.98 1.3 × 10−13 −0.99 7.1 × 10−18 −0.99 1.4 × 10−15
0.06 −0.96 1.8 × 10−11 −0.97 6.7 × 10−13 −1.00 1.9 × 10−21 −0.99 1.4 × 10−15
0.07 −0.98 5.9 × 10−15 −0.94 5.0 × 10−10 −0.98 8.3 × 10−15 −0.97 1.7 × 10−12
0.08 −0.97 1.7 × 10−12 −0.97 1.7 × 10−12 −0.98 2.4 × 10−13 −0.97 1.7 × 10−12
0.09 −0.97 4.0 × 10−13 −0.98 3.7 × 10−14 −0.98 1.8 × 10−14 −0.96 1.2 × 10−11
0.10 −0.97 1.9 × 10−12 −0.96 8.7 × 10−12 −0.98 8.3 × 10−15 −0.93 2.2 × 10−9
0.11 −0.97 1.1 × 10−12 −0.98 1.3 × 10−13 −0.99 2.2 × 10−16 −0.93 2.7 × 10−9
0.12 −0.97 4.4 × 10−12 −0.96 2.5 × 10−11 −0.98 7.1 × 10−13 −0.90 7.0 × 10−8
0.13 −0.94 1.5 × 10−9 −0.98 1.3 × 10−13 −0.97 1.7 × 10−12 −0.89 1.2 × 10−7
0.14 −0.97 2.6 × 10−12 −0.91 3.1 × 10−8 −0.97 6.7 × 10−13 −0.86 1.1 × 10−6
0.15 −0.96 4.6 × 10−11 −0.98 1.3 × 10−13 −0.95 8.3 × 10−11 −0.79 3.1 × 10−5
0.16 −0.95 3.1 × 10−10 −0.96 8.7 × 10−12 −0.95 1.4 × 10−10 −0.77 6.1 × 10−5
0.17 −0.95 2.4 × 10−10 −0.95 2.6 × 10−10 −0.91 2.3 × 10−8 −0.68 1.3 × 10−3
0.18 −0.97 4.0 × 10−12 −0.97 1.1 × 10−12 −0.93 2.6 × 10−9 −0.66 2.8 × 10−3
0.19 −0.94 6.3 × 10−10 −0.95 8.3 × 10−11 −0.90 8.0 × 10−8 −0.75 3.4 × 10−4
0.20 −0.91 3.6 × 10−8 −0.98 1.3 × 10−13 −0.95 6.2 × 10−11 −0.57 1.4 × 10−2
Table 6.2 SCC and p values of learning rate to test accuracy for different batch size (BS)
BS ResNet-110 on CIFAR-10 ResNet-110 on CIFAR-100 VGG-19 on CIFAR-10 VGG-19 on CIFAR-100
SCC p SCC p SCC p SCC p
16 0.60 5.3 × 10−3 0.84 3.2 × 10−6 0.62 3.4 × 10−3 −0.80 2.6 × 10−5
32 0.60 5.0 × 10−3 0.90 9.9 × 10−8 0.78 4.9 × 10−5 −0.14 5.5 × 10−1
48 0.84 3.2 × 10−6 0.89 1.8 × 10−7 0.87 4.9 × 10−7 0.37 1.1 × 10−1
64 0.67 1.2 × 10−3 0.89 1.0 × 10−7 0.91 2.0 × 10−8 0.91 1.1 × 10−6
80 0.80 2.0 × 10−5 0.99 4.8 × 10−16 0.95 2.4 × 10−10 0.87 4.5 × 10−6
96 0.79 3.3 × 10−5 0.89 2.4 × 10−7 0.94 5.2 × 10−9 0.94 1.5 × 10−9
112 0.90 8.8 × 10−8 0.91 2.7 × 10−8 0.97 2.6 × 10−12 0.95 1.2 × 10−10
128 0.95 8.3 × 10−11 0.92 1.1 × 10−8 0.98 2.2 × 10−14 0.99 4.8 × 10−16
144 0.85 2.1 × 10−6 0.98 7.7 × 10−14 0.90 6.2 × 10−8 0.98 3.5 × 10−15
160 0.90 4.3 × 10−8 0.94 5.0 × 10−10 0.95 3.3 × 10−10 0.99 7.1 × 10−18
176 0.94 5.0 × 10−10 0.99 3.6 × 10−17 0.91 2.3 × 10−8 0.98 1.8 × 10−14
192 0.94 6.7 × 10−10 0.94 5.0 × 10−10 0.95 6.2 × 10−11 0.97 2.6 × 10−12
208 0.91 3.6 × 10−8 0.97 6.7 × 10−12 0.98 6.1 × 10−14 0.99 2.5 × 10−17
224 0.90 9.0 × 10−8 0.98 3.7 × 10−14 0.93 2.2 × 10−9 0.98 1.3 × 10−13
240 0.78 4.6 × 10−5 0.95 2.4 × 10−10 0.98 8.3 × 10−15 0.99 9.6 × 10−19
256 0.83 4.8 × 10−6 0.94 5.0 × 10−10 0.99 4.8 × 10−16 0.97 5.4 × 10−12
272 0.95 2.4 × 10−10 0.96 2.5 × 10−11 0.97 4.0 × 10−13 0.99 1.5 × 10−16
288 0.94 9.8 × 10−10 0.92 1.5 × 10−18 0.95 8.3 × 10−11 0.99 1.5 × 10−16
304 0.81 1.5 × 10−5 0.97 4.0 × 10−13 0.95 6.2 × 10−11 1.00 3.7 × 10−24
320 0.94 1.4 × 10−9 0.95 8.3 × 10−11 0.97 2.6 × 10−12 1.00 7.2 × 10−20
Table 6.3 SCC and p values of the ratio of batch size to learning rate and test accuracy
ResNet-110 on CIFAR-10 ResNet-110 on CIFAR-100 VGG-19 on CIFAR-10 VGG-19 on CIFAR-100
SCC p SCC p SCC p SCC p
−0.97 3.3 × 10−235 −0.98 5.3 × 10−293 −0.98 6.2 × 10−291 −0.94 6.1 × 10−180
(2) Variational inference (Hoffman et al. 2013; Blei et al. 2017; Zhang et al.
2018a) employs a two-step process to infer the posterior:
1. A family of distributions is defined as
Q = {qλ |λ ∈ },
References
Alon, Brutzkus, Amir Globerson, Eran Malach, and Shai Shalev-Shwartz. 2018. SGD learns over-
parameterized networks that provably generalize on linearly separable data. In International
Conference on Learning Representations.
Ankit, Pensia, Varun Jog, and Po-Ling Loh. 2018. Generalization error bounds for noisy, iterative
algorithms. In 2018 IEEE International Symposium on Information Theory (ISIT), 546–550.
Aolin, Xu and Maxim Raginsky. 2017. Information-theoretic analysis of generalization capability
of learning algorithms. In Advances in Neural Information Processing Systems, 2524–2533.
Belinda, Tzen, Tengyuan Liang, and Maxim Raginsky. 2018. Local optimality and generalization
guarantees for the Langevin algorithm via empirical metastability. In Conference On Learning
Theory, 857–875.
Bottou, Léon., et al. 1998. Online learning and stochastic approximations. On-line learning in
neural networks 17 (9): 142.
Cheng, Xiang. 2020. The Interplay between Sampling and Optimization. PhD thesis, EECS Depart-
ment, University of California, Berkeley.
Christos, Louizos and Max Welling. 2017. Multiplicative normalizing flows for variational Bayesian
neural networks. In International Conference on Machine learning, 2218–2227.
Crispin, W Gardiner. 1985. Handbook of stochastic methods, vol. 3. Berlin: Springer.
David, A McAllester. 1999. PAC-Bayesian model averaging. In Annual Conference of Learning
Theory 99: 164–170.
120 6 Stochastic Gradient Descent as an Implicit Regularization
David, A McAllester. 1999. Some PAC-Bayesian theorems. Machine Learning 37 (3): 355–363.
David, M Blei, Alp Kucukelbir, and Jon D. McAuliffe. 2017. Variational inference: A review for
statisticians. Journal of the American statistical Association 112 (518): 859–877.
Diederik, P Kingma and Max Welling. 2014. Auto-encoding variational Bayes. In International
Conference on Learning Representations.
Dieuleveut, Aymeric, and Francis Bach. 2016. Nonparametric stochastic approximation with large
step-sizes. The Annals of Statistics 44 (4): 1363–1399.
Duane, Simon. 1987. Anthony D Kennedy, Brian J Pendleton, and Duncan Roweth. Hybrid Monte
Carlo. Physics Letters B 195 (2): 216–222.
Fredrik, Hellström and Giuseppe Durisi. 2020. Generalization bounds via information density and
conditional information density. arXiv preprint arXiv:2005.08044.
Geman, Stuart, and Donald Geman. 1984. Stochastic relaxation, Gibbs distributions, and the
Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence 6: 721–741.
George, E Uhlenbeck, and Leonard S. Ornstein. 1930. On the theory of the brownian motion.
Physical Review 36 (5): 823.
Gintare, Karolina Dziugaite, Kyle Hsu, Waseem Gharbieh, and Daniel M Roy. 2020. On the role
of data in pac-bayes bounds. arXiv preprint arXiv:2006.10929.
Hangfeng, He and Weijie Su. 2020. The local elasticity of neural networks. In International Con-
ference on Learning Representations.
Hao, Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. 2018c. Visualizing the
loss landscape of neural nets. In Advances in Neural Information Processing Systems.
Hao, Zhang, Bo Chen, Yulai Cong, Dandan Guo, Hongwei Liu, and Mingyuan Zhou. 2020. Deep
autoencoding topic model with scalable hybrid Bayesian inference. IEEE Transactions on Pattern
Analysis and Machine Intelligence.
Harold Kushner and G George Yin. Stochastic approximation and recursive algorithms and appli-
cations, volume 35. Springer Science & Business Media, 2003.
He, Fengxiang, and Tongliang Liu. 2019. and Dacheng Tao. Control batch size and learning rate
to generalize well: Theoretical and empirical evidence. In Advances in Neural Information Pro-
cessing Systems.
Herbert, Robbins and Sutton Monro. 1951. A stochastic approximation method. The Annals of
Mathematical Statistics, 400–407.
Huan, Xu., Constantine Caramanis, and Shie Mannor. 2011. Sparse algorithms are not stable: A
no-free-lunch theorem. IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (1):
187–193.
Hugo, Larochelle and Stanislas Lauly. 2012. A neural autoregressive topic model. In Advances in
Neural Information Processing Systems, 2708–2716.
Ian, Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,
Aaron Courville, and Yoshua Bengio. 2014a. Generative adversarial nets. In Advances in Neural
Information Processing Systems, 2672–2680.
Ian, Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. 2016. Deep learning, vol.
1. MIT Press.
Ilya, Sutskever, James Martens, George Dahl, and Geoffrey Hinton. 2013. On the importance of
initialization and momentum in deep learning. In International conference on machine learning,
1139–1147.
Ivan, Kobyzev, Simon Prince, and Marcus Brubaker. 2020. Normalizing flows: An introduction and
review of current methods. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Jeffrey, Negrea, Mahdi Haghifam, Gintare Karolina Dziugaite, Ashish Khisti, and Daniel M Roy.
2019. Information-theoretic generalization bounds for sgld via data-dependent estimates. In
Advances in Neural Information Processing Systems, 11015–11025.
Jian, Li, Xuanyuan Luo, and Mingda Qiao. 2019. On generalization error bounds of noisy gradient
methods for non-convex learning. arXiv preprint arXiv:1902.00621.
References 121
John, J Cherian, Andrew G Taube, Robert T McGibbon, Panagiotis Angelikopoulos, Guy Blanc,
Michael Snarski, Daniel D Richman, John L Klepeis, and David E Shaw. 2020. Efficient hyperpa-
rameter optimization by way of pac-bayes bound minimization. arXiv preprint arXiv:2008.06431.
Junhong, Lin and Lorenzo Rosasco. 2016. Optimal learning for multi-pass stochastic gradient
methods. In Advances in Neural Information Processing Systems, 4556–4564.
Junhong, Lin, Raffaello Camoriano, and Lorenzo Rosasco. 2016. Generalization properties and
implicit regularization for multiple passes SGM. In International Conference on Machine learn-
ing, 2340–2348.
Kaiming, He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016a. Deep residual learning for image
recognition. In IEEE Conference on Computer Vision and Pattern Recognition.
Kaiming, He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016b. Identity mappings in deep
residual networks. In European Conference on Computer Vision.
Karen, Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale
image recognition. In International Conference on Learning Representations.
Krizhevsky, Alex, and Geoffrey Hinton. 2009. Learning multiple layers of features from tiny images.
Citeseer: Technical report.
Laurent, Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. 2017. Sharp minima can gen-
eralize for deep nets. In .
Lennart, Ljung, Georg Pflug, and Harro Walk. 2012. Stochastic approximation and optimization of
random systems, volume 17. Birkhäuser.
Lever, Guy, François Laviolette, and John Shawe-Taylor. 2013. Tighter pac-bayes bounds through
distribution-dependent priors. Theoretical Computer Science 473: 4–28.
Mahdi, Haghifam, Jeffrey Negrea, Ashish Khisti, Daniel M Roy, and Gintare Karolina Dziugaite.
2020. Sharpened generalization bounds based on conditional mutual information and an appli-
cation to noisy, iterative algorithms. arXiv preprint arXiv:2004.12983.
Matthew, D Hoffman, David M. Blei, Chong Wang, and John Paisley. 2013. Stochastic variational
inference. Journal of Machine Learning Research 14 (1): 1303–1347.
Max, Welling and Yee W Teh. 2011. Bayesian learning via stochastic gradient Langevin dynamics.
In International Conference on Machine learning, 681–688.
Maxim, Raginsky, Alexander Rakhlin, and Matus Telgarsky. 2017. Non-convex learning via stochas-
tic gradient Langevin dynamics: A nonasymptotic analysis. In Conference on Learning Theory,
1674–1703.
Moritz, Hardt, Ben Recht, and Yoram Singer. 2016b. Train faster, generalize better: Stability of
stochastic gradient descent. In International Conference on Machine learning, 1225–1234.
Nan, Ding, Youhan Fang, Ryan Babbush, Changyou Chen, Robert D Skeel, and Hartmut Neven.
2014. Bayesian sampling using stochastic gradient thermostats. In Advances in Neural Informa-
tion Processing Systems, 3203–3211.
Nitish, Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping
Tak Peter Tang. 2017. On large-batch training for deep learning: Generalization gap and sharp
minima. In International Conference on Learning Representations.
Noah, Golowich, Alexander Rakhlin, and Ohad Shamir. 2018. Size-independent sample complexity
of neural networks. In Annual Conference on Learning Theory, 297–299.
Olivier, Bousquet, and André Elisseeff. 2002. Stability and generalization. Journal of Machine
Learning Research 2: 499–526.
Priya, Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola,
Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, large minibatch sgd: training
imagenet in 1 hour. arXiv preprint arXiv:1706.02677.
Roth, Wolfgang, and Franz Pernkopf. 2018. Bayesian neural networks with weight sharing using
Dirichlet processes. IEEE Transactions on Pattern Analysis and Machine Intelligence 42 (1):
246–252.
Sam, Patterson and Yee Whye Teh. 2013. Stochastic gradient Riemannian Langevin dynamics on
the probability simplex. In Advances in Neural Information Processing Systems, 3102–3110.
122 6 Stochastic Gradient Descent as an Implicit Regularization
Simon, S. Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. 2019b. Gradient descent provably
optimizes over-parameterized neural networks. In International Conference on Learning Repre-
sentations.
Spearman, Charles. 1987. The proof and measurement of association between two things. The
American journal of psychology 100 (3/4): 441–471.
Stephan, Mandt, Matthew D. Hoffman, and David M. Blei. 2017. Stochastic gradient descent as
approximate Bayesian inference. Journal of Machine Learning Research 18 (1): 4873–4907.
Sungjin, Ahn, Anoop Korattikara, and Max Welling. 2012. Bayesian posterior sampling via stochas-
tic gradient Fisher scoring. arXiv preprint arXiv:1206.6380.
Thomas, Steinke and Lydia Zakynthinou. 2020. Reasoning about generalization via conditional
mutual information. arXiv preprint arXiv:2001.09122.
Tianqi, Chen, Emily Fox, and Carlos Guestrin. 2014. Stochastic gradient Hamiltonian Monte Carlo.
In International Conference on Machine learning, 1683–1691.
W Keith, Hastings. 1970. Monte Carlo sampling methods using Markov chains and their applica-
tions.
Weinan, E. 2017. A proposal on machine learning via dynamical systems. Communications in
Mathematics and Statistics 5 (1): 1–11.
Wenlong, Mou, Liwei Wang, Xiyu Zhai, and Kai Zheng. 2018a. Generalization bounds of sgld for
non-convex learning: Two theoretical viewpoints. In Annual Conference On Learning Theory.
Xi, Chen, Jason D Lee, Xin T Tong, and Yichen Zhang. 2016. Statistical inference for model
parameters in stochastic gradient descent. arXiv preprint arXiv:1610.08637.
Yeming, Wen, Kevin Luk, Maxime Gazeau, Guodong Zhang, Harris Chan, and Jimmy Ba. 2019.
Interplay between optimization and generalization of stochastic gradient descent with covariance
noise. arXiv preprint arXiv:1902.08234.
Yi, Zhou, Yingbin Liang, and Huishuai Zhang. 2018. Generalization error bounds with probabilistic
guarantee for SGD in nonconvex optimization. arXiv preprint arXiv:1802.06903.
Yi-An, Ma, Tianqi Chen, and Emily Fox. 2015. A complete recipe for stochastic gradient mcmc.
In Advances in Neural Information Processing Systems, 2917–2925.
Ying, Yiming, and Massimiliano Pontil. 2008. Online gradient descent learning algorithms. Foun-
dations of Computational Mathematics 8 (5): 561–596.
Yuansi, Chen, Chi Jin, and Bin Yu. 2018b. Stability and convergence trade-off of iterative optimiza-
tion algorithms. arXiv preprint arXiv:1804.01619.
Yuheng, Bu, Shaofeng Zou, and Venugopal V Veeravalli. 2020. Tightening mutual information
based bounds on generalization error. IEEE Journal on Selected Areas in Information Theory.
Yuting, Wei, Fanny Yang, and Martin J Wainwright. 2017. Early stopping for kernel boosting
algorithms: A general analysis with localized complexities. In Advances in Neural Information
Processing Systems, 6065–6075.
Zeyuan, Allen-Zhu., Yuanzhi Li, and Zhao Song. 2019. A convergence theory for deep learning via
over-parameterization. In .
Zhang, Cheng, Judith Bütepage, Hedvig Kjellström, and Stephan Mandt. 2018. Advances in varia-
tional inference. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (8): 2008–
2026.
Zhun, Deng, Hangfeng He, and Weijie J Su. 2020. Toward better generalization bounds with locally
elastic stability. arXiv preprint arXiv:2010.13988.
Chapter 7
The Geometry of the Loss Surfaces
A major barrier recognized by the whole community is that the loss surfaces of deep
neural networks are extremely nonconvex and nonsmooth. Such nonconvexity and
nonsmoothness make the analysis of the optimization and generalization properties
of such networks prohibitively difficult. An intuitive approach is to bypass these geo-
metrical properties to seek a theoretical explanation. However, some papers argue that
these “intimidating” geometrical properties themselves are the major factors shaping
the properties of deep neural networks and the key to explaining deep learning.
Linear neural networks are neural networks whose activations are all linear functions.
For linear neural networks, the loss surfaces do not have any spurious local minima:
all local minima are equally good, i.e., they are all global minima.
Kawaguchi (2016) proved that linear neural networks do not have any spurious
local minima under several assumptions: (1) the loss functions are squared losses;
(2) X X T and X Y T are both of full rank, where X is the data matrix and Y is the
label matrix; (3) the dimensionality of the input layer is larger than that of the output
−1
layer; and (4) all eigenvalues of the matrix Y X X X T X Y T are different from
each other. Lu and Kawaguchi (2017) replaced these assumptions with a single, more
restrictive assumption, namely, that the data matrix X and the label matrix Y are both
of full rank. Later, Zhou and Liang (2018) proved that all critical points are global
minima when certain conditions hold. The authors proved this based on a new result
regarding the analytical formulation of the critical points.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025 123
F. He and D. Tao, Foundations of Deep Learning, Machine Learning: Foundations,
Methodologies, and Applications, https://doi.org/10.1007/978-981-16-8233-9_7
124 7 The Geometry of the Loss Surfaces
This result only relies on four mild assumptions that cover most practical circum-
stances: (1) the training sample set is linearly inseparable; (2) all training sample
points are distinct; (3) the output layer is narrower than the other hidden layers; and
(4) there exists some turning point in the piece-wise linear activations that the sum
of the slops on the two sides does not equal to 0.
Our result significantly extends the existing study on the existence of spurious
local minimum. For example, Zhou and Liang (2018) prove that one-hidden-layer
neural networks with two nodes in the hidden layer and two-piece linear (ReLU-
like) activations have spurious local minima; Swirszcz et al. (2016) prove that ReLU
networks have spurious local minima under the squared loss when most of the neurons
are not activated; Safran and Shamir (2018) present a computer-assisted proof that
two-layer ReLU networks have spurious local minima; a recent work Yun et al. (2019)
have proven that neural networks with two-piece linear activations have infinite
spurious local minima, but the results only apply to the networks with one hidden
layer and one-dimensional outputs; and a concurrent work Goldblum et al. (2020)
proves that for multi-layer perceptrons of any depth, the performance of every local
minimum on the training data equals to a linear model, which is also verified by
experiments.
The proposed theorem is proved in three stages: (1) we prove that neural networks
with one hidden layer and two-piece linear activations have spurious local minima;
(2) we extend the conditions to neural networks with arbitrary hidden layers and two-
piece linear activations; and (3) we further extend the conditions to neural networks
7.2 Nonlinear Activations Bring Infinite Spurious Local Minima 125
with arbitrary depth and arbitrary piecewise linear activations. Since some parameters
of the constructed spurious local minima are from continuous intervals, we have
obtained infinitely many spurious local minima. At each stage, the proof follows
a two-step strategy that: (a) constructs an infinite series of local minima; and (b)
constructs a point in the parameter space whose empirical risk is lower than the
constructed local minimum in Step (a). This strategy is inspired by Yun et al. (2019),
upon which we have made significant and non-trivial development.
Second, we draw a “big picture” for the loss surfaces of nonlinear neural networks.
Soudry and Hoffer (2018) highlight a smooth and multilinear partition of the loss
surfaces of neural networks. The nonlinearities in the piecewise linear activations
partition the loss surface of any nonlinear neural network into multiple smooth and
multilinear open cells. Specifically, every nonlinear point in the activation functions
creates a group of the non-differentiable boundaries between the cells, while the
linear parts of activations correspond to the smooth and multilinear interiors. Based
on the partition, we discover the degenerative nature of the large amounts of local
minima from the following aspects:
• Every local minimum is globally minimal within a cell. This property demon-
strates that the local geometry within every cell is similar to the global geometry of
linear networks, although technically, they are substantially different. It applies to
any one-hidden-layer neural network with two-piece linear activations for regres-
sion under convex loss. We rigorously prove this property in two stages: (1) we
prove that within every cell, the empirical risk R̂ S is convex with respect to a
variable Ŵ mapped from the weights W by a mapping Q. Therefore, the local
minima with respect to the variable Ŵ are also the global minima in the cell; and
then (2) we prove that the local optimality is maintained under the constructed
mapping. Specifically, the local minima of the empirical risk R̂ S with respect to
the parameter W are also the local minima with respect to the variable Ŵ . We
thereby prove this property by combining the convexity and the correspondence
of the minima. This proof is technically novel and non-trivial, despite the natural
intuitions.
• Equivalence classes and quotient space of local minimum valleys. All local
minima in a cell are concentrated as a local minimum valley: on a local minimum
valley, all local minima are connected with each other by a continuous path, on
which the empirical risk is invariant. Further, all these local minima constitute
an equivalence class. This local minima valley may have several parallel valleys
that are in the same equivalence class but do not appear because of the restraints
from cell boundaries. If such constraints are ignored, all the equivalence classes
constitute a quotient space. The constructed mapping Q is exactly the quotient
map. This result coincides with the property of mode connectivity that the minima
found by gradient-based methods are connected by a path in the parameter space
with almost invariant empirical risk (Garipov et al. 2018; Draxler et al. 2018;
Kuditipudi et al. 2019). Additionally, this property suggests that we would need
to study every local minimum valley as a whole.
126 7 The Geometry of the Loss Surfaces
• Linear collapse. Linear neural networks are covered by our theories as a simplified
case. When all activations are linear, the partitioned loss surface collapses to one
single cell, in which all local minima are globally optimal, as suggested by the
existing works on linear networks (Kawaguchi 2016; Baldi and Hornik 1989; Lu
and Kawaguchi 2017; Freeman and Bruna 2017; Zhou and Liang 2018; Laurent
and von Brecht 2018; Yun et al. 2018).
This section investigates the existence of spurious local minima on the loss surfaces
of neural networks. We find that almost all practical neural networks have infinitely
many spurious local minima. This result stands for any neural network with arbitrary
depth and arbitrary piecewise linear activations excluding linear functions under
arbitrary continuously differentiable loss.
7.2.1.1 Preliminaries
Also, we
define Y (0) = X , Y (L) = Ŷ , d0 = d X , and d L = dY . In some situations, we
L
use Ŷ [Wi ]i=1 , [bi ]i=1
L
to clarify the parameters, as well as Ỹ ( j) , Y ( j) , etc.
7.2 Nonlinear Activations Bring Infinite Spurious Local Minima 127
This section discusses neural networks with piecewise linear activations. A part
of the proof uses two-piece linear activations h s− ,s+ which are defined as follows,
One can also remove Assumption 7.4, if Assumption 7.3 is replaced by the
following assumption, which is mildly more restrictive (see a detailed proof in
Sect. 7.2.3.5–7.2.3.6).
d1 ≥ dY + 2,
di ≥ dY + 1, i = 2, . . . , L − 1.
This section presents the skeleton of the proof. Theorem 7.5 is proved in three stages.
We first prove a simplified version of Theorem 7.5 and then extend the conditions in
the last two stages.
Yun et al. (2019) and the method discussed in this section share a common strategy:
(a) creating a sequence of local minima using a linear classifier; and (b) showing the
existence of a constructed new point with lower empirical risk to demonstrate that
the constructed local minima are spurious. However, the specifics of how these local
minima are constructed vary significantly due to differences in the loss function used
and the dimensions of the output space. These distinctions highlight the nuanced
nature of the strategy and its application across different contexts.
We have extended the work of Yun et al. (2019) in three key ways: (1) Generalizing
from one hidden layer to arbitrary depth: Our approach aims to demonstrate that neu-
ral networks of arbitrary depth exhibit infinite spurious local minima. We introduced
a new strategy that employs transformation operations to direct data flow through the
7.2 Nonlinear Activations Bring Infinite Spurious Local Minima 129
same linear parts of the activations, facilitating the construction of spurious local min-
ima. (2) Extending from squared loss to arbitrary differentiable loss: Yun et al. (2019)
obtained the analytic derivatives of the loss function to construct and prove the exis-
tence of spurious local minima. However, this method cannot be directly applied to
arbitrary differentiable loss functions lacking analytic formulations. To establish that
a loss surface under an arbitrary differentiable loss has infinite spurious local minima,
we developed a new proof technique based on Taylor series expansions and a new
separation lemma. (3) Expanding from one-dimensional to arbitrary-dimensional
output: Our work investigates neural networks with arbitrary-dimensional output,
dealing with the calculus of functions whose domain and codomain are a matrix space
and a vector space, respectively. This extension contrasts with the one-dimensional
output scenario, which deals solely with the codomain of real number spaces. Explor-
ing higher-dimensional outputs significantly increases the complexity of the proof
process, requiring specialized mathematical techniques to demonstrate the presence
of infinite spurious local minima.
Stage (1): Neural networks with one hidden layer and two-piece linear
activations.
We first prove that nonlinear neural networks with one hidden layer and two-piece
linear activation functions (ReLU-like activations) have spurious local minima. The
proof in this stage further follows a two-step strategy:
(a) We first construct local minima of the empirical risk R̂ S (see Appendix 7.2.3.3,
Lemma 7.4). These local minimizers are constructed based on a linear neural network
which has the same network size (dimension of weight matrices) and evaluated under
the same loss. The design of the hidden layer guarantees that the components of the
output Ỹ (1) in the hidden layer before the activation are all positive. The activation
is thus effectively reduced to a linear function. Therefore, the local geometry around
the local minima with respect to the weights W is similar to those of linear neural
networks. Further, the design of the output layer guarantees that its output Ŷ is the
same as the linear neural network. This construction helps to utilize the results of
linear neural networks to solve the problems in nonlinear neural networks.
(b) We then prove that all the constructed local minima in Step (a) are spurious
(see Appendix 7.2.3.3, Theorem 7.9). Specifically, we assumed by Assumption 7.1
that the dataset cannot be fit by a linear model. Therefore, the gradient ∇Ŷ R̂ S of
the empirical risk R̂ S with respect to the prediction Ŷ is not zero. Suppose the i-th
row of the gradient ∇Ŷ R̂ S is not zero. Then, we use Taylor series and a preparation
lemma (see Appendix 7.2.3.6, Lemma 7.7) to construct another point in the parameter
space that has smaller empirical risk. Therefore, we prove that the constructed local
minima are spurious. Furthermore, the constructions involve some parameters that
are randomly picked from a continuous interval. Thus, we constructed infinitely
many spurious local minima.
130 7 The Geometry of the Loss Surfaces
Stage (2) - Neural networks with arbitrary hidden layers and two-piece linear
activations.
We extend the condition in Stage (1) to any neural network with arbitrary depth
and two-piece linear activations. The proof in this stage follows the same two-step
strategy but has different implementations:
(a) We first construct a series of local minima of the empirical risk R̂ S (see
Appendix 7.2.3.4, Lemma 7.5). The construction guarantees that every component
of the output Ỹ (i) in each layer before the activations is positive, which secure all the
input examples flow through the same part of the activations. Thereby, the nonlinear
activations are reduced to linear functions. Also, our construction guarantees that the
output Ŷ of the network is the same as a linear network with the same weight matrix
dimensions.
(b) We then prove that the constructed local minima are spurious (see Appendix
7.2.3.4, Theorem 7.10). The idea is to find a point in the parameter space that has
the same empirical risk R̂ S with the constructed point in Stage (1), Step (b).
Stage (3) - Neural networks with arbitrary hidden layer and piecewise linear
activations.
We further extend the conditions in Stage (2) to any neural network with arbitrary
depth and arbitrary piecewise linear activations. We continue to adapt the two-step
strategy in this stage:
(a) We first construct a local minimizer of the empirical risk R̂ S based on the
results in Stages (1) and (2) (see Appendix 7.2.3.5, Lemma 7.6). This construction
is based on Stage (2), Step (a). The difference of the construction in this stage is that
every linear part in the activations can be of a finite interval. The constructed weight
matrices apply several uniform scaling and translation operations on the outputs of
hidden layers in order to guarantee that all the input training sample points flow
through the same linear parts of the activations. We thereby effectively reduce the
nonlinear activations to linear functions. Also, our construction guarantees that the
output Ŷ of the neural network equals to that of the corresponding linear neural
network.
(b) We then prove that the constructed local minima are spurious (see Appendix
7.2.3.5). We use the same strategy in Stage (2), Step (b). Some adaptations are
implemented for the new conditions.
This section draws a big picture for the loss surfaces of neural networks. Based on
a recent result by Soudry and Hoffer (2018), we present four profound properties of
the loss surface that collectively characterize how the nonlinearities in activations
shape the loss surface.
7.2 Nonlinear Activations Bring Infinite Spurious Local Minima 131
7.2.2.1 Preliminaries
Remark 7.2 The definition of “multilinear” implies that the domain of any multi-
linear function f is a connective and convex set, such as the smooth and multilinear
cells below.
Definition 7.6 (Equivalence class, and quotient space) Suppose X is a linear space.
[x] = {v ∈ X : v ∼ x} is an equivalence class, if there is an equivalent relation ∼ on
[x], such that for any a, b, c ∈ [x], we have: (1) reflexivity: a ∼ a; (2) symmetry:
if a ∼ b, b ∼ a; and (3) transitivity: if a ∼ b and b ∼ c, a ∼ c. The quotient space
and quotient map are defined to be X/ ∼= {{v ∈ X : v ∼ x} : x ∈ X } and x → [x],
respectively.
In this section, the loss surface is defined under convex loss with respect to the pre-
diction Ŷ of the neural network. Convex loss covers many popular loss functions in
practice, such as the squared loss for the regression tasks and many others based on
norms. The triangle inequality of the norms secures the convexity of the correspond-
ing loss functions. The convexity of the squared loss is checked in the appendix (see
Appendix 7.3, Lemma 7.8).
132 7 The Geometry of the Loss Surfaces
We now present four propositions to express the loss surfaces of nonlinear neural
networks. These propositions give four major properties of the loss surface that
collectively draw a big picture for the loss surface.
We first recall a lemma by Soudry and Hoffer (2018). It proves that the loss
surfaces of neural networks have smooth and multilinear partitions.
Lemma 7.1 (Smooth and multilinear partition; cf. Soudry and Hoffer (2018)) The
loss surfaces of neural networks of arbitrary depth with piecewise linear functions
excluding linear functions are partitioned into multiple smooth and multilinear open
cells, while the boundaries are nondifferentiable.
Theorem 7.8 (Equivalence classes of local minimum valleys) Suppose all condi-
tions of Theorem 7.7 hold. Assume the loss function is strictly convex. Then, all local
minima in a cell are concentrated as a local minimum valley: they are connected
with each other by a continuous path and have the same empirical risk. Additionally,
all local minima in a cell constitute an equivalence class.
Corollary 7.1 (Quotient space of local minimum valleys) Suppose all conditions
of Theorem 7.8 hold. There might exist some “parallel” local minimum valleys in
the equivalence class of a local minimum valley. They do not appear because of the
constraints from the cell boundaries. If we ignore such constraints, all equivalence
classes of local minima valleys constitute a quotient space.
Corollary 7.2 (Linear collapse) The partitioned loss surface collapses to one single
smooth and multilinear cell, when all activations are linear.
1 1
m m
R̂ S (W1 , W2 ) = l (yi , W2 h(W1 xi )) = l yi , W2 diag A·,i W1 xi ,
m i=1 m i=1
(7.5)
where A·,i is the i-th column of matrix
⎡ ⎤
h s− ,s+ ((W1 )1,· x1 ) · · · h s− ,s+ ((W1 )1,· xm )
⎢ .. .. .. ⎥
A=⎣ . . . ⎦.
h s− ,s+ ((W1 )d1 ,· x1 )
· · · h s− ,s+ ((W1 )d1 ,· xm )
Applying Eq. (7.6) to Eq. (7.5), the empirical risk R̂ S equals to a formulation similar
to the linear neural networks,
1
m
R̂ S = l yi − A·,i
T
diag(W2 )W1 xi . (7.7)
m i=1
Afterwards, define Ŵ1 = diag(W2 )W1 and then straighten the matrix Ŵ1 to a vector
Ŵ ,
Ŵ = (Ŵ1 )1,· · · · (Ŵ1 )d1 ,· ,
T
A·,1 Ŵ1 x1 · · · A·,m
T
Ŵ1 xm =Ŵ X̂ .
Applying Eq. (7.7), the empirical risk is transferred to a convex function as follows,
1 T 1
m m
R̂ S = l yi , A·,i Ŵ1 xi = l(yi , Ŵ X̂ i ).
m i=1 m i=1
We then prove that the local optimality of the empirical risk R̂ S is maintained
when the weights W are mapped to the variable Ŵ . Specifically, the local minima
of the empirical risk R̂ S with respect to the weight W are also the local minima with
respect to the variable Ŵ . The maintenance of optimality is not surprising but the
proof is technically non-trivial (see a detailed proof in Sect. 7.3.3).
Equivalence classes and quotient space of local minimum valleys. The con-
structed mapping Q is a quotient map. Under the setting in the previous property,
all local minima in a cell is an equivalence class; they are concentrated as a local
minimum valley. However, there might exist some “parallel” local minimum valley
in the equivalence class, which do not appear because of the constraints from the cell
boundaries. Further for neural networks of arbitrary depth, we also constructed a local
minimum valley (the spurious local minima constructed in Sect. 7.2.1). This result
explains the property of mode connectivity that the minima found by gradient-based
methods are connected by a path in the parameter space with almost constant empir-
ical risk, which is proposed in two empirical works (Garipov et al. 2018; Draxler
et al. 2018). A recent theoretical work (Kuditipudi et al. 2019) proves that dropout
stability and noise stability guarantee the mode connectivity.
7.2 Nonlinear Activations Bring Infinite Spurious Local Minima 135
Linear collapse. Our theories also cover the case of linear neural networks. Linear
neural networks do not have any nonlinearity in their activations. Correspondingly,
the loss surface does not have any non-differentiable boundaries. In our theories,
when there is no nonlinearity in the activations, the partitioned loss surface collapses
to a single smooth, multilinear cell. All local minima wherein are equally good, and
also, they are all global minima as follows. This result unites the existing results on
linear neural networks (Kawaguchi 2016; Baldi and Hornik 1989; Lu and Kawaguchi
2017; Freeman and Bruna 2017; Zhou and Liang 2018; Laurent and von Brecht 2018;
Yun et al. 2018).
This section gives a detailed proof of Theorem 7.5. It follows the skeleton presented
in Sect. 7.2.1.3.
We first check whether squared loss and cross-entropy loss are covered by the
requirements of Theorem 7.5.
Lemma 7.2 The squared loss is continuously differentiable with respect to the pre-
diction of the model, whose gradient of loss equal to zero when the prediction and
the label are different.
Proof Apparently, the squared loss is differentiable with respect to Ŷ . Specifically,
the gradient with respect to Ŷ is as follows,
2
∇Ŷ Y − Ŷ = 2 Y − Ŷ ,
Proof For any i ∈ [1 : m], the cross-entropy loss is differentiable with respect to Ŷi .
The j-th component of the gradient with respect to the prediction Ŷi is as follows,
d
Y
eŶk,i d
∂ − Yk,i log dY Ŷk,i
k=1 k=1 e Y
eŶ j,i
= Yk,i − Y j,i . (7.8)
∂ Ŷ j,i k=1
dY
Ŷ
e k,i
k=1
which is continuous with respect to Ŷi . So, the cross-entropy loss is continuously
differentiable with respect to Ŷi .
Additionally, if the gradient (Eq. (7.8)) is zero, we have the following equations,
dY
dY
Yk,i e Ŷ j,i
− Y j,i eŶk,i = 0, j = 1, 2, . . . , m.
k=1 k=1
dY
Since Yk,i = 1, we can easily check the rank of the left matrix is dY − 1. So the
k=1
dimension of the solution space is one. Meanwhile, we have
⎡ ⎤
dY
⎢ Yk,i − Y1,i −Y1,i ··· −Y1,i ⎥⎡
⎢k=1 ⎥ ⎤
⎢ ⎥ Y1,i
⎢ dY
⎥⎢
⎢ −Y2,i Yk,i − Y2,i ··· −Y2,i ⎥ ⎢ Y2,i ⎥
⎢ ⎥⎢ . ⎥ = 0.
⎢ k=1
⎥⎣ . ⎥
⎢ .. .. .. .. ⎥ . ⎦
⎢ . . . . ⎥
⎢ ⎥ YdY ,i
⎣
dY ⎦
−YdY ,i ··· −YdY ,i Yk,i − YdY ,i
i=1
In Stage (1), we prove that deep neural networks with one hidden layer, two-piece
linear activation h s− ,s+ , and multi-dimensional outputs have infinite spurious local
minima.
This stage is organized as follows: (a) we construct a local minimizer by Lemma
7.4; and (b) we prove that the local minimizer is spurious in Theorem 7.9 by
constructing a set of parameters with smaller empirical risk.
Without loss of generality, we assume that s+ = 0. Otherwise, suppose that s+ =
0. From the definition of ReLU-like activation (Eq. (7.4)), we have s− = 0. Since
Under Assumption 7.3, any one-hidden-layer neural network has a local minimum
at
W̃ W̃ − η1dY
Ŵ1 = ·,[1:d X ] , b̂1 = ·,d X +1 , (7.10)
0(d1 −dY )×d X −η1d1 −dY
and
1
Ŵ2 = I
s+ dY
0dY ×(d1 −dY ) , b̂2 = η1dY , (7.11)
where Ŵ1 and b̂1 are respectively the weight matrix and the bias of the first layer,
Ŵ2 and b̂2 are respectively the weight matrix and the bias of the second layer, and η
is a negative constant with absolute value sufficiently large such that
Also, the loss in this lemma is continuously differentiable loss whose gradient
does not equals to 0 when the prediction is not the same as the ground-truth label.
Proof
We show that the empirical risk is higher in the neiborhood of
2 2 2 2
Ŵi , b̂i , in order to prove that Ŵi , b̂i is a local minimizer.
i=1 i=1 i=1 i=1
The output of the first layer before the activation is
(1) Ŵ X − η1dY 1mT
Ỹ = Ŵ1 X + b̂1 1mT = .
−η1d1 −dY 1mT
Because η is a negative constant with absolute value sufficiently large such that
Eq. (7.25)) holds, the output above is positive (element-wise), the output of the
neural network with parameters {Ŵ1 , Ŵ2 , b̂1 , b̂2 } is
where X̃ is defined as
X
X̃ = T . (7.13)
1m
Therefore, the empirical risk R̂ S in terms of parameters {Ŵ1 , Ŵ2 , b̂1 , b̂2 } is
m m
1 1 x
R̂ S Ŵ1 , Ŵ2 , b̂1 , b̂2 = l Yi , W̃ X̃ = l Yi , W̃ i = f (W̃ ).
m ·,i m 1
i=1 i=1
2
Then, we introduce a sufficiently small disturbance [δW i ]i=1 , [δbi ]i=1
2
into the
2 2
parameters Ŵi , b̂i . When the disturbance is sufficiently small, all com-
i=1 i=1
ponents of the output of the first layer remain positive. Therefore, the output after
the disturbance is
2 2
Ŷ Ŵi + δW i , b̂i + δbi
i=1 i=1
= s+ δ W 2 Ŵ1 + δW 1 X + b̂1 + δb1 1mT + s+ Ŵ2 δW 1 X + s+ Ŵ2 δb1 1mT + δb2 1mT
where Eq. (∗) is because all components of Ŵ1 + δW 1 X + b1 + δb1 1mT are
positive, and δ is defined as the following matrix
δ = s+ Ŵ2 δW 1 + δW 2 Ŵ1 + δW 2 δW 1 s+ Ŵ2 δb1 + δb2 + s+ δW 2 b̂1 + δb1 .
2 2
Therefore, the empirical risk R̂ S with respect to Ŵi + δW i , b̂i + δbi is
i=1 i=1
2 2
R̂ S Ŵi + δW i , b̂i + δbi
i=1 i=1
1
m
= l Yi , W̃ + δ X̃
m i=1 ·,i
1
m
xi
= l Yi , W̃ + δ
m i=1 1
= f (W̃ + δ).
δ approaches zero when the disturbances {δW 1 , δW 2 , δb1 , δb2 } approach zero (element-
wise). Since Ŵ is the local minimizer of f (W ), we have
2 2 2 2
R̂ S Ŵi , b̂i = f (Ŵ ) ≤ f (Ŵ + δ) = R̂ S Ŵi + δW i , b̂i + δbi .
i=1 i=1 i=1 i=1
(7.14)
Becausethe disturbances {δW 1 , δW 2 , δb1 , δb2 } are arbitrary, Eq. (7.14) demon-
2 2
strates that Ŵi , b̂i is a local minimizer.
i=1 i=1
The proof is completed.
Theorem 7.9 Under the same conditions of Lemma 7.4 and Assumptions 7.1, 7.2,
and 7.4, the constructed spurious local minima in Lemma 7.4 are spurious.
140 7 The Geometry of the Loss Surfaces
∇W f (W ) = 0.
Specifically, we have
∂ f W̃
= 0, i ∈ {1, . . . , dY }, j ∈ {1, . . . , d X },
∂ Wi, j
∂ f W̃
m m
x x x xi
= ∇Ŷi l Yi , W̃ i E k, j i = ∇Ŷi l Yi , W̃ i ,
∂ Wk, j 1 1 1 k
1 j
i=1 i=1
xi x
where Ŷi = W̃ , ∇Ŷi l Yi , W̃ i ∈ R1×dY . Since k, j are arbitrary in
1 1
{1, . . . , dY } and {1, . . . , d X }, respectively, we have
V X T 1m = 0, (7.15)
where
T T
x1 x
V= ∇Ŷ1 l Y1 , W̃ · · · ∇Ŷm l Ym , W̃ n .
1 1
Ỹ − Y = W̃ X̃ − Y = 0,
Thus, there exists some k-th row of Ỹ − Y that does not equal to 0.
We can rearrange the rows of W̃ and Y simultaneously, while W̃ is maintained
as the local minimizer of f (W ) and f (W̃ ) invariant.1 Without loss of generality,
we assume k = 1 (k is the index of the row). Set u = V 1,· and vi = Ỹ1,i in Lemma
7.7. There exists a non-empty separation I = [1 : l ] and J = [l + 1 : m] of S =
{1, 2, . . . , m} and a vector β ∈ Rd X , such that
(1.1) for any positive constant α small enough, and i ∈ I , j ∈ J , Ỹ1,i − αβ T xi <
Ỹ1, j − αβ T
xj;
(1.2) i∈I V 1,i = 0.
Define
1
η1 = Ỹ1,l − αβ T xl +
min Ỹ1,i − αβ T xi − Ỹ1,l − αβ T xl .
2 i∈{l +1,...,m}
Ỹ1,i − αβ T xi − η1
1
= Ỹ1,i − αβ T xi − Ỹ1,l + αβ T xl −
min Ỹ1,i − αβ T xi − Ỹ1,l − αβ T xl
2 i∈{l +1,...,m}
<0,
Ỹ1, j − αβ T x j − η1 > 0
1
= Ỹ1, j − αβ T xi − Ỹ1,l + αβ T xl − min Ỹ1,i − αβ T
x i − Ỹ1,l − αβ xl
T
2 i∈{l +1,...,m}
1
≥ min Ỹ1,i − αβ T
x i − Ỹ1,l − αβ xl
T
> 0.
2 i∈{l +1,...,m}
lim γ = 0,
α→0+
lim
min Ỹ1,i − αβ T xi − Ỹ1,l − αβ T xl = min Ỹ1,i − Ỹ1,l > 0.
α→0+ i∈{l +1,...,m} i∈{l +1,...,m}
Ỹ1,i − αβ T xi − η1 + |γ |
1
≤− min Ỹ1,i − αβ xi − Ỹ1,l − αβ xl
T T
+ |γ | < 0,
2 i∈{l +1,...,m}
Ỹ1, j − αβ T x j − η1 − |γ |
1
≥ min Ỹ1,i − αβ T
x i − Ỹ1,l − αβ T
x l − |γ | > 0.
2 i∈{l +1,...,m}
Ỹi, j − ηi > 0.
Now we construct a point in the parameter space whose empirical risk is smaller
than the proposed local minimum in Lemma 7.4 as follows
T
W̃1 = W̃1,[1:d
T
X]
− βα T , −W̃1,[1:d
T
X]
+ βα T , W̃2,[1:d
T
X]
, · · ·W̃dTY ,[1:d X ] , 0d X ×(d1 −dY −1) ,
(7.17)
T
b̃1 = W̃1,[d X +1] − η1 + γ , . . . , W̃2,[d X +1] − η2 , · · ·W̃dY ,[d X +1] − ηdY , 01×(d1 −dY −1) ,
(7.18)
⎡ ⎤
1
s+ +s−
− s+ +s1
−
0 0 ··· 0 0 ··· 0
⎢ . .. .. ⎥
⎢
⎢ 0 0 1
s+
· · · 0 .. . .⎥⎥
⎢ .. .. .. 1 .. ⎥
W̃2 = ⎢
⎢ . . . s+ · · · 0 . ⎥,
⎥ (7.19)
⎢ ⎥
⎢ .. .. .. .. . . .. .. .. ⎥
⎣ . . . . . . . .⎦
0 0 0 0 · · · s1+ 0 ··· 0
and T
b̃2 = η1 , η2 , · · ·, ηdY , (7.20)
where W̃i and b̃i are the weight matrix and the bias of the i-th layer, respectively.
7.2 Nonlinear Activations Bring Infinite Spurious Local Minima 143
After some
calculations, the network output of the first layer before the activation
2 2
in terms of W̃i , b̃i is
i=1 i=1
⎡ ⎤
W̃1,· X̃ − αβ T X − η1 1mT + γ 1mT
⎢ ⎥
⎢ −W̃1,· X̃ + αβ T X + η1 1mT + γ 1mT ⎥
⎢ ⎥
⎢ W̃2,· X̃ − η2 1mT ⎥
⎢ ⎥
Ỹ (1) = W̃1 X + b̃1 1m = ⎢
T
⎥.
⎢ .. ⎥
⎢ . ⎥
⎢ ⎥
⎣ W̃dY ,· X̃ − ηdY 1m
T ⎦
0(d1 −dY −1)×m
Therefore, the output of the whole neural network is
⎛⎡ ⎤⎞
W̃1,· X̃ − αβ T X − η1 1mT + γ 1mT
⎜⎢ ⎥⎟
⎜⎢ −W̃1,· X̃ + αβ T X + η1 1mT + γ 1mT ⎥⎟
⎜⎢ ⎥⎟
⎜⎢ W̃2,· X̃ − η2 1mT ⎥⎟
⎜⎢ ⎥⎟
Ŷ = W̃2 h s− ,s+ ⎜⎢ .. ⎥⎟ + b̃2 1mT .
⎜⎢ ⎥⎟
⎜⎢ . ⎥⎟
⎜⎢ ⎥⎟
⎝⎣ W̃dY ,· X̃ − ηdY 1m
T
⎦⎠
0(d1 −dY −1)×m
Specifically, if j ≤ l ,
2 2
xj
Ỹ (1) W̃i , b̃i = W̃1,· − αβ T x j − η1 + γ
i=1 i=1 1, j
1
= Ỹ1, j − αβ T x j − η1 + γ < 0,
2 2
x
Ỹ (1) W̃i , b̃i = − W̃1,· j + αβ T x j + η1 + γ
i=1 i=1 2, j
1
= − Ỹ1, j + αβ T x j + η1 + γ > 0.
2
2
Therefore, (1, j)-th component of Ŷ W̃i , b̃i is
i=1 i=1
2
2
Ŷ W̃i , b̃i
i=1 i=1 1, j
⎛⎡ ⎤⎞
W̃1,· X − αβ T X − η1 1m
T + γ 1T
m
⎜⎢ T + γ 1T ⎥⎟
⎜⎢ −W̃1,· X + αβ T X + η1 1m m ⎥⎟
⎜⎢ ⎥⎟
⎜⎢ W̃ X − η 1 T ⎥⎟
1 , − 1 , 0, . . . , 0 ⎜⎢ 2,· 2 m ⎥⎟
= s+ +s− s+ +s− h s ,s
− + ⎜ ⎢ .. ⎥⎟ + η1
⎜⎢ ⎥⎟
⎜⎢ . ⎥⎟
⎜⎢ ⎥⎟
⎝⎣ W̃dY ,· X − ηdY 1m T ⎦⎠
T
0d1 −dY −1 1m ·, j
144 7 The Geometry of the Loss Surfaces
s− s+
= (Ỹ1, j − αβ T x j − η1 + γ ) − (−Ỹ1, j + αβ T x j + η1 + γ ) + η1
s+ + s− s+ + s−
s− − s+
= Ỹ1, j − αβ T x j + γ; (7.21)
s+ + s−
and
2
2 s+
Ŷ W̃i , b̃i = (Ỹi, j − ηi ) + ηi = Ỹi, j , i ≥ 2. (7.23)
i=1 i=1 i, j s+
2 2
Thus, the empirical risk of the neural network with parameters W̃i , b̃i
i=1 i=1
is
2 2
R̂ S W̃i , b̃i
i=1 i=1
1
m
= l Yi , W̃2 W̃1 xi + b̃1 1m
T
+ b̃2 1m
T
m
i=1
m
1 x x x
= l Yi , W̃ i + ∇Ŷi l Yi , W̃ i W̃2 W̃1 xi + b̃1 1m
T
+ b̃2 1m
T
− W̃ i
m 1 1 1
i=1
m
x
+ o W̃2 W̃1 xi + b̃1 1m
T
+ b̃2 1m
T
− W̃ i . (7.24)
1
i=1
m
x
o W̃2 W̃1 xi + b̃1 + b̃2 − Ŵ i
1
i=1
⎛, ⎞
- 2
m
- n x
= o ⎝. W̃2 W̃1 xi + b̃1 + b̃2 − W̃ i ⎠
1 j
i=1 j=1
= o(γ ).
l
s+ −s−
Let α be sufficiently small while sgn(γ ) = −sgn V
s+ +s− 1,i
. We have
i=1
m
m
x
l Yi , W̃2 W̃1 xi + b̃1 + b̃2 − l Yi , Ŵ i
1
i=1 i=1
l
s+ − s− (∗∗)
= 2γ V 1,i + o(γ ) < 0,
i=1
s+ + s−
Stage (2) proves that neural networks with arbitrary hidden layers and two-piece
linear activation h s− ,s+ have spurious local minima. Here, we still assume s+ = 0.
We have justified this assumption in Stage (1).
146 7 The Geometry of the Loss Surfaces
This stage is organized similarly with Stage (1): (a) Lemma 7.5 constructs a local
minimum; and (b) Theorem 7.10 proves the minimum is spurious.
Step (a). Construct local minima of the loss surface.
Lemma 7.5 Suppose that all the conditions of Lemma 7.4 hold, while the neural
network has L − 1 hidden layers. Then, this network has a local minimum at
W̃ W̃ − η1dY
Ŵ1 = ·,[1:d X ] , b̂1 = ·,d X +1 ,
0(d1 −dY )×d X −η1d1 −dY
1 1
dY di
Ŵi = E j, j + E j,(dY +1) , b̂i = 0(i = 2, 3, ..., L − 1),
s+ j=1 s+ j=d +1
Y
and
Ŵ L = 1
I
s+ dY
0dY ×(dL−1 −dY ) , b̂L = η1dY ,
where Ŵi and b̂i are the weight matrix and the bias of the i-th layer, respectively,
and η is a negative constant with absolute value sufficiently large such that
Proof Recall the discussion in Lemma 7.4 that all components of Ŵ1 X + b̂1 1mT are
positive. Specifically,
⎡ ⎤
Ỹ − η1dY 1mT
Ŵ1 X + b̂1 1mT = ⎣ ⎦,
−η1d1 −dY 1mT
⎡ ⎤
Ỹ − η1dY 1mT
Ỹ (1) = Ŵ1 X + b̂1 1mT = ⎣ ⎦,
−η1d1 −dY 1mT
and
Ỹ (k+1) = Ŵk+1
Y (k) + b̂k+1
1mT
⎛ ⎞ ⎡ ⎤
1 ⎝
dY dk+1 Ỹ − η1dY 1mT
= E j, j + E j,(dY +1) ⎠ s+ ⎣ ⎦
s+ j=1 j=dY +1 −η1dk −dY 1mT
⎡ ⎤
Ỹ − η1dY 1m T
=⎣ ⎦.
−η1dk+1 −dY 1m T
We thereby prove Eqs. (7.28) and (7.29). Therefore, Y (L) can be calculated as
148 7 The Geometry of the Loss Surfaces
L
L
Then, we show the empirical risk is higher around Ŵi , b̂i in order
L
i=1 i=1
L
to prove that Ŵi , b̂i is a local minimizer.
i=1
L
i=1
L
Let Ŵi + δW
i , b̂i + δbi
be point in the parameter space
i=1 i=1
L
L
which is close enough to the point Ŵi , b̂i . Since the distur-
i=1 i=1
bances δW i and δbi are both close to 0 (element-wise), all components of
L L
Ỹ (i) Ŵi + δW
i , b̂i + δbi
remains positive. Therefore, the output of
i=1 i=1
L L
the neural network in terms of parameters Ŵi + δW
i , b̂i
+ δ
bi is
i=1 i=1
L L
Ŷ Ŵi +
δW i , b̂i +
δbi
i=1 i=1
= (Ŵ L +
δW L )s+ . . . s+ Ŵ1
+ δW
1 X + b̂1 + δb1 1m . . . + b̂ L + δbL 1m
T T
= M1 X + M2 1mT ,
L L
L L
where M1 and M2 can be obtained from Ŵi , b̂i and
δW i i=1 , δbi i=1
i=1 i=1
through several multiplication and summation operations2 .
Rewrite the output as
X
M1 X + M2 1mT = M1 M2 .
1nT
2 Since the exact form of M1 and M2 are not needed, we omit the exact formulations here.
7.2 Nonlinear Activations Bring Infinite Spurious Local Minima 149
L L L
L
R̂ S Ŵi + δW
i , b̂i + δbi
= f ( M1 M2 ) ≥ f (W̃ ) = R̂ S Ŵi , b̂i .
i=1 i=1 i=1 i=1
(7.30)
L L
Since Ŵi + δW
i , b̂i + δbi
are arbitrary within a sufficiently small
i=1 i=1
2
L L 2
neighbour of Ŵi , b̂i , Eq. (7.30) yields that Ŵi , b̂i is a
i=1 i=1 i=1 i=1
local minimizer.
Theorem 7.10 Under the same conditions of Lemma 7.5 and Assumptions 7.1, 7.2,
and 7.4, the constructed spurious local minima in Lemma 7.5 are spurious.
Proof We first construct the weight matrix and bias of the i-th layer as follows,
W̃1 =
W̃1 , b̃1 = b̃1 ,
W̃2 b̃2
W̃2 = , b̃2 = λ1d2 + ,
0(d2 −dY )×d1 0(d2 −dY )×1
1 dY
W̃i = E i,i , b̃i = 0(i = 3, 4, ..., L − 1),
s+ i=1
and
1
dY
W̃ L = E i,i , b̃L = −λ1dY ,
s+ i=1
where W̃1 , W̃2 , b̃1 and b̃2 are defined by Eqs. (7.17), (7.18), (7.19), and (7.20),
respectively, and λ is a sufficiently large positive real such that
2 2
Ŷ W̃i , b̃i + λ1d2 1mT > 0, (7.31)
i=1 i=1
Y (i)
W̃i , b̃i = s+ ⎢ ⎣
⎥.
⎦ (7.33)
i=1 i=1
0(di −dY )×m
= s+ ⎢⎣
⎥.
⎦
λ1d2 −dY 1mT
Meanwhile,
the output of the third layer before the activation is
L L L L
(3) (2)
Ỹ W̃i , b̃i can be calculated based on Y W̃i , b̃i :
i=1 i=1 i=1 i=1
L L L
L
Ỹ (3) W̃i , b̃i = W̃3 Y (2) W̃i , b̃i + b̃3 1mT
i=1 i=1 i=1 i=1
⎡ 2 ⎤
d 2
1 Y
⎢ Ŷ W̃i , b̃i + λ1 1 T
dY m ⎥
E i,i s+ ⎢ ⎥
i=1 i=1
= ⎣ ⎦
s+ i=1
λ1d2 −dY 1m T
⎡ 2 ⎤
2
⎢ Ŷ W̃i , b̃i + λ1 1 T
dY m ⎥
=⎢ ⎥.
i=1 i=1
⎣ ⎦
0(d3 −dY )×m
= s+ ⎢⎣
⎥.
⎦
0(d3 −dY )×m
152 7 The Geometry of the Loss Surfaces
= s+ ⎢⎣
⎥.
⎦
0(dk+1 −dY )×m
Therefore, Eqs. (7.32) and (7.33) hold for any i ∈ {3, 4, ..., L − 1}.
Finally, the output of the network is
L L L
L
Ŷ W̃i , b̃i =Y (L) W̃i , b̃i
i=1 i=1 i=1 i=1
L
L
(L−1)
= W̃ L Y W̃i , b̃i + b̃L 1mT
i=1 i=1
⎡ 2 ⎤
Ŷ W̃
2
, b̃ + λ1 1 T
1
d Y
⎢ i i dY m ⎥
E i,i s+ ⎢ ⎥ − λ1d 1T
i=1 i=1
= ⎣ ⎦ Y m
s+
i=1
0(d L−1 −dY )×m
2
2
= Ŷ W̃i , b̃i .
i=1 i=1
7.2 Nonlinear Activations Bring Infinite Spurious Local Minima 153
Finally, we prove Theorem 7.5. This stage also follows the two-step strategy.
Step (a). Construct local minima of the loss surface.
Lemma 7.6 Suppose t is a non-differentiable point for the piece-wise linear acti-
vation function h and σ is a constant such that the activation h is differentiable in
the intervals (t − σ, t) and (t, t + σ ). Assume that M is a sufficiently large positive
real such that
1
Ŵ1 X + b̂1 1mT < σ. (7.37)
M F
α1 = 1
0 < αi < 1, i = 2, . . . , L − 1. (7.38)
Then, under Assumption 7.3, any neural network with piecewise linear activations
and L − 1 hidden layers has local minima at
1 1
Ŵ1 =
Ŵ1 , b̂1 = b̂ + t1d1 ,
M M 1
j=2 α j
i
Ŵi = αi Ŵi , b̂i = −αi Ŵi h(t)1di−1 + t1di + b̂i ,(i = 2, 3, ..., L − 1),
M
and
1 M
Ŵ L = M Ŵ L , b̂L = − / L−1 Ŵ L h(t)1dL−1 + b̂L
j=2 α j
L
j=2 αj
L L
where Ŵi , b̂i is the local minimizer constructed in Lemma 7.5. Also,
i=1 i=1
the loss is continuously differentiable, whose derivative with respect to the prediction
Ŷi may equal to 0 only when the prediction Ŷi and label Yi are the same.
Proof Define s− = lim− h (θ ) and s+ = lim+ h (θ ).
θ→0 θ→0
We then prove by induction that forall i ∈ [1 : L − 1],all components of the i-th
L L
layer output before the activation Ỹ (i) Ŵi , b̂i are in interval (t, t + σ ),
i=1 i=1
154 7 The Geometry of the Loss Surfaces
and
L L L L
j=1 α j
i
(i)
Y Ŵi , b̂i = h(t)1di 1mT + Y (i)
Ŵi , b̂i .
i=1 i=1 M i=1 i=1
We proved in Lemma 7.5 that Ŵ1 X + b̂1 1mT is positive (element-wise). Since the
Frobenius norm of a matrix is no smaller than any component’s absolute value,
applying Eq. (7.37), we have that for all i ∈ [1, d1 ] and j ∈ [1 : n],
1
0< Ŵ1 X + b̂1 1mT < σ. (7.40)
M ij
Therefore, 1
M
Ŵ1 X + b̂1 1mT + t ∈ (t, t + σ ). So,
ij
L L L
L
(1)
Y Ŵi
, b̂i = h Ỹ (1)
Ŵi
, b̂i
i=1 i=1 i=1 i=1
(∗) 1 1 T
= h s− ,s+ Ŵ X + b̂1 1m + h(t)1d1 1mT
M 1 M
1 (1) L L
= Y Ŵi , b̂i + h(t)1d1 1mT ,
M i=1 i=1
L L
Lemma 7.5 has proved that all components of Ỹ (k+1)
Ŵi , b̂i are
L
i=1 i=1
L
contained in Ỹ (1) Ŵi , b̂i . Combining
i=1 i=1
1 (1) L L
t1d1 1mT < t1d1 1mT + Ỹ Ŵi , b̂i < (t + σ )1d1 1mT ,
M i=1 i=1
we have
L L (∗)
i=1 αi
k+1
(k+1)
t1dk+1 1mT < t1dk+1 1mT + Ỹ Ŵi , b̂i < (t + σ )1dk+1 1mT .
M i=1 i=1
Here < are all element-wise, and inequality (∗) comes from the property of αi
(Eq. (7.38)).
Furthermore, the (k + 1)-th layer output after the activation is
L L L
L
Y (k+1) Ŵi , b̂i = h Ỹ (k+1) Ŵi , b̂i
i=1 i=1 i=1 i=1
L
1 L
= h t1dk+1 1mT
+ Ỹ (k+1) Ŵi , b̂i
M i=1 i=1
k+1 L
i=1 αi (k+1)
(∗) L
=h(t)1dk+1 1mT
+ h s− ,s+ Ỹ Ŵi , b̂i
M i=1 i=1
k+1
αi L L
= h(t)1dk+1 1m
T
+ i=1 Y (k+1) Ŵi , b̂i ,
M i=1 i=1
where Eq. (∗) is because of Eq. (7.41). The above argument is proved for any index
k ∈ {1, . . . , L − 1}.
156 7 The Geometry of the Loss Surfaces
Therefore,
L L L
L
R̂ S Ŵi
, b̂i = R̂ S Ŵi
, b̂i = f (W̃ ).
i=1 i=1 i=1 i=1
L
L
We then introduce some small disturbances δW i i=1 , δbi i=1 into
L L
Ŵi , b̂i in order to check the local optimality.
i=1 i=1
Since all comonents of Y (i) are in interval (t, t + σ ), the activations in every
hidden layers is realized at linear parts. Therefore, the output of network is
L L
Ŷ Ŵi +
δW i , b̂i +
δbi
i=1 i=1
= Ŵ L +
δW L s+ · · · s+ Ŵ1
+ δW
1 X + b̂1 + δb1 1m + f (t)1d1 1m · · ·
T T
Proof of Theorem 7.5 Without loss of generality, we assume that all activations are
the same.
Let t be a non-differentiable point of the piece-wise linear activation function h
with
s− = lim− h (θ ),
θ→0
s+ = lim+ h (θ ).
θ→0
Then,
we prove by induction that for any i ∈ [2 : L − 1], all components of
L L
Ỹ (i) W̃i , b̃i are in interval (t, t + δ), and
i=1 i=1
L L L L
(i) 1
Y W̃i , b̃i = h(t)1di 1mT + Y (i)
W̃i , b̃i .
i=1 i=1 M̃ M i=1 i=1
First,
L L 1
Ỹ (1) W̃i , b̃i = W̃1 X + b̃1 1mT = (W̃ X + b̃1 1mT ) + t1dT1 1mT .
i=1 i=1 M 1
(7.44)
Thus,
1
(W̃ X + b̃1 1mT ) + t1dT1 1mT ∈ (t − σ, t + σ ). (7.45)
M 1 ij
where Eq. (∗) is from Eq. (7.41) for any x ∈ (t − δ, t + δ). Also,
L L L
L
(2)
Ỹ W̃i
, b̃i (1)
= W̃2 Y W̃i
, b̃i + b̃2 1mT
i=1 i=1 i=1 i=1
1 1 (1) L L
= W̃2 h(t)1d1 1m + YT
W̃i , b̃i
M̃ M i=1 i=1
1 1
+ t1d2 1mT − h(t)W̃2 1d1 1mT + b̃2 1mT
M̃ M M̃
7.2 Nonlinear Activations Bring Infinite Spurious Local Minima 159
L L
1
= Ỹ (2) W̃i , b̃i + t1d2 1mT .
M M̃ i=1 i=1
L L
Recall in Theorem 7.10 we prove all components of Ỹ (2)
W̃i , b̃i are
i=1 i=1
positive. Combining the definition of M̃ (Eq. (7.43)), we have
L L
(2)
t1d2 1mT < Ỹ W̃i , b̃i
i=1 i=1
L L
1 (2)
= Ỹ W̃i , b̃i + t1d2 1mT
M̃ M i=1 i=1
Therefore,
L L
L L
(2) (2)
Y W̃i , b̃i = h Ỹ W̃i , b̃i
i=1 i=1 i=1 i=1
L
1 (2)
L
=h Y W̃i , b̃i + t1d2 1m T
M̃ M i=1 i=1
L
1 (2)
L
= h(t)1d2 1m + h s− ,s+
T
Ỹ W̃i , b̃i
M̃ M i=1 i=1
1 L L
= h(t)1d2 1mT + h s− ,s+ Ỹ (2) W̃i , b̃i
M̃ M i=1 i=1
1 L L
= h(t)1d2 1mT + Y (2) W̃i , b̃i .
M̃ M i=1 i=1
L L
1 (k+1)
t1dk+1 1mT < Ỹ W̃i , b̃i + t1dk+1 1mT < (t + σ )1dk+1 1mT .
M M̃ i=1 i=1
Therefore,
L L
L L
Y (k+1) W̃i , b̃i = h Ỹ (k) W̃i , b̃i
i=1 i=1 i=1 i=1
L
1 L
=h Ỹ (k+1) W̃i , b̃i + t1dk+1 1mT
M M̃ i=1 i=1
L
1 L
= h(t)1dk+1 1mT + Y (k+1) W̃i , b̃i .
M M̃ i=1 i=1
+ b̃L 1m
T
− M M̃ W̃ L h(t)1d L−1 1m
T
L L
= Y (L) W̃i , b̃i .
i=1 i=1
Therefore,
L L L
L
R̂ S W̃i
, b̃i = R̂ S W̃i
, b̃i . (7.46)
i=1 i=1 i=1 i=1
Corollary 7.3 Suppose that Assumptions 7.1, 7.2, and 7.6 hold. Neural networks
with arbitrary depth and arbitrary piecewise linear activations (excluding linear
functions) have infinitely many spurious local minima under arbitrary continuously
differentiable loss whose derivative can equal 0 only when the prediction and label
are the same.
W̃1 = (W̃1,[1:d
T
X]
− βα T , W̃1,[1:d
T
X]
, −W̃1,[1:d
T
X]
+ βα T ,W̃2,[1:d
T
X]
, . . . , W̃dTY ,[1:d X ] , 0W̃1 )T ,
T
b̃1 = ν + γ , W̃1,[d X +1] − η, −ν + γ , W̃2,[d X +1] − η2 , · · ·W̃dY ,[d X +1] − ηdY , 0b̃1 ,
(7.47)
⎡ ⎤
1 1
2s+ s+
− 2s1+ 0 0 ··· 0 0 ··· 0
⎢ 1
··· ··· 0⎥
⎢ 0 0 0 s+
0 0 0 ⎥
⎢ ⎥
W̃2 = ⎢ 0 0 0 0 s1+ ··· 0 0 ··· 0⎥, (7.48)
⎢ ⎥
⎢ .. .. .. .. .. .. .. .. .. .. ⎥
⎣ . . . . . . . . . .⎦
0 0 0 0 0 ··· 1
s+
0 ··· 0
and T
b̃2 = η, η2 , · · ·, ηdY , (7.49)
where α, β, and ηi are defined the same as those in Theorem 7.9, and η is defined by
Eq. (7.25).
162 7 The Geometry of the Loss Surfaces
2 2
W̃2 Ỹ (1) W̃i , b̃i
1 i=1 i=1 j
1
= −s+ Ỹ1, j − αβ T x j − η1 + γ + 2s+ Ỹ1, j − η − s+ −Ỹ1 j + αβ T x j + η1 + γ
2s+
+η
1
= 2s+ Ỹ1, j − 2s+ η − 2s+ γ + η
2s+
= Ỹ1, j − γ .
2 2
Otherwise ( j > l ), the (1, j)-th component of Ŷ W̃i , b̃i is
i=1 i=1
2 2
W̃2 Ỹ (1) W̃i , b̃i
1 i=1 i=1 j
1
= s+ Ỹ1, j − αβ T x j − η1 + γ + 2s+ Ỹ1, j − η − s− −Ỹ1 j + αβ T x j + η1 + γ
2s+
+η
7.2 Nonlinear Activations Bring Infinite Spurious Local Minima 163
1
= s+ Ỹ1, j − αβ T x j − η1 + γ + 2s+ Ỹ1, j − η + s+ −Ỹ1 j + αβ T x j + η1 + γ
2s+
+η
1
= 2s+ Ỹ1, j − 2s+ η + 2s+ γ + η
2s+
= Ỹ1, j + γ ,
2 2
and the (i, j)-th (i > 1) component of Ŷ W̃i , b̃i is Ỹi, j .
i=1 i=1
Therefore, we have
⎧
⎨ − γ,
⎪ j = 1, i ≤ l;
x
W̃2 W̃1 xi + b̃1 + b̃2 − W̃ i = γ, j = 1, i > l;
1 j ⎪⎩
0, j ≥ 2.
n
u i = 0, (7.50)
i=1
while {x1 ,...,xm } is a set of vector ⊂ Rm×1 . Suppose index set S = {1, 2, . . . , m}.
Then for any series of real number {v1 , · · · , vm }, there exists a non-empty separation
I , J of S, which satisfies I ∪ J = S, I ∩ J = ∅ and both I and J are not empty, a
vector β ∈ Rm×1 ,such that,
(1.1) for any sufficiently small positive real α, i ∈ I , and j ∈ J , we have vi −
αβ T xi < v j − αβ T x j ;
(1.2) i∈I u i = 0.
Proof If there exists a non-empty separation I and J of the index set S, such that
when β = 0, (1.1) and (1.2) hold, the lemma is apparently correct.
Otherwise, suppose that there is no non-empty separation I and J of the index
set S such that (1.1) and (1.2) hold simultaneously when β = 0.
Some number vi in the sequence (v1 , v2 , . . . , vm ) are probably equal to each other.
We rerarrange the sequence by the increasing order as follows,
sj
u i = 0.
i=1
sj
u i = 0.
i=1
vi − αβ T xi = vi < v j = v j − αβ T x j ,
and
sj
ui = u i = 0,
i∈I i=1
7.2 Nonlinear Activations Bring Infinite Spurious Local Minima 165
which are exactly the arguments (1.1) and (1.2). Thereby we construct a contrary
example. Therefore, for any j ∈ {1, 2, . . . , k − 1}, we have
sj
u i = 0.
i=1
Since we assume that u = 0, there exists an index t ∈ {1, . . . , k − 1}, such that
there exists an index i ∈ {st + 1, ..., st+1 } that u i = 0.
Let l ∈ {st + 1, ..., st+1 } is the index such that xl has the largest norm while u l = 0:
We further rearrange the sequence (vst +1 , ..., vst+1 ) such that there is an index
l ∈ {st + 1, . . . , st+1 },
xl = max xj ,
j∈{st +1,...,st+1 },u j =0
and
It is worth noting that it is probably l = st+1 , but it is a trivial case that would not
influence the result of this lemma.
Let I = {1, ..., l }, J = {l + 1, ..., n}, and β = xl . We prove (1.1) and (1.2) as
follows.
Proof of argument (1.1) We argue that for any i ∈ I , vi − αβ T xi ≤ vl − αβ T xl and
for any j ∈ J , v j − αβ T x j > vl − αβ T xl .
There are three situations:
(A) i ∈ {1, . . . , st } and j ∈ {st+1 + 1, . . . , n}. Applying Eq. (7.51), for any i ∈
{1, . . . , st } and j ∈ {st+1 + 1, . . . , m}, we have that vi < vl and v j > vl . Therefore,
when α is sufficiently small, we have the following inequalities,
vi − αβ T xi < vl − αβ T xl ,
v j − αβ T x j > vl − αβ T xl .
−αβ, xi ≤ −α β 2
= −αβ, xl .
vi − αβ T xi ≤ vl − αβ T xl .
v j − αβ T x j > vl − αβ T xl ,
xl , xi ≤ xl x i ≤ xl 2
,
where the first inequality strictly holds if the vector xl and xi have the same direction,
while the second inequlity strictly holds when xi and xl have the same norm. Because
xl = xi , we have the following inequality,
xl , xi < xl 2
,
xl , xi ≥ xl 2
, ∀i ∈ {st + 1, . . . , l }.
Therefore,
st l −1
ui = ui + u i + u l = u l = 0.
i∈I i=1 i=st +1
while if l = st+1 ,
min vi − αβ T xi − vl − αβ T xl = min vi − vl + αβ T (xl − xi ).
i∈{l +1,...,m} i∈{l +1,...,m}
(7.58)
(7.57) and (7.58) make senses because l < m. Otherwise, from Lemma
Equations
m
7.7 we have i=1 u i = 0, which contradicts to the assumption.
This section gives the proofs of Theorems 7.7, 7.8, Corollaries 7.1, and 7.2 omitted
from Sect. 7.2.2.
We first check that the squared loss is strictly convex, which is even restrictive than
“convex”.
Lemma 7.8 The empirical risk R̂ S under squared loss is strictly convex with respect
to the prediction Ŷ .
Proof The second derivative of the empirical risk R̂ S under squared loss with respect
to the prediction Ŷ is
∂ 2 lce (Y, Ŷ ) ∂ 2 (y − Ŷ )2
= = 2 > 0.
∂ Ŷ 2 ∂ Ŷ 2
Therefore, the empirical risk R̂ S under squared loss is strictly convex with respect to
prediction Ŷ .
168 7 The Geometry of the Loss Surfaces
In the context where all activations are linear functions, the neural network simplifies
to a multilinear model characterized by a smooth and multilinear loss surface. How-
ever, when nonlinear activations are introduced, the landscape of the loss surface
becomes more complex due to the nonlinearity introduced by these activation func-
tions. When input data passes through the linear portions of activation functions, the
resulting output resides in a smooth and multilinear region of the loss surface. This
smoothness and linearity allow for predictable behavior under parameter changes,
ensuring that each region expands into an open cell with continuous and differentiable
characteristics.
Conversely, nonlinear points within the activations are non-differentiable, leading
to non-smooth empirical risk concerning the parameters. These nonlinear points
correspond to the non-differentiable boundaries between cells on the loss surface,
where the loss function exhibits abrupt changes due to the inherent nonlinearity of
the activations.
Proof of Theorem 7.7 In every cell, the input sample points flows through the same
linear parts of the activations no matter what values the parameters are.
(1) We first proves that the empirical risk R̂ S equals to a convex function with
respect to a variable Ŵ that is calculated from the parameters W .
Suppose (W1 , W2 ) is a local minimum within a cell. We argue that
m
m
l yi , W2 diag A·,i W1 xi = l yi , A·,i
T
diag(W2 )W1 xi , (7.59)
i=1 i=1
m
LHS = l yi , W2 diag A·,i W1 xi
i=1
7.3 Proofs of Theorems 7.7, 7.8, Corollaries 7.1, and 7.2 169
m
= l yi , (W2 )1,1 A1,i · · · (W2 )1,d1 Ad1 ,i W1 xi .
i=1
m
RHS = l yi , A·,i
T
diag(W2 )W1 xi ,
i=1
m
= l yi , (W2 )1,1 A1,i · · · (W2 )1,d1 Ad1 ,i W1 xi .
i=1
Also define
X̂ = A·,1 ⊗ x1 · · · A·,m ⊗ xm . (7.63)
1 T 1
m m
R̂ S (W1 , W2 ) = l yi , A·,i diag(W2 )W1 xi = l yi , Ŵ X̂ i .
m i=1 m i=1
(7.65)
The empirical risk is rearranged as a convex function in terms of Ŵ which unite the
two weight matrices W1 and W2 and the activation h are together as Ŵ .
Applying Eqs. (7.61) and (7.62), we have
Ŵ = (W2 )1 (W1 )1,· · · · (W2 )d1 (W1 )d1 ,· .
(2) We then prove that the local minima (including global minima) of the empirical
risk R̂ S with respect to the parameter W is also local minima with respect to the
corresponding variable Ŵ .
170 7 The Geometry of the Loss Surfaces
∂ R̂ S
0=
∂(W1 )i, j
n
∂ l Y·,k , Ŵ X̂
k=1 ·,k
=
∂(W1 )i, j
n
= 0 · · · 0 (W2 )i 0 · · · 0 X̂ k ∇ Ŵ X̂
l Y·,k , Ŵ X̂ ,
k=1 ·,k (7.66)
0 12 3 0 12 3 ·,k
(ed X (i−1)+ j X̂ )∇ = 0.
and
When ε is sufficiently small, W1 and W2 are also sufficiently small. Since
(W1 , W2 ) is a local minimum, we have
1
m
l Yk , Ŵ + X̂ = R̂ S (W1 + W1 , W2 + W2 )
m k=1 k
1
m
≥ R̂ S (W1 , W2 ) = l Yk , Ŵ X̂ , (7.67)
m k=1 k
7.3 Proofs of Theorems 7.7, 7.8, Corollaries 7.1, and 7.2 171
= [(W2 + W2 )1 (W1 + W1 )1 , . . . , (W2 + W2 )d1 (W1 + W1 )d1 ]
− (W2 )1 (W1 )1 , . . . , (W2 )d1 (W1 )d1
(∗)
=[00 ·12
· · 03 , ε2 u 2 (εu1 + (W1 )i ) , 0
0 ·12
· · 03 ]. (7.68)
d X (i−1) d1 d X −d Xi
Here, Eq. (∗) comes from (W2 )i = 0. Rearrange Eq. (7.67) and apply the Taylor’s
Theorem, we can get that
· X̂ ∇ + O · X̂ 2
≥ 0.
[0 · · · 0, ε2 u 2 (εu1 + (W1 )i ) , 0 · · · 0] X̂ ∇
+ ε4 O( [0 · · · 0, u 2 (εu1 + (W1 )i ) , 0 · · · 0] X̂ 2
)
(∗∗)
= [0 · · · 0, ε3 u 2 u1 , 0 · · · 0] X̂ ∇ + ε4 O( [0 · · · 0, u 2 (εu1 + (W1 )i ) , 0 · · · 0] X̂ 2
)
= ε [0 · · · 0, u 2 u1 , 0 · · · 0] X̂ ∇ + o(ε ) ≥ 0.
3 3
(7.69)
Here, Eq. (∗∗) can be obtained from follows. Because W2 is a local minimizer, for
any component (W2 )i of W2 ,
m
∂ k=1 l Yk , Ŵ X̂
k
=0,
∂(W2 )i
which leads to
[ 00 ·12
· · 03, (W1 )i , 00 ·12
· · 03 ] X̂ ∇ = 0.
d X (i−1) d X d1 −id X
· · 03, u 2 u1 , 00 ·12
[ 00 ·12 · · 03 ] X̂ ∇ ≥ 0.
d X (i−1) d X d1 −id X
Since u1 and u 2 are arbitrarily picked (while the norms equal 1), the inequality above
further leads to
(7.70)
0 · · · 0 e 0 · · · 0 X̂ ∇ = 0,
j
ed0 (i−1)+ j X̂ ∇ = 0,
m
R̂ S (W ) = l(Yi , W X̂ i ). (7.71)
i=1
Proof of Theorem 7.8 and Corollary 7.1 In the proof of Theorem 7.7, we constructed
a map Q: (W1 , W2 ) → Ŵ . Further, in any fixed cell, the represented hypothesis of a
neural network is uniquely determined by Ŵ .
We first prove that all local minima in a cell are concentrated as a local minimum
valley. Since the loss function l is strictly convex, the empirical risk has one unique
local minimum (which is also a global minimum) with respect to Ŵ in every cell,
if there exists some local minimum in the cell. Meanwhile, we have proved that all
local minima with respect to (W1 , W2 ) are also local minima with respect to the
corresponding Ŵ . Therefore, all local minima with respect to (W1 , W2 ) correspond
one unique Ŵ . Within a cell, when W1 expands by a positive real factor α to W1 and W2
shrinks by the same positive real factor α to W2 , we have Q(W1 , W2 ) = Q(W1 , W2 ),
i.e., the Ŵ remains invariant.
Further, we argue that all local minima in a cell are connected with each other by
a continuous path, on which the empirical risk is invariant. For every local minima
pair (W1 , W2 ) and (W1 , W2 ), we have
Since h s− ,s+ (W1 X ) = h s− ,s+ (W1 X ) (element-wise), for every i ∈ [1, d1 ],
sgn ((W2 )i ) = sgn W2 i .
7.3 Proofs of Theorems 7.7, 7.8, Corollaries 7.1, and 7.2 173
if
Q(W11 , W21 ) = Q(W12 , W22 ).
Q(W1 , W2 ) = Q(W1 , W2 ).
Therefore,
(W1 , W2 ) ∼ R (W1 , W2 ).
(2) Symmetry:
For any pair (W11 , W21 ) and (W12 , W22 ), Suppose that
Thus,
Q(W11 , W21 ) = Q(W12 , W22 ).
Apparently,
Q(W12 , W22 ) = Q(W11 , W21 ).
Therefore,
Q(W12 , W22 ) ∼ R Q(W11 , W21 ).
(3) Transitivity:
For any (W11 , W21 ), (W12 , W22 ), and (W13 , W23 ), suppose that
Then,
Apparently,
Q(W11 , W21 ) = Q(W12 , W22 ) = Q(W13 , W23 ).
Therefore,
(W11 , W21 ) ∼ R (W13 , W23 ).
the inverse of (W1 , W2 ) is defined to be (−W1 , W2 ) and the zero element is defined
to be (0, 11×d1 ).
Obviously, the following is a linear mapping:
if and only if
(W11 , W21 ) ⊕ (−W12 , W22 ) ∈ Ker(Q).
Therefore, the quotient space (Rd1 ×d X , R1×d1 )/Ker(Q) is a definition of the equiva-
lence relation ∼ R .
The proof is completed.
What does the loss surface actually look like? Understanding the geometric structure
of the loss surface could lead to significant advancements in our comprehension of
deep learning.
Linear partition of the loss surface. Soudry and Hoffer (2018) introduced the
concept of a smooth and multilinear partition within the loss surface of neural net-
works, highlighting the impact of nonlinearities in piecewise linear activations. These
nonlinearities effectively segment the loss surface into distinct regions characterized
7.4 Geometric Structure of the Loss Surface 175
by smooth and multilinear properties. Specifically, every nonlinear point within the
activation functions contributes to a set of non-differentiable boundaries between
cells, while the linear segments of the activations correspond to the smooth and mul-
tilinear interiors of these cells. This decomposition provides insights into the geo-
metric structure of the loss surface and its relationship with neural network behavior
and training dynamics.
He et al. (2020) demonstrated several intricate properties of mode connectivity:
(1) Within an open cell, if local minima exist, they are equally optimal in terms of
empirical risk, making all local minima global within that cell. This highlights a
uniformity of performance among local minima within the same region of the loss
surface. (2) The local minima within any open cell form an equivalence class and
are concentrated in a valley, suggesting a concentrating effect of optimal solutions in
specific regions of the loss landscape. (3) When all activations are linear, the partition
collapses into a single cell, including linear neural networks as a special case. This
observation suggests the value of nonlinear activations in shaping the complexity
and structure of the loss surface. These three findings provide important insights into
the behavior and landscape of neural network loss surfaces under different activation
regimes, shedding light on the nature of optimization and generalization in deep
learning models.
The property (2) introduced by He et al. (2020) elucidates the concept of mode
connectivity, which suggests that solutions discovered through SGD or its variants
are connected by a path in the weight space, where all points along the path exhibit
nearly identical empirical risk. This mode connectivity phenomenon has been empir-
ically observed in studies by Garipov et al. (2018) and Draxler et al. (2018). Recently,
Kuditipudi et al. (2019) provided theoretical support by demonstrating that mode con-
nectivity can be assured through dropout stability and noise stability. These findings
contribute to the understanding of how optimization algorithms explore the weight
space and the resilience of solutions under perturbations and dropout conditions.
Correspondence between the landscapes of the empirical risk and expected
risk: Two seminal works by Zhou and Feng (2018) and Mei et al. (2018) have theoret-
ically revealed a critical link between the landscapes of the empirical risk surface and
the expected counterpart. This correspondence suggests that studying the geometric
properties of empirical risk surfaces can provide insights into model generalizability,
which pertains to the gap between expected and empirical risks. Notably, these stud-
ies demonstrated that the gradient of the empirical risk, the stationary points of the
empirical risk, and the empirical risk itself can all be asymptotically approximated
by their population equivalents. Moreover, they delivered a generalization bound for
the nonconvex scenario: with probability at least 1 − δ,
4 4
9 s log(mU/D) + log(4/δ)
O τ [1 + cr (D − 1)] ,
8 m
l−1
where all data x are assumed to be τ -sub-Gaussian, cr = max r 2 /16, r 2 /16 ,
and s is the number of nonzero components of the weights. Later, Mei et al. (2018)
176 7 The Geometry of the Loss Surfaces
proved a similar correspondence between the Hessian matrices of the expected risk
and the empirical risk under the assumption that the sample size is greater than the
number of parameters.
Eigenvalues of the Hessian. Sagun et al. in 2016 (Sagun et al. 2016) and 2018
(Sagun et al. 2018) and Papyan in 2018 (Papyan 2018) conducted experiments to
study the eigenvalues of the Hessian of the loss surface. Sagun et al. (2016), Sagun
et al. (2018) discovered that (1) a large bulk of the eigenvalues are centred close to
zero and (2) several outliers are located far from this bulk. Papyan (2018) presented
the full spectrum of the Hessian matrix. In Papyan (2018, pp. 2, Figs. 1(a) and 1(c),
and pp. 3, Figs. 2(a) and 2(c)), the authors compare the spectra of the Hessian matrices
obtained when training and testing a VGG-11 network on the MNIST and CIFAR-10
datasets.
References
Baldi, Pierre, and Kurt Hornik. 1989. Neural networks and principal component analysis: learning
from examples without local minima. Neural Networks 2 (1): 53–58.
Choromanska, Anna, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, and Yann LeCun. 2015.
The loss surfaces of multilayer networks. In International Conference on Artificial Intelligence
and Statistics.
Draxler, Felix, Kambis Veschgini, Manfred Salmhofer, and Fred Hamprecht. 2018. Essentially no
barriers in neural network energy landscape. In International Conference on Machine Learning.
Freeman, C Daniel, and Joan Bruna. 2017. Topology and geometry of half-rectified network
optimization. In International Conference on Learning Representations.
Garipov, Timur, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P Vetrov, and Andrew G Wilson.
2018. Loss surfaces, mode connectivity, and fast ensembling of dnns. In Advances in Neural
Information Processing Systems.
Goldblum, Micah, Jonas Geiping, Avi Schwarzschild, Michael Moeller, and Tom Goldstein. 2020.
Truth or backpropaganda? An empirical investigation of deep learning theory. In International
Conference on Learning Representations.
Hanin, Boris, and David Rolnick. 2019. Complexity of linear regions in deep networks. In
International Conference on Machine Learning.
He, Fengxiang, Bohan Wang, and Dacheng Tao. 2020. Piecewise linear activations substan-
tially shape the loss surfaces of neural networks. In International Conference on Learning
Representations.
He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image
recognition. In IEEE Conference on Computer Vision and Pattern Recognition.
Kawaguchi, Kenji. 2016. Deep learning without poor local minima. In Advances in Neural
Information Processing Systems.
Kuditipudi, Rohith, Xiang Wang, Holden Lee, Yi Zhang, Zhiyuan Li, Wei Hu, Sanjeev Arora, and
Rong Ge. 2019. Explaining landscape connectivity of low-cost solutions for multilayer nets. In
Advances in Neural Information Processing Systems.
Laurent, Thomas, and James von Brecht. 2018. The multilinear structure of ReLU networks. In
International Conference on Machine Learning.
LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521 (7553): 436.
Litjens, Geert, Thijs Kooi, Babak Ehteshami Bejnordi, Arnaud Arindra Adiyoso Setio, Francesco
Ciompi, Mohsen Ghafoorian, Jeroen Awm Van Der Laak, Bram Van Ginneken, and Clara I
References 177
Sánchez. 2017. A survey on deep learning in medical image analysis. Medical Image Analysis
42: 60–88.
Lu, Haihao, and Kenji Kawaguchi. 2017. Depth creates no bad local minima. arXiv preprint
arXiv:1702.08580.
Mei, Song, Yu Bai, and Andrea Montanari. 2018. The landscape of empirical risk for nonconvex
losses. The Annals of Statistics 46 (6A): 2747–2774.
Papyan, Vardan. 2018. The full spectrum of deep net hessians at scale: dynamics with sample size.
arXiv preprint arXiv:1811.07062.
Safran, Itay, and Ohad Shamir. 2018. Spurious local minima are common in two-layer ReLU neural
networks. In International Conference on Machine Learning.
Sagun, Levent, Léon Bottou, and Yann LeCun. 2016. Singularity of the hessian in deep learning.
arXiv preprint arXiv:1611.07476.
Sagun, Levent, Utku Evci, V Ugur Guney, Yann Dauphin, and Leon Bottou. 2018. Empirical analysis
of the hessian of over-parametrized neural networks. In International Conference on Learning
Representations Workshop.
Silver, David, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driess-
che, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, and Marc Lanctot. 2016.
Mastering the game of go with deep neural networks and tree search. Nature 529 (7587): 484–489.
Soudry, Daniel, and Elad Hoffer. 2018. Exponentially vanishing sub-optimal local minima in
multilayer neural networks. In International Conference on Learning Representations Workshop.
Swirszcz, Grzegorz, Wojciech Marian Czarnecki, and Razvan Pascanu. 2016. Local minima in
training of deep networks. arXiv preprint arXiv:1611:06310.
Witten, Ian H, Eibe Frank, Mark A Hall, and Christopher J Pal. 2016. Data Mining: Practical
Machine Learning Tools and Techniques. Morgan Kaufmann.
Yun, Chulhee, Suvrit Sra, and Ali Jadbabaie. 2018. Global optimality conditions for deep neural
networks. In International Conference on Learning Representations.
Yun, Chulhee, Suvrit Sra, and Ali Jadbabaie. 2019. Small nonlinearities in activation functions create
bad local minima in neural networks. In International Conference on Learning Representations.
Zhou, Pan, and Jiashi Feng. 2018. Empirical risk landscape analysis for understanding deep neural
networks. In International Conference on Learning Representations.
Zhou, Yi, and Yingbin Liang. 2018. Critical points of neural networks: analytical forms and
landscape properties. In International Conference on Learning Representations.
Chapter 8
Linear Partition in the Input Space
Recent research has demonstrated that the input space of a rectified linear unit (ReLU)
network, which exclusively uses ReLU-like (two-piece linear) activation functions,
is divided into linear regions by the nonlinear activations.
Specifically, within these linear regions, the mapping induced by a ReLU network
behaves linearly for input data. Conversely, at the boundaries between linear regions,
the mapping becomes nonlinear and nonsmooth. Intuitively, the linear regions corre-
spond to the linear segments of the ReLU activations, representing specific activation
patterns. Meanwhile, the boundaries are defined by transition points where the acti-
vation pattern changes. As a result, each input example is associated with a neural
code—a 0-1 matrix representing its activation pattern within the linear region it
occupies.
This chapter introduces the concept of a neural code, which serves as a param-
eterized representation of activation patterns induced by a neural network for input
examples. Sufficient experiments demonstrate that the neural code exhibits the inter-
esting encoding properties, which are shared by hash code, in most common scenarios
of deep learning for classification tasks.
8.1 Preliminaries
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025 179
F. He and D. Tao, Foundations of Deep Learning, Machine Learning: Foundations,
Methodologies, and Applications, https://doi.org/10.1007/978-981-16-8233-9_8
180 8 Linear Partition in the Input Space
Recent works have shown that the input space of a ReLU network N is partitioned
into multiple linear regions, each of which corresponds to a specific activation pat-
tern of the ReLU activation functions. In this section, we represent the activation
pattern as a matrix P ∈ P ⊂ {0, 1}l×w , where l and w are the depth and largest
width, respectively, of this neural network N . Specifically, the (i, j)-th component
characterizes the activation status of the j-th ReLU neuron in the i-th layer. If the
(i, j)-th component is equal to 1, this represents that this neuron is activated, while a
value of 0 means that this neuron is deactivated or invalid.1 The matrix P is termed
the neural code. We can also reformulate the neural code as a vector if there is no
possibility for confusion of the depth and width.
It’s essential to note that the boundaries separating linear regions have no physical
width. This occurs because these boundaries align precisely with transition points in
the activations, effectively resembling infinitely thin lines. In more straightforward
terms, imagine these boundaries as razor-thin lines that separate different regions
within the model. Consequently, the likelihood of an example precisely landing on
one of these boundaries is virtually non-existent. Therefore, for practical analysis
purposes, we assume that no example resides within these boundaries. This simpli-
fication greatly aids in our understanding and interpretation of the model’s behavior.
Consequently, upon fixing the weights w of the model M, every example x ∈ X
can be indexed by the neural code P of the corresponding linear region. Note that
an example x can be seen in either the training sample or the test sample.
1 Different layers may have different numbers of neurons. Therefore, there might be some indices
(i, j) that are invalid. We represent the activation patterns of these neurons by 0 since they are never
activated.
8.2 Neural Networks Act as Hash Encoders 181
Definition 8.1 (Redundancy ratio) Suppose that there are m examples in a dataset
S. If they are located in n activation regions, the redundancy ratio is defined as m−n
m
.
Table 8.1 Accuracy of K -means and K -NN on the neural codes of convolutional neural networks
(CNNs) trained on MNIST
Architecture K -means acc (%) K -NN acc (%)
VGG-18 99.95 99.33
ResNet-18 98.96 99.32
ResNet-34 99.66 99.49
ResNeXt-26 98.31 99.24
DenseNet-28 69.87 98.59
Table 8.2 Accuracy of logistic regression (LR) on the neural codes of CNNs trained on CIFAR-10
Architecture LR acc (%) Test acc (%)
VGG-19 92.19 91.43
ResNet-18 89.55 90.42
ResNet-20 88.76 90.44
ResNet-32 89.05 90.45
182 8 Linear Partition in the Input Space
The results presented in Tables 8.1 and 8.2 also suggest that the encoding properties
exhibit variability in different scenarios. Through comprehensive experiments, we
investigate which factors influence the encoding properties and how. The investi-
gated factors include the model size, training time, sample size, three different pop-
ular regularizers, random data, and noisy labels. Some details of the experimental
implementations are given in Sect. 8.4.
We first study how the model size influences the encoding properties. We trained
115 one-hidden-layer MLPs of different widths on the MNIST dataset and 200 five-
hidden-layer MLPs of different widths on the CIFAR-10 dataset (see Sect. 8.4 for
the full list of widths considered in our experiments), while all irrelevant variables
were strictly controlled. The experiments were repeated for 5 trials on MNIST and
10 trials on CIFAR-10.
Measurement of model size by width. When considering MLPs, a natural metric
for evaluating model complexity is the layer width.2 Using layer width as our measure
of model size, we computed both the redundancy ratio and categorization accuracy
across all scenarios, as illustrated in Fig. 8.2a, b. These plots reveal notable correla-
tions between encoding properties and layer width: (1) Redundancy Ratio: Initially,
2 Depth is also a natural measure of model size. However, the optimal training protocol (especially
training time) significantly differs for networks of different depths. Thus, it is difficult to conduct
experiments on depth while controlling other factors.
8.3 Factors that Influence the Encoding Properties 183
(a) Determinism vs. layer width (b) Categorization vs. layer width
(c) D vs. layer width (d) D at different times (e) D for random/true (f) R vs. random/true
data data
Fig. 8.2 a Plots of the redundancy ratio as a function of the layer width of MLPs on both the
training set (blue) and the test set (red). b Plots of the test accuracies of K -means (blue), K -NN
(red), and logistic regression (LR, range) as functions of the MLP layer width. c Plots of the average
stochastic activation diameter (D) as a function of the MLP layer width on MNIST. d Histograms
of D calculated on MNIST for an MLP of width 50 trained on MNIST for 10 epochs (blue) and 500
epochs (red), respectively. e Histograms of D calculated on MNIST (blue) and randomly generated
data of the same dimensions (red) for an MLP of width 50 trained on MNIST. The two red histograms
are identical. f Plots of the redundancy ratio (R) calculated on MNIST (“True data”, solid lines)
and randomly generated data (dotted lines) as functions of the training time for MLPs of width
40 (blue), 50 (red), and 60 (orange). The dotted lines represent networks trained on the unaltered
data and evaluated on random data. The models were trained 5 times on MNIST and 10 times on
CIFAR-10 with different random seeds. The darker lines represent the averages over all seeds, and
the shaded areas show the standard deviations
the redundancy ratio starts at relatively high values (almost 1 on both the training and
test sets of MNIST, around 0.1 on the training set of CIFAR-10, and approximately
0.04 on the CIFAR-10 test set). However, it steadily decreases to nearly 0 across all
cases as the layer width increases. (2) Categorization Accuracy: Initially, the cate-
gorization accuracy commences at relatively low values (about 25% on MNIST and
32-45% on CIFAR-10). However, as the width increases, the accuracy consistently
improves across all scenarios, reaching relatively high values (approximately 70%
for K -means on both datasets, exceeding 90% for K -NN and logistic regression on
MNIST, and approximately 50% for K -NN and logistic regression on CIFAR-10,
akin to the test accuracy on the raw data).
Measurement of model capacity in terms of the diameters of the linear
regions. We devise a new measure to assess model capacity, termed the average
stochastic activation diameter. This metric is computed through the following three
steps: (1) we randomly sample a direction from a uniform distribution; (2) the stochas-
184 8 Linear Partition in the Input Space
tic diameter of a linear region is defined as the length of the longest line segment
intersecting the linear region along the sampled direction; and (3) the average stochas-
tic activation diameter is defined as the mean of these stochastic diameters across
all linear regions containing data. In essence, a smaller average stochastic activa-
tion diameter indicates a finer division of the input space into smaller linear regions,
enabling the representation of more intricate data structures. Therefore, this metric is
an effective measure of the model capacity. Correspondingly, we observe a negative
correlation between layer width and average stochastic activation diameter, as shown
in Fig. 8.2c.
Similarly, Hanin and Rolnick (2019) introduced a concept termed the “typical
distance from a random input to the boundary of its linear region”. In contrast, our
diameter conceptually represents the longest distance between any two points within
a linear region. While (Hanin and Rolnick 2019) typical distance metric can be under-
stood as the distance from a random input to the boundary of its linear region, our
diameter measures the maximum possible separation within the linear region itself.
When the linear region takes the form of an ideal ball, (Hanin and Rolnick 2019)
typical distance will be equal to or smaller than the radius of the ball, which is half of
our diameter. However, linear regions are often irregular in practice, as illustrated in
Fig. 1 of Hanin and Rolnick (2019). Consequently, their distances typically turn out
to be considerably smaller than our diameter. Consequently, the discrepancy between
these two definitions can be significant, particularly depending on the level of irreg-
ularity. One may remain relatively constant while the other experiences significant
variation. Consequently, Hanin and Rolnick (2019) distance metric may provide a
lower bound on the linear region volume, whereas our diameter serves as an upper
bound.
We further investigate the encoding properties beyond those of the data-generating
distribution. To this end, we generate a set of examples following a uniform distribu-
tion across the unit ball centered at the original point. Additionally, we normalize the
original data such that each pixel falls within the range [0, 1], ensuring comparable
scales between the randomly generated and original data. Notably, we observe that
the redundancy ratio exceeds 0.8 on the randomly generated data, as illustrated in Fig.
8.2f. This observation indicates that the determinism property is no longer upheld,
implying that unique neural codes cannot effectively represent randomly generated
data. Consequently, the categorization property becomes less discernible. To account
for these findings, we propose the following hypothesis.
Hypothesis 8.1 The diameters of the linear regions in the support of a data distri-
bution decrease as training progresses, while the diameters of regions far away show
little change.
We then collected the average stochastic activation diameters for each scenario,
as illustrated in Fig. 8.2d, e. We observe that the stochastic diameters are more
concentrated when the training time is longer; see Fig. 8.2d. Moreover, we observe the
interesting result that the stochastic diameters for the true data are more concentrated
8.3 Factors that Influence the Encoding Properties 185
at lower values than the stochastic diameters for the random data. Figure 8.2e shows
the histograms of the stochastic diameters. The histograms for other scenarios are
given in Sect. 8.5. These results fully support our hypothesis.
We then explore how training time affects the encoding properties. Our experiments
involved one-hidden-layer MLPs of varying widths trained on the MNIST dataset
and five-hidden-layer MLPs of different widths trained on CIFAR-10. In total, we
evaluated 810 models. To ensure the reliability of our findings, the experiments
were repeated five times for MNIST and ten times for CIFAR-10, amounting to a
comprehensive analysis of the influence of training time on encoding properties.
Throughout our experiments, we meticulously tracked the evolution of both the
redundancy ratio and categorization accuracy at every epoch for all scenarios, doc-
umented in Fig. 8.3. These plots serve as a comprehensive visualization of the rela-
tionship between encoding properties and training time. Notably, our observations
reveal a clear positive correlation: as the training time increases, we observe (1) a
steady decline in the redundancy ratio and (2) a consistent improvement in catego-
rization accuracy. This trend underscores the importance of training time in shaping
the encoding capabilities of our models, shedding light on the dynamics of learning
processes within neural networks.
We also make a noteworthy observation regarding the redundancy ratio of an
untrained MLP on the MNIST dataset, which is nearly zero, as depicted in Fig.
8.3c. To provide a deeper understanding, let’s investigate the rationale behind this
finding: Upon random initialization, a neural network partitions the input space into
numerous activation regions. If these regions are sufficiently small, each training
data point effectively has its own dedicated activation region. However, it’s essen-
tial to recognize that during this random initialization phase, the mapping from input
data to output predictions lacks any meaningful structure. Consequently, the network
may exhibit erratic behavior, producing vastly different predictions for data points
originating from neighboring activation regions. As a consequence, categorization
accuracy is notably poor, aligning with intuitive expectations. This phenomenon val-
idates the notion that determinism alone is insufficient for adequately characterizing
encoding properties. It also resonates with the concept of reservoir effects (Jaeger
2001; Maass et al. 2002), which emphasizes the dynamic and complex nature of
neural network behavior during the initialization phase.
It’s important to highlight a key distinction between our findings and those
reported in Hanin and Rolnick (2019). While their research suggests that the count
of linear regions increases with the training progress, our observations present a
different perspective. We assert that the encoding properties we’ve examined do
not extend beyond the training data distribution, regardless of any variations in
186 8 Linear Partition in the Input Space
(c) R ratio vs. time (d) Test accuracies of -means, -NN, and logistic regression vs. time
Fig. 8.3 a The redundancy ratio (R ratio) as a function of the training time on CIFAR-10. b The
test accuracies of K -means (left), K -NN (middle), and logistic regression (right) as functions of
the training time. The models are MLPs of width 10 (blue), 20 (red), 40 (orange), and 80 (green). c
The R ratio as a function of the training time on MNIST. d The test accuracies of K -means (left),
K -NN (middle), and logistic regression (right) as functions of the training time. The models are
MLPs of depth 40 (blue), 50 (red), and 60 (orange) on MNIST. The dotted lines represent networks
trained on the unaltered data and evaluated on random data. All models obtained from MNIST were
trained 5 times with different random seeds. The darker lines represent the averages over all seeds,
and the shaded areas show the standard deviations
the count of linear regions. This assertion is supported by the data presented in
Fig. 8.2f. What this implies is that while there may indeed be an increase in the
count of linear regions, this phenomenon is localized to only a small fraction of the
expansive input space. Consequently, even with this observed increase, there’s no
guarantee of a corresponding decrease in the redundancy ratio. This understanding
features the intricate relationship between training dynamics and encoding properties
within neural networks.
We now study the influence of sample size on the encoding properties of neural
networks. To conduct this study, we trained a total of 210 one-hidden-layer MLPs,
each with three different widths, and 480 five-hidden-layer MLPs, each with four
different widths. These models were trained on training samples of varying sizes ran-
domly sampled from the MNIST and CIFAR-10 datasets, respectively. It’s essential
8.3 Factors that Influence the Encoding Properties 187
to highlight our stringent experimental controls, ensuring that all irrelevant vari-
ables are meticulously managed to maintain experimental integrity (i.e. accurate and
repeatable results). Additionally, instead of tracking epochs, we monitor the num-
ber of iterations. This decision is informed by the understanding that the number
of iterations in one epoch scales proportionally with the sample size. To ensure the
reliability and repeatability of our findings, each experiment is rigorously repeated
five times for MNIST and ten times for CIFAR-10 datasets.
We conducted an extensive analysis by computing both the redundancy ratio
and categorization accuracy across all scenarios, visually detailed in Fig. 8.4. Our
findings unveil two significant trends: (1) The redundancy ratio, whether assessed on
the training or test sample, exhibits a notable pattern. Upon initialization, it begins
at a relatively high value, gradually diminishing to near-zero levels as the training
sample size increases. This observation validates the diminishing redundancy as the
model learns from a larger and more diverse dataset. (2) We observe a distinct positive
correlation between the sample size and the test accuracies of all three algorithms.
Specifically, as the sample size expands, the accuracy of K -means escalates from 20
to 40%, K -NN accuracy experiences a surge from 10 to 45%, and logistic regression
accuracy shows a pronounced increase from 15 to 45%. These trends elucidate the
profound impact of sample size on both redundancy ratio and classification accuracy
and a negative correlation between redundancy ratio and categorization accuracy,
providing useful insights into model performance under varying data conditions.
Surprisingly, we observe that the encoding properties on the test set also become
stronger as the training sample size increases. Our hypothesis is as follows. Intu-
itively, a larger training sample size supports the neural network in attaining a higher
expressive power, i.e., a finer linear partition in the input space. Meanwhile, a sample
of larger size requires a finer linear partition to yield the same redundancy ratio. Our
experiments show that the first effect is stronger than the second one. Thus, a larger
sample size can help reduce the redundancy ratio.
We next study how different layers impact the encoding properties. We conducted a
layerwise ablation study on the CIFAR-10 dataset based on five-hidden-layer MLPs,
in which every layer was of width 40.
We conducted an exhaustive analysis, meticulously calculating both the redun-
dancy ratio and categorization accuracy across every epoch, as illustrated in Fig. 8.5.
This thorough examination yielded the following insights:
(1) The redundancy ratio of the neural code formed by the initial layer consis-
tently remains close to zero, indicating a notable absence of redundancy. However,
despite this, the corresponding categorization accuracy remains relatively poor. This
observation suggests that while redundancy is minimized, the initial layer may not
capture sufficient discriminative information for effective categorization.
188 8 Linear Partition in the Input Space
Fig. 8.4 a The redundancy ratios (R ratios) on the training set (dotted lines) and test set (solid
lines) of CIFAR-10 as functions of the sample size. b The test accuracies of K -means (left), K -NN
(middle), and logistic regression (right) as functions of the sample size. The models are MLPs of
width 10 (blue), 20 (red), 40 (orange), and 80 (green). c The R ratios on the training set (dotted
lines) and test set (solid lines) of MNIST as functions of the sample size. d The test accuracies of
K -means (left), K -NN (middle), and logistic regression (right) as functions of the sample size. The
models are MLPs of width 40 (blue), 50 (red), and 60 (orange) on MNIST. The models were trained
10 times on CIFAR-10 or 5 times on MNIST, with a different random seed each time. The darker
lines represent the averages over all seeds, and the shaded areas show the standard deviations
(2) As we progress through increasingly higher single layers, the redundancy ratio
gradually increases, indicating a more diverse representation of data. Correspond-
ingly, there is a noticeable improvement in categorization accuracy, suggesting that
deeper layers encode more discriminative features.
(3) The impact of training time on the encoding properties of a single layer mirrors
that of the entire network, emphasizing the importance of adequate training time in
shaping encoding capabilities.
(4) As the neural code incorporates more layers, the redundancy ratio steadily
decreases, indicating a more efficient encoding of data across multiple layers.
(5) Categorization accuracy exhibits a gradual enhancement as the neural code
evolves from the first layer to encompass the entire network. This progression high-
lights the iterative refinement of features and representations throughout the network
architecture.
(6) The observed pattern in (5) is disrupted when forming the neural code in
reverse, starting from the last layer and progressing backward. This reversal suggests
that the hierarchical organization of features may play a crucial role in information
representation.
8.3 Factors that Influence the Encoding Properties 189
Fig. 8.5 a Plots of the redundancy ratios of the neural codes formed by different single layers of
MLPs trained on CIFAR-10 as functions of the training time. b The test accuracies of K -means (left),
K -NN (middle), and logistic regression (right) as functions of the training time. c The redundancy
ratios of the neural codes formed by multiple MLP layers as functions of the sample size. d The
test accuracies of K -means (left), K -NN (middle), and logistic regression (right) as functions of
the sample size. The models are MLPs of width 40 on CIFAR-10
(7) The categorization accuracy of the neural code formed solely by the last layer
is comparable to that of the entire network, indicating that the final layer encapsulates
critical discriminative features necessary for accurate classification.
The insight (2), particularly regarding the redundancy ratio, offers insight into
the interplay between hashing properties and the generalizability of deep learning.
The gradual concentration of data into fewer cells from the initial to the final layer
validates the network’s ability to extract increasingly informative and discriminative
features, contributing to its overall effectiveness and generalization for categoriza-
tion. This understanding sheds light on the intricate relationship between encoding
properties and the broader principles governing deep learning architectures.
Fig. 8.6 First row: Scatter plots of the redundancy ratios and test accuracies of K -means (blue),
K -NN (violet), and logistic regression (LR, orange) for MLPs with depths ranging from 3 to 100 with
batch normalization (y-axis) and without gradient clipping (x-axis). A smaller ȳ − x̄ is preferred
for the redundancy ratio, and a larger one is preferred for the test accuracy. In total, 115 models
are represented in one scatter plot. Second row: Scatter plots of the redundancy ratios and test
accuracies of K -means (blue), K -NN (violet), and LR (orange) for MLPs with depths ranging from
3 to 100 with gradient clipping (y-axis) and without gradient clipping (x-axis). A smaller ȳ − x̄ is
preferred for the redundancy ratio, and a larger one is preferred for the test accuracy. In total, 115
models are represented in one scatter plot. Third row: Scatter plots of the redundancy ratios and
test accuracies of K -means (blue), K -NN (violet), and LR (orange) for MLPs with depths ranging
from 3 to 100 with weight decay (y-axis) and without weight decay (x-axis). A smaller ȳ − x̄ is
preferred for the redundancy ratio, and a larger one is preferred for the test accuracy. In total, 115
models are represented in one scatter plot
We also generated random data in which every pixel was generated from the
uniform distribution U (0, 1). We then trained MLPs and convolutional neural net-
works (CNNs) on the generated data. Unfortunately, the training process did not
converge. We then added label noise to MNIST at different noise rates (0.1, 0.2,
0.3). The encoding properties still showed the same general trends, although they
become somewhat worse. Our results suggest that the structure of the input data can
influence the organization of the hash space (Table 8.3).
8.3 Factors that Influence the Encoding Properties 191
Fig. 8.7 a The redundancy ratios for MNIST with different levels of label noise as functions of the
layer width. b The test accuracies of K -means (left), K -NN (middle), and logistic regression (right)
with different levels of label noise as functions of the layer width. The models are MLPs trained
on MNIST at different label noise rates of 0 (blue), 0.1 (red), 0.2 (orange) and 0.3 (green). All
models were trained on MNIST with noisy labels for classification 5 times with different random
seeds. The darker lines represent the averages over all seeds, and the shaded areas show the standard
deviations
Table 8.3 Training accuracy and training loss of a one-hidden-layer MLP of width 100 on random
data
Epoch 0 100 300 500
Training acc (%) 10.92 11.24 11.24 11.24
Loss 230.56 230.13 230.13 230.13
Finally, we can define an activation hash phase chart that characterizes the space
formed by the redundancy ratio, categorization accuracy, model size, training time,
and sample size. Summarizing the relationships discovered above, the activation
hash phase chart is divided into three canonical regions, corresponding to the under-
192 8 Linear Partition in the Input Space
expressive regime, the critically expressive regime, and the sufficiently expressive
regime. This chart can provide guidance in hyperparameter tuning, the design of
novel algorithms, and algorithm diagnosis. We note that the thresholds between the
three regimes are currently unknown. Exploring them is a promising direction for
future research.
Datasets. Our experiments are based on the MNIST dataset (LeCun et al. 1998) and
the CIFAR-10 dataset (Krizhevsky and Hinton 2009): (1) MNIST contains 60, 000
training examples and 10, 000 test examples from 10 classes. This dataset can be
downloaded at http://yann.lecun.com/exdb/mnist/. (2) CIFAR-10 consists of 50, 000
training images and 10, 000 test images that belong to 10 classes. CIFAR-10 can be
downloaded at https://www.cs.toronto.edu/∼kriz/cifar.html. The splits of the training
and test sets follow the official versions. All images are normalized such that every
pixel value is in the range [0, 1].
Training settings. (1) For MNIST, MLPs were trained with Adam for 2, 000
epochs with a batch size of 128 and a constant learning rate. VGG, ResNet, ResNet,
ResNet, and DenseNet models were trained with Adam for 500 epochs with a batch
size of 128. The learning rate was initialized as 0.01 and decayed to 1/10 of the
previous value every 100 epochs. For all models, the hyperparameter β1 was set to
0.9, and the hyperparameter β2 was set to 0.999. (2) For CIFAR-10, MLPs with 5
hidden layers were trained with Adam for 200 epochs with a batch size of 64. The
learning rate was initialized as 0.01 and decayed to 1/10 of the previous value every
20 epochs. VGG and ResNet models were trained with stochastic gradient descent
(SGD) for 200 epochs with a batch size of 64. The learning rate was initialized as
0.01 and decayed to 1/10 of the previous value every 50 epochs. On MNIST, the
MLPs were trained five times with different random seeds. On CIFAR-10, the MLPs
were trained ten times with different random seeds.
Average stochastic diameter. We first trained MLPs with widths of {5, 10, 15, 20,
25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 90, 100} on MNIST. Then, we ran-
domly selected 600 examples from the test set and calculated the mean of their
corresponding stochastic diameters.
Network architectures. The details of the network architectures investigated in
Sect. 8.2 are shown in Tables 8.4 and 8.5.
Experimental design for K -means. The pipeline for the experiments on
K -means was designed as follows: (1) We set K equal to the number of classes.
(2) We ran K -means on the neural codes to obtain K clusters. (3) Every cluster was
assigned a label from {1, 2, . . . , 10}. Thus, 90 (cluster, label) pairs were obtained.
(4) For every (cluster, label) pair, we assigned the label to all data in the cluster and
calculated the accuracy. (5) We selected the highest accuracy as the accuracy of the
K -means algorithm.
8.4 Additional Experimental Implementation Details 193
Table 8.4 The detailed architectures of the neural networks trained on MNIST
VGG-18 ResNet-18 ResNet-34 ResNeXt-26 DenseNet-28
3 × 3, 32, stride 2 3 × 3, 32, stride 2 3 × 3, 32, stride 2 3 × 3, 32, stride 2 3 × 3, 6, stride 2
maxpool, 3 × 3 maxpool, 3 × 3 maxpool, 3 × 3 maxpool, 3 × 3 maxpool, 3 × 3
⎡ ⎤
1 × 1, 32
3 × 3, 32 3 × 3, 32 ⎢ ⎥ 1 × 1, 12
(3 × 3, 32) × 4 ×2 ×3 ⎢3 × 3, 32, C = 8⎥ × 2 ×4
3 × 3, 32 3 × 3, 32 ⎣ ⎦ 3 × 3, 3
1 × 1, 64
3 × 3, 64 3 × 3, 64 conv, 1 × 1
(3 × 3, 64) × 4 ×2 ×4
3 × 3, 64 3 × 3, 64 avgpool, 2 × 2
⎡ ⎤
1 × 1, 64
3 × 3, 128 3 × 3, 128 ⎢ ⎥ 1 × 1, 12
(3 × 3, 128) × 4 ×2 ×6 ⎢3 × 3, 64, C = 8⎥ × 3 ×4
3 × 3, 128 3 × 3, 128 ⎣ ⎦ 3 × 3, 3
1 × 1, 128
3 × 3, 256 3 × 3, 256 conv, 1 × 1
(3 × 3, 256) × 4 ×2 ×3
3 × 3, 256 3 × 3, 256 avgpool, 2 × 2
⎡ ⎤
1 × 1, 128
⎢ ⎥ 1 × 1, 12
⎢3 × 3, 128, C = 8⎥ × 3 ×4
⎣ ⎦ 3 × 3, 3
1 × 1, 256
Table 8.5 The detailed architectures of the neural networks trained on CIFAR-10
VGG-19 ResNet-18 ResNet-20 ResNet-32
(3 × 3, 32) × 2
3 × 3, 64 3 × 3, 16 3 × 3, 16
maxpool, 2 × 2
(3 × 3, 128) × 2 3 × 3, 64 3 × 3, 16 3 × 3, 16
×2 ×3 ×5
maxpool, 2 × 2 3 × 3, 64 3 × 3, 16 3 × 3, 16
(3 × 3, 256) × 4 3 × 3, 128 3 × 3, 32 3 × 3, 32
×2 ×3 ×5
maxpool, 2 × 2 3 × 3, 128 3 × 3, 32 3 × 3, 32
(3 × 3, 512) × 4 3 × 3, 256 3 × 3, 64 3 × 3, 64
×2 ×3 ×5
maxpool, 2 × 2 3 × 3, 256 3 × 3, 64 3 × 3, 64
(3 × 3, 512) × 4 3 × 3, 512
×2
maxpool, 2 × 2 3 × 3, 512
f c − 4096
avgpool avgpool avgpool
f c − 4096
fc-10, softmax fc-10, softmax fc-10, softmax fc-10, softmax
194 8 Linear Partition in the Input Space
Experiments on the relationship between the model size and the encoding
properties. We trained MLPs with widths of {3, 7, 10, 15, 20, 23, 27, 30, 33, 37, 40,
43, 47, 50, 53, 57, 60, 65, 70, 75, 80, 90, 100} on MNIST and {10, 20, 30, 40, 50,
60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190,200} on CIFAR-10.
Experiments concerning the relationship between the training process and
the model size. (1) For MNIST, we trained MLPs with widths of {40, 50, 60}.
The redundancy ratios and test accuracies of K -means, K -NN, and logistic regres-
sion were calculated for the following training epochs: {1, 3, 6, 10, 30, 60, 100, 300,
600, 1000, 1200, 1500, 1800, 2000}. (2) For CIFAR-10, we trained MLPs with
widths of {10, 20, 40, 80}. The redundancy ratios and test accuracies of K -means,
K -NN, and logistic regression were calculated for the following training epochs:
{1, 3, 6, 10, 20, 30, 40, 60, 80, 100, 120, 140, 160, 180, 200}.
Experiments concerning the relationship between the sample size and the
model size. (1) For MNIST, we trained MLPs with widths of {40, 50, 60} on training
samples with sizes of {10, 30, 60, 100, 300, 600, 1000, 2000, 3000, 6000, 10000,
20000, 30000, 60000} randomly drawn from the training set. (2) For CIFAR-10,
we trained MLPs with widths of {10, 20, 40, 80} on training samples with sizes
of {10, 20, 50, 100, 200, 500, 1000, 2000, 5000, 10000, 20000, 40000} randomly
drawn from the training set.
Experiments concerning the relationship between regularization and the
model size. Three regularizers were tested in our experiments:
• Batch normalization: Adding a batch normalization layer before every ReLU layer.
• Weight decay: Utilizing the L 2 weight regularizer with hyperparameter λ = 0.01.
• Gradient clipping: Setting the clip norm to 1.
Layerwise ablation study. We trained MLPs of width 40 on CIFAR-10. The train-
ing strategy was the same as that previously used for MLPs trained on
CIFAR-10.
Experiments concerning random data. All pixels of the random data were
individually generated from the uniform distribution U (0, 1). Each random example
had dimensions of 28 × 28, i.e., the same as the MNIST images.
Experiments concerning noisy labels. Specified numbers of training exam-
ples from MNIST were assigned random labels in accordance with label noise
ratios of 0.1, 0.2, and 0.3. Then, we trained one-hidden-layer MLPs with widths
of {3, 7, 10, 15, 20, 23, 27, 30, 33, 37, 40, 43, 47, 50, 53, 57, 60, 65, 70, 75, 80, 90,
100} on each noisy training set.
This section collects experimental results omitted from the main analysis section for
brevity. The following figure corresponds to the study of the diameters. Please refer
to Sect. 8.3.1.
References 195
References
Hanin, Boris, and David Rolnick. 2019. Deep relu networks have surprisingly few activation pat-
terns. In Advances in Neural Information Processing Systems, 361–370.
Jaeger, Herbert. 2001. The “cho state” approach to analysing and training recurrent neural networks-
with an erratum note. Bonn, Germany: German National Research Center for Information Tech-
nology GMD Technical Report 148 (34): 13.
Krizhevsky, Alex, and Geoffrey Hinton. 2009. Learning multiple layers of features from tiny images.
Technical report, Citeseer.
LeCun, Yann, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning
applied to document recognition. Proceedings of the IEEE 86 (11): 2278–2324.
Maass, Wolfgang, Thomas Natschläger, and Henry Markram. 2002. Real-time computing without
stable states: a new framework for neural computation based on perturbations. Neural Computa-
tion 14 (11): 2531–2560.
Nesterov, Yurii E. 1983. A method for solving the convex programming problem with convergence
rate o (1/k2 ). In Dokl. Akad. Nauk Sssr, vol. 269, 543–547.
van der Maaten, Laurens, and Geoffrey Hinton. 2008. Visualizing data using t-sne. Journal of
Machine Learning Research 9 (Nov): 2579–2605.
Chapter 9
Reflecting on the Role of
Overparameterization: Is it Solely
Harmful?
One might blame the overparameterized nature of deep learning for its lack of theo-
retical foundations. However, recent works have discovered that this overparameter-
ization also contributes to the success of deep learning. Notably, there is currently
no standard definition of “overparameterization”. In many cases, the definition is
quite restrictive, such as requiring an infinite network width. A promising direction
of future research is to relax the conditions of overparameterization.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025 197
F. He and D. Tao, Foundations of Deep Learning, Machine Learning: Foundations,
Methodologies, and Applications, https://doi.org/10.1007/978-981-16-8233-9_9
198 9 Reflecting on the Role of Overparameterization: Is it Solely Harmful?
This line of research has also led to research on benign overfitting in linear regres-
sion (Bartlett et al. 2020), ridge regression (Tsigler and Bartlett 2020), the large-
deviation regime (Chinot and Lerasle 2020), constant-step-size stochastic gradient
descent (SGD) for regression (Zou et al. 2021), and neural networks (Li et al. 2021).
Bartlett et al. (2020) characterized linear regression problems in which the minimum-
norm interpolating prediction rule has near-optimal prediction accuracy. The authors
demonstrated that for benign overfitting, overparameterization is crucial when the
number of directions in the parameter space must be substantially larger than the
sample size, and they derived nearly matching lower- and upper-bounds for the risk
of the minimum-norm interpolating estimator, as follows (Fig. 9.1).
Fig. 9.1 Figure illustrating test loss and training loss versus parameter size for ResNet-18 on
CIFAR-10. See Nakkiran et al. (2021)
9.1 Double Descent and Benign Overfitting 199
√
and
O(md −1/2 ) · min σ p d)−1 , µ −1 2 ≤ the initialization level σ0 ≤
√
−4 m −1 ) · min (σ p d)−1 , µ −1 q √
. For any > 0, if n · µ 2 = (σ p ( d)q ),
q
O(n 2
−1 σ0−(q−2) µ −q
then there exists a T = O(η −1 −1 4 −2
2 + η n µ 2 ) with probability
at least 1 − d −1 such that 1) the training loss converges to , i.e., R̂ S (W(T ) ) ≤ ;
(T )
2) the CNN learns the signal: maxr γ j,r ≥ (n −1/q ) for j ∈ {−1, +1}; and 3) the
200 9 Reflecting on the Role of Overparameterization: Is it Solely Harmful?
(T ) √
CNN does not memorize the noise in the training data: max j,r,i ζ j,r,i = (σ0 σ p d)
√
and max j,r,i |ω(T )
j,r,i | =
(σ0 σ p d). Here, µ and σ0 represent the signal strength and
noise level respectively.
q √
This theorem illustrates that if the condition n · µ 2 = (σ p ( d)q ) holds, then a
q
Theorem 9.3 Suppose that d= (n 8 m 4), where n, m = ( polylog(d)); the
learning rate satisfies η ≤
O n −2/q min µ −2 −2 −1
2 , 4σ p d ; and
O(md −1/2 ) ·
√
min σ p d)−1 , µ −1 2 ≤ the initialization level
√
−4 m −1 ) min (σ p d)−1 , µ
σ0 ≤ O(n −1
2 .
q √
η−1 ·
For any > 0, if σ p ( d)q = (m · µ 2 ), then there exists a T = O
q
√ −(q−2)
m(σ p d)−q · σ0 + η−1 −1 mn 4 d −1 σ p−2 with probability at least 1 − d −1 such
that 1) the training loss converges to , i.e., R̂ S (W (T ) ) ≤ , and 2) the CNN memorizes
the noise in the training data: maxr ζ y(Ti ,r,i
)
≥ (n −1/q ).
Lower and upper bounds on the population loss achieved by a CNN can be obtained
(T ) (T )
based on the bounds on γ j,r , ζ j,r,i and ω(T )
j,r,i in Theorems 9.2 and 9.3.
Theorem 9.4 Suppose that d= (n 8 m 4), where n, m = ( polylog(d)); the
learning rate satisfies η ≤
O n −2/q min µ −2 −2 −1
2 , 4σ p d ; and
O(md −1/2 ) ·
√
min σ p d)−1 , µ −1 2 ≤ the initialization level
√
−4 m −1 ) min (σ p d)−1 , µ
σ0 ≤ O(n −1
2 .
q √
For any > 0, 1) if n · µ 2 = (σ p ( d)q ), then gradient descent will yield a CNN
q
) < and R( W
such that R̂ S ( W ) = (1), where R(W ) := E(x,y)∼D [y · f (W , x)].
This theorem further illustrates the phenomenon of phase transition that is revealed
in Theorems 9.2 and 9.3 with respect to the population loss.
Neal (1995, 1996), Williams (1996), and Hazan and Jaakkola (2015) laid the ground-
work by proving the equivalence between an infinite-width shallow neural network
and a Gaussian process. This fundamental insight provided a solid foundation for
9.2 Neural Tangent Kernels (NTKs) 201
P
(L) (θ ) = ∂θ p F (L) (θ ) ⊗ ∂θ p F (L) (θ ).
p=1
where Jiα (x) = ∂Hα z iL (x) is the Jacobian evaluated at a point x for parameter α and
z iL (x) is the i-th output of the last output layer. Jacot et al. (2018) has shown that
the NTK converges to a deterministic kernel and remains constant over the course of
training. Subsequently, the authors proved that the expected outputs of an infinitely
wide network can be solved for by means of an ordinary differential equation (ODE)
as follows:
μt (X train ) = Id − e−ηHtrain, train t Ytrain . (9.3)
Here, Htrain, train denotes the NTK between the training inputs. As the number of
steps t approaches infinity, Eq. 9.3 reduces to μt (X train ) = Ytrain . Equation 9.3 can
be further written as
μ̃t (X train )i = Id − e−ηλi t Ỹtrain, i , (9.4)
where the λi are the eigenvalues of Htrain, train and μ̃t (X train ) and Ỹtrain are the mean
predictions and labels, respectively. Given the ordered eigenvalues as λ0 ≥ · · ·λm ,
Lee et al. (2019) assumed that the maximum feasible learning rate scales as η ∼ 2/λ0 ,
202 9 Reflecting on the Role of Overparameterization: Is it Solely Harmful?
which has been empirically verified by Xiao et al. (2020). Plugging η ∼ 2/λ0 into
Eq. 9.3, we find that λm will converge exponentially in a rate κ = λ0 /λm , i.e., the
condition number. It shows that if the condition number of the NTK associated with
a neural network diverges, then the network will become untrainable. Therefore, we
can use κ as a metric for trainability. Investigation has shown that κ is inversely
related to the performance of an architecture. Therefore, we generally wish to find a
network with a smaller κ.
Jacot et al. (2018) later proved that as for the model learned for a least-squares
regression problem, it is characterized by a linear differential equation in the infinite-
width regime during the training process.
P[ ŷ(t) − y 2
2 ≤ exp(−λ0 t) ŷ(0) − y 22 ] ≥ 1 − δ,
where y(t) is the prediction at iteration t, λ0 > 0 is a constant real number, and
0 < δ < 1 is a real number. This result relies on the following
6 overparameterization
condition: the number of hidden units must satisfy n = m λ−40 δ
−3
. The authors
also show that overparameterization can restrict the weights to be close to the random
initialization. In Chizat and Bach (2018), the optimization of a one-hidden-layer net-
work was modelled to the minimization of a convex function of a measure discretized
9.4 Generalization and Learnability 203
into a mixture of particles via continuous-time gradient descent. They proved that
the gradient flow characterizing the optimization converges to a global minimizer.
Going beyond a single hidden layer, Allen-Zhu et al. (2019b), Du et al. (2019a),
and Zou et al. (2020) concurrently proved that SGD converges to a globally optimal
solution for an overparameterized deep neural network in polynomial time under
slightly different assumptions on the network architecture and training data. Chen
et al. (2019) proved that the convergence of optimization is guaranteed if the network
width is polylogarithmic in the sample size m and 1/, where is the target error. A
sample complexity of
is needed to ensure that with probability at least 1 − δ, SGD with a specific learning
rate achieves an error no larger than
8L 2 R 2 8 log(2/δ)
+ + 24NTRF ,
m m
where NTRF is the minimum achievable training loss, L is the network depth, R is
the radius of the neural tangent random feature (NTRF) function class (Cao and Gu
2019) and n(δ, R, L) is network width.
Under the assumption that the nodes in the first layer all have linear functions while
the hidden nodes in the last layer are nonlinear, Andoni et al. (2014) studied two-
layer neural networks. These authors proved that if a sufficiently wide neural network
is trained with a generic gradient descent algorithm and all weights are randomly
initialized, it can learn any low-degree polynomial function. Brutzkus et al. (2018)
also gave generalization guarantees for two-layer neural networks under specific
assumptions. Arora et al. (2019) and Allen-Zhu et al. (2019a) proved that for neural
networks with two or three layers, the sample complexity is almost independent of
the parameter size. Arora et al. (2019) proved that any one-hidden-layer rectified
linear unit (ReLU) network trained via gradient descent under a 1-Lipschitz loss has
a generalization error of at most
2 y (H ∞ )−1 y
,
m
x x j (π − arccos(xi x j ))
Hi,∞j = Ew∼N (0,I ) xi x j Iw xi ≥0,w x j ≥0 = i .
2π
Chen et al. (2019) showed that generalization is guaranteed if the network width is
polylogarithmic in the sample size m and −1 , where is the target error. Cao and Gu
(2019) also proved the generalization bounds for wide and deep neural networks. Wei
et al. (2019) proved that regularizers can significantly influence the generalization
and optimization properties.
Influence of the network depth on generalizability. Canziani et al. (2016) sug-
gested that the deeper neural networks could have the better generalizability, sum-
marizing various practical applications of neural networks. This understanding plays
a major part in the reconciliation between the overparameterization of neural net-
works and their excellent generalizability. In addition to the previous results, there
is another possible explanation that originates from information theory. Zhang et al.
(2018) adopted the techniques developed by Xu and Raginsky (2017) and proved a
generalization bound to characterize how the generalizability evolves as a network
becomes deeper. The expectation of the generalization error, E[R − R̂ S ], has the
following upper bound:
L 1 2σ 2
exp − log I (S, W ),
2 η m
where L is the network depth, 0 < η < 1 is a real constant, S is the training sample,
and W denotes the parameters of the learned hypothesis.
References
Allen-Zhu, Zeyuan, Yuanzhi Li, and Yingyu Liang. 2019a. Learning and generalization in over-
parameterized neural networks, going beyond two layers. In Advances in Neural Information
Processing Systems, 6158–6169.
Allen-Zhu, Zeyuan, Yuanzhi Li, and Zhao Song. 2019b. A convergence theory for deep learning
via over-parameterization. In International Conference on Machine Learning.
Andoni, Alexandr, Rina Panigrahy, Gregory Valiant, and Li Zhang. 2014. Learning polynomials
with neural networks. In International Conference on Machine Learning, 1908–1916.
Arora, Sanjeev, Simon S Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. 2019. Fine-grained analysis of
optimization and generalization for overparameterized two-layer neural networks. arXiv preprint
arXiv:1901.08584.
Bartlett, Peter L, Philip M Long, Gábor Lugosi, and Alexander Tsigler. 2020. Benign overfitting in
linear regression. Proceedings of the National Academy of Sciences 117 (48): 30063–30070.
Belkin, Mikhail, Daniel J Hsu, and Partha Mitra. 2018a. Overfitting or perfect fitting? risk bounds
for classification and regression rules that interpolate. Advances in Neural Information Processing
Systems 31.
Belkin, Mikhail, Siyuan Ma, and Soumik Mandal. 2018b. To understand deep learning we need to
understand kernel learning. In International Conference on Machine Learning, 541–549.
References 205
Belkin, Mikhail, Daniel Hsu, Siyuan Ma, and Soumik Mandal. 2019. Reconciling modern machine-
learning practice and the classical bias-variance trade-off. Proceedings of the National Academy
of Sciences 116 (32): 15849–15854.
Belkin, Mikhail, Daniel Hsu, and Xu. Ji. 2020. Two models of double descent for weak features.
SIAM Journal on Mathematics of Data Science 2 (4): 1167–1180.
Brutzkus, Alon, Amir Globerson, Eran Malach, and Shai Shalev-Shwartz. 2018. SGD learns over-
parameterized networks that provably generalize on linearly separable data. In International
Conference on Learning Representations.
Bui, Thang, Daniel Hernández-Lobato, Jose Hernandez-Lobato, Yingzhen Li, and Richard Turner.
2016. Deep Gaussian processes for regression using approximate expectation propagation. In
International Conference on Machine Learning, 1472–1481.
Canziani, Alfredo, Adam Paszke, and Eugenio Culurciello. 2016. An analysis of deep neural network
models for practical applications. arXiv preprint arXiv:1605.07678.
Cao, Yuan, and Quanquan Gu. 2019. Generalization bounds of stochastic gradient descent for wide
and deep neural networks. In Advances in Neural Information Processing Systems, 10836–10846.
Cao, Yuan, Zixiang Chen, Mikhail Belkin, and Quanquan Gu. 2022. Benign overfitting in two-layer
convolutional neural networks. arXiv preprint arXiv:2202.06526.
Chen, Zixiang, Yuan Cao, Difan Zou, and Quanquan Gu. 2019. How much over-parameterization
is sufficient to learn deep ReLU networks? arXiv preprint arXiv:1911.12360.
Chinot, Geoffrey, and Matthieu Lerasle. 2020. Benign overfitting in the large deviation regime.
arXiv preprint arXiv:2003.05838, 1(5).
Chizat, Lenaic, and Francis Bach. 2018. On the global convergence of gradient descent for over-
parameterized models using optimal transport. In Advances in Neural Information Processing
Systems, 3036–3046.
Choromanska, Anna, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, and Yann LeCun. 2015.
The loss surfaces of multilayer networks. In International Conference on Artificial Intelligence
and Statistics.
Damianou, Andreas, and Neil D Lawrence. 2013. Deep Gaussian processes. In Artificial Intelligence
and Statistics, 207–215.
Du, Simon, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. 2019a. Gradient descent finds
global minima of deep neural networks. In International Conference on Machine Learning,
1675–1685.
Du, Simon S., Xiyu Zhai, Barnabas Poczos, and Aarti Singh. 2019b. Gradient descent prov-
ably optimizes over-parameterized neural networks. In International Conference on Learning
Representations.
Duvenaud, David, Oren Rippel, Ryan Adams, and Zoubin Ghahramani. 2014. Avoiding pathologies
in very deep networks. In Artificial Intelligence and Statistics, 202–210.
Hastie, Trevor, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. 2019. Surprises in
high-dimensional ridgeless least squares interpolation. arxiv e-prints, page. arXiv preprint
arXiv:1903.08560.
Hazan, Tamir, and Tommi Jaakkola. 2015. Steps toward deep kernel methods from infinite neural
networks. arXiv preprint arXiv:1508.05133.
Hensman, James, and Neil D Lawrence. 2014. Nested variational compression in deep Gaussian
processes. arXiv preprint arXiv:1412.1370.
Jacot, Arthur, Franck Gabriel, and Clément Hongler. 2018. Neural tangent kernel: convergence
and generalization in neural networks. In Advances in Neural Information Processing Systems,
8571–8580.
Lawrence, Neil D, and Andrew J Moore. 2007. Hierarchical Gaussian process latent variable models.
In International Conference on Machine Learning, 481–488.
Lee, Jaehoon, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-
Dickstein, and Jeffrey Pennington. 2019. Wide neural networks of any depth evolve as linear
models under gradient descent. In Advances in Neural Information Processing Systems, 8572–
8583.
206 9 Reflecting on the Role of Overparameterization: Is it Solely Harmful?
Lee, Jaehoon, Yasaman Bahri, Roman Novak, Samuel S Schoenholz, Jeffrey Pennington, and Jascha
Sohl-Dickstein. 2018. Deep neural networks as Gaussian processes. In International Conference
on Learning Representations.
Li, Dawei, Tian Ding, and Ruoyu Sun. 2018. Over-parameterized deep neural networks have no
strict local minima for any continuous activations. arXiv preprint arXiv:1812.11039.
Li, Yaopeng, Ming Jia, Xu Han, and Xue-Song Bai. 2021. Towards a comprehensive optimization of
engine efficiency and emissions by coupling artificial neural network (ann) with genetic algorithm
(ga). Energy 225: 120331.
Li, Yuanzhi, and Yingyu Liang. 2018. Learning overparameterized neural networks via stochastic
gradient descent on structured data. Advances in Neural Information Processing Systems 31:
8157–8166.
Liang, Tengyuan, and Alexander Rakhlin. 2020. Just interpolate: Kernel “ridgeless” regression can
generalize. The Annals of Statistics 48 (3): 1329–1347.
Muthukumar, Vidya, Adhyyan Narang, Vignesh Subramanian, Mikhail Belkin, Daniel Hsu, and
Anant Sahai. 2020. Classification vs regression in overparameterized regimes: does the loss
function matter? arXiv preprint arXiv:2005.08054.
Nakkiran, Preetum, Behnam Neyshabur, and Hanie Sedghi. 2020. The deep bootstrap framework:
Good online learners are good offline generalizers. arXiv preprint arXiv:2010.08127.
Nakkiran, Preetum, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. 2021.
Deep double descent: where bigger models and more data hurt. Journal of Statistical Mechanics:
Theory and Experiment 2021 (12): 124003.
Neal, Radford M. 1995. Bayesian Learning for Neural Networks. PhD thesis, University of Toronto.
Neal, Radford M. 1996. Priors for infinite networks. In Bayesian Learning for Neural Networks,
29–53. Springer.
Nguyen, Quynh, Mahesh Chandra Mukkamala, and Matthias Hein. 2019. On the loss landscape of a
class of deep neural networks with no bad local valleys. In International Conference on Learning
Representations.
Nguyen, Quynh. 2019. On connected sublevel sets in deep learning. In International Conference
on Machine Learning.
Tsigler, Alexander, and Peter L Bartlett. 2020. Benign overfitting in ridge regression. arXiv preprint
arXiv:2009.14286.
Wei, Colin, Jason D Lee, Qiang Liu, and Tengyu Ma. 2019. Regularization matters: generalization
and optimization of neural nets vs their induced kernel. In Advances in Neural Information
Processing Systems, 9712–9724.
Williams, Christopher. 1996. Computing with infinite networks. Advances in Neural Information
Processing Systems 9: 295–301.
Xiao, Lechao, Jeffrey Pennington, and Samuel Schoenholz. 2020. Disentangling trainability and
generalization in deep neural networks. In International Conference on Machine Learning,
10462–10472.
Xu, Aolin, and Maxim Raginsky. 2017. Information-theoretic analysis of generalization capability
of learning algorithms. In Advances in Neural Information Processing Systems, 2524–2533.
Zhang, Jingwei, Tongliang Liu, and Dacheng Tao. 2018. An information-theoretic view for deep
learning. arXiv preprint arXiv:1804.09060.
Zou, Difan, Jingfeng Wu, Vladimir Braverman, Quanquan Gu, and Sham Kakade. 2021. Benign
overfitting of constant-stepsize sgd for linear regression. In Conference on Learning Theory,
4633–4635.
Zou, Difan, Yuan Cao, Dongruo Zhou, and Quanquan Gu. 2020. Gradient descent optimizes over-
parameterized deep ReLU networks. Machine Learning 109 (3): 467–492.
Chapter 10
Theoretical Foundations for Specific
Architectures
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025 207
F. He and D. Tao, Foundations of Deep Learning, Machine Learning: Foundations,
Methodologies, and Applications, https://doi.org/10.1007/978-981-16-8233-9_10
208 10 Theoretical Foundations for Specific Architectures
population counterparts. Lin and Zhang (2019) studied influence of the sparsity and
permutation of convolutional layers on the spectral norm and generalizability. Other
works have also studied the generalizability of CNNs with residual networks (He
et al. 2016), including He et al. (2020) and Chen et al. (2019a).
Recurrent neural networks. Recurrent neural networks (RNNs) possess a recur-
rent structure and shows its promising performance in the processing and analysis
of sequential data. They have been applied to many real-world scenarios, including
NLP (Bahdanau et al. 2015; Sutskever et al. 2014) and speech recognition (Graves
et al. 2006, 2013). Chen et al. (2019b) developed a generalization bound for RNNs
based on works by Neyshabur et al. (2017) and Bartlett et al. (2017). Allen-Zhu and
Li (2019) also analysed the generalizability of RNNs.
Based on the Fisher-Rao norm, Tu et al. (2020) developed a gradient measure and
then derived a generalization bound as follows.
Theorem 10.1 Let us fix a margin parameter α; then, for any δ > 0, with prob-
ability at least 1 − δ, the following holds for every RNN whose weight matrices
θ = (U, W, V ) satisfy ||V T ||1 ≤ βV , ||W T ||1 ≤ βW , ||U T ||1 ≤ βU and ||θ || f s ≤ r :
4k r ||X || F 1 1
E[IM yL (x,y)≤0 ] ≤ + βV βU ||X T ||1
α 2m λmin (E(X X T )) m
1 log 2δ
+ IM yL (xi ,yi )≤a + 3 . (10.1)
m 2m
Recently, equivariant neural networks have shown its great successes in various areas,
including 3D point cloud processing (Li et al. 2018), chemistry (Faber et al. 2016),
and astronomy (Ntampaka et al. 2016; Ravanbakhsh et al. 2016). The intuition is that
when the prior symmetry in the data can be maintained, networks can achieve better
performance. As an example, objects in images of different classes can be oblique,
while all rotated versions of the same image are expected to belong to the same class.
Hence, there is some prior symmetry in such data, and if a model can preserve this
symmetry, it will enjoy better performance.
This section is organized as follows. First, there are two main ways to design a new
equivariant network. One way is to modify a traditional network; as an example, we
consider group CNNs, which are impressive traditional models of this type. The other
way is to limit the layers such that they always remain equivariant. Subsequently,
10.2 Equivariant Neural Networks 209
An equivariant neural network was generalized and developed by Cohen and Welling
(2016a), who noted the following shift equivariance in CNNs:
Tgh = Tg Th . (10.3)
For a regular representation L such that L h f (g) = f (h −1 g), it can be proven that
Despite the application of convolution, the equivariance of the nonlinearity and some
operations should be maintained. For any pointwise nonlinearity Cv , it can be verified
that
Cv L h f = v ◦ f ◦ h −1 = (v ◦ f ) ◦ h −1 = L h (v ◦ f ) = L h Cv f. (10.6)
In Cohen and Welling (2016b), the authors showed another way to achieve equivari-
ance, that is, constraining the filter such that the corresponding layer is equivariant.
A subgroup H of G is often considered first; for any h ∈ H , the filter satisfies
for some linear representation ψ of H acting on the output of the layer and ρ acting
on the input of the layer. Note that if the constraint is linear, then the solution space
will be linear. The space of all solutions is denoted by Hom H (ρ, ψ) because
an equivariant map is a “homomorphism of group representations”. Equivariant
maps are also called intertwiners (Serre et al. 1977). Before training, a basis
ψ1 , . . . , ψm of the solution space can be calculated, and all admitted weights can be
represented as linear combinations of this basis of the form = i αi ψi . For any
G, there are many choices of different linear representations, and thus, a steerable
neural network is flexible and efficient. By training the coefficients of the linear
combination, steerable neural networks can achieve excellent performance in various
tasks; examples include graph neural networks (Maron et al. 2018), E 2 steerable
CNNs (Weiler and Cesa 2019), and rotation-equivariant steerable CNNs (Weiler
et al. 2018b). In the following, we will present some theoretical implementations of
steerable neural networks.
A linear representation of G can be defined as an induced representation of the
linear representation H . The following example comes from Cohen and Welling
(2016b). The correlation f can be computed as
where Wk satisfies
ψ k ◦ Wk = Wk ◦ ψ k−1 (10.15)
ψk ◦ σ = σ ◦ ψk. (10.16)
One can verify that networks satisfying these two constraints are equivariant. We first
assume that σ is a pointwise nonlinearity. Constraint (10.16) will be discussed further
in the following section. We can prove that the image of the linear representation
satisfying constraint (10.16) must be a set of generalized permutation matrices, where
each ψg is a permutation matrix but the value of each nonzero entry need not be 1.
In addition, the network can be expressed as
F(x) = W L−1 σ (. . . σ (W
L σ (W 1 x̃))), (10.17)
L = [ψ L−1 W L . . . ψ L−1 W L ],
where W g 1 g t
212 10 Theoretical Foundations for Specific Architectures
⎡ ⎤ ⎡ k ⎤
ψg01 ψg g−1 Wk . . . ψgk g−1 Wk
⎢ψ k 1 W . . . ψgk g−1 Wk ⎥
1 1 t
⎢ψg0 ⎥ ⎢ ⎥
⎢ 2⎥ k = ⎢ g2 g1.
−1 k
⎥,
x̃ = ⎢ . ⎥ x, and W
2 t
⎣ .. ⎦ ⎢ .. .. .. ⎥
⎣ . . ⎦
ψgt0
ψg g−1 Wk
k
. . . ψgk g−1 Wk
t 1 t t
with any k < L and any arbitrary Wk . Applications of steerable neural networks have
also been reported. For example, Maron et al. (2018) considered equivariant graph
neural networks and solved the solution space HomG (ρ, ψ).
and in turn,
1
dimHomG (ρ, ψ) = tr(P) = χρ (g), (10.20)
|G| g∈G
where χρ (g) = tr(ρg ) is the character. Equation (10.18) implies that χρ (g) =
χρ T ⊕ψ −1 (g) = χρ (g)χψ (g −1 ). Finally, we know that
1
dimHomG (ρ, ψ) = χρ (g)χψ (g −1 ) = χρ , χψ . (10.21)
|G| g∈G
n
dimHomG (ρ, ψ) = χρ , χψ = m i m i . (10.22)
i=1
From this result, we know that the dimensionality of HomG (ρ, ψ) can be modified to
any desired value. In addition, in practice, we need only to compute HomG ψi , ψ j for
all irreducible representations ψi and ψ j . However, the linear representation should
commute with the nonlinearity, and thus, the choice of the invertible matrix P is also
important. We discuss this topic in the following section.
In this section, we present two results to show the benefits of equivariant neural
networks. First, however, let us begin with some generalization bounds. Intuitively,
equivariant neural networks are relatively constrained in comparison to all possible
neural networks. Specifically, the hypothesis set is smaller, implying better gen-
eralization. Some works have attempted to provide new generalization bounds for
equivariant neural networks, including Kondor and Trivedi (2018), Sokolic et al.
(2017), Sannai et al. (2021). However, generalization bounds cover only the worst
case, and the results of the above works do not explicitly consider the benefits of
equivariant models. To address this gap, we introduce the benefits of equivariant
neural networks (Elesedy and Zaidi 2021; Lyle et al. 2020).
Some works have attempted to present generalization bounds for equivariant net-
works, such as Kondor and Trivedi (2018), Sannai et al. (2021). Here, we discuss
the result in Sannai et al. (2021), which is the newest generalization bound. We first
introduce the generalization bounds for invariant neural networks and equivariant
neural networks, and we then share the method used to obtain them. Let R( f ) =
m
E(x,y)∼D [l( f (x), y)] be the expected risk, and let R̂ S ( f ) = m1 i=1 l( f (xi ), yi ) be
the empirical risk. Then, for any invariant neural network f uniformly bounded by
1, the following inequality holds with probability at least 1 − 2:
C 2 log(1/2)
R( f ) − R̂ S ( f ) ≤ + . (10.23)
|G|m 2/n m
can be defined as φG (x) = {gx : g ∈ G}, and the hypothesis set { f : [0, 1]n →
R M such that f is equivariant} can be covered by { f : [0, 1]n /G → R M }. The cov-
ering number of [0, 1]n /G can be bounded by C/|G| n . The results above can be
proven in accordance with the Rademacher complexity bound.
In addition, note that |G| is sometimes not independent of n. For example, when
G = Sn , the generalization bound becomes
C 2 log(1/2)
+ . (10.25)
n!m 2/n m
Beyond traditional generalization bounds, some works have focused on the bene-
fits of equivariant networks. In these works, the authors have shown the benefits
of equivariant maps but not those of equivariant networks in practice. This means
that these analyses offer no suggestions regarding how to design a new equivariant
neural network. Thus, the theoretical analysis of equivariant neural networks is still
immature.
Lyle et al. (2020) compared data augmentation and feature averaging for invariant
targets. For a given dataset {(xi , yi )}i=1
m
and a linear representation ρ transforming
the data, suppose that the data have an invariant distribution, that is, PD (x, y) =
PD (ρg x, y) for all x, y, and g ∈ G. For data augmentation, the empirical risk is
considered to be
m m n
1 1
R̂ S ( f ) = Eg∼λl( f (gxi ), yi ) ≈ l( f (g j xi ), yi ), (10.26)
m i=1
mn i=1 j=1
Qf = ψg−1 ◦ f ◦ ρg (10.27)
g∈G
for compact groups and those with a Haar measure λ over G. Note that
−1
Q f ◦ ρh = ψh ψgh ◦ f ◦ ρgh dλ(g) = ψh ◦ Q f, (10.29)
and thus, Q f is equivariant. Furthermore, Q is a projection map from the space of all
functions to the space of all equivariant functions, and Q2 = Q. When we consider
the expected loss
In particular, when f is not equivariant, the gap satisfies R( f ) − R(Q f ) > 0. Finally,
if f is the best predictor with the least expected risk, then f must be equivariant.
If not, then Q f is a better predictor. However, a problem arises in this case: Q f
may not be a neural network of the same width. If we use the predictor Q f , then
the process will require more calculation, which is not desirable. In practice, an
equivariant neural network is layerwise equivariant, and the linear representations in
each of the layers are chosen prior to training. Thus, a real equivariant neural network
has additional constraints.
two-layer neural networks have uniform approximation properties. That is, for a
given compact set C, any continuous function f supported on C can be uniformly
approximated by a two-layer neural network F with ReLU nonlinearity. Any equiv-
ariant map f defined on C can be extended to be defined on C̃ = {gx : x ∈ C} as
f˜(gx) = g f (x) for any x ∈ C. Hence, without loss of generality, we can assume
that C = C̃.
Now, there are two claims to be proven. On the one hand, when F is a two-layer
universal approximator such that f (x) − F(x) < for any x ∈ C, then QF is
also, and QF can be modified into a two-layer neural network. Note that
1
QF(x) − f (x) = g −1 [F(gx) − f (gx)] (10.31)
|G| g∈G
1
≤ g −1 [F(x) − f (x)] (10.32)
|G| g∈G
≤ sup g , (10.33)
g∈G
which implies that QF is also a uniform approximator. On the other hand, for a
two-layer neural network F(x) = W2 σ W1 · x, QF can be written as
⎡ ⎤
W1 g1
⎢ ⎥
g1 −1 W2 . . . gt −1 W2 σ ⎣ ... ⎦ x,
1
QF(x) = (10.34)
|G|
W1 gt
References
Allen-Zhu, Zeyuan, and Yuanzhi Li. 2019. Can SGD learn recurrent neural networks with provable
generalization? In Advances in Neural Information Processing Systems, 10331–10341.
Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by
jointly learning to align and translate. In International Conference on Learning Representations.
Bartlett, Peter L, Dylan J Foster, and Matus J Telgarsky. 2017. Spectrally-normalized margin bounds
for neural networks. In Advances in Neural Information Processing Systems, 6240–6249.
Chen, Hao, Zhanfeng Mo, Zhouwang Yang, and Xiao Wang. 2019a. Theoretical investigation
of generalization bound for residual networks. In International Joint Conference on Artificial
Intelligence, 2081–2087.
Chen, Minshuo, Xingguo Li, and Tuo Zhao. 2019b. On generalization bounds of a family of recurrent
neural networks. arXiv preprint arXiv:1910.12947.
Cohen, Taco, and Max Welling. 2016a. Group equivariant convolutional networks. In International
Conference on Machine Learning, 2990–2999.
Cohen, Taco S, and Max Welling. 2016b. Steerable cnns. arXiv preprint arXiv:1612.08498.
218 10 Theoretical Foundations for Specific Architectures
Du, Simon S, Yining Wang, Xiyu Zhai, Sivaraman Balakrishnan, Russ R Salakhutdinov, and Aarti
Singh. 2018. How many samples are needed to estimate a convolutional neural network? In
Advances in Neural Information Processing Systems, 373–383.
Elesedy, Bryn, and Sheheryar Zaidi. 2021. Provably strict generalisation benefit for equivariant
models. In International Conference on Machine Learning, 2959–2969.
Faber, Felix A, Alexander Lindmaa, O Anatole Von Lilienfeld, and Rickard Armiento. 2016.
Machine learning energies of 2 million elpasolite (a b c 2 d 6) crystals. Physical Review Letters
117 (13): 135502.
Gehring, Jonas, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. 2017. Con-
volutional sequence to sequence learning. In International Conference on Machine Learning,
1243–1252.
Graves, Alex, Abdel-rahman Mohamed, and Geoffrey Hinton. 2013. Speech recognition with deep
recurrent neural networks. In IEEE International Conference on Acoustics, Speech and Signal
Processing, 6645–6649.
Graves, Alex, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist
temporal classification: labelling unsegmented sequence data with recurrent neural networks. In
International Conference on Machine Learning, 369–376.
He, Fengxiang, Tongliang Liu, and Dacheng Tao. 2020. Why resnet works? residuals generalize.
IEEE Transactions on Neural Networks and Learning Systems.
He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image
recognition. In IEEE Conference on Computer Vision and Pattern Recognition.
Kondor, Risi, and Shubhendu Trivedi. 2018. On the generalization of equivariance and convolution
in neural networks to the action of compact groups. In International Conference on Machine
Learning, 2747–2755.
Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep
convolutional neural networks. In Advances in Neural Information Processing Systems, 1097–
1105.
Li, Yangyan, Rui Bu, Mingchao Sun, Wei Wu, Xinhan Di, and Baoquan Chen. 2018. Pointcnn:
convolution on x-transformed points. In Advances in Neural Information Processing Systems,
820–830.
Lin, Shan, and Jingwei Zhang. 2019. Generalization bounds for convolutional neural networks.
arXiv preprint arXiv:1910.01487.
Lyle, Clare, Mark van der Wilk, Marta Kwiatkowska, Yarin Gal, and Benjamin Bloem-Reddy. 2020.
On the benefits of invariance in neural networks. arXiv preprint arXiv:2005.00178.
Maron, Haggai, Heli Ben-Hamu, Nadav Shamir, and Yaron Lipman. 2018. Invariant and equivariant
graph networks. arXiv preprint arXiv:1812.09902.
Merity, Stephen, Nitish Shirish Keskar, and Richard Socher. 2018. Regularizing and optimizing
lstm language models. In International Conference on Learning Representations.
Neyshabur, Behnam, Srinadh Bhojanapalli, and Nathan Srebro. 2017. A PAC-Bayesian approach
to spectrally-normalized margin bounds for neural networks. arXiv preprint arXiv:1707.09564.
Ntampaka, Michelle, Hy Trac, Dougal J Sutherland, Sebastian Fromenteau, Barnabás Póczos, and
Jeff Schneider. 2016. Dynamical mass measurements of contaminated galaxy clusters using
machine learning. The Astrophysical Journal 831 (2): 135.
Peters, Matthew E, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee,
and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Annual Conference
of the North American Chapter of the Association for Computational Linguistics, 2227–2237.
Ravanbakhsh, Siamak, Junier Oliva, Sebastian Fromenteau, Layne Price, Shirley Ho, Jeff Schnei-
der, and Barnabás Póczos. 2016. Estimating cosmological parameters from the dark matter
distribution. In International Conference on Machine Learning, 2407–2416.
Ravanbakhsh, Siamak. 2020. Universal equivariant multilayer perceptrons. In International
Conference on Machine Learning.
Reeder, Mark. 2014. Notes on group theory.
References 219
Sabour, Sara, Nicholas Frosst, and Geoffrey E Hinton. 2017. Dynamic routing between capsules.
Advances in Neural Information Processing Systems, 30.
Sannai, Akiyoshi, Masaaki Imaizumi, and Makoto Kawano. 2021. Improved generalization bounds
of group invariant/equivariant deep networks via quotient feature spaces. In Uncertainty in
Artificial Intelligence, 771–780.
Serre, Jean-Pierre, et al. 1977. Linear Representations of Finite Groups, vol. 42. Springer.
Silver, David, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driess-
che, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, and Marc Lanctot. 2016.
Mastering the game of go with deep neural networks and tree search. Nature 529 (7587): 484–489.
Sokolic, Jure, Raja Giryes, Guillermo Sapiro, and Miguel Rodrigues. 2017. Generalization error of
invariant classifiers. In Artificial Intelligence and Statistics, 1094–1103.
Sutskever, Ilya, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural
networks. In Advances in Neural Information Processing Systems, 3104–3112.
Tu, Zhuozhuo, Fengxiang He, and Dacheng Tao. 2020. Understanding generalization in recurrent
neural networks. In International Conference on Learning Representations.
Weiler, Maurice, and Gabriele Cesa. 2019. General e (2)-equivariant steerable cnns. Advances in
Neural Information Processing Systems, 32.
Weiler, Maurice, Mario Geiger, Max Welling, Wouter Boomsma, and Taco S Cohen. 2018a. 3d
steerable cnns: learning rotationally equivariant features in volumetric data. Advances in Neural
Information Processing Systems, 31.
Weiler, Maurice, Fred A Hamprecht, and Martin Storath. 2018b. Learning steerable filters for
rotation equivariant cnns. In IEEE Conference on Computer Vision and Pattern Recognition,
849–858.
Worrall, Daniel E, Stephan J Garbin, Daniyar Turmukhambetov, and Gabriel J Brostow. 2017.
Harmonic networks: deep translation and rotation equivariance. In IEEE Conference on Computer
Vision and Pattern Recognition, 5028–5037.
Yu, Adams Wei, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi,
and Quoc V Le. 2018. Qanet: combining local convolution with global self-attention for reading
comprehension. In International Conference on Learning Representations.
Zhou, Pan, and Jiashi Feng. 2018. Understanding generalization and optimization performance of
deep cnns. In International Conference on Machine Learning, 5960–5969.
Part III
Deep Learning Theory form
the Trust-worthy Facet
Chapter 11
Privacy Preservation
where B is an arbitrary subset of the hypothesis space and (S, S ) is a pair of neigh-
bouring sample sets in which S and S differ by only one example. The left-hand
side of Eq. (11.1) is also called the privacy loss.
In Eq. (11.1), a division operation is employed to measure changes in the out-
put hypothesis. In recent years, numerous variants of the privacy-preserving abil-
ity metrics have been devised by tweaking this division operation or introducing
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025 223
F. He and D. Tao, Foundations of Deep Learning, Machine Learning: Foundations,
Methodologies, and Applications, https://doi.org/10.1007/978-981-16-8233-9_11
224 11 Privacy Preservation
and
5m 2
ε + 1/m > 1 − 3e−mε ,
2
P Diff R < 4V̂S (A(S))ε̂ +
m−1
where
1 2
V̂S (A(S)) = (A(S), z i ) − A(S), z j .
2m(m − 1) i= j
To date, the tightest generalization bound by far has been given by He et al. (2020)
as follows:
e−ε δ 2
P R̂ S (A(S)) − R(A(S)) < 9ε > 1 − ln . (11.2)
ε ε
This generalization bound was derived in three stages. The authors first proved an
on-average generalization bound for any (ε, δ)-differentially private multidatabase
learning algorithm
à : S → H × {1, . . . , k}
as follows:
E E R̂ h − E R h A(S) ≤ e−ε kδ + 1 − e−ε , (11.3)
S∼D m A(S)
SiA(S) A(S)
A(S)
R(A(S)) − R̂ S (A(S)),
where A(S) is the hypothesis learned by algorithm A on the training sample set S,
R(A(S)) is the expected risk, and R̂ S (A(S)) is the empirical risk.
To derive this bound, we employ a new on-average generalization bound for (ε, δ)-
differentially private multi-database learning algorithms. Moreover, this bound can
also be deduced from the proven high-probability generalization bound, indicating
that differentially private machine learning algorithms are likely to be approximately
correct (PAC)-learnable. These findings indicate a notable connection between an
algorithm’s privacy-preserving capabilities and its ability to generalize effectively.
In essence, algorithms exhibiting strong privacy preservation also tend to demon-
strate robust generalizability. Consequently, there exists an opportunity to enhance
the generalizability of learning algorithms by bolstering their privacy-preserving
mechanisms.
We then investigate the impact of the iterative nature inherent in learning algo-
rithms on both their privacy-preserving capabilities and generalizability. Typically,
the privacy-preserving effectiveness of an iterative algorithm diminishes over the
course of training. This decline occurs due to the accumulation of leaked informa-
tion as the algorithm iteratively processes data. To investigate this phenomenon, we
further derived three composition theorems that measure the differential privacy of
any iterative algorithm by assessing the privacy of each individual iteration. By inter-
twining these theorems with the established correlation between generalizability and
privacy preservation, we gained insights into the generalizability of iterative learn-
ing algorithms. These composition theorems provide a framework for understanding
how the iterative nature of algorithms influences both their privacy and their ability
to generalize effectively.
These results provide an insight of the relationship between generalizability and
privacy-preserving ability in iterative learning algorithms.
Existing works (Dwork et al. 2015; Nissim and Stemmer 2015; Oneto et al. 2017)
have already proved some high-probability generalization bounds in the following
form,
P R(A(S)) − R̂ S (A(S)) > a < b,
where a and b are two positive constant real numbers. The high-probability bound
provided in this section is strictly tighter than the current best results proved in
Nissim and Stemmer (2015) from two aspects: (1) our bound improves the term
a from 13ε to 9ε; and (2) our bound improves the term b from 2δε−1 log (2/ε) to
2e−ε δε−1 log (2ε). Besides, based on this bound we further derived a PAC-learnable
11.2 The Interplay of Generalizability and Privacy-Preserving Ability 227
guarantee for differentially private machine learning algorithms. Nissim and Stem-
mer have proved an on-average multi-database generalization bound, which is looser
than ours by a factor of eε . Such improvements are significant in practice because ε
can always be as large as 10 (Abadi et al. 2016). Besides, the bounds in Nissim and
Stemmer (2015) are only applicable for binary classification, while ours are suitable
for any differentially private learning algorithms.
Some works have also proved composition theorems (Dwork and Roth 2014;
Kairouz et al. 2017). The approximation of factor δ in our composition theorems is
tighter than the best existing result (Kairouz et al. 2017) by
eε − 1 ε
δ T − ,
eε + 1 ε
where T is the number of iterations, while the estimate of ε remains the same. This
improvement is significant in practice, where T is always considerably large and
is helpful for further tightening our the generalization bounds for iterative learning
algorithms considerably.
Our results are applicable to a wide range of machine learning algorithms. In
this book, we consider the stochastic gradient MCMC scheme (Ma et al. 2015) and
agnostic federated learning (Geyer et al. 2017) and take stochastic gradient Langevin
dynamics (Welling and Teh 2011) as an example of application. Our results deliver
generalization bounds for SGLD and agnostic federated learning. The obtained gener-
alization bounds are not explicitly relied on the model size, which can be prohibitively
large in modern methods, such as deep neural networks.
11.2.1 Preliminaries
Here, we slightly abuse the notations of distribution and its cumulative distribu-
tion function when no ambiguity is introduced because there is a one-one mapping
between them if we ignore zero-probability events.
Definition 11.1 (Max Divergence; cf. Dwork and Roth (2014), Definition 3.6) For
any random variables X and Y , the max divergence between X and Y is defined as
P(X ∈ S)
D∞ (X Y ) = max log .
S⊆Supp(X ) P(Y ∈ S)
Definition 11.2 (δ-Approximate Max Divergence; cf. (Dwork and Roth 2014), Def-
inition 3.6) For any random variables X and Y , the δ-approximate max divergence
between X to Y is defined as
δ P(X ∈ S) − δ
D∞ (X Y ) = max log .
S⊆Supp(X ):P(Y ∈S)≥δ P(Y ∈ S)
228 11 Privacy Preservation
Definition 11.3 (Statistical Distance; cf. (Dwork and Roth 2014)) For any random
variables X and Y , the statistical distance between X and Y is defined as
Lemma 11.1 (cf. (Dwork and Rothblum 2016), Lemmas 3.9 and 3.10) For any two
distributions D and D , there exist distributions M and M such that
and
D K L (DD ) ≤ D K L (MM ) = D K L (M M).
Lemma 11.2 (cf. Dwork and Roth (2014), Theorem 3.17) For any random variables
Y and Z , we have
δ δ
D∞ (Y Z ) ≤ ε, D∞ (Z Y ) ≤ ε,
δ δ
(Y Y ) ≤ , (Z Z ) ≤ ,
eε +1 1 + eε
D∞ (Y Z ) ≤ ε, D∞ (Z Y ) ≤ ε.
Lemma 11.3 (Azuma Lemma; cf. (Mohri et al. 2018), p. 371) Suppose {Yi }i=1 T
is
a sequence of random variables, where Yi ∈ [−ai , ai ]. Let {X i }i=1 be a sequence of
T
Definition 11.4 (PAC-Learnability; cf. (Mohri et al. 2018), Definition 2.4) A concept
class C is said to be PAC-learnable if there exists an algorithm A and a polynomial
function poly(·, ·, ·, ·) such that for any s > 0 and t > 0, for all distributions D on
the training example Z , any target concept c ∈ C, and any sample size
Proof Skeleton
We now give the proof skeleton for Theorem 11.1. The proofs have three stages:
(1) we first prove an on-average generalization bound for multi-database learning
algorithms; (2) we then obtain a high-probability generalization bound for multi-
database algorithms; and (3) we eventually prove Theorem 11.1 by reduction to
absurdity.
Stage 1: Prove an on-average generalization bound for multi-database
learning algorithms.
We first prove the following on-average generalization bound for multi-database
learning algorithms which are defined as follows.
à : S → H × {1, . . . , k},
be (ε, δ)-differentially private and the loss function l∞ ≤ 1. Then, for any data
distribution D over data space Z, we have the following inequality,
E E R̂ SiA(S) h A(S) − E R h A(S) ≤ e−ε kδ + 1 − e−ε . (11.5)
S∼Dm A(S) A(S)
Let S = S ∪ {z 0 } and z = z i (it is without less of generality since all z i is i.i.d. drawn
from D). Since S ∪ {z 0 } ∼ Dkm , we have
k 1
eε E E P l h A(S ∪{z0 }) , z i ≤ t, i A(S ∪{z0 }) = i dt + kδ
S ∼Dkm−1 z i ∼D,z 0 ∼D 0
i=1
k 1
= eε E E P l h A(S) , z ≤ t, i A(S) = i dt + kδ
S∼Dkm z∼D 0
i=1
1
= eε E E
P l h A(S) , z ≤ t dt + kδ
S∼Dkm z∼D 0
= eε E E EA(S) l h A(S) , z + kδ.
S∼Dkm z∼D
Therefore, we have
E E R̂ SiA(S) (h A(S) ) ≤ kδ + eε E [ E [R h A(S) ]]. (11.6)
S∼Dkm A(S) S∼Dkm A(S)
which leads to
E E R̂ SiA(S) (h A(S) ) − E [ E [R h A(S) ]]
S∼Dkm A(S) S∼Dkm A(S)
≤e−ε kδ − e−ε E E R̂ SiA(S) (h A(S) ) + E E R̂ SiA(S) (h A(S) )
S∼Dkm A(S) S∼Dkm A(S)
−ε −ε
≤1 − e + e kδ.
be (ε, δ)-differential private, where km is the size of the whole dataset S and YX =
{ f : X → Y}. Then, for any data distribution D over data space Z, any database
set S = {Si }i=1
k
, where Si is a database contains m i.i.d. examples drawn from D, we
have the following generalization bound,
P R̂ SiA(S) h A(S) ≤ R h A(S) + ke−ε δ + 3ε ≥ ε. (11.7)
11.2 The Interplay of Generalizability and Privacy-Preserving Ability 233
Since R̂ SiA(S) h A(S) ≥ 0, we have that for any α > 0,
E E R̂ SiA(S) (h A(S) )
S∼Dkm A(S)
≥ E E R̂ SiA(S) (h A(S) )I R̂ S (h A(S) )≥R (h A(S) )+α
S∼Dkm A(S) i A(S)
≥ E E (α + R h A(S) )I R̂ S (h A(S) )≥R (h A(S) )+α .
S∼Dkm A(S) i A(S)
Furthermore, by splitting E [ E [R h A(S) ]] into two parts, we have
S∼Dkm A(S)
E E R̂ SiA(S) (h A(S) ) − E E R h A(S)
S∼Dkm A(S) S∼Dkm A(S)
≥ E E (α + R h A(S) )I R̂ S (h A(S) )≥R (h A(S) )+α
S∼Dkm A(S) i A(S)
− E E R h A(S) I R̂ S (h A(S) )≥R (h A(S) )+α
S∼Dkm A(S) i A(S)
+ E E R h A(S) I R̂ S (h A(S) )<R (h A(S) )+α
S∼Dkm A(S) i A(S)
≥ E E αI R̂ S (h A(S) )≥R (h A(S) )+α
S∼Dkm A(S) i A(S)
− E E R h A(S) I R̂ S (h A(S) )<R (h A(S) )+α
S∼Dkm A(S) i A(S)
≥αP R̂ SiA(S) (h A(S) ) > R h A(S) + α − P R̂ SiA(S) (h A(S) ) ≤ R h A(S) + α .
Definition 11.6 (Exponential Mechanism; cf. (Nissim and Stemmer 2015), p. 3, and
(McSherry and Talwar 2007)) Suppose that S is a sample set, u : (S, r ) → R+ is the
utility function, R is an index set, ε is the privacy parameter, and u is the sensitivity
of u defined by
u = max max
|u(S, r ) − u(S , r )|.
r ∈R S,S adjacent
If we have
e−ε δ 2
P R̂ S (A(S)) ≤ e−ε kδ + 8ε + R(A(S)) < 1 − ln , (11.8)
ε
e−ε δ 2 k
P ∀i, R̂ Si (A(Si )) ≤ e−ε kδ + 8 + R(A(Si )) ≤ 1 − ln ,
ε
which leads to
k
−ε 1 2 ε
P ∃i, R̂ Si (A(Si )) > e kδ + 8 + R(A(Si )) > 1 − 1 − ln ≥1− .
k 2
(11.10)
Furthermore, since T is independent with S, by using the Hoeffding inequality, we
have
ε ε
≥ (1 − e− /2m )k ≥ 1 − .
2
P ∀i, |l(h i , T ) − R(h i )| ≤ (11.11)
2 8
Therefore, by combining Eqs. (11.10) and (11.11), we obtain
−ε15 5ε
P ∃i, R̂ Si (A(Si )) > e kδ + + l(h i , T ) > 1 − .
2 8
The Eq. (11.9) in Lemma 11.4 conflicts with Eq. (11.7). Thus, we proved Theorem
11.1.
We then compares our results with the existing works.
Comparison of Theorem 11.1.
There have been several high-probability generalization bounds for (ε, δ)-
differentially private machine learning algorithms.
Dwork et al. (2015) proved that
P R(A(S)) − R̂ S (A(S)) < 4ε > 1 − 8δ ε .
236 11 Privacy Preservation
and
5N 2
ε + 1/m > 1 − 3e−mε ,
2
P Diff R < 4V̂S (A(S))ε̂ +
N −1
where
1 2
V̂S (A(S)) = (A(S), z i ) − A(S), z j .
2m(m − 1) i= j
where a and b are two positive constant real numbers. Apparently, a smaller a and
a smaller b imply a tighter generalization bound. Our bound improves the current
tightest result from two aspects:
• Our bound tightens the term a from 13ε to 9ε.
• Our bound tightens the term b from (2δ/ε) log (2/ε) to (2e−ε δ/ε) log (2/ε).
These improvements are significant. Abadi et al. (2016) conducted experiments on
the differential privacy in deep learning. Their empirical results demonstrate that the
factor ε can be as large as 10.
A comparison of Theorem 11.2.
There is only one related work in the literature that presents an on-average gener-
alization bound for multi-database algorithms. Nissim and Stemmer (2015) proved
that,
11.2 The Interplay of Generalizability and Privacy-Preserving Ability 237
E E R̂ SiA(S) (h A(S) ) − E R h A(S) ≤ kδ + 2ε.
S∼Dkm A(S) A(S)
Most machine learning algorithms are iterative, which may degenerate the privacy-
preserving ability as the number of iterations. This section studies the degenerative
nature of the privacy preservation in iterative machine learning algorithms and its
influence on the generalization.
We have the following composition theorem.
Mi : (Yi−1 , S) → Yi
where
T
ε1 = εi , (11.13)
i=1
⎛ ⎞
T εi T T
ε 2
(e − 1) ε
εi2 log ⎝e + ⎠,
i=1 i
ε2 = + 2
i
εi + 1
i=1
e i=1
δ̃
T
(e − 1) εi
T εi
1
ε3 = εi + 1
+ 2 log εi2 , (11.14)
i=1
e δ̃ i=1
!
T !
T
δi δi
1− 1 − eαi +1− 1− + δ̃, (11.15)
i=1
1 + eεi i=1
1 + eεi
where " T
#
I = {αi }i=1
T
: αi = ε , |{i : αi = εi , αi = 0}| ≤ 1 ,
i=1
Proof Skeleton
eε − 1
D K L (A(S)A(S )) ≤ ε .
eε + 1
This lemma is new, and its proof involves non-trivial technical challenges. Lemma
11.1 is a crucial component in establishing the proof of Lemma 11.5. Notably, there
are two related results in the literature, albeit considerably less stringent than ours.
Dwork et al. (2010) demonstrated an inequality of the KL divergence as follows,
1 ε
D K L (A(S)A(S )) ≤ ε(e − 1). (11.16)
2
Compared with ours, Eq. (11.16) is larger by a factor of (1 + eε )/2, which can be
very large in practice.
Proof By Lemma 11.1, we have a random variable M(S) and M(S ), which satisfies
and
dkl (A(S)|A(S )) ≤ dkl (M(S)|M(S )) = dkl (M(S )|M(S)). (11.17)
dkl (M(S)|M(S ))
(∗) 1
= dkl (M(S)|M(S )) + dkl (M(S )|M(S))
2
1 dP(M(S)) 1 dP(M(S ))
= log dP(M(S)) + log dP(M(S ))
2 dP(M(S )) 2 dP(M(S))
1 dP(M(S))
= log d P(M(S)) − P(M(S ))
2 dP(M(S ))
1 dP(M(S )) dP(M(S))
+ log + log dP(M(S ))
2 dP(M(S)) dP(M(S ))
1 dP(M(S)) 1
= log d P(M(S)) − P(M(S )) + log 1dP(M(S ))
2 dP(M(S )) 2
1 dP(M(S))
= log d P(M(S)) − P(M(S )) , (11.18)
2 dP(M(S ))
dP(M(S) = y)
k(y) = − 1. (11.19)
dP(M(S ) = y)
Therefore,
Additionally,
M(S ) k(M(S ) = k(y)dP(M(S ) = y)
y∈
= d P(M(S) = y) − dP(M(S ) = y)
y∈
=0. (11.21)
Also, combined with the definition of k(y) (see Eq. 11.19), the right-hand side
(rhs) of Eq. (11.18) becomes
rhs = M(S ) k(M(S )) log(k(M(S )) + 1). (11.22)
We now calculate the maximum of Eq. (11.22) subject to Eqs. (11.21) and (11.23).
First, we argue that the maximum is achieved when k(M(S )) ∈ {e−ε − 1, eε − 1}
with a probability of 1 (almost surely). when k(M(S )) ∈ {e−ε − 1, eε − 1}, almost
surely, the distribution for k(M(S )) is as follows,
1
P∗ (k(M(S )) = eε − 1) = ,
1 + eε
eε
P∗ (k(M(S )) = e−ε − 1) = .
1 + eε
Note that
Also,
e M(S )∼q (k(M(S ) log(k(M(S )) + 1)i k(M(S ))>0 ) ≤ Pq (k(M(S )) ≥ 0) · ε(eε − 1).
(11.25)
Therefore, by combining the inequalities in Eqs. (11.24) and (11.25), we have
e M(S )∼q (k(M(S ) log(k(M(S )) + 1)) ≤ e M(S )∼q ∗ (k(M(S ) log(k(M(S )) + 1)).
Therefore,
Pq (k(M(S )) ≤ 0) < P∗ (k(M(S )) = e−ε − 1).
242 11 Privacy Preservation
and
Pq (0 < k(M(S )) < eε − 1) ≤ Pq (0 < k(M(S )) < eε − 1),
Since x log(x + 1) increases when x > 0 and decreases when x < 0, we have
q (k(M(S ) log(k(M(S )) + 1)) > q (k(M(S ) log(k(M(S )) + 1)).
where
q (k(M(S ) log(k(M(S )) + 1)i k(M(S ))<0 ) = ε(1 − e−ε )Pq (k(M(S )) < 0).
By applying Jensen’s inequality to bound the q (k(m(s ) log(k(m(s )) +
1)i k(m(s ))≥0 ), we have
q (k(M(S )) log(k(M(S )) + 1)i k(M(S ))≥0 )
= Pq (M(S ) ≥ 0)q (k(M(S )) log(k(M(S )) + 1)|k(M(S )) ≥ 0)
(∗)
≤ Pq (M(S ) ≥ 0)q k(M(S ))|k(M(S )) ≥ 0 · log(q k(M(S ))|k(M(S )) ≥ 0 + 1),
(11.26)
where the inequality (∗) uses jensen’s inequality (x log(1 + x) is convex with respect
to x when x > 0). The upper bound in Eq. (11.26) is achieved as long as
Furthermore,
subject to
q
≤ eε , (11.27)
1−q
where g(q) is the maximum of Eq. (11.22) subject to Pq (k(M(S )) < 0) = q, and the
condition Eq. (11.27) comes from the Pq (k(M(S )) ≥ 0) > P∗ (k(M(S )) = eε − 1)
(the assumption of case 2).
Additionally, g(q) can be represented as follows,
q
q(1 − e−ε ) log (eε − 1) + ε .
1−q
We assume that Y0 is the initial hypothesis (which does not depend on S). If for any
fixed observation yi−1 of the variable Yi−1 , Mi (yi−1 , S) is εi -differentially private,
then {Yi (S)}i=0
T
is (ε , δ )-differentially private that
$ T
% T
1 eεi − 1
ε = 2 log
εi2 + εi .
δ i=1 i=1
eεi + 1
Based on Lemma 11.5, we can prove the following composition theorem for
ε-differential privacy as a preparation theorem of the general case.
244 11 Privacy Preservation
P {Yi (S) = yi }i=0 T
log
P {Yi (S ) = yi }i=0
T
$T %
! P (Yi (S) = yi |Yi−1 (S) = yi−1 , ..., Y0 (S) = y0 )
= log
i=0
P (Yi (S ) = yi |Yi−1 (S ) = yi−1 , ..., Y0 (S ) = y0 )
T
P (Yi (S) = yi |Yi−1 (S) = yi−1 , ..., Y0 (S) = y0 )
= log
i=0
P (Yi (S ) = yi |Yi−1 (S ) = yi−1 , ..., Y0 (S ) = y0 )
T
(∗) P (Yi (S) = yi |Yi−1 (S) = yi−1 , ..., Y0 (S) = y0 )
= log
i=1
P (Yi (S ) = yi |Yi−1 (S ) = yi−1 , ..., Y0 (S ) = y0 )
T
P (Mi (yi−1 , S) = yi |Yi−1 (S) = yi−1 , ..., Y0 (S) = y0 )
= log
i=1
P (Mi (yi−1 , S ) = yi |Yi−1 (S ) = yi−1 , ..., Y0 (S ) = y0 )
T
(∗∗) P (Mi (yi−1 , S) = yi )
= log ,
i=1
P (Mi (yi−1 , S ) = yi )
where Eq. (∗) comes from the independence of Y0 with respect to S and Eq. (∗∗) is
due the independence of Mi to Yk (k < i) when the Yi−1 is fixed.
By the definition of ε-differential privacy, one has for arbitrary yi−1 as the
observation of Yi−1 ,
D∞ Mi (yi−1 , S)Mi (yi−1 , S ) < εi , D∞ Mi (yi−1 , S )Mi (yi−1 , S) < εi .
Combining Azuma Lemma (Lemma 11.3), Eq. (11.29) derives the following
equation $ %
P {Yi (S) = yi }i=0T
ε
P {Yi (S) = yi }i=0 :
T
>e < δ,
P {Yi (S ) = yi }i=0
T
!
T
f {α }
i i=1 = 1 −
T
(1 − αi Ai ), (11.30)
i=1
Lemma 11.6 The maximum of function (11.30) when Ai is positive real such that,
!
T
1 ≤ αi ≤ ci , (here ci Ai ≤ 1), and αi = c,
i=1
Based on Lemmas 11.2 and 11.6, we can prove the following composition theorem
whose estimate of ε is somewhat looser than our main results.
We assume that Y0 is the initial hypothesis (which does not depend on S). If for
any fixed observation yi−1 of the variable Yi−1 , Mi (yi−1 , S) is (εi , δi )-differentially
private (i ≥ 1), then {Yi (S)}i=0
T
is (ε , δ )-differentially private where
$ T % T
1 eεi − 1
ε =2 log
εi2 + εi ,
δ̃ i=1 i=1
eεi + 1
!
T !
T
δi
αi δi
δ = max 1 − 1−e +1− 1− + δ̃,
{αi }i=1
T
∈I
i=1
1 + eεi i=1
1 + eεi
T
αi = ε , |{i : αi = εi and αi = 0}| ≤ 1.
i=1
246 11 Privacy Preservation
Now, we can prove our composition theorems for (ε, δ)-differential privacy. We
first prove a composition algorithm of (ε, δ)-differential privacy whose estimate of ε
is somewhat looser than the existing results. Then, we tighten the results and obtain
a composition theorem that is strictly tighter than the current estimate.
Proof of Theorem 11.6 It has been proved that the optimal privacy preservation
can be achieved by a sequence of independent iterations (see Kairouz et al. (2017),
Theorem 3.5). Therefore, without loss of generality, we assume that the iterations in
our theorem are independent with each other.
Rewrite Yi (S) as Yi0 , and Yi (S ) as Yi1 (i ≥ 1). Then, by Lemma 11.2, there exist
random variables Ỹi0 and Ỹi1 , such that
δi
Yi0 Ỹi0 ≤ , (11.33)
1 + eεi
δi
Yi1 Ỹi1 ≤ , (11.34)
1 + eεi
D∞ Ỹi0 Ỹi1 ≤εi , (11.35)
D∞ Ỹi1 Ỹi0 ≤εi . (11.36)
Apparently,
& '
δi
P(Yi0 ∈ Bi ) − min , P(Yi ∈ Bi ) ≥ 0.
0
1 + eεi
Therefore,
" #
!
T
1
P(Ỹ00 ∈ B0 ) · · · P(Ỹn0 ∈ BT ) ≤ min eεi , P(Ỹ01 ∈ B0 ) · · · P(ỸT1 ∈ BT ) + δ̃.
i=1
P(Ỹi1 ∈ Bi )
Apparently, " #
εi 1
min e , P(Ỹ0i ∈ Bi ) ≤ 1,
P(Ỹi1 ∈ Bi )
Therefore, we have
" #
!
T
1
εi
min e , P(Ỹ01 ∈ B0 ) · · · P(ỸT1 ∈ BT )
i=1
P(Ỹi1 ∈ Bi )
" #
!
T
1
εi
− min e , P(Ỹ01 ∈ B0 )
i=1
P(Ỹi1 ∈ Bi )
δ1 δT
P(Ỹ11 ∈ B1 ) − · · · P( Ỹ 1
∈ B T ) −
1 + eε1 T
1 + e εT
$ " # %
!T
1 δi
≤1 − 1 − min eεi , .
i=1
P(Ỹi ∈ Bi ) 1 + eεi
1
(T ) *
εi
Case 2- i=1 min e , 1
P(Ỹ 1 ∈B )
> eε :
i i
!
T !
T
δiαi δi
δ ≤1− 1−e +1− 1− .
i=1
1 + eεi i=1
1 + eεi
!
T !
T
δi δi
δ ≤ 1 − 1 − eαi +1− 1− ,
i=1
1 + eεi i=1
1 + eεi
where i=1 T
αi ≤ ε and αi ≤ εi .
From Lemma 11.6, the minimum is realized on the boundary, which is exactly as
this theorem claims.
The proof is completed.
11.2 The Interplay of Generalizability and Privacy-Preserving Ability 249
where Yi (S) and Yi (S ) are the mechanisms achieving the largest privacy area.
Thus, we can deliver an approximation of the differential privacy via this moment
generating function.
Proof of Theorem 11.4 By applying Theorem 3.5 proposed in Kairouz et al. (2017)
and replacing ε in the proof of Theorem 11.6 as
ε = min {I1 , I2 , I3 } ,
where
T
I1 = εi ,
i=1
⎛ ⎞
T εi T T
ε 2
(e − 1) εi
2εi2 log ⎝e + ⎠,
i=1 i
I2 = εi + 1
+
i=1
e i=1
δ̃
T
(eεi − 1) εi
T
1
I3 = + 2
2εi log .
i=1
eεi + 1 i=1
δ̃
where δ̃ is an arbitrary positive real number, (ε , δ ) is the differential privacy of the
whole algorithm, and (εi , δi ) is the differential privacy of the i-th iteration.
250 11 Privacy Preservation
where
ε1 = T ε,
$ %
√
(e − ε
1) εT T ε 2
ε2 = + ε2T log e + ,
eε + 1 δ̃
+
(eε − 1) εT 1
ε3 = + ε 2T log .
eε + 1 δ̃
Corollary 11.2 (Composition Theorem II) When all the iterations are (ε, δ)-
differential private, δ is
, - , -
εε T − εε T
ε δ δ δ
δ =1 − 1 − e 1 − + 1 − 1 − + δ̃
1 + eε 1 + eε 1 + eε
$ 2 %
ε 2δ ε δ
= T− + δ + δ̃ + O .
ε 1 + eε ε 1 + eε
, -
ε
Proof The maximum of δ is achieved when at most T − ε
elements αi = 0. We
note that (1 − x) = 1 − nx + O(x ). Then, the proof is completed by estimating
n 2
, - , -
ε T − ε T
δ ε
ε δ ε δ
δ =1 − 1 − e 1− + 1 − 1 − + δ̃
1 + eε 1 + eε 1 + eε
$ 2 %
δ δ
=1 + T + δ̃ + O
1 + eε 1 + eε
⎛ , - $ ⎞
ε 2 % $ $ 2 %%
ε δ δ ε δ δ
− ⎝1 − +O ⎠ 1− T − +O
1 + eε 1 + eε ε 1 + eε 1 + eε
11.2 The Interplay of Generalizability and Privacy-Preserving Ability 251
$ 2 %
εδ ε δ δ δ
= + T − + T + O + δ̃
1 + eε
ε ε 1 + eε 1 + eε 1 + eε
ε 2δ ε
≈ T− + δ + δ̃.
ε 1 + eε ε
When all the iterators Mi are ε-differentially private, we can further tighten the
third estimation of ε in Theorem 11.4, Eq. (11.12) as the following theorem.
Corollary 11.3 (Composition Theorem III) Suppose all the iterators Mi are
ε-differentially private and all the other conditions in Theorem 11.4 hold. Then,
the algorithm A is (ε , δ )-differentially private that
+
(eε − 1) ε 1
ε =T ε + 2 log T ε2 ,
e +1 δ̃
, - , -
ε T − ε T
δ ε δ ε δ
δ =1 − 1 − eε 1− +1− 1− + δ ,
1 + eε 1 + eε 1 + eε
− ε +T ε 1 2T ε T ε + ε 2ε
δ =e 2 .
1 + eε T ε − ε T ε − ε
and ⎧
⎪
⎪ 0, x =0
⎪
⎪
⎪
⎪ (1 − δ)eε
⎪
⎨ , x =1
1 + eε
P1 (x) = .
⎪
⎪ 1−δ
⎪
⎪ , x =2
⎪
⎪ 1 + eε
⎪
⎩
δ, x =3
According to Theorem 3.4 by Kairouz et al. (2017), the largest magnitude of the
(ε , δ )-differential privacy can be calculated from the P0T and P1T .
252 11 Privacy Preservation
Construct P̃0 and P̃1 , whose cumulative distribution functions are as follows,
⎧ ε
⎪ e δ
⎪
⎪ , x =0
⎪ 1 + eε
⎪
⎪
⎪
⎪ (1 − δ)eε
⎪
⎪
⎨ , x =1
1 + eε
P̃0 (x) = ,
⎪
⎪ 1−δ
⎪
⎪ , x =2
⎪
⎪ 1 + eε
⎪
⎪
⎪
⎩ δ ,
⎪
x =3
1 + eε
and ⎧
⎪ δ
⎪
⎪ , x =0
⎪
⎪ 1 + eε
⎪
⎪ (1 − δ)eε
⎪
⎪
⎪
⎨ , x =1
1 + eε
P̃1 (x) .
⎪ 1−δ ,
⎪
⎪ x =2
⎪
⎪
⎪ 1 + eε
⎪
⎪
⎪ ε
⎩ e δ ,
⎪
x =3
1 + eε
δ
(P0 P̃0 ) ≤ ,
1 + eε
δ
(P1 P̃1 ) ≤ ,
1 + eε
D∞ (P̃0 P̃1 ) ≤ ε,
D∞ (P̃1 P̃0 ) ≤ εi .
P̃0 (xi ) T
Let Vi (xi ) = log P̃1 (xi )
and S(x1 , . . . , x T ) = i=1 Vi (xi ). We have that for any
t > 0,
PP̃0T ({xi } : S({xi }) > ε ) ≤ e−ε t EP̃0T (et S )
T
−ε t etε+ε e−tε
=e +
1 + eε 1 + eε
T
−ε t−T tε e2tε+ε 1
=e + . (11.39)
1 + eε 1 + eε
By calculating the derivative, we can deduce that the minimum of the RHS of
Eq. (11.39) is achieved at
11.2 The Interplay of Generalizability and Privacy-Preserving Ability 253
T ε + ε
e2εt = e−ε . (11.40)
T ε − ε
ε
Since ε ≥ T eeε −1
+1
,
T ε + ε
e−ε > 1.
T ε − ε
Therefore, by applying Eq. (11.40) into the RHS of Eq. (11.39), we have
T − ε +T ε
− ε +T ε 1 2T ε T ε + ε 2ε
PP̃0T ({xi } : S({xi }) > ε ) ≤ e 2 .
1 + eε T ε − ε T ε − ε
(11.41)
eε −1
≤ e−t ( −T eε +1 ) e T t ε /2
2 2
,
P(Yi (S)) P(Yi (S))
where Vi is defined as log P(Y
i (S ))
− E log
P(Yi (S ))
|Y1 (S), . . . , Yi−1 (S) and S j is
j
defined as i=1 Vi .
ε
Since P ST ≥ ε − T eeε −1
+1
does not depend on t,
ε
eε − 1
−
( −T eε −1 )2
e +1
P ST ≥ ε − T ε ≤ min e 2T ε 2
= δ,
e +1 t>0
By contrary, the approach here directly calculates E[et ST ], without the shrinkage
in the proof of Theorem 11.4. Specifically,
T
e2tε+ε 1 eε −1
e−ε t−T tε = e−t E et ST ≤ e−t ( −T eε +1 ) e T t ε /2
2 2
+ .
1 + eε 1 + eε
254 11 Privacy Preservation
Therefore,
T
e2tε+ε 1 eε −1
min e−ε t−T tε ≤ min e−t ( −T eε +1 ) e T t ε /2
2 2
ε
+ ,
t>0 1+e 1 + eε t>0
which leads to
T − ε +T ε
− ε +T ε 1 2T ε T ε + ε 2ε
e 2 ≤ δ.
1 + eε T ε − ε T ε − ε
It ensures that this estimate further tightens δ than Theorem 11.4 if the ε is the same.
The trio of composition theorems expands upon the established connection
between generalizability and privacy-preserving capabilities to encompass itera-
tive machine learning algorithms. With these theorems, we establish the theoretical
groundwork for understanding the generalizability of iterative differentially private
machine learning algorithms.
11.2.3 Applications
Our theories apply to a wide spectrum of machine learning algorithms. This section
implements them on two popular regimes as examples: (1) stochastic gradient
Langevin dynamics (Welling and Teh 2011) as an example of the stochastic gra-
dient Markov chain Monte Carlo scheme (Ma et al. 2015; Zhang et al. 2018); and (2)
agnostic federated
√ learning (Geyer et al. 2017; Mohri et al. 2019). Our results help
deliver O( log m/m) high-probability generalization bounds and PAC-learnability
guarantees for the two schemes.
a wide range of domains, including topic modeling (Larochelle and Lauly 2012;
Zhang et al. 2020), Bayesian neural networks (Louizos and Welling 2017; Roth and
Pernkopf 2018; Ban et al. 2019; Ye et al. 2020), and generative models (Wang et al.
2019; Kobyzev et al. 2020). In this section, we investigate the analysis of SGLD as
an example of the SGMCMC framework. SGLD is illustrated as the following table.
where
√
2 2Lσ τ1 log 1δ + 4
τ2
L2
ε̃ = ,
2σ 2
and
T − mε2τ+τε̃ T ε̃
−
τ T ε̃
ε + m 1 2 mτ T ε̃ τ
ε̃T + ε
δ̃ = e 2
τ τ
m
τ .
1 + e m ε̃ m
T ε̃ − ε m
T ε̃ − ε
Some existing works have also studied the privacy-preservation and generalization
of SGLD. Wang et al. (2015) proved that SGLD has “privacy for free” without
injecting noise. Specifically, the authors proved that SGLD is (ε, δ)-differentially
private if
256 11 Privacy Preservation
ε2 m
T > .
32τ log(2/δ)
Pensia et al. (2018) analyzed the generalizability of SGLD via information theory.
Some works also deliver generalization bounds via algorithmic stability or the PAC-
Bayesian framework (Hardt et al. 2016; Raginsky et al. 2017; Mou et al. 2018).
Our Theorem 11.7 also demonstrates that SGLD is PAC-learnable under the
following assumption.
Assumption 11.8 There exist constants K 1 > 0, K 2 , T0 , and m 0 , such that, for T >
T0 and any m > m 0 , we have
R̂ S (A(S)) ≤ exp(−K 1 T + K 2 ).
This assumption can be easily justified: the training error is possible to achieve a
near-0 training error in machine learning. Then, we have the following remark.
Proof of Theorem 11.7 We first calculate the differential privacy of each step. We
assume mini-batch B has been selected and define ∇ R̂ Sτ (θ ) as following:
∇ R̂ Sτ (θ ) = ∇r (θ ) + ∇l(z|θ ).
z∈B
For any two neighboring sample sets S and S and fixed θi−1 , we have
p(θiS = θi |θi−1
S =θ
i−1 ) p(ηi (− τ1 ∇ R̂ τS (θi−1 ) + N (0, σ 2 I)) = θi − θi−1 )
max = max
θi S S
p(θi = θi |θi−1 = θi−1 ) θi p(ηi (− τ1 ∇ R̂ τS (θi−1 ) + N (0, σ 2 I)) = θi − θi−1 )
p(ηi (− τ1 ∇ R̂ τS (θi−1 ) + N (0, σ 2 I)) = θi )
= max .
θi p(ηi (− τ1 ∇ R̂ τS (θi−1 ) + N (0, σ 2 I)) = θi )
We define
where θ = 1
θ
ηi i
obeys − τ1 ∇ R̂ Sτ (θi−1 ) + N (0, σ 2 I).
Let θ = θ + τ1 ∇ R̂ Sτ (θi−1 ) and we rewrite D(θ ) as:
11.2 The Interplay of Generalizability and Privacy-Preserving Ability 257
θ + τ1 ∇ R̂ τS (θi−1 ))2
e− 2σ 2
D(θ ) = log θ + τ1 ∇ R̂ τ (θi−1 )2
e−
S
2σ 2
θ + τ ∇ R̂ Sτ (θi−1 ))2
1
θ + τ1 ∇ R̂ Sτ (θi−1 )2
=− +
2σ 2 2σ 2
θ 2 θ + τ ∇ R̂ S (θi−1 ) − τ ∇ R̂ Sτ (θi−1 ))2
1 τ 1
=− +
2σ 2 2σ 2
2θ τ (∇ R̂ S (θi−1 ) − ∇ R̂ S (θi−1 )) + τ12 (∇ R̂ Sτ (θi−1 ) − ∇ R̂ Sτ (θi−1 )2 )
T 1 τ τ
= .
2σ 2
v < 2L .
where
4σ τ1 log 1δ + 1
τ2
ε̃ = ,
2σ 2
11.2 The Interplay of Generalizability and Privacy-Preserving Ability 259
and
T − mε2τ+τε̃ T ε̃
−
τ T ε̃
ε + m 1 2 mτ T ε̃ τ
ε̃T + ε
δ̃ =e 2
τ τ
m
τ .
1 + e m ε̃ m
T ε̃ − ε m
T ε̃ − ε
To prove Theorem 11.9, we only need to prove the differential privacy part of
Theorem 11.9, while the rest of the proof is similar to that of Theorem 11.7.
Proof of Theorem 11.9 The proof bears resemblance to the proof of Theorem 11.7.
One only has to notice that each update is still a Gauss mechanism, while
3 3
3 h it 3
3 3
3 3 ≤ L.
3 max(1, h i 2 ) 3
t
References
Abadi, Martin, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar,
and Li Zhang. 2016. Deep learning with differential privacy. In ACM SIGSAC Conference on
Computer and Communications Security, 308–318.
Ban, Yutong, Xavier Alameda-Pineda, Laurent Girin, and Radu Horaud. Variational bayesian infer-
ence for audio-visual tracking of multiple speakers. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 2019.
Beimel, Amos, Shiva Prasad Kasiviswanathan, and Kobbi Nissim. 2010. Bounds on the sample
complexity for private learning and private data release. In Theory of Cryptography Conference,
437–454. Springer.
Boucheron, Stéphane, Gábor Lugosi, and Pascal Massart. 2013. Concentration inequalities: a
nonasymptotic theory of independence. Oxford University Press.
Bun, Mark, and Thomas Steinke. 2016. Concentrated differential privacy: simplifications, exten-
sions, and lower bounds. In Theory of Cryptography Conference, 635–658.
Chaudhuri, Kamalika, Jacob Imola, and Ashwin Machanavajjhala. 2019. Capacity bounded
differential privacy. arXiv preprint arXiv:1907.02159.
Cuff, Paul, and Lanqing Yu. 2016. Differential privacy as a mutual information constraint. In ACM
SIGSAC Conference on Computer and Communications Security, 43–54.
Duane, Simon, Anthony D Kennedy, Brian J Pendleton, and Duncan Roweth. 1987. Hybrid Monte
Carlo. Physics Letters B 195 (2): 216–222.
Dwork, Cynthia, and Aaron Roth. 2014. The algorithmic foundations of differential privacy.
Foundations and Trends® in Theoretical Computer Science 9 (3–4): 211–407.
Dwork, Cynthia, and Deirdre K Mulligan. 2013. It’s not privacy, and it’s not fair. Stanford Law
Review Online 66: 35.
Dwork, Cynthia, and Guy N Rothblum. 2016. Concentrated differential privacy. arXiv preprint
arXiv:1603.01887.
Dwork, Cynthia, Guy N Rothblum, and Salil Vadhan. 2010. Boosting and differential privacy. In
IEEE Annual Symposium on Foundations of Computer Science, 51–60.
Dwork, Cynthia, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Leon
Roth. 2015. Preserving statistical validity in adaptive data analysis. In Annual ACM Symposium
on Theory of Computing, 117–126.
Geumlek, Joseph, Shuang Song, and Kamalika Chaudhuri. 2017. Renyi differential privacy
mechanisms for posterior sampling. In Advances in Neural Information Processing Systems,
5289–5298.
Geyer, Robin C, Tassilo Klein, and Moin Nabi. 2017. Differentially private federated learning: a
client level perspective. In Advances in Neural Information Processing Systems.
Hardt, Moritz, Ben Recht, and Yoram Singer. 2016. Train faster, generalize better: Stability of
stochastic gradient descent. In International Conference on Machine learning, 1225–1234.
Hastings, W Keith. 1970. Monte Carlo sampling methods using Markov chains and their
applications.
He, Fengxiang, Bohan Wang, and Dacheng Tao. 2020. Tighter generalization bounds for iterative
differentially private learning algorithms. arXiv preprint arXiv:2007.09371.
Kairouz, Peter, Sewoong Oh, and Pramod Viswanath. 2015. The composition theorem for
differential privacy. In International Conference on Machine Learning, 1376–1385.
Kairouz, Peter, Oh. Sewoong, and Pramod Viswanath. 2017. The composition theorem for
differential privacy. IEEE Transactions on Information Theory 63 (6): 4037–4049.
Kobyzev, Ivan, Simon Prince, and Marcus Brubaker. 2020. Normalizing flows: an introduction and
review of current methods. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Konečnỳ, Jakub, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and
Dave Bacon. 2016. Federated learning: Strategies for improving communication efficiency. In
Advances in Neural Information Processing Systems Workshop on Private Multi-Party Machine
Learning.
References 261
Larochelle, Hugo, and Stanislas Lauly. 2012. A neural autoregressive topic model. In Advances in
Neural Information Processing Systems, 2708–2716.
Liao, Jiachun, Lalitha Sankar, Vincent YF Tan, and Flavio du Pin Calmon. 2017. Hypothesis testing
under mutual information privacy constraints in the high privacy regime. IEEE Transactions on
Information Forensics and Security 13 (4): 1058–1071.
Louizos, Christos, and Max Welling. 2017. Multiplicative normalizing flows for variational
Bayesian neural networks. In International Conference on Machine Learning, 2218–2227.
Ma, Yi-An, Tianqi Chen, and Emily Fox. 2015. A complete recipe for stochastic gradient mcmc.
In Advances in Neural Information Processing Systems, 2917–2925.
McMahan, H Brendan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y
Arcas. 2017. Communication-efficient learning of deep networks from decentralized data. In
International Conference on Artificial Intelligence and Statistics.
McSherry, Frank, and Kunal Talwar. 2007. Mechanism design via differential privacy. In 48th
Annual IEEE Symposium on Foundations of Computer Science (FOCS’07), 94–103. IEEE.
Mironov, Ilya. 2017. Rényi differential privacy. In IEEE Computer Security Foundations Sympo-
sium, 263–275.
Mohri, Mehryar, Afshin Rostamizadeh, and Ameet Talwalkar. 2018. Foundations of Machine
Learning. MIT Press.
Mohri, Mehryar, Gary Sivek, and Ananda Theertha Suresh. 2019. Agnostic federated learning. In
International Conference on Machine Learning, 4615–4625.
Mou, Wenlong, Liwei Wang, Xiyu Zhai, and Kai Zheng. 2018. Generalization bounds of sgld for
non-convex learning: two theoretical viewpoints. In Annual Conference on Learning Theory,
605–638.
Nissim, Kobbi, and Uri Stemmer. 2015. On the generalization properties of differential privacy.
CoRR, abs/1504.05800.
Oneto, Luca, Sandro Ridella, and Davide Anguita. 2017. Differential privacy and generalization:
Sharper bounds with applications. Pattern Recognition Letters 89: 31–38.
Pensia, Ankit, Varun Jog, and Po-Ling Loh. 2018. Generalization error bounds for noisy, iterative
algorithms. In 2018 IEEE International Symposium on Information Theory (ISIT), 546–550.
Pittaluga, Francesco, and Sanjeev Jagannatha Koppal. 2016. Pre-capture privacy for small vision
sensors. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (11): 2215–2226.
Raginsky, Maxim, Alexander Rakhlin, and Matus Telgarsky. 2017. Non-convex learning via stochas-
tic gradient Langevin dynamics: a nonasymptotic analysis. In Conference on Learning Theory,
1674–1703.
Robbins, Herbert, and Sutton Monro. 1951. A stochastic approximation method. The Annals of
Mathematical Statistics, 400–407.
Roth, Wolfgang, and Franz Pernkopf. 2018. Bayesian neural networks with weight sharing using
Dirichlet processes. IEEE Transactions on Pattern Analysis and Machine Intelligence 42 (1):
246–252.
Shokri, Reza, and Vitaly Shmatikov. 2015. Privacy-preserving deep learning. In ACM SIGSAC
Conference on Computer and Communications Security, 1310–1321.
Vapnik, Vladimir. 2013. The Nature of Statistical Learning Theory. Springer Science & Business
Media.
Wang, Yu-Xiang, Stephen Fienberg, and Alex Smola. 2015. Privacy for free: posterior sampling and
stochastic gradient monte carlo. In International Conference on Machine Learning, 2493–2502.
Wang, Weina, Lei Ying, and Junshan Zhang. 2016. On the relation between identifiability, differen-
tial privacy, and mutual-information privacy. IEEE Transactions on Information Theory 62 (9):
5018–5029.
Wang, Hongwei, Jialin Wang, Jia Wang, Miao Zhao, Weinan Zhang, Fuzheng Zhang, Wenjie Li,
Xing Xie, and Minyi Guo. 2019. Learning graph representation with generative adversarial nets.
IEEE Transactions on Knowledge and Data Engineering 33 (8): 3090–3103.
Welling, Max, and Yee W Teh. 2011. Bayesian learning via stochastic gradient Langevin dynamics.
In International Conference on Machine Learning, 681–688.
262 11 Privacy Preservation
Yang, Qiang, Yang Liu, Tianjian Chen, and Yongxin Tong. 2019. Federated machine learning:
concept and applications. ACM Transactions on Intelligent Systems and Technology 10 (2): 12:1–
12:19. ISSN 2157-6904.
Ye, Qiaoling, Arash A Amini, and Qing Zhou. 2020. Optimizing regularized cholesky score for
order-based learning of bayesian networks. IEEE Transactions on Pattern Analysis and Machine
Intelligence 43 (10): 3555–3572.
Zhang, Hao, Bo Chen, Yulai Cong, Dandan Guo, Hongwei Liu, and Mingyuan Zhou. 2020. Deep
autoencoding topic model with scalable hybrid Bayesian inference. IEEE Transactions on Pattern
Analysis and Machine Intelligence.
Zhang, Cheng, Judith Bütepage, Hedvig Kjellström, and Stephan Mandt. 2018. Advances in vari-
ational inference. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (8):
2008–2026.
Chapter 12
Algorithmic Fairness
Deep learning technology has been widely deployed in increasingly critical decision-
making tasks, such as mortgage approval, credit card assessment, college admissions,
employee selection, and recidivism prediction. However, these areas are observed
to be subject to long-standing systematic discrimination against certain people on
the basis of diverse background traits, including race, gender, nationality, age, and
religion. Unfortunately, the introduction of intelligent algorithms into these areas
has failed to relieve the discrimination conundrum because people with minority
backgrounds are institutionally underrepresented in the historical data that fuel the
algorithmic systems. Thus, the unfairness residing in biased historical data is inher-
ited and sometimes intensified by a machine learning (ML) model that is trained on
such data. The consequent fairness concerns are particularly severe due to the black-
box nature of ML algorithms. Therefore, mitigation of the fairness issues arising in
ML applications is both urgent and important.
In general, there are two types of fairness related to ML: individual fairness and
group fairness. The concept of individual fairness was first proposed by Dwork et al.
(2012), based on the core principle that similar individuals should be treated similarly.
Since then, many other individual fairness measures have been developed, including
but not limited to average individual fairness (Kearns et al. 2019), counterfactual
fairness (Kusner et al. 2017), meritocratic fairness (Kearns et al. 2017), and others
(Yurochkin and Sun 2021). On the other hand, group fairness measures the level
of bias across different groups of individuals. The most commonly used measures
include demographic parity (Calders et al. 2009), equalized odds (Hardt et al. 2016),
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025 263
F. He and D. Tao, Foundations of Deep Learning, Machine Learning: Foundations,
Methodologies, and Applications, https://doi.org/10.1007/978-981-16-8233-9_12
264 12 Algorithmic Fairness
and equalized opportunity (Hardt et al. 2016), among others (Zafar et al. 2017; Choi
et al. 2020).
We consider a binary classification task. Each sample has the form z = (x, a, y),
where x ∈ X is the input feature, a ∈ A = {0, 1} represents one or more sensitive
attributes (e.g., gender, race, and age), and y ∈ {0, 1} is the prediction target. Let Z ,
X , A, and Y denote the random variables corresponding to z, x, a, and y, respec-
tively. Then, the goal of fair ML is to learn a binary classifier h : X × A → {0, 1}
while ensuring a specific notion of fairness concerning the sensitive attributes A.
For simplicity, we define Ŷ := h(X, A) to denote the prediction of the classifier h
for variable Z = (X, A, Y ).
Group fairness.
and
• the equal opportunity gap EOP is defined as
Combined with Definition 12.1, it is clear that a small value of any fairness
measure in Definition 12.1 would indicate a strongly nondiscriminatory nature of
the given classifier, and vice versa. When the metric DP , EO , or EOP is equal to
12.2 Fairness-Aware Algorithms 265
zero, the classifier h perfectly satisfies demographic parity, equalized odds, or equal
opportunity, respectively.
Preprocessing methods aim to eliminate the inherent bias in data before those data
are fed into learning algorithms (Calders et al. 2009; Feldman et al. 2014; Calmon
et al. 2017; Louizos et al. 2015; Choi et al. 2020; Kamiran and Calders 2012; Zemel
et al. 2013; Zhao et al. 2020). For example, Kamiran and Calders (2012) reviewed
several data preprocessing techniques, including (1) removing attributes that are most
relevant to the target sensitive attribute, (2) changing data labels, and (3) reweighting
or resampling the data. Zemel et al. (2013) proposed learning fair representations
by learning a linear transform to encode all information in the input features except
information that could lead to biased decision-making. This method can enhance both
individual fairness and group fairness. Zhao et al. (2020) focused on group fairness
and further proposed a novel fair representation learning technique that can mitigate
unfairness in terms of both accuracy parity and equalized odds simultaneously while
achieving a better utility-fairness trade-off.
To solve this optimization problem, they used the Lagrange method to minimize the
following loss function with Lagrange multipliers λ ∈ R+K
:
S1
S1
L(Q, λ) = PS1 (Q(X ) = Y ) + λk (|γ y,a (Q) − γ y,0 (Q)| − αn ).
The transition probability p learned in this way is the optimal solution to make Ỹ
closer to being nondiscrinminatory. Based on these two steps, the authors obtained
a both near-optimal and near-nondiscrinminatory predictor Ỹ .
Moreover, Zafar et al. (2017) studied binary classification and proposed a fair
learning algorithm by restricting the covariance among the target sensitive attributes
and restricting the distance between the data features and the decision boundary.
Baharlouei et al. (2020) employed Rényi correlation as a regularizer and proposed a
min-max optimization algorithm that can reduce arbitrary dependence between the
predictions of a model and the sensitive attributes of the input data. Yurochkin and
Sun (2021) further defined the concept of distributional individual fairness and then
designed an approximation algorithm to enforce such a fairness restriction by means
of regularization.
References 267
References
Agarwal, Alekh, Alina Beygelzimer, Miroslav Dudík, John Langford, and Hanna Wallach. 2018.
A reductions approach to fair classification. arXiv preprint arXiv:1803.02453.
Baharlouei, Sina, Maher Nouiehed, Ahmad Beirami, and Meisam Razaviyayn. 2020. Rényi fair
inference. In International Conference on Learning Representations.
Calders, Toon, Faisal Kamiran, and Mykola Pechenizkiy. 2009. Building classifiers with indepen-
dency constraints. In IEEE International Conference on Data Mining Workshops, 13–18.
Calmon, Flavio, Dennis Wei, Bhanukiran Vinzamuri, Karthikeyan Natesan Ramamurthy, and
Kush R Varshney. 2017. Optimized pre-processing for discrimination prevention. Advances in
Neural Information Processing Systems 30: 3992–4001.
Choi, Kristy, Aditya Grover, Trisha Singh, Rui Shu, and Stefano Ermon. 2020. Fair generative
modeling via weak supervision. In International Conference on Machine Learning, 1887–1898.
Dwork, Cynthia, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. 2012. Fairness
through awareness. In Innovations in Theoretical Computer Science Conference, 214–226.
Dwork, Cynthia, Nicole Immorlica, Adam Tauman Kalai, and Max Leiserson. 2018. Decoupled
classifiers for group-fair and efficient machine learning. In Conference on Fairness, Accountability
and Transparency, 119–133.
Feldman, Michael, Sorelle Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkatasubra-
manian. 2014. Certifying and removing disparate impact.
Fish, Benjamin, Jeremy Kun, and Ádám D Lelkes. 2016. A confidence-based approach for balancing
fairness and accuracy. In 2016 SIAM International Conference on Data Mining, 144–152. SIAM.
Hardt, Moritz, Eric Price, and Nati Srebro. 2016. Equality of opportunity in supervised learning. In
Advances in Neural Information Processing Systems, 3315–3323.
Kamiran, Faisal, and Toon Calders. 2012. Data preprocessing techniques for classification without
discrimination. Knowledge and Information Systems 33 (1): 1–33.
Kamishima, Toshihiro, Shotaro Akaho, Hideki Asoh, and Jun Sakuma. 2012. Fairness-aware clas-
sifier with prejudice remover regularizer. In Joint European Conference on Machine Learning
and Knowledge Discovery in Databases, 35–50.
Kearns, Michael, Aaron Roth, and Saeed Sharifi-Malvajerdi. 2019. Average individual fairness:
Algorithms, generalization and experiments. In Advances in Neural Information Processing
Systems, vol. 32.
268 12 Algorithmic Fairness
Kearns, Michael, Aaron Roth, and Zhiwei Steven Wu. 2017. Meritocratic fairness for cross-
population selection. In International Conference on Machine Learning, 1828–1836.
Kearns, Michael, Seth Neel, Aaron Roth, and Zhiwei Steven Wu. 2018. Preventing fairness gerry-
mandering: Auditing and learning for subgroup fairness. In International Conference on Machine
learning, 2564–2572.
Kim, Michael P, Amirata Ghorbani, and James Zou. 2019. Multiaccuracy: Black-box post-
processing for fairness in classification. In 2019 AAAI/ACM Conference on AI, Ethics, and Society,
247–254.
Kusner, Matt J, Joshua Loftus, Chris Russell, and Ricardo Silva. 2017. Counterfactual fairness. In
Advances in Neural Information Processing Systems, vol. 30.
Lahoti, Preethi, Alex Beutel, Jilin Chen, Kang Lee, Flavien Prost, Nithum Thain, Xuezhi Wang,
and Ed Chi. 2020. Fairness without demographics through adversarially reweighted learning. In
Advances in Neural Information Processing Systems, vol. 33, 728–740.
Louizos, Christos, Kevin Swersky, Yujia Li, Max Welling, and Richard Zemel. 2015. The variational
fair autoencoder. arXiv preprint arXiv:1511.00830.
Martinez, Natalia, Martin Bertran, and Guillermo Sapiro. 2020. Minimax Pareto fairness: A multi
objective perspective. In International Conference on Machine Learning, 6755–6764.
Menon, Aditya Krishna, and Robert C Williamson. 2018. The cost of fairness in binary classification.
In Conference on Fairness, Accountability and Transparency, 107–118.
Mozannar, Hussein, Mesrob Ohannessian, and Nathan Srebro. 2020. Fair learning with private
demographic data. In International Conference on Machine Learning, 7066–7075.
Nesterov, Yurii E. 1983. A method for solving the convex programming problem with convergence
rate o (1/k∧ 2). In Dokl. Akad. Nauk Sssr, vol. 269, 543–547.
Woodworth, Blake, Suriya Gunasekar, Mesrob I Ohannessian, and Nathan Srebro. 2017. Learning
non-discriminatory predictors. arXiv preprint arXiv:1702.06081.
Yurochkin, Mikhail, and Yuekai Sun. 2021. SenSeI: Sensitive set invariance for enforcing individual
fairness. In International Conference on Learning Representations.
Zafar, Muhammad Bilal, Isabel Valera, Manuel Gomez Rogriguez, and Krishna P Gummadi. 2017.
Fairness constraints: Mechanisms for fair classification. In Artificial Intelligence and Statistics,
962–970.
Zemel, Rich, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. 2013. Learning fair
representations. In International Conference on Machine Learning, 325–333.
Zhao, Han, Amanda Coston, Tameem Adel, and Geoffrey J. Gordon. 2020. Conditional learning of
fair representations. In International Conference on Learning Representations.
Chapter 13
Adversarial Robustness
Szegedy et al. (2014) discovered that neural networks are vulnerable to adversarial
examples, which are true examples perturbed with small artificial noise to fake the
classifiers. Since that finding was reported, many methods have been proposed to
study attacks on neural network using adversarial examples and defences against
such adversarial attacks. This chapter discusses the theory of adversarial robustness
and its relation to generalizability and privacy preservation.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025 269
F. He and D. Tao, Foundations of Deep Learning, Machine Learning: Foundations,
Methodologies, and Applications, https://doi.org/10.1007/978-981-16-8233-9_13
270 13 Adversarial Robustness
robust PAC learnability. When a learning algorithm meets the requirements for both
agnostic robust PAC learnability and realizable robust PAC learnability, the learning
rule is defined as improper. They then showed that under an improper learning rule,
any hypothesis space H is robustly PAC learnable if its VC dimension is finite.
Generalizability of adversarial training. Among the proposed defence
approaches, adversarial training (Dai et al. 2018; Li et al. 2018a; Baluja and Fischer
2018; Zheng et al. 2019) have promising performance on improving the adversarial
robustness of deep neural networks against adversarial examples. Mathematically,
adversarial training can be formulated as solving the following minimax loss problem
(Tu et al. 2019; Yin et al. 2019; Khim and Loh 2018):
1
m
min max l(h θ (xi ), yi ),
θ m i=1 xi −xi ≤ρ
Adversarial training (Dai et al. 2018; Li et al. 2018a; Baluja and Fischer 2018; Zheng
et al. 2019) has promising performance on improving the adversarial robustness of
deep neural networks against adversarial examples (Biggio et al. 2013; Szegedy et al.
2013; Goodfellow et al. 2014; Papernot et al. 2016). Specifically, adversarial training
can be formulated as solving the following minimax loss problem:
1
m
min max l(h θ (xi ), yi ),
θ m i=1 xi −xi ≤ρ
Fig. 13.1 The first three subfigures are plots of a adversarial accuracy versus radius, b robustified
intensity versus radius, and c adversarial accuracy versus robustified intensity. The green and blue
curves correspond to CIFAR-10 and CIFAR-100, respectively. Subfigure d shows a histogram of
8, 000, 000 gradient noise instances on Tiny ImageNet versus the probability density function of
Lap(0, 0.48)
1
m
min R̂ S (θ ) = min l(h θ (xi ), yi ).
θ θ m i=1
where θt is the weight vector in the t-th iteration and ηt is the corresponding learning
rate.
In adversarial training, SGD is employed to solve the following minimax loss
problem:
1
m
min R̂ SA (θ ) = min max l(h θ (xi ), yi ), (13.2)
θ θ m xi −xi ≤ρ
i=1
where ρ is the radius of a ball centred at example (xi , yi ). Here, we call R̂ SA (θ ) the
adversarial empirical risk. Correspondingly, the stochastic gradient on a minibatch
B and the weight update are calculated as follows:
13.2 Interplay Between Adversarial Robustness, Privacy … 275
1
ĝ A (θ ) = ∇θ max l(h θ (xi ), yi ),
|B| (x ,y )∈B xi −xi ≤ρ (13.3)
i i
θt+1 = θt − ηt ĝ (θt ).
A
We then define the robustified intensity based on the gradient norms as follows.
Definition 13.1 (Robustified intensity) For adversarial training (Eq. 13.2), the
robustified intensity is defined as
maxθ,x,y ∇θ maxx −x≤ρ l(h θ (x ), y)
I = , (13.4)
maxθ,x,y ∇θ l(h θ (x), y)
Conducting a rigorous search for the maximal r value for either adversarial training or
ERM within every ball of radius ρ in the Euclidean space is technically impossible.
Therefore, we define an empirical robustified intensity for practical utilization to
empirically estimate the robustified intensity, as follows.
Definition 13.2 (Empirical robustified intensity) For adversarial training (Eq. 13.2),
the empirical robustified intensity is defined as
max(xi ,yi )∈B,θ ∇θ maxxi −xi ≤ρ l(h θ (xi ), yi )
Iˆ = , (13.5)
max(xi ,yi )∈B,θ ∇θ l(h θ (xi ), yi )
Assumption 13.2 The gradient of the loss function satisfies ∇θ l(h θ (x), y) ∈
C 0 (Z); i.e., for any hypothesis h θ ∈ H, ∇θ l(h θ (x), y) is continuous w.r.t. z.
To prove Theorem 13.1, we first recall additional preliminaries that are necessary
for the proofs.
Suppose that every example z is sampled in an independent and identically dis-
tributed (i.i.d.) manner from the data distribution D, i.e., z ∼ D. Thus, the training
sample set satisfies S ∼ Dm , where m is the number of training samples.
To prove our theorem, we establish the following definitions.
Definition 13.3 (Ball and sphere) The ball of radius r > 0 in terms of norm · in
space H centred at point x ∈ H is defined as
Br (h) = {x : x − h ≤ r }.
∂ Br (h) = {x : x − h = r }.
A = {h : h ∈ H, h ∈
/ A}.
almost surely.
We first prove that for any positive real
g(θ, z) = ∇θ max l h θ x , y , ρ > 0
x ∈Bx (ρ)
∞
{z i = (xi , yi )}i=1 , lim z i = z 0
i→∞
∞
Since {Tyi (xi )}i=1 is a bounded set, there exists an increasing subsequence
∞ + ∞
{ki }i=1 ⊆ Z such that {Tyki (xki )}i=1 converges to some point T∞ . Then, we have
∞
T∞ ∈ ∩i=1 Bxki (ρ) ⊂ Bx0 (ρ).
Furthermore, for any ε ≥ 0, there exists a δ > 0, such that for any x ∈ BTy0 (x0 ) (δ),
l(h θ (x), y0 )) ≥ l(h θ (Ty0 (x0 )), y0 ) − ε. In the case of Ty0 (x0 ) ∈ ∂ Bx0 (ρ) such that
∞
Ty0 (x0 ) ∈
/ ∩i=1 Bxki (ρ), let x ∈ BTy0 (x0 ) (δ) be an inner point of Bx0 (ρ). When i is
sufficiently large, we have x ∈ Bxki (ρ), which yields
Therefore, T∞ = Ty0 (x0 ), which leads to a contradiction since Tyi (xi ) − Ty0 (x0 ) ≥
ε0 .
Since g(θ, z) can be rewritten as
g(θ, z) = ∇θ max l h θ x , y = ∇θ l(h θ (Ty (x)), y),
x ∈Bx (ρ)
For any ε > 0, since g(θ0 , z) is continuous with respect to z, there exists a δ > 0
such that for any (x , y ) ∈ B(x0 ,y0 ) (δ),
Therefore,
c
{(x, y) : g(θ0 , (x, y)) < g(θ0 , (x0 , y0 )) − ε} ⊂ B(x0 ,y0 ) (δ) ,
and we have
P S∼Dm max g(θ, z) ≤ max g(θ, z) − ε ≤ P S∼Dm max g(θ0 , z) ≤ g(θ0 , z 0 ) − ε
θ,z∈S θ,z z∈S
≤ P S∼Dm S ∩ B(x0 ,y0 ) (δ) = ∅
m
= 1 − Pz∼D z ∈ B(x0 ,y0 ) (δ) .
As m → ∞, we have
lim P S∼Dm max g(θ, z) ≤ max g(θ, z) − ε = 0.
m→∞ θ,z∈S θ,z
This section studies the relationship between privacy preservation and robustness in
adversarial training. We prove that adversarial training is (ε, δ)-differentially private.
1
∇θ max l(h θ (x ), y) ∼ Lap ∇θ R̂ SA (θ ), b .
τ (x,y)∈B x −x≤ρ
m
ε = ε0 2T log + T ε0 (eε0 − 1),
δ
δ
δ= ,
m
280 13 Adversarial Robustness
with
2L E R M
ε0 = I.
mb
Here, δ is a positive real number, τ is the batch size, I is the robustified intensity,
and b is the Laplace parameter.
Remark√ 13.1 The approximation of the differential privacy given by Theorem 13.4
is (O( log m/m), O(1/m)).
To prove Theorem 13.4, we will first prove two lemmas (Lemma 13.1 and Lemma
13.2 below).
In practice, it is easier to obtain high-probability approximations of ε-differential
privacy from concentration inequalities. Lemma 13.1 presents a relationship between
high-probability approximations of ε-differential privacy and approximations of
(ε, δ)-differential privacy. Similar arguments have been used in some related works;
see, for example, the proof of Theorem 3.20 in (Dwork and Roth 2014). Here, we
present a detailed proof to conclude our work in this study.
Lemma 13.1 Suppose that A : Zm → H is a stochastic algorithm, whose output
hypothesis learned on training sample set S is A(S). For any hypothesis h ∈ H, if
the condition
P [A(S) = h]
log ≤ε (13.8)
P [A(S ) = h]
Additionally, we define
B0c = H − B0 .
Lemma 13.2 (Advanced composition; cf. (Dwork and Roth 2014), Theorem 3.20)
Suppose that an (ε0 , δ0 )-differentially private process is run repeatedly T times.
Then, the whole algorithm is ( , δ)-differentially private, where
1
ε= 2T log ε0 + T ε0 (eε0 − 1),
δ
δ = T δ0 + δ ,
Proof of Theorem 13.4 We assume that the gradients calculated from a randomly
sampled minibatch B with a size of τ are random variables drawn from a Laplace
distribution, as justified previously:
1
∇θ max l(θ, x, y) ∼ Lap ∇θ R̂ SA (θ ), b .
τ z∈B x −x≤ρ
We define
L A = max ∇θ max l(θt , x, y)
θt ,x,y x −x≤ρ
and
v = ∇θ R̂ SA (θt ) − ∇θ R̂ SA (θt ).
Since the training sample sets S and S differ by only one pair of examples, we have
2L A
v ≤ . (13.13)
m
Combining Eqs. (13.12) and (13.13), we obtain
⎡ ⎤
p Lap(∇θ R̂ SA (θt ), b) = ĝtA
log ⎣ ⎦
p Lap(∇θ R̂ SA (θt ), b) = ĝtA
1
= − ∇θ R̂ SA (θt ) − ĝtA + ∇θ R̂ SA (θt ) − ĝtA
b
2L A
= . (13.14)
mb
Since L A = I L E R M , we have
⎡ ⎤
p Lap(∇θ R̂ SA (θt ), b) = ĝtA 2L E R M
log ⎣ ⎦ ≤ I.
p Lap(∇θ R̂ SA (θt ), b) = ĝtA mb
We define
2L E R M
ε0 = I.
mb
13.2 Interplay Between Adversarial Robustness, Privacy … 283
Corollary 13.1 Suppose that SGD is employed for ERM and that the whole training
procedure consists of T iterations. Then, the ERM process is (ε, δ)-differentially
private, where
m
+ T ε0ERM (eε0 − 1),
ERM
ε = ε0ERM 2T log
δ
δ
δ= ,
m
with
2L E R M
ε0ERM = .
mb
By comparing the results between adversarial training and ERM, we can conclude
that ε0 = I · ε0ERM .
Theorem 13.4 and Corollary 13.1 show that the factors ε and δ are both positively
correlated with the robustified intensity, which suggests a trade-off between privacy
preservation and adversarial robustness.
Fig. 13.2 Plots of membership inference attack accuracy versus empirical robustified intensity. For
the four plots, the datasets and norms used in projected gradient descent (PGD) are (1) CIFAR-10
and the L ∞ norm, (2) CIFAR-100 and the L ∞ norm, (3) CIFAR-100 and the L 2 norm, and (4) Tiny
ImageNet and the L ∞ norm
adversarial attacks (reflected in its robustness), the more vulnerable it becomes to pri-
vacy breaches. This observation suggests the intricate balance required in developing
learning models that prioritize both privacy and robustness.
Proof of Theorem 13.5 By combining Lemmas 13.5 and 13.3, we can directly prove
Theorem 13.5.
Remark 13.2 By the postprocessing property of differential privacy, since B :
h → maxx ∈B∗ (ρ) l(h, (x , ∗)) is a one-to-one mapping, maxx ∈B∗ (ρ) l(A, (x , ∗)) is
(ε, δ)-differentially private. Therefore, Theorems 13.5 and 13.6 hold for adversarial
learning algorithms.
In our endeavor to quantify the generalizability of a learning model, we not only
establish a high-probability generalization bound but also derive an on-average gen-
eralization bound. This on-average bound offers a glimpse into the “expected” perfor-
mance of the learning model. It’s essential to understand the theoretical possibility
of deriving on-average bounds from high-probability bounds through integration.
However, in practice, executing such calculations becomes dauntingly challenging.
As a result, we opt for a more feasible and independent approach to achieve the
on-average bound.
Theorem 13.6 (On-average generalization bound in terms of differential pri-
vacy) Suppose that all conditions of Theorem 13.4 hold. Then, the on-average
generalization error of algorithm A is upper bounded by
E S,A R(A(S)) − R̂ S (A(S)) ≤ Mδe−ε + M(1 − e−ε ).
The proof of Theorem 13.6 relies on the following lemma from Dwork et al.
(2015a).
Lemma 13.4 (Lemma 11, cf. (Shalev-Shwartz et al. 2010)) Suppose that the loss
function is upper bounded. For any machine learning algorithm with β replace-one
stability, its generalization error is upper bound as follows:
R(A(S)) − R̂ S (A(S)) ≤ β.
Proof of Theorem 13.6 By combining Lemma 13.5 and Lemma 13.4, we can
directly prove Theorem 13.6.
By combining Theorem 13.4, Corollary 13.1, Theorem 13.5, and Theorem 13.6,
we can obtain generalization bounds for both adversarial training and conventional
ERM.
Both generalization bounds are positively correlated with the magnitude of the
differential privacy, which, in turn, is positively correlated with the adversarial
robustness. This leads to the following corollary.
Corollary 13.2 A trade-off exists between generalizability and adversarial robust-
ness (as measured by the robustified intensity) in adversarial training.
286 13 Adversarial Robustness
Theorems 13.5 and 13.6 are established via algorithmic stability, which measures
how stable an algorithm is when the training sample is exposed to disturbance (Rogers
and Wagner 1978; Kearns and Ron 1999; Bousquet and Elisseeff 2002). While algo-
rithmic stability has many different definitions, this work mainly discusses uniform
stability.
Definition 13.5 (Uniform stability; cf. (Bousquet and Elisseeff 2002)) A machine
learning algorithm A is uniformly stable if, for any neighbouring sample pair S and
S that differ by only one example, the following inequality holds:
EAl(A(S), z) − EAl(A(S ), z) ≤ β,
where z is an arbitrary example; A(S) and A(S ) are the output hypotheses learned
on training sets S and S , respectively; and β is a positive real constant. The constant
β is called the uniform stability of A.
To prove Lemma 13.5, we first prove a weaker version of it that holds when
algorithm A has ε-pure differential privacy.
Equation (13.17) holds for every pair of neighbouring sample sets S and S . Thus,
Therefore,
max PA(S ) (A(S ) ∈ B) − PA(S) (A(S) ∈ B) ≤ 1 − e−ε .
S and S neighbour
Thus,
(∗)
≤ max {I1 , I2 } ≤ M(1 − e−ε ),
Therefore,
Thus,
EA(S)l(A(S), Z ) − EA(S )l(A(S ), Z ) ≤ Mδe−ε + M(1 − e−ε ).
Theorems 13.5 and 13.6 are further established based on Lemma 13.5 with
Feldman and Vondrak (2019) and Bousquet and Elisseeff (2002), respectively.
Remark
√ 13.3 The high-probability generalization bound given by Theorem 13.5 is
O(1/ m).
Remark
√ 13.4 The on-average generalization bound given by Theorem 13.6 is
O( log m/m).
Our investigation into the trade-off between generalization and robustness is fur-
ther grounded in empirical analysis, leveraging the ResNet architecture and datasets
Fig. 13.3 Box plots of the gradient norms in ERM and adversarial training on CIFAR-100 and
Tiny ImageNet with six different ρ values and two different norms in PGD. The sample sizes are
55, 000 and 120, 000, respectively.The last two plots correspond to experimental settings A and B,
respectively
290 13 Adversarial Robustness
Fig. 13.4 Plots of the generalization gap versus the empirical robustified intensity. For the four
plots, the datasets and norms used in PGD are (1) CIFAR-10 and the L ∞ norm, (2) CIFAR-100 and
the L ∞ norm, (3) CIFAR-100 and the L 2 norm, and (4) Tiny ImageNet and the L ∞ norm
References
Attias, Idan, Aryeh Kontorovich, and Yishay Mansour. 2019. Improved generalization bounds for
robust learning. In Algorithmic Learning Theory, 162–183.
Baluja, Shumeet, and Ian Fischer. 2018. Learning to attack: adversarial transformation networks.
In AAAI Conference on Artificial Intelligence, vol. 1, 3.
Bartlett, Peter L, Dylan J Foster, and Matus J Telgarsky. 2017. Spectrally-normalized margin bounds
for neural networks. In Advances in Neural Information Processing Systems, 6240–6249.
Bhagoji, Arjun Nitin, Daniel Cullina, and Prateek Mittal. 2019. Lower bounds on adversarial
robustness from optimal transport. In Advances in Neural Information Processing Systems,
7498–7510.
Biggio, Battista, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Šrndić, Pavel Laskov, Gior-
gio Giacinto, and Fabio Roli. 2013. Evasion attacks against machine learning at test time. In
European Conference on Machine Learning.
Bousquet, Olivier, and André Elisseeff. 2002. Stability and generalization. Journal of Machine
Learning Research 2 (Mar): 499–526.
Chen, Lin, Yifei Min, Mingrui Zhang, and Amin Karbasi. 2020. More data can expand the general-
ization gap between adversarially robust and standard models. arXiv preprint arXiv:2002.04725.
Cullina, Daniel, Arjun Nitin Bhagoji, and Prateek Mittal. 2018. PAC-learning in the presence of
adversaries. In Advances in Neural Information Processing Systems, 230–241.
Dai, Hanjun, Hui Li, Tian Tian, Xin Huang, Lin Wang, Jun Zhu, and Le Song. 2018. Adversarial
attack on graph structured data. arXiv preprint arXiv:1806.02371.
References 291
Dwork, Cynthia, and Aaron Roth. 2014. The algorithmic foundations of differential privacy.
Foundations and Trends® in Theoretical Computer Science 9 (3–4): 211–407.
Dwork, Cynthia, and Deirdre K Mulligan. 2013. It’s not privacy, and it’s not fair. Stanford Law
Review Online 66: 35.
Dwork, Cynthia, Vitaly Feldman, Moritz Hardt, Toni Pitassi, Omer Reingold, and Aaron Roth. 2015.
Generalization in adaptive data analysis and holdout reuse. In Advances in Neural Information
Processing Systems, 2350–2358.
Feldman, Vitaly, and Jan Vondrak. 2019. High probability generalization bounds for uniformly
stable algorithms with nearly optimal rate. arXiv preprint arXiv:1902.10710.
Golowich, Noah, Alexander Rakhlin, and Ohad Shamir. 2018. Size-independent sample complexity
of neural networks. In Annual Conference on Learning Theory, 297–299.
Goodfellow, Ian J, Jonathon Shlens, and Christian Szegedy. 2014. Explaining and harnessing
adversarial examples. arXiv preprint arXiv:1412.6572.
Kearns, Michael, and Dana Ron. 1999. Algorithmic stability and sanity-check bounds for leave-
one-out cross-validation. Neural Computation 11 (6): 1427–1453.
Khim, Justin, and Po-Ling Loh. 2018. Adversarial risk bounds via function transformation. arXiv
preprint arXiv:1810.09519.
Kingma, Diederik P, and Jimmy Ba. 2015. Adam: a method for stochastic optimization. In
International Conference on Learning Systems.
Krizhevsky, Alex, and Geoffrey Hinton. 2009. Learning multiple layers of features from tiny images.
Technical report, Citeseer.
Kushner, Harold, and G George Yin. 2003. Stochastic Approximation and Recursive Algorithms
and Applications, vol. 35. Springer Science & Business Media.
Le, Ya, and Xuan Yang. 2015. Tiny imagenet visual recognition challenge. CS 231N 7 (7): 3.
Li, Bai, Changyou Chen, Wenlin Wang, and Lawrence Carin. 2018. Second-order adversarial attack
and certifiable robustness.
Ljung, Lennart, Georg Pflug, and Harro Walk. 2012. Stochastic Approximation and Optimization
of Random Systems, vol. 17. Birkhäuser.
Mandt, Stephan, Matthew D Hoffman, and David M Blei. 2017. Stochastic gradient descent as
approximate Bayesian inference. Journal of Machine Learning Research 18 (1): 4873–4907.
Min, Yifei, Lin Chen, and Amin Karbasi. 2020. The curious case of adversarially robust models:
More data can help, double descend, or hurt generalization. arXiv preprint arXiv:2002.11080.
Mohri, Mehryar, Afshin Rostamizadeh, and Ameet Talwalkar. 2018. Foundations of Machine
Learning. MIT Press.
Montasser, Omar, Steve Hanneke, and Nathan Srebro. 2019. VC classes are adversarially robustly
learnable, but only improperly. In Conference on Learning Theory, 2512–2530.
Nesterov, Yurii E. 1983. A method for solving the convex programming problem with convergence
rate o (1/k∧ 2). In Dokl. Akad. Nauk Sssr, vol. 269, 543–547.
Oneto, Luca, Sandro Ridella, and Davide Anguita. 2017. Differential privacy and generalization:
Sharper bounds with applications. Pattern Recognition Letters 89: 31–38.
Papernot, Nicolas, Patrick McDaniel, Somesh Jha, Matt Fredrikson, Z Berkay Celik, and Anan-
thram Swami. 2016. The limitations of deep learning in adversarial settings. In IEEE European
Symposium on Security and Privacy, 372–387.
Pydi, Muni Sreenivas, and Varun Jog. 2020. Adversarial risk via optimal transport and optimal
couplings. In International Conference on Machine Learning, 7814–7823.
Raghunathan, Aditi, Jacob Steinhardt, and Percy Liang. 2018. Certified defenses against adversarial
examples. In International Conference on Learning Representations.
Robbins, Herbert, and Sutton Monro. 1951. A stochastic approximation method. The Annals of
Mathematical Statistics, 400–407.
Rogers, William H, and Terry J Wagner. 1978. A finite sample distribution-free performance bound
for local discrimination rules. The Annals of Statistics, 506–514.
292 13 Adversarial Robustness
Schmidt, Ludwig, Shibani Santurkar, Dimitris Tsipras, Kunal Talwar, and Aleksander Madry.
2018. Adversarially robust generalization requires more data. In Advances in Neural Information
Processing Systems, 5014–5026.
Shalev-Shwartz, Shai, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. 2010. Learnability,
stability and uniform convergence. Journal of Machine Learning Research 11 (Oct): 2635–2670.
Shokri, Reza, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. 2017. Membership inference
attacks against machine learning models. In IEEE Symposium on Security and Privacy (SP), 3–18.
Szegedy, Christian, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow,
and Rob Fergus. 2013. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199.
Szegedy, Christian, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow,
and Rob Fergus. 2014. Intriguing properties of neural networks. In International Conference on
Learning Representations.
Tseng, Paul. 1998. An incremental gradient (-projection) method with momentum term and adaptive
stepsize rule. SIAM Journal on Optimization 8 (2): 506–531.
Tu, Zhuozhuo, Jingwei Zhang, and Dacheng Tao. 2019. Theoretical analysis of adversarial learning:
a minimax approach. In Advances in Neural Information Processing Systems, 12280–12290.
Vapnik, Vladimir. 2013. The Nature of Statistical Learning Theory. Springer Science & Business
Media.
Yin, Dong, Ramchandran Kannan, and Peter Bartlett. 2019. Rademacher complexity for adversar-
ially robust generalization. In International Conference on Machine Learning, 7085–7094.
Zhang, Hongyang, Yaodong Yu, Jiantao Jiao, Eric P Xing, Laurent El Ghaoui, and Michael I
Jordan. 2019. Theoretically principled trade-off between robustness and accuracy. arXiv preprint
arXiv:1901.08573.
Zheng, Tianhang, Changyou Chen, and Kui Ren. 2019. Distributionally adversarial attack. In AAAI
Conference on Artificial Intelligence, vol. 33, 2253–2260.