0% found this document useful (0 votes)
69 views298 pages

(Machine Learning - Foundations, Methodologies, and Applications) Fengxiang He, Dacheng Tao - Foundations of Deep Learning-Springer (2025)

The document discusses the foundations, methodologies, and applications of machine learning, focusing on deep learning and its evolution over the decades. It highlights the importance of theoretical understanding in deep learning, the advancements in algorithms, and the critical role of data and computational power in driving innovation. The monograph aims to provide a comprehensive overview of deep learning theory, its applications, and emerging ethical considerations.

Uploaded by

Ersela Zain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views298 pages

(Machine Learning - Foundations, Methodologies, and Applications) Fengxiang He, Dacheng Tao - Foundations of Deep Learning-Springer (2025)

The document discusses the foundations, methodologies, and applications of machine learning, focusing on deep learning and its evolution over the decades. It highlights the importance of theoretical understanding in deep learning, the advancements in algorithms, and the critical role of data and computational power in driving innovation. The monograph aims to provide a comprehensive overview of deep learning theory, its applications, and emerging ethical considerations.

Uploaded by

Ersela Zain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 298

Machine Learning: Foundations, Methodologies,

and Applications

Fengxiang He
Dacheng Tao

Foundations of
Deep Learning
Machine Learning: Foundations, Methodologies,
and Applications

Series Editors
Kay Chen Tan, Department of Computing, Hong Kong Polytechnic University,
Hong Kong, China
Dacheng Tao, Nanyang Technological University, Singapore, Singapore
Books published in this series focus on the theory and computational foundations,
advanced methodologies and practical applications of machine learning, ideally
combining mathematically rigorous treatments of a contemporary topics in machine
learning with specific illustrations in relevant algorithm designs and demonstrations
in real-world applications. The intended readership includes research students and
researchers in computer science, computer engineering, electrical engineering, data
science, and related areas seeking a convenient medium to track the progresses made
in the foundations, methodologies, and applications of machine learning.
Topics considered include all areas of machine learning, including but not limited
to:
• Decision tree
• Artificial neural networks
• Kernel learning
• Bayesian learning
• Ensemble methods
• Dimension reduction and metric learning
• Reinforcement learning
• Meta learning and learning to learn
• Imitation learning
• Computational learning theory
• Probabilistic graphical models
• Transfer learning
• Multi-view and multi-task learning
• Graph neural networks
• Generative adversarial networks
• Federated learning
This series includes monographs, introductory and advanced textbooks, and state-
of-the-art collections. Furthermore, it supports Open Access publication mode.
Fengxiang He · Dacheng Tao

Foundations of Deep
Learning
Fengxiang He Dacheng Tao
School of Informatics College of Computing and Data Science
University of Edinburgh Nanyang Technological University
Edinburgh, United Kingdom Singapore, Singapore

ISSN 2730-9908 ISSN 2730-9916 (electronic)


Machine Learning: Foundations, Methodologies, and Applications
ISBN 978-981-16-8232-2 ISBN 978-981-16-8233-9 (eBook)
https://doi.org/10.1007/978-981-16-8233-9

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature
Singapore Pte Ltd. 2025

This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore

If disposing of this product, please recycle the paper.


Preface

In the ever-evolving research field of artificial intelligence, deep learning reigns


supreme in the recent over 10 years, casting a spell of awe and wonder with its
remarkable feats of intelligence. Imagine a world where machines can perceive,
learn, reason, and behave like never before, unraveling the mysteries of data (espe-
cially broad data) with an insatiable appetite for knowledge. This is the promise
of deep learning—a journey that has captivated the imagination of researchers and
practitioners alike, propelling us into a future where the boundaries of possibility are
constantly redrawn.
The concept of deep learning was first introduced by Dechter in 1986 and later
expanded to brain-inspired algorithms by Aizenberg et al. in 2000. However, the story
of deep learning originates from the humble beginnings of neural networks and boasts
a rich, multifaceted history spanning several decades. It has gained significant traction
in contemporary times, thanks to the visionary individuals who dared to dream of
machines capable of learning from experience. Its origins can be traced back to the
early 1940s when Warren McCulloch and Walter Pitts proposed a computational
model of neural networks, laying the groundwork for what would eventually be
recognized as artificial neural networks.
During the 1940s–1960s, the first wave of interest in deep learning, often referred
to as “cybernetics” emerged. This period was characterized by pioneering efforts
to understand the computational principles underlying biological neural networks.
Notably, Frank Rosenblatt’s development of the perceptron in 1957 marked a signif-
icant milestone, representing the earliest attempt to create a machine capable of
learning from its environment.
However, the initial enthusiasm for neural networks waned in the late 1960s.
Specifically, Minsky and Papert published the influential work “Perceptrons: An
Introduction to Computational Geometry” in 1969, which had a significant impact
and argued that Rosenblatt’s single-layer neural network could not solve funda-
mental logic problems such as the “exclusive or” (XOR) problem. This marked the
beginning of a period of decline in interest and research activity in the field, often
referred to as the “AI winter”. Technically, we can also attribute Rosenblatt’s failure

v
vi Preface

to the limitations in computing power and the inability to train deeper architectures
effectively.
The resurgence of interest in neural networks came about in the 1980s, marking
the second wave of deep learning research. This period, characterized as “connec-
tionist”, saw the development of more sophisticated neural network architectures
and learning algorithms. Notably, the introduction of the backpropagation algorithm
by Rumelhart, Hinton, and Williams in 1986 revolutionized the field by enabling
efficient training of deep neural networks. More importantly, the significance of
this paper lies not only in the proposal of an algorithm but also in a major turning
point where neural networks moved from psychology and physiology to the field
of machine learning. Despite significant advancements, the limitations of computa-
tional resources and the lack of large-scale datasets led to another decline in interest
in neural networks during the 1990s, marking the end of the second wave.
The third wave of deep learning, which dawned around 2010, emerged from
obscurity with a surge of game-changing advances, shaking the very foundations of
artificial intelligence. Leading this charge was Hinton’s inspiring research on deep
belief networks and unsupervised pretraining in 2006, revolutionizing the landscape
of deep neural network training. This momentum culminated in the remarkable emer-
gence of AlexNet, a successor to LeNet, which swept the field with its triumphant
victory in the 2012 ImageNet challenge, demonstrating the unparalleled capabilities
of deep learning in computer vision. The pioneering efforts by luminaries sparked a
revolution, unleashing a torrent of innovation that transcended disciplinary bound-
aries and reshaped societal norms. The pinnacle of recognition arrived with the 2018
ACM Turing Award, honoring Yoshua Bengio, Geoffrey Hinton, and Yann LeCun for
their monumental contributions, firmly establishing deep learning as the vanguard
of artificial intelligence innovation.
The success of deep learning is not only attributed to advancements in algorithms
but also relies significantly on two critical pillars: the availability of vast amounts
of data and the remarkable computational power of GPUs. Deep learning models
thrive on large datasets, which provide the diverse and abundant examples neces-
sary for training robust models. Additionally, the emergence of GPUs, particularly
those developed by NVIDIA, has revolutionized the field by enabling accelerated
training and inference processes. These GPUs are optimized for parallel processing,
allowing deep learning algorithms to harness their immense computational capa-
bilities for training complex models with unprecedented speed and efficiency. As
a result, researchers and practitioners can explore more sophisticated architectures
and tackle increasingly complex problems across various domains, from computer
vision to natural language processing.
The impact of deep learning has been nothing short of transformative, permeating
every facet of our lives with its boundless potential. From personalized recommen-
dations and intelligent assistants to breakthroughs in healthcare and autonomous
vehicles, deep learning has become the driving force behind some of the most
awe-inspiring advancements of our time. Yet, amid the dazzling array of practical
successes lies a profound challenge—the quest to unravel the mysteries of deep
learning’s inner working mechanisms.
Preface vii

Therefore, many researchers, drawing parallels to the ancient pursuit of alchemy,


view deep learning with skepticism and fascination alike. Like alchemists seeking
to transform base metals into gold or discover the elixir of life, practitioners of
deep learning aim to discover efficient deep learning models by extensively tuning
hyper-parameters. This comparison highlights the mystical nature of deep learning
research, which, much like alchemy, involves a blend of experimentation, intuition,
and a quest for transformative breakthroughs.
While the practical triumphs of deep learning have dazzled the world, the theo-
retical foundations upon which they stand remain shrouded in mystery. Many prac-
titioners, enamored by the sheer power of deep learning algorithms, may overlook
the critical importance of theoretical studies, dismissing them as mere abstractions
divorced from reality. However, the truth is far more tantalizing—for it is through
the lens of theory that we gain deeper insights into the workings of these enigmatic
networks, unlocking new frontiers of understanding and paving the way for even
greater innovations.
As the tide of deep learning continues to rise, fueled by the emergence of founda-
tion models trained on broad data, the need for a solid theoretical foundation becomes
more pressing than ever. Theoretical understandings to deep learning networks not
only provide insights into the mechanisms governing deep learning networks but also
ensure their safe and responsible deployment in real-world applications. Moreover,
theoretical studies serve as the crucible in which new ideas are forged, leading to
breakthroughs that push the boundaries of what is possible in artificial intelligence.
It is within this crucible that we present this monograph—a testament to the power
of theory in unlocking the secrets of deep learning. Drawing upon a rich tapestry of
research spanning our relevant journal papers, conference papers, ArXiv technical
reports, and Dr. Fengxiang He’s doctoral thesis, this monograph endeavors to weave
together a narrative that illuminates the theoretical landscape of deep learning. Our
goal is to provide readers with a roadmap to understanding the intricate workings of
deep learning networks, empowering them to harness the full potential of AI while
mitigating the risks inherent in its deployment.
Join us on this exhilarating journey into deep learning theory, where mysteries
await to be unraveled and discoveries lie just beyond the horizon. Whether you
are a seasoned researcher seeking to deepen your theoretical understanding or a
practitioner eager to harness the power of deep learning in your applications, we
invite you to embark on this adventure with us. Together, let us chart a course towards
a future where artificial intelligence knows no bounds, guided by the light of theory
and fueled by the fires of imagination.
Before introducing the organizational structure of this monograph, it’s necessary
to touch upon the concept of explainable deep learning. This field represents a fusion
of empirical strategies and theoretical rigor, fostering a space for enlightenment and
discovery. While empirical methods have driven significant advancements in under-
standing and interpreting deep learning models, their focus lies predominantly on
elucidating and dissecting the performance of the output and intermediate layers,
and their correlations with input data. Techniques for studying explainable deep
viii Preface

learning typically encompass feature importance analysis and visualization method-


ologies, aiming to achieve both local and global interpretability. By integrating these
approaches to offer multi-layered explanations, users gain a deeper understanding
of the behavior and predictive outcomes of deep learning models. This monograph
primarily seeks to explore theoretical underpinnings and enhance interpretability
within the realm of deep learning. Through the crucible of theory, we aim to provide
readers with a thorough understanding of the theoretical underpinnings that drive the
success of deep learning models.
Specifically, this monograph is a general introduction to deep learning theory that
can serve as a textbook for students and researchers in machine learning, statistics,
and other related areas. The reader is assumed to be familiar with basic concepts in
linear algebra, probability, and analysis of algorithms. It covers fundamental modern
topics in deep learning theory while providing the theoretical basis and conceptual
tools needed for the discussion and justification of algorithms. It also describes several
key aspects of the application of these algorithms.
The journey of this monograph begins with an introduction to the current status
of deep learning and its theory in Chap. 1. Chapters 2–4, forming Part I, review the
previous developments in statistical learning theory, which may have a significant
discrepancy from the requirements of developing deep learning theory.
In Part II, Chaps. 5–10, we present recent advances in deep learning theory from
six facets: (1) complexity and capacity-based approaches for analyzing the gener-
alizability of deep learning; (2) stochastic differential equations and their dynamic
systems for modeling stochastic gradient descent and its variants, which characterize
the optimization and generalization of deep learning, partially inspired by Bayesian
inference; (3) the geometrical structures of the loss landscape that drives the trajec-
tories of the dynamic systems; (4) the geometrical structures of the input space,
shaped by the nonlinear activations; (5) the roles of over-parameterization of deep
neural networks from both positive and negative perspectives; and (6) theoretical
foundations of several special structures in network architectures.
In Part III, Chaps. 11–13, we present some emerging directions beyond the
generalizability which is focused on the previous chapters. We first explore trust-
worthy deep learning, encompassing the growing concerns in ethics and security, and
their relationship to generalizability. Finally, the emerging supermodel paradigm or
foundation model trained on broad data is discussed.
Throughout this monograph, we aim to provide readers with a comprehensive
understanding of deep learning theory, from its foundational principles to its latest
advancements and ethical considerations. We hope that this exploration will inspire
further research and innovation in the field of deep learning, driving progress towards
a future where intelligent systems enhance our lives in guaranteed ways.
While we have strived for accuracy, this monograph might still not be perfect.
If you encounter any typos, factual errors, citation inconsistencies, and reference
Preface ix

missings, please don’t hesitate to bring them to our attention. Your feedback is
important to ensure the accuracy and clarity of this work for future readers.

From Sydney, Beijing to Singapore and Fengxiang He


Edinburgh Dacheng Tao
2019–2024
Contents

1 Deep Learning: A (Currently) Black-Box Model . . . . . . . . . . . . . . . . . 1


1.1 Definitions and Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Advances in Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Applications of Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 The Status Quo of Deep Learning Theory . . . . . . . . . . . . . . . . . . . . 10
1.5 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

Part I Background
2 Statistical Learning Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1 Glivenko-Cantelli Theorem and Concentration Inequalities . . . . . 17
2.1.1 Glivenko-Cantelli Theorem . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1.2 Concentration Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2 Probably Approximately Correct (PAC) Learning . . . . . . . . . . . . . 26
2.2.1 The PAC Learning Model . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2.2 Generalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.3 Insights from Glivenko-Cantelli Theorem . . . . . . . . . . . . 32
2.3 PAC-Bayesian Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3 Hypothesis Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.1 Worst-Case Bounds Based on the Rademacher Complexity . . . . . 39
3.2 Worst-Case Bounds Based on the Vapnik-Chervonenkis
(VC) Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4 Algorithmic Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.1 Definition of Algorithmic Stability . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Algorithmic Stability and Generalization Error Bounds . . . . . . . . 48
4.3 Uniform Stability of Regularized Learning . . . . . . . . . . . . . . . . . . . 56
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

xi
xii Contents

Part II Deep Learning Theory


5 Capacity and Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.1 Worst-Case Bounds Based on the VC Dimension . . . . . . . . . . . . . 61
5.2 Rademacher Complexity and Covering Number . . . . . . . . . . . . . . 62
5.3 Generalization Bound of ResNet . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3.1 Stem-Vine Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.3.2 Generalization Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.3.3 Proofs of Section 5.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.4 Vacuous Generalization Guarantee in Deep Learning . . . . . . . . . . 90
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6 Stochastic Gradient Descent as an Implicit Regularization . . . . . . . . 95
6.1 Stochastic Gradient Methods (SGMs) . . . . . . . . . . . . . . . . . . . . . . . 95
6.2 Generalization Bounds on Convex Loss Surfaces . . . . . . . . . . . . . 96
6.3 Generalization Bounds on Nonconvex Loss Surfaces . . . . . . . . . . 97
6.4 Generalization Bounds Relying on Data-Dependent Priors . . . . . 102
6.5 The Role of Learning Rate and Batch Size in Shaping
Generalizability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.5.1 Theoretical Evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.5.2 Empirical Evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.6 Interplay of Optimization and Bayesian Inference . . . . . . . . . . . . . 117
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
7 The Geometry of the Loss Surfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.1 Linear Networks Have No Spurious Local Minima . . . . . . . . . . . . 123
7.2 Nonlinear Activations Bring Infinite Spurious Local
Minima . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
7.2.1 Neural Networks Have Infinite Spurious Local
Minima . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
7.2.2 A Big Picture of the Loss Surface . . . . . . . . . . . . . . . . . . . 130
7.2.3 Proofs of Sect. 7.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.3 Proofs of Theorems 7.7, 7.8, Corollaries 7.1, and 7.2 . . . . . . . . . . 167
7.3.1 Squared Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
7.3.2 Smooth and Multilinear Partition . . . . . . . . . . . . . . . . . . . . 168
7.3.3 Every Local Minimum Is Globally Minimal
Within a Cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
7.3.4 Equivalence Classes of Local Minimum Valleys
in Cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
7.4 Geometric Structure of the Loss Surface . . . . . . . . . . . . . . . . . . . . . 174
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
8 Linear Partition in the Input Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
8.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
8.2 Neural Networks Act as Hash Encoders . . . . . . . . . . . . . . . . . . . . . 180
8.3 Factors that Influence the Encoding Properties . . . . . . . . . . . . . . . . 182
Contents xiii

8.3.1 Relationship Between Model Size and Encoding


Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
8.3.2 Relationship Between Training Time
and Encoding Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
8.3.3 Relationship Between Sample Size and Encoding
Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
8.3.4 Layerwise Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . 187
8.3.5 Impacts of Regularization, Random Data,
and Random Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
8.3.6 Activation Hash Phase Chart . . . . . . . . . . . . . . . . . . . . . . . 191
8.4 Additional Experimental Implementation Details . . . . . . . . . . . . . 192
8.5 Additional Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
9 Reflecting on the Role of Overparameterization: Is it Solely
Harmful? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
9.1 Double Descent and Benign Overfitting . . . . . . . . . . . . . . . . . . . . . 197
9.2 Neural Tangent Kernels (NTKs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
9.3 Loss Surface and Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
9.4 Generalization and Learnability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
10 Theoretical Foundations for Specific Architectures . . . . . . . . . . . . . . . 207
10.1 Convolutional Neural Networks (CNNs) and Recurrent
Neural Networks (RNNs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
10.2 Equivariant Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
10.2.1 Group CNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
10.2.2 Steerable Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 210
10.2.3 Nonlinearities in Equivariant Networks . . . . . . . . . . . . . . 213
10.2.4 Generalization of Equivariant Networks . . . . . . . . . . . . . . 214
10.2.5 Generalization Bounds of Equivariant Networks . . . . . . . 214
10.2.6 Approximation of Equivariant Networks . . . . . . . . . . . . . 216
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217

Part III Deep Learning Theory form the Trust-worthy Facet


11 Privacy Preservation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
11.1 Differential Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
11.2 The Interplay of Generalizability and Privacy-Preserving
Ability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
11.2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
11.2.2 Generalization Bounds for Iterative Differentially
Private Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
11.2.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
xiv Contents

12 Algorithmic Fairness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263


12.1 Definitions of Fairness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
12.2 Fairness-Aware Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
12.2.1 Preprocessing Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
12.2.2 In-processing Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
12.2.3 Postprocessing Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
13 Adversarial Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
13.1 Adversarial Attacks and Defences . . . . . . . . . . . . . . . . . . . . . . . . . . 269
13.2 Interplay Between Adversarial Robustness, Privacy
Preservation, and Generalizability . . . . . . . . . . . . . . . . . . . . . . . . . . 271
13.2.1 Measurement of Robustness . . . . . . . . . . . . . . . . . . . . . . . . 273
13.2.2 Privacy–Robustness Trade-Off . . . . . . . . . . . . . . . . . . . . . . 279
13.2.3 Generalization–Robustness Trade-Off . . . . . . . . . . . . . . . 284
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
Chapter 1
Deep Learning: A (Currently) Black-Box
Model

Deep learning refers to a subset of machine learning techniques that employ artificial
neural networks characterized by multiple layers of interconnected nodes (or neu-
rons) followed by nonlinear activation functions. Such networks are of significant
size. These deep neural networks are designed to process and learn from experience,
extracting complex patterns and features through successive layers of computation.
Such experience can be obtained either from human-annotated electronic records
such as datasets or from the learner’s own interactions with its perceived environment.
By leveraging these layered architectures, deep learning models can autonomously
discover and represent intricate relationships within the data, enabling them to make
sophisticated predictions and informed decisions based on learned knowledge.
To date, however, the development of deep learning has relied heavily on experi-
ments that have not thoroughly explored or sought an understanding of its theoretical
foundations. The black-box nature of deep learning introduces unmanageable risk
into its applications since we do not know why deep learning works, when it may
fail, or how to prevent failures.

1.1 Definitions and Terminology

A prevalent application of deep learning is object recognition, which aims to cate-


gorize images based on their characteristics. This section takes object recognition as
an example to introduce the definitions and terminology used in this book.
• Example: Let X and Y be the input and output space, respectively. A datum used
in the process of training and evaluation/inference. An example is usually defined
as a feature–label pair
(x, y) ∈ X × Y,

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025 1
F. He and D. Tao, Foundations of Deep Learning, Machine Learning: Foundations,
Methodologies, and Applications, https://doi.org/10.1007/978-981-16-8233-9_1
2 1 Deep Learning: A (Currently) Black-Box Model

where x is the feature and y is the corresponding label. For object recognition,
images and their representations are the features, and the labels are the concepts
depicted such as ‘tiger’, ‘horse’, and ‘cheetah’.
• Feature: A feature is a collection of attributes that represents an example and is
typically structured as a vector or matrix denoted as x ∈ X. Features are often
obtained through sensor measurements, such as those from optical cameras or
light detection and ranging (LiDAR) devices. Certain deep learning models are
specifically engineered for feature extraction, as they excel at producing highly
informative feature representations.
• Label: An annotation of an example. It can be a categorical value, a continuous
real value, or a combined set of such values. Deep learning models learn mappings
between features and labels. In this book, we denote a label by y ∈ Y.
• Training sample: A set of examples used to train a deep learning model:

S = {z 1 , . . . , z m },

where z i = (xi , yi ) ∈ Z := X × Y and m is the training sample size. Usually,


deep learning used large-scale training samples.
• Validation sample: A set of examples used to validate various hyperparameters
and techniques when training a deep learning model.
• Test sample: A set of examples used to evaluate a deep learning model.
• Data distribution: In typical scenarios, we make the assumption that all examples
within the training, validation, and test samples are independent and identically dis-
tributed (i.i.d.) random variables drawn from the same underlying data distribution,
denoted as D, i.e.,

Z ∼ D, S ∼ Dm .

For clarity, we use upper-case and lower-case letters to represent random variables
and their corresponding observations, respectively.

Remark 1.1 It’s crucial to acknowledge that the assumption of i.i.d. data, where
each data point is assumed to be drawn from the same distribution and is unrelated
to other data points, may not hold in many real-world scenarios. A notable coun-
terexample is observed in time series data, such as financial data, where observations
can exhibit significant temporal dependencies and variability over time. Despite this
departure from the i.i.d. assumption, deep learning models can still effectively capture
complex patterns and trends in such data due to their ability to learn from sequential
information. Although the temporal variability in time series data challenges the i.i.d.
assumption, it typically does not critically impede the performance of deep learning
models. Therefore, in this book, we maintain the use of the i.i.d. assumption for
simplicity and to establish the fundamental principles and theoretical foundations of
deep learning, recognizing that the flexibility and adaptability of these models allow
them to handle real-world data complexities effectively.
1.1 Definitions and Terminology 3

Deep learning applications can be broadly categorized as follows.

• Supervised learning: In supervised learning, every training example is annotated


or labelled. The form of annotation varies in different scenarios. In image classi-
fication, the annotations are image categories; in object detection, the annotations
are the bounding boxes around different objects; and in semantic segmentation,
all pixels are individually annotated.
• Unsupervised learning: In contrast, the training data remain unannotated in unsu-
pervised learning, and the corresponding deep learning methods are designed for
clustering, feature extraction, and reconstruction. Self-supervised learning is a
special form of unsupervised learning in which knowledge, or, more specifically,
label concepts, is learned through self-training without external interference. This
learning regime is becoming increasingly popular, especially for training large-
scale deep networks in super deep learning, an advanced era of deep learning.
Unsupervised learning is attractive because of the often high expense of data
labelling.
• Semisupervised learning and weakly supervised learning: These two learning
regimes represent intermediate approaches that bridge the gap between fully super-
vised and unsupervised learning paradigms. Semisupervised learning (SSL) com-
bines both labeled and unlabeled data for training predictive models. Most SSL
approaches are based upon one of the following several key assumptions: smooth-
ness (similar inputs produce similar outputs), clustering (samples in the same
cluster have the same label), and the manifold assumption (data points lie on a
lower-dimensional manifold where neighboring points have similar labels). Popu-
lar SSL techniques include pseudo-labeling, self-training, co-training, and graph-
based models. On the other hand, weakly supervised learning (WSL) trains models
using limited yet imprecise labels, which helps reduce the cost of exact and exten-
sive data labeling. WSL approaches can be categorized into three main types:
inexact supervision (coarse-grained labels), incomplete supervision (only part of
the data is labeled), and inaccurate supervision (labels may be incorrect). Classical
learning techniques such as multi-instance learning, active learning, and transduc-
tive learning have been integrated with the above WSL types. These methods lever-
age available data to enhance model performance despite the weak supervision.
Both learning techniques play crucial roles in practical machine learning applica-
tions, offering solutions for leveraging partially labeled or imperfectly annotated
data to train effective models with reduced human annotation effort.

Remark 1.2 In either supervised or self-supervised learning, a deep learning method


learns a mapping from features space X to output space Y. This idea is inherited
from conventional machine learning algorithms.

Remark 1.3 Deep learning has shown significant value in dealing with the chal-
lenges arising in conventional reinforcement learning (RL), which is an important
machine learning paradigm. In RL, a learner determines what action an agent needs
to take in each specific state of the environment. Deep RL exploits deep learning for
4 1 Deep Learning: A (Currently) Black-Box Model

efficient feature extraction to boost the policy approximation ability in RL, which is
a considerable barrier for conventional methods.

Supervised learning plays a significant role in deep learning applications. It has


shown promising performance in the following tasks.

• Classification: The goal of classification is to assign an unordered categorical


value to each example; i.e., the labels can be represented by integers Y ⊂ N. All
categories are equal in status, with no preference among them. Categories are
commonly defined using integers for convenience. Sometimes, a single example
is simultaneously associated with multiple categories; we refer to this learning
regime multilabel classification. In this regime, Y ⊂ Nl , where l is the number of
possible categories for each example.
• Ranking: The goal of ranking is to assign a specific ordered categorical value to
each example, where Y ⊂ N represents the set of possible ordered categories.
In contrast to classification, which assigns discrete labels to instances, ranking
involves organizing a sample into a partially ordered sequence or hierarchy based
on their attributes or features. This distinction reflects a fundamental difference in
the objectives of ranking versus classification tasks. In ranking tasks, the emphasis
is on establishing a relative order or preference among examples rather than assign-
ing definitive class labels. This approach is common in information retrieval sys-
tems, recommendation systems, and personalized search engines, where the goal
is to present items or documents in an order that maximizes relevance or user satis-
faction. By considering partial orderings, ranking methods allow for more nuanced
and flexible representations of data compared to strict classification schemes.
• Regression: The goal of regression is to predict continuous values or vectors for a
given example: Y ⊂ Rl , where l is the dimensionality of the labels. In most appli-
cation scenarios, these continuous values or vectors are real. However, complex-
valued data can also be encountered. While traditionally used for regression tasks,
regression analysis has also played a significant role in classification problems.
Logistic regression exemplifies this, by formulating classification as a process of
fitting a logistic function, which is technically a regression problem. The logistic
function, also known as the sigmoid function, is defined as follows:

L
f (x) = ,
1 − ek (x − x0 )

where x ∈ R is a real number, L , k ∈ R+ are two positive real constants, and


x0 ∈ R is a real constant. The high-dimensional version is termed the softmax
function, which is defined as follows:

eb
f (b) = l ,
j=1 eb j
1.1 Definitions and Terminology 5

where b = (b1 , . . . , bl ) ∈ Rl is a real vector and l ∈ N+ is the length of b. Deep


learning models for classification tasks usually adopt the softmax function for
activation in the output layer.

Notably, deep learning breaks free of the framework of learning a mapping from
features to labels.
Unsupervised learning is also of significant interest for deep learning theory.
Prevalent scenarios are listed as follows.

• Feature extraction: The goal of feature extraction is to transform the original


data into a more compact and informative feature space that captures essential
characteristics for subsequent analysis or modeling tasks. One effective heuristic
approach to feature extraction involves utilizing the outputs of intermediate layers
within a neural network as extracted features. These intermediate outputs, often
referred to as mid-products, represent the activations and transformations applied
to the input data as it passes through the network. By extracting features from
these intermediate representations, we can leverage the hierarchical and abstract
information learned by the neural network during training. Using mid-products as
extracted features offers several advantages. First, these features encapsulate com-
plex patterns and relationships present in the data, learned through the network’s
layers. Second, they provide a more meaningful and compact representation com-
pared to raw input data, which can be noisy or high-dimensional. Finally, extracted
features can be used as inputs to downstream tasks such as classification, regres-
sion, or clustering, facilitating more effective and efficient model training and
inference. By and large, feature extraction using mid-products from neural net-
work layers is powerful to capture and leverage hierarchical information encoded
in data, enabling improved performance and interpretability in machine learning
applications.
• Data generation: It aims to produce new examples that share the same distribution
learned from the training data. Recent advancements, such as Large language mod-
els (LLMs), diffusion models (DMs) and generative adversarial networks (GANs),
have proven their superior capabilities to generate new data from the given training
data. LLMs understand text rules and patterns by integrating transformer blocks
and training on vast amounts of text data. As the training data and model parameters
increase, the model’s language capabilities continually improve, even exhibiting
emergent capabilities. LLMs possess unique in-context learning capabilities and
can handle novel tasks efficiently and cost-effectively through techniques like
supervised fine-tuning and prompt engineering. GPT is a recently emerged and
widely used type of LLM. GANs are constructed by a generator and a discrimina-
tor. The generator produces synthetic data by emulating the patterns it has learned
from the training data, while the discriminator performs as an evaluator, validating
the authenticity of the generated data by comparing with real data. The training
process progressively teaches the generator to produce new data with distributions
approach to that of the training data, while the discriminator continually improves
its capability to separate synthetic data from real data. VAEs learn to encode and
6 1 Deep Learning: A (Currently) Black-Box Model

decode training data. The encoder maps input data points to a latent space repre-
sentation, while the decoder takes a point from the latent space and reconstructs
the original input data point. Therefore, VAEs are capable to generate new data
by decoding low-dimensional points efficiently sampled from the latent space into
meaningful outputs. This method is widely useful to synthesise images and audio
waves, as it allows for the generation of diverse and creative outputs by manipulat-
ing the latent representations. DMs have recently been widely used to generate new
data by mimicking training data through two steps: forward diffusion and reverse
diffusion. In forward diffusion, noise is progressively added to an image over
several steps, gradually transforming it into a completely noise-like picture. By
contrast, reverse diffusion learns to invert the forward diffusion process. Starting
from the noisy image, the model iteratively denoises it, to progressively reconstruct
the original image. The above data generation models effectively learn to capture
and replicate the sophisticated patterns present in the distribution of the training
data. These generators aim to generate data instances that are indistinguishable
from real data. Although these models have found widespread applications in
tasks such as image generation, question answering, or broadly the synthesis of
realistic data for various purposes, including deep learning model training, they
face many challenges, such as data bias, low reliability, poor quality, scalability,
ethical and privacy concerns, as well as interpretability and transparency.
• Self-supervised learning: It aims to autonomously discover the underlying struc-
tures in data without relying on manually annotated labels. Contrastive learning
is a typical self-supervised learning, which improves the feature representation
capability of deep learning models by using large amounts of unlabelled data.
By simultaneously maximizing intra-class similarity and minimizing inter-class
dissimilarity, it improves the model’s representation capability, ensuring that the
feature representations of data in the same category are alike and those of data from
different categories are as distinct as possible. In general, it does not primarily focus
on the intricate details of the instances but instead aims to distinguish the data at
an abstract semantic feature space level. Consequently, the model is simplified, its
optimization is more efficient, and its generalization ability is improved. For exam-
ple, in the context of self-supervised learning like RotNet, images are artificially
augmented by rotating them with specific angles (e.g., 0 degrees, 90 degrees) to
create pairs of rotated and original images. The rotation angles themselves are used
as labels to train the model to predict the rotation applied to each image. While
these rotation labels may appear arbitrary or devoid of semantic meaning, the pro-
cess allows the model to learn rich and informative features that capture intrinsic
properties of the data, such as object shape, orientation, and texture. Representative
algorithms in contrastive learning include: SimCLR: Enhances feature representa-
tion through a simple contrastive learning framework by using data augmentation
and large-scale batch training; MoCo: Improves the efficiency of contrastive learn-
ing by storing sample pairs in a dynamic dictionary through a momentum contrast
mechanism; and BYOL: Enhances the stability of contrastive learning by intro-
ducing a self-supervised mechanism without the need for negative samples. In
order to define supervisory signals without human intervention, self-supervised
1.2 Advances in Deep Learning 7

learning explores and exploits the inherent structure of the data. By generating
labels from data transformations or relationships (such as image rotations in Rot-
Net), self-supervised learning helps the model learn effective representations that
generalize well to downstream tasks, even in the absence of explicit labels. This
approach has been extensively applied to computer vision, natural language pro-
cessing, speech recognition, and bioinformatics, demonstrating its versatility and
scalability in learning from unlabeled data.

1.2 Advances in Deep Learning

Compared with deep neural networks, most conventional machine learning models
share four major barriers to achieving comparably elegant deep learning performance.
• Conventional machine learning models/algorithms usually have significantly
lower representation capacity. Their capacity is sufficient for modelling simple
data, such as what is necessary for income modelling and credit assessment, but
considerably limits their ability to fit complex data, such as images and videos. For
example, the vanilla support vector machine (SVM) algorithm can be used to train
only linear classifiers. In contrast, deep learning models have a uniform approx-
imation ability: they can approximate any continuous function, time series, or
distribution at any accuracy as long as the networks are sufficiently wide. In addi-
tion, stochastic gradient-based optimizers, including stochastic gradient descent
and its variants, have excellent capabilities for optimizing neural networks. More-
over, representation learning proceeds in a data-driven manner, which significantly
reduces the human labour burden incurred.
• Conventional machine learning algorithms usually suffer from the curse of dimen-
sionality: to maintain the robustness of a learning model/algorithm, the required
training sample size increases rapidly as the dimensionality of the input data
increases. In contrast, existing empirical studies have not yet observed the curse of
dimensionality in deep learning. In fact, increasing the input dimensionality has
been shown to boost performance. For example, some papers have reported that
training on higher-resolution images can lead to considerably higher performance,
even as the input dimensionality significantly increases.
• Conventional machine learning algorithms usually rely on restrictive assumptions,
considerably limiting their potential application domains. For example, linear clas-
sifiers assume that the data are linearly separable, hidden Markov models assume
that the state sequence is a Markov chain, and kernel methods assume that the
data lie in the corresponding reproducing kernel Hilbert space. In contrast, deep
learning breaks free of most of these assumptions.
• Some conventional machine learning algorithms have limited capabilities in deal-
ing with large-scale data. A prime example is Markov chain Monte Carlo, a preva-
lent Bayesian inference method that faces significant difficulty in scaling to the
processing of big data.
8 1 Deep Learning: A (Currently) Black-Box Model

• Deep learning algorithms usually operate in an end-to-end manner. Compared with


their multistage counterparts, end-to-end methods have two major advantages: (1)
End-to-end models are usually more reliable. The risks from each single stage
accumulate in multistage systems, which can eventually become unmanageable.
(2) End-to-end models are usually easy to train. Their end-to-end nature enables
training of the whole system in accordance with a single objective function. In con-
trast, the optimal protocols for the individual modules in a multistage system may
vary significantly. Tuning a multistage system is thus significantly more difficult
than tuning an end-to-end system.
• As a family of representation learners, deep learning models can be easily inte-
grated with other approaches. The raw data can be easily replaced with their rep-
resentations learned from a deep neural network. This algorithm design approach
has been extensively utilized in computer vision and natural language processing
to considerably accelerate the pace of innovation.

1.3 Applications of Deep Learning

Significant developments have emerged in the field of deep learning over the past
decade. The rapid progress in deep learning is empowering abundant developments
in many, possibly almost all, areas of both engineering and science and is further
driving a technological revolution in a variety of industry sectors.
• Computer vision (CV) is the field of technology that seeks to help machines “see”
and understand a digital form of the world, including images and videos captured
by cameras and point clouds generated by LiDAR sensors. The primary subar-
eas of CV include recognition, detection, segmentation, and tracking. Potential
applications involve a variety of industry sectors, including autonomous vehicles,
medical diagnosis, and entertainment. The outstanding approximation capabilities
and trainability of deep learning models enable deep learning to vastly enhance
performance in CV tasks, providing improved speed and accuracy.
• Natural language processing (NLP) is the field of training machines to read, under-
stand, and derive meaning from natural languages. Machines are good at dealing
with machine languages or programming languages, which follow clear and con-
cise logic. Unfortunately, this characteristic does not apply to natural languages.
Natural languages exhibit considerable ambiguity, vagueness, and violations of the
grammar rules as summarized in linguistics;1 this, however, is exactly the wisdom
and charm of natural languages. Deep learning models, particularly long short-
term memory (LSTM) networks and recurrent neural networks (RNNs), enable
data-driven approaches to the discovery of knowledge from natural languages.
Prevalent subareas of NLP include information retrieval, text mining, dialogue

1Theoretically, it would be possible to summarize the grammar of a natural language in its entirety,
as per finite length, the capacity of all potential sentences is finite. However, such a grammar book
would be unacceptably large.
1.3 Applications of Deep Learning 9

and interactive systems, machine translation, question answering, syntax analy-


sis (tagging, chunking, and parsing), sentiment analysis, stylistic analysis, and
argument mining.
• Speech processing is the field of understanding the contents of speech and translat-
ing speech into text or other forms. One conventional method is a hidden Markov
model, which models every spoken sentence as a sequence of observations. This
method assumes the presence of a state sequence, which is assumed to be a Markov
chain, underlying the observation sequence and carrying the desired information.
However, the Markov chain assumption is rarely stringent and is rather flexible.
Fortunately, deep learning models, such as LSTM and RNN models, do not rely on
such assumptions and have therefore emerged as essential techniques in dealing
with speech processing tasks. Prevalent applications include automatic transla-
tion, real-time speech writing, real-time captioning, and virtual assistants (such as
Apple’s Siri).
• Recommender systems seek to rank products or service options in accordance
with consumer preference. Such systems been extensively deployed by advertising
providers and e-commerce corporations, including Google, Amazon, and JD.com.
Formally, recommender systems aim to solve ranking problems.
• Autonomous vehicles are automobiles that can travel either under limited human
control or entirely without human control. It is estimated that by 2040, autonomous
vehicles will occupy over 40% of the entire automobile market. Based on the level
of autonomy, a hierarchy has been introduced: Level 0: humans take full responsi-
bility for driving; Level 1, “hands on”: machines and humans share vehicle control;
Level 2, “hands off”: machines normally have complete control, but humans need
to oversee the driving process and constantly remain ready to intervene; Level 3,
“eyes off”: machines can override human control in response to emergencies, while
humans need to oversee the driving process and “advise” the machines in taking
action; Level 4, “mind off”: machines take complete responsibility for vehicle con-
trol without any human oversight in limited scenarios; Level 5, “steering wheel
optional”: machines have complete vehicle control at all times and human interven-
tion is not needed. Autonomous vehicles seamlessly integrate technologies from
several areas, including manufacturing, recognition, reasoning, and cybernetics.
Deep learning contributes significantly to the last three of these areas. Specifically,
techniques for various CV tasks are needed, such as segmentation, object detec-
tion, tracking, and depth estimation. In addition, deep learning has emerged as an
efficient representation learning tool that enables the deployment of RL to control
vehicles.
• Autonomous medical diagnosis aims to teach machines to diagnose disease and
thereby improve the affordability and accuracy of medical diagnosis. The methods
involved heavily rely on many key CV and NLP technologies, including classifi-
cation, detection, and segmentation. The extensive applications of deep learning in
CV have significantly accelerated the development of autonomous medical diag-
nosis. However, the U.S. Food and Drug Administration (FDA) has yet to authorize
any autonomous diagnosis system for direct medical use. This lack of approval
can be attributed to the fact that in the current state of research on deep learning,
10 1 Deep Learning: A (Currently) Black-Box Model

theoretical foundations and rigorous empirical verifications, such as hypothesis


tests, are largely missing.2 This situation highlights that investigations of deep
learning theory and rigorous empirical verifications are essential.
• Drug discovery refers to the discovery of novel medications. Neural networks,
particularly graph neural networks (GNNs), show promise in the discovery of
knowledge from data expressed in a graph-style format, which is closely analogous
in form to the molecular formulas of medications.

1.4 The Status Quo of Deep Learning Theory

Many applications of deep learning lie in extremely security-critical areas, such as


autonomous vehicles and medical diagnosis. In these areas, even a small algorith-
mic error can lead to fatal disasters. Consequently, deep learning is required to be
transparent and accountable. Intuitively, the theoretical foundations of deep learning
algorithms should serve as a major source of transparency and accountability.
To date, however, the development of deep learning has relied heavily on experi-
ments that have not thoroughly explored or sought an understanding of its theoretical
foundations. Although unstable at times, heuristic approaches in deep learning have
achieved significant performance in many areas. The success of deep learning has
been both astonishing and baffling, as researchers have yet to comprehend the mech-
anism that governs deep learning’s behavior. The black-box nature of deep learning
introduces unmanageable risk into its applications since we do not know why deep
learning works, when it may fail, or how to prevent failures. This knowledge gap
considerably compromises our ability to identify, manage, and prevent algorithm-led
disasters, severely undermines confidence in the application of recent advances in
many industrial sectors, and restricts the further development of innovative machine
learning algorithms. In particular, deep learning faces three major problems:
• Approximation refers to the ability of a machine learning model to fit the training
data. Conventional machine learning models exhibit a relatively limited ability
to approximate data. For example, a vanilla SVM can fit only linearly separable
data. Nevertheless, several works in the 1980s and 1990s have proven universal
approximation theorems for neural networks: if the width is sufficiently large, even
a neural network with only one hidden layer can approximate any continuous
function at any precision. What is the bound on the approximation ability of a
feed-forward neural networks if the network is of regular width? What are the
approximation bounds for RNNs and GANs?

2 The lack of hypothesis tests is not merely due to negligence. It is honestly doubtful that current
deep learning algorithms could pass hypothesis tests if they were performed. Most deep learning
algorithms exhibit significant variability in performance, and partial reporting is common in current
experimental practice, which severely undermines any confidence in the reproducibility of deep
learning algorithms.
1.4 The Status Quo of Deep Learning Theory 11

• Optimization refers to the ability to search for the optimal solution to a problem.
Machine learning (including deep learning) algorithms are usually formulated
to solve optimization problems. Thus, their optimization performance is of vital
importance. Unfortunately, the optimization problems in deep learning tend to be
highly nonconvex. Nevertheless, they can be well solved in practice by means
of stochastic-gradient-based methods, such as stochastic gradient descent (SGD),
which is apparently a stochastic convex optimization method. Why does SGD, a
convex optimization method, work for a nonconvex optimization problem?
• Generalization refers to the ability to perform well on unseen data. Good general-
izability ensures that a trained model can robustly handle unseen circumstances.
Generalizability is particularly important when long-tail events regularly appear
and can cause catastrophic failure. Statistical learning theory has established a
rigorous upper bound on the generalization error depending on the hypothesis
complexity; however, major difficulties arise in analysing deep learning.
Theoretical foundations have been established for conventional machine learning
algorithms. However, significant difficulties arise when we attempt to apply these
theoretical foundations to deep learning. These difficulties are listed as follows:
• Overparameterization. Statistical learning theory is grounded in Occam’s razor
principle:
Plurality should not be posited without necessity.

This principle emphasizes the importance of simplicity in model design to prevent


unnecessary complexity. According to Occam’s razor, a model should be as sim-
ple as possible while still effectively capturing the patterns present in the training
data. This principle ensures that generalization bounds in statistical learning are
tied to the complexity of the hypothesis space used in machine learning scenar-
ios. However, deep learning models often appear to diverge from Occam’s razor
due to their inherent complexity. Deep neural networks typically comprise numer-
ous trainable parameters, sometimes exceeding the size of the training dataset. This
phenomenon challenges conventional generalization guarantees because excessively
complex models can memorize specific examples from the training data rather than
learning generalizable patterns. As a result, deep learning models are suspected to
be susceptible to overfitting, where they perform well on training data but fail to
generalize to unseen examples.
Remark 1.4 Before the era of foundation models (large-scale deep neural networks
trained on broad data), the complexity of deep learning models necessitates careful
regularization techniques and model selection strategies to mitigate overfitting and
promote generalization. Techniques such as dropout, weight regularization, and early
stopping are employed to simplify and regularize deep neural networks, encourag-
ing them to learn essential features while suppressing noise and irrelevant details.
Balancing model complexity with generalization capability remains a fundamental
challenge in deep learning research, highlighting the need for continued exploration
of regularization methods and theoretical insights to improve model robustness and
performance.
12 1 Deep Learning: A (Currently) Black-Box Model

Remark 1.5 In the era of foundation models, state-of-the-art models often do not
impose strict constraints on model complexity. On the contrary, larger models have
shown greater effectiveness on testing benchmarks, which may exceed the size of
the training set by a significant margin. The underlying reason for this phenomenon
can be attributed to the capacity of these larger models to capture and learn intricate
patterns and nuances present in vast and diverse datasets. Additionally, larger mod-
els possess more parameters and computational power, enabling them to represent
complex relationships and dependencies within the data more comprehensively. This
enhanced capacity often leads to improved performance and generalization capabil-
ities, especially in tasks involving extensive and diverse data inputs. Surely, it is still
important to balance model size with considerations of efficiency, interpretability,
and practical deployment in real-world applications.
• Highly complex empirical risk landscape. Highly complex empirical risk land-
scape. The landscape of empirical risk in deep learning is characterized by sig-
nificant complexity, posing unique challenges compared to traditional machine
learning paradigms. In conventional settings, learnability and optimizability are
typically ensured through regularization techniques that leverage properties like
convexity, Lipschitz continuity, and differentiability. However, these conventional
guarantees do not directly translate to deep learning due to the intrinsic nature
of neural networks. Deep neural networks are composed of multiple layers with
nonlinear activation functions, leading to loss surfaces that exhibit extreme non-
smoothness and nonconvexity. This inherent complexity renders traditional opti-
mization guarantees ineffective and raises significant obstacles in understanding
the underlying principles of deep learning. The complex and rugged nature of deep
learning loss surfaces has long been a barrier to developing comprehensive theories
and analytical frameworks for neural network behavior. Recent progress in deep
learning research has shifted the focus towards investigating the geometric prop-
erties of loss surfaces as a means to decode the behavior of neural networks. It is
increasingly recognized that the intricate geometry of loss surfaces directly reflects
the dynamics of learning and generalization in deep learning models. Exploring
and characterizing the geometric features of loss landscapes holds promise as a
pathway to gaining deeper insights into the working mechanisms of neural net-
works. By delving into the geometric intricacies of loss surfaces, researchers aim
to uncover fundamental principles governing optimization, generalization, and
model performance in deep learning. This evolving perspective highlights the
importance of advancing our understanding of loss surface geometry to enhance
the robustness, efficiency, and interpretability of deep learning systems.

1.5 Notations

Let S = {(x1 , y1 ), . . . , (xm , ym )|xi ∈ X ⊂ Rd X , yi ∈ Y ⊂ RdY , i = 1, . . . , m} be a


training sample set, where d X and dY are the dimensionalities of the feature X and
the label Y , respectively. All {z i }i=1m
are i.i.d. observations of the random variable
1.5 Notations 13

Z = (X, Y ) ∈ Z. Deep learning algorithms aim to learn a hypothesis (denoted by


h ) from training data. All possible such hypotheses form the hypothesis space,
denoted by H. The mathematical structure of a neural network is usually defined in
the following form:

x → W D σ D−1 (W D−1 σ D−2 (. . . σ1 (W1 x))), x ∈ X

where D is the depth of the neural network, W j is the j-th weight matrix, and σ j is
the j-th nonlinear activation function for any j = 1, ..., D. Usually, we assume that
the activation function σ j is a continuous function between Euclidean spaces and
that σ j (0) = 0. Popular activation functions include the softmax, sigmoid, and tanh
functions.
The generalization bound measures the generalizability of an algorithm. For any
hypothesis h learned by an algorithm A, the expected risk R(h) and the empirical risk
R̂ S (h) with respect to the training sample set S are respectively defined as follows:

1 
m
R(h) = Ez l(h, z), R̂ S (h) = l(h, z i ),
m i=1

where l(·) is the loss function. Furthermore, when the output hypothesis h learned
by algorithm A is stochastic (i.e., exhibits randomness), we typically calculate the
expectations of both the expected risk, R̂(h), and the empirical risk R̂ S (h). This
involves averaging over the randomness introduced by algorithm A.

R(A) = EA(S) R(A(S)), R̂ S (A) = EA(S) R̂ S (A(S)),

where A(S) is the hypothesis learned by algorithm A on the sample S.


Part I
Background
Chapter 2
Statistical Learning Theory

This chapter introduces mathematical tools for establishing the foundations of deep
learning and, more broadly, machine learning. We first introduce Glivenko-Cantelli
theorem and concentration inequalities, which characterizes the convergence of an
empirical process in the context of convergence in probability. The Glivenko-Cantelli
theorem and concentration inequalities also inspire the concept of the sample com-
plexity required to ensure a desirable level of generalizability on unseen data. We
then discuss the Probably Approximately Correct (PAC) learning framework, which
characterizes learning algorithms that can learn a target concept in an appropriate
amount of time given a sufficiently high sample complexity. The PAC framework
has become the foundation of statistical learning. Usually, PAC is categorized in the
‘frequentist’ sense, which suffers from significant deficiency in studying stochastic
algorithms. To address this issue, the PAC-Bayes framework is also discussed, which
integrates the PAC framework with Bayesian methods to characterize randomness.

2.1 Glivenko-Cantelli Theorem and Concentration


Inequalities

This section discusses the convergence of an empirical process. We first introduce


Glivenko-Cantelli theorem based on measure theory, which characterizes whether
an empirical process asymptotically converges. Then, concentration inequalities are
introduced as a tool for evaluating the convergence speed. Please note that some
concentration inequalities also imply asymptotic convergence of a sequence.

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025 17
F. He and D. Tao, Foundations of Deep Learning, Machine Learning: Foundations,
Methodologies, and Applications, https://doi.org/10.1007/978-981-16-8233-9_2
18 2 Statistical Learning Theory

2.1.1 Glivenko-Cantelli Theorem

This section introduces Glivenko-Cantelli theorem (or central limit theorem) based
on the measure theory. Measure theory offers a portfolio of rigorous tools for math-
ematically characterizing the convergence of a sequence of hypotheses. However,
this rigour is usually accompanied by technical difficulty in practice. We thus also
introduce concentration inequalities in the next section to simplify the reader’s
understanding, which may also characterize the convergence.
We first recall some necessary definitions in the measure theory. Readers can also
consult Stein and Shakarchi (2009) for reference. Measure is defined as follows.
Definition 2.1 Let X be the sample space. A collection A of subsets of X is called
a σ −field if it satisfies the following conditions:
• ∅ ∈ A; ∞
• if A1 , A2 , · · · ∈ A then i=1 Ai ∈ A;
• if A ∈ A then Ac ∈ A.
The tuple (X, A) is called a measurable space. A subset A of X is called measurable
if A ∈ A. Moreover, a function f : X → R is called measurable, if for all a ∈ R,
the set f −1 ([−∞, a)) = {x ∈ X : f (x) < a} is measurable.
We then may define probability measure.
Definition 2.2 A probability measure P on a measurable space (X, A) is a function
P : A → [0, 1] satisfying
• P(∅) = 0, P(X) = 1;
• if A1 , A2 , . . . is a collection of disjoint members of A, then

∞  ∞
P Ai = P(Ai ).
i=1 i=1

The triple (X, A, P) is called a probability space.


The empirical measure Pm of a sample of random elements X 1 , . . . , X m on a mea-
surable space (X, A) is defined as Pm (C) = m −1 # (1 ≤ i ≤ m : X i ∈ C). Alterna-
tively, if the points are measurable, this measure is characterized as a random measure
that imposes a mass 1/m at each point.
The following defined Dirac measure is critical in probability theory.
Definition 2.3 The Dirac measure δx is defined by

1 if x ∈ A
δx (A) =
0 if x ∈
/ A

Given the definition of the Dirac measure, the empirical measure is usually rep-
m as a linear combination of the Dirac measures at the observations, Pm =
resented
m −1 i=1 δxi .
2.1 Glivenko-Cantelli Theorem and Concentration Inequalities 19

Given a collection F of measurable functions f : X → R, the empirical measure


induces a map from F to R given by

1 
m
f → Pm f = f (xi ).
m i=1

Let P be the shared distribution of all X i . We define the notation PF as PF =
sup{|P f | : f ∈ F }. We now recall the law of large numbers.

Theorem 2.1 (Law of large numbers) Let X 1 , X 2 , . . . X m be independently identi-


cally distributed random variables with E|X 1 | < ∞. Then

1 
m
xi → E[X 1 ] almost surely, as m → ∞.
m i=1

We can write the uniform version of the law of large numbers as follows:

Pm − PF → 0.

The above equation is guaranteed to converge almost surely in the outer norm, i.e.,
the event {Pm − PF → 0} holds with probability 1. A Glivenko-Cantelli class
is defined as a class F that satisfies the law of large numbers. It is also referred
to as a P-Glivenko-Cantelli class since the class is dependent on the underlying
measure P.
Let l and u be real functions with finite norms on a measurable space X. We define
a bracket [l, u] as the set of all functions f ∈ F satisfying l(x) ≤ f (x) ≤ u(x) for
every x ∈ X. It should be noted that the functions l and u are not necessarily in the
Glivenko-Cantelli class F . An -bracket is a bracket [l, u] that satisfies u − l ≤ ,
where  is a positive real number and  ·  is a given norm. The bracketing number
N[] (, F ,  · ) is the minimum number of -brackets that can contain the class
F . The bracketing entropy is the natural logarithm of the bracketing number. The
occasional stars as superscripts refer to outer measures in the first case, and minimal
measureable envelopes in the second case.

Theorem 2.2 Let F be a class of measurable functions. If N[] (, F ,  · ) < ∞ for
all  > 0, then F is a P-Glivenko-Cantelli class. We have that

Pm − P∗F → 0,
a.s.

where P is a shared distribution and Pm is the empirical observation.

We present a brief proof of this theorem as follows. For any  > 0, we choose
finitely many -brackets {[li , u i ]}i=1
m
and argue that, by selecting a bound on
|(Pm − P) f | (for each f ) in terms of the [li , u i ] that contains it, we obtain
20 2 Statistical Learning Theory

sup |(Pm − P) f | ≤ max (Pm − P) u i ∨ max (P − Pm ) li + .
f ∈F 1≤i≤m 1≤i≤m

To conclude, according to the strong law for random variables, the right-hand side
of the above inequality is guaranteed to be less than 2.
The following is the definition of the Glivenko-Cantelli theorem for a continuous
distribution function on a line. Let F be a continuous cumulative distribution function
(CDF), and let P be the corresponding measure. By making the continuity of F
on the line uniform, we can find that for every  > 0, −∞ = t0 < t1 < t2 < . . . <
tk < tk+1 = ∞, where k is a positive integer such that the union of the brackets
Ix≤ti , Ix≤ti+1 for i = 0, 1, . . . , k contains {Ix≤t : t ∈ R} and satisfies the equation
F (ti+1 ) − F (ti ) ≤ . Then, the above theorem applies. Notably, the continuity of
the distribution function F is essential, although the Glivenko-Cantelli theorem on
a line holds for arbitrary distribution functions. This more general result will be
considered a subsequent corollary of the Glivenko-Cantelli theorem.
The following lemma characterizes a setting that guarantees a finite bracketing
number for appropriate classes of functions and characterizes ready application in
inference for parametric statistical models.
Lemma 2.1 Suppose that F = { f (·, t) : t ∈ T }, where T is a compact subset of a
metric space (D, d) and the functions f : X × T → R are continuous in t for P-
almost x ∈ X. If the envelope function F defined by F(x) = supt∈T | f (x, t)| satisfies
the condition P F < ∞, then N[] (, F , L 1 (P)) < ∞ for each  > 0.
We will derive the consistency in parametric statistical models.
Consistency in parametric models: Let { p(x, θ ) : θ ∈ },  ⊂ Rd be a class
of parametric densities. Suppose that X 1 , X 2 , . . . are generated from some p (x, θ0 ).
Additionally, assume that  is compact and p(x, θ ) is continuous w.r.t. θ for Pθ0 -
almost x. We define M(θ ) = Eθ0 l (X 1 , θ ), where l(x, θ ) = log p(x, θ ). Eventually,
we assume that supθ∈ |l(x, θ )| ≤ B(x) for some B with Eθ0 B (X 1 ) < ∞.
Note that M(θ ) is continuous on  if it is finite for all θ . If Pθ0 denotes the
measure corresponding to θ0 , then M(θ ) = Pθ0 l(·, θ ), and the maximum likelihood
estimate (MLE) of θ is given by θ̂m = argmaxθ Mm (θ ), where Mm (θ ) = Pm l(·, θ ).
Under the assumption that the model is identifiable (i.e., the probability distributions
corresponding to different θ s are different), it can be observed that M(θ ) is uniquely
(and globally) maximized at θ0 .
Eventually, θ0 is a well-separated maximizer in the sense that for any η > 0,
supθ∈∩Bη (θ0 )c M(θ ) < M (θ0 ), while Bη (θ0 ) is the open ball of radius η centred at
θ0 . Let ψ(η) = M (θ0 ) − supθ∈∩Bη (θ0 )c M(θ ). Then, we have ψ(η) > 0.
Our goal is to show that θ̂n → pθ∗0 θ0 . Given  > 0, we have that

θ̂m ∈ B (θ0 )c ⇒ M θ̂m ≤ sup M(θ ) (Property of the expectation)


θ∈∩Bη (θ0 )c

⇒ M θ̂m − M (θ0 ) ≤ −ψ()


2.1 Glivenko-Cantelli Theorem and Concentration Inequalities 21

⇒ M θ̂m − M (θ0 ) + Mm (θ0 ) − Mm θ̂m ≤ −ψ()


⇒ 2 sup |Mm (θ ) − M(θ )| ≥ ψ(). (Property of the limit superior)
θ∈

Thus,
 
∗ ∗
P θ̂m ∈ B (θ0 ) c
≤ P sup |Mm (θ ) − M(θ )| ≥ ψ()/2
θ∈
 
  
∗  
= P sup Pm − Pθ0 l(·, θ ) ≥ ψ()/2 .
θ∈

   ∗
This equation converges to 0 because supθ∈  Pm − Pθ0 l(·, θ ) → 0 a.s.;
based on our assumptions on the parametric
  model, we can conclude from Lemma
2.1 that N[] η, {l(·, θ ) : θ ∈ }, L 1 Pθ0 < ∞ for every η > 0 and then invoke
Theorem 2.2.
Next, we provide the necessary and sufficient conditions for a class of functions
F to be a Glivenko-Cantelli class in terms of covering numbers Wellner et al. (2013).
Theorem 2.3 Let F be a P-measurable class of measurable functions bounded in
L 1 (P). Then, F is a P-Glivenko-Cantelli class if and only if
(a) P ∗ F < ∞ and
(b)
E∗ log N (, F M , L 2 (Pm ))
lim =0
m→∞ m

for all M < ∞ and  > 0. Here, F M = { f I F≤M : f ∈ F }.


We will only consider the ‘if’ part of the proof here due to space limita-
tions. We first note that L 2 can be replaced by any L r with r ≥ 1. Then, for
the ‘if’ part, the second condition can be replaced by the weaker condition that
log N (, F M , L 2 (Pm )) /m → P ∗ 0. Since N (, F M , L 2 (Pm )) ≤ N (, F , L 2 (Pm ))
for all M > 0, condition (b) in the theorem can be replaced by the alternative condi-
tion that E∗ (log N (, F , L 2 (Pm ) /m) → 0 (or a condition involving convergence
in probability for the ‘if’ part). Eventually, if F has a measurable and integrable
envelope F, then Pm F is most certainly finite (as per the simple strong law). We then
may argue the equivalence relation between

∀ > 0, (log N (, F , L 1 (Pm )))∗ = o p (m)

and
  ∗
∀ > 0, log N FPm ,1 , F , L 1 (Pm ) = o p (m).

We can use the characterization of in-probability convergence on the likely con-


vergence along the subsequences to verify this. Thus, there is a large class of func-
tions, called a Vapnik-Chervonenkis (VC) class of functions, for which the quantity
22 2 Statistical Learning Theory
 
log N FPm ,1 , F , L 1 (Pm ) is bounded; in fact, for such a class F of functions,
we have that  M
  1
sup N F Q,r , F , L r (Q) ≤ K 1 r
Q 

for an integer M ≥ 1 that depends solely on F and a constant K 1 that also depends
solely only on F , where the supremum is taken over all probability measures for
which F Q,r > 0. Thus, a VC class of functions with an integrable envelope F is
a Glivenko-Cantelli class for any probability measure on the corresponding sample
space. Fortunately, functions formed by combining VC classes of functions via vari-
ous mathematical operations often satisfy entropy bounds similar to those illustrated
above. Therefore, classes of such (greater) complexity remain Glivenko-Cantelli
classes under the integrability hypothesis.
 
As a unique case, consider F = f t (x) = I−∞<x≤t : t ∈ Rd . Thus, f t (x) is the
indicator of the infinite rectangle to the ‘southwest’ of the point t. For any probability
measure Q on the d-dimensional Euclidean space, we have that
 d
K
N (c, F, L 1 (Q)) ≤ Md ,


which immediately implies the classical Glivenko-Cantelli theorem in Rd .


Proof of Theorem 2.2 We now prove the ‘if’ part. Applying the P-measurability of
the class F and Corollary 1.1 of the symmetrization notes, we find that
 
1  m 
 
E∗ Pm − PF ≤ 2E  i f (X i )
m 
i=1 F
 
1  m 
 
= 2E X E  i f (X i )
m 
i=1 F
 
1  m 
 
≤ 2E X Ec  i f (X i ) + 2P ∗ (FI F>M ).
m 
i=1 FM

Given any  > 0, an appropriate choice of M ensures that the second term is not
larger than . It is sufficient that, for this choice of M, the first term will eventually
be smaller than . To this end, we fix X 1 , X 2 , . . . , X m . An -net G (assumed to be
of minimal size) over F M in L 2 (Pm ) is also an -net in L 1 (Pm ). Then, we have
   
1  m  1  m 
   
E  i f (X i ) ≤ E  i f (X i ) + .
m  m 
i=1 FM i=1 G
2.1 Glivenko-Cantelli Theorem and Concentration Inequalities 23

Each g ∈ G is assumed to be uniformly bounded in absolute value by M since


each f in F M is bounded in absolute value by M. Therefore, given an arbitrary -
net G, we perturb each g to a g̃ that coincides with g whenever |g| ≤ M, and the
complement of this set is equal to (g) × M. These perturbed functions constitute an
-net over F M .
Let us analyze the first term on the right-hand side of the equation. Since the L 1
norm is bounded up to a constant by the ψ1 Orlicz norm, which is bounded up to
a constant by the ψ2 Orlicz norm, we can use Lemma 2.1 in the chaining notes to
bound the first term up to the following constant:
 
 1  m 
 
Bm = 1 + log N (, F M , L 2 (Pm )) max  i f (X i ) .
f ∈G  m 
i=1 ψ2 |X

Applying Hoeffding’s inequality, we obtain


 
1 m  √ 1  1/2 √ 1
 
 i f (X i ) ≤ 6 √ Pm f 2 ≤ 6 √ M,
m  m m
i=1 ψ2 |X

and thus, 
√ 1 + log N (, F M , L 2 (Pm ))
Bm ≤ 6M →0
m

by condition (b) of the theorem. We conclude that


 
1 m 
 
E  i f (X i ) → P 0.
m 
i=1 FM

Since the above random variable is bounded, we have


 
1  m 
 
E X Ec  i f (X i ) → 0.
m 
i=1 FM

It follows that E∗ Pm − PF → 0. However, our goal is to show almost sure
convergence. This is deduced through a submartingale argument, a simplified ver-
sion of which is presented at the end of these notes. The idea here is to show that
Pm − P∗F is a reverse submartingale with respect to a (decreasing) filtration that
converges to the symmetric sigma field and therefore has an almost sure limit. This
almost sure limit, measurable with respect to the symmetric sigma field, must be
a nonnegative constant almost surely. The fact that the expectation converges to 0
forces this constant to become 0.
24 2 Statistical Learning Theory

2.1.2 Concentration Inequality

Concentration inequality, which describes the deviation between the sum (or sample
mean) of a group of random variables and their expected value, is a very useful
concept in the analysis of PAC learning. In addition to characterizing the asymptotic
convergence of an iterative machine learning algorithm, concentration inequalities
also offer practical tools for evaluating the convergence rate. We will give a brief
summary of various concentration inequalities in this subsection.
Markov’s inequality. Let X be a nonnegative random variable, and a > 0. Then,

EX
P(X ≥ a) ≤ .
a

Chebyshev’s inequality. Let X be a random variable with mean μ and variance σ 2 .


Then, for any constant k > 0,

1
P(|X − μ| ≥ kσ ) ≤ .
k2

m
Chernoff bounds. Let X = X i , and t be a positive constant. Then,
i=1

 
P (X i > a) = P et X i > eta ≤ e−ta E et X i ,

P(X > a) ≤ e−ta E m t Xi


i e ,

P(X > a) ≤ min e−ta E m t Xi


i e .
t>0

Hoeffding’s inequality. Let X 1 , X 2 , · · · , X m be a family of independent random


variables, and 0 ≤ X i ≤ 1. Then,

1  1 
m m
P( Xi − EX i ≥ ) ≤ exp(−2m 2 ),
m i=1 m i=1

1  m
1 
m

P( Xi − EX i  ≥ ) ≤ 2 exp(−2m 2 ).
m i=1 m i=1

McDiarmid’s inequality. Let X 1 , X 2 , · · · , X m ∈ X be a family of independent


random variables, and for any 1 ≤ i ≤ m, the function f satisfy

sup | f (X 1 , · · · , X m ) − f (X 1 , · · · , X i−1 , X i , X i+1 , · · · , X m )| ≤ ci .

X 1 ,X 2 ,··· ,X m ,X i ∈X
2.1 Glivenko-Cantelli Theorem and Concentration Inequalities 25

Then, for any  > 0,


 
−2 2
P( f (X 1 , · · · , X m ) − E( f (X 1 , · · · , X m )) ≥ ) ≤ exp m 2 ,
i=1 c

 
−2 2
P(| f (X 1 , · · · , X m ) − E( f (X 1 , · · · , X m ))| ≥ ) ≤ 2 exp m 2 .
i=1 ci

Bennett’s inequality. Let X 1 , X 2 , · · · , X m be a family of independent random


variables, and the following relations
m hold: ai < X i < bi , bi − ai = ci ≤ C, X̄ =
1
m (X 1 + · · · + X m ), and σ 2
= i=1 E[X i ] − E[X ] . Then, for any t ∈ R,
2 2

  
σ2 Ct
P( X̄ − E[ X̄ ] ≥ t) ≤ exp − 2 h ,
C mσ 2

where h(u) = (1 + u) log(1 + u) − u.


Bernstein’s inequality. Let X 1 , X 2 , · · · , X m be a family of independent random
variables, and the following relations
m hold: ai < X i < bi , bi − ai = ci ≤ C, X̄ =
1
m (X 1 + · · · + X m ), and σ 2
= i=1 E[X i ] − E[X ] . Then, we have
2 2

 
m 2 t 2 /2
P( X̄ − E[ X̄ ] ≥ t) ≤ exp − 2 , ∀t ∈ R.
σ + Ctm/3

The inequalities discussed above require the random variables to be independent.


In contrast, for Azuma’s inequality below, the sequence of random variables can be
correlated.
Azuma’s inequality. Let X 1 , X 2 , · · · , X m be a martingale sequence, i.e., E[|X m |] ≤
∞ and E[X m+1 |X 1 , · · · , X m ] = X m . Then, under the assumption that |X i − X i−1 | <
ci , for any t ∈ R, we have
 
t2
P (X m − X 0 ≥ t) ≤ exp − m 2 .
2 i=1 ci

Doob’s martingale inequality. Let {X t , Ft , 0 ≤ t < ∞} be a submartingale, i.e.,


for every 0 ≤ s < t < ∞, E(X t |Fs ) ≥ X s . Then, for any positive constant C,
 
E [max (X T , 0)]
P sup X t ≥ C ≤ .
0≤t≤T C
26 2 Statistical Learning Theory

2.2 Probably Approximately Correct (PAC) Learning

Before designing algorithms to learn from training data, we should consider the
following issues. What is the most essential consideration in a learning task? How
many examples are required to successfully train a model? Additionally, is there a
generalizable model that can be learned for the current problem? To formalize and
solve these issues, this section introduces the PAC learning framework. The PAC
learning framework can help determine an algorithm’s sample complexity (how many
examples are required to train a desirable approximator for the given problem) and
whether a problem is learnable. In this section, we first present the PAC framework,
then provide a theoretical assurance for learning algorithms, and eventually hope to
guide algorithm design in accordance with the results of our analysis.

2.2.1 The PAC Learning Model

We first present some relevant definitions and notations before introducing the PAC
model. We consider a binary classification task, i.e., the label y ∈ {0, 1}. We express
a concept c as a mapping from a feature x to the label y. We use C to denote a concept
class, which represents the set of concepts we hope to learn. For instance, it may be
the set of all triangles on a plane.
Suppose that all examples are ii.i.d. according to a fixed but unknown distribution
D. Then, the framework of the learning problem is constructed as follows. The
learner considers a fixed set of possible concepts H, which is termed the hypothesis
set. The hypothesis set does not need to coincide with C. The learner receives a
sample set S : S = (x1 , . . . , xm ) with known labels (c (x1 ) , . . . , c (xm )) that depend
on the concept c ∈ C to be learned (here, c represents an optimal mapping). The
examples in S are independently sampled from the distribution D. The learning task
requires the selection of a hypothesis h S ∈ H based on the labelled sample S, which
has a small generalization error with respect to c. The generalization error (or risk, or
true error, or simply error) of a hypothesis h ∈ H is denoted by R(h) and is defined
as follows.
Definition 2.4 (Generalization risk) The generalization risk of a hypothesis h ∈ H
is defined as
R(h) = P [h(x) = c(x)], (2.1)
x∼D

where c ∈ C is the target concept.


The generalization error of h cannot be obtained directly by the learner because both
the distribution D and the target concept c are unknown. Nevertheless, the learner
can learn the empirical error of a mapping on the labelled sample S to approximate
the generalization error.
2.2 Probably Approximately Correct (PAC) Learning 27

Definition 2.5 (Empirical error) The empirical risk of a hypothesis h ∈ H is defined


as
1 
m
R̂ S (h) = Ih(xi )=c(xi ) , (2.2)
m i=1

where Iw is the indicator function of event w. The generalization error of h ∈ H is


the expected error based on the distribution D. The empirical error is the average
error over the sample S. Under certain general assumptions, there exist a series of
guarantees with high probability for these two quantities. It is also important to note
that the expected empirical error based on the i.i.d. sample set S is equivalent to the
generalization error for a fixed h ∈ H:
 
E m R̂ S (h) = R(h). (2.3)
S∼D

By linearizing the expectation and considering that the sample S is drawn in an i.i.d.
manner, we obtain

  1 
m
E m R̂ S (h) = E Ih(xi )=c(xi ) = E Ih(x)=c(x) = R(h).
S∼D m i=1 S∼Dm x∼D

Next, we present the PAC learning framework. We use n to denote a number, size(c)
to denote the maximal cost of the computational representation of c ∈ C, and h S to
denote the hypothesis that is obtained by algorithm A when applied to the labelled
sample S. The computational cost of representing the element x ∈ X is at most O(n).
To simplify the notation, we do not explicitly express the dependence of h S on A.
Definition 2.6 (PAC learning) For any  > 0, any δ > 0, any distributions D on X
and any target concept c ∈ C, the concept class C is PAC learnable if there exist
an algorithm A and a polynomial function poly(·, ·, ·, ·) such that the following
inequality holds for any sample size m ≥ poly(1/, 1/δ, n, size(c)):

P [R (h S ) ≤ ] ≥ 1 − δ. (2.4)

Furthermore, C is efficiently PAC learnable if A runs in poly(1/, 1/δ, n, size(c))


time. The algorithm A is said to be a PAC learning algorithm for C.
Hence, if the hypothesis acquired by the algorithm after receiving a series of points
that are polynomial in 1/ and 1/δ is approximately correct with high probability,
then the concept class C is PAC learnable. The parameter  > 0 defines the accuracy
1 − , while δ > 0 defines the confidence 1 − δ.
We need to emphasize a certain point regarding the definition of PAC learning.
No particular assumption is made about the distribution D from which examples are
drawn in the PAC framework. Therefore, it is a distribution-free model. Additionally,
the training and test examples are drawn from the same distribution D when defining
the error. This is a natural assumption to make generalization possible. Furthermore,
28 2 Statistical Learning Theory

the PAC framework is suitable for solving the learnability question for a concept
class C rather than for a particular concept. It is important to note that the target
concept c ∈ C is unknown to the algorithm, whereas the concept class C is known.
Hence, we can focus only on the sample complexity in the definition of PAC learning
and omit the polynomial dependency on n and size(c) without explicitly discussing
the concepts’ computational representation.

2.2.2 Generalities

Next, we discuss some general aspects of learning scenarios.

2.2.2.1 Deterministic Versus Stochastic Scenarios

In supervised learning, the distribution D is usually defined over X × Y, and the


training data consist of a labelled sample S drawn in an i.i.d. manner according to
D:
S = ((x1 , y1 ) , . . . , (xm , ym )) .

The target of learning is to find a hypothesis h ∈ H with a small generalization error

R(h) = P [h(x) = y] = E Ih(x)= y .


(x,y)∼D (x,y)∼D

This general scenario is stochastic. Under such circumstances, the output label is a
probabilistic function of the input. This scenario can describe a number of real-world
problems. For instance, when the goal is to predict gender using input pairs consisting
of the weight and height of a person, the labels of the input sample generally may
not be unique. Both ‘man’ and ‘woman’ will be possible genders for most pairs,
although there may exist distinct probability distributions for the label being ‘man’
or ‘woman’ for each fixed pair. An extension of the PAC learning framework, known
as agnostic PAC learning, is able to deal with this setting.

Definition 2.7 (Agnostic PAC learning) Let H be a hypothesis set. The algorithm A
is said to be an agnostic PAC learning algorithm if there exists a polynomial function
poly(·, ·, , ··) such that for any  > 0 and δ > 0, for distribution D over X × Y , and
for any sample size m ≥ poly(1/, 1/δ, n, size(c)), the following expression holds:

P R (h S ) − min R(h) ≤  ≥ 1 − δ, (2.5)


h∈H

Furthermore, if A runs in poly(1/, 1/δ, n) time, then it is called an efficient agnostic


PAC learning algorithm.
2.2 Probably Approximately Correct (PAC) Learning 29

The scenario is considered deterministic if the label of an input sample can be deter-
mined with probability one by a function f : x → y. In such a scenario, it is sufficient
to consider the distribution D in the input space. The training sample can be acquired
by drawing (x1 , . . . , xm ) according to D. The corresponding labels can be obtained
from f : yi = f (xi ) for all i ∈ [m].

2.2.2.2 Bayes Error and Noise

In the deterministic case, there exists a target function f with no generalization error
(R(h) = 0), while in the stochastic case, there is a minimal nonzero error for any
hypothesis.
Definition 2.8 (Bayes error) Given a distribution D over X × Y , the Bayes error R ∗
is defined as the infimum of the errors obtained by a measurable function h : X → y:

R∗ = inf R(h). (2.6)


h

A hypothesis h with R(h) = R ∗ is called a Bayes hypothesis or Bayes classifier.


R ∗ = 0 in the deterministic case, while R ∗ = 0 in the stochastic case. The Bayes
classifier h Bayes can be defined in terms of conditional probabilities as follows:

∀x ∈ X, h Bayes (x) = argmax P(y | x). (2.7)


y∈{0,1}

Therefore, the average error generated by h Bayes on x ∈ X is min{P(0 | x), P(1 | x)},
the minimum possible error. On this basis, the definition of noise is given as follows.

Definition 2.9 (Noise) Given a distribution D over X × Y , the noise at point x ∈ X


is defined as
noise(x) = min{P(1 | x), P(0 | x)}. (2.8)

The average noise, or the noise associated with D, is E[noise(x)].


It is easy to see that the average noise is precisely the Bayes error: noise =
E[noise(x)] = R ∗ . A point x ∈ X with noise(x) close to 1/2 is sometimes called
noisy and is a challenge to predict accurately.

2.2.2.3 Guarantees for Finite Hypothesis Sets–Consistent Case

We consider that the hypotheses are consistent and that the target concept c is within
the cardinality |H| of the finite hypothesis set. In this section, we will present a
general sample complexity bound for such consistent hypotheses.
30 2 Statistical Learning Theory

Theorem 2.4 (Learning bound–finite H, consistent case) Let A be an algorithm


that, for any target concept c ∈ H and i.i.d. sample S, returns a consistent hypothesis
h S : R̂ S (h S ) = 0, where H is a finite set of functions mapping from X to Y. For any
, δ > 0, if  
1 1
m≥ log |H| + log , (2.9)
 δ

then the inequality P S∼Dm [R (h S ) ≤ ] ≥ 1 − δ holds. The following generaliza-


tion bound can be regarded as the equivalent statement according to this sample
complexity. That is, for any , δ > 0, with a probability of at least 1 − δ,
 
1 1
R (h S ) ≤ log |H| + log . (2.10)
m δ

Proof For any  > 0, we define H as H = {h ∈ H : R(h) > }. The probability
that a hypothesis h is consistent on an i.i.d. sample set S, i.e., that it has no error on
any point in S, can be bounded as follows:
 
P R̂ S (h) = 0 ≤ (1 − )m .

Applying the union bound yields the following expression:


     
P ∃h ∈ H : R̂ S (h) = 0 = P R̂ S (h 1 ) = 0 ∨ · · · ∨ R̂ S h |H∈ | = 0
  
≤ P R̂ S (h) = 0 (Triangle inequality)
h∈H

≤ (1 − )m ≤ |H|e−m .
h∈H

Let the right-hand side be equal to δ; then, solving for  completes the proof. 

This theorem illustrates that a consistent algorithm A is PAC learnable when the
hypothesis set H is finite since the sample complexity in (2.9) is dominated by a
polynomial in 1/ and 1/δ. From expression (2.10), we know that the generaliza-
tion error of consistent hypotheses has an upper bound, which decreases with an
increasing sample size m.
The cost of proposing a consistent algorithm is the necessity of using a larger
hypothesis set H that contains the target concepts. The upper bound of (2.10)
increases with increasing |H|, although the dependency is only logarithmic. The
term log |H|, which is not a constant factor, can be interpreted as the number of bits
needed to represent H. As a result, the ratio between the number of bits (log2 |H|)
and the sample size m determines this theorem’s generalization guarantee.
2.2 Probably Approximately Correct (PAC) Learning 31

2.2.2.4 Guarantees for Finite Hypothesis Sets–Inconsistent Case

Difficult learning problems may arise in which there exist concept classes that are
more complex than the hypotheses used in the learning algorithm. In other words,
there may not exist a hypothesis in H that is consistent with the labelled training
sample. Nevertheless, such inconsistent hypotheses with minor errors in the training
sample can also be useful. Under certain assumptions, they may even benefit from
favourable guarantees. In this section, we introduce learning guarantees for finite
hypotheses in the inconsistent case by using either Hoeffding’s inequality or the
following corollary, which establishes the relationship between the empirical error
and generalization error of a single hypothesis.

Corollary 2.1 Let  > 0 be fixed. Then, for any hypothesis h : x → {0, 1}, the
following expressions hold:
 
P R̂ S (h) − R(h) ≥  ≤ exp −2m 2 , (2.11)
 
P R̂ S (h) − R(h) ≤ − ≤ exp −2m 2 . (2.12)

Applying the union bound yields the following two-sided inequality:


   
 
P  R̂ S (h) − R(h) ≥  ≤ 2 exp −2m 2 . (2.13)

Let the right-hand side of (2.13) be equal to δ; then, solving for  immediately
produces the following bound.
Corollary 2.2 (Generalization bound–single hypothesis) Consider a fixed
hypothesis h : x → {0, 1}. For any δ > 0, the following expression holds with a
probability of at least 1 − δ:

log 2δ
R(h) ≤ R̂ S (h) + . (2.14)
2m

Because h S is a random variable that depends on the training sample S, we cannot


utilize Corollary 2.2 to bound its generalization error. Additionally, the generalization
 
error R (h S ) is a random variable and differs from the expectation E R̂ S (h S ) ,
which is a constant. This is distinct from the case of a given hypothesis, for which
the expectation of the empirical error is the generalization error (Eq. 2.3). Thus, for
the consistent case, it is necessary to derive a uniform convergence bound that is
satisfactory for all hypotheses h ∈ H with high probability.

Theorem 2.5 (Learning bound–finite H, inconsistent case) Let H be a finite hypoth-


esis set. Then, for any δ > 0, the following expression holds with a probability of at
least 1 − δ:
32 2 Statistical Learning Theory

log |H| + log 2δ
∀h ∈ H, R(h) ≤ R̂ S (h) + . (2.15)
2m

Proof We denote the elements of H by h 1 , . . . , h |H| . Applying the union bound and
Corollary 2.2 to each hypothesis yields
 
 
P ∃h ∈ H R(h) − R̂ S (h) > 
      
  
= P  R̂ S (h 1 ) − R (h 1 ) >  ∨ . . . ∨  R̂ S h |H| − R h |H|  > 
  
 
≤ P  R̂ S (h) − R(h) >  (Triangle inequality)
h∈H
 
≤ 2|H| exp −2m 2 .

Setting the right-hand side equal to δ completes the proof. 

Therefore, for a finite hypothesis set H, we have


 
log2 |H|
R(h) − R̂ S (h) ≤ O .
m

A larger sample size m provides a better generalization guarantee, and the bound
increases logarithmically with |H|. However, the bound is a less favourable function
of log2 |H|/m, varying with the square root of log2 |H|/m. A quadratically larger
labelled sample is required for a given |H| to obtain the same guarantee as in the
consistent case. The values of the bounds show that a balance is sought between the
empirical error and the size of the hypothesis set. Although a larger hypothesis set is
penalized by the second term, increasing the size of the hypothesis set helps reduce
the first term, i.e., the empirical error. For a similar empirical error, however, the use
of a smaller hypothesis set is recommended; this can be regarded as an example of
the Occam’s razor principle, named after the theologian William of Occam, which
states that the simplest explanation is the best. In this context, this principle can be
understood to mean that when all other things are equal, a hypothesis set of a smaller
size is better.

2.2.3 Insights from Glivenko-Cantelli Theorem

A fundamental result of statistical learning theory states that a concept class is PAC
learnable if and only if it is a uniform Glivenko-Cantelli class. However, the theorem
is valid only under special assumptions regarding the class’s measurability, in which
case even the PAC learnability becomes consistent. Otherwise, a classical example
can be constructed, under the continuum hypothesis developed by Dudley and Durst
2.3 PAC-Bayesian Learning 33

and further adapted by Blumer, Ehrenfeucht, Haussler, and Warmuth, of a concept


class of VC dimension one that is neither uniformly Glivenko-Cantelli nor consis-
tently PAC learnable. We show that, under an additional set-theoretic hypothesis that
is much milder than the continuum hypothesis (Martin’s axiom), PAC learnability is
equivalent to a finite VC dimension for every concept class.
A class F of measurable functions on a measurable space (X, A) is uniformly
Glivenko-Cantelli in P ∈ P for a class P of probability measures on (X, A) if
 

sup P P
sup m − PF > ε → 0, ∀ε > 0, as a → ∞.
P∈P m≥a

Theorem 2.6 For a concept class F , the following two conditions are equivalent:
1. F is distribution-free PAC learnable; 2. F is a uniform Glivenko-Cantelli class.

2.3 PAC-Bayesian Learning

PAC-Bayesian theory was initially developed by McAllester as an attempt to explain


Bayesian learning from a learning theory perspective (McAllester 1999a).
Let X ∈ R p be the input space and Y ∈ {−1, 1} be the response space, respec-
tively. We use D to denote the unknown joint distribution on Z := X × Y. Suppose
that the estimators in the hypothesis space H can be indexed by a parameter set ,
i.e., H = {h θ : θ ∈ }. Let Q be the space of probability distributions on . The
output of an algorithm here refers to a distribution Q ∈ Q (named the posterior dis-
tribution) instead of a single estimator h θ ∈ H. Here, we consider only a 0, 1-valued
loss function l(h θ , z), i.e., l(h θ , z) = 1 if h θ (x) = y and l(h θ , z) = 0 otherwise. We
present the definitions of the expected risk and the empirical risk as follows:
Definition 2.10 Suppose that there exists a probability distribution Q ∈ Q on .
Given a sample S = {(xi , yi )}i=1
m
, the expected and empirical risk are

R(Q) = Ez∼D Eθ∼Q [l(h θ , z)]

and
1 
m
R̂ S (Q) = Eθ∼Q [l(h θ , z i )].
m i=1

To establish the PAC-Bayesian bound, we introduce the following definition for the
Kullback-Leibler divergence.
Definition 2.11 Given two probability measures D, G ∈ Q, the Kullback-Leibler
(KL) divergence between G and D is
34 2 Statistical Learning Theory

D(θ )
KL(DG) = D(θ ) log( )dθ.
G(θ )

In particular, if the space  is finite, then we have


 D(θ )
KL(DG) = D(θ ) log( ).
θ∈
G(θ )

Note that for any probability measures D and G, KL(DG) ≥ 0, where the equality
holds if and only if D = G.
In this section, we will focus on a bound as proposed in (Catoni 2007). Let us fix
a probability measure π ∈ Q, which is usually called the prior.

Theorem 2.7 We assume that l(h θ , z) ≤ C, ∀h θ ∈ H and ∀z ∈ Z. For any λ > 0


and δ ∈ (0, 1), it holds that

λC 2 KL(Qπ ) + log 1/δ


P Eθ∼Q R(θ ) ≤ Eθ∼Q R̂ S (θ ) + + ≥ 1 − δ, ∀Q ∈ Q.
8m λ

Proof We first recall Donsker and Varadhan’s variational formula (Donsker and
Varadhan 1976). For any measurable and bounded function g :  → R, we have

log Eθ∼π [e g(θ) ] = sup [Eϑ∼Q [g(ϑ)] − KL(Qπ )]. (2.16)


Q∈Q

Moreover, given a risk R(·), the supremum with respect to Q on the right-hand side
is reached for the Gibbs measure π R , where its density with respect to π is

dπ R e g(θ)
(θ ) = .
dπ Eϑ∼π [e g(ϑ) ]

With the help of Hoeffding’s lemma, for a fixed h θ ∈ H and any t > 0, we have

mt 2 C 2
E S [etm[R(θ)− R̂ S (θ)] ] ≤ e 8 .

By taking t = λ/m, we obtain

λ2 C 2
E S [etm|R(θ)− R̂ S (θ)| ] ≤ e 8m .

Clearly, this bound with respect to π is

λ2 C 2
Eθ∼π E S [etm|R(θ)− R̂ S (θ)| ] ≤ e 8m .
2.3 PAC-Bayesian Learning 35

We can exchange the integrations with respect to π and the sample S:

λ2 C 2
E S Eθ∼π [etm|R(θ)− R̂ S (θ)| ] ≤ e 8m .

Donsker and Varadhan’s variational formula gives

λ2 C 2
E S [esup Q∈Q λEθ ∈Q [R(θ)− R̂ S (θ)]−KL(Qπ) ] ≤ e 8m .

Direct computation shows that

λ2 C 2
E S [esup Q∈Q λEθ ∈Q [R(θ)− R̂ S (θ)]−KL(Qπ)− 8m ] ≤ 1.

The following result is obtained using the Chernoff bound:

λ2 C 2
P sup λEθ∈Q [R(θ ) − R̂ S (θ )] − KL(Qπ ) − >s
Q∈Q 8m
 λ2 C 2

≤E S esup Q∈Q λEθ ∈Q [R(θ)− R̂ S (θ)]− 8m e−s ≤ e−s .

By taking aligns = log(1/δ), we deduce that

λ2 C 2
P sup λEθ∈Q [R(θ ) − R̂ S (θ )] − KL(Qπ ) − > log 1/δ ≤ δ.
Q∈Q 8m

We complete the proof by rearranging the above expression. 


The above error bound motivates the study of a data-dependent learning problem
defined as
KL(Qπ )
Q̂ = arg min{Eθ∼Q R̂ S (θ ) + }.
Q∈Q λ

Equivalently, this optimization problem can be rewritten as

Q̂ = arg max{−λEθ∼Q R̂ S (θ ) − KL(Qπ )}.


Q∈Q

According to Donsker and Varadhan’s variational formula (2.16), the minimum is


reached for the Gibbs measure Q̂ = π−λ R̂ S .
Corollary 2.3 Let Q̂ be the Gibbs measure Q̂ = π−λ R̂ S . For any λ > 0 and δ ∈
(0, 1),
 λC 2 KL(Qπ ) + log 1/δ 
P Eθ∼ Q̂ R(θ ) ≤ inf Eθ∼Q R̂ S (θ ) + + ≥ 1 − δ.
Q∈Q 8m λ

We will present two examples related to the general PAC-Bayes bound above.
36 2 Statistical Learning Theory

Example 2.1 (Finite case) Let us consider a special case in which the set Q is finite.
In such a case, the Gibbs posterior Q̂ is a probability on the finite set  given by

e−λ R̂ S (θ) π(θ )


Q̂(θ ) =  ,
ϑ∈ e−λ R̂ S (ϑ) π(ϑ)

and, with a probability of at least 1 − δ, we have


 λC 2 KL(Qπ ) + log 1/δ 
Eθ∼ Q̂ R(θ ) ≤ inf Eϑ∼Q R̂ S (ϑ) + + .
Q∈Q 8m λ

The above upper bound holds for all Q ∈ Q. Thus, it holds for the Q in the set of the
Dirac masses {δθ , θ ∈ }. Then, we have Eϑ∼δθ R̂ S (ϑ) = R̂ S (θ ) and

 δθ (ϑ) 1
KL(δθ π ) = δθ (ϑ) log = log .
ϑ∈
π(ϑ) π(θ )

Furthermore, if the prior π follows a uniform probability distribution, then


log(1/π(h)) = log(||), and the bound becomes

λC 2 log || + log 1/δ


P Eθ∼ Q̂ R(θ ) ≤ inf R̂ S (θ ) + + ≥ 1 − .
Q∈Q 8m λ

By taking λ = 8m/(C 2 log(||/)), we further obtain

log ||

P Eθ∼ Q̂ R(θ ) ≤ inf R̂ S (θ ) + C ≥ 1 − .
Q∈Q 2m

Example 2.2 (Lipschitz loss and Gaussian priors) We assume that l(h θ , z) ≤ C,
∀h θ ∈ H and ∀z ∈ Z. Let the loss function l(h, z) be L-Lipschitz for any z ∈ Z,
and let the prior π follow a standard Gaussian distribution π = N (0, σ 2 Idim(θ) ).
Then, for any δ ∈ (0, 1), with a probability of at least 1 − δ, it holds that

 
B2 dim(γ )
dim(γ ) 2σ 2
+ log m + log 1
Eθ∼ Q̂ R(θ ) ≤ inf R̂ S (γ ) + Lσ + 2
.
γ ∈, γ :γ ≤B m 2m

Proof Let the posterior Q̂ be a Gibbs measure. For any δ ∈ (0, 1), it holds that
 λC 2 KL(Qπ ) + log 1/δ 
Eθ∼ Q̂ R(θ ) ≤ inf Eϑ∼Q R̂ S (ϑ) + +
Q∈Q 8m λ

with a probability of at least 1 − δ. If the probability distribution Q ∈ Q is restricted


to Gaussian distributions of the form N (γ , s 2 Idim(θ) ), where γ ∈ Rdim(θ) , we have
2.3 PAC-Bayesian Learning 37

 λC 2 KL(Qπ ) + log 1/δ 


Eθ∼ Q̂ R(θ ) ≤ inf Eϑ∼Q R̂ S (ϑ) + + .
Q=N (γ ,s 2 Idim(θ ) ) 8m λ

When Q = N (γ , s 2 Idim(θ) ) and π = N (0, σ 2 Idim(θ) ), we obtain

γ 22 dim(θ ) s 2 σ2
KL(Qπ ) = + [ + log − 1].
2σ 2 2 σ2 s2
Moreover, according to the Jensen’s inequality, we obtain

Eϑ∼Q R̂ S (ϑ) ≤ R̂ S (γ ) + LEθ∼Q [θ − γ ] ≤ R̂ S (γ ) + L Eθ∼Q [θ − γ 2 ]

= R̂ S (γ ) + Ls dim(θ ).

Combining the above results, we have


 
Eθ∼ Q̂ R(θ ) ≤ inf R̂ S (γ ) + Ls dim(γ )
γ

+ log σs 2 − 1] + log 1/δ 


γ 22 dim(γ ) s 2 2
λC 2 2σ 2
+ [ σ2
+ + 2
.
8m λ

Taking s = σ/ m yields

 
dim(γ ) λC 2
Eθ∼ Q̂ R(θ ) ≤ inf R̂ S (γ ) + Lσ +
γ m 8m
− 1 + log m] + log 1/δ 
γ 22 dim(γ ) 1
2σ 2
+ [m
+ 2
.
λ
By further assuming that γ  ≤ B for some B > 0, we obtain

 
dim(γ ) λC 2
Eθ∼ Q̂ R(θ ) ≤ inf R̂ S (γ ) + Lσ +
γ :γ ≤B m 8m
− 1 + log m] + log 1/δ 
γ 22 dim(γ ) 1
2σ 2
+ [m
+ 2
.
λ

B2 dim(γ )
In this case, we can see that the optimal λ is C1 8m( 2σ 2 + 2
log m + log 1δ ),
which indicates that, with a probability of at least 1 − δ, it holds that

 
B2 dim(γ )
dim(γ ) 2σ 2
+ log m + log 1
Eθ∼ Q̂ R(θ ) ≤ inf R̂ S (γ ) + Lσ + 2
.
γ :γ ≤B m 2m

This completes the proof. 


38 2 Statistical Learning Theory

References

David, A McAllester. 1999. PAC-Bayesian model averaging. In Annual Conference of Learning


Theory 99: 164–170.
Donsker, M.D., S.R. Srinivasa, and Varadhan. 1976. On the principal eigenvalue of second-order
elliptic differential operators. Communications on Pure and Applied Mathematics 29 (6): 595–
621.
Elias, M Stein and Rami Shakarchi. 2009. Real Analysis: Measure Theory, Integration, and Hilbert
Spaces. Princeton University Press.
Jon Wellner et al. 2013. Weak Convergence and Empirical Processes: With Applications to Statistics.
Springer Science & Business Media.
Olivier, Catoni. 2007. Pac-bayesian supervised classification: the thermodynamics of statistical
learning. arXiv preprint arXiv:0712.0248.
Chapter 3
Hypothesis Complexity

This chapter presents the hypothesis complexity concept frequently used in sta-
tistical learning theory, which is important for deriving generalization bounds for
deep learning models. The hypothesis complexity characterizes the complexity
of a machine learning algorithm and can be measured in terms of the Vapnik-
Chervonenkis (VC) dimension, Rademacher complexity, and covering number. Intu-
itively, a more complex algorithm has worse generalizability. In this way, we may
study the generalizability via the hypothesis complexity.

3.1 Worst-Case Bounds Based on the Rademacher


Complexity

A major measure for evaluating the hypothesis complexity in conventional statisti-


cal learning theory is the Rademacher complexity (Bartlett and Mendelson 2002),
defined below.

Definition 3.1 (Empirical Rademacher complexity and Rademacher complexity)


For a real-valued function class denoted by H and a dataset S, the empirical
Rademacher complexity is defined as follows:
 
1 
m
R̂(H) = E sup i h(xi ) , (3.1)
h∈H m i=1

where  = {1 , . . . , m } and the i , i = 1, . . . , m, are independently and uniformly


drawn from {−1, +1}. In turn, the Rademacher complexity can be defined as follows:

R(H) = E S R̂(H). (3.2)

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025 39
F. He and D. Tao, Foundations of Deep Learning, Machine Learning: Foundations,
Methodologies, and Applications, https://doi.org/10.1007/978-981-16-8233-9_3
40 3 Hypothesis Complexity

One can obtain uniform generalization bounds via the Rademacher complexity.
Theorem 3.1 (see Mohri et al. (2018)) Let H be a family of functions taking values
in {-1,+1}, and let D be the distribution over the input space X. Then, for any δ > 0,
with probability at least 1 − δ over samples S of size m drawn according to Dm , the
following holds for any h ∈ H:

log 1δ
R(h) ≤ R̂ S (h) + R(H) + . (3.3)
2m

Proof For any sample S = (x1 , . . . , xm ), we define a function D as follows:


 
D(S) = sup R(h) − R̂ S (h) . (3.4)
h∈H

If S and S  have only one different point, we have


  1
D(S) − D(S  ) ≤ sup R̂ S (h) − R̂ S  (h) ≤ . (3.5)
h∈H m

Then, by McDiarmid’s inequality, for any δ > 0, with probability at least 1 − δ, the
following holds: 
log 1δ
D(S) ≤ E S [D(S)] + . (3.6)
2m

In addition, the expectation of the right-hand side can be bounded as follows:


  
E S [D(S)] =E S sup R(h) − R̂ S (h)
h∈H

≤E S,S  sup R̂ S  (h) − R̂ S (h)
h∈H

1 
m
1 − yi h(xi ) 1 − yi h(xi )
=Eσ,S,S  sup σi −
h∈H m i=1 2 2
 
1  h(xi ) 1  h(xi )
m m
≤Eσ,S  sup σi + Eσ,S sup σi
h∈H m i=1 2 h∈H m i=1 2
=R(H). (3.7)

As a result, 
log 1δ
R(h) ≤ R̂ S (h) + R(H) + . (3.8)
2m

The proof is complete. 


3.1 Worst-Case Bounds Based on the Rademacher Complexity 41

Moreover, an important bound of R(H) is given based on Massart’s lemma.


Lemma 3.1 (Massart’s lemma) Let G ⊂ Rm be a finite set, with r = maxx∈G x2 ;
then, the following holds:
 
1 m
r 2 log |G|
Eσ sup σi xi ≤ , (3.9)
m x∈G i=1 m

where the σi are independent uniform random variables taking values in {−1, +1}
and x1 , . . . , xm are the components of the vector x.
Proof Let t > 0 be a number to be chosen later.
 
m  
m  
m
exp Eσ sup tσi xi ≤Eσ exp sup tσi xi = Eσ sup exp tσi xi
x∈G i=1 x∈G i=1 x∈G i=1
  m 
m
≤ Eσ exp tσi xi = Eσ [exp(tσi xi )]
x∈G i=1 x∈G i=1

m  (t x )2   m
(t xi )2 
i
≤ exp = exp
x∈G i=1
2 x∈G i=1
2
 t 2r 2 t 2r 2
≤ exp( ) = |G| exp( ).
x∈G
2 2

The first inequality comes from Jensen’s inequality, and the third inequality is from
Hoeffding’s inequality. By the result above, we have
 
m
1 tr 2
Eσ sup σi xi ≤ log |G| + . (3.10)
x∈G i=1 t 2

Let t be 2 log |G|/r 2 . Then, we can obtain the result that we have claimed.
The proof is complete. 
One can prove a margin bound on the basis of the covering number, defined as
follow, or the Rademacher complexity of the hypothesis space.
Definition 3.2 (Covering number) The covering number N(H, ,  · ) of a space
H is defined as the minimal cardinality of any subset V ⊂ H that covers H at scale
 under metric  · , i.e.,
sup min A − B ≤ . (3.11)
A∈H B∈V

Intuitively, the covering number reflects the richness of the hypothesis space. A
larger covering number indicates a space with more balls needed to represent it,
suggesting a potentially more complex set of hypotheses.
42 3 Hypothesis Complexity

The covering number indicates how many balls are needed to cover the hypoth-
esis space, whereas the Rademacher complexity measures the maximal correlation
between a hypothesis and noise and thus characterizes the “goodness-of-fit” of the
hypothesis space to noise. A higher Rademacher complexity suggests that hypothe-
ses in the class might be overly susceptible to noise, leading to poorer generalization
performance on unseen data.
The relationship between hypothesis complexity and these measures becomes
evident when we consider techniques like the Dudley entropy integral.

Theorem 3.2 (Dudley entropy integral) Let H be a real-valued function class taking
values in [0, 1], and assume that 0 ∈ H. Then,
  √ 
4α 12 m 
R(H) ≤ inf √ + log N(H, ,  · 2 )d . (3.12)
α>0 m m α


Proof Let N be an arbitrary positive integer in N, and let i = m2−(i−1) for each
i ∈ [N ]. Let Bi denote the cover achieving N(H, ,  · 2 ). By the definition of the
covering number, for any f ∈ H, there exists a bi [ f ] ∈ Bi such that

 f − bi [ f ]2 ≤ i .

Moreover, we have the following:


 
m
Eσ sup σt f (xt )
f ∈H t=1

m N −1 
 m 
m
=Eσ sup σt ( f (xt ) − btN [ f ]) + σt (bti+1 [ f]− bti [ f ]) + σt bt1 [ f ]
f ∈H t=1 i=1 t=1 t=1

m N −1
 
m
≤Eσ sup σt ( f (xt ) − btN [ f ]) + Eσ sup σt (bti+1 [ f ] − bti [ f ])
f ∈H t=1 i=1 f ∈H t=1

m
+ Eσ sup σt bt1 [ f ] . (Triangle inequality)
f ∈H t=1

Let b1 = 0; then, the third term becomes 0. For the first term, we can use the Cauchy-
Schwarz inequality to obtain

   m 
m
  
m
Eσ sup N 
σt ( f (xt ) − bt [ f ]) ≤ Eσ σt Eσ sup
2
( f (xt ) − btN [ f ])2
f ∈H t=1 t=1 f ∈H t=1

≤ m N .
3.2 Worst-Case Bounds Based on the Vapnik-Chervonenkis (VC) Dimension 43

Let Wi = {bti+1 [ f ] − bti [ f ] : f ∈ H}. Then, we have

|Wi | ≤ |Bi ||Bi+1 | ≤ |Bi+1 |2

and

sup wi 2 ≤ sup  f − bi [ f ]2 + sup  f − bi+1 [ f ]2 ≤ i + i+1 ≤ 3i+1 .


wi ∈Wi f ∈H f ∈H

Thus, by Massart’s lemma (Lemma 3.1), we have


 
m m

Eσ sup σt (bti+1 [ f]− bti [ f ]) ≤ Eσ sup σt wt ≤ 6 log |Bi+1 |i+1 .
f ∈H t=1 wi ∈Wi t=1

Collecting all terms, we have


  N −1

m
√ 
Eσ sup σt f (xt ) ≤ m N + 6 log |Bi+1 |i+1
f ∈H t=1 i=1

√ 
N

≤ m N + 12 (i − i+1 ) log N(H, i ,  · 2 )
i=1
 √

√ m
≤ m N + 12 log N(H, ,  · 2 )d.
 N +1

Finally, for any α > 0, let  N +1 > α and  N ≤ 4α for some N ; then,
 √
m 
4α 12
R(H) ≤ √ + log N(H, ,  · 2 )d, (3.13)
m m α

which completes the proof. 

3.2 Worst-Case Bounds Based on the Vapnik-Chervonenkis


(VC) Dimension

Another major measure of hypothesis complexity is the VC dimension (Vapnik and


Chervonenkis 2015), which is defined as follows.

Definition 3.3 (Growth function, shattering, and Vapnik-Chervonenkis (VC)


dimension) For any nonnegative integer m, the growth function of a hypothesis
space H is defined as follows:

H (m) = max |{(h(x1 ), . . . , h(xm )) : h ∈ H}|. (3.14)


x1 ,...,xm ∈X
44 3 Hypothesis Complexity

If H (m) = 2m , we say that H shatters the dataset {x1 , . . . , xm }. We define the VC


dimension VCdim(H) as the size of the largest shattered set.

We can obtain the following theorem regarding the relationship between the
growth function and the VC dimension.

Theorem 3.3 (Sauer’s lemma; Shelah (1972); Sauer (1972)) Let H be a hypothesis
set with VCdim(H) = d. Then, for all m ∈ N, the following inequality holds:

d
m
H (m) ≤ . (3.15)
i=0
i

Proof We will prove this theorem by induction on m + d.


First, the statement obviously holds for m = 1 and d = 0 or d = 1.
Second, we assume that the statement holds for (m − 1, d − 1) and (m − 1, d).
Then, we will show that the statement is also true for (m, d). Let S = {x1 , . . . , xm }
be a set with H (m) dichotomies, and let G = H|S be the set of concepts H induced
by restriction to S, that is, G = {{xi : h(xi ) = 1} : h ∈ H}. Thus, H (m) = |G|.
To use our assumptions, similarly, we consider S = {x1 , . . . , xm−1 } and G1 =
H|S1 as the set of concepts induced by restriction to S1 .
Now, we consider the gap between G and G1 . That is, in S , we omit only the
value h(xm ). Based on this observation, we define G2 as

G2 = {g  ⊂ S  : (g  ∈ G) ∧ (g  ∪ {xm } ∈ G)}. (3.16)

This implies that |G| = |G1 | + |G2 |.


Next, we will calculate the scales of G1 and G2 . We can view each point g in G1
as a concept satisfying h(xi ) = 1 if and only if xi ∈ g. In this sense, G1 and G2 can
be viewed as consisting of concepts induced by restriction to S1 .
Note that VCdim(G1 ) ≤ VCdim(G) ≤ d; then, from the assumption for (m −
1, d), we know that

d
m−1
|G1 | ≤ G1 (m − 1) ≤ . (3.17)
i=0
i

Moreover, if Z ⊂ S is shattered by G2 , then the set Z ∪ {xm } is shattered by G.


Hence, we have
VCdim(G2 ) ≤ VCdim(G) − 1 = d − 1, (3.18)

and from the assumption for (m − 1, d − 1),


d−1
m−1
|G2 | ≤ G2 (m − 1) ≤ . (3.19)
i=0
i
3.2 Worst-Case Bounds Based on the Vapnik-Chervonenkis (VC) Dimension 45

Finally, we have

d
m−1 
d−1
m−1 d
m
|G| = |G1 | + |G2 | ≤ + = , (3.20)
i=0
i i=0
i i=0
i

which completes the inductive proof. 


One can obtain uniform generalization bounds in terms of the VC dimension as
follows (Mohri et al. 2018).
Theorem 3.4 (see (Mohri et al. 2018)) Suppose that a hypothesis space H has VC
dimension VCdim(H). Then, for any δ > 0, with probability 1 − δ, the following
inequality holds for any h ∈ H:
 
em
2VCdim(H) log VCdim(H) log 1δ
R(h) ≤ R̂ S (h) + + . (3.21)
m 2m

This theorem is based on two lemmas, as follows.


Lemma 3.2 Let H be a family of functions taking values in {−1, +1}. Then, the
following inequality holds:

2 log H (m)
R(H) ≤ . (3.22)
m

Proof Lemma 3.2 follows from Massart’s lemma (Lemma 3.1). Let the set of vectors
of function values (h(x1 ), . . . , h(xm ))T be denoted by H|S ; then, we have
√  
m 2 log |H|S | 2 log H (m)
R(H) ≤ ES ≤ . (3.23)
m m

The first inequality follows from Massart’s lemma, and the second inequality follows
from the definition of H (m). 
The other lemma is a foundational lemma in combinatorial mathematics and
extremal set theory that was independently proven by (Shelah 1972) and (Sauer
1972).
We now may prove Theorem 3.4.
Proof of Theorem 3.4 Theorem 3.3 yields

  m  m d−i  m d  m d
d m m
m d
H (m) ≤ ≤ = 1+ ≤ ed .
i=0
i i=0
i d d m d
(3.24)
By combining this result with Lemma 3.2 above, we complete the proof of
Theorem 3.4.
46 3 Hypothesis Complexity

References

Mehryar, Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. 2018. Foundations of Machine
Learning. MIT press.
Nesterov, Yurii E. 1983. A method for solving the convex programming problem with convergence
rate o (1/k∧ 2). In Dokl. Akad. Nauk Sssr 269: 543–547.
Peter, L Bartlett, and Shahar Mendelson. 2002. Rademacher and Gaussian complexities: Risk
bounds and structural results. Journal of Machine Learning Research 3: 463–482.
Sauer, Norbert. 1972. On the density of families of sets. Journal of Combinatorial Theory, Series
A 13 (1): 145–147.
Shelah, Saharon. 1972. A combinatorial problem; stability and order for models and theories in
infinitary languages. Pacific Journal of Mathematics 41 (1): 247–261.
Vladimir, N Vapnik and A Ya Chervonenkis. 2015. On the uniform convergence of relative
frequencies of events to their probabilities. In Measures of complexity, 11–30. Springer.
Chapter 4
Algorithmic Stability

Recall the classical uniform generalization bounds, which essentially depend on a


measure of the capacity of the hypothesis space H (e.g., the Rademacher complexity
or covering number). This type of upper bound is independent of any specific algo-
rithm because it provides a guarantee for all hypotheses h ∈ H simultaneously. In
contrast to the above approaches, generalization bounds associated with a specific
learning algorithm can be established based on the concept of algorithmic stability,
which refers to the property that the hypothesis output by an algorithm does not
change significantly when one training point is perturbed. In this chapter, we intro-
duce algorithmic stability to characterize the impact of randomness in the training
process on the generalizability of an algorithm.

4.1 Definition of Algorithmic Stability

Algorithm stability is defined in terms of the difference in the output of an algorithm


on a training set S and a perturbed version S i , where S i = (z 1 , ..., z i , ..., z m ) is
obtained by replacing the i-th point of S with a new point z i ∈ Z. We regard a
learning algorithm as a function A(S) : Z m → H that takes the training sample S
as input and produces the hypothesis h ∈ H as output. In this section, we consider
only a deterministic learning algorithm A, which does not depend on the ordering
of the points in the training set.
We introduce several widely used definitions of algorithmic stability below
(Bousquet and Elisseeff 2002).

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025 47
F. He and D. Tao, Foundations of Deep Learning, Machine Learning: Foundations,
Methodologies, and Applications, https://doi.org/10.1007/978-981-16-8233-9_4
48 4 Algorithmic Stability

Definition 4.1 (Hypothesis stability) An algorithm A has hypothesis stability β with


respect to the loss function l if the following holds:
∀i ∈ {1, . . . , m}, E S [|l(A(S), z i ) − l(A(S \i ), z i )|] ≤ β, (4.1)

where S \i = {z 1 , .., z i−1 , z i+1 , ..., z m }.


Definition 4.2 (Pointwise hypothesis stability) An algorithm A has pointwise
hypothesis stability β with respect to the loss function l if the following holds:

∀i ∈ {1, . . . , m}, E S [|l(A(S), z i ) − l(A(S \i ), z i )|] ≤ β. (4.2)

Definition 4.3 (Error stability) An algorithm A has error stability β with respect to
the loss function l if the following holds:

∀S ∈ Z m , ∀i ∈ {1, . . . , m}, |Ez [l(A(S), z)] − Ez [l(A(S \i ), z)]| ≤ β. (4.3)

Definition 4.4 (Uniform stability) Given a training sample S = {(xi , yi )}i=1


m
of size
m 
m, an algorithm A is β-uniformly stable if ∀m ∈ N, i ∈ [m], S ∈ Z , z i ∈ Z, we
have

sup[|l(A(S), z) − l(A(S i ), z)|] ≤ β.


z∈Z

Definition 4.5 (Classification stability) A real-valued classification algorithm A has


classification stability β if the following holds:

∀S ∈ Z m , ∀i ∈ {1, . . . , m}, A(S)(z) − A(S \i )(z) ≤ β. (4.4)

4.2 Algorithmic Stability and Generalization Error Bounds

Given that an algorithm A has uniform stability β, an upper bound on its


generalization error can be established with the help of McDiarmid’s inequality.
Definition 4.6 (McDiarmid’s inequality) Suppose that a function f : Z m → R
satisfies sup S,Si | f (S) − f (S i )| ≤ ci , i = 1, ..., m; then, the following statement
holds:
2 2
P(| f (S) − E S [ f (S)]| > ) ≤ 2 exp(− m 2 ).
i=1 ci

We can then prove the following polynomial generalization bound in terms of the
hypothesis stability (Bousquet and Elisseeff 2002).
Theorem 4.1 (Polynomial generalization bound in terms of hypothesis stability)
For a learning algorithm A with hypothesis stability β1 and pointwise hypothesis
4.2 Algorithmic Stability and Generalization Error Bounds 49

stability β2 with respect to a loss function l such that 0 ≤ l(y, y  ) ≤ M, we have,


with probability 1 − δ,

M 2 + 12Mmβ2
R(A(S)) ≤ R̂ S (A(S)) + (4.5)
2mδ

and 
\i M 2 + 6Mmβ1
R(A(S)) ≤ R̂ S (A(S )) + . (4.6)
2mδ

Proof Note that

ES,zi [|l(A(S), z i ) − l(A(S i ), z i )|] ≤ES [|l(A(S), z i ) − l(A(S \i ), z i )|]


+ E[|l(A(S \i ), z i ) − l(A(S i ), z i )|]
≤2β2 .

Thus, we have

M2
E[(R(A(S)) − R̂ S (A(S))2 ] ≤ + 3MES,zi [|l(A(S), z i ) − l(A(S i ), z i )|]
2m
M2
= + 6Mβ2
2m
and

M2
E[(R(A(S)) − R̂ S (A(S \i )))2 ] ≤ + 3ME S,z [|l(A(S), z) − l(A(S \i ), z)|]
2m
M2
= + 3Mβ1 .
2m

Finally, by Chebyshev’s inequality P[X ≥ δ] ≤ E[X 2 ]/δ 2 , we have that, with


probability at least 1 − δ, 
E[X 2 ]
X≤ . (4.7)
δ
By combining this result with

X = R(A(S)) − R̂ S (A(S)) (4.8)

and
X = R(A(S)) − R̂ S (A(S \i )), (4.9)

we obtain Theorem (4.1).


The proof is complete. 
50 4 Algorithmic Stability

We introduce a modified cost function




⎨1 if f (x)y < 0
lγ ( f, z) = 1 − f (x)y/γ if 0 ≤ f (x)y ≤ γ (4.10)


0 if f (x)y ≥ γ

and then define


1 
m
γ
R̂ S (A(S)) = lγ (A(S), z i ). (4.11)
m i=1

Then, we may prove the following theorem based on this new notion.

Theorem 4.2 Let A be a real-valued classification algorithm with stability β. Then,


for all γ > 0, any m ≥ 1, and any δ ∈ (0, 1), with probability at least 1 − δ over the
random draw of sample S,

γ β β ln(1/δ)
R(A(S)) ≤ R̂ S (A(S)) + 2 + 4m + 1 (4.12)
γ γ 2m
and 
γ β β ln(1/δ)
R(A(S)) ≤ R̂ S (A(S \i )) + + 4m + 1 . (4.13)
γ γ 2m

Proof The loss function lγ is bounded by M = 1, and the algorithm is β/γ -stable.
By using the fact that R(A(S)) ≤ R γ = Ez [lγ (A(S), z)] and applying Theorem 4.1,
we complete the proof. 

Based on the uniform stability, we give an exponential generalization bound.

Theorem 4.3 (Exponential generalization bound in terms of uniform stability) Let


algorithm A be β-uniformly stable, and let the loss function satisfy l(h, z) ≤ M,
∀h ∈ H and ∀z ∈ Z. Given a training sample S = {(xi , yi )}i=1
m
, for any δ ∈ (0, 1),
with probability at least 1 − δ, it holds that

2 ln(2/δ)
R(A(S)) ≤ R̂ S (A(S)) + β + (mβ + M) .
m

Proof We adopt the notation D(A(S)) = R(A(S)) − R̂ S (A(S)) for notational sim-
plicity. In the rest of the proof, we will prove that D(A(S)) is close to its expectation
E S [D(A(S))] and that both have uniform β-stability:

1 
m
E S [D(A(S))] = E S,z [l(A(S), z) − l(A(S), z i )]
m i=1
1 
m
= E S,z  [l(A(S), z  ) − l(A(S i ), z  )]
m i=1
≤ β,
4.2 Algorithmic Stability and Generalization Error Bounds 51

where the third equality is based on the “symmetry” of the expectation. Let us verify
the conditions required in McDiarmid’s inequality (Definition 4.6). Given that an
algorithm A with uniform stability β, for any m ∈ N and S, S i ∈ Z m , it holds that

|D(A(S)) − D(A(S i ))| ≤ |R(A(S)) − R(A(S i ))| + | R̂ S (A(S)) − R̂ Si (A(S i ))|


1
≤ β + |l(A(S), z i ) − l(A(S i ), z i )|
m
1 
+ |l(A(S), z j ) − l(A(S i ), z j )|
m j =i
M
≤ 2β + .
m

Applying McDiarmid’s inequality to D(A(S)), we find that for any  ∈ (0, 1),

m 2
P(|D(A(S)) − ED(A(S))| > ) ≤ 2 exp(− ).
2(mβ + M)2

Recall that ED(A(S)) ≤ β; then, we have

P(D(A(S)) > β + ) ≤ P(|D(A(S)) − ED(A(S))| > ).

Hence,
m 2
P(D(A(S)) > β + ) ≤ 2 exp(− ).
2(mβ + M)2

Eventually, we take
m 2
δ = 2 exp(− ),
2(mβ + M)2

i.e., 
2 ln(2/δ)
 = (mβ + M) .
m

Thus, we obtain the claimed result. 

Deep learning algorithms usually have an extremely large “total” hypothesis com-
plexity, whereas optimizers may explore only a small part of the hypothesis space.
The following notion helps characterize the “effective” hypothesis complexity, which
depends on both the learning algorithm and the training data.

Definition 4.7 (Algorithmic Rademacher complexity) The Rademacher complexity


of a hypothesis class H on the feature space X is defined as

1 
m
R(H) = E sup σi h, xi , (4.14)
h∈H m i=1
52 4 Algorithmic Stability

where σ1 , . . . , σm are i.i.d. Rademacher variables that are uniformly distributed in


{−1, +1}.

With the help of algorithmic stability, one may prove a generalization bound in
terms of the algorithmic Rademacher complexity (Liu et al. 2017).

Theorem 4.4 (Generalization bound in terms of algorithmic Rademacher complex-


ity) Let B be a separable Hilbert space. Suppose that xi ∗ ≤ B with probability one
for some B > 0 and that the loss function is bounded and Lipschitz, that is, l(h, z) ≤
M with probability one for some M > 0 and |l(h, z) − l(h  , z)| ≤ L| h, x − h  , x |
for all z ∈ Z and h, h  ∈ H (H is a subset of the separable Hilbert space B). If a
learning algorithm is α(m)-uniformly argument stable, then its generalization error
is bounded as follows. With probability at least 1 − 2δ,

log(1/δ)
R(h S ) − R̂ S (h S ) ≤ 2L B 2 log(2/δ)α(m) + M .
2m

Proof For a sample size m and confidence δ > 0, let

r := r (m, δ) = Dα(m) 2m log(2/δ), D > 0

and
Br = {h ∈ H|h − Eh S  ≤ r (m, δ)}.

Our proof consists of two parts. First, we show that R(Br ) ≤ B 2 log(2/δ)α(m).
This is because

1
m
R(Br ) = E sup σi h, xi
h∈Br n i=1
1 
m
= E sup σi ( h, xi − E[ h S , xi ])
h∈Br m i=1
 m 
1  
≤ E sup h − E[h S ]  σi xi 
m 
h∈Br i=1 ∗
 m 
r  
≤   σi xi 

m i=1 ∗

1 m 1/ p
≤ α(m)D 2m log(2/δ)C p xi ∗p
m i=1

≤ DC p B 2 log(2/δ)α(m)m −1/2+1/ p .

Note that B is a Hilbert space; thus, the first step is complete. Next, let us consider
the Doob-Meyer difference
4.2 Algorithmic Stability and Generalization Error Bounds 53

Dt = E[h S |Z 1 , . . . , Z t ] − E[h S |Z 1 , . . . , Z t−1 ],

and further,


m 
m
Dt 2∞ = E[h S |Z 1 , . . . , Z t ] − E[h S |Z 1 , . . . , Z t−1 ]2∞
t=1 t=1

m
= E[h S − h S t |Z 1 , . . . , Z t ]2∞
t=1
m  2
≤ E[h S − h S t ∞ |Z 1 , . . . , Z t ]
t=1
≤ mα(m)2 .

Then, we know that with probability at least 1 − δ,

R(h S ) − R̂ S (h S ) ≤ sup (R(h) − R̂ S (h)).


h∈Br

Since the loss function is bounded by M, with probability at least 1 − δ, it holds



log(1/δ)
sup (R(h) − R̂ S (h)) ≤ E sup (R(h) − R̂ S (h)) + M
h∈Br h∈Br 2m

log(1/δ)
≤ 2LR(Br ) + M .
2m

By combining the two results above, we complete the proof. 

Furthermore, we may prove a generalization bound in terms of the uniform


argument stability as follows.

Theorem 4.5 (Generalization bound in terms of uniform argument stability; see


(Liu et al. 2017)) Let B be a separable Hilbert space. Suppose that the marginal
distribution of the X i is such that X i ∗ ≤ B with probability one for some B > 0 and
that the loss function is bounded and Lipschitz, that is, l(h, z) ≤ M with probability
one for some M > 0 and |l(h, z) − l(h  , z)| ≤ L| h, x − h  , x | for all z ∈ Z and
h, h  ∈ H. Let a > 1. If a learning algorithm is α(m)-uniformly argument stable,
then with probability at least 1 − 2δ,

a (6a + 8)M log(1/δ)


R(h S ) − R̂ S (h S ) ≤ 8L B 2 log(2/δ)α(m) + .
a−1 3m

This theorem relies on the following two lemmas.


54 4 Algorithmic Stability

Lemma 4.1 (Theorem 2.1 in (Bartlett et al. 2005)) Let F be a class of functions that
map X to [0, M]. Assume that there exists some ρ > 0 such that for every f ∈ F,
var( f (X )) ≤ ρ. Then, with probability at least 1 − δ, we have

1 
m
2ρ log(1/δ) 4M log(1/δ)
sup E[ f (x)] − f (xi ) ≤ 4R(F) + + .
f ∈F m i=1 m 3m

Proof Based on the concept of concentration inequality, let V = sup E[ f (x)] −
f ∈F

m 
1
m
f (xi ) . Since sup var[ f (xi )] ≤ ρ, it holds, with probability at least 1 − e−x ,
i=1 f ∈F


2xρ 4x ME[V ] Mx
V ≤ E[V ] + + + .
m m 3m
√ √ √ √
By the inequalities x+y≤ x+ y and 2 x y ≤ αx + αy , with probability at
least 1 − e−x ,

2ρx  1 1  Mx
V ≤ inf (1 + α)E[V ] + + + .
α>0 m 3 α m

Finally, let α = 1 and consider E[V ] ≤ 2R(F); this completes the proof. 

Lemma 4.2 (Bartlett et al. 2005) Let


 
r
Gr (z) = l(h, z)|h ∈ Br
max{r, El(h, z)}

and
1 
m
Vr = sup E[g(z)] − g(z i )
g∈Gr m i=1

be defined. For any r > 0 and a > 1, if Vr ≤ r/a, then every h ∈ Br satisfies

a 1 
m
E[l(h, z)] ≤ l(h, z i ) + Vr .
a − 1 m i=1

Proof Let l(h, z) be denoted by f (z), and let g = r f /ω( f ). Then, by the definition

m
of Vr , we have E[g(z)] ≤ m1 g(z i ) + Vr . We note that when r ≥ E[ f (z)], g = f .
i=1
Otherwise, when r < E[ f (z)], we have
4.2 Algorithmic Stability and Generalization Error Bounds 55

1  1 
m m
E[ f (z)] E[ f (z)]
E[ f (z)] ≤ f (z i ) + Vr ≤ f (z i ) + .
m i=1 r m i=1 a

Finally, the statement is true under both conditions, as shown. 

Now, we will prove Theorem 4.5.


Proof First, with probability at least 1 − δ, we have
a  a 
R(h S ) − R̂ S (h S ) ≤ sup R(h) − R̂ S (h) .
a−1 h∈Br a−1

Furthermore, we note that

var(g(z)) ≤ E[g(z)2 ] ≤ ME[g(z)] ≤ Mr.

Then, by Lemma 4.1, we have



2Mr log(1/δ) 4M log(1/δ)
Vr ≤ 4R(Gr ) + + .
m 3m

Let the right-hand side be r/a; then, by applying Lemma 4.2, we obtain

a 1 
m
r
E[l(h, z)] ≤ l(h, z i ) + (4.15)
a − 1 m i=1 a
a 1 
m
2Ma log(1/δ) 8M log(1/δ)
≤ l(h, z i ) + + 8R(Gr ) + .
a − 1 m i=1 m 3m
(4.16)

Then, we have

a 1 
m
2Ma log(1/δ) 8M log(1/δ)
sup E[l(h, z) − l(h, z i ) ≤ + 8R(l ◦ Br ) + .
h∈Br a−1m m 3m
i=1

By combining the two results above, the proof is completed. 


In general, the stability β is a function of the sample size m. For instance, if β = k
m
for some positive constant k, then we have that, for any δ ∈ (0, 1),

k 2 ln(2/δ)
R(A(S)) ≤ R̂ S (A(S)) + + (2k + M)
m m

with probability at least 1 − δ.


56 4 Algorithmic Stability

We have used McDiarmid’s inequality to prove a generalization bound for a uni-


formly β-stable algorithm. This generalization bound tells us that with a high prob-
ability, the expected error will be close to the empirical error. We thus conclude that
a low empirical error serves as evidence of a low expected error.
In the next section, we will show that the Tikhonov regularization algorithm
possesses this property.

4.3 Uniform Stability of Regularized Learning

In this section, we provide an example to illustrate the uniform stability of the kernel-
based Tikhonov regularization algorithm. Recall the definition of the Tikhonov
regularization scheme:

1 
m
min{ l(h, z i ) + λ (h)},
h∈H m i=1

where (h) is a convex and differentiable function mapping from H to R and λ is


a positive regularization parameter. We discuss the stability properties and general-
ization error bound of a regularized algorithm that has a reproducing kernel Hilbert
space (RKHS) and a kernel-based norm regularizer.
Let K : X × X → R be a symmetric and positive definite kernel function. For
each u, x ∈ X , we adopt the notation K u (v) = K (u, v). The RKHS is defined as


m
H K = { f : X → R| f (x) = αi K xi (x), αi ∈ R},
i=1

with inner product


 

m 
m 
m 
m
αi K xi , β j Kx j = αi β j K (xi , x j ).
i=1 j=1 i=1 j=1

According to the reproducing property, we have that, for any h ∈ H K and x ∈ X ,


 

m 
m
f, K x = αi K xi , K x = αi K xi (x) := f (x).
i=1 i=1

Let A be a kernel-based regularized learning algorithm that outputs a function


h S ∈ H K such that
4.3 Uniform Stability of Regularized Learning 57

1 
m
h S = arg min l(h, z i ) + λ f 2K ,
f ∈H K m i=1

where the loss function l is convex and satisfies the L-Lipschitz continuity condition.

Theorem 4.6 Suppose that the kernel function is bounded, i.e., K (x, x) ≤ κ 2 <
∞, ∀x ∈ X . The kernel-based Tikhonov regularization algorithm A is uniformly
β-stable, with probability at least 1 − δ, we have

2Lκ 2 2 ln(2/δ)
R(A(S)) ≤ R̂ S (A(S)) + + (2k + M) .
λm m

Proof Let D f = A(S i ) − A(S). According to the property of a convex function, we


have that ∀t ∈ [0, 1], it is easy to obtain

R̂ S (A(S) + t D f ) + R̂ S (A(S i ) − t D f ) ≤ R̂ S (A(S)) + R̂ S (A(S i )).

The definitions of A(S) and A(S i ) indicate that

R̂ S (A(S)) + λA(S)2K ≤ R̂ S (A(S) + t D f ) + λA(S) + t D f 2K

and

R̂ Si (A(S i )) + λA(S i )2K ≤ R̂ Si (A(S i ) − t D f ) + λA(S i ) − t D f 2K .

Furthermore, we can deduce that

λ
D f 2K
2
= λ[A(S)2K + A(S i )2K − A(S) + t D f 2K − A(S i ) − t D f 2K ]
≤ [ R̂ Si (A(S i ) − t D f ) − R̂ S (A(S i ) − t D f )] + [ R̂ S (A(S i )) − R̂ Si (A(S i ))]
1 1
= [l(A(S i ) − t D f , z i ) − l(A(S i ) − t D f , z i )] + [l(A(S i ), z i ) − l(A(S i ), z i )]
m m
1 1
= [l(A(S i ) − t D f , z i ) − l(A(S i ), z i )] + [l(A(S i ), z i ) − l(A(S i ) − t D f , z i )]
m m
tL 2t Lκ
≤ (|D f (xi )| + |D f (xi )|) ≤ D f  K
m m
Lκ 1
≤ D f  K (by taking t = ),
m 2
where the last line follows from the reproducing property, the Cauchy-Schwartz
inequality, and the boundedness assumption on the kernel K , i.e., | f | ≤ κ f  K for
any f in the RKHS. Moreover, by taking t = 1/2, we then have
58 4 Algorithmic Stability

2Lκ 2
|D f | ≤ κD f 2K ≤ .
λm

Finally, we show that the algorithm has a uniform stability of 2Lκ 2 /λm. We complete
the proof by combining this result with the result of Theorem 4.3. 

References

Olivier, Bousquet, and André Elisseeff. 2002. Stability and generalization. Journal of Machine
Learning Research 2: 499–526.
Peter, L Bartlett, Olivier Bousquet, and Shahar Mendelson. 2005. Local rademacher complexities.
The Annals of Statistics 33 (4): 1497–1537.
Tongliang, Liu, Gábor Lugosi, Gergely Neu, and Dacheng Tao. 2017. Algorithmic stability and
hypothesis complexity. In International Conference on Machine Learning, 2159–2167.
Yurii, E Nesterov. 1983. A method for solving the convex programming problem with convergence
rate o (1/k 2 ). In Dokl. Akad. Nauk Sssr 269: 543–547.
Part II
Deep Learning Theory
Chapter 5
Capacity and Complexity

This chapter presents the results for deep learning theory from the perspective of
hypothesis complexity, including the Vapnik-Chervonenkis (VC) dimension, the
Rademacher complexity, and the covering number. By determining upper and lower
bounds for the VC dimension of neural networks, we can better understand their
generalizability. Additionally, we discuss margin bounds, which offer more robust
generalization assurances compared to worst-case bounds based on the VC dimen-
sion. These bounds ensure that trained models can achieve a small empirical margin
loss with high confidence. Furthermore, we examine the effect of residual connec-
tions on hypothesis complexity by analyzing the covering number of the hypothesis
space. We propose an upper bound for the covering number, providing insights into
how residual connections influence model complexity.

5.1 Worst-Case Bounds Based on the VC Dimension

One can upper bound or lower bound the VC dimension of a neural network in order
to characterize its generalizability. Goldberg and Jerrum (1995) gave an O(W 2 ) upper
bound for the VC dimension of a neural network with W parameters. Bartlett et al.
(1999) improved this bound to O((W L log W + W L 2 ), where L is the number of
layers. The tightest bound thus far has been proven by Harvey et al. (2017) as follows.

Theorem 5.1 (see Harvey et al. (2017)) Consider a neural network with W parame-
ters and U units. These units have activation functions that are piecewise polynomials
with at most p pieces and a degree of at most d. Let F be the set of (real-valued)
functions computed by this network. Then, the VC dimension of the sign function
(sgn) applied to elements of F is upper bounded by O(W U log((d + 1) p)).

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025 61
F. He and D. Tao, Foundations of Deep Learning, Machine Learning: Foundations,
Methodologies, and Applications, https://doi.org/10.1007/978-981-16-8233-9_5
62 5 Capacity and Complexity

The following theorem is based on a theorem presented by Goldberg and Jerrum


(1993), which states that any class of functions that can be expressed using a small
number of distinct polynomial inequalities has a smaller VC dimension.

Theorem 5.2 Let k and n be positive integers, and let f : Rn × Rk → {0, 1} be a


function that can be expressed as a Boolean formula containing s distinct atomic
predicates, where each atomic predicate is a polynomial inequality or equality in
k + n variables with a degree of at most d. Let F = { f (·, ω) : ω ∈ Rk }. Then,
V Cdim(F ) ≤ 2k log2 (8eds).

Proof The output signal of a neural network with piecewise activation can be
expressed as a Boolean formula. In detail, in each layer, the input to each com-
putation unit must lie in one of the p pieces of the activation function ψ, which can
be written as a Boolean formula. Accordingly, we can express the output signal of the
network as a Boolean function consisting of fewer than 2(1 + p)U atomic predicates,
each of which is a polynomial inequality with a degree of at most max{U + 1, 2d U }.


We can now prove Theorem 5.1.

Proof By Theorem 5.2, we prove an upper bound of 2W log(16e · max{U +


1, 2d U } · (1 + p)U ) = O(W U log(1 + d) p) for the VC dimension. This completes
the proof. 

Harvey et al. (2017) also derived a lower bound for the VC dimension, presented
as the following theorem.

Theorem 5.3 (see (Harvey et al. 2017)) Let C be an universal constant. Given any W
and L satisfying W > C L > C 2 , there exists a rectified linear unit (ReLU) network
with fewer than L layers and fewer than W parameters that has a VC dimension
lower bounded by W L log(W/L)/C.

This theorem has a more general version. Theorem 5.3 is a special case in which
r = log2 (W/L)/2, m = r L/8, and n = W − 5m2r .

Theorem 5.4 Let r , m, and n be positive integers, and let k = m/r . There exists
a ReLU network with 3 + 5d layers, 2 + n + 4m + k((11 + r )2r + 2r + 2) param-
eters, m + n input nodes and m + 2 + k(5 × 2r + 1) computational nodes with VC
dimension ≥ mn.

5.2 Rademacher Complexity and Covering Number

Margin bounds (Vapnik 2006; Schapire et al. 1998; Bartlett and Shawe-Taylor 1999;
Koltchinskii and Panchenko 2002; Taskar et al. 2004) represent a distinct category of
generalization bounds that offer stronger assurances compared to worst-case bounds
5.2 Rademacher Complexity and Covering Number 63

based on the VC dimension. Unlike traditional bounds, margin bounds focus on the
concept of margin, which refers to the difference between the predicted score of the
correct label and the highest score of incorrect labels for a given example. These
bounds provide robust guarantees that trained models can achieve a small empirical
margin loss within a large confidence margin. Essentially, margin bounds quantify
the degree of separation between different classes in the model’s decision space,
reflecting the model’s ability to generalize beyond the training data. Furthermore,
similar generalization guarantees can be derived based on a measure of “luckiness”
(Schapire et al. 1998; Koltchinskii and Panchenko 2002), which captures the notion
of how likely a model is to perform well on unseen data due to favorable properties of
the training set. By incorporating margin-based metrics and measures of luckiness,
researchers gain valuable insights into the behavior and performance of machine
learning models, enhancing our understanding of their generalization capabilities
and reliability in real-world applications.

Margin bound

For any distribution D and margin γ > 0, we define the expected margin loss for a
hypothesis h as follows:
 
L γ (h) = P(x,y)∼D h(x)[y] ≤ γ + max h(x)[ j] ,
j=y

where h(x)[ j] is the j-th component of the vector h(x). The generalization bounds
obtained under the margin loss are called margin bounds.
Based on the covering number, the Dudley entropy integral, and the Rademacher
complexity, Bartlett et al. (2017) obtained the following spectrally normalized upper
bound for the generalization error:
⎛   3/2 ⎞
X log Q 
D D
Wi − Mi
2/3
1/δ ⎠
Õ ⎝
2 2,1
ρi Wi σ + ,
γm i=1 i=1 Wi
2/3
σ m

where Wi , i = 1, ..., D, is the weight matrix, Q is an upper bound on the number


of output units in each layer, · σ is the spectral norm, ρi > 0 is the Lipschitz
constant and Mi , i = 1, ..., D, is a reference matrix. This bound is strictly tighter
than that of Neyshabur et al. (2015). Moreover, this bound suggests that one can boost
the generalizability of a network by controlling its spectral norm. This is consistent
with spectral normalization (Miyato et al. 2018), an effective regularization approach
mainly used in generative adversarial networks.
Concurrently, Neyshabur et al. (2017) proved a similar generalization bound via
Probably Approximately Correct (PAC)-Bayesian theory, as shown below:
64 5 Capacity and Complexity
⎛ ⎞
 2 2 D Wi 2

⎜ B D Q log(D Q) i=1 Wi
F
2 + log Dm
δ ⎟
O⎝ 2
⎠. (5.1)
γ 2m

The authors of (Neyshabur et al. 2017) pursued a two-pronged approach in their


research endeavor. First, they developed a perturbation bound aimed at controlling
the changes in the network’s output resulting from perturbations introduced into the
model’s weights. This perturbation bound allowed them to effectively quantify the
sensitivity or sharpness of the neural network’s response to variations in its parame-
ters. Building upon the perturbation analysis, Neyshabur et al. (2017) then employed
PAC-Bayesian theory to derive a generalization bound. This bound was intended to
provide insights into the model’s ability to generalize from the training data to unseen
examples. However, it was noted that this derived generalization bound proved to be
weaker compared to a similar bound proposed by Bartlett et al. (2017). This obser-
vation underscores the importance of exploring different theoretical frameworks and
bounds in assessing the robustness and performance of machine learning models in
various contexts. Further investigations may be warranted to elucidate the factors
contributing to the variance in generalization guarantees across different theoretical
approaches.
By introducing some norm constraints on the weights, Golowich et al. (2018)
improved the exponential depth dependence of Eq. (5.1) in (Neyshabur et al. 2015)
to a polynomial dependence, as follows:
 √ D   √  
B D j=1 M F ( j) B D + log m Dj=1 M F ( j)
O √ or O √ , (5.2)
m m

where M F (1), ..., M F (D) mean the maximum of Frobenius norm of networks param-
eter matrices. Besides, Golowich et al. (2018) further improved the bound to a depth-
independent bound, as follows:
⎛ ⎛ ⎞ ⎧    ⎫⎞

⎨ ⎪
D ⎬⎟
D
D
 log 1
M ( j)
⎜  j=1 F
Õ ⎝ B ⎝ M F ( j)⎠ min √ , ⎠, (5.3)

⎩ m m⎪⎭
j=1

where  is an upper bound on the spectral norm of all weight matrices.


Moreover, Golowich et al. (2018) also proved a lower bound, as follows:
⎛ D ⎞
max{0, 21 − 1p }
B M F ( j)Q
⎝ j=1
√ ⎠, (5.4)
m

where
 Q is the network width. This bound suggests a possible tighter bound of
O m −1/2 .
5.2 Rademacher Complexity and Covering Number 65

Tu et al. (2020) developed a gradient measure based on the Fisher-Rao norm,


which gives rise to a generalization bound as follows.
Theorem 5.5 (see Tu et al. (2020)) If the margin parameter α is fixed, then for
any δ > 0, with probability at least 1 − δ, the following holds for every recurrent
neural network (RNN) whose weight matrices θ = (U, W, V ) satisfy ||V T ||1 ≤ βV ,
||W T ||1 ≤ βW , ||U T ||1 ≤ βU and ||θ || f s ≤ r :
  
4k r ||X || F 1 1
E[IM yL (x,y)≤0 ] ≤ + βV βU ||X ||1 
T
α 2m λmin (E(X X T )) n

1 log 2δ
+ IM yL (xi ,yi )≤a + 3 . (5.5)
m 2m

The gradient measure plays a crucial role in refining generalization bounds and illu-
minating the intricate relationship between a model’s generalizability and its train-
ability. By analyzing the impact of gradients during training, researchers gain insights
into how noise can be strategically introduced into input data to enhance a model’s
ability to generalize to unseen examples. This observation underscores the value
of robust training strategies that incorporate regularization techniques like gradient
clipping (Merity et al. 2018; Gehring et al. 2017; Peters et al. 2018). Furthermore, the
gradient measure provides theoretical justification for gradient clipping, a technique
used to stabilize training by constraining the magnitude of gradients during optimiza-
tion. By controlling the scale of gradients, gradient clipping helps prevent exploding
gradients and facilitates more stable and effective learning. Overall, understanding
the implications of gradient measures enriches our understanding of model behavior
and informs practical strategies to improve model performance and generalization
capabilities.
(Novak et al. 2018) conducted comprehensive experiments to analyze the gener-
alization capacity of deep neural networks. Their empirical investigations revealed
pivotal factors influencing generalization, particularly focusing on the input-output
Jacobian norm and the linear region counting within the networks. The input-output
Jacobian norm measures the sensitivity of network outputs to changes in input, pro-
viding insights into how robustly the network behaves across different input varia-
tions. Meanwhile, the linear region counting quantifies the number of linear decision
boundaries that can be formed within the network’s architecture, offering clues about
the complexity and flexibility of its decision-making process. Furthermore, the study
highlighted the critical relationship between the output hypothesis of a model and the
underlying data manifold. The generalization bound, which defines the model’s abil-
ity to perform well on unseen data, is heavily influenced by how closely the model’s
output hypothesis aligns with the distribution of the training data in the input space.
This suggests the importance of designing models that not only fit the training data
well but also generalize effectively to new, unseen instances. Understanding these
fundamental aspects contributes to the development of more robust and reliable deep
learning models capable of adapting to diverse real-world scenarios.
66 5 Capacity and Complexity

5.3 Generalization Bound of ResNet

In recent years, deep neural networks have repeatedly demonstrated their signifi-
cant value in practical applications and have led to breakthroughs in image recogni-
tion, language translation, and predictive analytics, driving innovation across various
industries (LeCun et al. 2015; Greff et al. 2016; Shi et al. 2018; Silver et al. 2016; Lit-
jens et al. 2017; Chang et al. 2018), particularly with the introduction of ResNet (He
et al. 2016a) and the wide adoption of residual connections to induce top performed
neural network architectures (He et al. 2016a; Huang et al. 2017; Xie et al. 2017).
These advancements have catalyzed transformations in computer vision (Krizhevsky
et al. 2012; Lin et al. 2017; He et al. 2017; Liu et al. 2019) and data analysis (Witten
et al. 2016). Residual connections link non-neighboring layers, diverging from the
traditional chain-like stacking of layers.
Empirical studies have consistently proven that incorporating residual connections
can facilitate the training of deep neural networks by largely alleviating issues related
to vanishing gradients and exploding gradients. Despite their effectiveness in training,
there remains a notable gap in theoretical analysis regarding the impact of residual
connections on the generalization ability of deep neural networks.
While residual connections have shown instrumental in improving training dynam-
ics and model convergence, their influence on generalization-i.e., a model’s perfor-
mance on unseen data in the inference stage-remains less explored from a theoretical
perspective. Comprehensive research is critical to elucidate the underlying mecha-
nisms by which residual connections impact on the generalization capabilities of deep
neural networks, thereby providing deeper insights into the theoretical foundations
of this powerful addon.
Residual connections introduce a novel approach to neural network architecture by
connecting non-neighboring layers, diverging from the traditional chain-like stack-
ing of layers. This departure from conventional network structures introduces loops
within the neural network, fundamentally altering its topology. Intuitively, the incor-
poration of residual connections significantly expands the hypothesis space of deep
neural networks. This increased complexity, however, raises concerns regarding the
generalization ability of the network, as per the principle of Occam’s razor, which
suggests a negative correlation between algorithmic generalization and hypothesis
complexity.
The lack of comprehensive analysis on this aspect poses challenges for deploying
residual-connected neural networks in safety-critical domains, such as autonomous
vehicles and medical diagnosis, where algorithmic errors could have severe conse-
quences. Without addressing this theoretical gap, leveraging the advancements made
possible by residual connections in critical applications, from autonomous vehicles
(Janai et al. 2017) to medical diagnose (Esteva et al. 2017), remains constrained.
Efforts to bridge this gap are crucial for ensuring the reliability and safety of neural
network-based systems in real-world scenarios where accuracy and generalization
are paramount.
5.3 Generalization Bound of ResNet 67

In this section, we investigate the impact of residual connections on hypothesis


complexity by examining the covering number of the hypothesis space. We propose
an upper bound for the covering number that reveals insights into the influence of
residual connections on model complexity. Our analysis demonstrates that when the
total number of weight matrices in a neural network is fixed, the upper bound on the
covering number remains consistent regardless of whether the weight matrices are
part of the residual connections or the main “stem”1 of the network. This finding
suggests that residual connections do not necessarily increase the hypothesis space
complexity compared to a traditional chain-like neural network, provided that the
total number of weight matrices and non-linearities
√ is held constant. Leveraging
this understanding, we establish an O(1/ m) generalization bound for ResNet as a
representative case of neural networks with residual connections, where m represents
the training sample size.
Furthermore, our framework enables the derivation of generalization bounds for
similar architectures created by incorporating residual connections into chain-like
neural networks. This approach facilitates the assessment and comparison of hypoth-
esis complexity across different network configurations, offering valuable insights
into the theoretical underpinnings of deep learning models with residual connec-
tions. Our generalization bound is intricately linked to the product of norms of all
weight matrices within the neural network. Specifically, as the product of these norms
increases, there tends to be a detrimental impact on the network’s ability to generalize
effectively. This observation validates the importance of employing regularization
techniques aimed at controlling the magnitudes of these weight matrix norms to
enhance generalization performance.
To achieve robust generalization, it is common practice to incorporate regulariza-
tion methods such as weight decay and spectral normalization during the training
of deep neural networks. Weight decay, also known as L 2 regularization, imposes
a penalty based on the L 2 norm of the weights, encouraging smaller and more sta-
ble weight values. Similarly, spectral normalization constrains the spectral norm of
weight matrices, leading to more controlled and stable learning dynamics.
These regularization techniques (Krogh and Hertz 1992) play a critical role in
managing model complexity, mitigating overfitting, and improving generalization
capabilities across a variety of deep learning tasks. The endorsement of these meth-
ods by Bartlett et al. (2017) suggests their effectiveness in promoting robust and
reliable performance of neural networks, particularly in scenarios where general-
ization is paramount. By incorporating these strategies, practitioners can optimize
neural network architectures for improved performance and reliability in real-world
applications.

1The “stem” is defined to denote the chain-like part of the neural network besides all the residuals.
For more details, please refer to Sect. 5.3.1.
68 5 Capacity and Complexity

5.3.1 Stem-Vine Framework

This section provides a notation system for deep neural networks with residual con-
nections. Motivated by the topological structure, we call it the stem-vine framework.
Deep neural networks are typically assembled by linking numerous weight matri-
ces with nonlinear operators, such as ReLU, sigmoid, and max-pooling functions. In
this discussion, we focus on a neural network design that integrates multiple resid-
ual connections into a traditional “chain-like” architecture, which sequentially stacks
layers of weight matrices and nonlinearities. Inspired by the topological arrangement
of this network, we refer to the sequential layers as the “stem” of the neural network
and designate the added residual connections as the “vines”.
Both the stem and vines consist of stacked layers of weight matrices and nonlinear
functions. The stem represents the core structure of the neural network, comprising
the primary sequence of layers, while the vines introduce additional connections that
bypass certain layers, enhancing the network’s depth and flexibility. This configu-
ration allows for more complex transformations and feature representations within
the neural network, leveraging both the traditional sequential architecture and the
introduced residual connections to enhance learning capabilities and model expres-
siveness.
We denote the weight matrices and the nonlinearities in the stem S respectively
as

Ai ∈ Rni−1 ×ni , (5.6)


σ j : Rn j → Rn j , (5.7)

where i = 1, . . . , L, L is the number of weight matrices in the stem, j = 1, . . . , L N ,


L N is the number of nonlinearities in the stem, n i is the dimension of the output of
the i-th weight matrix, n 0 is the dimension of the input data to the network, and n L
is the dimension of the output of the network. Thus we can write the stem S as a
vector to express the chain-like structure. Here for the simplicity and without any
loss of the generality, we give an example that the numbers of weight matrices and
nonlinearities are equal2 , i.e., L N = L, as the following equation,

S = (A1 , σ1 , A2 , σ2 , . . . , A L , σ L ) . (5.8)

2 If two weight matrices, Ai and Ai+1 , are connected directly without a nonlinearity between
them, we define a new weight matrix A = Ai · Ai+1 . The situations that nonlinearities are directly
connected are similar, as the composition of any two nonlinearities is still a nonlinearity.
Meanwhile, the number of weight matrices does not necessarily equal the number of nonlin-
earities. Sometimes, if a vine connects the stem at a vertex between two weight matrices (or two
nonlinearities), the number of the weight matrices (nonlinearities) would be larger than the number
of nonlinearities (weight matrices). Taken the 34-layer ResNet as an example, a vine connects the
stem between two nonlinearities σ33 and σ34 . In this situation, we cannot merge the two nonlinear-
ities, so the number of nonlinearities is larger than the number of weight matrices.
5.3 Generalization Bound of ResNet 69

For the brevity, we give an index j to each vertex between a weight matrix and
a nonlinearity and denote the j-th vertex as N ( j). Specifically, we give the index 1
to the vertex that receives the input data and L + L N + 1 to the vertex after the last
weight matrix/nonlinearity. Taken Eq. (5.8) as an example, the vertex between the
nonlinearity σi−1 and the weight matrix Ai is denoted as N (2i − 1) and the vertex
between the weight matrix Ai and the nonlinearity σi is denoted as N (2i).
Vines are constructed to connect the stem at two different vertexes. And there
could be over one vine connecting a same pair of the vertexes. Therefore, we use a
triple vector (s, t, i) to index the i-th vine connecting the vertexes N (s) and N (t)
and denote the vine as V (s, t, i). All triple vectors (s, t, i) constitute an index set I V ,
i.e., (s, t, i) ∈ I V . Similar to the stem, each vine V (s, t, i) is also constructed by a
series of weight matrices As,t,i s,t,i s,t,i
1 , . . . , A L s,t,i and nonlinearities σ1 , . . . , σ Ls,t,i
s,t,i , where
N

L s,t,i is the number of weight matrices in the vine, while L u,v,i


N is the number of the
nonlinearities.
Multiplying by a weight matrix corresponds to an affine transformation on the
data matrix. Also, nonlinearities induce nonlinear transformations. Through a series
of affine transformations and nonlinear transformations, hierarchical features are
extracted from the input data by neural networks. Usually, we use the spectrum
norms of weight matrices and the Lipschitz constants of nonlinearities to express the
intensities respectively of the affine transformations and the nonlinear transforma-
tions. We call a function f (x) is ρ-Lipschitz continuous if for any x1 and x2 in the
support domain of f (x), it holds that

f (x1 ) − f (x2 ) f ≤ ρ x1 − x2 x , (5.9)

where · f and · x are respectively the norms defined on the spaces of f (x)
and x. Fortunately, almost all nonlinearities normally used in neural networks are
Lipschitz continuous, such as ReLU, max-pooling, and sigmoid (see (Bartlett et al.
2017)).
Many important tasks for deep neural networks can be categorized into multi-class
classification. Suppose input examples z 1 . . . , z m are given, where z i = (xi , yi ), xi ∈
Rn 0 is an instance, y ∈ {1, . . . , n L } is the corresponding label, and n L is the number of
the classes. Collect all instances x1 , . . . , xm as a matrix X = (x1 , . . . , xm )T ∈ Rm×n 0
that each row of X represents a data point. By employing optimization methods
(usually stochastic gradient decent, SGD), neural networks are trained to fit the
training data and then predict on test data. In mathematics, a trained deep neural
network with all parameters fixed computes a hypothesis function F : Rn 0 → Rn L .
And a natural way to convert F to a multi-class classifier is to select the coordinate
of F(x) with the largest magnitude. In other words, for an instance x, the classifier
is x → arg maxi F(x)i . Correspondingly, the margin for an instance x labelled as
yi is defined as F(x) y − maxi =y F(x)i . It quantitatively expresses the confidence of
assigning a label to an instance.
To express F, we first define the functions respectively computed by the stem and
vines. Specifically, we denote the function computed by a vine V (s, t, i) as:
70 5 Capacity and Complexity

FVs,t,i (X ) = σ Lu,v,i u,v,i u,v,i u,v,i


u,v,i (A L u,v,i σ L u,v,i −1 (. . . σ1 (A 1 X ) . . .)) . (5.10)

Similarly, the stem computes a function as the following equation:

FS (X ) = σ L (A L σ L−1 (. . . σ1 (A1 X ) . . .)) . (5.11)

Furthermore, we denote the output of the stem at the vertex N ( j) as the following
equation:
j
FS (X ) = σ j (A j σ j−1 (. . . σ1 (A1 X ) . . .)) . (5.12)

j
FS (X ) is also the input of the rest part of the stem. Eventually, with all residual
connections, the output hypothesis function F j (X ) at the vertex N ( j) is expressed
by the following equation:

j u, j,i
F j (X ) = FS (X ) + FV (X ) . (5.13)
(u, j,i)∈I V

Apparently,
FS (X ) = FSL (X ), F(X ) = F L (X ) . (5.14)

Naturally, we call this notation system as the stem-vine framework, and Fig. 5.1
gives an example.

5.3.2 Generalization Bound

In this section, we investigate the generalization capabilities of deep neural networks


with residual connections and establish a specific generalization bound for ResNet
as a case study. This bound is derived from the margin-based multi-class bound
detailed in Lemmas 3.1 and 3.2. Guided by these lemmas, which highlight a natural
approach to deriving the generalization bound through exploration of the covering
number of the corresponding hypothesis space, we first propose an upper bound for
the covering number (referred to as a covering bound) applicable to any deep neural
network within the stem-vine framework. We then apply this approach to ResNet,
providing a specific calculation of its covering bound. By leveraging Lemmas 3.1
and 3.2, we subsequently present a detailed generalization bound tailored specifi-
cally for ResNet. The proofs supporting these covering bounds will be provided in
Sect. 5.3.3, enhancing our understanding of the theoretical underpinnings behind the
generalization capabilities of neural networks with residual connections.
In practice, when introducing new structures to improve training performance,
such as boosting accuracy or reducing training time, it’s essential to proceed with cau-
tion to avoid overfitting, which can lead to significant generalization errors. ResNet,
with its introduction of “loops” via residual connections in traditional chain-like neu-
5.3 Generalization Bound of ResNet 71

Fig. 5.1 A deep neural network with Residual Connections under the Stem-Vine Framework

ral networks, represents a more sophisticated model. Empirical findings demonstrate


that these residual connections effectively reduce training error and expedite train-
ing, while also maintaining strong generalization capabilities. However, the absence
of theoretical underpinnings to explain or support these empirical results remains a
notable gap in the understanding of such architectures.
Our analysis on covering bounds reveals that the complexity of hypothesis spaces
computed by deep neural networks remains consistent, regardless of the distribution
of weight matrices-whether they reside in the stem, the vines, or even without any
vines at all-when the total number of weight matrices is fixed. This observation sug-
gests that the presence or absence of residual connections does not inherently alter
the overall hypothesis complexity. By integrating established results from statistical
learning theories (Lemmas 3.1 and 3.2), our findings suggest that the generalization
capability of deep neural networks with residual connections can be comparably
effective to those without, particularly in challenging scenarios where model com-
plexity plays a critical role. This theoretical insight sheds light on why deep neural
networks incorporating residual connections exhibit strong generalization capabili-
72 5 Capacity and Complexity

ties comparable to traditional chain-like architectures, while maintaining competitive


training performance.

5.3.2.1 Covering Bound for Deep Neural Networks with Residuals

In this subsection, we give a covering bound generally for any deep neural network
with residual connections.

Theorem 5.6 (Covering Bound for Deep Neural Network) Suppose a deep neural
network is constituted by a stem and a series of vines.
For the stem, let (ε1 , . . . , ε L ) be given, along with L N fixed nonlinearities
(σ1 , . . . , σ L N ). Suppose the L weight matrices (A1 , . . . , A L ) lies in B1 × . . . × B L ,
where Bi is a ball centered at 0 with radius of si , i.e., Ai ≤ si . Suppose the vertex
that directly follows the weight matrix Ai is N (M(i)) (M(i) is the index of the ver-
tex). All M(i) constitute an index set I M . When the output FM( j−1) (X ) of the weight
matrix A j−1 is fixed, suppose all output hypotheses FM( j) (X ) of the weight matrix
A j constitute a hypothesis space H M( j) with an ε M( j) -cover W M( j) with covering
number N M( j) . Specifically, we define M(0) = 0 and F0 (X ) = X .
Each vine V (u, v, i), (u, v, i) ∈ IV is also a chain-like neural network that con-
structed by multiple weight matrices Au,v,i j , j ∈ {1, . . . , L u,v,i }, and nonlinearities
σ ju,v,i , j ∈ {1, . . . , L u,v,i
N }. Suppose for any weight matrix A j
u,v,i
, there is a s u,v,i
j >0
u,v,i u,v,i u,v,i
such that A j σ ≤ sj . Also, all nonlinearities σ j are Lipschitz continu-
ous. Similar to the stem, when the input of the vine Fu (X ) is fixed, suppose the
vine V (u, v, i) computes a hypothesis space HVu,v,i , constituted by all hypotheses
FVu,v,i (X ), has an εu,v,i -cover WVu,v,i with covering number NVu,v,i .
Eventually, we denote the hypothesis space computed by the neural network is H.
Then there exists an ε in terms of εi , i = {1, . . . , L} and εu,v,i , (u, v, i) ∈ I V , such
that the following inequality holds:


L 
N(H, ε, · ) ≤ sup N M( j+1) sup NVu,v,i . (5.15)
j=1 FM( j) (u,v,i)∈I V Fu

A detailed proof will be given in Sect. 5.3.3.3.


As vines are chain-like neural networks, we can further obtain an upper bound for
sup Fu NVu,v,i via a lemma slightly modified from Bartlett et al. (2017). The lemma is
summarised as follows.

Lemma 5.1 (Covering Bound for Chain-like Deep Neural Network; cf. Bartlett et al.
(2017), Lemma A.7) Suppose there are L weight matrices in a chain-like neural
network. Let (ε1 , . . . , ε L ) be given. Suppose the L weight matrices (A1 , . . . , A L )
lies in B1 × . . . × B L , where Bi is a ball centered at 0 with the radius of si , i.e.,
Bi = {Ai : Ai ≤ si }. Furthermore, suppose the input data matrix X is restricted
in a ball centred at 0 with the radius of B, i.e., X ≤ B. Suppose F is a hypothesis
5.3 Generalization Bound of ResNet 73

function computed by the neural network. If we define:

H = {F(X ) : Ai ∈ Bi } , (5.16)
L L
where i = 1, . . . , L and t ∈ {1, . . . , L u,v,s }. Let ε = j=1 εjρj l= j+1 ρl sl . Then
we have the following inequality:


L
N(H, ε, · ) ≤ sup Ni , (5.17)
i=1 Ai−1 ∈B i−1

where Ai−1 = (A1 , . . . , Ai−1 ), B i−1 = B1 × . . . × Bi−1 , and


 ! 
Ni = N Ai FAi−1 (X ) : Ai ∈ Bi , εi , · . (5.18)

Remark 5.1 The mapping induced by a chain-like neural network involves com-
posing a sequence of affine and nonlinear transformations. According to Lemma
5.1, the covering bound for a chain-like network can be decomposed into the product
of covering bounds for each layer, as demonstrated in Bartlett et al. (2017). How-
ever, when residual connections are introduced, parallel structures emerge within
the network, complicating the representation of the overall mapping as a series of
transformations. Instead, calculating a covering bound for the entire network requires
considering many additions of function spaces (as seen in Eq. (5.13)), which presents
challenges for applying previous methods directly. To address this, we introduce a
new proof in Sect. 5.3.3.3 to analyze the covering bound under the presence of
residual connections and parallel structures.
Contrary to the different proofs, the result for deep neural networks with residual
connections share similarities with the one for the chain-like network (see, respec-
tively, Eqs. (5.15) and (5.17)). The similarities lead to the property summarised as
follows.
The influences on the hypothesis complexity of weight matrices are in the same way, no
matter whether they are in the stem or the vines. Specifically, adding an identity vine could
not affect the hypothesis complexity of the deep neural network.

As indicated by Eq. (5.17) in Lemma 5.1, the covering number of the hypothesis
computed by a chain-like neural network (including the stem and all the vines) is
upper bounded by the product of the covering number of all single layers. Specifically,
the contribution
of the stem on the covering bound is the product of a series of covering
numbers, i.e., Lj=1 sup FM( j) N M( j+1) . In the meantime, applying Eq. (5.17) in Lemma
5.1, the contribution sup Fu NVu,v,i of the vine V (u, v, i) can also be decomposed as the
product of a series of covering numbers. Apparently, the contributions respectively
by the weight matrices in the stem and the ones in the vines have similar formulations.
This result gives an insight that residuals would not undermine the generalization
capability of deep neural networks. Also, if a vine V (u, v, i) is an identity mapping,
the term in Eq. (5.15) that relates to it is definitely 1, i.e., NVu,v,i = 1. This is because
74 5 Capacity and Complexity

there is no parameter to tune in an identity vine. This result gives an insight that adding
an identity vine to a neural network would not affect the hypothesis complexity.
However, it is worth noting that the vines could influence the part of the stem in the
covering bound, i.e., N M( j+1) in Eq. (5.15). The mechanism of the cross-influence
between the stem and the vines is an open problem.
Intuitively, introducing residual connections to a neural network may not change
the hypothesis space. Here, we discuss the following case as an example. Consider
a residual connection that links nodes i and j. Suppose the hypothesis functions
computed in the nodes i and j are Fi and F j , respectively. Also, we denote the
hypothesis functions computed by the bone and residual connection between nodes
i and j as Fi, j and Fi, j , respectively. See an illumination in Fig. 5.2. Then, we have
the following equation
F j = Fi, j ◦ Fi + Fi, j ◦ Fi .

Usually, the residual connection is a simpler sub-network of the bone part. Therefore,
the hypothesis space constituted by all Fi, j is a subspace of the one of Fi, j . Thus,
the hypothesis space constituted by all Fi, j ◦ Fi is a subspace of the one of Fi, j ◦ Fi .
In other words, the hypothesis space computed by a residual connection is only a
sub-space computed by the bone part. Introducing a residual connection is merging
the two spaces by addition operation, which obtains the larger space. This property
guarantees that introducing residual connections may not change the hypothesis
space.

Fig. 5.2 An illustration of


influence of a residual
connection to the hypothesis
space
5.3 Generalization Bound of ResNet 75

5.3.2.2 Covering Bound for ResNet

As an example, we analyze the generalization capability of the 34-layer ResNet.


Analysis of other deep neural networks under the stem-vine framework is similar.
For the convenience, we give a detailed illustration of the 34-layer ResNet under the
stem-vine framework in Fig. 5.3.
There are one 34-layer stem and 16 vines in the 34-layer ResNet. Each layer in the
stem contains one weight matrix and several Lipschitz-continuous nonlinearities. For
most layers with over one nonlinearity, the multiple nonlinearities are connected one
by one directly; we merge the nonlinearities as one single nonlinearity.3 However,
the vine links the stem at a vertex between two nonlinearities after the 33-th weight
matrix, and thus we cannot merge the two nonlinearities. Hence, the stem of ResNet
can be expressed as follows:

Sr es = (A1 , σ1 , . . . , A33 , σ33 , σ34 , A34 , σ35 ) . (5.19)

From the vertex that receives the input data to the vertex that outputs classification
functions, there are 34 + 35 + 1 = 70 vertexes (34 is the number of weight matrices
and 35 is the number of nonlinearities). We denote them as N (1) to N (70). Addi-
tionally, we assume the norm of the the weight matrix Ai has an upper bound si , i.e.,
Ai σ ≤ si , while the Lipschitz constant of the nonlinearity σi is denoted as bi .
Under the stem-vine framework, the 16 vines in ResNet are respectively denoted as
V (3, 7, 1), V (7, 11, 1), . . . , V (63, 67, 1). Among these 16 vines, there are 3 vines,
V (15, 19, 1), V (31, 35, 1), and V (55, 59, 1), that respectively contains one weight
matrix, while all others are identity mappings. Let’s denote the weight matrices in the
vines V (15, 19, 1), V (31, 35, 1), and V (55, 59, 1) respectively as A15,19,1 1 , A31,35,1
1 ,
55,59,1 15,19,1 31,35,1 55,59,1
and A1 . Suppose the norms of A1 , A1 , and A1 are respectively
upper bounded by s115,19,1 , s131,35,1 , and s155,59,1 . Denote the reference matrices that
correspond to weight matrices (A1 , . . . , A34 ) as (M1 , . . . , M34 ). Suppose the distance
between each weight matrix Ai and the corresponding reference matrix Mi is upper
bounded by bi , i.e., AiT − MiT ≤ bi . Similarly, suppose there are reference matrices
M1s,t,1 , (s, t) ∈ {(15, 19), (31, 35), (55, 59)} respectively for weight matrices As,t,1 1 ,
and the distance between As,t 1 and M s,t,1
1 is upper bounded by b1
s,t,1
, i.e., (A s,t,1 T
i ) −
(Mis,t,1 )T ≤ b1s,t,1 . We then have the following lemma.

3 Specifically, if there are two Lipschitz-continuous nonlinearities connected directly somewhere


in the stem, such as one max-pooling and one ReLU, we compose the two nonlinearities as one
single nonlinearity. The composition is well-defined, as the composition of two Lipschitz-continuous
nonlinearities is still a Lipschitz-continuous nonlinearity. The Lipschitz constant of the composition
function is the product of the Lipschitz constants respectively of the two nonlinearities. Additionally,
the composition would not limit any generality, as in our theory, different nonlinearities with the
same Lipschitz constant have the same influence on the generalization bound (this argument is
supported by Eq. (5.20) of Lemma 5.2, Theorem 5.7, etc.).
76 5 Capacity and Complexity

Fig. 5.3 The 34-layer


ResNet under the Stem-Vine
Framework
5.3 Generalization Bound of ResNet 77

Lemma 5.2 (Covering Number Bound for ResNet) For a ResNet R satisfies all
conditions above, suppose the hypothesis space is H R . Then, we have

(b1u,u+4,1 )2 Fu (X T )T 2
log N(H R , ε, · ) ≤ 2
log(2W 2 )
u∈{15,31,55}
εu,u+4,1
2

34
b2j F2 j−1 (X T )T 2
2
+ log(2W 2 )
j=1
ε22 j+1
2
b34 F68 (X T )T 2
+ 2
log(2W 2 ) , (5.20)
ε70
2

where N(H R , ε, · ) is the ε-covering number of H R . When j = 1, . . . , 16,


  2 2 2 
F4 j+1 (X ) 2
2 ≤ X ρ1 s1 ρ2 j s2 j
2 2 2 2 2
ρ2i s2i ρ2i+1 s2i+1
2
+1
1≤i≤ j−1
i ∈{4,8,14}
/
 " #
ρ2i2 s2i2 ρ2i+1
2 2
s2i+1 + (s14i−1,4i+3,1 )2 , (5.21)
1≤i≤ j−1
i∈{4,8,14}

and
  2 2 2 
F4 j+3 (X ) 2
2 ≤ X ρ1 s1
2 2 2
ρ2i s2i ρ2i+1 s2i+1
2
+1
1≤i≤ j
i ∈{4,8,14}
/
 " #
ρ2i2 s2i2 ρ2i+1
2 2
s2i+1 + (s14i−1,4i+3,1 )2 , (5.22)
1≤i≤ j
i∈{4,8,14}

and specifically,
  2 2 2 
F68 (X T )T 2
2 ≤ X ρ1 s1 ρ34
2 2 2 2
ρ2i s2i ρ2i+1 s2i+1
2
+1
1≤i≤16
i ∈{4,8,14}
/
 " #
ρ2i2 s2i2 ρ2i+1
2 2
s2i+1 + (s14i−1,4i+3,1 )2 . (5.23)
1≤i≤16
i∈{4,8,14}

Also, when j = 1, . . . , 16,


  " #
ε4 j+1 = (1 + s1 )ρ1 (1 + s2 j )ρ2 j [(∗) + 1] (∗) + 1 + s14i−1,4i+3,1 ,
1≤i≤ j−1 1≤i≤ j−1
i ∈{4,8,14}
/ i∈{4,8,14}
(5.24)
78 5 Capacity and Complexity

and
  " #
ε4 j+3 =(1 + s1 )ρ1 [(∗) + 1] (∗) + 1 + s14i−1,4i+3,1 , (5.25)
1≤i≤ j 1≤i≤ j
i ∈{4,8,14}
/ i∈{4,8,14}

and for u = 15, 31, 55,


 
εu,u+4,1 = εu 1 + s1u,u+4,1 . (5.26)

In above equations/inequalities,
  " #
ᾱ =(s1 + 1)ρ1 ρ34 (s34 + 1)ρ35 [(∗) + 1] (∗) + s14i−1,4i+3,1 + 1 ,
1≤i≤16 i∈{4,8,14}
i ∈{4,8,14}
/
(5.27)

and
(∗) = ρ2i (s2i + 1)ρ2i+1 (s2i+1 + 1) . (5.28)

A detailed proof is omitted and will be given in Sect. 5.3.3.3.

5.3.2.3 Generalization Bound for ResNet

Lemma 3.2 guarantees that when the covering number of a hypothesis space is upper
bounded, the corresponding generalization error is also upper bounded. This lemma
provides a theoretical foundation for establishing generalization bounds based on the
covering number of the hypothesis space. By leveraging Lemma 5.2, which presents a
specific covering bound for ResNet, we can derive a concrete generalization bound for
this architecture. In this subsection, we summarize and formalize the generalization
bound as Theorem 5.7, providing a clear theoretical link between the complexity of
the hypothesis space and the expected generalization performance of ResNet.
For the brevity, we rewrite the radius ε2 j+1 and εu,u+4,1 as follows:

ε2 j+1 := ε̂2 j+1 , (5.29)


εu,u+4,1 := ε̂u,u+4,1 ε . (5.30)

Additionally, we rewrite Eq. (5.20) of Lemma 5.2 as the following inequality:

R
log N(H, ε, · ) ≤ , (5.31)
ε2
5.3 Generalization Bound of ResNet 79

where

(b1u,u+4,1 )2 Fu (X T )T 2
R= 2
log(2W 2 )
u∈{15,31,55}
ε̂u,u+4,1
2

33
b2j F2 j−1 (X T )T 2
2
2
b34 F68 (X T )T 2
+ log(2W 2 ) + 2
log(2W 2 ) , (5.32)
j=1
ε̂22 j+1 ε̂70
2

Then, we can obtain the following theorem.


We first define the margin operator M for the k-class classification task as

M : Rk × {1, ..., k} → R, (v, y) → v y − max vi .


i =y

Then, ramp loss lλ is defined as




⎨0, x < −λ
lλ (x) = 1 + x/λ, −λ ≤ x < 0


1, x >0

Furthermore, the empirical ramp risk is defined as


m
1
R̂ Sλ = lλ (−M(F(xi ), yi )).
m i=1

Theorem 5.7 (Generalization Bound for ResNet) Suppose a ResNet satisfies all
conditions in Lemma 5.2. Suppose a given series of examples (x1 , y1 ), . . . , (xm , ym )
are arbitrary i.i.d. variables. Suppose hypothesis function FA : Rn 0 → Rn L is com-
puted by a ResNet with weight matrices A = (A1 , . . . , A34 , A15,19,1
1 , A31,35,1
1 ,
55,59,1
A1 ). Then for any margin λ > 0 and any real δ ∈ (0, 1), with probability at
least 1 − δ, we have the following inequality:

8 36 √ log(1/δ)
P{arg max F(x)i = y} ≤ R̂ Sλ (F) + 3 + R log m + 3 , (5.33)
i m 2 m 2m

where R is defined as Eq. (5.32).


This generalization bound is established via the hypothesis complexity which
is measured by the covering number of the hypothesis space. This relationship is
characterised by Lemma 3.1. Intuitively, Lemma 3.1 follows the principle of Occam’s
razor, which demonstrates a negative correlation between the generalization ability
of an algorithm and its hypothesis complexity. Directly applying Lemma 3.1 with
the covering number bound (Theorem 5.6), we can obtain Theorem 5.7. A proof is
omitted here and will be given in Sect. 5.3.3.5.
80 5 Capacity and Complexity

Indicated by Theorem 5.7, the generalization bound of ResNet relies on its cov-
ering bound. Specifically, when the sample size n and the probability δ are fixed, the
generalization error satisfies that
√ 
P{arg max F(x)i = y} − R̂ Sλ (F) = O R , (5.34)
i

where R expresses the magnitude of the covering number (R/ε2 is an ε-covering


bound).
Incorporating the characteristics applicable to any neural network within the stem-
vine framework, Eq. (5.34) provides two insights into the impact of residual con-
nections on neural network generalization capability: (1) The influence of weight
matrices on generalization remains invariant regardless of their location (stem or
vines); (2) The addition of an identity vine does not affect generalization. These
findings offer a theoretical rationale for ResNet’s equivalent generalization capabil-
ity compared to chain-like neural networks. This implies that the placement of weight
matrices within the network structure, whether in the stem or the vines, does not fun-
damentally alter the network’s capability to generalize. Additionally, the inclusion
of identity vines, which do not introduce additional transformations, maintains this
generalization capability without detriment.
As indicated in Eq. (5.33), the expected risk of ResNet, which corresponds to the
expectation of the test error, is composed of the sum of the empirical risk (training
error) and the generalization error. Notably, residual connections have demonstrated
a remarkable capability to reduce the training error of neural networks across various
tasks. Our findings provide a theoretical basis for understanding why ResNet exhibits
substantially lower test errors in these scenarios. This elucidates the underlying mech-
anism by which ResNet achieves superior performance by effectively mitigating
training error, leading to enhanced generalization ability in practical applications.

5.3.2.4 Practical Implementation

In addition to the sample size m, our generalization bound (Eq. (5.33)) highlights
a positive correlation with the norms of all weight matrices in the neural network.
Specifically, weight matrices with higher norms contribute to a higher generaliza-
tion bound, suggesting a potential trade-off with generalization performance. This
observation validates the importance of techniques such as weight decay, which aims
to regulate the norms of weight matrices to optimize generalization performance in
deep learning models. By controlling the norms of weight matrices, practitioners
can mitigate the risk of overfitting and enhance the model’s ability to generalize to
unseen data.
Weight decay, original introduced by Krogh and Hertz (1992), has become a
widely adopted regularization technique in the training of deep neural networks.
In the context of learning theory, weight decay involves augmenting the typical
training objective with an additional term that penalizes large weights. It involves
5.3 Generalization Bound of ResNet 81

adding the L 2 norm of all weights as a regularization term to the loss function during
training. This term penalizes large weight values, aiming to prevent overfitting by
discouraging complex model configurations that fit the training data too closely.
By constraining the magnitude of weight norms, weight decay promotes smoother
and more stable training dynamics, which often leads to improved generalization
performance on unseen data. This regularization method is particularly effective in
controlling model complexity and mitigating the risk of overfitting, contributing to
the robustness and stability of deep learning models.
Remark 5.2 The technique of weight decay can improve the generalization abil-
ity of deep neural networks. It refers to adding the L 2 norm of the weights
W = (W1 , . . . , W D ) to the objective function as a regularization term:

D
1
L (W ) = L(W ) + λ W2 ,
2 i=1 i

where λ is a tuneable parameter, L(W ) is the original objective function, and L (w)
is the objective function with weight decay.
D
The term 21 λ i=1 Wi2 can be easily re-expressed by the L 2 norms of all the weight
matrices. Therefore, using weight decay can control the magnitude of the norms of
all the weights matrices not to increase too much. Also, our generalization bound
(Eq. (5.33)) provides a positive correlation between the generalization bound and the
norms of all the weight matrices. Thus, our work gives a justification for why weight
decay leads to a better generalization ability.
A recent systematic experiment conducted by Li et al. (2018c) studies the influence
of weight decay on the loss surface of the deep neural networks (Li et al. 2018c). It
trains a 9-layer VGGNet (Chen et al. 2017) on the dataset CIFAR-10 (Krizhevsky
and Hinton 2009) by employing stochastic gradient descent with batch sizes of 128
(0.26% of the training set of CIFAR-10) and 8192 (16.28% of the training set of
CIFAR-10). The results demonstrate that by employing weight decay, SGD can find
flatter minima4 of the loss surface with lower test errors as shown in Fig. 5.4 (original
presented as Li et al. (2018c), p. 6, Fig. 3). Other technical advances and empirical
analysis include (Galloway et al. 2018; Zhang et al. 2018b; Chen et al. 2018a; Park
et al. 2019).

5.3.3 Proofs of Section 5.3

This appendix compiles several proofs that were omitted from Sect. 5.3.2. First, we
present a proof establishing the covering bound for an affine transformation induced

4The flatness (or equivalently sharpness) of the loss surface around the minima is considered as
an important index expressing the generalization ability. However, the mechanism remains elusive.
For more details, please refers to (Keskar et al. 2017) and (Dinh et al. 2017).
82 5 Capacity and Complexity

Fig. 5.4 Illustrations of the 1D and 2D visualization of the loss surface around the solutions
obtained with different weight decay and batch size. The numbers in the title of each subfigure
is respectively the parameter of weight decay, batch size, and test error. The data and figures are
originally presented in (Li et al. 2018c)

by a single weight matrix, which serves as a foundation for subsequent analyses.


Next, we provide a detailed proof of the covering bound for deep neural networks
within the stem-vine framework (Theorem 5.6). Additionally, we include a proof
demonstrating the covering bound specific to ResNet architectures (Lemma 5.2).
Finally, we provide a comprehensive proof of the generalization bound for ResNet
(Theorem 5.7). These proofs collectively contribute to a deeper understanding of the
theoretical underpinnings discussed in the main section.

5.3.3.1 Proof of the Covering Bound for the Hypothesis Space of a


Single Weight Matrix

In this subsection, we provide an upper bound for the covering number of the hypoth-
esis space induced by a single weight matrix A. This covering bound relies on Maurey
sparsification lemma (Pisier 1981) and has been introduced in machine learning by
previous works (see, e.g.,(Zhang 2002; Bartlett et al. 2017)).
Suppose a data matrix X is the input of a weight matrix A. All possible values
of the output X A constitute a space. We use the following lemma to express the
complexity of all X A via the covering number.

Lemma 5.3 (Bartlett et al.; see (Bartlett et al. 2017), Lemma 3.2) Let conjugate
exponents ( p, q) and (r, s) be given with p ≤ 2, as well as positive reals (a, b, ε)
and positive integer m. Let matrix X ∈ Rn×d be given with X p ≤ b. Let H A denote
the family of matrices obtained by evaluating X with all choices of matrix A:
!
H A  X A|A ∈ Rd×m , A q,s ≤a . (5.35)
5.3 Generalization Bound of ResNet 83

Then
a 2 b2 m 2/r
log N (H A , ε, · 2) ≤ ∗ log(2dm) . (5.36)
ε2

5.3.3.2 Covering Bound for the Hypothesis Space of Chain-like Neural


Network

This subsection considers the upper bound for the covering number of the hypothesis
space induced by the stem of a deep neural network. Intuitively, following the stem
from the first vertex N (1) to the last one N (L), every weight matrices and nonlin-
earities increase the complexity of the hypothesis space that could be computed by
the stem. Following this intuition, we use an induction method to approach the upper
bound. The result is summarized as Lemma 5.1. This lemma is originally given in
the work by Bartlett et al. (2017). Here to make this work complete, we recall the
main part of the proof but omit the part for ε.
Proof of Lemma 5.1 We use an induction procedure to prove the lemma.
(1) The covering number of the hypothesis space computed by the first weight
matrix A1 can be straightly upper bounded by Lemma 5.3.
(2) The vertex after the j-th nonlinearity is N (2 j + 1). Suppose W2 j+1 is an
ε-cover of the hypothesis space H2 j+1 induced by the output hypotheses in the
vertex N (2 j + 1). Suppose there is a weight matrix A j+1 directly follows the vertex
N (2 j + 1). We then analyze the contribution of the weight matrix A j+1 . Assume that
there exists an upper bound s j+1 of the norm of A j+1 . For any F2 j+1 (X ) ∈ H2 j+1 ,
there exists a W (X ) ∈ W2 j+1 such that

F2 j+1 (X ) − W (X ) ≤ ε2 j+1 . (5.37)

Lemma 5.3 guarantees that for any W (X ) ∈ W2 j+1 there exists an ε2 j+1 -cover
W2 j+2 (W ) for the function space {W (X )A j+1 : W (X ) ∈ W2 j+1 , A j+1 ≤ s j+1 },
i.e., for any W  (X ) ∈ Ĥ2 j+1 , there exists a V (X ) ∈ {W (X )A j+1 : W (X ) ∈ W2 j+1 ,
A j+1 ≤ s j+1 } such that

W  (X ) − V (X ) ≤ ε2 j+1 . (5.38)

As for any F2 j+1 (X ) ∈ H2 j+2  {F2 j+1 (X )A j+1 : F2 j+1 (X ) ∈ H2 j+1 , A j+1 ≤
c}, there is a F2 j+1 (X ) ∈ H2 j+1 such that

F2 j+1 (X ) = F2 j+1 (X )A j+1 . (5.39)

Thus, applying Eqs. (5.37), (5.38), and (5.39), we prove the following inequality
84 5 Capacity and Complexity

F2 j+1 (X ) − V (X ) = F2 j+1 (X )A j+1 − V (X )


≤ F2 j+1 (X )A j+1 − W (X )A j+1 + W (X )A j+1 − V (X )
≤ F2 j+1 (X ) − W (X ) A j+1 + ε2 j+1
≤s j+1 ε2 j+1 + ε2 j+1
=(s j+1 + 1)ε2 j+1 . (5.40)
$
Therefore, W ∈W2 j+1 W2 j+2 (W ) is a (s j+1 + 1)ε2 j+1 -cover of H2 j+2 . Let’s denote
(s j+1 + 1)ε2 j+1 as ε2 j+2 . Apparently,
% %
% & %
% % % % % %
N(H2 j+2 , ε2 j+2 , · ) ≤ %% W2 j+2 (W )%% ≤ %W2 j+1 % · sup %W2 j+2 (W )%
%W ∈W2 j+1 % W ∈W2 j+1
 
≤N(H2 j+1 , ε2 j+1 , · ) sup N (∗∗), ε2 j+1 , · 2 j+1 ,
(A1 ,...,A j )
∀ j≤ j, Ai ∈Bi
(5.41)

where !
(∗∗) = A j+1 F2 j+1 (X ) : A j+1 ∈ B j+1 .

Thus, N(W2 j+1 , ε2 j+1 , · ) · N(W2 j+2 , ε2 j+2 , · ) is an upper bound for the
ε2 j+2 -covering number of the hypotheses space Hi+1 .
(3) The vertex after the j-th weight matrix is N (2 j − 1). Suppose W2 j−1 is an
ε2 j−1 -cover of the hypothesis space H2 j−1 induced by the output hypotheses in the
vertex N (2 j − 1). Suppose there is a nonlinearity σ j directly follows the vertex
N (2 j − 1). We then analyze the contribution of the nonlinearity σ j . Assume that the
nonlinearity σ j is ρ j -Lipschitz continuous. Apparently, σ j (W2 j−1 ) is a ρε2 j−1 -cover
of the hypothesis space σ j (H2 j−1 ). Specifically, for any F  ∈ σ (H2 j−1 ), there exits a
F ∈ H2 j−1 that F  = σ j (F). Since W2 j−1 is an ε2 j−1 -cover of the hypothesis space
H2 j−1 , there exists a W ∈ W2 j−1 such that

F − W2 j−1 ≤ ε2 j−1 . (5.42)

Therefore, we have the following equation

F  − σ j (W2 j−1 ) = σ j (F) − σ j (W2 j−1 ) ≤ ρ j F − W2 j−1 = ρ j ε2 j−1 .


(5.43)

We thus prove that W2 j  σ j (W2 j−1 ) is a ρ j ε2 j−1 -cover of the hypothesis space
σ j (H2 j−1 ). Additionally, the covering number remains the same while applying a
nonlinearity to the neural network.
By analyzing the influence of weight matrices and nonlinearities one by one, we
can prove Eq. (5.17). As for ε, the above part indeed gives an constructive method
to obtain ε from all εi and εu,v, j . Here we omit the explicit formulation of ε in terms
of εi and εu,v, j , since it could not benefit our theory. This completes the proof.
5.3 Generalization Bound of ResNet 85

5.3.3.3 Covering Bound for the Hypothesis Space of Deep Neural


Networks with Residual Connections

In Sect. 5.3.2.1, we give a covering bound generally for all deep neural networks with
residual connections. The result is summarised as Theorem 5.6. In this subsection,
we give a detailed proof of Theorem 5.6.
Proof of Theorem 5.6. To approach the covering bound for the deep neural networks
with residuals, we first analyze the influence of adding a vine to a deep neural network,
and then use an induction method to obtain a covering bound for the whole network.
All vines are connected with the stem at two points that is respectively after a non-
linearity and before a weight matrix. When the input Fu (X ) of the vine V (u, v, i) is
fixed, suppose all the hypothesis functions FVu,v,i (X ) computed by the vine V (u, v, i)
constitute a hypothesis space HVu,v,i . As a vine is also a chain-like neural network con-
structed by stacking a series of weight matrices and nonlinearities, we can straightly
apply Lemma 5.1 to approach an upper bound for the covering number of the hypoth-
esis space HVu,v,i . It is worth noting that vines could be identity mappings. This sit-
uation is normal in ResNet–there are 13 out of all the 16 vines are identities. For
the circumstances that the vines are identities, the hypothesis space computed by the
vine only contains one element–an identity mapping. The covering number of the
hypothesis space for the identities are apparently 1.
Applying Lemmas 5.3 and 5.1, there exists an εv -cover Wv for the hypothesis
space Hv with a covering number N(Hv , εi , · ), as well as an εVu,v,i -cover WVu,v,i
for the hypothesis space HVu,v,i with a covering number N(HVu,v,i , εi , · ).
The hypotheses computed by the vine V (u, v, i) and the deep neural network
without V (u, v, i), i.e., respectively, Fv (X ) and FVu,v,i , are added element-wisely at
the vertex V (v). We denote the space constituted by all F   Fv (X ) + FVu,v,i (X ) as
Hv .
Let’s define a function space as Wv  {W S + WV : W S ∈ Wv , WV ∈ WVu,v,i }.
For any hypothesis F  ∈ Hv , there must exist an FS ∈ Hv and FV ∈ HVu,v,i such that

F  (X ) = FS (X ) + FV (X ) . (5.44)

Because Wv is an εv -cover of the hypothesis space Hv . For any FS ∈ Hv , there


exists an element W FS (X ) ∈ Wv , such that

FS (X ) − W FS (X ) ≤ εv . (5.45)

Similarly, as WVu,v,i is an εVu,v,i -cover of HVu,v,i , we can obtain a similar result. For
any FV (X ) ∈ HVu,v,i , there exists an element W FV (X ) ∈ WVu,v,i , such that

FV (X ) − W FV (X ) ≤ εVu,v,i . (5.46)
86 5 Capacity and Complexity

Therefore, For any hypothesis F  (X ) ∈ Hv , there exists an element W (X ) ∈ W ,


such that W (X ) = W FS (X ) + W FV (X ) satisfying Eqs. (5.45) and (5.46), and

F  (X ) − W (X ) = FV (X ) + FS (X ) − W FV (X ) − W FS (X )
≤ FV (X ) − W FV (X ) + FS (X ) − W FS (X )
≤εVu,v,i + εv . (5.47)

Therefore, the function space Wv is an (εVu,v,i + εv )-cover of the hypothesis space
Hv . An upper bound for the cardinality of the function space Wv is given as below
(it is also an εVu,v,i + εv -covering number of the hypothesis space Hv ):

N(Hv , εVu,v,i + εv , · ) ≤|Wv | ≤ |Wv | · |WVu,v,i |


≤ sup N(Hv , εi , · ) · sup N(HVu,v,i , εVu,v,i , · )
Fv−2 Fu

≤ sup Nv · sup NVu,v,i , (5.48)


Fv−2 Fu

where Nv and NVu,v,i can be obtained from Eq. (5.17) in Lemma 5.1, as the stem and
all the vines are chain-like neural networks.
By adding vines to the stem one by one, we can construct the whole deep neural
network. Combining Lemma 5.1 for the covering number of Fv−1 (X ) and Fu (X ),
we further prove the following inequality:


L 
N(H, ε, · ) ≤ sup N M( j+1) sup NVu,v,i . (5.49)
j=1 FM( j) (u,v,i)∈I V Fu

Thus, we prove Eq. (5.15) of Theorem 5.6. As for ε, the above part indeed gives
an constructive method to obtain ε from all εi and εu,v, j . Here we omit the explicit
formulation of ε in terms of εi and εu,v, j , since it could be extremely complex and
does not benefit our theory. The proof is completed.

5.3.3.4 Covering Bound for the Hypothesis Space of ResNet

In Sect. 5.3.2.2, we give a covering bound for ResNet. The result is summarized as
Lemma 5.2. In this subsection, we give a detailed proof of Lemma 5.2.
Proof of Lemma 5.2. There are 34 weight matrices and 35 nonlinearities in the stem of
the 34-ResNet. Let’s denote the weight matrices respectively as A1 , ... ,A34 and denote
the nonlinearities respectively as σ1 , ... , σ35 . Apparently, there are 34 + 35 + 1 =
70 vertexes in the network, where 34 is the number of weight matrices and 35 is
the number of nonlinearities. We denote them respectively as N (1), ... , N (70).
Additionally, there are 16 vines which are respectively denoted as V (4i − 1, 4i +
5.3 Generalization Bound of ResNet 87

3, 1), i = {1, . . . , 16}, where 4i − 1 and 4i + 3 are the indexes of the vertexes that the
vine connected. Among all the 16 vines, there are 3, V (15, 19, 1), V (31, 35, 1), and
V (55, 59, 1), respectively contain one weight matrix, while all others are identities
mappings. For the vine V (4i − 1, 4i + 3, 1), i = 4, 8, 14, we denote the weight
matrix in the vine as A4i−1,4i+3,1
1 .
Applying Theorem 5.6, we straightly prove the following inequality:

34
log N(H, ε, · ) ≤ sup log N2 j+1 + sup log NVu,v,1 , (5.50)
j=1 F2 j−1 (X ) (u,v,i)∈I V Fu (X )

where N2 j+1 is the covering number of the hypothesis space constituted by all outputs
F2 j+1 (X ) at the vertex N (2 j + 1) when the input F2 j−1 (X ) of the vertex N (2 j − 1)
is fixed, NVu,v,1 is the covering number of the hypothesis space constituted by all
outputs FVu,v,i (X ) of the vine V (u, v, 1) when the input Fv (X ) is fixed, and I V is the
index set {(4i − 1, 4i + 3, 1), i = 1, . . . , 16}.
Applying Lemma 5.3, we can further prove an upper bound for the ε2 j+1 -covering
number N2 j+1 . The bound is expressed as the following inequality:

b2j+1 F2 j+1 (X T )T 2
2
log N2 j+1 ≤ log(2W 2 ) , (5.51)
ε22 j+1

where W is the maximum dimension among all features through the ResNet, i.e.,
W = maxi n i , i = 0, 1, . . . , L. Also, we can decompose F2 j+1 (X T )T 22 and utilize
an induction method to obtain an upper bound for it.
(1) If there is no vine connected with the stem at the vertex N (2 j − 1), we have
the following inequality:

F2 j+1 (X T )T 2 = σ j (A j F2 j−1 (X T ))T − σ j (0) 2

≤ρ j A j F2 j−1 (X ) − 0 T T
2

≤ρ j A j σ · F2 j−1 (X ) T T
2 . (5.52)

(2) If there is a vine V (2 j − 3, 2 j + 1, 1) connected at the vertex N (2 j + 1),


then we prove the following inequality:
2 j−3,2 j+1,1
F2 j+1 (X T )T 2 = σ j (A j σ j (A j F2 j−3 (X T )))T + A1 F2 j−3 (X T )T 2
2 j−3,2 j+1,1
≤ σ j (A j σ j (A j F2 j−3 (X ))) 2 + A1
T T
F2 j−3 (X T )T 2
≤ρ j A j σ ρ j−1 A j−1 σ · F2 j−3 (X T )T 2
2 j−3,2 j+1,1
+ A1 σ · F2 j−3 (X T )T 2
 
2 j−3,2 j+1,1
= ρ j ρ j−1 A j σ · A j−1 σ + A1 σ

F2 j−3 (X T )T 2 . (5.53)
88 5 Capacity and Complexity

Therefore, based on Eqs. (5.52) and (5.53), we can prove the norm of output of
ResNet as in the main text.
Similar with N2 j+1 , we can obtain an upper bound for the εu,v,1 -covering number
NVu,v,1 . Suppose the output computed at the vertex N (u) is Fu (X T ). Then, we can
prove the following inequality:

(b1u,v,1 )2 Fu (X T )T 2
log NVu,v,1 ≤ 2
log(2W 2 ) . (5.54)
εu,v,1
2

Applying Eqs. (5.51) and (5.54) to Eq. (5.50), we thus prove Eq. (5.20).
As for the formulation of the radiuses of the covers, we also employ an induction
method.
(1) Suppose the radius of the cover for the hypothesis space computed by the
weight matrix A1 and the nonlinearity σ1 is ε3 . Then, applying Eqs. (5.40) and
(5.43), after the weight matrix A2 and the nonlinearity σ2 , we prove the following
equation:
ε3 = (s2 + 1)ρ2 ε1 . (5.55)

(2) Suppose the radius of the cover for the hypothesis space computed by the
weight matrix A j−1 and the nonlinearity σ j−1 is ε2 j−1 . Assume there is no vine
connected around. Then, similarly, after the weight matrix A2 and the nonlinearity
σ j , we prove the following equation:

ε2 j+1 = ρ j (s j + 1)ε2 j−1 . (5.56)

(3) Suppose the radius of the cover at the vertex N (i) is εi . Assume there is a vine
V (u, u + 4, 1) links the stem at the vertex N (u) and N (u + 4). Then, similarly, after
the weight matrix A2 and the nonlinearity σ j , we prove the following equation:
   
ε2 j+1 =εu+2 s u−1
2
+ 1 ρ u−1
2
+ εu su,u+4,1 + 1
     
=εu s u−1
2
+ 1 ρ u−1
2
s u−3 + 1 ρ u−3 + εu su,u+4,1 + 1
2 2
    
=εu s u−1
2
+ 1 s u−3 + 1 ρ u−1 ρ u−3 + εu su,u+4,1 + 1
2 2 2
. (5.57)

From Eqs. (5.55), (5.56), and (5.57), we can obtain the following equation

ε =ε1 ρ1 (s1 + 1)ρ34 (s34 + 1)ρ35 [(∗ ∗ ∗) + 1]
1≤i≤16
i ∈{4,8,14}
/
 " #
(∗ ∗ ∗) + s14i−1,4i+3,1 + 1 , (5.58)
i∈{4,8,14}

where
5.3 Generalization Bound of ResNet 89

(∗ ∗ ∗) = ρ2i (s2i + 1)ρ2i+1 (s2i+1 + 1) . (5.59)

Combining the definition of ᾱ:



ᾱ =ρ1 (s1 + 1)ρ34 (s34 + 1)ρ35 [(∗ ∗ ∗) + 1]
≤i≤16
i ∈{4,8,14}
/
 " #
(∗ ∗ ∗) + s14i−1,4i+3,1 + 1 , (5.60)
i∈{4,8,14}

we can obtain that ε


ε1 = . (5.61)
ᾱ

Applying Eqs. (5.55), (5.56), and (5.57), we can prove all ε2 j+1 and εu,u+4,1 .
The proof is completed.

5.3.3.5 Generalization Bound for ResNet

Proof of Theorem 5.7 We prove this theorem in 2 steps: (1) We first apply Lemma
3.2 to Lemma 5.2 in order to prove an upper bound on the Rademacher complexity
of the hypothesis space computed by ResNet; and (2) We then apply the result of (1)
to Lemma 3.1 in order to prove a generalization bound.
(1) Upper bound on the Rademacher complexity.
Applying Eq. (3.12) of Lemma 3.2 to Eq. (5.31) of Lemma 5.2, we can prove the
following inequality:
 ' √ 
4α 12 m (
R(Hλ | D ) ≤ inf √ + log N(Hλ | D , ε, · |2 )dε
α>0 m m α
 ' √ √
m
4α 12 R
≤ inf √ + dε
α>0 mα m ε
 √ 
4α 12 √ m
≤ inf √ + R log . (5.62)
α>0 m m α
)
Apparently, the infinimum is reached uniquely at α = 3 mR . Here, we use a simpler
and also widely used choice α = 1
m
, and prove the following inequality:

4 18 √
R(Hλ | D ) ≤ 3 + R log m . (5.63)
n 2 m
90 5 Capacity and Complexity

(2) Upper bound on the generalization error.


Combining with Lemma 3.1, we prove the following inequality:

8 36 √ log(1/δ)
P(arg max F(x)i = y) ≤ R̂ Sλ (F) + 3 + R log m + 3 . (5.64)
i m 2 m 2m

The proof is completed.

5.4 Vacuous Generalization Guarantee in Deep Learning

Conventional statistical learning theory suggests that the hypotheses learned from
datasets of different sizes constitute a Glivenko-Cantelli class (Talagrand 1987; Dud-
ley et al. 1991): the learned hypothesis converges to the target hypothesis, i.e., the
generalization bound decreases, as the size of the training dataset increases. Surpris-
ingly, however, Nagarajan and Kolter (2019) reported that many classical uniform-
convergence generalization bounds may in fact increase with the training dataset size
for interpolators. This behavior is also echoed by the phenomenon of benign over-
fitting. In response to this work, Negrea et al. (2020) defended uniform convergence
by defining a new notion called the structural Glivenko-Cantelli property for learned
hypothesis sequences. They proved that (1) unregularized and overparameterized lin-
ear regression has a structural Glivenko-Cantelli surrogate hypothesis class and (2)
removing a few bits of information from a nonstructural Glivenko-Cantelli hypoth-
esis sequence can yield a sequence with the structural Glivenko-Cantelli property,
whose generalization bounds exhibit double descent. Zhou et al. (2020) proved that
consistency cannot be shown for any set by uniformly bounding the generalization
error in a norm ball. To address this, they proved that zero-error predictors exhibit
uniform convergence in a norm ball. Yang et al. (2021) presented an exact compar-
ison between the generalization error and the uniform-convergence generalization
bound in random feature models.

References

Alex, Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep
convolutional neural networks. In Advances in Neural Information Processing Systems, 1097–
1105.
Anders, Krogh and John A Hertz. 1992. A simple weight decay can improve generalization. In
Advances in Neural Information Processing Systems, 950–957.
Andre, Esteva, Brett Kuprel, Roberto A Novoa, Justin Ko, Susan M Swetter, Helen M Blau, and
Sebastian Thrun. 2017. Dermatologist-level classification of skin cancer with deep neural net-
works. nature, 542 (7639): 115–118.
Angus, Galloway, Thomas Tanay, and Graham W Taylor. 2018. Adversarial training versus weight
decay. arXiv preprint arXiv:1804.03308.
References 91

Behnam, Neyshabur, Ryota Tomioka, and Nathan Srebro. 2015. Norm-based capacity control in
neural networks. In Conference on Learning Theory, 1376–1401.
Behnam, Neyshabur, Srinadh Bhojanapalli, and Nathan Srebro. 2017. A PAC-Bayesian approach
to spectrally-normalized margin bounds for neural networks. arXiv preprint arXiv:1707.09564
Ben, Taskar, Carlos Guestrin, and Daphne Koller. 2004. Max-margin Markov networks. In Advances
in Neural Information Processing Systems, 25–32.
Chenxi, Liu, Liang-Chieh Chen, Florian Schroff, Hartwig Adam, Wei Hua, Alan L Yuille, and
Li Fei-Fei. 2019. Auto-deeplab: Hierarchical neural architecture search for semantic image seg-
mentation. In IEEE/CVF conference on computer vision and pattern recognition, 82–92.
David, Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George Van Den Driess-
che, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, and Marc Lanctot. 2016.
Mastering the game of go with deep neural networks and tree search. Nature 529 (7587): 484–489.
Eunchun, Park, B. Wade Brorsen, and Ardian Harri. 2019. Using bayesian kriging for spatial
smoothing in crop insurance rating. American Journal of Agricultural Economics 101 (1): 330–
351.
G Pisier. 1981. Remarques sur un résultat non publié de b. maurey. Séminaire Analyse fonctionnelle
(dit” Maurey-Schwartz”), 1–12.
Gao, Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. 2017. Densely con-
nected convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition,
4700–4708.
Geert, Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, Arnaud Arindra Adiyoso. Setio, Francesco
Ciompi, Mohsen Ghafoorian, Jeroen Awm Van Der. Laak, Bram Van Ginneken, and Clara I.
Sánchez. 2017. A survey on deep learning in medical image analysis. Medical Image Analysis
42: 60–88.
Guodong, Zhang, Chaoqi Wang, Bowen Xu, and Roger Grosse. 2018b. Three mechanisms of weight
decay regularization. arXiv preprint arXiv:1810.12281.
Hao, Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. 2018c. Visualizing the
loss landscape of neural nets. In Advances in Neural Information Processing Systems.
Ian, H Witten, Eibe Frank, Mark A Hall, and Christopher J Pal. 2016. Data Mining: Practical
machine learning tools and techniques. Morgan Kaufmann.
J, Janai, F Güney, A Behl, and A Geiger. 2017. Computer vision for autonomous vehicles: problems,
datasets and state of the art. arxiv e-prints. arXiv preprint arXiv:1704.05519.
Jeffrey, Negrea, Gintare Karolina Dziugaite, and Daniel Roy. 2020. In defense of uniform con-
vergence: Generalization via derandomization with an application to interpolating predictors. In
International Conference on Machine Learning, 7263–7272.
Jinghui, Chen, Dongruo Zhou, Yiqi Tang, Ziyan Yang, Yuan Cao, and Quanquan Gu. 2018a. Closing
the generalization gap of adaptive gradient methods in training deep neural networks. arXiv
preprint arXiv:1806.06763.
Jonas, Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. 2017. Con-
volutional sequence to sequence learning. In International Conference on Machine learning,
1243–1252.
Kaiming, He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In IEEE inter-
national conference on computer vision, 2961–2969.
Kaiming, He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016a. Deep residual learning for image
recognition. In IEEE Conference on Computer Vision and Pattern Recognition.
Klaus, Greff, Rupesh K. Srivastava, Jan Koutník, Bas R. Steunebrink, and Jürgen. Schmidhuber.
2016. Lstm: A search space odyssey. IEEE transactions on neural networks and learning systems
28 (10): 2222–2232.
Koltchinskii, Vladimir, and Dmitry Panchenko. 2002. Empirical margin distributions and bounding
the generalization error of combined classifiers. The Annals of Statistics 30 (1): 1–50.
Krizhevsky, Alex, and Geoffrey Hinton. 2009. Learning multiple layers of features from tiny images.
Citeseer: Technical report.
92 5 Capacity and Complexity

Laurent, Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. 2017. Sharp minima can gen-
eralize for deep nets. In International Conference on Machine learning
LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521 (7553): 436.
Liang-Chieh, Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille.
2017. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution,
and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4):
834–848.
Lijia, Zhou, Danica J Sutherland, and Nati Srebro. 2020. On uniform convergence and low-norm
interpolation learning. In Advances in Neural Information Processing Systems, 6867–6877.
Matthew, E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee,
and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Annual Conference
of the North American Chapter of the Association for Computational Linguistics, 2227–2237.
Michael, B Chang, Abhishek Gupta, Sergey Levine, and Thomas L Griffiths. 2018. Automati-
cally composing representation transformations as a means for generalization. arXiv preprint
arXiv:1807.04640.
Michel Talagrand. 1987. The glivenko-cantelli problem. The Annals of Probability, 837–870.
Nick, Harvey, Christopher Liaw, and Abbas Mehrabian. 2017. Nearly-tight VC-dimension bounds
for piecewise linear neural networks. In Annual Conference on Learning Theory, 1064–1068.
Nitish Shirish, Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping
Tak Peter Tang. 2017. On large-batch training for deep learning: Generalization gap and sharp
minima. In International Conference on Learning Representations.
Noah, Golowich, Alexander Rakhlin, and Ohad Shamir. 2018. Size-independent sample complexity
of neural networks. In Annual Conference on Learning Theory, 297–299.
Paul, Goldberg, and Mark Jerrum. 1993. Bounding the vapnik-chervonenkis dimension of con-
cept classes parameterized by real numbers. In Proceedings of the sixth annual conference on
Computational learning theory, 361–369.
Paul, W Goldberg, and Mark R. Jerrum. 1995. Bounding the Vapnik-Chervonenkis dimension of
concept classes parameterized by real numbers. Machine Learning 18 (2–3): 131–148.
Peter, Bartlett and John Shawe-Taylor. 1999. Generalization performance of support vector
machines and other pattern classifiers. Advances in Kernel methods-Support Vector Learning,
43–54.
Peter, L Bartlett, Dylan J Foster, and Matus J Telgarsky. 2017. Spectrally-normalized margin bounds
for neural networks. In Advances in Neural Information Processing Systems, 6240–6249.
Peter, L Bartlett, Vitaly Maiorov, and Ron Meir. 1999. Almost linear VC dimension bounds for
piecewise polynomial networks. In Advances in Neural Information Processing Systems, 190–
196.
Richard, M Dudley, Evarist Giné, and Joel Zinn. 1991. Uniform and universal glivenko-cantelli
classes. Journal of Theoretical Probability 4 (3): 485–510.
Robert, E Schapire, Yoav Freund, Peter Bartlett, and Wee Sun Lee. 1998. Boosting the margin: A
new explanation for the effectiveness of voting methods. The Annals of Statistics 26 (5): 1651–
1686.
Roman, Novak, Yasaman Bahri, Daniel A Abolafia, Jeffrey Pennington, and Jascha Sohl-Dickstein.
2018. Sensitivity and generalization in neural networks: an empirical study. In International
Conference on Learning Representations.
Saining, Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated residual
transformations for deep neural networks. In IEEE Conference on Computer Vision and Pattern
Recognition, 1492–1500.
Stephen, Merity, Nitish Shirish Keskar, and Richard Socher. 2018. Regularizing and optimizing
lstm language models. In International Conference on Learning Representations.
Takeru, Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. 2018. Spectral normal-
ization for generative adversarial networks. In International Conference on Learning Represen-
tations.
References 93

Tong Zhang. 2002. Covering number bounds of certain regularized linear function classes. Journal
of Machine Learning Research, 2 (Mar): 527–550.
Tsung-Yi, Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie.
2017. Feature pyramid networks for object detection. In IEEE conference on computer vision
and pattern recognition, 2117–2125.
Vaishnavh, Nagarajan and J Zico Kolter. 2019. Uniform convergence may be unable to explain
generalization in deep learning. In Advances in Neural Information Processing Systems.
Vladimir Vapnik. 2006. Estimation of Dependences based on Empirical Data. Springer Science &
Business Media.
Yichun, Shi, Charles Otto, and Anil K. Jain. 2018. Face clustering: representation and pairwise
constraints. IEEE Transactions on Information Forensics and Security 13 (7): 1626–1640.
Zhuozhuo, Tu, Fengxiang He, and Dacheng Tao. 2020. Understanding generalization in recurrent
neural networks. In International Conference on Learning Representations.
Zitong, Yang, Yu Bai, and Song Mei. 2021. Exact gap between generalization error and uniform con-
vergence in random feature models. In International Conference on Machine Learning, 11704–
11715.
Chapter 6
Stochastic Gradient Descent as an
Implicit Regularization

Overparameterization creates a colossal hypothesis space. Although Golowich et al.


(2018) have proved generalization bounds that have explicit independence on the
network size, some implicit dependence may be observed.
Stochastic gradient methods (SGMs) have been widely used for training deep
learning models. However, a key limitation of SGMs is their tendency to explore
only a limited portion of the entire hypothesis space. This limitation can be seen as a
form of implicit regularization, which constrains the effective capacity of the learned
model. Importantly, this constraint is influenced by both the specific SGM algorithm
and the training data.
By leveraging this inherent regularization of SGMs, researchers are exploring
the possibility of establishing algorithm-dependent and data-dependent generaliza-
tion bounds. This approach holds promise for surpassing the limitations of conven-
tional statistical learning theory, which often relies on bounds based on the Vapnik-
Chervonenkis (VC) dimension or Rademacher complexity. In essence, these new
bounds aim to capture the interplay between the chosen SGM, the training data, and
the model’s generalization ability.

6.1 Stochastic Gradient Methods (SGMs)

A natural tool for optimizing the expected risk is gradient descent (GD) methods.
Denote an estimator with parameter θ by Fθ . Specifically, the gradient of the empirical
risk in terms of the t-th iteration parameters θ (t) can be expressed as

g(θ (t)) = ∇θ(t) R(θ (t)) = ∇θ(t) E[l(Fθ(t) (X ), Y )],

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025 95
F. He and D. Tao, Foundations of Deep Learning, Machine Learning: Foundations,
Methodologies, and Applications, https://doi.org/10.1007/978-981-16-8233-9_6
96 6 Stochastic Gradient Descent as an Implicit Regularization

and the corresponding update equation is defined as follows:

θ (t + 1) = θ (t) − ηg(θ (t)),

where θ (t) represents the parameters in iteration t and η > 0 is the learning rate.
In SGD, the gradient g(θ ) is estimated from minibatches of the training sample
set. Let S be the index set of a minibatch, in which all indices are drawn in an i.i.d.
manner from {1, 2, . . . , m}, where m is the number of training samples. Then, the
iterative process of SGD on minibatch S can be defined as follows:

1 
ĝ S (θ (t)) =∇θ(t) R̂ S (θ (t)) = ∇θ(t)l(Fθ(t) (xn ), yn ) (6.1)
|S| n∈S

and

θ (t + 1) = θ (t) − η ĝ(θ (t)), (6.2)

where
1 
R̂ S (θ ) = l(Fθ (xn ), yn )
|S| n∈S

is the empirical risk on the minibatch and |S| is the cardinality of the set S. For
brevity, we adopt the notation

l(Fθ (X n ), Yn ) = ln (θ )

in the rest of this section.


Additionally, suppose that in step i, we represent the distribution of a parameter
by Q i , with the initial distribution being Q 0 and the convergent distribution being Q.
Then, SGD is the process of finding Q by starting from Q 0 and proceeding through
a series of Q i .

6.2 Generalization Bounds on Convex Loss Surfaces

Intensive studies have been conducted on the generalizability of SGMs under both
continuous and convex losses.
Ying and Pontil (2008) proved a generalization bound of the following order for

a fixed learning rate of ηt  m − 2ζ +1 :
 2ζ

O m − 2ζ +1 log m ,

where ζ is a constant.
6.3 Generalization Bounds on Nonconvex Loss Surfaces 97

Later, Dieuleveut and Bach (2016) improved this result to the following average
order when 2ζ + γ > 1:  
2 min{ζ,1}
O m − 2 min{ζ,1}+γ .

Other related works include (Lin et al. 2016; Lin and Rosasco 2016; Chen et al.
2016; Wei et al. 2017). However, in deep learning, the loss surface is usually highly
nonconvex, which invalidates the previous results.

6.3 Generalization Bounds on Nonconvex Loss Surfaces

Results for nonconvex loss surfaces also exist in the literature. Related works include
analyses obtaining generalization bounds based on algorithmic stability and Probably
Approximately Correct (PAC)-Bayesian theory.
Algorithmic stability. Bousquet and Elisseeff (2002) proposed employing algo-
rithmic stability to measure the stability of the output hypothesis when the training
sample set is disturbed. Many versions of algorithmic stability have been proposed;
a popular one is presented as follows.

Definition 6.1 (Uniform stability; cf. Bousquet and Elisseeff (2002), Definition 6)
A machine learning algorithm A is uniformly stable if for any neighbouring sample
pair S and S  that differ by only one example, we have the following inequality:
 
EA(S) l(A(S), z) − EA(S  )l(A(S  ), z) ≤ β,

where z is an arbitrary example; A(S) and A(S  ) are the output hypotheses learned
on the training sets S and S  , respectively; and β is a positive real constant. The
constant β is called the uniform stability of algorithm A.

It is natural for an algorithm that is insensitive to disturbances in the training data


to have good generalizability. Following this intuition, several generalization bounds
have been proven based on algorithmic stability (Bousquet and Elisseeff 2002; Xu
et al. 2011). For example, based on uniform stability, (Bousquet and Elisseeff 2002)
proved the following theorem in relation to generalization in expectation.

Theorem 6.1 Let algorithm A be β-uniformly stable. Then,

|E S,A [ R̂ S [A(S)] − R[A(S)]]| ≤ β.

In recent works, SGMs have commonly been modelled in terms of stochastic


differential equations (SDEs). Minibatch stochastic gradients introduce randomness
into an SGM (i.e., a stochastic gradient can be decomposed into the stochastic esti-
mate of the full gradient plus noise). Therefore, the parameter updating in an SGM
can be formulated as a stochastic process. Thus, the output hypothesis is indeed
98 6 Stochastic Gradient Descent as an Implicit Regularization

drawn from the steady distribution of this stochastic process. We can ultimately
analyse the generalization and optimization of deep learning on the basis of this
steady distribution.
An SGM is usually formulated as shown in Eqs. (6.1) and (6.2). The stochastic
gradient introduces extra noise into the parameter updating. When the gradient noise
can be modelled as a Gaussian distribution, the SGM reduces to stochastic gradient
Langevin dynamics (SGLD), expressed as follows (Mandt et al. 2017):


θ (t) =θ (t + 1) − θ (t) = −η ĝ S (θ (t)) = −ηg(θ ) + W, W ∼ N (0, I ),
β

where η is the learning rate and β > 0 is an inverse temperature parameter.


Hardt et al. (2016b) proved the following upper bound for the uniform stability,
which is exponentially correlated with the aggregated step size and the smoothness
parameter.

Theorem 6.2 Suppose that the loss function is -smooth, L-Lipschitz, and no larger
than 1 for all data. If we run an SGM with monotonically increasing learning rates
αt ≤ c/t for T steps, then the SGM is uniformly stable with

1 + 1/c 1 1
β≤ (2cL 2 ) c+1 T c+1 .
m−1

The proof of this theorem mimics a proof of convergence. Suppose that the loss
function l is L-Lipschitz constant with respect to the weights for any example, i.e.,

|l(w, z) − l(w  , z)| ≤ Lw − w  .

In other words, the stability can be measured in terms of the stability of the weights.
Therefore, one can simply analyse how the weights diverge as a function of time t.
Generalization bounds depending on the learning rates and the number of iterations
can then be presented. Furthermore, we can easily drive a generalization bound based
on this theorem.
Under five additional assumptions, Raginsky et al. (2017) proved the following
upper bound for the expected excess risk:
    
β(β + U )2 1/4 1 (β + U )2 U log(β + 1)
Õ δ log + + Õ + ,
λ∗  λ∗ + m β

provided that   
β(β + U ) 1
k= ˜ log5
λ∗  4 
6.3 Generalization Bounds on Nonconvex Loss Surfaces 99

and  4

η≤ ,
log(1/)

where U is the number of parameters, m is the number of training samples, the


inverse temperature parameter satisfies β > 1 ∨ 2/m, λ∗ is a factor that controls the
exponential rate of convergence of the SGM dynamics to the stationary distribution,
and k is the number of iteration; i.e., for any datum z and any parameter w, for some
m > 0 and b > 0,
< w, ∇h(w, z) >≤ mw2 − b

and   
m − ˜ β(β+U
λ∗
 ∈ 0, ∧e )
.
4M 2

The proof proceeds through three important steps:


(1) Convergence to Langevin Dynamics: Initially, we establish that with a suffi-
ciently small learning rate, the behavior of SGLD resembles that of a continuous-
time Langevin dynamic process under the 2-Wasserstein distance. This convergence
conceptually links discrete updates in optimization with continuous-time diffusion
processes, offering insights into the stability of parameter updates.
(2) Transition to Gibbs Distribution: As training progresses across numerous
epochs, the Langevin dynamics are shown to converge to a Gibbs distribution. This
transition signifies the establishment of a probabilistic equilibrium, providing a prob-
abilistic view of the model’s parameter space.
(3) Gibbs Algorithm Construction: Building upon these convergences, we con-
struct a Gibbs sampling algorithm designed to approximate an empirical risk mini-
mizer. This algorithmic approach encapsulates the theoretical insights into a practical
framework for model optimization.
Understanding these dynamics and their convergence to probabilistic distributions
is essential for grasping the optimization and generalization aspects of deep learn-
ing. However, it’s important to note that achieving stable convergence, particularly
towards stationary distributions, can be challenging and dependent on factors like
model complexity and model dimensionality (at the exponential rate).
A canonical mutual-information-based generalization bound has been given in
Xu and Raginsky (2017):
2σ 2
I (S; θ ), (6.3)
m
where S is the training sample, θ represents the model parameters, and the loss is
σ -sub-Gaussian under the data distribution. This result arises from the following
result on the informational theoretical “stability” of a two-input function f (X, Y ):

| f (X, Y ) − f ( X̄ , Ȳ )| ≤ 2σ 2 I (X ; Y ),
100 6 Stochastic Gradient Descent as an Implicit Regularization

if f ( X̄ , Ȳ ) is sub-Gaussian under distribution PX̄ ,Ȳ = PX × PY , where PX and PY


are the distributions of X and Y , respectively. Especially, noting that the mutual
information is calculated between the learned weights and all training data; it is an
“on-average” estimate of how much information has been transformed into the model
learned from the data.
Based on this result, Pensia et al. (2018) gave a generalization error bound for
SGLD,
R 2  ηt2 L 2
T
, (6.4)
m t=1 σt2

where the hypothesis is assumed to be R-sub-Gaussian, ηt is the learning rate in the


t-th iteration, L is the upper bound on the weight update θt , and the gradient noise
is assumed to be N (0, σt2 I ). This result was later extended by Bu et al. (2020):
• Suppose that the loss l(θ, z) is R-sub-Gaussian under z ∼ μ, where θ represents
the model parameters, and μ is the data-generating distribution. Then, the gener-
alization error has the following upper bound:

1 
m
2R 2 I (θ ; z i ), (6.5)
m i=1

where z i is the i-th datum and m is the sample size.


• Suppose that the loss l(θ̃ , z̃) is R-sub-Gaussian under the following distribution:

Pθ̃ ,z̃ = Pθ ⊗ Pz .

Then, the generalization error has the same upper bound expressed in Eq. (6.5).
In contrast with Xu and Raginsky (2017), Bu et al. (2020) proposed an “individual”
version of a metric to measure the information transformed into a model learned
from data; see Eq. (6.5), where the whole training sample S is replaced by individual
data I (θ ; z i ).
Mou et al. (2018a) modelled SGLD as Langevin  √ diffusion
 dynamics and then
proved an O (1/m) generalization bound and an O 1/ m generalization bound for
SGLD via algorithmic stability and PAC-Bayesian theory, respectively:
• Algorithmic stability: Assume that C is an upper bound of the loss function, the
hypothesis is L-Lipschitz continuous, and the learning rate satisfies ηt ≤ log
β L2
2
for
∀t; then, the generalization error has the following upper bound on average:
⎛  2 ⎞
2LC 
N
O⎝ β ηt ⎠, (6.6)
m t=1

where N is the number of iterations.


6.3 Generalization Bounds on Nonconvex Loss Surfaces 101

• PAC-Bayesian theory: Suppose that the empirical risk has a l2 -norm regulariza-
tion on the weights, i.e., λ2 θ 2 , and that the loss function is sub-Gaussian. Then,
there is the following high-probability generalization bound:
⎛ ⎞
β N
O⎝ ηk e− 3 (TN −Tk ) E[gk (θk )2 ]⎠ ,
λ

m k=1

k
where Tk = i=1 ηi , β is the temporal parameter, and gk (θk ) is the gradient at
iteration k.
He and Su (2020) empirically showed that neural networks are locally elastic:
the prediction for an instance x  given by predictor learned by SGD is not sensitive
to that of a dissimilar example x. This phenomenon inspired Deng et al. (2020) to
propose a localized version of algorithmic stability.

Definition 6.2 (Locally elastic stability; cf. Deng et al. (2020), Definition 1) A
machine learning algorithm A has locally elastic stability βm (·, ·) with respect to the
loss function l if, for any sample S ∈ Zm , z i ∈ S, and z ∈ Z, the following inequality
holds:  
l(A(S), z) − l(A(S \i ), z) ≤ βm (z i , z).

This localized algorithmic stability leads to a tighter generalization bound.

Theorem 6.3 Let algorithm A have βm (·, ·)-locally elastic stability. Suppose that
z ≤ 1 for any z ∈ Z and that βm (·, ·) = β(·,·)
m
, where β(·, ·) is independent of
the sample size m and β(·, ·) ≤ Mβ . Then, for any 0 < δ < 1 and η > 0, with
probability at least 1 − δ, we have
 
 
R(A(S)) − R̂ S (A(S))
 
2 supz  ∈Z Eβ(z  , z) 2 log(2/δ)
≤ + 2 2 sup Eβ(z  , z) + η + Ml .
m z  ∈Z m

Chen et al. (2018b) proved the existence of a trade-off between stability and con-
vergence for all iterative algorithms under either the convex smooth or the strong con-
vex smooth assumption, leading to an O (1/m) generalization bound. Other advances
include the works of (He et al. 2019; Tzen et al. 2018; Zhou et al. 2018; Wen et al.
2019; Li et al. 2019).
102 6 Stochastic Gradient Descent as an Implicit Regularization

6.4 Generalization Bounds Relying on Data-Dependent


Priors

In Bayesian statistics, prior distributions play a crucial role in encoding our beliefs or
expectations about model parameters before observing any data. Unlike conventional
machine learning algorithms that assume full understanding of the data, Bayesian
methods require us to specify priors independently of the training data. This ensures
that our models start with a neutral or uninformative stance regarding the underlying
data characteristics.
Common choices for priors in Bayesian inference include uniform distributions
(assigning equal probability to all possible values) or Gaussian distributions (assum-
ing a normal distribution of parameter values). These priors are often selected to
reflect minimal assumptions about the data, allowing the data itself to shape the
posterior distribution through Bayesian updating.
Importantly, Bayesian priors are designed to have diminishing influence as more
data is incorporated into the model during training. This property helps ensure that
the model’s performance improves based primarily on the observed data rather than
the initial prior assumptions.
Recent advances in probabilistic learning, such as PAC-Bayesian theory, aim to
extend traditional Bayesian principles to more complex scenarios, relaxing strict
assumptions about data-independent priors while preserving the theoretical under-
pinnings of Bayesian inference. This approach allows for a more flexible and nuanced
integration of prior knowledge and observed data in learning algorithms.
Distribution-dependent priors. In machine learning theory, distribution-
dependent priors are designed to leverage knowledge about the underlying data dis-
tribution without directly relying on the specific training dataset used. This concept
is rooted in the notion that the data-generating process and its distribution exist inde-
pendently of the actual training data collection (Lever et al. 2013). By incorporating
such priors, researchers aim to refine generalization bounds, which are fundamental
in assessing the theoretical performance and predictive ability of machine learning
models. Specifically, the utilization of distribution-dependent priors has been shown
to be able to considerably tighten the generalization bounds.
Following this line of reasoning, Lever et al. (2013) tightened the PAC-Bayesian
bound (see Theorem 2.7) to the following bound. Given any positive constant C,
there is that with probability at least 1 − δ > 0, with respect to the drawn sample,
the generalization error is upper bounded as follows:
 
∗ C∗ 2 2ξ(m) γ2 2
R(Q(h)) − C R̂ S (Q(h)) ≤ λ log + + log ,
Cm m δ 2m δ

where Q(h) is the posterior of the learned model, with density


√ C
q(h) ∝ e−γ R̂ S (Q(h)) , ξ(m) = O( m), C ∗ = ,
1 − e−C
and γ is a positive constant.
6.4 Generalization Bounds Relying on Data-Dependent Priors 103

Li et al. (2019) further designed a Bayes-stability framework to derive general-


ization bounds based on distribution-dependent priors. They tightened the result of
Mou et al. (2018a) (Eq. 6.6) to the following formula:
⎛  ⎞
C T
ηt
2
O⎝ ES g (t) ⎠ ,
2 e
m σ
t=1 t

where √σt2 I is assumed as the gradient noise, C > 0 is the upper bound of the loss
function, T is the number of iterations, ηt is the step size at iteration t, and
 
1 
m
ge (t) = EWt−1 ∇ F(Wt−1 , z i ) .
2
m i=1

Data-dependent priors. Negrea et al. (2019) further pushed the frontier of gen-
eralization bound analysis by constructing priors that are not independent of the data.
Suppose that set S J ⊂ S has a size of n < m. It is natural to design a prior exploiting
S J to derive a data-dependent forecast of the posterior Q. We denote this subset by
S J = {z j1 , . . . , z jn }, where all indices constitute a set J . The index set J is randomly
drawn from {1, . . . , m}. Suppose that we have the following generalization bound:
 
EF R(θ ) − R̂ S J (θ ) ≤ B,

where R̂ S J is the empirical risk on the set S J , B > 0 is F measurable, and F is a


σ -field such that σ (S J ) ⊂ F ⊥σ (S cJ ). Additionally, we denote by U the uncertainty
introduced by the machine learning algorithm. Negrea et al. (2019) proved the fol-
lowing theorem, which shows considerable improvement over the one given by Xu
and Raginsky (2017) (Eq. 6.3).
Theorem 6.4 Let J ⊂ [m], |J | = n be uniformly distributed and independent of the
training sample set S and the model parameters θ . Suppose that the loss is σ -sub-
Gaussian under the data distribution. When the posterior Q = P(θ |S) and the prior
P are σ (S J )-measurable data-dependent distributions on the data space,
 
  2σ 2 2σ 2
E R(θ ) − R̂ S (θ ) ≤ I (θ ; S cJ ) ≤ E[KL(Q||P)].
m−n m−n

Moreover, let U be independent of S and θ . When the posterior Q = P(θ |S, U )


and the prior P are σ (S J , U )-measurable data-dependent distributions on the data
space,
 
  2σ 2 S ,U 2σ 2 S ,U
E R(θ ) − R̂ S (θ ) ≤ I J (θ ; S cJ ) ≤ E J [KL(Q||P)].
m−n m−n
104 6 Stochastic Gradient Descent as an Implicit Regularization

Further works following this approach include (Haghifam et al. 2020) in coop-
eration with (Steinke and Zakynthinou 2020; Dziugaite et al. 2020; Hellström and
Durisi 2020) and their applications (Cherian et al. 2020).

6.5 The Role of Learning Rate and Batch Size in Shaping


Generalizability

The last decade has seen the dramatic success of deep neural networks (Goodfellow
et al. 2016) based on the stochastic gradient descent (SGD) optimization method (Bot-
tou et al. 1998; Sutskever et al. 2013). The task of fine tuning the hyper-parameters
of SGD to make neural networks generalize well is both critical and challenging.
Some works have addressed the strategies of tuning hyper-parameters (Dinh et al.
2017; Goyal et al. 2017; Keskar et al. 2017) and the generalization ability of SGD
(Chen et al. 2018b; Hardt et al. 2016b; Lin et al. 2016; Mou et al. 2018a; Pensia et al.
2018). However, there is still a lack of solid evidence to validate the effectiveness of
training strategies for tuning the hyper-parameters of neural networks.
In this section, we present both theoretical and empirical evidence for a more
effective training strategy for deep neural networks:
When employing SGD to train deep neural networks, we should ensure that the batch size
is not too large and the learning rate is not too low, to make the networks generalize well.

This strategy gives a guide to tune the hyper-parameters that helps neural networks
achieve good test performance when the training error is small. It is derived from the
following property:
The generalizability of deep neural networks has a negative correlation with the ratio of
batch size to learning rate.

Regardingtheoreticalevidence,weproveanovelPAC-BayesMcAllester(1999a, b)
upper bound for the generalization error of deep neural networks trained by SGD.
The positive correlation between the proposed generalization bound with the ratio
between the batch size and the learning rate indicates that the generalizability of
neural networks is poor. This result builds the theoretical foundation of the training
strategy.
From the empirical aspect, we conduct extensive systematic experiments while
strictly controlling unrelated variables to investigate the influence of batch size and
learning rate on generalizability. Specifically, we trained 1,600 neural networks
based on two popular architectures, ResNet-110 (He et al. 2016a, b) and VGG-19
(Simonyan and Zisserman 2015), on two standard datasets, CIFAR-10 and CIFAR-
100 (Krizhevsky and Hinton 2009). The accuracies on the test set of all the networks
are collected for analysis. Since the training error is almost the same across all the
networks (it is almost 0), the test accuracy is an informative index to express the
model generalizability. Evaluation is then performed on 164 groups of the collected
data. The Spearman’s rank-order correlation coefficients and the corresponding p
6.5 The Role of Learning Rate and Batch Size in Shaping Generalizability 105

Fig. 6.1 Scatter plots of accuracy on test set to ratio of batch size to learning rate. Each point
represents a model. Totally 1,600 points are plotted

values (Spearman 1987) demonstrate that the correlation is statistically significant


(the probability that the correlation is wrong is smaller than 0.005), which fully
supports the training strategy (Fig. 6.1).

6.5.1 Theoretical Evidence

In this section, we explore and develop the theoretical foundations for the training
strategy. The main ingredient is a PAC-Bayes generalization bound of deep neural
networks based on the SGD optimization method. The generalization bound has a
positive correlation with the ratio of batch size to learning rate. This correlation
validates the effectiveness of the presented training strategy.

6.5.1.1 A Generalization Bound for SGD

Let ln (θ ) = l(θ, xn ) be the contribution to the overall loss from a single observation
xn , n = 1, ..., m. Apparently, both ln (θ ) and R̂ S (θ ) are un-biased estimations of the
expected risk R(θ ), while ∇θ ln (θ ) and ĝ S (θ ) are both un-biased estimations of the
gradient g(θ ) = ∇θ R(θ ):
 
E [ln (θ )] = E R̂ S (θ ) =R(θ ), (6.7)
 
E [∇θ ln (θ )] = E ĝ S (θ ) = g(θ ) = ∇θ R(θ ), (6.8)

where the expectations are in terms of the corresponding examples (X, Y ).


An assumption (see, e.g., Weinan (2017); Mandt et al. (2017)) for the estimations
is that all the gradients {∇θ ln (θ )} calculated from individual data points are i.i.d. and
drawn from a Gaussian distribution centered at g(θ ) = ∇θ R(θ ):

∇θ ln (θ ) ∼ N (g(θ ), C). (6.9)


106 6 Stochastic Gradient Descent as an Implicit Regularization

where C is the covariance matrix and is a constant matrix for all θ . As covariance
matrices are (semi) positive-definite, for brevity, we suppose that C can be factorized
as C = B B  for brevity. This assumption can be justified by the central limit theorem
when the sample size m is large enough compared to the batch size S.
Therefore, the stochastic gradient is also drawn from a Gaussian distribution
centered at g(θ ):
 
1  1
ĝ B (θ ) = ∇θ ln (θ ) ∼ N g(θ ), C . (6.10)
|G| n∈G |G|

SGD uses the stochastic gradient ĝG (θ ) to iteratively update the parameter θ to
minimize the function R(θ ):
η
θ (t) = θ (t + 1) − θ (t) = −η ĝG (θ (t)) = −ηg(θ ) + √ BW, W ∼ N (0, I ).
|G|
(6.11)
Equation (6.11) expresses a well-known stochastic process called the Ornstein-
Uhlenbeck process (Uhlenbeck and Ornstein 1930).
Furthermore, we assume that the loss function in the local region around the
minimum is convex and 2-order differentiable:
1 
R(θ ) = θ Aθ, (6.12)
2
where A is the Hessian matrix around the minimum and is a (semi) positive-definite
matrix. This assumption has been primarily demonstrated by empirical works (see
Li et al. (2018c, p. 1, Figs. 1(a) and 1(b) and p. 6, Figs. 4(a) and 4(b))). Without loss
of generality, we assume that the global minimum of the objective function R(θ ) is
0 and achieves at θ = 0. General cases can be obtained by translation operations,
which would not change the geometry of objective function and the corresponding
generalization ability.
From the results of Ornstein-Uhlenbeck process, Eq. (6.11) has an analytic sta-
tionary distribution:  
1
q(θ ) = M exp − θ   −1 θ , (6.13)
2

where M is the normalizer (Gardiner 1985).


This assumption that the local minima are convex and 2-order differentiable has
been primarily proved by empirical works (see Fig. 6.2 which is originally presented
in a recent work by Li et al. (2018c, pp. 1, Figs. 1(a) and 1(b) and pp. 6, Figs. 4(a)
and 4(b))).
Estimating SGD as a continuous-time stochastic process dates back to works by
Kushner and Yin (2003); Ljung et al. (2012). For a detailed justification, please refer
to a recent work [see Mandt et al. 2017, pp. 6–8, Sect. 3.2].
The generation bound we obtained is based on the PAC-Bayesian framework that
dates back to works by McAllester (1999a, b). From the PAC-Bayesian perspec-
6.5 The Role of Learning Rate and Batch Size in Shaping Generalizability 107

Fig. 6.2 The risk surfaces


with/without skips (Li et al.
2018c)

tive, the hypothesis function learned by a stochastic machine learning algorithm is


drawn randomly (but still governed by several “laws”) from the hypothesis class.
The generalizability of an algorithm refers to its ability to perform well on unseen
data beyond the training set. In the context of machine learning, this property is
negatively impacted by the divergence (measured by Kullback-Leibler divergence)
between the distribution of the output hypothesis and the prior distribution used in
Bayesian methods. Essentially, when the output distribution diverges significantly
from the prior distribution (which is often assumed to be Gaussian or uniform), the
algorithm may struggle to generalize effectively to new, unseen data. This trade-off
highlights the challenge of balancing empirical risk minimization with the need to
explore diverse areas of the hypothesis space defined by the prior distribution.
Suppose the prior distribution over the parameter space  is P. Let Q represent
the distribution on the parameter space  expressing the learned hypothesis function.
We then define the expected risk with respect to the distribution Q is shown as follows

R(Q) = Eθ∼Q R(θ ).

Similarly, the empirical risk with respect to the distribution Q is defined as

R̂ S (Q) = Eθ∼Q R̂ S (θ ).

Then, a classic result uniformly bounding the expected risk R(Q) in terms of the
empirical risk R̂ S (Q) and the KL divergence K L(Q||P) is as follows.

Lemma 6.1 (see (McAllester 1999a), Theorem 1) For any positive real δ ∈ (0, 1),
with probability at least 1 − δ over a sample of size m, we have the following inequal-
ity for all distributions Q:

K L(Q||P) + log 1δ + log m + 2
R(Q) ≤ R̂ S (Q) + , (6.14)
2m − 1

where K L(Q||P) is the KL divergence between the distributions Q and P and is


108 6 Stochastic Gradient Descent as an Implicit Regularization
 
Q(θ )
K L(Q||P) = Eθ∼Q log . (6.15)
P(θ )

We derive a generalization bound for SGD with considerations to simplify tech-


nical aspects for clarity. In this context, we avoid delving into complex issues related
to measurability and integrability to keep the discussion succinct and accessible.
We also rely on Fubini’s theorem to facilitate integration across multiple variables,
assuming that the order of integration can be interchanged. Additionally, we assume
the existence and uniqueness of stable (stationary) solutions for all stochastic differ-
ential equations involved.

Theorem 6.5 For any positive real δ ∈ (0, 1), with probability at least 1 − δ over
a training sample set of size m, we have the following inequality for the distribution
Q of the output hypothesis function of SGD:
 1
η
2|G|
tr (C A−1 ) − log(det()) − d + 2 log δ
+ 2 log m + 4
R(Q) ≤ R̂ S (Q) + .
4m − 2
(6.16)
and η
 A + A = C, (6.17)
|G|

where A is the Hessian matrix of the loss function around the local minimum, B is
from the covariance matrix of the gradients calculated by single sample points , and
d is the dimension of the parameter θ (network size).

The proof of this generalization bound involves two key steps that draw upon con-
cepts from SDEs and the PAC-Bayes framework. (1) We use SDE results to identify
the stationary solution of the latent Ornstein-Uhlenbeck process described by Eq.
(6.11), which captures the iterative update process of SGD. This step helps establish
the behavior and stability of SGD over time. (2) We adapt the PAC-Bayes framework,
which is a theoretical framework for generalization analysis in machine learning, to
derive the generalization bound based on insights gained from the stationary dis-
tribution identified in the first step. The PAC-Bayes framework allows us to reason
about the performance and predictive ability of SGD in terms of its convergence to
a stable solution and its ability to generalize well.
To prove 6.5, we first present the following lemma.
Lemma 6.2 (cf. (Mandt et al. 2017), pp. 27-18, Appendix B) Under the 2-order
differentiable assumption (Eq. 6.12), the Ornstein-Uhlenbeck process (Eq. 6.11)’s
stationary distribution,
 
1
q(θ ) = M exp − θ   −1 θ , (6.18)
2

has the following property,


6.5 The Role of Learning Rate and Batch Size in Shaping Generalizability 109

η
A +  A = C. (6.19)
|G|

Here, we recall the proof to make this section complete.

Proof Form a result in Ornstein-Uhlenbeck process (Gardiner 1985), we know that


the parameter θ has the following analytic solution,

η t

θ (t) = θ (0)e−At + e−A(t−t ) BdW (t  ), (6.20)
|G| 0

where W (t  ) is a white noise and follows N (0, I ). From Eq. (6.18), we know that
 
 = Eθ∼Q θ θ  . (6.21)

Therefore, we complete the proof by the following equation,


 t  t
η η
A +  A = Ae−A(t−t0 ) Ce−A(t−t0 ) dt  + e−A(t−t0 ) Ce−A(t−t0 ) dt  A
|G| −∞ |G| −∞
 t
η d η
= Ae−A(t−t0 ) Ce−A(t−t0 ) = C. (6.22)
|G| −∞ dt  |G|

Proof of Theorem 6.5 In the PAC-Bayesian framework (Lemma 6.1), an essential


part is the KL divergence between the distribution of the learned hypothesis and
the priori on the hypothesis space. The prior distribution can be interpreted as the
distribution of the initial parameters, which are usually settled according to Gaussian
distributions or uniform distributions.1 Here, we use a standard Gaussian distribution
N (0, I ) as the priori. Suppose the densities of the stationary distribution Q and the
prior distribution P are respectively p(θ ) and q(θ ) in terms of the parameter θ in the
following equations:

1 In scenarios where there is limited prior knowledge about latent model parameters, it is common
practice to set priors as distributions that convey minimal information, such as Gaussian or uniform
distributions. This choice is driven by the following two primary considerations. (1) Algorithms
based on Bayesian statistics are expected to converge to stationary distributions given sufficient
time and data. The existence and uniqueness of these stationary solutions are assumed based on the
latent stochastic differential equation governing the iterative process. This assumption provides a
theoretical foundation for the convergence of Bayesian algorithms, ensuring that the learned model
stabilizes over time. (2) Setting priors requires caution because we cannot assume prior knowledge
about the target hypothesis function before initiating model training. Therefore, the choice of non-
informative priors ensures that the learning process remains unbiased and adapts to the available
data without undue influence from assumed prior beliefs. This approach is fundamental in statistical
learning theory to avoid overfitting and to facilitate robust generalization to new, unseen data.
110 6 Stochastic Gradient Descent as an Implicit Regularization
 
1 1
p(θ ) = √ exp − θ  I θ , (6.23)
2π det(I ) 2
 
1 1
q(θ ) = √ exp − θ   −1 θ , (6.24)
2π det() 2

where Eq. (6.24) comes from Eq. (6.18) by calculating the normalizer M. Therefore,
   
q(θ ) 1 1 1  
log = log + θ I θ − θ   −1 θ . (6.25)
p(θ ) 2 det() 2

Applying Eqs. (6.25) to (6.15), we can calculate the KL divergence between the
distributions Q and P (we assume  = Rd ):
  
q(θ )
K L(Q||P) = log q(θ )dθ
θ∈ p(θ )
   !
1 1 1   −1

= log + θ I θ − θ  θ q(θ )dθ
θ∈ 2 det() 2
   
1 1 1  1
= log + θ I θ p(θ )dθ − θ   −1 θq(θ )dθ
2 det() 2 θ∈ 2 R|S|
 
1 1 1 1
= log + Eθ∼N(0,) θ  I θ − Eθ∼N(0,) θ   −1 θ
2 det() 2 2
 
1 1 1
= log + tr( − I ). (6.26)
2 det() 2

From Eq. (6.19), we have that


η
A +  A = C. (6.27)
|G|

Therefore,
η
A A−1 +  = C A−1 . (6.28)
|G|

After calculating the trace of the both sides, we have the following equation,
 
  η
tr A A−1 +  = tr C A−1 . (6.29)
|S|

The left-hand side (LHS) is as follows,


   
LHS = tr A A−1 +  =tr A A−1 + tr () = tr () + tr () = 2tr () .
(6.30)
6.5 The Role of Learning Rate and Batch Size in Shaping Generalizability 111

Therefore,  
1 η 1 η  −1 
tr () = tr C A−1 = tr C A . (6.31)
2 |G| 2 |G|

At the same time, we can easily calculate that

tr(I ) = d, (6.32)

as I ∈ Rd×d , where d is the dimension of the parameter θ .


Insert Eqs. (6.31) and (6.32) to Eq. (6.26), we can prove the following inequality,

1 η 1 1
K L(Q||P) ≤ tr (C A−1 ) − log(det()) − d. (6.33)
4 |G| 2 2

Equation (6.33) provides an upper bound on the KL divergence between the sta-
tionary distribution of SGD weights and the prior distribution over the hypothesis
space. This divergence quantifies how far the learned distribution is away from the
prior. By leveraging the monotonous nature of the generalization bound with respect
to the KL divergence, we can extend this insight to derive a PAC-Bayesian general-
ization bound for SGD. This process involves incorporating the KL divergence bound
(Eq. (6.33)) into the PAC-Bayesian framework outlined in Eq. (6.14) of Lemma 6.1,
enabling us to quantify the trade-off between model complexity and generalization
performance in a probabilistic context.

6.5.1.2 A Special Case of the Generalization Bound

In this subsection, we study a special case with two more assumptions for further
understating the influence of the gradient fluctuation on our proposed generalization
bound.
Assumption 6.6 The matrices A and  are symmetric. 
Assumption 6.6 can be translated as that both the local geometry around the global
minima and the stationary distribution are homogeneous to every dimension of the
parameter space. This assumption indicates that the product  A of matrices A and
 is also symmetric.
Based on Assumptions 6.6, we can further prove the following theorem.
Theorem 6.7 When Assumptions 6.6 holds, under all the conditions of Theorem 6.5,
the stationary distribution of SGD has the following generalization bound,

R(Q) ≤ R̂ S (Q)
   
η −1
2|G| tr (C A ) + d log
2|G|
η − log(det(C A−1 )) − d + 2 log 1δ + 2 log m + 4
+ .
4m − 2
(6.34)
112 6 Stochastic Gradient Descent as an Implicit Regularization

To prove Theorem 6.7, we first present a Lemma.

Lemma 6.3 When Assumptions 6.6, the KL divergence between the stationary dis-
tribution Q of SGD and the prior distribution P is satisfies the following inequality
 
η −1 1 2|G| 1 1
K L(Q||P) ≤ tr (C A ) + d log − log(det(C A−1 )) − d.
4|G| 2 η 2 2
(6.35)

Lemma 6.3 gives an upper bound for the distance between the distribution of
the output hypothesis by SGD and the prior distribution of the hypothesis space. It
measures how far SGD can explore in the hypothesis space. Based on it, we can
further prove the following theorem that controls the generalization error of the
special case under Assumptions 6.6.

Proof Apply Assumptions 6.6 to Eq. (6.19), we can prove the following three equa-
tions:
η η η
 A + A = C, 2 A = C, and  = C A−1 . (6.36)
|S| |S| 2|S|

Therefore,    d
η η  
det() = det C A−1 = det C A−1 . (6.37)
2|S| 2|S|

Thus,
  
η d  −1 
log (det()) = log det C A
2|S|
 
2|S|   
= − d log + log det C A−1 . (6.38)
η

Applying Eqs. (6.36) and (6.32) to Eq. (6.26), we can obtain Eq. (6.35).
The proof is completed. 

Then, we can directly obtain Theorem 6.7.


Proof of Theorem 6.7 Apply Eq. (6.35) of Lemma 6.3 to Eq. (6.14) of Lemma 6.1
of Lemma 6.1, we can directly obtain Eq. (6.34). The proof is completed.
Intuitively, our generalization bound links the generalization ability of the deep
neural networks trained by SGD with three factors:
Local geometry around minimum. The determinant of the Hessian matrix A
provides insights into the curvature and shape of the objective function’s surface
around a local minimum. A larger determinant det (A) indicates a sharper or nar-
rower minimum, suggesting more abrupt changes in the function’s values around that
point. Research (Keskar et al. 2017; Goyal et al. 2017) indicates that sharp minima,
6.5 The Role of Learning Rate and Batch Size in Shaping Generalizability 113

characterized by high values of det (A), often correspond to models that overfit the
training data and exhibit poorer generalization to unseen data. Understanding these
local geometric properties is crucial for assessing the behavior and performance of
optimization algorithms in deep learning.
Gradient fluctuation. The covariance matrix C (or equivalently the matrix B)
characterizes how much the gradient estimates vary across different data points dur-
ing the training process. This variation represents the inherent noise in SGD. By intro-
ducing this noise into the gradient computation, SGD can explore a broader range
of solutions during optimization. This stochastic behavior allows SGD to navigate
away from sharp or narrow local minima, potentially leading to solutions with better
generalization performance on unseen data. The notion of injecting noise through
SGD has been influential in understanding its robustness and ability to escape poor
solutions during training.
Hyper-parameters. The relationship between batch size |S| and learning rate η
affects how gradients are computed and utilized during the training process. A larger
batch size typically provides a more accurate estimate of the gradient, reducing its
variance. On the other hand, a higher learning rate can lead to larger updates in
parameter space based on these gradients. Specifically, under the following assump-
tion, our generalization bound has a positive correlation with the ratio of batch size
to learning rate. The interplay between these two factors influences the stability and
convergence of the training process and can impact the generalization performance
of the model. By analyzing the generalization bound in relation to the ratio of batch
size to learning rate, we gain insights into how to optimize these hyperparameters
for better model performance and generalization.
Specifically, under the following assumption, our generalization bound has a pos-
itive correlation with the ratio of batch size to learning rate.

Assumption 6.8 The network size is large enough:

tr(C A−1 )η
d> , (6.39)
2|S|

where d is the number of the parameters, C expresses the magnitude of individual


gradient noise, A is the Hessian matrix around the global minima, η is the leaning
rate, and |S| is the batch size. 

This assumption can be justified that the network sizes of neural networks are usually
extremely large. This property is also called overparametrization (Du et al. 2019b;
Brutzkus et al. 2018; Allen-Zhu et al. 2019b). We can obtain the following corollary
by combining Theorem 6.7 and Assumption 6.8.

Corollary 6.1 When all conditions of Theorem 6.7 and Assumption 6.8 hold, the
generalization bound of the network has a positive correlation with the ratio of
batch size to learning rate.
114 6 Stochastic Gradient Descent as an Implicit Regularization

Proof We first define


   
η 2|S| 1
I = tr (C A−1 ) + d log − log(det(C A−1 )) − d + 2 log + 2 log m + 4.
2|S| η δ
(6.40)
Then the generalization Eq. (6.34) becomes,

I
R(Q) ≤ R̂ S (Q) + , (6.41)
4m − 2

We thus calculate the derivative of I with respect to the ratio |S|


η
in order to check
whether the generalization bound has a positive correlation withe the ratio. For the
brevity, we define k = |S|/η.
  !
∂I ∂ 1 1
= tr (C A−1 ) + d log (2k) − log(det(C A−1 )) − d + 2 log + 2 log m + 4
∂k ∂k 2k δ
  !
∂ 1 −1 −1 1
= tr (C A ) + d log (2k) − log(det(C A )) − d + 2 log + 2 log m + 4
∂k 2k δ
1 d
= − 2 tr (C A−1 ) + (6.42)
2k k

Therefore, when Assumption 6.8 holds, we have

tr (C A−1 )η 1
d> = tr (C A−1 ). (6.43)
2|S| 2k

Thus,
∂I
> 0. (6.44)
∂k
So, I and further the generalization bound has a positive correlation with the ratio
of batch size to learning rate.
The proof is completed. 

It reveals the negative correlation between the generalizability and the ratio. This
property further derives the training strategy, which requires us to control the ratio
and ensure that it is not too large in order to achieve a good generalization when
training deep neural networks using SGD.

6.5.2 Empirical Evidence

To evaluate the training strategy from the empirical aspect, we conduct extensive
systematic experiments to investigate the influence of the batch size and learning
rate on the generalizability of deep neural networks trained by SGD. To deliver rig-
6.5 The Role of Learning Rate and Batch Size in Shaping Generalizability 115

orous results, our experiments strictly control all unrelated variables. The empirical
results show that there is a statistically significant negative correlation between the
generalizability of the networks and the ratio of the batch size to the learning rate,
which builds a solid empirical foundation for the training strategy.

6.5.2.1 Implementation Details

To guarantee that the empirical results generally apply to any case, our experiments
are conducted based on two popular architectures, ResNet-110 (He et al. 2016a, b)
and VGG-19 (Simonyan and Zisserman 2015), on two standard datasets, CIFAR-
10 and CIFAR-100 (Krizhevsky and Hinton 2009), which can be downloaded from
https://www.cs.toronto.edu/~kriz/cifar.html. The separation of the training and test
sets we used are the same as the official version.
We trained 1,600 models with 20 batch sizes, S B S = {16, 32, 48, 64, 80, 96, 112,
128, 144, 160, 176, 192, 208, 224, 240, 256, 272, 288, 304, 320}, and 20 learning
rates, SL R = {0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.10, 0.11, 0.12,
0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.20}. All SGD training techniques, such as
momentum, are disabled. Also, both batch size and learning rate are constant in
our experiments. Every model with a specific pair of batch size and learning rate
is trained for 200 epochs. The test accuracies of all 200 epochs are collected for
analysis. We select the highest accuracy on the test set to express the generalizability
of each model, since the training error is almost the same across all models (they are
all nearly 0).
The collected data is then utilized to investigate three correlations: (1) the correla-
tion between network generalizability and the batch size, (2) the correlation between
network generalizability and the learning rate, and (3) the correlation between net-
work generalizability and the ratio of batch size to learning rate, where the first two are
preparations for the final one. Specifically, we calculate the Spearman’s rank-order
correlation coefficients(SCCs) and the corresponding p groups of the collected data
to investigate the statistical significance of the correlations. Almost all results indi-
cate that the correlations are statistically significant ( p < 0.005).2 The p values of
the correlation between the test accuracy and the ratio are all lower than 10−180 (see
Table 6.3).
The architectures of our models are similar to a popular implementation of ResNet-
110 and VGG-19.3 Additionally, our experiments are conducted on a computing
cluster with GPUs of NVIDIA® Tesla™ V100 16GB and CPUs of Intel® Xeon®
Gold 6140 CPU @ 2.30GHz.

2 The definition of “statistically significant” has various versions, such as p < 0.05 and p < 0.01.
This section uses a more rigorous one ( p < 0.005).
3 See Wei Yang, https://github.com/bearpaw/pytorch-classification, 2017.
116 6 Stochastic Gradient Descent as an Implicit Regularization

(a) Test accuracy-batch size (b) Test accuracy-learning rate

Fig. 6.3 Curves of test accuracy to batch size and learning rate. The four rows are respectively for
(1) ResNet-110 trained on CIFAR-10, (2) ResNet-110 trained on CIFAR-100, (3) VGG-19 trained
on CIFAR-10, and (4) VGG-19 trained on CIFAR-10. Each curve is based on 20 networks

6.5.2.2 Empirical Results on the Correlation

Correlation between generalization ability and batch size. When the learning rate
is fixed as an element of SL R , we train ResNet-110 and VGG-19 on CIFAR10 and
CIFAR100 with 20 batch sizes of S B S . The plots of test accuracy to batch size are
illustrated in Fig. 6.3a. We list 1/4 of all plots due to space limitation. The rest of the
plots are in the supplementary materials. We then calculate the SCCs and the p values
as Table 6.1, where bold p values refer to the statistically significant observations,
while underlined ones refer to those that are not significant (This convention applies
to Table 6.2). The results clearly show that there is a statistically significant negative
correlation between generalization ability and batch size.
Correlation between generalization ability and learning rate. When the batch
size is fixed as an element of S B S , we train ResNet-110 and VGG-19 on CIFAR10
and CIFAR100 respectively with 20 learning rates SL R . The plot of the test accuracy
to the learning rate is illustrated in Fig. 6.3b, which include 1/4 of all plots due to
space limitation. The rest of the plots are in the supplementary materials. We then
calculate the SCC and the p values as Table 6.2 shows. The results clearly show that
there is a statistically significant positive correlation between the learning rate and
the generalization ability of SGD.
Correlation between generalization ability and ratio of batch size to learning
rate. We plot the test accuracies of ResNet-110 and VGG-19 on CIFAR-10 and
CIFAR-100 to the rate of batch size to learning rate in Fig. 6.1. Totally over 1,600
points are plotted. Additionally, we perform Spearman’s rank-order correlation test
on all the accuracies of ResNet-110 and VGG-19 on CIFAR-10 and CIFAR-100. The
SCC and p values show that the correlation between the ratio and the generalization
ability is statistically significant as Table 6.3 demonstrate. Each test is performed on
400 models. The results strongly support the training strategy.
6.6 Interplay of Optimization and Bayesian Inference 117

Table 6.1 SCC and p values of batch size to test accuracy for different learning rate (LR)
LR ResNet-110 on CIFAR-10 ResNet-110 on CIFAR-100 VGG-19 on CIFAR-10 VGG-19 on CIFAR-100
SCC p SCC p SCC p SCC p
0.01 −0.96 2.6 × 10−11 −0.92 5.6 × 10−8 −0.98 3.7 × 10−14 −0.99 7.1 × 10−18
0.02 −0.96 1.2 × 10−11 −0.94 1.5 × 10−9 −0.99 3.6 × 10−17 −0.99 7.1 × 10−18
0.03 −0.96 3.4 × 10−11 −0.99 1.5 × 10−16 −0.99 7.1 × 10−18 −1.00 1.9 × 10−21
0.04 −0.98 1.8 × 10−14 −0.98 7.1 × 10−14 −0.99 9.6 × 10−19 −0.99 3.6 × 10−17
0.05 −0.98 3.7 × 10−14 −0.98 1.3 × 10−13 −0.99 7.1 × 10−18 −0.99 1.4 × 10−15
0.06 −0.96 1.8 × 10−11 −0.97 6.7 × 10−13 −1.00 1.9 × 10−21 −0.99 1.4 × 10−15
0.07 −0.98 5.9 × 10−15 −0.94 5.0 × 10−10 −0.98 8.3 × 10−15 −0.97 1.7 × 10−12
0.08 −0.97 1.7 × 10−12 −0.97 1.7 × 10−12 −0.98 2.4 × 10−13 −0.97 1.7 × 10−12
0.09 −0.97 4.0 × 10−13 −0.98 3.7 × 10−14 −0.98 1.8 × 10−14 −0.96 1.2 × 10−11
0.10 −0.97 1.9 × 10−12 −0.96 8.7 × 10−12 −0.98 8.3 × 10−15 −0.93 2.2 × 10−9
0.11 −0.97 1.1 × 10−12 −0.98 1.3 × 10−13 −0.99 2.2 × 10−16 −0.93 2.7 × 10−9
0.12 −0.97 4.4 × 10−12 −0.96 2.5 × 10−11 −0.98 7.1 × 10−13 −0.90 7.0 × 10−8
0.13 −0.94 1.5 × 10−9 −0.98 1.3 × 10−13 −0.97 1.7 × 10−12 −0.89 1.2 × 10−7
0.14 −0.97 2.6 × 10−12 −0.91 3.1 × 10−8 −0.97 6.7 × 10−13 −0.86 1.1 × 10−6
0.15 −0.96 4.6 × 10−11 −0.98 1.3 × 10−13 −0.95 8.3 × 10−11 −0.79 3.1 × 10−5
0.16 −0.95 3.1 × 10−10 −0.96 8.7 × 10−12 −0.95 1.4 × 10−10 −0.77 6.1 × 10−5
0.17 −0.95 2.4 × 10−10 −0.95 2.6 × 10−10 −0.91 2.3 × 10−8 −0.68 1.3 × 10−3
0.18 −0.97 4.0 × 10−12 −0.97 1.1 × 10−12 −0.93 2.6 × 10−9 −0.66 2.8 × 10−3
0.19 −0.94 6.3 × 10−10 −0.95 8.3 × 10−11 −0.90 8.0 × 10−8 −0.75 3.4 × 10−4
0.20 −0.91 3.6 × 10−8 −0.98 1.3 × 10−13 −0.95 6.2 × 10−11 −0.57 1.4 × 10−2

6.6 Interplay of Optimization and Bayesian Inference

An interesting interaction can be found between optimization and Bayesian inference


(Xiang 2020): (1)generally, Bayesian inference (or sampling) can be viewed as opti-
mization in the probability space (e.g., in stochastic gradient Markov chain Monte
Carlo, stochastic gradient MCMC) and can be scaled through optimization (e.g., in
variational inference), and (2) stochastic gradient MCMC also involves performing
Bayesian approximation.
Bayesian inference. Bayesian learning for optimizing parametric machine learn-
ing models is to obtain the posterior of the model parameters and then seek the best
parameter settings. However, the analytical expression for the posterior is unknown
in most real-world applications. To address this problem, classical methods, such as
the Metropolis-Hastings algorithm (Hastings 1970; Geman and Geman 1984) and
hybrid Monte Carlo (HMC; (Duane et al. 1987)), adopts MCMC methods to infer the
posterior. Due to the promising performance, Bayesian inference has been applied to
many applications, including topic models (Larochelle and Lauly 2012; Zhang et al.
2020), Bayesian neural networks (Louizos and Welling 2017; Roth and Pernkopf
2018), and generative models (Kingma and Welling 2014; Goodfellow et al. 2014a;
Kobyzev et al. 2020).
118 6 Stochastic Gradient Descent as an Implicit Regularization

Table 6.2 SCC and p values of learning rate to test accuracy for different batch size (BS)
BS ResNet-110 on CIFAR-10 ResNet-110 on CIFAR-100 VGG-19 on CIFAR-10 VGG-19 on CIFAR-100
SCC p SCC p SCC p SCC p
16 0.60 5.3 × 10−3 0.84 3.2 × 10−6 0.62 3.4 × 10−3 −0.80 2.6 × 10−5
32 0.60 5.0 × 10−3 0.90 9.9 × 10−8 0.78 4.9 × 10−5 −0.14 5.5 × 10−1
48 0.84 3.2 × 10−6 0.89 1.8 × 10−7 0.87 4.9 × 10−7 0.37 1.1 × 10−1
64 0.67 1.2 × 10−3 0.89 1.0 × 10−7 0.91 2.0 × 10−8 0.91 1.1 × 10−6
80 0.80 2.0 × 10−5 0.99 4.8 × 10−16 0.95 2.4 × 10−10 0.87 4.5 × 10−6
96 0.79 3.3 × 10−5 0.89 2.4 × 10−7 0.94 5.2 × 10−9 0.94 1.5 × 10−9
112 0.90 8.8 × 10−8 0.91 2.7 × 10−8 0.97 2.6 × 10−12 0.95 1.2 × 10−10
128 0.95 8.3 × 10−11 0.92 1.1 × 10−8 0.98 2.2 × 10−14 0.99 4.8 × 10−16
144 0.85 2.1 × 10−6 0.98 7.7 × 10−14 0.90 6.2 × 10−8 0.98 3.5 × 10−15
160 0.90 4.3 × 10−8 0.94 5.0 × 10−10 0.95 3.3 × 10−10 0.99 7.1 × 10−18
176 0.94 5.0 × 10−10 0.99 3.6 × 10−17 0.91 2.3 × 10−8 0.98 1.8 × 10−14
192 0.94 6.7 × 10−10 0.94 5.0 × 10−10 0.95 6.2 × 10−11 0.97 2.6 × 10−12
208 0.91 3.6 × 10−8 0.97 6.7 × 10−12 0.98 6.1 × 10−14 0.99 2.5 × 10−17
224 0.90 9.0 × 10−8 0.98 3.7 × 10−14 0.93 2.2 × 10−9 0.98 1.3 × 10−13
240 0.78 4.6 × 10−5 0.95 2.4 × 10−10 0.98 8.3 × 10−15 0.99 9.6 × 10−19
256 0.83 4.8 × 10−6 0.94 5.0 × 10−10 0.99 4.8 × 10−16 0.97 5.4 × 10−12
272 0.95 2.4 × 10−10 0.96 2.5 × 10−11 0.97 4.0 × 10−13 0.99 1.5 × 10−16
288 0.94 9.8 × 10−10 0.92 1.5 × 10−18 0.95 8.3 × 10−11 0.99 1.5 × 10−16
304 0.81 1.5 × 10−5 0.97 4.0 × 10−13 0.95 6.2 × 10−11 1.00 3.7 × 10−24
320 0.94 1.4 × 10−9 0.95 8.3 × 10−11 0.97 2.6 × 10−12 1.00 7.2 × 10−20

Table 6.3 SCC and p values of the ratio of batch size to learning rate and test accuracy
ResNet-110 on CIFAR-10 ResNet-110 on CIFAR-100 VGG-19 on CIFAR-10 VGG-19 on CIFAR-100
SCC p SCC p SCC p SCC p
−0.97 3.3 × 10−235 −0.98 5.3 × 10−293 −0.98 6.2 × 10−291 −0.94 6.1 × 10−180

Scaling Bayesian inference via optimization. Bayesian inference has a pro-


hibitive computation complexity and thus is time consuming on large-scale data.
Several approaches have been proposed to address this challenge:
(1) Stochastic gradient Markov chain Monte Carlo (SGMCMC) (Ma et al. 2015)
introduces stochastic gradient estimation (Robbins and Monro 1951) into MCMC.
The family of SGMCMC algorithms includes SGLD (Welling and Teh 2011),
stochastic gradient Riemannian Langevin dynamics (SGRLD) (Patterson and Teh
2013), stochastic gradient Fisher scoring (SGFS) (Ahn et al. 2012), stochastic gra-
dient Hamiltonian Monte Carlo (SGHMC) (Chen et al. 2014), and the stochastic
gradient Nosé-Hoover thermostat (SGNHT) (Ding et al. 2014). This section analy-
ses SGLD as an example of an SGMCMC scheme.
References 119

(2) Variational inference (Hoffman et al. 2013; Blei et al. 2017; Zhang et al.
2018a) employs a two-step process to infer the posterior:
1. A family of distributions is defined as

Q = {qλ |λ ∈ },

where λ represents the distribution parameter(s).


2. The member of the distribution family Q that is the closest to the posterior p(θ |S)
under the Kullback-Leibler (KL) divergence is sought, i.e.,

min KL(qλ (θ ) p(θ |S)).


λ

Minimizing the KL divergence is equivalent to maximizing the evidence lower bound


(ELBO), defined as follows:

ELBO(λ, S) = Eqλ log p(θ, S) − Eqλ log qλ (θ ).

Understanding stochastic-gradient-based optimization via Bayesian infer-


ence. Weinan (2017) suggested seeking to understand demystify the process of
optimization in deep learning via SDEs. Mandt et al. (2017) suggested gaining an
understanding of SGMs via approximate Bayesian inference. Following this line of
reasoning, extensive works on SGMs has been produced, as discussed in the previous
subsections.

References

Alon, Brutzkus, Amir Globerson, Eran Malach, and Shai Shalev-Shwartz. 2018. SGD learns over-
parameterized networks that provably generalize on linearly separable data. In International
Conference on Learning Representations.
Ankit, Pensia, Varun Jog, and Po-Ling Loh. 2018. Generalization error bounds for noisy, iterative
algorithms. In 2018 IEEE International Symposium on Information Theory (ISIT), 546–550.
Aolin, Xu and Maxim Raginsky. 2017. Information-theoretic analysis of generalization capability
of learning algorithms. In Advances in Neural Information Processing Systems, 2524–2533.
Belinda, Tzen, Tengyuan Liang, and Maxim Raginsky. 2018. Local optimality and generalization
guarantees for the Langevin algorithm via empirical metastability. In Conference On Learning
Theory, 857–875.
Bottou, Léon., et al. 1998. Online learning and stochastic approximations. On-line learning in
neural networks 17 (9): 142.
Cheng, Xiang. 2020. The Interplay between Sampling and Optimization. PhD thesis, EECS Depart-
ment, University of California, Berkeley.
Christos, Louizos and Max Welling. 2017. Multiplicative normalizing flows for variational Bayesian
neural networks. In International Conference on Machine learning, 2218–2227.
Crispin, W Gardiner. 1985. Handbook of stochastic methods, vol. 3. Berlin: Springer.
David, A McAllester. 1999. PAC-Bayesian model averaging. In Annual Conference of Learning
Theory 99: 164–170.
120 6 Stochastic Gradient Descent as an Implicit Regularization

David, A McAllester. 1999. Some PAC-Bayesian theorems. Machine Learning 37 (3): 355–363.
David, M Blei, Alp Kucukelbir, and Jon D. McAuliffe. 2017. Variational inference: A review for
statisticians. Journal of the American statistical Association 112 (518): 859–877.
Diederik, P Kingma and Max Welling. 2014. Auto-encoding variational Bayes. In International
Conference on Learning Representations.
Dieuleveut, Aymeric, and Francis Bach. 2016. Nonparametric stochastic approximation with large
step-sizes. The Annals of Statistics 44 (4): 1363–1399.
Duane, Simon. 1987. Anthony D Kennedy, Brian J Pendleton, and Duncan Roweth. Hybrid Monte
Carlo. Physics Letters B 195 (2): 216–222.
Fredrik, Hellström and Giuseppe Durisi. 2020. Generalization bounds via information density and
conditional information density. arXiv preprint arXiv:2005.08044.
Geman, Stuart, and Donald Geman. 1984. Stochastic relaxation, Gibbs distributions, and the
Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence 6: 721–741.
George, E Uhlenbeck, and Leonard S. Ornstein. 1930. On the theory of the brownian motion.
Physical Review 36 (5): 823.
Gintare, Karolina Dziugaite, Kyle Hsu, Waseem Gharbieh, and Daniel M Roy. 2020. On the role
of data in pac-bayes bounds. arXiv preprint arXiv:2006.10929.
Hangfeng, He and Weijie Su. 2020. The local elasticity of neural networks. In International Con-
ference on Learning Representations.
Hao, Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. 2018c. Visualizing the
loss landscape of neural nets. In Advances in Neural Information Processing Systems.
Hao, Zhang, Bo Chen, Yulai Cong, Dandan Guo, Hongwei Liu, and Mingyuan Zhou. 2020. Deep
autoencoding topic model with scalable hybrid Bayesian inference. IEEE Transactions on Pattern
Analysis and Machine Intelligence.
Harold Kushner and G George Yin. Stochastic approximation and recursive algorithms and appli-
cations, volume 35. Springer Science & Business Media, 2003.
He, Fengxiang, and Tongliang Liu. 2019. and Dacheng Tao. Control batch size and learning rate
to generalize well: Theoretical and empirical evidence. In Advances in Neural Information Pro-
cessing Systems.
Herbert, Robbins and Sutton Monro. 1951. A stochastic approximation method. The Annals of
Mathematical Statistics, 400–407.
Huan, Xu., Constantine Caramanis, and Shie Mannor. 2011. Sparse algorithms are not stable: A
no-free-lunch theorem. IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (1):
187–193.
Hugo, Larochelle and Stanislas Lauly. 2012. A neural autoregressive topic model. In Advances in
Neural Information Processing Systems, 2708–2716.
Ian, Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,
Aaron Courville, and Yoshua Bengio. 2014a. Generative adversarial nets. In Advances in Neural
Information Processing Systems, 2672–2680.
Ian, Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. 2016. Deep learning, vol.
1. MIT Press.
Ilya, Sutskever, James Martens, George Dahl, and Geoffrey Hinton. 2013. On the importance of
initialization and momentum in deep learning. In International conference on machine learning,
1139–1147.
Ivan, Kobyzev, Simon Prince, and Marcus Brubaker. 2020. Normalizing flows: An introduction and
review of current methods. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Jeffrey, Negrea, Mahdi Haghifam, Gintare Karolina Dziugaite, Ashish Khisti, and Daniel M Roy.
2019. Information-theoretic generalization bounds for sgld via data-dependent estimates. In
Advances in Neural Information Processing Systems, 11015–11025.
Jian, Li, Xuanyuan Luo, and Mingda Qiao. 2019. On generalization error bounds of noisy gradient
methods for non-convex learning. arXiv preprint arXiv:1902.00621.
References 121

John, J Cherian, Andrew G Taube, Robert T McGibbon, Panagiotis Angelikopoulos, Guy Blanc,
Michael Snarski, Daniel D Richman, John L Klepeis, and David E Shaw. 2020. Efficient hyperpa-
rameter optimization by way of pac-bayes bound minimization. arXiv preprint arXiv:2008.06431.
Junhong, Lin and Lorenzo Rosasco. 2016. Optimal learning for multi-pass stochastic gradient
methods. In Advances in Neural Information Processing Systems, 4556–4564.
Junhong, Lin, Raffaello Camoriano, and Lorenzo Rosasco. 2016. Generalization properties and
implicit regularization for multiple passes SGM. In International Conference on Machine learn-
ing, 2340–2348.
Kaiming, He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016a. Deep residual learning for image
recognition. In IEEE Conference on Computer Vision and Pattern Recognition.
Kaiming, He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016b. Identity mappings in deep
residual networks. In European Conference on Computer Vision.
Karen, Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale
image recognition. In International Conference on Learning Representations.
Krizhevsky, Alex, and Geoffrey Hinton. 2009. Learning multiple layers of features from tiny images.
Citeseer: Technical report.
Laurent, Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. 2017. Sharp minima can gen-
eralize for deep nets. In .
Lennart, Ljung, Georg Pflug, and Harro Walk. 2012. Stochastic approximation and optimization of
random systems, volume 17. Birkhäuser.
Lever, Guy, François Laviolette, and John Shawe-Taylor. 2013. Tighter pac-bayes bounds through
distribution-dependent priors. Theoretical Computer Science 473: 4–28.
Mahdi, Haghifam, Jeffrey Negrea, Ashish Khisti, Daniel M Roy, and Gintare Karolina Dziugaite.
2020. Sharpened generalization bounds based on conditional mutual information and an appli-
cation to noisy, iterative algorithms. arXiv preprint arXiv:2004.12983.
Matthew, D Hoffman, David M. Blei, Chong Wang, and John Paisley. 2013. Stochastic variational
inference. Journal of Machine Learning Research 14 (1): 1303–1347.
Max, Welling and Yee W Teh. 2011. Bayesian learning via stochastic gradient Langevin dynamics.
In International Conference on Machine learning, 681–688.
Maxim, Raginsky, Alexander Rakhlin, and Matus Telgarsky. 2017. Non-convex learning via stochas-
tic gradient Langevin dynamics: A nonasymptotic analysis. In Conference on Learning Theory,
1674–1703.
Moritz, Hardt, Ben Recht, and Yoram Singer. 2016b. Train faster, generalize better: Stability of
stochastic gradient descent. In International Conference on Machine learning, 1225–1234.
Nan, Ding, Youhan Fang, Ryan Babbush, Changyou Chen, Robert D Skeel, and Hartmut Neven.
2014. Bayesian sampling using stochastic gradient thermostats. In Advances in Neural Informa-
tion Processing Systems, 3203–3211.
Nitish, Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping
Tak Peter Tang. 2017. On large-batch training for deep learning: Generalization gap and sharp
minima. In International Conference on Learning Representations.
Noah, Golowich, Alexander Rakhlin, and Ohad Shamir. 2018. Size-independent sample complexity
of neural networks. In Annual Conference on Learning Theory, 297–299.
Olivier, Bousquet, and André Elisseeff. 2002. Stability and generalization. Journal of Machine
Learning Research 2: 499–526.
Priya, Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola,
Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, large minibatch sgd: training
imagenet in 1 hour. arXiv preprint arXiv:1706.02677.
Roth, Wolfgang, and Franz Pernkopf. 2018. Bayesian neural networks with weight sharing using
Dirichlet processes. IEEE Transactions on Pattern Analysis and Machine Intelligence 42 (1):
246–252.
Sam, Patterson and Yee Whye Teh. 2013. Stochastic gradient Riemannian Langevin dynamics on
the probability simplex. In Advances in Neural Information Processing Systems, 3102–3110.
122 6 Stochastic Gradient Descent as an Implicit Regularization

Simon, S. Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. 2019b. Gradient descent provably
optimizes over-parameterized neural networks. In International Conference on Learning Repre-
sentations.
Spearman, Charles. 1987. The proof and measurement of association between two things. The
American journal of psychology 100 (3/4): 441–471.
Stephan, Mandt, Matthew D. Hoffman, and David M. Blei. 2017. Stochastic gradient descent as
approximate Bayesian inference. Journal of Machine Learning Research 18 (1): 4873–4907.
Sungjin, Ahn, Anoop Korattikara, and Max Welling. 2012. Bayesian posterior sampling via stochas-
tic gradient Fisher scoring. arXiv preprint arXiv:1206.6380.
Thomas, Steinke and Lydia Zakynthinou. 2020. Reasoning about generalization via conditional
mutual information. arXiv preprint arXiv:2001.09122.
Tianqi, Chen, Emily Fox, and Carlos Guestrin. 2014. Stochastic gradient Hamiltonian Monte Carlo.
In International Conference on Machine learning, 1683–1691.
W Keith, Hastings. 1970. Monte Carlo sampling methods using Markov chains and their applica-
tions.
Weinan, E. 2017. A proposal on machine learning via dynamical systems. Communications in
Mathematics and Statistics 5 (1): 1–11.
Wenlong, Mou, Liwei Wang, Xiyu Zhai, and Kai Zheng. 2018a. Generalization bounds of sgld for
non-convex learning: Two theoretical viewpoints. In Annual Conference On Learning Theory.
Xi, Chen, Jason D Lee, Xin T Tong, and Yichen Zhang. 2016. Statistical inference for model
parameters in stochastic gradient descent. arXiv preprint arXiv:1610.08637.
Yeming, Wen, Kevin Luk, Maxime Gazeau, Guodong Zhang, Harris Chan, and Jimmy Ba. 2019.
Interplay between optimization and generalization of stochastic gradient descent with covariance
noise. arXiv preprint arXiv:1902.08234.
Yi, Zhou, Yingbin Liang, and Huishuai Zhang. 2018. Generalization error bounds with probabilistic
guarantee for SGD in nonconvex optimization. arXiv preprint arXiv:1802.06903.
Yi-An, Ma, Tianqi Chen, and Emily Fox. 2015. A complete recipe for stochastic gradient mcmc.
In Advances in Neural Information Processing Systems, 2917–2925.
Ying, Yiming, and Massimiliano Pontil. 2008. Online gradient descent learning algorithms. Foun-
dations of Computational Mathematics 8 (5): 561–596.
Yuansi, Chen, Chi Jin, and Bin Yu. 2018b. Stability and convergence trade-off of iterative optimiza-
tion algorithms. arXiv preprint arXiv:1804.01619.
Yuheng, Bu, Shaofeng Zou, and Venugopal V Veeravalli. 2020. Tightening mutual information
based bounds on generalization error. IEEE Journal on Selected Areas in Information Theory.
Yuting, Wei, Fanny Yang, and Martin J Wainwright. 2017. Early stopping for kernel boosting
algorithms: A general analysis with localized complexities. In Advances in Neural Information
Processing Systems, 6065–6075.
Zeyuan, Allen-Zhu., Yuanzhi Li, and Zhao Song. 2019. A convergence theory for deep learning via
over-parameterization. In .
Zhang, Cheng, Judith Bütepage, Hedvig Kjellström, and Stephan Mandt. 2018. Advances in varia-
tional inference. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (8): 2008–
2026.
Zhun, Deng, Hangfeng He, and Weijie J Su. 2020. Toward better generalization bounds with locally
elastic stability. arXiv preprint arXiv:2010.13988.
Chapter 7
The Geometry of the Loss Surfaces

A major barrier recognized by the whole community is that the loss surfaces of deep
neural networks are extremely nonconvex and nonsmooth. Such nonconvexity and
nonsmoothness make the analysis of the optimization and generalization properties
of such networks prohibitively difficult. An intuitive approach is to bypass these geo-
metrical properties to seek a theoretical explanation. However, some papers argue that
these “intimidating” geometrical properties themselves are the major factors shaping
the properties of deep neural networks and the key to explaining deep learning.

7.1 Linear Networks Have No Spurious Local Minima

Linear neural networks are neural networks whose activations are all linear functions.
For linear neural networks, the loss surfaces do not have any spurious local minima:
all local minima are equally good, i.e., they are all global minima.
Kawaguchi (2016) proved that linear neural networks do not have any spurious
local minima under several assumptions: (1) the loss functions are squared losses;
(2) X X T and X Y T are both of full rank, where X is the data matrix and Y is the
label matrix; (3) the dimensionality of the input layer is larger than that of the output
 −1
layer; and (4) all eigenvalues of the matrix Y X  X X T X Y T are different from
each other. Lu and Kawaguchi (2017) replaced these assumptions with a single, more
restrictive assumption, namely, that the data matrix X and the label matrix Y are both
of full rank. Later, Zhou and Liang (2018) proved that all critical points are global
minima when certain conditions hold. The authors proved this based on a new result
regarding the analytical formulation of the critical points.

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025 123
F. He and D. Tao, Foundations of Deep Learning, Machine Learning: Foundations,
Methodologies, and Applications, https://doi.org/10.1007/978-981-16-8233-9_7
124 7 The Geometry of the Loss Surfaces

7.2 Nonlinear Activations Bring Infinite Spurious Local


Minima

Neural networks have been successfully deployed in many real-world applications


(LeCun et al. 2015; Witten et al. 2016; Silver et al. 2016; He et al. 2016; Litjens et al.
2017). Despite this, the theoretical foundations of neural networks are somewhat
premature. To cover the many deficiencies in our knowledge of deep learning theory,
the investigation into the loss surfaces of neural networks is of fundamental impor-
tance. Understanding the loss surface would be helpful in several relevant research
areas, such as the ability to estimate data distributions, the optimization of neural
networks, and the generalization to unseen data.
This section studies the role of nonlinearities in activation functions which in turn
shape the loss surfaces of neural networks. Our results demonstrate that the impact
of nonlinearities is profound.
First, we prove that the loss surfaces of nonlinear neural networks are substan-
tially different to those of linear neural networks, in which local minima are created
equal, and are all global minima (Kawaguchi 2016; Baldi and Hornik 1989; Lu and
Kawaguchi 2017; Freeman and Bruna 2017; Zhou and Liang 2018; Laurent and von
Brecht 2018; Yun et al. 2018). By contrast,
Neural networks with arbitrary depth and arbitrary piecewise linear activations (excluding
linear functions) have infinitely many spurious local minima under arbitrary continuously
differentiable loss functions.

This result only relies on four mild assumptions that cover most practical circum-
stances: (1) the training sample set is linearly inseparable; (2) all training sample
points are distinct; (3) the output layer is narrower than the other hidden layers; and
(4) there exists some turning point in the piece-wise linear activations that the sum
of the slops on the two sides does not equal to 0.
Our result significantly extends the existing study on the existence of spurious
local minimum. For example, Zhou and Liang (2018) prove that one-hidden-layer
neural networks with two nodes in the hidden layer and two-piece linear (ReLU-
like) activations have spurious local minima; Swirszcz et al. (2016) prove that ReLU
networks have spurious local minima under the squared loss when most of the neurons
are not activated; Safran and Shamir (2018) present a computer-assisted proof that
two-layer ReLU networks have spurious local minima; a recent work Yun et al. (2019)
have proven that neural networks with two-piece linear activations have infinite
spurious local minima, but the results only apply to the networks with one hidden
layer and one-dimensional outputs; and a concurrent work Goldblum et al. (2020)
proves that for multi-layer perceptrons of any depth, the performance of every local
minimum on the training data equals to a linear model, which is also verified by
experiments.
The proposed theorem is proved in three stages: (1) we prove that neural networks
with one hidden layer and two-piece linear activations have spurious local minima;
(2) we extend the conditions to neural networks with arbitrary hidden layers and two-
piece linear activations; and (3) we further extend the conditions to neural networks
7.2 Nonlinear Activations Bring Infinite Spurious Local Minima 125

with arbitrary depth and arbitrary piecewise linear activations. Since some parameters
of the constructed spurious local minima are from continuous intervals, we have
obtained infinitely many spurious local minima. At each stage, the proof follows
a two-step strategy that: (a) constructs an infinite series of local minima; and (b)
constructs a point in the parameter space whose empirical risk is lower than the
constructed local minimum in Step (a). This strategy is inspired by Yun et al. (2019),
upon which we have made significant and non-trivial development.
Second, we draw a “big picture” for the loss surfaces of nonlinear neural networks.
Soudry and Hoffer (2018) highlight a smooth and multilinear partition of the loss
surfaces of neural networks. The nonlinearities in the piecewise linear activations
partition the loss surface of any nonlinear neural network into multiple smooth and
multilinear open cells. Specifically, every nonlinear point in the activation functions
creates a group of the non-differentiable boundaries between the cells, while the
linear parts of activations correspond to the smooth and multilinear interiors. Based
on the partition, we discover the degenerative nature of the large amounts of local
minima from the following aspects:

• Every local minimum is globally minimal within a cell. This property demon-
strates that the local geometry within every cell is similar to the global geometry of
linear networks, although technically, they are substantially different. It applies to
any one-hidden-layer neural network with two-piece linear activations for regres-
sion under convex loss. We rigorously prove this property in two stages: (1) we
prove that within every cell, the empirical risk R̂ S is convex with respect to a
variable Ŵ mapped from the weights W by a mapping Q. Therefore, the local
minima with respect to the variable Ŵ are also the global minima in the cell; and
then (2) we prove that the local optimality is maintained under the constructed
mapping. Specifically, the local minima of the empirical risk R̂ S with respect to
the parameter W are also the local minima with respect to the variable Ŵ . We
thereby prove this property by combining the convexity and the correspondence
of the minima. This proof is technically novel and non-trivial, despite the natural
intuitions.
• Equivalence classes and quotient space of local minimum valleys. All local
minima in a cell are concentrated as a local minimum valley: on a local minimum
valley, all local minima are connected with each other by a continuous path, on
which the empirical risk is invariant. Further, all these local minima constitute
an equivalence class. This local minima valley may have several parallel valleys
that are in the same equivalence class but do not appear because of the restraints
from cell boundaries. If such constraints are ignored, all the equivalence classes
constitute a quotient space. The constructed mapping Q is exactly the quotient
map. This result coincides with the property of mode connectivity that the minima
found by gradient-based methods are connected by a path in the parameter space
with almost invariant empirical risk (Garipov et al. 2018; Draxler et al. 2018;
Kuditipudi et al. 2019). Additionally, this property suggests that we would need
to study every local minimum valley as a whole.
126 7 The Geometry of the Loss Surfaces

• Linear collapse. Linear neural networks are covered by our theories as a simplified
case. When all activations are linear, the partitioned loss surface collapses to one
single cell, in which all local minima are globally optimal, as suggested by the
existing works on linear networks (Kawaguchi 2016; Baldi and Hornik 1989; Lu
and Kawaguchi 2017; Freeman and Bruna 2017; Zhou and Liang 2018; Laurent
and von Brecht 2018; Yun et al. 2018).

7.2.1 Neural Networks Have Infinite Spurious Local Minima

This section investigates the existence of spurious local minima on the loss surfaces
of neural networks. We find that almost all practical neural networks have infinitely
many spurious local minima. This result stands for any neural network with arbitrary
depth and arbitrary piecewise linear activations excluding linear functions under
arbitrary continuously differentiable loss.

7.2.1.1 Preliminaries

Consider a training sample set {(x1 , y1 ), (x2 , y2 ), . . . , (xm , ym )} of size m. Suppose


the dimensions of feature X i and label Yi are d X and dY , respectively. By aggregating
the training sample set, we obtain the feature matrix X ∈ Rd X ×m and label matrix
Y ∈ RdY ×m .
Suppose a neural network has L layers. Denote the weight matrix, bias, and acti-
vation in the j-th layer respectively by W j ∈ Rd j ×d j−1 , b j ∈ Rd j , and h j : Rd j ×m →
Rd j ×m , where d j is the dimension of the output of the j-th layer. Also, for the input
matrix X , the output of the j-th layer is denoted as the Y ( j) and the output of the
j-th layer before the activation is denoted as the Ỹ ( j) ,

Ỹ ( j) = W j Y ( j−1) + b j 1mT , (7.1)


 
Y ( j) = h W j Y ( j−1) + b j 1mT . (7.2)

The output of the network is defined as follows,


       
Ŷ = h L W L h L−1 W L−1 h L−2 . . . h 1 W1 X + b1 1mT . . . + b L−1 1mT + b L 1mT .
(7.3)

Also, we
 define Y (0) = X , Y (L) = Ŷ , d0 = d X , and d L = dY . In some situations, we
L
use Ŷ [Wi ]i=1 , [bi ]i=1
L
to clarify the parameters, as well as Ỹ ( j) , Y ( j) , etc.
7.2 Nonlinear Activations Bring Infinite Spurious Local Minima 127

This section discusses neural networks with piecewise linear activations. A part
of the proof uses two-piece linear activations h s− ,s+ which are defined as follows,

h s− ,s+ (x) = I{x≤0} s− x + I{x>0} s+ x, (7.4)

where |s+ | = |s− | and I{·} is the indicator function.


Remark 7.1 Piecewise linear functions are dense in the space of continuous func-
tions. In other words, for any continuous function, we can always find a piecewise
linear function to estimate it with arbitrary small distance.
This section uses continuously differentiable loss to evaluate the performance of
neural networks. Continuous differentiability is defined as follows.
Definition 7.1 (Continuously differentiable) We call a function f : Rd X → R con-
tinuously differentiable with respect to the variable X if: (1) the function f is differ-
entiable with respect to X ; and (2) the gradient ∇ f (X ) of the function f is continuous
with respect to the variable X .

7.2.1.2 Main Result

The theorem in this section relies on the following assumptions.


Assumption 7.1 The training data cannot be fit by a linear model. 
Assumption 7.2 All data points are distinct. 
Assumption 7.3 All hidden layers are wider than the output layer. 
Assumption 7.4 For the piece-wise linear activations, there exists some turning
point that the sum of the slops on the two sides does not equal to 0. 
These assumptions are respectively justified as follows: (1) most real-world
datasets are extremely complex and cannot be simply fit using linear models; (2)
it is easy to guarantee that the data points are distinct by employing data cleansing
methods; (3) for regression and many classification tasks, the width of output layer is
limited and narrower than the hidden layers; and (4) this assumption is invalid only
for activations like f (x) = a|x|.
Based on these four assumptions, we can prove the following theorem.
Theorem 7.5 Neural networks with arbitrary depth and arbitrary piecewise linear
activations (excluding linear functions) have infinitely many spurious local minima
under arbitrary continuously differentiable loss whose derivative can equal 0 only
when the prediction and label are the same.
In practice, most loss functions are continuously differentiable and the derivative
can equal 0 only when the prediction and label are the same, such as squared loss
and cross-entropy loss (see Appendix 7.2.3.2, Lemmas 7.2 and 7.3).
128 7 The Geometry of the Loss Surfaces

One can also remove Assumption 7.4, if Assumption 7.3 is replaced by the
following assumption, which is mildly more restrictive (see a detailed proof in
Sect. 7.2.3.5–7.2.3.6).

Assumption 7.6 The dimensions of the layers satisfy that:

d1 ≥ dY + 2,
di ≥ dY + 1, i = 2, . . . , L − 1.

The above result demonstrates that introducing nonlinearities into activations


substantially reshapes the loss surface: they bring infinitely many spurious local
minima into the loss surface. This result highlights the substantial difference from
linear neural networks that all local minima of linear neural networks are equally
good, and therefore, they are all global minima (Kawaguchi 2016; Baldi and Hornik
1989; Lu and Kawaguchi 2017; Freeman and Bruna 2017; Zhou and Liang 2018;
Laurent and von Brecht 2018; Yun et al. 2018).
Some works have noticed the existence of spurious local minima on the loss sur-
faces of nonlinear neural networks, which however has a limited applicable domain
(Choromanska et al. 2015; Swirszcz et al. 2016; Safran and Shamir 2018; Yun et al.
2019). A notable work by Yun et al. (2019) proves that one-hidden-layer neural net-
works with two-piece linear (ReLU-like) activations for one-dimensional regression
have infinitely many spurious local minima under squared loss. This work first con-
structs a series of local minima and then proves that they are spurious. This idea
inspires some of the results in Theorem 7.5. However, the result in Theorem 7.5
makes a significant and non-trivial development that extends the conditions to arbi-
trary depth, piecewise linear activations excluding linear functions, and continuously
differentiable loss.

7.2.1.3 Proof Skeleton

This section presents the skeleton of the proof. Theorem 7.5 is proved in three stages.
We first prove a simplified version of Theorem 7.5 and then extend the conditions in
the last two stages.
Yun et al. (2019) and the method discussed in this section share a common strategy:
(a) creating a sequence of local minima using a linear classifier; and (b) showing the
existence of a constructed new point with lower empirical risk to demonstrate that
the constructed local minima are spurious. However, the specifics of how these local
minima are constructed vary significantly due to differences in the loss function used
and the dimensions of the output space. These distinctions highlight the nuanced
nature of the strategy and its application across different contexts.
We have extended the work of Yun et al. (2019) in three key ways: (1) Generalizing
from one hidden layer to arbitrary depth: Our approach aims to demonstrate that neu-
ral networks of arbitrary depth exhibit infinite spurious local minima. We introduced
a new strategy that employs transformation operations to direct data flow through the
7.2 Nonlinear Activations Bring Infinite Spurious Local Minima 129

same linear parts of the activations, facilitating the construction of spurious local min-
ima. (2) Extending from squared loss to arbitrary differentiable loss: Yun et al. (2019)
obtained the analytic derivatives of the loss function to construct and prove the exis-
tence of spurious local minima. However, this method cannot be directly applied to
arbitrary differentiable loss functions lacking analytic formulations. To establish that
a loss surface under an arbitrary differentiable loss has infinite spurious local minima,
we developed a new proof technique based on Taylor series expansions and a new
separation lemma. (3) Expanding from one-dimensional to arbitrary-dimensional
output: Our work investigates neural networks with arbitrary-dimensional output,
dealing with the calculus of functions whose domain and codomain are a matrix space
and a vector space, respectively. This extension contrasts with the one-dimensional
output scenario, which deals solely with the codomain of real number spaces. Explor-
ing higher-dimensional outputs significantly increases the complexity of the proof
process, requiring specialized mathematical techniques to demonstrate the presence
of infinite spurious local minima.
Stage (1): Neural networks with one hidden layer and two-piece linear
activations.
We first prove that nonlinear neural networks with one hidden layer and two-piece
linear activation functions (ReLU-like activations) have spurious local minima. The
proof in this stage further follows a two-step strategy:
(a) We first construct local minima of the empirical risk R̂ S (see Appendix 7.2.3.3,
Lemma 7.4). These local minimizers are constructed based on a linear neural network
which has the same network size (dimension of weight matrices) and evaluated under
the same loss. The design of the hidden layer guarantees that the components of the
output Ỹ (1) in the hidden layer before the activation are all positive. The activation
is thus effectively reduced to a linear function. Therefore, the local geometry around
the local minima with respect to the weights W is similar to those of linear neural
networks. Further, the design of the output layer guarantees that its output Ŷ is the
same as the linear neural network. This construction helps to utilize the results of
linear neural networks to solve the problems in nonlinear neural networks.
(b) We then prove that all the constructed local minima in Step (a) are spurious
(see Appendix 7.2.3.3, Theorem 7.9). Specifically, we assumed by Assumption 7.1
that the dataset cannot be fit by a linear model. Therefore, the gradient ∇Ŷ R̂ S of
the empirical risk R̂ S with respect to the prediction Ŷ is not zero. Suppose the i-th
row of the gradient ∇Ŷ R̂ S is not zero. Then, we use Taylor series and a preparation
lemma (see Appendix 7.2.3.6, Lemma 7.7) to construct another point in the parameter
space that has smaller empirical risk. Therefore, we prove that the constructed local
minima are spurious. Furthermore, the constructions involve some parameters that
are randomly picked from a continuous interval. Thus, we constructed infinitely
many spurious local minima.
130 7 The Geometry of the Loss Surfaces

Stage (2) - Neural networks with arbitrary hidden layers and two-piece linear
activations.
We extend the condition in Stage (1) to any neural network with arbitrary depth
and two-piece linear activations. The proof in this stage follows the same two-step
strategy but has different implementations:
(a) We first construct a series of local minima of the empirical risk R̂ S (see
Appendix 7.2.3.4, Lemma 7.5). The construction guarantees that every component
of the output Ỹ (i) in each layer before the activations is positive, which secure all the
input examples flow through the same part of the activations. Thereby, the nonlinear
activations are reduced to linear functions. Also, our construction guarantees that the
output Ŷ of the network is the same as a linear network with the same weight matrix
dimensions.
(b) We then prove that the constructed local minima are spurious (see Appendix
7.2.3.4, Theorem 7.10). The idea is to find a point in the parameter space that has
the same empirical risk R̂ S with the constructed point in Stage (1), Step (b).
Stage (3) - Neural networks with arbitrary hidden layer and piecewise linear
activations.
We further extend the conditions in Stage (2) to any neural network with arbitrary
depth and arbitrary piecewise linear activations. We continue to adapt the two-step
strategy in this stage:
(a) We first construct a local minimizer of the empirical risk R̂ S based on the
results in Stages (1) and (2) (see Appendix 7.2.3.5, Lemma 7.6). This construction
is based on Stage (2), Step (a). The difference of the construction in this stage is that
every linear part in the activations can be of a finite interval. The constructed weight
matrices apply several uniform scaling and translation operations on the outputs of
hidden layers in order to guarantee that all the input training sample points flow
through the same linear parts of the activations. We thereby effectively reduce the
nonlinear activations to linear functions. Also, our construction guarantees that the
output Ŷ of the neural network equals to that of the corresponding linear neural
network.
(b) We then prove that the constructed local minima are spurious (see Appendix
7.2.3.5). We use the same strategy in Stage (2), Step (b). Some adaptations are
implemented for the new conditions.

7.2.2 A Big Picture of the Loss Surface

This section draws a big picture for the loss surfaces of neural networks. Based on
a recent result by Soudry and Hoffer (2018), we present four profound properties of
the loss surface that collectively characterize how the nonlinearities in activations
shape the loss surface.
7.2 Nonlinear Activations Bring Infinite Spurious Local Minima 131

7.2.2.1 Preliminaries

The discussions in this section use the following concepts.


Definition 7.2 (Open ball and open set) The open ball in H centered at x ∈ H and
of radius r > 0 is defined by Br (h) = {x : x − h < r }. A subset A ⊂ H of a space
H is called a open set, if for every point h ∈ A, there exists a positive real r > 0, such
that the open ball Br (h) with center h and radius r is in the subset A: Br (h) ⊂ A.
Definition 7.3 (Interior point and interior) For a subset A ⊂ H of a space H, a
point h ∈ A is called an interior point of A, if there exists a positive real r > 0, such
that the open ball Br (h) with center h and radius r is in the subset A: Br (h) ⊂ A.
The set of all the interior points of the set A is called the interior of the set A.
Definition 7.4 (Limit point, closure, and boundary) For a subset A ⊂ H of a space
H, a point h ∈ A is called a limit point, if for every r > 0, the open ball Br (h) with
center h and radius r contains some point of A: Br (h) ∩ A = ∅. The closure Ā of
the set A consists of the union of the set A and all its limit points. The boundary ∂ A
is defined as the set of points which are in the closure of set A but not in the interior
of set A.
Definition 7.5 (Multilinear) A function f : X1 × X2 → Y is called multilinear if
for arbitrary x11 , x12 ∈ X, x21 , x22 ∈ X2 , and constants λ1 , λ2 , μ1 , and μ2 , we have

f (λ1 x11 + λ2 x12 , μ1 x21 + μ2 x22 )


= λ1 μ1 f (x11 , x21 ) + λ1 μ2 f (x11 , x22 ) + λ2 μ1 f (x12 , x21 ) + λ2 μ2 f (x12 , x22 ).

Remark 7.2 The definition of “multilinear” implies that the domain of any multi-
linear function f is a connective and convex set, such as the smooth and multilinear
cells below.
Definition 7.6 (Equivalence class, and quotient space) Suppose X is a linear space.
[x] = {v ∈ X : v ∼ x} is an equivalence class, if there is an equivalent relation ∼ on
[x], such that for any a, b, c ∈ [x], we have: (1) reflexivity: a ∼ a; (2) symmetry:
if a ∼ b, b ∼ a; and (3) transitivity: if a ∼ b and b ∼ c, a ∼ c. The quotient space
and quotient map are defined to be X/ ∼= {{v ∈ X : v ∼ x} : x ∈ X } and x → [x],
respectively.

7.2.2.2 Main Results

In this section, the loss surface is defined under convex loss with respect to the pre-
diction Ŷ of the neural network. Convex loss covers many popular loss functions in
practice, such as the squared loss for the regression tasks and many others based on
norms. The triangle inequality of the norms secures the convexity of the correspond-
ing loss functions. The convexity of the squared loss is checked in the appendix (see
Appendix 7.3, Lemma 7.8).
132 7 The Geometry of the Loss Surfaces

We now present four propositions to express the loss surfaces of nonlinear neural
networks. These propositions give four major properties of the loss surface that
collectively draw a big picture for the loss surface.
We first recall a lemma by Soudry and Hoffer (2018). It proves that the loss
surfaces of neural networks have smooth and multilinear partitions.

Lemma 7.1 (Smooth and multilinear partition; cf. Soudry and Hoffer (2018)) The
loss surfaces of neural networks of arbitrary depth with piecewise linear functions
excluding linear functions are partitioned into multiple smooth and multilinear open
cells, while the boundaries are nondifferentiable.

Based on the smooth and multilinear partition, we prove four propositions as


follows.

Theorem 7.7 (Analogous convexity) For one-hidden-layer neural networks with


two-piece linear activation for regression under convex loss, within every cell, all
local minima are equally good, and also, they are all global minima in the cell.

Theorem 7.8 (Equivalence classes of local minimum valleys) Suppose all condi-
tions of Theorem 7.7 hold. Assume the loss function is strictly convex. Then, all local
minima in a cell are concentrated as a local minimum valley: they are connected
with each other by a continuous path and have the same empirical risk. Additionally,
all local minima in a cell constitute an equivalence class.

Corollary 7.1 (Quotient space of local minimum valleys) Suppose all conditions
of Theorem 7.8 hold. There might exist some “parallel” local minimum valleys in
the equivalence class of a local minimum valley. They do not appear because of the
constraints from the cell boundaries. If we ignore such constraints, all equivalence
classes of local minima valleys constitute a quotient space.

Corollary 7.2 (Linear collapse) The partitioned loss surface collapses to one single
smooth and multilinear cell, when all activations are linear.

7.2.2.3 Discussions and Proof Techniques

The four propositions collectively characterize how the nonlinearities in activations


shape the loss surfaces of neural networks. This section discusses the results and the
structure of the proofs. A detailed proof is omitted here and given in Appendix 7.3.
Smooth and multilinear partition. Intuitively, the nonlinearities in the piecewise
linear activation functions partition the surface into multiple smooth and multilinear
cells. Zhou and Liang (2018), Soudry and Hoffer (2018) highlight the partition of
the loss surface. We restate it here to make the picture self-contained. A similar
but also markedly different notions recently proposed by Hanin and Rolnick (2019)
demonstrate that the input data space is partitioned into multiple linear regions, while
our work focuses on the partition in the parameter space.
7.2 Nonlinear Activations Bring Infinite Spurious Local Minima 133

Every local minimum is globally minimal within a cell. In convex optimization,


convexity guarantees that all the local minima are global minima. This theorem
proves that the local minima within a cell are equally good, and also, they are all
global minima in the cell. This result is not surprising provided the excellent training
performance of deep learning algorithms. However, the proof is technically non-
trivial.
Soudry and Hoffer (2018) proved that the local minima in a cell are the same.
However, some points near the boundary that have a smaller empirical risk and
are not locally minimal may exist. Unfortunately, the proof by Soudry and Hoffer
(2018) cannot exclude this possibility. By contrast, our proof completely solves this
problem. Furthermore, our proof holds for any convex loss, including squared loss
and cross-entropy loss, while the proof by Soudry and Hoffer (2018) only holds for
squared loss.
Developing our proof presented notable challenges due to the limitations of apply-
ing techniques used for linear networks to nonlinear contexts. Linear networks, char-
acterized by sequential weight matrix products, possess straightforward geometrical
properties where each linear activation function scales the output by a constant. In
contrast, the loss surface in a cell of a nonlinear neural network lacks this simplic-
ity owing to the intricate behaviors induced by nonlinear activations. Our approach
involves addressing these complexities within the proof framework.
We first prove that the empirical risk R̂ S is a convex function within every cell
with respect to a variable Ŵ which is calculated from the weights W . Therefore, all
local minima of the empirical risk R̂ S with respect to Ŵ are also globally optimal in
the cell. Every cell corresponds to a specific series of linear parts of the activations.
Therefore, in any fixed cell, the activation h s− ,s+ can be expressed by the slopes of
the corresponding linear parts as the following equations,

1  1     
m m
R̂ S (W1 , W2 ) = l (yi , W2 h(W1 xi )) = l yi , W2 diag A·,i W1 xi ,
m i=1 m i=1
(7.5)
where A·,i is the i-th column of matrix
⎡ ⎤
h s− ,s+ ((W1 )1,· x1 ) · · · h s− ,s+ ((W1 )1,· xm )
⎢ .. .. .. ⎥
A=⎣ . . . ⎦.
h s− ,s+ ((W1 )d1 ,· x1 ) 
· · · h s− ,s+ ((W1 )d1 ,· xm )

Matrix A is constituted by collecting the slopes of the activation h at every point


(W1 )i,· x j .
Different elements of the matrix A can be multiplied either one of {s− , s+ }. There-
fore, we cannot use a single constant to express the effect of this activation, and thus,
even within the cell, a nonlinear network cannot be expressed as the product of a
sequence of weight matrices. This difference ensures that the proofs of deep linear
neural networks cannot be transplanted here.
Then, we prove that (see Sect. 7.3.3).
134 7 The Geometry of the Loss Surfaces
 
W2 diag A·,i W1 xi = A·,i
T
diag(W2 )W1 xi . (7.6)

Applying Eq. (7.6) to Eq. (7.5), the empirical risk R̂ S equals to a formulation similar
to the linear neural networks,

1   
m
R̂ S = l yi − A·,i
T
diag(W2 )W1 xi . (7.7)
m i=1

Afterwards, define Ŵ1 = diag(W2 )W1 and then straighten the matrix Ŵ1 to a vector
Ŵ ,  
Ŵ = (Ŵ1 )1,· · · · (Ŵ1 )d1 ,· ,

Define Q : (W1 , W2 ) → Ŵ , and also define,


 
X̂ = A·,1 ⊗ x1 · · · A·,m ⊗ xm .

We can prove the following equations (see Sect. 7.3.3),

T
A·,1 Ŵ1 x1 · · · A·,m
T
Ŵ1 xm =Ŵ X̂ .

Applying Eq. (7.7), the empirical risk is transferred to a convex function as follows,

1   T 1 
m m
R̂ S = l yi , A·,i Ŵ1 xi = l(yi , Ŵ X̂ i ).
m i=1 m i=1

We then prove that the local optimality of the empirical risk R̂ S is maintained
when the weights W are mapped to the variable Ŵ . Specifically, the local minima
of the empirical risk R̂ S with respect to the weight W are also the local minima with
respect to the variable Ŵ . The maintenance of optimality is not surprising but the
proof is technically non-trivial (see a detailed proof in Sect. 7.3.3).
Equivalence classes and quotient space of local minimum valleys. The con-
structed mapping Q is a quotient map. Under the setting in the previous property,
all local minima in a cell is an equivalence class; they are concentrated as a local
minimum valley. However, there might exist some “parallel” local minimum valley
in the equivalence class, which do not appear because of the constraints from the cell
boundaries. Further for neural networks of arbitrary depth, we also constructed a local
minimum valley (the spurious local minima constructed in Sect. 7.2.1). This result
explains the property of mode connectivity that the minima found by gradient-based
methods are connected by a path in the parameter space with almost constant empir-
ical risk, which is proposed in two empirical works (Garipov et al. 2018; Draxler
et al. 2018). A recent theoretical work (Kuditipudi et al. 2019) proves that dropout
stability and noise stability guarantee the mode connectivity.
7.2 Nonlinear Activations Bring Infinite Spurious Local Minima 135

Linear collapse. Our theories also cover the case of linear neural networks. Linear
neural networks do not have any nonlinearity in their activations. Correspondingly,
the loss surface does not have any non-differentiable boundaries. In our theories,
when there is no nonlinearity in the activations, the partitioned loss surface collapses
to a single smooth, multilinear cell. All local minima wherein are equally good, and
also, they are all global minima as follows. This result unites the existing results on
linear neural networks (Kawaguchi 2016; Baldi and Hornik 1989; Lu and Kawaguchi
2017; Freeman and Bruna 2017; Zhou and Liang 2018; Laurent and von Brecht 2018;
Yun et al. 2018).

7.2.3 Proofs of Sect. 7.2

7.2.3.1 Proof of Theorem 7.5

This section gives a detailed proof of Theorem 7.5. It follows the skeleton presented
in Sect. 7.2.1.3.

7.2.3.2 Squared Loss and Cross-Entropy Loss

We first check whether squared loss and cross-entropy loss are covered by the
requirements of Theorem 7.5.
Lemma 7.2 The squared loss is continuously differentiable with respect to the pre-
diction of the model, whose gradient of loss equal to zero when the prediction and
the label are different.
Proof Apparently, the squared loss is differentiable with respect to Ŷ . Specifically,
the gradient with respect to Ŷ is as follows,
2
∇Ŷ Y − Ŷ = 2 Y − Ŷ ,

which is continuous with respect to Ŷ .


Also, when the prediction Ŷ does not equals to the label Y , we have
2
∇Ŷ Y − Ŷ = 0.

The proof is completed. 


Lemma 7.3 The cross-entropy loss is continuously differentiable with respect to the
prediction of the model, whose gradient of loss equal to zero when the prediction
and the label are different. Also, we assume that the ground-truth label is a one-hot
vector.
136 7 The Geometry of the Loss Surfaces

Proof For any i ∈ [1 : m], the cross-entropy loss is differentiable with respect to Ŷi .
The j-th component of the gradient with respect to the prediction Ŷi is as follows,
 d  
Y
eŶk,i d 
∂ − Yk,i log dY Ŷk,i 
k=1 k=1 e Y
eŶ j,i
= Yk,i − Y j,i . (7.8)
∂ Ŷ j,i k=1

dY

e k,i
k=1

which is continuous with respect to Ŷi . So, the cross-entropy loss is continuously
differentiable with respect to Ŷi .
Additionally, if the gradient (Eq. (7.8)) is zero, we have the following equations,
 

dY 
dY
Yk,i e Ŷ j,i
− Y j,i eŶk,i = 0, j = 1, 2, . . . , m.
k=1 k=1

Rewrite it into the matrix form, we have


⎡ ⎤
dY
⎢ Yk,i − Y1,i −Y1,i ··· −Y1,i ⎥⎡ ⎤
⎢k=1 ⎥ eŶ1,i
⎢  ⎥
⎢ d Y
⎥ ⎢ Ŷ2,i ⎥
⎢ −Y2,i Yk,i − Y2,i · · · −Y2,i ⎥⎢e ⎥
⎢ ⎥ ⎢ . ⎥ = 0.
⎢ ..
k=1
.. .. ⎥⎢ . ⎥
⎢ .. ⎥⎣ . ⎦
⎢ . . . . ⎥ Ŷ
⎢ ⎥ e dY ,i
⎣ 
dY ⎦
−YdY ,i ··· −YdY ,i Yk,i − YdY ,i
i=1


dY
Since Yk,i = 1, we can easily check the rank of the left matrix is dY − 1. So the
k=1
dimension of the solution space is one. Meanwhile, we have
⎡ ⎤

dY
⎢ Yk,i − Y1,i −Y1,i ··· −Y1,i ⎥⎡
⎢k=1 ⎥ ⎤
⎢  ⎥ Y1,i
⎢ dY
⎥⎢
⎢ −Y2,i Yk,i − Y2,i ··· −Y2,i ⎥ ⎢ Y2,i ⎥
⎢ ⎥⎢ . ⎥ = 0.
⎢ k=1
⎥⎣ . ⎥
⎢ .. .. .. .. ⎥ . ⎦
⎢ . . . . ⎥
⎢ ⎥ YdY ,i
⎣ 
dY ⎦
−YdY ,i ··· −YdY ,i Yk,i − YdY ,i
i=1

Therefore, 0 = eŶk,i = λYk,i , for some λ ∈ R, which contradicts to the assumption


that some of the components of Y is 0 (Yi,· is a one-hot vector).
The proof is completed. 
7.2 Nonlinear Activations Bring Infinite Spurious Local Minima 137

7.2.3.3 Stage (1)

In Stage (1), we prove that deep neural networks with one hidden layer, two-piece
linear activation h s− ,s+ , and multi-dimensional outputs have infinite spurious local
minima.
This stage is organized as follows: (a) we construct a local minimizer by Lemma
7.4; and (b) we prove that the local minimizer is spurious in Theorem 7.9 by
constructing a set of parameters with smaller empirical risk.
Without loss of generality, we assume that s+ = 0. Otherwise, suppose that s+ =
0. From the definition of ReLU-like activation (Eq. (7.4)), we have s− = 0. Since

h s− ,s+ (x) = h −s+ ,−s− (−x),


 
the output of the neural network with parameters [Wi ]i=1 L
,[bi ]i=1
L
and activation
  L   L 
h s− ,s+ equals to that of the neural network with parameters Wi i=1 , bi i=1 and
activation h −s+ ,−s− where Wi = −Wi , bi = −bi , i = 1, 2,. . . , L − 1 and W L = W L ,
   L  L
bL = b L . Since [Wi ]i=1
L
, [bi ]i=1
L
→ Wi i=1 , bi i=1 is an one-to-one map, it is
equivalent to consider either the two networks, with h −s+ ,−s− (x) has non-zero slope
when x > 0.
Step (a). Construct local minima of the loss surface.

Lemma 7.4 Suppose that W̃ is a local minimizer of


  
1 
m
 x
f (W ) = l Yi , W i , (7.9)
m i=1 1

Under Assumption 7.3, any one-hidden-layer neural network has a local minimum
at
     
W̃ W̃ − η1dY
Ŵ1 = ·,[1:d X ] , b̂1 = ·,d X +1 , (7.10)
0(d1 −dY )×d X −η1d1 −dY

and  
1
Ŵ2 = I
s+ dY
0dY ×(d1 −dY ) , b̂2 = η1dY , (7.11)

where Ŵ1 and b̂1 are respectively the weight matrix and the bias of the first layer,
Ŵ2 and b̂2 are respectively the weight matrix and the bias of the second layer, and η
is a negative constant with absolute value sufficiently large such that

W̃ X̃ − η1dY 1mT > 0, (7.12)

where > is element-wise.


138 7 The Geometry of the Loss Surfaces

Also, the loss in this lemma is continuously differentiable loss whose gradient
does not equals to 0 when the prediction is not the same as the ground-truth label.

Proof
  We show that the empirical risk  is higher in the neiborhood of
2 2 2  2
Ŵi , b̂i , in order to prove that Ŵi , b̂i is a local minimizer.
i=1 i=1 i=1 i=1
The output of the first layer before the activation is
 
(1) Ŵ X − η1dY 1mT
Ỹ = Ŵ1 X + b̂1 1mT = .
−η1d1 −dY 1mT

Because η is a negative constant with absolute value sufficiently large such that
Eq. (7.25)) holds, the output above is positive (element-wise), the output of the
neural network with parameters {Ŵ1 , Ŵ2 , b̂1 , b̂2 } is

Ŷ = Ŵ2 h s− ,s+ Ŵ1 X + b̂1 1mT + b̂2 1mT

= s+ Ŵ2 Ŵ1 X + b̂1 1mT + b̂2 1mT


  
Ŵ X − η1dY 1mT
= s+ s1+ IdY 0dY ×(d1 −dY ) + η1dY 1mT
−η1d1 −dY 1mT
= W̃ X̃ ,

where X̃ is defined as 
X
X̃ = T . (7.13)
1m

Therefore, the empirical risk R̂ S in terms of parameters {Ŵ1 , Ŵ2 , b̂1 , b̂2 } is
m   m   
1  1  x
R̂ S Ŵ1 , Ŵ2 , b̂1 , b̂2 = l Yi , W̃ X̃ = l Yi , W̃ i = f (W̃ ).
m ·,i m 1
i=1 i=1

 
2
Then, we introduce a sufficiently small disturbance [δW i ]i=1 , [δbi ]i=1
2
into the
 2  2
parameters Ŵi , b̂i . When the disturbance is sufficiently small, all com-
i=1 i=1
ponents of the output of the first layer remain positive. Therefore, the output after
the disturbance is
 2  2 
Ŷ Ŵi + δW i , b̂i + δbi
i=1 i=1

= Ŵ2 + δW 2 h s− ,s+ Ŵ1 + δW 1 X + b̂1 + δb1 1mT + b̂2 + δb2 1mT


(∗)
= Ŵ2 + δW 2 s+ Ŵ1 + δW 1 X + b̂1 + δb1 1mT + b̂2 + δb2 1mT
7.2 Nonlinear Activations Bring Infinite Spurious Local Minima 139

= s+ δ W 2 Ŵ1 + δW 1 X + b̂1 + δb1 1mT + s+ Ŵ2 δW 1 X + s+ Ŵ2 δb1 1mT + δb2 1mT

+ Ŵ2 s+ (Ŵ1 X + b̂1 1mT ) + b̂2 1mT


= s+ δW 2 Ŵ1 + δW 1 + s+ Ŵ2 δW 1 X + s+ Ŵ2 δb1 + δb2 + s+ δW 2 b̂1 + δb1 1mT

+ Ŵ2 h s− ,s+ (Ŵ1 X + b̂1 1mT ) + b̂2 1mT


 
X
= (W̃ + δ) T ,
1m

 
where Eq. (∗) is because all components of Ŵ1 + δW 1 X + b1 + δb1 1mT are
positive, and δ is defined as the following matrix
 
δ = s+ Ŵ2 δW 1 + δW 2 Ŵ1 + δW 2 δW 1 s+ Ŵ2 δb1 + δb2 + s+ δW 2 b̂1 + δb1 .
 2  2
Therefore, the empirical risk R̂ S with respect to Ŵi + δW i , b̂i + δbi is
i=1 i=1

 2  2 
R̂ S Ŵi + δW i , b̂i + δbi
i=1 i=1
 
1 
m
= l Yi , W̃ + δ X̃
m i=1 ·,i
  
1 
m
xi
= l Yi , W̃ + δ
m i=1 1

= f (W̃ + δ).

δ approaches zero when the disturbances {δW 1 , δW 2 , δb1 , δb2 } approach zero (element-
wise). Since Ŵ is the local minimizer of f (W ), we have
 2  2   2  2 
R̂ S Ŵi , b̂i = f (Ŵ ) ≤ f (Ŵ + δ) = R̂ S Ŵi + δW i , b̂i + δbi .
i=1 i=1 i=1 i=1
(7.14)

Becausethe disturbances {δW 1 , δW 2 , δb1 , δb2 } are arbitrary, Eq. (7.14) demon-
 2  2
strates that Ŵi , b̂i is a local minimizer.
i=1 i=1
The proof is completed. 

Step (b). Prove the constructed local minima are spurious.

Theorem 7.9 Under the same conditions of Lemma 7.4 and Assumptions 7.1, 7.2,
and 7.4, the constructed spurious local minima in Lemma 7.4 are spurious.
140 7 The Geometry of the Loss Surfaces

Proof The minimizer W̃ is the solution of the following equation

∇W f (W ) = 0.

Specifically, we have

∂ f W̃
= 0, i ∈ {1, . . . , dY }, j ∈ {1, . . . , d X },
∂ Wi, j

Applying the definition of f (W ) (Eq. (7.9)),

∂ f W̃ 
m      m      
x x x xi
= ∇Ŷi l Yi , W̃ i E k, j i = ∇Ŷi l Yi , W̃ i ,
∂ Wk, j 1 1 1 k
1 j
i=1 i=1

    
xi x
where Ŷi = W̃ , ∇Ŷi l Yi , W̃ i ∈ R1×dY . Since k, j are arbitrary in
1 1
{1, . . . , dY } and {1, . . . , d X }, respectively, we have
 
V X T 1m = 0, (7.15)

where
   T    T 
x1 x
V= ∇Ŷ1 l Y1 , W̃ · · · ∇Ŷm l Ym , W̃ n .
1 1

We then define Ỹ = W̃ X̃ . Applying Assumption 7.1, we have

Ỹ − Y = W̃ X̃ − Y = 0,

Thus, there exists some k-th row of Ỹ − Y that does not equal to 0.
We can rearrange the rows of W̃ and Y simultaneously, while W̃ is maintained
as the local minimizer of f (W ) and f (W̃ ) invariant.1 Without loss of generality,
we assume k = 1 (k is the index of the row). Set u = V 1,· and vi = Ỹ1,i in Lemma
7.7. There exists a non-empty separation I = [1 : l  ] and J = [l  + 1 : m] of S =
{1, 2, . . . , m} and a vector β ∈ Rd X , such that
(1.1) for any positive constant α small enough, and i ∈ I , j ∈ J , Ỹ1,i − αβ T xi <
Ỹ1, j − αβ T
xj;
(1.2) i∈I V 1,i = 0.

1 f is also the function in term of Y .


7.2 Nonlinear Activations Bring Infinite Spurious Local Minima 141

Define
 
1
η1 = Ỹ1,l  − αβ T xl  + 
min Ỹ1,i − αβ T xi − Ỹ1,l  − αβ T xl  .
2 i∈{l +1,...,m}

Applying (1.1), for any i ∈ I

Ỹ1,i − αβ T xi − η1
 
1
= Ỹ1,i − αβ T xi − Ỹ1,l  + αβ T xl  − 
min Ỹ1,i − αβ T xi − Ỹ1,l  − αβ T xl 
2 i∈{l +1,...,m}

<0,

while for any j ∈ J ,

Ỹ1, j − αβ T x j − η1 > 0
 
1
= Ỹ1, j − αβ T xi − Ỹ1,l  + αβ T xl  − min Ỹ1,i − αβ T
x i − Ỹ1,l  − αβ xl 
T
2 i∈{l  +1,...,m}
 
1
≥ min Ỹ1,i − αβ T
x i − Ỹ1,l  − αβ xl 
T
> 0.
2 i∈{l  +1,...,m}

Define γ ∈ R which satisfies



⎨1 min αβ T (xl − xi ), l  < st+1
|γ | = 2 i∈{l  +1,...,st+1 } ,

α, l  = st+1

where st+1 is defined in Lemma 7.7.


We argue that
$  $
$1 $
$ min Ỹ − αβ T
x − Ỹ − αβ T
x $ − |γ | > 0. (7.16)
$ 2 i∈{l  +1,...,m} 1,i i 1,l  l 
$

When l  = st+1 , Eq. (7.58) stands. Also,

lim γ = 0,
α→0+
 
lim 
min Ỹ1,i − αβ T xi − Ỹ1,l  − αβ T xl  = min Ỹ1,i − Ỹ1,l  > 0.
α→0+ i∈{l +1,...,m} i∈{l  +1,...,m}

Therefore, we get Eq. (7.16) when α is small enough.


When l  < st+1 , Eq. (7.57) stands. Therefore,
$  $
1 $$ 1 $
$,
|γ | = $ 
min Ỹ1,i − αβ T
x i − Ỹ1,l  − αβ xl 
T
$
2 2 i∈{l +1,...,m}

which apparently leads to Eq. (7.16).


142 7 The Geometry of the Loss Surfaces

Therefore, for any i ∈ I , we have that

Ỹ1,i − αβ T xi − η1 + |γ |
 
1
≤− min Ỹ1,i − αβ xi − Ỹ1,l  − αβ xl 
T T
+ |γ | < 0,
2 i∈{l  +1,...,m}

while for any j ∈ J ,

Ỹ1, j − αβ T x j − η1 − |γ |
 
1
≥ min Ỹ1,i − αβ T
x i − Ỹ1,l  − αβ T
x l  − |γ | > 0.
2 i∈{l  +1,...,m}

Furthermore, define ηi (2 ≤ i ≤ dY ) as negative reals with absolute value


sufficiently large, such that for any i ∈ [2 : dY ] and any j ∈ [1 : n],

Ỹi, j − ηi > 0.

Now we construct a point in the parameter space whose empirical risk is smaller
than the proposed local minimum in Lemma 7.4 as follows
T
W̃1 = W̃1,[1:d
T
X]
− βα T , −W̃1,[1:d
T
X]
+ βα T , W̃2,[1:d
T
X]
, · · ·W̃dTY ,[1:d X ] , 0d X ×(d1 −dY −1) ,
(7.17)

T
b̃1 = W̃1,[d X +1] − η1 + γ , . . . , W̃2,[d X +1] − η2 , · · ·W̃dY ,[d X +1] − ηdY , 01×(d1 −dY −1) ,
(7.18)

⎡ ⎤
1
s+ +s−
− s+ +s1

0 0 ··· 0 0 ··· 0
⎢ . .. .. ⎥

⎢ 0 0 1
s+
· · · 0 .. . .⎥⎥
⎢ .. .. .. 1 .. ⎥
W̃2 = ⎢
⎢ . . . s+ · · · 0 . ⎥,
⎥ (7.19)
⎢ ⎥
⎢ .. .. .. .. . . .. .. .. ⎥
⎣ . . . . . . . .⎦
0 0 0 0 · · · s1+ 0 ··· 0

and  T
b̃2 = η1 , η2 , · · ·, ηdY , (7.20)

where W̃i and b̃i are the weight matrix and the bias of the i-th layer, respectively.
7.2 Nonlinear Activations Bring Infinite Spurious Local Minima 143

After some
 calculations, the network output of the first layer before the activation
2  2
in terms of W̃i , b̃i is
i=1 i=1
⎡ ⎤
W̃1,· X̃ − αβ T X − η1 1mT + γ 1mT
⎢ ⎥
⎢ −W̃1,· X̃ + αβ T X + η1 1mT + γ 1mT ⎥
⎢ ⎥
⎢ W̃2,· X̃ − η2 1mT ⎥
⎢ ⎥
Ỹ (1) = W̃1 X + b̃1 1m = ⎢
T
⎥.
⎢ .. ⎥
⎢ . ⎥
⎢ ⎥
⎣ W̃dY ,· X̃ − ηdY 1m
T ⎦
0(d1 −dY −1)×m
Therefore, the output of the whole neural network is
⎛⎡ ⎤⎞
W̃1,· X̃ − αβ T X − η1 1mT + γ 1mT
⎜⎢ ⎥⎟
⎜⎢ −W̃1,· X̃ + αβ T X + η1 1mT + γ 1mT ⎥⎟
⎜⎢ ⎥⎟
⎜⎢ W̃2,· X̃ − η2 1mT ⎥⎟
⎜⎢ ⎥⎟
Ŷ = W̃2 h s− ,s+ ⎜⎢ .. ⎥⎟ + b̃2 1mT .
⎜⎢ ⎥⎟
⎜⎢ . ⎥⎟
⎜⎢ ⎥⎟
⎝⎣ W̃dY ,· X̃ − ηdY 1m
T
⎦⎠
0(d1 −dY −1)×m
Specifically, if j ≤ l  ,
  2  2   
xj
Ỹ (1) W̃i , b̃i = W̃1,· − αβ T x j − η1 + γ
i=1 i=1 1, j
1
= Ỹ1, j − αβ T x j − η1 + γ < 0,
  2  2   
x
Ỹ (1) W̃i , b̃i = − W̃1,· j + αβ T x j + η1 + γ
i=1 i=1 2, j
1
= − Ỹ1, j + αβ T x j + η1 + γ > 0.
   2 
2
Therefore, (1, j)-th component of Ŷ W̃i , b̃i is
i=1 i=1

    2 
2
Ŷ W̃i , b̃i
i=1 i=1 1, j
⎛⎡ ⎤⎞
W̃1,· X − αβ T X − η1 1m
T + γ 1T
m
⎜⎢ T + γ 1T ⎥⎟
⎜⎢ −W̃1,· X + αβ T X + η1 1m m ⎥⎟
⎜⎢ ⎥⎟
⎜⎢ W̃ X − η 1 T ⎥⎟
1 , − 1 , 0, . . . , 0 ⎜⎢ 2,· 2 m ⎥⎟
= s+ +s− s+ +s− h s ,s
− + ⎜ ⎢ .. ⎥⎟ + η1
⎜⎢ ⎥⎟
⎜⎢ . ⎥⎟
⎜⎢ ⎥⎟
⎝⎣ W̃dY ,· X − ηdY 1m T ⎦⎠
T
0d1 −dY −1 1m ·, j
144 7 The Geometry of the Loss Surfaces

s− s+
= (Ỹ1, j − αβ T x j − η1 + γ ) − (−Ỹ1, j + αβ T x j + η1 + γ ) + η1
s+ + s− s+ + s−
s− − s+
= Ỹ1, j − αβ T x j + γ; (7.21)
s+ + s−

Similarly, when j > l  , the (1, j)-th component is


    2 
2
Ŷ W̃i , b̃i
i=1 i=1 1, j
s+ s−
= (Ỹ1, j − αβ T x j − η1 + γ ) − (−Ỹ1, j + αβ T x j + η1 + γ ) + η1
s+ + s− s+ + s−
s+ − s−
= Ỹ1, j − αβ T x j + γ, (7.22)
s+ + s−

and
    2 
2 s+
Ŷ W̃i , b̃i = (Ỹi, j − ηi ) + ηi = Ỹi, j , i ≥ 2. (7.23)
i=1 i=1 i, j s+
 2  2
Thus, the empirical risk of the neural network with parameters W̃i , b̃i
i=1 i=1
is
 2  2 
R̂ S W̃i , b̃i
i=1 i=1

1 
m
= l Yi , W̃2 W̃1 xi + b̃1 1m
T
+ b̃2 1m
T
m
i=1
m          
1 x x x
= l Yi , W̃ i + ∇Ŷi l Yi , W̃ i W̃2 W̃1 xi + b̃1 1m
T
+ b̃2 1m
T
− W̃ i
m 1 1 1
i=1

m    
x
+ o W̃2 W̃1 xi + b̃1 1m
T
+ b̃2 1m
T
− W̃ i . (7.24)
1
i=1

Applying Eqs. (7.21), (7.22), and (7.23), we have


m 
  T   T
xi x
W̃2 W̃1 xi + b̃1 + b̃2 − W̃ ∇Ŷi l Yi , W̃ i
1 1
i=1
l
(∗)  m
s+ − s− s+ − s−
= V 1,i (−αβ xi + T
γ) + V 1,i (−αβ T xi − γ)
i=1
s+ + s− i=l  +1
s + + s−


l
s+ − s−
= 2γ V 1,i ,
i=1
s+ + s−
7.2 Nonlinear Activations Bring Infinite Spurious Local Minima 145

where Eq. (∗) is because


⎧ s− − s+

⎪ − αβ T x j + γ, j = 1, i ≤ l 
   ⎪
⎪ s+ + s−

x
W̃2 W̃1 xi + b̃1 + b̃2 − W̃ i = − αβ T x − s− − s+ γ , j = 1, i > l  .
1 j ⎪⎪ j

⎪ s+ + s−

0, j ≥2

Furthermore, note that α = O(γ ) (from the definition of γ ). We have

m    
x
o W̃2 W̃1 xi + b̃1 + b̃2 − Ŵ i
1
i=1
⎛, ⎞
 -   2
m
- n x
= o ⎝. W̃2 W̃1 xi + b̃1 + b̃2 − W̃ i ⎠
1 j
i=1 j=1

= o(γ ).
 

l
s+ −s−
Let α be sufficiently small while sgn(γ ) = −sgn V
s+ +s− 1,i
. We have
i=1


m 
m   
x
l Yi , W̃2 W̃1 xi + b̃1 + b̃2 − l Yi , Ŵ i
1
i=1 i=1
l
 s+ − s− (∗∗)
= 2γ V 1,i + o(γ ) < 0,
i=1
s+ + s−

where inequality (∗∗) comes from (1.2) (see Sect. 7.2.3.3).


 
2  2
From Lemma 7.4, there exists a local minimizer Ŵi , b̂i with empir-
i=1 i=1
ical risk that equals to f (W̃ ). Meanwhile, we just construct a point in the parameter
space with empirical
  risk smaller than f (W̃ ).
2 2
Therefore, Ŵi , b̂i is a spurious local minimum.
i=1 i=1
The proof is completed. 

7.2.3.4 Stage (2)

Stage (2) proves that neural networks with arbitrary hidden layers and two-piece
linear activation h s− ,s+ have spurious local minima. Here, we still assume s+ = 0.
We have justified this assumption in Stage (1).
146 7 The Geometry of the Loss Surfaces

This stage is organized similarly with Stage (1): (a) Lemma 7.5 constructs a local
minimum; and (b) Theorem 7.10 proves the minimum is spurious.
Step (a). Construct local minima of the loss surface.

Lemma 7.5 Suppose that all the conditions of Lemma 7.4 hold, while the neural
network has L − 1 hidden layers. Then, this network has a local minimum at
     
W̃ W̃ − η1dY
Ŵ1 = ·,[1:d X ] , b̂1 = ·,d X +1 ,
0(d1 −dY )×d X −η1d1 −dY
1  1 
dY di
Ŵi = E j, j + E j,(dY +1) , b̂i = 0(i = 2, 3, ..., L − 1),
s+ j=1 s+ j=d +1
Y

and  
Ŵ L = 1
I
s+ dY
0dY ×(dL−1 −dY ) , b̂L = η1dY ,

where Ŵi and b̂i are the weight matrix and the bias of the i-th layer, respectively,
and η is a negative constant with absolute value sufficiently large such that

W̃ X̃ − η1dY 1mT > 0, (7.25)

where > is element-wise.

Proof Recall the discussion in Lemma 7.4 that all components of Ŵ1 X + b̂1 1mT are
positive. Specifically,
⎡ ⎤
Ỹ − η1dY 1mT
Ŵ1 X + b̂1 1mT = ⎣ ⎦,
−η1d1 −dY 1mT

where Ỹ is defined in Lemma 7.4.


 Similar to the discussions in Lemma 7.4, when the parameters equal to
L  L
Ŵi , b̂i , the output of the first layer before the activation function is
i=1 i=1

⎡ ⎤
Ỹ − η1dY 1mT
Ỹ (1) = Ŵ1 X + b̂1 1mT = ⎣ ⎦,
−η1d1 −dY 1mT

and

Ỹ − η1dY 1mT > 0, (7.26)


− η1d1 −dY 1mT > 0. (7.27)
7.2 Nonlinear Activations Bring Infinite Spurious Local Minima 147

Here > is defined element-wise.


After the activation function, the output of the first layer is
⎡ ⎤
Ỹ − η1dY 1mT
Y (1) = h s− ,s+ (Ŵ1 X + b̂1 1mT ) = s+ (Ŵ1 X + b̂1 1mT ) = s+ ⎣ ⎦.
−η1d1 −dY 1mT

We prove by induction that for all i ∈ [1 : L − 1] that

Ỹ (i) > 0 , element-wise, (7.28)


⎡ ⎤
Ỹ − η1dY 1mT
Y (i) = s+ ⎣ ⎦. (7.29)
−η1di −dY 1mT

Suppose that for 1 ≤ k ≤ L − 2, Ỹ (k) is positive (element-wise) and


⎡ ⎤
Ỹ − η1dY 1mT
Y (k) = s+ ⎣ ⎦.
−η1dk −dY 1mT

Then the output of the (k + 1)-th layer before the activation is

Ỹ (k+1) = Ŵk+1

Y (k) + b̂k+1

1mT
⎛ ⎞ ⎡ ⎤
1 ⎝
dY dk+1 Ỹ − η1dY 1mT
= E j, j + E j,(dY +1) ⎠ s+ ⎣ ⎦
s+ j=1 j=dY +1 −η1dk −dY 1mT
⎡ ⎤
Ỹ − η1dY 1m T

=⎣ ⎦.
−η1dk+1 −dY 1m T

Applying Eqs. (7.26) and (7.27), we have


⎡ ⎤
Ỹ − η1dY 1mT
Ỹ (k+1) = ⎣ ⎦ > 0,
−η1dk+1 −dY 1mT

where > is defined element-wise. Therefore,


⎡ ⎤
Ỹ − η1dY 1mT
Y (k+1) = h s− ,s+ Ỹ (k+1) = s+ Ỹ (k+1) = s+ ⎣ ⎦.
−η1dk+1 −dY 1mT

We thereby prove Eqs. (7.28) and (7.29). Therefore, Y (L) can be calculated as
148 7 The Geometry of the Loss Surfaces

Ŷ = Y (L) = Ŵ L Y (L−1) + b̂L 1mT


⎡ ⎤
1   Ỹ − η1d Y
1m
T

= IdY 0dY ×(dL−1 −dY ) s+ ⎣ ⎦ + η1dY 1mT = Ỹ .


s+
−η1di −dY 1m T

   L
L
Then, we show the empirical risk is higher around Ŵi , b̂i in order
   L
i=1 i=1
L
to prove that Ŵi , b̂i is a local minimizer.
 i=1
L 
i=1
L
Let Ŵi + δW

i , b̂i + δbi

be point in the parameter space
i=1 i=1
   L
L
which is close enough to the point Ŵi , b̂i . Since the distur-
i=1 i=1
 
bances δW i and δbi are both close  to 0 (element-wise), all components of
L L
Ỹ (i) Ŵi + δW

i , b̂i + δbi

remains positive. Therefore, the output of
i=1 i=1
 L  L
the neural network in terms of parameters Ŵi + δW 
i , b̂i

+ δ 
bi is
i=1 i=1

 L  L 
Ŷ Ŵi + 
δW i , b̂i + 
δbi
i=1 i=1

= (Ŵ L + 
δW L )s+ . . . s+ Ŵ1 
+ δW    
1 X + b̂1 + δb1 1m . . . + b̂ L + δbL 1m
T T

= M1 X + M2 1mT ,
 L  L  
L   L 
where M1 and M2 can be obtained from Ŵi , b̂i and 
δW i i=1 , δbi i=1
i=1 i=1
through several multiplication and summation operations2 .
Rewrite the output as
 
  X
M1 X + M2 1mT = M1 M2 .
1nT

 risk R̂ before and after the disturbance can be expressed as


Therefore, theempirical
f (W̃ ) and f M1 M2 , respectively.
 
 L   L   
When the disturbances δW i i=1 , δbi i=1 approach 0 (element-wise), M1 M2
    L 
 L
approaches W̃ . Therefore, when δW i i=1 , δbi i=1 are all small enough, we have

2 Since the exact form of M1 and M2 are not needed, we omit the exact formulations here.
7.2 Nonlinear Activations Bring Infinite Spurious Local Minima 149
 L  L     L 
  L
R̂ S Ŵi + δW

i , b̂i + δbi

= f ( M1 M2 ) ≥ f (W̃ ) = R̂ S Ŵi , b̂i .
i=1 i=1 i=1 i=1
(7.30)
 L  L
Since Ŵi + δW
i , b̂i + δbi

are arbitrary within a sufficiently small
  i=1  i=1
   2
L L 2
 
neighbour of Ŵi , b̂i , Eq. (7.30) yields that Ŵi , b̂i is a
i=1 i=1 i=1 i=1
local minimizer. 

Step (b). Prove the constructed local minima are spurious.

Theorem 7.10 Under the same conditions of Lemma 7.5 and Assumptions 7.1, 7.2,
and 7.4, the constructed spurious local minima in Lemma 7.5 are spurious.

Proof We first construct the weight matrix and bias of the i-th layer as follows,

 W̃1 = 
 W̃1 , b̃1 = b̃1 ,  
W̃2 b̃2
W̃2 = , b̃2 = λ1d2 + ,
0(d2 −dY )×d1 0(d2 −dY )×1
1 dY
W̃i = E i,i , b̃i = 0(i = 3, 4, ..., L − 1),
s+ i=1

and
1 
dY
W̃ L = E i,i , b̃L = −λ1dY ,
s+ i=1

where W̃1 , W̃2 , b̃1 and b̃2 are defined by Eqs. (7.17), (7.18), (7.19), and (7.20),
respectively, and λ is a sufficiently large positive real such that
 2  2 
Ŷ W̃i , b̃i + λ1d2 1mT > 0, (7.31)
i=1 i=1

where > is defined


element-wise.
   L L
We argue that W̃i , b̃i corresponds to a smaller empirical risk than
i=1 i=1
f (W̃ ) which is defined in Lemma 7.4.  
First, Theorem 7.9 has proved that the point W̃1 , W̃2 , b̃1 , b̃2 corresponds to a
smaller empirical risk than f (W̃ ).
150 7 The Geometry of the Loss Surfaces

We prove by induction that for any i ∈ {3, 4, ..., L − 1},


 L  L 
(i)
Ỹ W̃i , b̃i ≥ 0 , element-wise, (7.32)
i=1 i=1
⎡    2  ⎤
2
 
L  L ⎢ Ŷ W̃i i=1 , b̃i i=1 + λ1dY 1m ⎥
T

Y (i)
W̃i , b̃i = s+ ⎢ ⎣
⎥.
⎦ (7.33)
i=1 i=1
0(di −dY )×m

Apparently the output of the first layer before the activation is


 L  L     2 
2
Ỹ (1) W̃i , b̃i = W̃1 X + b̃1 1m
T
= W̃1 X + b̃1 1m
T
= Ỹ (1) W̃i , b̃i .
i=1 i=1 i=1 i=1

Therefore, the output of the first layer after the activation is


 L  L       L 
L
(1)
Y W̃i 
, b̃i = h s− ,s+ Ỹ (1)
W̃i
, b̃i
i=1 i=1 i=1 i=1
     L 
L
= h s− ,s+ Ỹ (1) W̃i , b̃i
i=1 i=1
   2 
2
= Y (1) W̃i , b̃i .
i=1 i=1

Thus, the output of the second layer before the activation is


 L  L     L 
L
Ỹ (2) W̃i , b̃i = W̃2 Y (1) W̃i , b̃i + b̃2 1m
T
i=1 i=1 i=1 i=1
   L  L   
W̃2 b̃2
= Y (1) W̃i , b̃i + T
1m
0(d2 −dY )×d1 i=1 i=1 0(d2 −dY )×1
+ λ1d2 1m
T
⎡    2  ⎤
2
⎢ Ŷ W̃i , b̃i + λ1d 1 T
Y m ⎥
=⎢ ⎥.
i=1 i=1
⎣ ⎦
λ1d2 −dY 1mT

Applying the definition of λ (Eq. (7.31)),


 L  L 
Ỹ (2) W̃i , b̃i > 0 , element-wise. (7.34)
i=1 i=1
7.2 Nonlinear Activations Bring Infinite Spurious Local Minima 151

Therefore, the output of the second layer after the activation is


 L  L       L 
L
(2)
Y W̃i 
, b̃i = h s− ,s+ Ỹ (2)
W̃i
, b̃i
i=1 i=1 i=1 i=1
⎡    2  ⎤
2
⎢ Ŷ W̃i i=1 , b̃i i=1 + λ1dY 1m ⎥
T

= s+ ⎢⎣
⎥.

λ1d2 −dY 1mT

Meanwhile,
  the output of the third layer before  the activation is
L  L  L L 
(3)   (2)  
Ỹ W̃i , b̃i can be calculated based on Y W̃i , b̃i :
i=1 i=1 i=1 i=1

 L  L     L 
L
Ỹ (3) W̃i , b̃i = W̃3 Y (2) W̃i , b̃i + b̃3 1mT
i=1 i=1 i=1 i=1
⎡    2  ⎤
d  2
1  Y
⎢ Ŷ W̃i , b̃i + λ1 1 T
dY m ⎥
E i,i s+ ⎢ ⎥
i=1 i=1
= ⎣ ⎦
s+ i=1
λ1d2 −dY 1m T
⎡    2  ⎤
2
⎢ Ŷ W̃i , b̃i + λ1 1 T
dY m ⎥
=⎢ ⎥.
i=1 i=1
⎣ ⎦
0(d3 −dY )×m

Applying Eq. (7.34),


 L  L 
(3)
Ỹ W̃i , b̃i ≥ 0 , element-wise. (7.35)
i=1 i=1

Therefore, the output of the third layer after the activation is


 L  L       L 
L
(3)
Y W̃i 
, b̃i = h s− ,s+ Ỹ (3)
W̃i
, b̃i
i=1 i=1 i=1 i=1
     L 
L
= s+ Ỹ (3) W̃i , b̃i
i=1 i=1
⎡    2  ⎤
2
⎢ Ŷ W̃i i=1 , b̃i i=1 + λ1dY 1m ⎥
T

= s+ ⎢⎣
⎥.

0(d3 −dY )×m
152 7 The Geometry of the Loss Surfaces

Suppose Eqs. (7.32) and (7.33) hold for k (3 ≤ k ≤ L − 2), when k + 1,


 L  L     2 
2
Ỹ (k+1) W̃i , b̃i 
= W̃k+1 Y (k) W̃i , b̃i 
+ b̃k+1 T
1m
i=1 i=1 i=1 i=1
⎡    2  ⎤
d  Ŷ W̃
2
, b̃ + λ1 1 T
1  Y
⎢ i i d Y m ⎥
E i,i s+ ⎢ ⎥
i=1 i=1
= ⎣ ⎦
s+
i=1
0(dk −dY )×m
⎡    2  ⎤
2
⎢ Ŷ W̃i , b̃i + λ1dY 1mT

=⎢ ⎥.
i=1 i=1
⎣ ⎦
0(dk+1 −dY )×m

Applying Eq. (7.35),


 L  L 
Ỹ (k+1) W̃i , b̃i ≥ 0 , element-wise. (7.36)
i=1 i=1

Therefore, the output of the (k + 1)-th layer after the activation is


 L  L       L 
L
(k+1)
Y W̃i 
, b̃i = h s− ,s+ Ỹ (k+1)
W̃i
, b̃i
i=1 i=1 i=1 i=1
   L 
L
= s+ Ỹ (k+1) W̃i , b̃i
i=1 i=1
⎡    2  ⎤
2
⎢ Ŷ W̃i i=1 , b̃i i=1 + λ1dY 1m ⎥
T

= s+ ⎢⎣
⎥.

0(dk+1 −dY )×m

Therefore, Eqs. (7.32) and (7.33) hold for any i ∈ {3, 4, ..., L − 1}.
Finally, the output of the network is

 L  L     L 
L
Ŷ W̃i , b̃i =Y (L) W̃i , b̃i
i=1 i=1 i=1 i=1
   L 
L
 (L−1) 
= W̃ L Y W̃i , b̃i + b̃L 1mT
i=1 i=1
⎡    2  ⎤
  Ŷ W̃
2
, b̃ + λ1 1 T
1 
d Y
⎢ i i dY m ⎥
E i,i s+ ⎢ ⎥ − λ1d 1T
i=1 i=1
= ⎣ ⎦ Y m
s+
i=1
0(d L−1 −dY )×m
   2 
2
= Ŷ W̃i , b̃i .
i=1 i=1
7.2 Nonlinear Activations Bring Infinite Spurious Local Minima 153

Applying Theorem 7.9, we have


 L  L   2  2 
R̂ S W̃i , W̃i = R̂ S W̃i , b̃i < f (W̃ ).
i=1 i=1 i=1 i=1

The proof is completed. 

7.2.3.5 Stage (3)

Finally, we prove Theorem 7.5. This stage also follows the two-step strategy.
Step (a). Construct local minima of the loss surface.
Lemma 7.6 Suppose t is a non-differentiable point for the piece-wise linear acti-
vation function h and σ is a constant such that the activation h is differentiable in
the intervals (t − σ, t) and (t, t + σ ). Assume that M is a sufficiently large positive
real such that
1
Ŵ1 X + b̂1 1mT < σ. (7.37)
M F

Let αi be any positive real such that

α1 = 1
0 < αi < 1, i = 2, . . . , L − 1. (7.38)

Then, under Assumption 7.3, any neural network with piecewise linear activations
and L − 1 hidden layers has local minima at

1   1 
Ŵ1 =
Ŵ1 , b̂1 = b̂ + t1d1 ,
M M 1
j=2 α j 
i
Ŵi = αi Ŵi , b̂i = −αi Ŵi h(t)1di−1 + t1di + b̂i ,(i = 2, 3, ..., L − 1),
M
and
1 M
Ŵ L = M Ŵ L , b̂L = − / L−1 Ŵ L h(t)1dL−1 + b̂L
j=2 α j
L
j=2 αj
 L  L
where Ŵi , b̂i is the local minimizer constructed in Lemma 7.5. Also,
i=1 i=1
the loss is continuously differentiable, whose derivative with respect to the prediction
Ŷi may equal to 0 only when the prediction Ŷi and label Yi are the same.
Proof Define s− = lim− h  (θ ) and s+ = lim+ h  (θ ).
θ→0 θ→0
We then prove by induction that forall i ∈ [1 : L − 1],all components of the i-th
 L  L
layer output before the activation Ỹ (i) Ŵi , b̂i are in interval (t, t + σ ),
i=1 i=1
154 7 The Geometry of the Loss Surfaces

and
 L  L   L  L 
j=1 α j
i
(i)
Y Ŵi , b̂i = h(t)1di 1mT + Y (i)
Ŵi , b̂i .
i=1 i=1 M i=1 i=1

The first layer output before the activation is,


 L  L  1  1
Ỹ (1) Ŵi , b̂i = Ŵ1 X + b̂1 1mT = Ŵ X + b̂1 1mT + t1d1 1mT .
i=1 i=1 M 1 M
(7.39)

We proved in Lemma 7.5 that Ŵ1 X + b̂1 1mT is positive (element-wise). Since the
Frobenius norm of a matrix is no smaller than any component’s absolute value,
applying Eq. (7.37), we have that for all i ∈ [1, d1 ] and j ∈ [1 : n],

1
0< Ŵ1 X + b̂1 1mT < σ. (7.40)
M ij

 
Therefore, 1
M
Ŵ1 X + b̂1 1mT + t ∈ (t, t + σ ). So,
ij

 L  L       L 
L
(1)
Y Ŵi 
, b̂i = h Ỹ (1)
Ŵi 
, b̂i
i=1 i=1 i=1 i=1
 
(∗) 1  1  T
= h s− ,s+ Ŵ X + b̂1 1m + h(t)1d1 1mT
M 1 M
 
1 (1)    L    L
= Y Ŵi , b̂i + h(t)1d1 1mT ,
M i=1 i=1

where eq.(∗) is because for any x ∈ (t − σ, t + σ ),

h(x) = h(t) + h s− ,s+ (x − t). (7.41)

Suppose the above argument holds for k (1 ≤ k ≤ L − 2). Then



L  L 
Ỹ (k+1)
Ŵi , b̂i
i=1 i=1
   L 
L

= Ŵk+1 Y (k) Ŵi , b̂i 
+ b̂k+1 1mT
i=1 i=1

= (−αk+1 Ŵk+1 h(t)1dk+1
 L  L 
k+1
αi  
+ t1dk+1 + i=1 b̂k+1 )1mT + αk+1 Ŵk+1 Y (k) Ŵi , b̂i
M i=1 i=1
7.2 Nonlinear Activations Bring Infinite Spurious Local Minima 155
  L   L 
i=1 αi
k
 (k)
= αk+1 Ŵk+1 h(t)1dk 1nT + Y Ŵi , b̂i
M i=1 i=1
 
i=1 αi
k+1
 
+ −αk+1 Ŵk+1 h(t)1dk + t1dk+1 + b̂k+1 1mT
M
 L  L 
i=1 αi i=1 αi
k+1 k+1

= Ŵk+1 Y (k) Ŵi , b̂i + 
b̂k+1 1mT + t1dk+1 1mT
M i=1 i=1 M
 L  L 
i=1 αi
k+1
(k+1)
= t1dk+1 1mT + Ỹ Ŵi , b̂i .
M i=1 i=1

 L  L 
Lemma 7.5 has proved that all components of Ỹ (k+1)
Ŵi , b̂i are
   L 
i=1 i=1
L
contained in Ỹ (1) Ŵi , b̂i . Combining
i=1 i=1

 
1 (1)    L    L
t1d1 1mT < t1d1 1mT + Ỹ Ŵi , b̂i < (t + σ )1d1 1mT ,
M i=1 i=1

we have
 L   L  (∗)
i=1 αi
k+1
(k+1)
t1dk+1 1mT < t1dk+1 1mT + Ỹ Ŵi , b̂i < (t + σ )1dk+1 1mT .
M i=1 i=1

Here < are all element-wise, and inequality (∗) comes from the property of αi
(Eq. (7.38)).
Furthermore, the (k + 1)-th layer output after the activation is
 L  L       L 
L
Y (k+1) Ŵi , b̂i = h Ỹ (k+1) Ŵi , b̂i
i=1 i=1 i=1 i=1
     L 
1 L
= h t1dk+1 1mT
+ Ỹ (k+1) Ŵi , b̂i
M i=1 i=1
   
k+1  L 
i=1 αi (k+1)
(∗) L
=h(t)1dk+1 1mT
+ h s− ,s+ Ỹ Ŵi , b̂i
M i=1 i=1

k+1     
αi L L
= h(t)1dk+1 1m
T
+ i=1 Y (k+1) Ŵi , b̂i ,
M i=1 i=1

where Eq. (∗) is because of Eq. (7.41). The above argument is proved for any index
k ∈ {1, . . . , L − 1}.
156 7 The Geometry of the Loss Surfaces

Therefore, the output of the network is


 L  L 
(L)
Y Ŵi , b̂i
i=1 i=1
 L  L 
= Ŵ L Y (L−1) Ŵi , b̂i + b̂L 1mT
i=1 i=1
  
L  L 
i=1 αi
L−1
M
= Ŵ L h(t)1dL−1 1mT + Y (L−1)
Ŵi 
, b̂i
i=1 αi
L−1 M i=1 i=1
 
M
+ − Ŵ L h(t)1dL−1 + b̂L 1mT
i=1 αi
L−1
   L 
L
= Ŵ L Y (L−1) Ŵi , b̂i + b̂L 1mT
i=1 i=1
   L 
L
= Y (L) Ŵi , b̂i .
i=1 i=1

Therefore,
 L  L     L 
L
R̂ S Ŵi 
, b̂i = R̂ S Ŵi
, b̂i = f (W̃ ).
i=1 i=1 i=1 i=1

     L 
 L
We then introduce some small disturbances δW i i=1 , δbi i=1 into
   
L L
Ŵi , b̂i in order to check the local optimality.
i=1 i=1
Since all comonents of Y (i) are in interval (t, t + σ ), the activations in every
hidden layers is realized at linear parts. Therefore, the output of network is
 L  L 
Ŷ Ŵi + 
δW i , b̂i + 
δbi
i=1 i=1

= Ŵ L + 
δW L s+ · · · s+ Ŵ1 
+ δW  
1 X + b̂1 + δb1 1m + f (t)1d1 1m · · ·
T T

+ f (t)1dL 1mT + b̂L + δbL



1mT
 
  X
= M1 M2 .
1mT
7.2 Nonlinear Activations Bring Infinite Spurious Local Minima 157
   
L
Similar to Lemma 7.5, M1 M2 approaches W̃ as disturbances [δW i ]i=1 , [δbi ]i=1
L

approach 0 (element-wise). Combining that W̃ is a local minimizer of f (W ), we


have
 L  L     L 
  L
R̂ S Ŵi + δW

i , b̂i + δbi

= f M1 M2 ≥ f (W̃ ) = R̂ S Ŵi , b̂i .
i=1 i=1 i=1 i=1

The proof is completed.




Step (b). Prove the constructed local minima are spurious.

Proof of Theorem 7.5 Without loss of generality, we assume that all activations are
the same.
Let t be a non-differentiable point of the piece-wise linear activation function h
with

s− = lim− h  (θ ),
θ→0
s+ = lim+ h  (θ ).
θ→0

Let σ be a constant such that h is linear in interval (t − σ, t) and interval (t, t + σ ).


Then construct that
1   1 
W̃1 = W̃1 , b̃1 = b̃ + t1d1 ,
M M 1
1 1 1 
W̃2 = W̃2 , b̃2 = t1d2 − h(t)W̃2 1d2 + b̃2 ,
M̃ M̃ M M̃
1 
W̃i = W̃i , b̃i = −W̃i h(t)1di−1 + t1di + b̃i , (i = 3, 4, ..., L − 1)
M M̃
and
W̃ L = M M̃ W̃ L , b̃L = b̃L − M M̃ W̃ L h(t)1 L−1 ,
 L  L
where W̃i , b̃i are constructed in Theorem 7.10, M is a large enough
i=1 i=1
positive real such that
1
W̃1 X + b̃1 1mT < σ, (7.42)
M F

and M̃ a large enough positive real such that


 
1 1 (2)    L    L
Ỹ W̃i , b̃i < σ. (7.43)
M̃ M i=1 i=1 F
158 7 The Geometry of the Loss Surfaces

Then,
 we prove by induction that for any i ∈ [2 : L − 1], all components of
L  L 
Ỹ (i) W̃i , b̃i are in interval (t, t + δ), and
i=1 i=1

 L  L   L  L 
(i) 1
Y W̃i , b̃i = h(t)1di 1mT + Y (i)
W̃i , b̃i .
i=1 i=1 M̃ M i=1 i=1

First,
 L  L  1
Ỹ (1) W̃i , b̃i = W̃1 X + b̃1 1mT = (W̃  X + b̃1 1mT ) + t1dT1 1mT .
i=1 i=1 M 1
(7.44)

For any i ∈ [1 : d1 ] and j ∈ [1 : m], Eq. (7.42) implies


$  $$
$ 1
$   T $ 1
$ (W̃1 X + b̃1 1m ) $ ≤ W̃1 X + b̃1 1mT < σ.
$ M ij$ M F

Thus,  
1
(W̃  X + b̃1 1mT ) + t1dT1 1mT ∈ (t − σ, t + σ ). (7.45)
M 1 ij

Therefore, the output of the first layer after the activation is


 L  L       L 
L
Y (1) W̃i , b̃i = h Ỹ (1) W̃i , b̃i
i=1 i=1 i=1 i=1
 
1
=h (W̃1 X + b̃1 1mT ) + t1d1 1mT
M
 
(∗) 1
=h(t)1d1 1mT + h s− ,s+ W̃1 X + b̃1
M
1
= h(t)1d1 1mT + h s− ,s+ W̃1 X + b̃1
M  
1 L  L 
= h(t)1d1 1m + Y (1) W̃i
T
, b̃i ,
M i=1 i=1

where Eq. (∗) is from Eq. (7.41) for any x ∈ (t − δ, t + δ). Also,
 L  L     L 
L
(2)
Ỹ W̃i 
, b̃i  (1)
= W̃2 Y W̃i
, b̃i + b̃2 1mT
i=1 i=1 i=1 i=1
  
1  1 (1)    L    L
= W̃2 h(t)1d1 1m + YT
W̃i , b̃i
M̃ M i=1 i=1
1 1
+ t1d2 1mT − h(t)W̃2 1d1 1mT + b̃2 1mT
M̃ M M̃
7.2 Nonlinear Activations Bring Infinite Spurious Local Minima 159
 L  L 
1
= Ỹ (2) W̃i , b̃i + t1d2 1mT .
M M̃ i=1 i=1

 L  L 
Recall in Theorem 7.10 we prove all components of Ỹ (2)
W̃i , b̃i are
i=1 i=1
positive. Combining the definition of M̃ (Eq. (7.43)), we have
 L  L 
(2)
t1d2 1mT < Ỹ W̃i , b̃i
i=1 i=1
 L  L 
1 (2)
= Ỹ W̃i , b̃i + t1d2 1mT
M̃ M i=1 i=1

< (t + σ )1d2 1mT .

Therefore,
   L       L 
L L
(2)   (2) 
Y W̃i , b̃i = h Ỹ W̃i , b̃i
i=1 i=1 i=1 i=1
    L  
1 (2) 
L

=h Y W̃i , b̃i + t1d2 1m T
M̃ M i=1 i=1
     L 
1 (2) 
L
= h(t)1d2 1m + h s− ,s+
T
Ỹ W̃i , b̃i
M̃ M i=1 i=1
     
1 L L
= h(t)1d2 1mT + h s− ,s+ Ỹ (2) W̃i , b̃i
M̃ M i=1 i=1
    
1 L L
= h(t)1d2 1mT + Y (2) W̃i , b̃i .
M̃ M i=1 i=1

Suppose the above argument holds for k-th layer.


The output of (k + 1)-th layer before the activation is
 L  L 
Ỹ (k+1) W̃i , b̃i
i=1 i=1
   L 
L

= W̃k+1 Y (k) W̃i , b̃i 
+ b̃k+1 1mT
i=1 i=1
     L 
 1 L
= W̃k+1 h(t)1dk 1mT + Y (k) W̃i , b̃i
M̃ M i=1 i=1
 
 1 
+ −W̃k+1 h(t)1dk + t1dk+1 + b̃k+1 1mT
M M̃
    L  
1 
L
= W̃k+1 Y (k) W̃i , b̃i 
+ b̃k+1 1mT + t1dk+1 1mT
M M̃ i=1 i=1
   L 
1 L
= Ỹ (k+1) W̃i , b̃i + t1dk+1 1mT .
M M̃ i=1 i=1
160 7 The Geometry of the Loss Surfaces
   L 
L
Recall proved in Theorem 7.10 that all components of Ỹ (k+1) W̃i , b̃i
   L 
i=1 i=1
L
except those that are 0 are contained in Ỹ (k) W̃i , b̃i . We have
i=1 i=1

 L  L 
1 (k+1)
t1dk+1 1mT < Ỹ W̃i , b̃i + t1dk+1 1mT < (t + σ )1dk+1 1mT .
M M̃ i=1 i=1

Therefore,
   L       L 
L L
Y (k+1) W̃i , b̃i = h Ỹ (k) W̃i , b̃i
i=1 i=1 i=1 i=1
    L  
1 L
=h Ỹ (k+1) W̃i , b̃i + t1dk+1 1mT
M M̃ i=1 i=1
   L 
1 L
= h(t)1dk+1 1mT + Y (k+1) W̃i , b̃i .
M M̃ i=1 i=1

Thus, the argument holds for any k ∈ {2, . . . , L − 1}.


So, we obtain
 L  L     L 
L
Y (L) W̃i , b̃i = W̃ L Y (L−1) W̃i , b̃i + b̃L
i=1 i=1 i=1 i=1
     L 
1 L
= M M̃ W̃ L h(t)1d L−1 1mT
+ Y (L−1) W̃i , b̃i
M M̃ i=1 i=1

+ b̃L 1m
T
− M M̃ W̃ L h(t)1d L−1 1m
T
    
L L
= Y (L) W̃i , b̃i .
i=1 i=1

Therefore,
 L  L     L 
L
R̂ S W̃i 
, b̃i = R̂ S W̃i
, b̃i . (7.46)
i=1 i=1 i=1 i=1

From Eq. (7.46) and Theorem 7.10, we have


 L  L 
R̂ S W̃i , b̃i < f W̃ ,
i=1 i=1

which completes the proof of local minimizer.


Furthermore, the parameter M used in Lemma 7.6 (not those in this proof) is
arbitrary in a continuous interval (cf. Eq. (7.37)), we have actually constructed infinite
spurious local minima. This complete the proof of Theorem 7.5.
7.2 Nonlinear Activations Bring Infinite Spurious Local Minima 161

Theorem 7.5 relies on Assumption 7.4. We can further remove it by replacing


Assumption 7.3 by a mildly more restrictive variant Assumption 7.6.

Corollary 7.3 Suppose that Assumptions 7.1, 7.2, and 7.6 hold. Neural networks
with arbitrary depth and arbitrary piecewise linear activations (excluding linear
functions) have infinitely many spurious local minima under arbitrary continuously
differentiable loss whose derivative can equal 0 only when the prediction and label
are the same.

Proof The proof is delivered by modifications of Theorem 7.9 in Stage 1 of


Theorem 7.5’s proof. We only need to prove the corollary under the assumption that
s− + s+ = 0.  
2  2
Let the local minimizer constructed in Lemma 7.4 be Ŵi , b̂i .
i=1 i=1
Denote by two zero matrix 0W̃1 := 0d X ×(d1 −dY −1) and 0b̃1 := 01×(d1 −dY −1) , and ν =
W̃1,[d X +1] − η1 . Then, we construct a point in the parameter space whose empirical
risk is smaller as follows:

W̃1 = (W̃1,[1:d
T
X]
− βα T , W̃1,[1:d
T
X]
, −W̃1,[1:d
T
X]
+ βα T ,W̃2,[1:d
T
X]
, . . . , W̃dTY ,[1:d X ] , 0W̃1 )T ,

T
b̃1 = ν + γ , W̃1,[d X +1] − η, −ν + γ , W̃2,[d X +1] − η2 , · · ·W̃dY ,[d X +1] − ηdY , 0b̃1 ,
(7.47)

⎡ ⎤
1 1
2s+ s+
− 2s1+ 0 0 ··· 0 0 ··· 0
⎢ 1
··· ··· 0⎥
⎢ 0 0 0 s+
0 0 0 ⎥
⎢ ⎥
W̃2 = ⎢ 0 0 0 0 s1+ ··· 0 0 ··· 0⎥, (7.48)
⎢ ⎥
⎢ .. .. .. .. .. .. .. .. .. .. ⎥
⎣ . . . . . . . . . .⎦
0 0 0 0 0 ··· 1
s+
0 ··· 0

and  T
b̃2 = η, η2 , · · ·, ηdY , (7.49)

where α, β, and ηi are defined the same as those in Theorem 7.9, and η is defined by
Eq. (7.25).
162 7 The Geometry of the Loss Surfaces

Then, the output of the first layer is


 2  2 
(1)
Y W̃i , b̃i = h s− ,s+ W̃1 X + b̃1 1mT
i=1 i=1
⎛⎡ ⎤⎞
W̃1,· X − αβ T X − η1 1mT + γ 1mT
⎜⎢ W̃1,· X − η1mT ⎥⎟
⎜⎢ ⎥⎟
⎜⎢ ⎥⎟
⎜⎢ −W̃1,· X + αβ T X + η1 1mT + γ 1mT ⎥⎟
⎜⎢ ⎥⎟
= h s− ,s+ ⎜⎢
⎜⎢
W̃2,· X − η2 1mT ⎥⎟ .
⎥⎟
⎜⎢ .. ⎥⎟
⎜⎢ . ⎥⎟
⎜⎢ ⎥⎟
⎝⎣ W̃dY ,· X − ηdY 1mT ⎦⎠
T
0d1 −dY −2 1m

Further, the output of the whole network is


⎛⎡
T + γ 1T ⎤⎞
W̃1,· X − αβ T X − η1 1m m
⎜⎢ W̃1,· X − η1mT ⎥⎟
⎜⎢ ⎥⎟
⎜⎢ −W̃ X + αβ T X + η 1T + γ 1T ⎥⎟
   2  ⎜ ⎢ 1,· 1 m m ⎥⎟
2 ⎜⎢ T ⎥⎟
Ŷ W̃i , b̃i = W̃2 h s− ,s+ ⎜ ⎢ W̃2,· X − η2 1m ⎥⎟ + b̃2 1T
⎜ ⎢ .. ⎥⎟ m
i=1 i=1 ⎜⎢ ⎥⎟
⎜⎢ . ⎥⎟
⎜⎢ ⎥⎟
⎝⎣ W̃dY ,· X − ηdY 1mT ⎦⎠
T
0d1 −dY −2 1m
 2  2 

Therefore, if j ≤ l , the (1, j)-th component of Ŷ W̃i , b̃i is
i=1 i=1

 2  2 
W̃2 Ỹ (1) W̃i , b̃i
1 i=1 i=1 j
1
= −s+ Ỹ1, j − αβ T x j − η1 + γ + 2s+ Ỹ1, j − η − s+ −Ỹ1 j + αβ T x j + η1 + γ
2s+

1
= 2s+ Ỹ1, j − 2s+ η − 2s+ γ + η
2s+
= Ỹ1, j − γ .
 2  2 

Otherwise ( j > l ), the (1, j)-th component of Ŷ W̃i , b̃i is
i=1 i=1

 2  2 
W̃2 Ỹ (1) W̃i , b̃i
1 i=1 i=1 j
1
= s+ Ỹ1, j − αβ T x j − η1 + γ + 2s+ Ỹ1, j − η − s− −Ỹ1 j + αβ T x j + η1 + γ
2s+

7.2 Nonlinear Activations Bring Infinite Spurious Local Minima 163

1
= s+ Ỹ1, j − αβ T x j − η1 + γ + 2s+ Ỹ1, j − η + s+ −Ỹ1 j + αβ T x j + η1 + γ
2s+

1
= 2s+ Ỹ1, j − 2s+ η + 2s+ γ + η
2s+
= Ỹ1, j + γ ,

 2  2 
and the (i, j)-th (i > 1) component of Ŷ W̃i , b̃i is Ỹi, j .
i=1 i=1
Therefore, we have

   ⎨ − γ,
⎪ j = 1, i ≤ l;
x
W̃2 W̃1 xi + b̃1 + b̃2 − W̃ i = γ, j = 1, i > l;
1 j ⎪⎩
0, j ≥ 2.

Then, similar to Theorem 7.9, we have


   L     L 
L L
R̂ S W̃i , b̃i − R̂ S Ŵi , b̂i
i=1 i=1 i=1 i=1
  
1  1 
n m
x
= l Yi , W̃2 W̃1 xi + b̃1 + b̃2 − l Yi , Ŵ i
m i=1 m i=1 1
     
1 
n
x x
= ∇Ŷi l Yi , W̃ i W̃2 W̃1 xi + b̃1 1mT + b̃2 1mT − W̃ i
m i=1 1 1
n    
x
+ o W̃2 W̃1 xi + b̃1 1mT + b̃2 1mT − W̃ i
1
i=1

2 
l
=− V 1,i γ + o(γ ),
m i=1

where V and l  are also defined the same as those in


Theorem 7.9.
l
When γ is sufficiently small and sgn(γ ) = sgn i=1 V 1,i , we have that
 2  2 
R̂ S W̃i , b̃i < f (W̃ ).
i=1 i=1

This complete the proof of Corollary 7.3. 


164 7 The Geometry of the Loss Surfaces

7.2.3.6 A Preparation Lemma

We now prove the preparation lemma used above.


Lemma 7.7 Suppose u = (u 1 · · · u m ) ∈ R1×m which satisfies u = 0 and


n
u i = 0, (7.50)
i=1

while {x1 ,...,xm } is a set of vector ⊂ Rm×1 . Suppose index set S = {1, 2, . . . , m}.
Then for any series of real number {v1 , · · · , vm }, there exists a non-empty separation
I , J of S, which satisfies I ∪ J = S, I ∩ J = ∅ and both I and J are not empty, a
vector β ∈ Rm×1 ,such that,
(1.1) for any sufficiently small positive real α, i ∈ I , and j ∈ J , we have vi −
αβ T xi < v j − αβ T x j ;
(1.2) i∈I u i = 0.
Proof If there exists a non-empty separation I and J of the index set S, such that
when β = 0, (1.1) and (1.2) hold, the lemma is apparently correct.
Otherwise, suppose that there is no non-empty separation I and J of the index
set S such that (1.1) and (1.2) hold simultaneously when β = 0.
Some number vi in the sequence (v1 , v2 , . . . , vm ) are probably equal to each other.
We rerarrange the sequence by the increasing order as follows,

v1 = v2 = · · · = vs1 < vs1 +1 = · · · = vs2 < · · · < vsk−1 +1 = · · · = vsk = vm ,


(7.51)
where sk = m.
Then, for any j ∈ {1, 2, . . . , k − 1}, we argue that


sj
u i = 0.
i=1

Otherwise, suppose there exists a s j , such that


sj
u i = 0.
i=1

Let I = {1, 2, ..., s j } and J = {s j + 1, ..., m}. Then, when β = 0, we have

vi − αβ T xi = vi < v j = v j − αβ T x j ,

and
 
sj
ui = u i = 0,
i∈I i=1
7.2 Nonlinear Activations Bring Infinite Spurious Local Minima 165

which are exactly the arguments (1.1) and (1.2). Thereby we construct a contrary
example. Therefore, for any j ∈ {1, 2, . . . , k − 1}, we have


sj
u i = 0.
i=1

Since we assume that u = 0, there exists an index t ∈ {1, . . . , k − 1}, such that
there exists an index i ∈ {st + 1, ..., st+1 } that u i = 0.
Let l ∈ {st + 1, ..., st+1 } is the index such that xl has the largest norm while u l = 0:

l= arg max xj . (7.52)


j∈{st +1,...,st+1 },u j =0

We further rearrange the sequence (vst +1 , ..., vst+1 ) such that there is an index
l  ∈ {st + 1, . . . , st+1 },

xl  = max xj ,
j∈{st +1,...,st+1 },u j =0

and

∀i ∈ {st + 1, ..., l  }, xl  , xi  ≥ xl  2


; (7.53)

∀i ∈ {l + 1, . . . , st+1 }, xl  , xi  < xl  2
. (7.54)

It is worth noting that it is probably l  = st+1 , but it is a trivial case that would not
influence the result of this lemma.
Let I = {1, ..., l  }, J = {l  + 1, ..., n}, and β = xl  . We prove (1.1) and (1.2) as
follows.
Proof of argument (1.1) We argue that for any i ∈ I , vi − αβ T xi ≤ vl  − αβ T xl  and
for any j ∈ J , v j − αβ T x j > vl  − αβ T xl  .
There are three situations:
(A) i ∈ {1, . . . , st } and j ∈ {st+1 + 1, . . . , n}. Applying Eq. (7.51), for any i ∈
{1, . . . , st } and j ∈ {st+1 + 1, . . . , m}, we have that vi < vl  and v j > vl  . Therefore,
when α is sufficiently small, we have the following inequalities,

vi − αβ T xi < vl  − αβ T xl  ,
v j − αβ T x j > vl  − αβ T xl  .

(B) i ∈ {st + 1, . . . , l  }. Applying Eq. (7.53) and because of α > 0, we have

−αβ, xi  ≤ −α β 2
= −αβ, xl  .

Since vi = vl  , it further leads to


166 7 The Geometry of the Loss Surfaces

vi − αβ T xi ≤ vl  − αβ T xl  .

(C) j ∈ {l  + 1, . . . , st+1 }. Similarly, applying Eq. (7.54) and because of α > 0,


we have
−αβ, x j  > −α β 2 = −αβ, xl  .

Since v j = vl  , it further leads to

v j − αβ T x j > vl  − αβ T xl  ,

which is exactly the argument (1.1).


Proof of argument (1.2) We argue that for any i ∈ {st + 1, . . . , l  − 1}, u i = 0.
Otherwise, suppose there exists an i ∈ {st + 1, . . . , l  − 1} such that u i = 0. From
Eq. (7.52), we have xi ≤ xl  . Therefore,

xl  , xi  ≤ xl  x i ≤ xl  2
,

where the first inequality strictly holds if the vector xl  and xi have the same direction,
while the second inequlity strictly holds when xi and xl  have the same norm. Because
xl  = xi , we have the following inequality,

xl  , xi  < xl  2
,

which contradicts to Eq. (7.53), i.e.,

xl  , xi  ≥ xl  2
, ∀i ∈ {st + 1, . . . , l  }.

Therefore,

 
st l −1

ui = ui + u i + u l  = u l  = 0.
i∈I i=1 i=st +1

The proof is completed. 

Remark 7.3 For any i ∈ {l  + 1, ..., st+1 }, we have


   
vi − αβ T xi − vl  − αβ T xl  = αβ T (xl  − xi ), (7.55)

while for any j ∈ {st+1 + 1, ..., m}, we have


   
v j − αβ T x j − vl  − αβ T xl  = v j − vl  + αβ T (xl  − x j ). (7.56)

Because v j > vl  , when the real number α is sufficiently small, we have

αβ T (xl  − xi ) < v j − vl  + αβ T (xl  − x j ).


7.3 Proofs of Theorems 7.7, 7.8, Corollaries 7.1, and 7.2 167

Applying Eqs. (7.55) and (7.56), we have


       
vi − αβ T xi − vl  − αβ T xl  < v j − αβ T x j − vl  − αβ T xl  .

Therefore, if l  < st+1 , we have


   
min vi − αβ T xi − vl  − αβ T xl  = min αβ T (xl  − xi ); (7.57)
i∈{l  +1,...,m} i∈{l  +1,...,st+1 }

while if l  = st+1 ,
   

min vi − αβ T xi − vl  − αβ T xl  = min vi − vl + αβ T (xl  − xi ).
i∈{l +1,...,m} i∈{l  +1,...,m}
(7.58)
(7.57) and (7.58) make senses because l  < m. Otherwise, from Lemma
Equations
m
7.7 we have i=1 u i = 0, which contradicts to the assumption.

7.3 Proofs of Theorems 7.7, 7.8, Corollaries 7.1, and 7.2

This section gives the proofs of Theorems 7.7, 7.8, Corollaries 7.1, and 7.2 omitted
from Sect. 7.2.2.

7.3.1 Squared Loss

We first check that the squared loss is strictly convex, which is even restrictive than
“convex”.

Lemma 7.8 The empirical risk R̂ S under squared loss is strictly convex with respect
to the prediction Ŷ .

Proof The second derivative of the empirical risk R̂ S under squared loss with respect
to the prediction Ŷ is

∂ 2 lce (Y, Ŷ ) ∂ 2 (y − Ŷ )2
= = 2 > 0.
∂ Ŷ 2 ∂ Ŷ 2

Therefore, the empirical risk R̂ S under squared loss is strictly convex with respect to
prediction Ŷ . 
168 7 The Geometry of the Loss Surfaces

7.3.2 Smooth and Multilinear Partition

In the context where all activations are linear functions, the neural network simplifies
to a multilinear model characterized by a smooth and multilinear loss surface. How-
ever, when nonlinear activations are introduced, the landscape of the loss surface
becomes more complex due to the nonlinearity introduced by these activation func-
tions. When input data passes through the linear portions of activation functions, the
resulting output resides in a smooth and multilinear region of the loss surface. This
smoothness and linearity allow for predictable behavior under parameter changes,
ensuring that each region expands into an open cell with continuous and differentiable
characteristics.
Conversely, nonlinear points within the activations are non-differentiable, leading
to non-smooth empirical risk concerning the parameters. These nonlinear points
correspond to the non-differentiable boundaries between cells on the loss surface,
where the loss function exhibits abrupt changes due to the inherent nonlinearity of
the activations.

7.3.3 Every Local Minimum Is Globally Minimal Within a


Cell

Proof of Theorem 7.7 In every cell, the input sample points flows through the same
linear parts of the activations no matter what values the parameters are.
(1) We first proves that the empirical risk R̂ S equals to a convex function with
respect to a variable Ŵ that is calculated from the parameters W .
Suppose (W1 , W2 ) is a local minimum within a cell. We argue that


m
    m
 
l yi , W2 diag A·,i W1 xi = l yi , A·,i
T
diag(W2 )W1 xi , (7.59)
i=1 i=1

where A·,i is the i-th column of the following matrix


⎡ ⎤
h s− ,s+ ((W1 )1,· x1 ) · · · h s− ,s+ ((W1 )1,· xm )
⎢ .. .. .. ⎥
A=⎣ . . . ⎦. (7.60)
 
h s− ,s+ ((W1 )d1 ,· x1 ) · · · h s− ,s+ ((W1 )d1 ,· xm )

The left-hand side (LHS) is as follows,


m
   
LHS = l yi , W2 diag A·,i W1 xi
i=1
7.3 Proofs of Theorems 7.7, 7.8, Corollaries 7.1, and 7.2 169


m
   
= l yi , (W2 )1,1 A1,i · · · (W2 )1,d1 Ad1 ,i W1 xi .
i=1

Meanwhile, the right-hand side (RHS) is as follows,


m
 
RHS = l yi , A·,i
T
diag(W2 )W1 xi ,
i=1

m
   
= l yi , (W2 )1,1 A1,i · · · (W2 )1,d1 Ad1 ,i W1 xi .
i=1

Apparently, LHS = RHS. Thereby, we proved Eq. (7.59).


Afterwards, we define
Ŵ1 = diag(W2 )W1 , (7.61)

and then straighten the matrix Ŵ1 to a vector Ŵ ,


 
Ŵ = (Ŵ1 )1,· · · · (Ŵ1 )d1 ,· . (7.62)

Also define  
X̂ = A·,1 ⊗ x1 · · · A·,m ⊗ xm . (7.63)

Then, we can prove that the following equations,


  
T
A·,1 Ŵ1 x1 · · · A·,n
T
Ŵ1 xm = (Ŵ1 )1,· · · · (Ŵ1 )d1 ,· A·,1 ⊗ x1 · · · A·,n ⊗ xm
= Ŵ X̂ . (7.64)

Applying Eq. (7.64), the empirical risk is transferred to convex function,

1   T 1 
m m
R̂ S (W1 , W2 ) = l yi , A·,i diag(W2 )W1 xi = l yi , Ŵ X̂ i .
m i=1 m i=1
(7.65)

The empirical risk is rearranged as a convex function in terms of Ŵ which unite the
two weight matrices W1 and W2 and the activation h are together as Ŵ .
Applying Eqs. (7.61) and (7.62), we have
 
Ŵ = (W2 )1 (W1 )1,· · · · (W2 )d1 (W1 )d1 ,· .

(2) We then prove that the local minima (including global minima) of the empirical
risk R̂ S with respect to the parameter W is also local minima with respect to the
corresponding variable Ŵ .
170 7 The Geometry of the Loss Surfaces

We first prove that for any i ∈ [1 : d1 d2 ], we have ei X̂ ∇ = 0, where ∇ is


 T
∇= ∇ Ŵ X̂
l Y1 , Ŵ X̂ ··· ∇ Ŵ X̂
l Ym , Ŵ X̂ .
1 1 m m

To see this, we divide i into two cases: (W2 )i = 0 and (W2 )i = 0.


Case 1: (W2 )i = 0. The local minimizer of the empirical risk R̂ S with respect to
∂ R̂ S
the parameter W satisfies the following equation ∂(W 1 )i, j
= 0. Therefore,

∂ R̂ S
0=
∂(W1 )i, j
 n  

∂ l Y·,k , Ŵ X̂
k=1 ·,k
=
∂(W1 )i, j
n 
 
 
= 0 · · · 0 (W2 )i 0 · · · 0 X̂ k ∇ Ŵ X̂
l Y·,k , Ŵ X̂ ,
k=1 ·,k (7.66)
0 12 3 0 12 3 ·,k

d X (i−1)+ j−1 d X d1 −d X (i−1)− j

where (W2 ) is a vector and (W2 )i is its i-th component.


By dividing the both hand sides of Eq. (7.66) with (W2 )i , we can get

(ed X (i−1)+ j X̂ )∇ = 0.

Case 2: (W2 )i = 0. Suppose u1 ∈ Rd0 is a unitary vector, u 2 ∈ R is a real number,


and ε is a small enough positive constant. Define a disturbance of W1 and W2 as

W1 = [00 ·12


· · 03, εu1 , 00 ·12
· · 03 ],
d X (i−1) d1 d X −d X i

and

W2 = [00 ·12


· · 03, ε2 u 2 , 00 ·12
· · 03].
i−1 d1 −i

When ε is sufficiently small, W1 and W2 are also sufficiently small. Since
(W1 , W2 ) is a local minimum, we have

1 
m
l Yk , Ŵ +  X̂ = R̂ S (W1 + W1 , W2 + W2 )
m k=1 k

1 
m
≥ R̂ S (W1 , W2 ) = l Yk , Ŵ X̂ , (7.67)
m k=1 k
7.3 Proofs of Theorems 7.7, 7.8, Corollaries 7.1, and 7.2 171

where  is defined as follows,

 = [(W2 + W2 )1 (W1 + W1 )1 , . . . , (W2 + W2 )d1 (W1 + W1 )d1 ]
 
− (W2 )1 (W1 )1 , . . . , (W2 )d1 (W1 )d1
(∗)
=[00 ·12
· · 03 , ε2 u 2 (εu1 + (W1 )i ) , 0
0 ·12
· · 03 ]. (7.68)
d X (i−1) d1 d X −d Xi

Here, Eq. (∗) comes from (W2 )i = 0. Rearrange Eq. (7.67) and apply the Taylor’s
Theorem, we can get that

 · X̂ ∇ + O  · X̂ 2
≥ 0.

Applying Eq. (7.68), we have

[0 · · · 0, ε2 u 2 (εu1 + (W1 )i ) , 0 · · · 0] X̂ ∇
+ ε4 O( [0 · · · 0, u 2 (εu1 + (W1 )i ) , 0 · · · 0] X̂ 2
)
(∗∗)
= [0 · · · 0, ε3 u 2 u1 , 0 · · · 0] X̂ ∇ + ε4 O( [0 · · · 0, u 2 (εu1 + (W1 )i ) , 0 · · · 0] X̂ 2
)
= ε [0 · · · 0, u 2 u1 , 0 · · · 0] X̂ ∇ + o(ε ) ≥ 0.
3 3
(7.69)

Here, Eq. (∗∗) can be obtained from follows. Because W2 is a local minimizer, for
any component (W2 )i of W2 ,
m
∂ k=1 l Yk , Ŵ X̂
k
=0,
∂(W2 )i

which leads to

[ 00 ·12
· · 03, (W1 )i , 00 ·12
· · 03 ] X̂ ∇ = 0.
d X (i−1) d X d1 −id X

When ε approaches 0, Eq. (7.69) leads to the following inequality,

· · 03, u 2 u1 , 00 ·12
[ 00 ·12 · · 03 ] X̂ ∇ ≥ 0.
d X (i−1) d X d1 −id X

Since u1 and u 2 are arbitrarily picked (while the norms equal 1), the inequality above
further leads to
  (7.70)
0 · · · 0 e 0 · · · 0 X̂ ∇ = 0,
j

which completes the proof of the argument.


172 7 The Geometry of the Loss Surfaces

Therefore, for any i and j, we have proven that

ed0 (i−1)+ j X̂ ∇ = 0,

which demonstrates that


X̂ ∇ = 0,

which means Ŵ is also a local minimizer of the empirical risk R̂,


m
R̂ S (W ) = l(Yi , W X̂ i ). (7.71)
i=1

(3) Applying the property of convex function, Ŵ is a global minimizer of the


empirical risk R̂ S , which leads to (W1 , W2 ) is a global minimum inside this cell.
The proof is completed.

7.3.4 Equivalence Classes of Local Minimum Valleys in Cells

Proof of Theorem 7.8 and Corollary 7.1 In the proof of Theorem 7.7, we constructed
a map Q: (W1 , W2 ) → Ŵ . Further, in any fixed cell, the represented hypothesis of a
neural network is uniquely determined by Ŵ .
We first prove that all local minima in a cell are concentrated as a local minimum
valley. Since the loss function l is strictly convex, the empirical risk has one unique
local minimum (which is also a global minimum) with respect to Ŵ in every cell,
if there exists some local minimum in the cell. Meanwhile, we have proved that all
local minima with respect to (W1 , W2 ) are also local minima with respect to the
corresponding Ŵ . Therefore, all local minima with respect to (W1 , W2 ) correspond
one unique Ŵ . Within a cell, when W1 expands by a positive real factor α to W1 and W2
shrinks by the same positive real factor α to W2 , we have Q(W1 , W2 ) = Q(W1 , W2 ),
i.e., the Ŵ remains invariant.
Further, we argue that all local minima in a cell are connected with each other by
a continuous path, on which the empirical risk is invariant. For every local minima
pair (W1 , W2 ) and (W1 , W2 ), we have

diag(W2 )W1 = diag(W2 )W1 . (7.72)

Since h s− ,s+ (W1 X ) = h s− ,s+ (W1 X ) (element-wise), for every i ∈ [1, d1 ],
  
sgn ((W2 )i ) = sgn W2 i .
7.3 Proofs of Theorems 7.7, 7.8, Corollaries 7.1, and 7.2 173

Therefore, a continuous path from (W1 , W2 ) to (W1 , W2 ) can be constructed by


finite moves, each of which expands a component of W2 by a real constant α and
then shrinks the corresponding line of W1 by the same constant α.
We then prove that all local minima in a cell constitute an equivalence class.
Define an operation ∼ R as follows,

(W11 , W21 ) ∼ R (W12 , W22 ),

if
Q(W11 , W21 ) = Q(W12 , W22 ).

We then argue that ∼ R is an equivalence relation. The three properties of


equivalence relations are checked as follows.
(1) Reflexivity:
For any (W1 , W2 ), we have

Q(W1 , W2 ) = Q(W1 , W2 ).

Therefore,
(W1 , W2 ) ∼ R (W1 , W2 ).

(2) Symmetry:
For any pair (W11 , W21 ) and (W12 , W22 ), Suppose that

(W11 , W21 ) ∼ R (W12 , W22 ).

Thus,
Q(W11 , W21 ) = Q(W12 , W22 ).

Apparently,
Q(W12 , W22 ) = Q(W11 , W21 ).

Therefore,
Q(W12 , W22 ) ∼ R Q(W11 , W21 ).

(3) Transitivity:
For any (W11 , W21 ), (W12 , W22 ), and (W13 , W23 ), suppose that

(W11 , W21 ) ∼ R (W12 , W22 ),


(W12 , W22 ) ∼ R (W13 , W23 ).

Then,

Q(W11 , W21 ) = Q(W12 , W22 ),


174 7 The Geometry of the Loss Surfaces

Q(W12 , W22 ) = Q(W13 , W23 ).

Apparently,
Q(W11 , W21 ) = Q(W12 , W22 ) = Q(W13 , W23 ).

Therefore,
(W11 , W21 ) ∼ R (W13 , W23 ).

We then prove the mapping Q is the quotient map.


Define a map as follows,

T : (W1 , W2 ) → (diag(W2 )W1 , 11×d1 ).

We then define an operator ⊕ as,

(W11 , W21 ) ⊕ (W12 , W22 ) = T (W11 , W21 ) + T (W12 , W22 ),

the inverse of (W1 , W2 ) is defined to be (−W1 , W2 ) and the zero element is defined
to be (0, 11×d1 ).
Obviously, the following is a linear mapping:

Q : ((Rd1 ×d X , R1×d1 ), ⊕) → (R1×d X d1 , +).

For any pair (W11 , W21 ) and (W12 , W22 ), we have

(W11 , W21 ) ∼ R (W12 , W22 ),

if and only if
(W11 , W21 ) ⊕ (−W12 , W22 ) ∈ Ker(Q).

Therefore, the quotient space (Rd1 ×d X , R1×d1 )/Ker(Q) is a definition of the equiva-
lence relation ∼ R .
The proof is completed.

7.4 Geometric Structure of the Loss Surface

What does the loss surface actually look like? Understanding the geometric structure
of the loss surface could lead to significant advancements in our comprehension of
deep learning.
Linear partition of the loss surface. Soudry and Hoffer (2018) introduced the
concept of a smooth and multilinear partition within the loss surface of neural net-
works, highlighting the impact of nonlinearities in piecewise linear activations. These
nonlinearities effectively segment the loss surface into distinct regions characterized
7.4 Geometric Structure of the Loss Surface 175

by smooth and multilinear properties. Specifically, every nonlinear point within the
activation functions contributes to a set of non-differentiable boundaries between
cells, while the linear segments of the activations correspond to the smooth and mul-
tilinear interiors of these cells. This decomposition provides insights into the geo-
metric structure of the loss surface and its relationship with neural network behavior
and training dynamics.
He et al. (2020) demonstrated several intricate properties of mode connectivity:
(1) Within an open cell, if local minima exist, they are equally optimal in terms of
empirical risk, making all local minima global within that cell. This highlights a
uniformity of performance among local minima within the same region of the loss
surface. (2) The local minima within any open cell form an equivalence class and
are concentrated in a valley, suggesting a concentrating effect of optimal solutions in
specific regions of the loss landscape. (3) When all activations are linear, the partition
collapses into a single cell, including linear neural networks as a special case. This
observation suggests the value of nonlinear activations in shaping the complexity
and structure of the loss surface. These three findings provide important insights into
the behavior and landscape of neural network loss surfaces under different activation
regimes, shedding light on the nature of optimization and generalization in deep
learning models.
The property (2) introduced by He et al. (2020) elucidates the concept of mode
connectivity, which suggests that solutions discovered through SGD or its variants
are connected by a path in the weight space, where all points along the path exhibit
nearly identical empirical risk. This mode connectivity phenomenon has been empir-
ically observed in studies by Garipov et al. (2018) and Draxler et al. (2018). Recently,
Kuditipudi et al. (2019) provided theoretical support by demonstrating that mode con-
nectivity can be assured through dropout stability and noise stability. These findings
contribute to the understanding of how optimization algorithms explore the weight
space and the resilience of solutions under perturbations and dropout conditions.
Correspondence between the landscapes of the empirical risk and expected
risk: Two seminal works by Zhou and Feng (2018) and Mei et al. (2018) have theoret-
ically revealed a critical link between the landscapes of the empirical risk surface and
the expected counterpart. This correspondence suggests that studying the geometric
properties of empirical risk surfaces can provide insights into model generalizability,
which pertains to the gap between expected and empirical risks. Notably, these stud-
ies demonstrated that the gradient of the empirical risk, the stationary points of the
empirical risk, and the empirical risk itself can all be asymptotically approximated
by their population equivalents. Moreover, they delivered a generalization bound for
the nonconvex scenario: with probability at least 1 − δ,
 4 4 
9 s log(mU/D) + log(4/δ)
O τ [1 + cr (D − 1)] ,
8 m

 l−1
where all data x are assumed to be τ -sub-Gaussian, cr = max r 2 /16, r 2 /16 ,
and s is the number of nonzero components of the weights. Later, Mei et al. (2018)
176 7 The Geometry of the Loss Surfaces

proved a similar correspondence between the Hessian matrices of the expected risk
and the empirical risk under the assumption that the sample size is greater than the
number of parameters.
Eigenvalues of the Hessian. Sagun et al. in 2016 (Sagun et al. 2016) and 2018
(Sagun et al. 2018) and Papyan in 2018 (Papyan 2018) conducted experiments to
study the eigenvalues of the Hessian of the loss surface. Sagun et al. (2016), Sagun
et al. (2018) discovered that (1) a large bulk of the eigenvalues are centred close to
zero and (2) several outliers are located far from this bulk. Papyan (2018) presented
the full spectrum of the Hessian matrix. In Papyan (2018, pp. 2, Figs. 1(a) and 1(c),
and pp. 3, Figs. 2(a) and 2(c)), the authors compare the spectra of the Hessian matrices
obtained when training and testing a VGG-11 network on the MNIST and CIFAR-10
datasets.

References

Baldi, Pierre, and Kurt Hornik. 1989. Neural networks and principal component analysis: learning
from examples without local minima. Neural Networks 2 (1): 53–58.
Choromanska, Anna, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, and Yann LeCun. 2015.
The loss surfaces of multilayer networks. In International Conference on Artificial Intelligence
and Statistics.
Draxler, Felix, Kambis Veschgini, Manfred Salmhofer, and Fred Hamprecht. 2018. Essentially no
barriers in neural network energy landscape. In International Conference on Machine Learning.
Freeman, C Daniel, and Joan Bruna. 2017. Topology and geometry of half-rectified network
optimization. In International Conference on Learning Representations.
Garipov, Timur, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P Vetrov, and Andrew G Wilson.
2018. Loss surfaces, mode connectivity, and fast ensembling of dnns. In Advances in Neural
Information Processing Systems.
Goldblum, Micah, Jonas Geiping, Avi Schwarzschild, Michael Moeller, and Tom Goldstein. 2020.
Truth or backpropaganda? An empirical investigation of deep learning theory. In International
Conference on Learning Representations.
Hanin, Boris, and David Rolnick. 2019. Complexity of linear regions in deep networks. In
International Conference on Machine Learning.
He, Fengxiang, Bohan Wang, and Dacheng Tao. 2020. Piecewise linear activations substan-
tially shape the loss surfaces of neural networks. In International Conference on Learning
Representations.
He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image
recognition. In IEEE Conference on Computer Vision and Pattern Recognition.
Kawaguchi, Kenji. 2016. Deep learning without poor local minima. In Advances in Neural
Information Processing Systems.
Kuditipudi, Rohith, Xiang Wang, Holden Lee, Yi Zhang, Zhiyuan Li, Wei Hu, Sanjeev Arora, and
Rong Ge. 2019. Explaining landscape connectivity of low-cost solutions for multilayer nets. In
Advances in Neural Information Processing Systems.
Laurent, Thomas, and James von Brecht. 2018. The multilinear structure of ReLU networks. In
International Conference on Machine Learning.
LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521 (7553): 436.
Litjens, Geert, Thijs Kooi, Babak Ehteshami Bejnordi, Arnaud Arindra Adiyoso Setio, Francesco
Ciompi, Mohsen Ghafoorian, Jeroen Awm Van Der Laak, Bram Van Ginneken, and Clara I
References 177

Sánchez. 2017. A survey on deep learning in medical image analysis. Medical Image Analysis
42: 60–88.
Lu, Haihao, and Kenji Kawaguchi. 2017. Depth creates no bad local minima. arXiv preprint
arXiv:1702.08580.
Mei, Song, Yu Bai, and Andrea Montanari. 2018. The landscape of empirical risk for nonconvex
losses. The Annals of Statistics 46 (6A): 2747–2774.
Papyan, Vardan. 2018. The full spectrum of deep net hessians at scale: dynamics with sample size.
arXiv preprint arXiv:1811.07062.
Safran, Itay, and Ohad Shamir. 2018. Spurious local minima are common in two-layer ReLU neural
networks. In International Conference on Machine Learning.
Sagun, Levent, Léon Bottou, and Yann LeCun. 2016. Singularity of the hessian in deep learning.
arXiv preprint arXiv:1611.07476.
Sagun, Levent, Utku Evci, V Ugur Guney, Yann Dauphin, and Leon Bottou. 2018. Empirical analysis
of the hessian of over-parametrized neural networks. In International Conference on Learning
Representations Workshop.
Silver, David, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driess-
che, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, and Marc Lanctot. 2016.
Mastering the game of go with deep neural networks and tree search. Nature 529 (7587): 484–489.
Soudry, Daniel, and Elad Hoffer. 2018. Exponentially vanishing sub-optimal local minima in
multilayer neural networks. In International Conference on Learning Representations Workshop.
Swirszcz, Grzegorz, Wojciech Marian Czarnecki, and Razvan Pascanu. 2016. Local minima in
training of deep networks. arXiv preprint arXiv:1611:06310.
Witten, Ian H, Eibe Frank, Mark A Hall, and Christopher J Pal. 2016. Data Mining: Practical
Machine Learning Tools and Techniques. Morgan Kaufmann.
Yun, Chulhee, Suvrit Sra, and Ali Jadbabaie. 2018. Global optimality conditions for deep neural
networks. In International Conference on Learning Representations.
Yun, Chulhee, Suvrit Sra, and Ali Jadbabaie. 2019. Small nonlinearities in activation functions create
bad local minima in neural networks. In International Conference on Learning Representations.
Zhou, Pan, and Jiashi Feng. 2018. Empirical risk landscape analysis for understanding deep neural
networks. In International Conference on Learning Representations.
Zhou, Yi, and Yingbin Liang. 2018. Critical points of neural networks: analytical forms and
landscape properties. In International Conference on Learning Representations.
Chapter 8
Linear Partition in the Input Space

Recent research has demonstrated that the input space of a rectified linear unit (ReLU)
network, which exclusively uses ReLU-like (two-piece linear) activation functions,
is divided into linear regions by the nonlinear activations.
Specifically, within these linear regions, the mapping induced by a ReLU network
behaves linearly for input data. Conversely, at the boundaries between linear regions,
the mapping becomes nonlinear and nonsmooth. Intuitively, the linear regions corre-
spond to the linear segments of the ReLU activations, representing specific activation
patterns. Meanwhile, the boundaries are defined by transition points where the acti-
vation pattern changes. As a result, each input example is associated with a neural
code—a 0-1 matrix representing its activation pattern within the linear region it
occupies.
This chapter introduces the concept of a neural code, which serves as a param-
eterized representation of activation patterns induced by a neural network for input
examples. Sufficient experiments demonstrate that the neural code exhibits the inter-
esting encoding properties, which are shared by hash code, in most common scenarios
of deep learning for classification tasks.

8.1 Preliminaries

Suppose that a ReLU network N is trained to fit a dataset S = {(xi , yi ), i =


1, . . . , m} for classification, where xi ∈ X ⊂ Rd X , d X is the dimensionality of x,
yi ∈ Y = {1, . . . , d}, d is the number of potential categories, and m is the training
sample size. Additionally, we assume that all examples (xi , yi ) are independent and
identically distributed (i.i.d.) random variables drawn from a data distribution D.
We denote the resulting well-trained model by M. Here, “well-trained” refers to the
condition in which the training procedure has converged.

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025 179
F. He and D. Tao, Foundations of Deep Learning, Machine Learning: Foundations,
Methodologies, and Applications, https://doi.org/10.1007/978-981-16-8233-9_8
180 8 Linear Partition in the Input Space

Recent works have shown that the input space of a ReLU network N is partitioned
into multiple linear regions, each of which corresponds to a specific activation pat-
tern of the ReLU activation functions. In this section, we represent the activation
pattern as a matrix P ∈ P ⊂ {0, 1}l×w , where l and w are the depth and largest
width, respectively, of this neural network N . Specifically, the (i, j)-th component
characterizes the activation status of the j-th ReLU neuron in the i-th layer. If the
(i, j)-th component is equal to 1, this represents that this neuron is activated, while a
value of 0 means that this neuron is deactivated or invalid.1 The matrix P is termed
the neural code. We can also reformulate the neural code as a vector if there is no
possibility for confusion of the depth and width.
It’s essential to note that the boundaries separating linear regions have no physical
width. This occurs because these boundaries align precisely with transition points in
the activations, effectively resembling infinitely thin lines. In more straightforward
terms, imagine these boundaries as razor-thin lines that separate different regions
within the model. Consequently, the likelihood of an example precisely landing on
one of these boundaries is virtually non-existent. Therefore, for practical analysis
purposes, we assume that no example resides within these boundaries. This simpli-
fication greatly aids in our understanding and interpretation of the model’s behavior.
Consequently, upon fixing the weights w of the model M, every example x ∈ X
can be indexed by the neural code P of the corresponding linear region. Note that
an example x can be seen in either the training sample or the test sample.

8.2 Neural Networks Act as Hash Encoders

In this section, we investigate the results of an empirical study, shedding light on


two important discoveries. First, we find that the neural code, which represents data
within a neural network, functions much like a hash code for the corresponding
input data. Second, we show that the mapping from raw data to their neural codes,
known as the activation mapping X → P, behaves similarly to a hash function.
What’s intriguing is that unlike specialized methods for learning to hash, well-trained
neural networks, typically used for tasks like classification, inherently exhibit hash
encoding capabilities without any deliberate tuning. Specifically, the neural code
displays two fundamental encoding properties: determinism and categorization. To
rigorously evaluate the effectiveness of the activation mapping as a hash mapping,
we employ the following two standard quantitative metrics, similar to those used in
learning-to-hash approaches.
Similar to works on learning-to-hash, we adopt the following two measures to
quantitatively evaluate the activation mapping as a hash mapping.

1 Different layers may have different numbers of neurons. Therefore, there might be some indices
(i, j) that are invalid. We represent the activation patterns of these neurons by 0 since they are never
activated.
8.2 Neural Networks Act as Hash Encoders 181

Redundancy ratio. We define the following redundancy ratio to measure the


determinism property. Generally, a smaller redundancy ratio is preferred when we
are evaluating the activation mapping as a hashing function.

Definition 8.1 (Redundancy ratio) Suppose that there are m examples in a dataset
S. If they are located in n activation regions, the redundancy ratio is defined as m−n
m
.

Categorization accuracy. In typical applications, hash codes are usually utilized


for nearest neighbor searches. In our study, we employ straightforward algorithms
like K -means, K -NN, and logistic regression on the neural codes. We gauge the
encoding properties through training and test accuracy, where higher accuracy indi-
cates superior encoding. Notably, K -means is traditionally employed for unsuper-
vised learning, but here, we adapt it to verify our encoding properties within the
realm of supervised learning. The modified pipeline is detailed in Sect. 8.4.
We examine the activation mappings generated by MLP (Multi-layer Perception),
VGG-18, ResNet-18, ResNet-34, ResNeXt-26, and DenseNet-28 models trained
on the MNIST dataset, as well as VGG-19, ResNet-18, ResNet-20, and ResNet-
32 models trained on the CIFAR-10 dataset. Detailed information regarding the
implementations is provided in Sect. 8.4. Across all cases, the redundancy ratio is
nearly negligible (i.e. almost zero), and the categorization accuracy is commendable,
as illustrated in Tables 8.1 and 8.2.
These results serve to confirm the reliability of both the determinism and cat-
egorization properties of our neural codes. Furthermore, we employ t-distributed
stochastic neighbor embedding (t-SNE) (Maaten and Hinton 2008) to visually depict
the neural codes, a technique showcased in Fig. 8.1. Through this visualization,
we observe that data points belonging to the same category tend to cluster closely

Table 8.1 Accuracy of K -means and K -NN on the neural codes of convolutional neural networks
(CNNs) trained on MNIST
Architecture K -means acc (%) K -NN acc (%)
VGG-18 99.95 99.33
ResNet-18 98.96 99.32
ResNet-34 99.66 99.49
ResNeXt-26 98.31 99.24
DenseNet-28 69.87 98.59

Table 8.2 Accuracy of logistic regression (LR) on the neural codes of CNNs trained on CIFAR-10
Architecture LR acc (%) Test acc (%)
VGG-19 92.19 91.43
ResNet-18 89.55 90.42
ResNet-20 88.76 90.44
ResNet-32 89.05 90.45
182 8 Linear Partition in the Input Space

Fig. 8.1 t-SNE visualization


of the neural codes of
MNIST generated by a
one-hidden-layer MLP of
width 100

together, indicating a high degree of similarity. Conversely, distinct boundaries are


discernible between data points from different categories, highlighting clear sep-
arations. This visual representation vividly illustrates the categorization property,
demonstrating that neural codes effectively capture and differentiate between vari-
ous categories of data.

8.3 Factors that Influence the Encoding Properties

The results presented in Tables 8.1 and 8.2 also suggest that the encoding properties
exhibit variability in different scenarios. Through comprehensive experiments, we
investigate which factors influence the encoding properties and how. The investi-
gated factors include the model size, training time, sample size, three different pop-
ular regularizers, random data, and noisy labels. Some details of the experimental
implementations are given in Sect. 8.4.

8.3.1 Relationship Between Model Size and Encoding


Properties

We first study how the model size influences the encoding properties. We trained
115 one-hidden-layer MLPs of different widths on the MNIST dataset and 200 five-
hidden-layer MLPs of different widths on the CIFAR-10 dataset (see Sect. 8.4 for
the full list of widths considered in our experiments), while all irrelevant variables
were strictly controlled. The experiments were repeated for 5 trials on MNIST and
10 trials on CIFAR-10.
Measurement of model size by width. When considering MLPs, a natural metric
for evaluating model complexity is the layer width.2 Using layer width as our measure
of model size, we computed both the redundancy ratio and categorization accuracy
across all scenarios, as illustrated in Fig. 8.2a, b. These plots reveal notable correla-
tions between encoding properties and layer width: (1) Redundancy Ratio: Initially,

2 Depth is also a natural measure of model size. However, the optimal training protocol (especially
training time) significantly differs for networks of different depths. Thus, it is difficult to conduct
experiments on depth while controlling other factors.
8.3 Factors that Influence the Encoding Properties 183

(a) Determinism vs. layer width (b) Categorization vs. layer width

(c) D vs. layer width (d) D at different times (e) D for random/true (f) R vs. random/true
data data

Fig. 8.2 a Plots of the redundancy ratio as a function of the layer width of MLPs on both the
training set (blue) and the test set (red). b Plots of the test accuracies of K -means (blue), K -NN
(red), and logistic regression (LR, range) as functions of the MLP layer width. c Plots of the average
stochastic activation diameter (D) as a function of the MLP layer width on MNIST. d Histograms
of D calculated on MNIST for an MLP of width 50 trained on MNIST for 10 epochs (blue) and 500
epochs (red), respectively. e Histograms of D calculated on MNIST (blue) and randomly generated
data of the same dimensions (red) for an MLP of width 50 trained on MNIST. The two red histograms
are identical. f Plots of the redundancy ratio (R) calculated on MNIST (“True data”, solid lines)
and randomly generated data (dotted lines) as functions of the training time for MLPs of width
40 (blue), 50 (red), and 60 (orange). The dotted lines represent networks trained on the unaltered
data and evaluated on random data. The models were trained 5 times on MNIST and 10 times on
CIFAR-10 with different random seeds. The darker lines represent the averages over all seeds, and
the shaded areas show the standard deviations

the redundancy ratio starts at relatively high values (almost 1 on both the training and
test sets of MNIST, around 0.1 on the training set of CIFAR-10, and approximately
0.04 on the CIFAR-10 test set). However, it steadily decreases to nearly 0 across all
cases as the layer width increases. (2) Categorization Accuracy: Initially, the cate-
gorization accuracy commences at relatively low values (about 25% on MNIST and
32-45% on CIFAR-10). However, as the width increases, the accuracy consistently
improves across all scenarios, reaching relatively high values (approximately 70%
for K -means on both datasets, exceeding 90% for K -NN and logistic regression on
MNIST, and approximately 50% for K -NN and logistic regression on CIFAR-10,
akin to the test accuracy on the raw data).
Measurement of model capacity in terms of the diameters of the linear
regions. We devise a new measure to assess model capacity, termed the average
stochastic activation diameter. This metric is computed through the following three
steps: (1) we randomly sample a direction from a uniform distribution; (2) the stochas-
184 8 Linear Partition in the Input Space

tic diameter of a linear region is defined as the length of the longest line segment
intersecting the linear region along the sampled direction; and (3) the average stochas-
tic activation diameter is defined as the mean of these stochastic diameters across
all linear regions containing data. In essence, a smaller average stochastic activa-
tion diameter indicates a finer division of the input space into smaller linear regions,
enabling the representation of more intricate data structures. Therefore, this metric is
an effective measure of the model capacity. Correspondingly, we observe a negative
correlation between layer width and average stochastic activation diameter, as shown
in Fig. 8.2c.
Similarly, Hanin and Rolnick (2019) introduced a concept termed the “typical
distance from a random input to the boundary of its linear region”. In contrast, our
diameter conceptually represents the longest distance between any two points within
a linear region. While (Hanin and Rolnick 2019) typical distance metric can be under-
stood as the distance from a random input to the boundary of its linear region, our
diameter measures the maximum possible separation within the linear region itself.
When the linear region takes the form of an ideal ball, (Hanin and Rolnick 2019)
typical distance will be equal to or smaller than the radius of the ball, which is half of
our diameter. However, linear regions are often irregular in practice, as illustrated in
Fig. 1 of Hanin and Rolnick (2019). Consequently, their distances typically turn out
to be considerably smaller than our diameter. Consequently, the discrepancy between
these two definitions can be significant, particularly depending on the level of irreg-
ularity. One may remain relatively constant while the other experiences significant
variation. Consequently, Hanin and Rolnick (2019) distance metric may provide a
lower bound on the linear region volume, whereas our diameter serves as an upper
bound.
We further investigate the encoding properties beyond those of the data-generating
distribution. To this end, we generate a set of examples following a uniform distribu-
tion across the unit ball centered at the original point. Additionally, we normalize the
original data such that each pixel falls within the range [0, 1], ensuring comparable
scales between the randomly generated and original data. Notably, we observe that
the redundancy ratio exceeds 0.8 on the randomly generated data, as illustrated in Fig.
8.2f. This observation indicates that the determinism property is no longer upheld,
implying that unique neural codes cannot effectively represent randomly generated
data. Consequently, the categorization property becomes less discernible. To account
for these findings, we propose the following hypothesis.

Hypothesis 8.1 The diameters of the linear regions in the support of a data distri-
bution decrease as training progresses, while the diameters of regions far away show
little change. 

We then collected the average stochastic activation diameters for each scenario,
as illustrated in Fig. 8.2d, e. We observe that the stochastic diameters are more
concentrated when the training time is longer; see Fig. 8.2d. Moreover, we observe the
interesting result that the stochastic diameters for the true data are more concentrated
8.3 Factors that Influence the Encoding Properties 185

at lower values than the stochastic diameters for the random data. Figure 8.2e shows
the histograms of the stochastic diameters. The histograms for other scenarios are
given in Sect. 8.5. These results fully support our hypothesis.

8.3.2 Relationship Between Training Time and Encoding


Properties

We then explore how training time affects the encoding properties. Our experiments
involved one-hidden-layer MLPs of varying widths trained on the MNIST dataset
and five-hidden-layer MLPs of different widths trained on CIFAR-10. In total, we
evaluated 810 models. To ensure the reliability of our findings, the experiments
were repeated five times for MNIST and ten times for CIFAR-10, amounting to a
comprehensive analysis of the influence of training time on encoding properties.
Throughout our experiments, we meticulously tracked the evolution of both the
redundancy ratio and categorization accuracy at every epoch for all scenarios, doc-
umented in Fig. 8.3. These plots serve as a comprehensive visualization of the rela-
tionship between encoding properties and training time. Notably, our observations
reveal a clear positive correlation: as the training time increases, we observe (1) a
steady decline in the redundancy ratio and (2) a consistent improvement in catego-
rization accuracy. This trend underscores the importance of training time in shaping
the encoding capabilities of our models, shedding light on the dynamics of learning
processes within neural networks.
We also make a noteworthy observation regarding the redundancy ratio of an
untrained MLP on the MNIST dataset, which is nearly zero, as depicted in Fig.
8.3c. To provide a deeper understanding, let’s investigate the rationale behind this
finding: Upon random initialization, a neural network partitions the input space into
numerous activation regions. If these regions are sufficiently small, each training
data point effectively has its own dedicated activation region. However, it’s essen-
tial to recognize that during this random initialization phase, the mapping from input
data to output predictions lacks any meaningful structure. Consequently, the network
may exhibit erratic behavior, producing vastly different predictions for data points
originating from neighboring activation regions. As a consequence, categorization
accuracy is notably poor, aligning with intuitive expectations. This phenomenon val-
idates the notion that determinism alone is insufficient for adequately characterizing
encoding properties. It also resonates with the concept of reservoir effects (Jaeger
2001; Maass et al. 2002), which emphasizes the dynamic and complex nature of
neural network behavior during the initialization phase.
It’s important to highlight a key distinction between our findings and those
reported in Hanin and Rolnick (2019). While their research suggests that the count
of linear regions increases with the training progress, our observations present a
different perspective. We assert that the encoding properties we’ve examined do
not extend beyond the training data distribution, regardless of any variations in
186 8 Linear Partition in the Input Space

(a) R ratio vs. time (b) Categorization accuracy vs. time

(c) R ratio vs. time (d) Test accuracies of -means, -NN, and logistic regression vs. time

Fig. 8.3 a The redundancy ratio (R ratio) as a function of the training time on CIFAR-10. b The
test accuracies of K -means (left), K -NN (middle), and logistic regression (right) as functions of
the training time. The models are MLPs of width 10 (blue), 20 (red), 40 (orange), and 80 (green). c
The R ratio as a function of the training time on MNIST. d The test accuracies of K -means (left),
K -NN (middle), and logistic regression (right) as functions of the training time. The models are
MLPs of depth 40 (blue), 50 (red), and 60 (orange) on MNIST. The dotted lines represent networks
trained on the unaltered data and evaluated on random data. All models obtained from MNIST were
trained 5 times with different random seeds. The darker lines represent the averages over all seeds,
and the shaded areas show the standard deviations

the count of linear regions. This assertion is supported by the data presented in
Fig. 8.2f. What this implies is that while there may indeed be an increase in the
count of linear regions, this phenomenon is localized to only a small fraction of the
expansive input space. Consequently, even with this observed increase, there’s no
guarantee of a corresponding decrease in the redundancy ratio. This understanding
features the intricate relationship between training dynamics and encoding properties
within neural networks.

8.3.3 Relationship Between Sample Size and Encoding


Properties

We now study the influence of sample size on the encoding properties of neural
networks. To conduct this study, we trained a total of 210 one-hidden-layer MLPs,
each with three different widths, and 480 five-hidden-layer MLPs, each with four
different widths. These models were trained on training samples of varying sizes ran-
domly sampled from the MNIST and CIFAR-10 datasets, respectively. It’s essential
8.3 Factors that Influence the Encoding Properties 187

to highlight our stringent experimental controls, ensuring that all irrelevant vari-
ables are meticulously managed to maintain experimental integrity (i.e. accurate and
repeatable results). Additionally, instead of tracking epochs, we monitor the num-
ber of iterations. This decision is informed by the understanding that the number
of iterations in one epoch scales proportionally with the sample size. To ensure the
reliability and repeatability of our findings, each experiment is rigorously repeated
five times for MNIST and ten times for CIFAR-10 datasets.
We conducted an extensive analysis by computing both the redundancy ratio
and categorization accuracy across all scenarios, visually detailed in Fig. 8.4. Our
findings unveil two significant trends: (1) The redundancy ratio, whether assessed on
the training or test sample, exhibits a notable pattern. Upon initialization, it begins
at a relatively high value, gradually diminishing to near-zero levels as the training
sample size increases. This observation validates the diminishing redundancy as the
model learns from a larger and more diverse dataset. (2) We observe a distinct positive
correlation between the sample size and the test accuracies of all three algorithms.
Specifically, as the sample size expands, the accuracy of K -means escalates from 20
to 40%, K -NN accuracy experiences a surge from 10 to 45%, and logistic regression
accuracy shows a pronounced increase from 15 to 45%. These trends elucidate the
profound impact of sample size on both redundancy ratio and classification accuracy
and a negative correlation between redundancy ratio and categorization accuracy,
providing useful insights into model performance under varying data conditions.
Surprisingly, we observe that the encoding properties on the test set also become
stronger as the training sample size increases. Our hypothesis is as follows. Intu-
itively, a larger training sample size supports the neural network in attaining a higher
expressive power, i.e., a finer linear partition in the input space. Meanwhile, a sample
of larger size requires a finer linear partition to yield the same redundancy ratio. Our
experiments show that the first effect is stronger than the second one. Thus, a larger
sample size can help reduce the redundancy ratio.

8.3.4 Layerwise Ablation Study

We next study how different layers impact the encoding properties. We conducted a
layerwise ablation study on the CIFAR-10 dataset based on five-hidden-layer MLPs,
in which every layer was of width 40.
We conducted an exhaustive analysis, meticulously calculating both the redun-
dancy ratio and categorization accuracy across every epoch, as illustrated in Fig. 8.5.
This thorough examination yielded the following insights:
(1) The redundancy ratio of the neural code formed by the initial layer consis-
tently remains close to zero, indicating a notable absence of redundancy. However,
despite this, the corresponding categorization accuracy remains relatively poor. This
observation suggests that while redundancy is minimized, the initial layer may not
capture sufficient discriminative information for effective categorization.
188 8 Linear Partition in the Input Space

Fig. 8.4 a The redundancy ratios (R ratios) on the training set (dotted lines) and test set (solid
lines) of CIFAR-10 as functions of the sample size. b The test accuracies of K -means (left), K -NN
(middle), and logistic regression (right) as functions of the sample size. The models are MLPs of
width 10 (blue), 20 (red), 40 (orange), and 80 (green). c The R ratios on the training set (dotted
lines) and test set (solid lines) of MNIST as functions of the sample size. d The test accuracies of
K -means (left), K -NN (middle), and logistic regression (right) as functions of the sample size. The
models are MLPs of width 40 (blue), 50 (red), and 60 (orange) on MNIST. The models were trained
10 times on CIFAR-10 or 5 times on MNIST, with a different random seed each time. The darker
lines represent the averages over all seeds, and the shaded areas show the standard deviations

(2) As we progress through increasingly higher single layers, the redundancy ratio
gradually increases, indicating a more diverse representation of data. Correspond-
ingly, there is a noticeable improvement in categorization accuracy, suggesting that
deeper layers encode more discriminative features.
(3) The impact of training time on the encoding properties of a single layer mirrors
that of the entire network, emphasizing the importance of adequate training time in
shaping encoding capabilities.
(4) As the neural code incorporates more layers, the redundancy ratio steadily
decreases, indicating a more efficient encoding of data across multiple layers.
(5) Categorization accuracy exhibits a gradual enhancement as the neural code
evolves from the first layer to encompass the entire network. This progression high-
lights the iterative refinement of features and representations throughout the network
architecture.
(6) The observed pattern in (5) is disrupted when forming the neural code in
reverse, starting from the last layer and progressing backward. This reversal suggests
that the hierarchical organization of features may play a crucial role in information
representation.
8.3 Factors that Influence the Encoding Properties 189

Fig. 8.5 a Plots of the redundancy ratios of the neural codes formed by different single layers of
MLPs trained on CIFAR-10 as functions of the training time. b The test accuracies of K -means (left),
K -NN (middle), and logistic regression (right) as functions of the training time. c The redundancy
ratios of the neural codes formed by multiple MLP layers as functions of the sample size. d The
test accuracies of K -means (left), K -NN (middle), and logistic regression (right) as functions of
the sample size. The models are MLPs of width 40 on CIFAR-10

(7) The categorization accuracy of the neural code formed solely by the last layer
is comparable to that of the entire network, indicating that the final layer encapsulates
critical discriminative features necessary for accurate classification.
The insight (2), particularly regarding the redundancy ratio, offers insight into
the interplay between hashing properties and the generalizability of deep learning.
The gradual concentration of data into fewer cells from the initial to the final layer
validates the network’s ability to extract increasingly informative and discriminative
features, contributing to its overall effectiveness and generalization for categoriza-
tion. This understanding sheds light on the intricate relationship between encoding
properties and the broader principles governing deep learning architectures.

8.3.5 Impacts of Regularization, Random Data, and Random


Labels

We also study the impact of regularization on the encoding properties. We trained


345 MLPs on the MNIST dataset with and without batch normalization, gradient
clipping, and weight decay. The results suggest that regularization has an impact on
the encoding properties but that this impact is smaller than those of the model size,
training time, and sample size; see Fig. 8.6 (Figs. 8.7 and 8.8).
190 8 Linear Partition in the Input Space

Fig. 8.6 First row: Scatter plots of the redundancy ratios and test accuracies of K -means (blue),
K -NN (violet), and logistic regression (LR, orange) for MLPs with depths ranging from 3 to 100 with
batch normalization (y-axis) and without gradient clipping (x-axis). A smaller ȳ − x̄ is preferred
for the redundancy ratio, and a larger one is preferred for the test accuracy. In total, 115 models
are represented in one scatter plot. Second row: Scatter plots of the redundancy ratios and test
accuracies of K -means (blue), K -NN (violet), and LR (orange) for MLPs with depths ranging from
3 to 100 with gradient clipping (y-axis) and without gradient clipping (x-axis). A smaller ȳ − x̄ is
preferred for the redundancy ratio, and a larger one is preferred for the test accuracy. In total, 115
models are represented in one scatter plot. Third row: Scatter plots of the redundancy ratios and
test accuracies of K -means (blue), K -NN (violet), and LR (orange) for MLPs with depths ranging
from 3 to 100 with weight decay (y-axis) and without weight decay (x-axis). A smaller ȳ − x̄ is
preferred for the redundancy ratio, and a larger one is preferred for the test accuracy. In total, 115
models are represented in one scatter plot

We also generated random data in which every pixel was generated from the
uniform distribution U (0, 1). We then trained MLPs and convolutional neural net-
works (CNNs) on the generated data. Unfortunately, the training process did not
converge. We then added label noise to MNIST at different noise rates (0.1, 0.2,
0.3). The encoding properties still showed the same general trends, although they
become somewhat worse. Our results suggest that the structure of the input data can
influence the organization of the hash space (Table 8.3).
8.3 Factors that Influence the Encoding Properties 191

Fig. 8.7 a The redundancy ratios for MNIST with different levels of label noise as functions of the
layer width. b The test accuracies of K -means (left), K -NN (middle), and logistic regression (right)
with different levels of label noise as functions of the layer width. The models are MLPs trained
on MNIST at different label noise rates of 0 (blue), 0.1 (red), 0.2 (orange) and 0.3 (green). All
models were trained on MNIST with noisy labels for classification 5 times with different random
seeds. The darker lines represent the averages over all seeds, and the shaded areas show the standard
deviations

Fig. 8.8 An example of


random data

Table 8.3 Training accuracy and training loss of a one-hidden-layer MLP of width 100 on random
data
Epoch 0 100 300 500
Training acc (%) 10.92 11.24 11.24 11.24
Loss 230.56 230.13 230.13 230.13

8.3.6 Activation Hash Phase Chart

Finally, we can define an activation hash phase chart that characterizes the space
formed by the redundancy ratio, categorization accuracy, model size, training time,
and sample size. Summarizing the relationships discovered above, the activation
hash phase chart is divided into three canonical regions, corresponding to the under-
192 8 Linear Partition in the Input Space

expressive regime, the critically expressive regime, and the sufficiently expressive
regime. This chart can provide guidance in hyperparameter tuning, the design of
novel algorithms, and algorithm diagnosis. We note that the thresholds between the
three regimes are currently unknown. Exploring them is a promising direction for
future research.

8.4 Additional Experimental Implementation Details

Datasets. Our experiments are based on the MNIST dataset (LeCun et al. 1998) and
the CIFAR-10 dataset (Krizhevsky and Hinton 2009): (1) MNIST contains 60, 000
training examples and 10, 000 test examples from 10 classes. This dataset can be
downloaded at http://yann.lecun.com/exdb/mnist/. (2) CIFAR-10 consists of 50, 000
training images and 10, 000 test images that belong to 10 classes. CIFAR-10 can be
downloaded at https://www.cs.toronto.edu/∼kriz/cifar.html. The splits of the training
and test sets follow the official versions. All images are normalized such that every
pixel value is in the range [0, 1].
Training settings. (1) For MNIST, MLPs were trained with Adam for 2, 000
epochs with a batch size of 128 and a constant learning rate. VGG, ResNet, ResNet,
ResNet, and DenseNet models were trained with Adam for 500 epochs with a batch
size of 128. The learning rate was initialized as 0.01 and decayed to 1/10 of the
previous value every 100 epochs. For all models, the hyperparameter β1 was set to
0.9, and the hyperparameter β2 was set to 0.999. (2) For CIFAR-10, MLPs with 5
hidden layers were trained with Adam for 200 epochs with a batch size of 64. The
learning rate was initialized as 0.01 and decayed to 1/10 of the previous value every
20 epochs. VGG and ResNet models were trained with stochastic gradient descent
(SGD) for 200 epochs with a batch size of 64. The learning rate was initialized as
0.01 and decayed to 1/10 of the previous value every 50 epochs. On MNIST, the
MLPs were trained five times with different random seeds. On CIFAR-10, the MLPs
were trained ten times with different random seeds.
Average stochastic diameter. We first trained MLPs with widths of {5, 10, 15, 20,
25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 90, 100} on MNIST. Then, we ran-
domly selected 600 examples from the test set and calculated the mean of their
corresponding stochastic diameters.
Network architectures. The details of the network architectures investigated in
Sect. 8.2 are shown in Tables 8.4 and 8.5.
Experimental design for K -means. The pipeline for the experiments on
K -means was designed as follows: (1) We set K equal to the number of classes.
(2) We ran K -means on the neural codes to obtain K clusters. (3) Every cluster was
assigned a label from {1, 2, . . . , 10}. Thus, 90 (cluster, label) pairs were obtained.
(4) For every (cluster, label) pair, we assigned the label to all data in the cluster and
calculated the accuracy. (5) We selected the highest accuracy as the accuracy of the
K -means algorithm.
8.4 Additional Experimental Implementation Details 193

Table 8.4 The detailed architectures of the neural networks trained on MNIST
VGG-18 ResNet-18 ResNet-34 ResNeXt-26 DenseNet-28
3 × 3, 32, stride 2 3 × 3, 32, stride 2 3 × 3, 32, stride 2 3 × 3, 32, stride 2 3 × 3, 6, stride 2
maxpool, 3 × 3 maxpool, 3 × 3 maxpool, 3 × 3 maxpool, 3 × 3 maxpool, 3 × 3
⎡ ⎤
    1 × 1, 32  
3 × 3, 32 3 × 3, 32 ⎢ ⎥ 1 × 1, 12
(3 × 3, 32) × 4 ×2 ×3 ⎢3 × 3, 32, C = 8⎥ × 2 ×4
3 × 3, 32 3 × 3, 32 ⎣ ⎦ 3 × 3, 3
1 × 1, 64
   
3 × 3, 64 3 × 3, 64 conv, 1 × 1
(3 × 3, 64) × 4 ×2 ×4
3 × 3, 64 3 × 3, 64 avgpool, 2 × 2
⎡ ⎤
    1 × 1, 64  
3 × 3, 128 3 × 3, 128 ⎢ ⎥ 1 × 1, 12
(3 × 3, 128) × 4 ×2 ×6 ⎢3 × 3, 64, C = 8⎥ × 3 ×4
3 × 3, 128 3 × 3, 128 ⎣ ⎦ 3 × 3, 3
1 × 1, 128
   
3 × 3, 256 3 × 3, 256 conv, 1 × 1
(3 × 3, 256) × 4 ×2 ×3
3 × 3, 256 3 × 3, 256 avgpool, 2 × 2
⎡ ⎤
1 × 1, 128  
⎢ ⎥ 1 × 1, 12
⎢3 × 3, 128, C = 8⎥ × 3 ×4
⎣ ⎦ 3 × 3, 3
1 × 1, 256

avgpool avgpool avgpool avgpool avgpool


fc-10, softmax fc-10, softmax fc-10, softmax fc-10, softmax fc-10, softmax

Table 8.5 The detailed architectures of the neural networks trained on CIFAR-10
VGG-19 ResNet-18 ResNet-20 ResNet-32
(3 × 3, 32) × 2
3 × 3, 64 3 × 3, 16 3 × 3, 16
maxpool, 2 × 2
     
(3 × 3, 128) × 2 3 × 3, 64 3 × 3, 16 3 × 3, 16
×2 ×3 ×5
maxpool, 2 × 2 3 × 3, 64 3 × 3, 16 3 × 3, 16
     
(3 × 3, 256) × 4 3 × 3, 128 3 × 3, 32 3 × 3, 32
×2 ×3 ×5
maxpool, 2 × 2 3 × 3, 128 3 × 3, 32 3 × 3, 32
     
(3 × 3, 512) × 4 3 × 3, 256 3 × 3, 64 3 × 3, 64
×2 ×3 ×5
maxpool, 2 × 2 3 × 3, 256 3 × 3, 64 3 × 3, 64
 
(3 × 3, 512) × 4 3 × 3, 512
×2
maxpool, 2 × 2 3 × 3, 512
f c − 4096
avgpool avgpool avgpool
f c − 4096
fc-10, softmax fc-10, softmax fc-10, softmax fc-10, softmax
194 8 Linear Partition in the Input Space

Experiments on the relationship between the model size and the encoding
properties. We trained MLPs with widths of {3, 7, 10, 15, 20, 23, 27, 30, 33, 37, 40,
43, 47, 50, 53, 57, 60, 65, 70, 75, 80, 90, 100} on MNIST and {10, 20, 30, 40, 50,
60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190,200} on CIFAR-10.
Experiments concerning the relationship between the training process and
the model size. (1) For MNIST, we trained MLPs with widths of {40, 50, 60}.
The redundancy ratios and test accuracies of K -means, K -NN, and logistic regres-
sion were calculated for the following training epochs: {1, 3, 6, 10, 30, 60, 100, 300,
600, 1000, 1200, 1500, 1800, 2000}. (2) For CIFAR-10, we trained MLPs with
widths of {10, 20, 40, 80}. The redundancy ratios and test accuracies of K -means,
K -NN, and logistic regression were calculated for the following training epochs:
{1, 3, 6, 10, 20, 30, 40, 60, 80, 100, 120, 140, 160, 180, 200}.
Experiments concerning the relationship between the sample size and the
model size. (1) For MNIST, we trained MLPs with widths of {40, 50, 60} on training
samples with sizes of {10, 30, 60, 100, 300, 600, 1000, 2000, 3000, 6000, 10000,
20000, 30000, 60000} randomly drawn from the training set. (2) For CIFAR-10,
we trained MLPs with widths of {10, 20, 40, 80} on training samples with sizes
of {10, 20, 50, 100, 200, 500, 1000, 2000, 5000, 10000, 20000, 40000} randomly
drawn from the training set.
Experiments concerning the relationship between regularization and the
model size. Three regularizers were tested in our experiments:
• Batch normalization: Adding a batch normalization layer before every ReLU layer.
• Weight decay: Utilizing the L 2 weight regularizer with hyperparameter λ = 0.01.
• Gradient clipping: Setting the clip norm to 1.
Layerwise ablation study. We trained MLPs of width 40 on CIFAR-10. The train-
ing strategy was the same as that previously used for MLPs trained on
CIFAR-10.
Experiments concerning random data. All pixels of the random data were
individually generated from the uniform distribution U (0, 1). Each random example
had dimensions of 28 × 28, i.e., the same as the MNIST images.
Experiments concerning noisy labels. Specified numbers of training exam-
ples from MNIST were assigned random labels in accordance with label noise
ratios of 0.1, 0.2, and 0.3. Then, we trained one-hidden-layer MLPs with widths
of {3, 7, 10, 15, 20, 23, 27, 30, 33, 37, 40, 43, 47, 50, 53, 57, 60, 65, 70, 75, 80, 90,
100} on each noisy training set.

8.5 Additional Experimental Results

This section collects experimental results omitted from the main analysis section for
brevity. The following figure corresponds to the study of the diameters. Please refer
to Sect. 8.3.1.
References 195

References

Hanin, Boris, and David Rolnick. 2019. Deep relu networks have surprisingly few activation pat-
terns. In Advances in Neural Information Processing Systems, 361–370.
Jaeger, Herbert. 2001. The “cho state” approach to analysing and training recurrent neural networks-
with an erratum note. Bonn, Germany: German National Research Center for Information Tech-
nology GMD Technical Report 148 (34): 13.
Krizhevsky, Alex, and Geoffrey Hinton. 2009. Learning multiple layers of features from tiny images.
Technical report, Citeseer.
LeCun, Yann, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning
applied to document recognition. Proceedings of the IEEE 86 (11): 2278–2324.
Maass, Wolfgang, Thomas Natschläger, and Henry Markram. 2002. Real-time computing without
stable states: a new framework for neural computation based on perturbations. Neural Computa-
tion 14 (11): 2531–2560.
Nesterov, Yurii E. 1983. A method for solving the convex programming problem with convergence
rate o (1/k2 ). In Dokl. Akad. Nauk Sssr, vol. 269, 543–547.
van der Maaten, Laurens, and Geoffrey Hinton. 2008. Visualizing data using t-sne. Journal of
Machine Learning Research 9 (Nov): 2579–2605.
Chapter 9
Reflecting on the Role of
Overparameterization: Is it Solely
Harmful?

One might blame the overparameterized nature of deep learning for its lack of theo-
retical foundations. However, recent works have discovered that this overparameter-
ization also contributes to the success of deep learning. Notably, there is currently
no standard definition of “overparameterization”. In many cases, the definition is
quite restrictive, such as requiring an infinite network width. A promising direction
of future research is to relax the conditions of overparameterization.

9.1 Double Descent and Benign Overfitting

Double descent. Classical machine learning usually exhibits a U-shaped bias-


variance trade-off: the test performance initially increases as the hypothesis com-
plexity increases and then decreases after an inflection point. However, Belkin et al.
(2019) empirically found that this “single-descent” curve shape holds only when the
model is not able to perfectly fit (interpolate) the training data; the bias-variance
curve may also exhibit a second descent after an interpolation point. Nakkiran et al.
(2020) empirically showed that a similar double-descent phenomenon also occurs as
a function of the number of training epochs. These findings motivate a new complex-
ity measure, the effective model complexity, which unifies the two aforementioned
curves as a double-descent curve of the test error versus the effective model com-
plexity. Specifically, the effective model complexity EMCD,  (T ) of model T on
sample set S drawn from data distribution D is defined as follows:

EMCD,  (T ) = max{m|E S∼Dm [ R̂ S (T )] ≤ }.

Neural networks as interpolators. The double-descent phenomenon introduces a


conflict with conventional statistical learning theory. A key factor in this phenomenon
is the interpolation point. A related line of research is focused on studying neural

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025 197
F. He and D. Tao, Foundations of Deep Learning, Machine Learning: Foundations,
Methodologies, and Applications, https://doi.org/10.1007/978-981-16-8233-9_9
198 9 Reflecting on the Role of Overparameterization: Is it Solely Harmful?

networks as interpolators. Belkin et al. (2018b) empirically demonstrated that kernel-


based models with zero classification error or near-zero regression error also have
very low test error, even when a high level of label noise exists. Belkin et al. (2018a)
further constructed several examples to illustrate the “blessing of dimensionality”:
the asymptotic risk of the constructed interpolator approaches the Bayesian risk as the
model dimensions increase to infinity. Muthukumar et al. (2020) studied the influence
of a 0-1 loss and a squared loss on overparameterized neural networks. Liang and
Rakhlin (2020) proved that a nonlinear kernel “ridgeless” regression that interpolates
the training data generalizes well on test data. Belkin et al. (2020) calculated the peak
points of the double-descent curves for two families of linear models, while Hastie
et al. (2019) further studied the case of overparameterized linear regression models.
Specifically, as the training sample size m → ∞, the feature dimension p → ∞,
and the ratio m/ p → λ ∈ (0, ∞),
 2 γ
σ 1−γ γ < 1,
R(β̂) → 2   (9.1)
r 1 − γ1 + σ 2 γ −1
1
γ > 1,

where β̂ is the minimum-norm least-squares estimator, β is the target mapping, and


 
R(β̂) = E (x  β̂ − x  β)2 .

This line of research has also led to research on benign overfitting in linear regres-
sion (Bartlett et al. 2020), ridge regression (Tsigler and Bartlett 2020), the large-
deviation regime (Chinot and Lerasle 2020), constant-step-size stochastic gradient
descent (SGD) for regression (Zou et al. 2021), and neural networks (Li et al. 2021).
Bartlett et al. (2020) characterized linear regression problems in which the minimum-
norm interpolating prediction rule has near-optimal prediction accuracy. The authors
demonstrated that for benign overfitting, overparameterization is crucial when the
number of directions in the parameter space must be substantially larger than the
sample size, and they derived nearly matching lower- and upper-bounds for the risk
of the minimum-norm interpolating estimator, as follows (Fig. 9.1).

Fig. 9.1 Figure illustrating test loss and training loss versus parameter size for ResNet-18 on
CIFAR-10. See Nakkiran et al. (2021)
9.1 Double Descent and Benign Overfitting 199

Theorem 9.1 Consider a linear regression problem defined in terms of a random


covariate vector x, an outcome y and an optimal parameter vector θ ∗ satisfying
E(y − x T θ ∗ ) = Eθ (y − x T θ )2 . For any positive constant σx , there exist quantities
b, c, and c1 for which the following holds. Let us define

k ∗ = min{k ≥ 0 : rk ( ) ≥ bm},

where = E[x x T ] is the covariance operator and rk () = i>k λi /λk+1 . Suppose
that δ < 1 with log(1/δ) < m/c. If k ∗ ≥ m/c1 , then E[R(θ̂)] ≥ σ 2 /c. Otherwise,
  
 r0 ( ) r0 ( ) log(1/δ)

R(θ̂ ) ≤c θ max , , +
m m m

k∗ m
c log(1/δ)σ y2 + 
m Rk ∗ ( )

with probability at least 1 − δ and



σ2 k∗ m
E[R(
θ )] ≥ +  , .
c m Rk ∗ ( )

where Rk () = (i>k λi )2 /i>k λi2 . Moreover,there exist universal constants a1 ,


a2 , and n 0 such that for all m ≥ m 0 , forall , and for all t ≥ 0, there
 exists a
θ ∗ with θ ∗ = t such that for x ∼ N (0, ) and y|x ∼ N (x T θ ∗ , θ ∗ ), with
probability at least 1/4,
1 ∗ 
R(θ̂ ) ≥ θ I 
r0 ( )  .
a1 m log(1+r0 ( )) ≥a2

Cao et al. (2022) investigated benign overfitting in a two-layer convolutional neural


network (CNN) and found that a two-layer CNN can achieve arbitrarily small training
and test losses when the signal-to-noise ratio satisfies certain conditions.
Theorem 9.2 Suppose that training sample size m, neural network width n and
dimension d satisfy d = (n 8 m 4 ), where n, m = ( polylog(d)); the learning rate
satisfies   
η≤ O n −2/q min µ −2 −2 −1
2 , 4σ p d

 √ 
and 
O(md −1/2 ) · min σ p d)−1 , µ −1 2 ≤ the initialization level σ0 ≤
 √ 
 −4 m −1 ) · min (σ p d)−1 , µ −1 q √
. For any  > 0, if n · µ 2 = (σ p ( d)q ),
q
O(n 2
 −1 σ0−(q−2) µ −q
then there exists a T = O(η −1 −1 4 −2
2 + η  n µ 2 ) with probability
at least 1 − d −1 such that 1) the training loss converges to , i.e., R̂ S (W(T ) ) ≤ ;
(T )
2) the CNN learns the signal: maxr γ j,r ≥ (n −1/q ) for j ∈ {−1, +1}; and 3) the
200 9 Reflecting on the Role of Overparameterization: Is it Solely Harmful?

(T ) √
CNN does not memorize the noise in the training data: max j,r,i ζ j,r,i = (σ0 σ p d)

and max j,r,i |ω(T )
j,r,i | =
(σ0 σ p d). Here, µ and σ0 represent the signal strength and
noise level respectively.
q √
This theorem illustrates that if the condition n · µ 2 = (σ p ( d)q ) holds, then a
q

filter will exist that learns µ by achieving γ j,r ∗j ≥ (n −1/q ).


In addition, these authors presented a theorem for noise memorization by a CNN.

Theorem 9.3 Suppose that d= (n 8 m 4), where n, m = ( polylog(d)); the
learning rate satisfies η ≤ 
O n −2/q min µ −2 −2 −1
2 , 4σ p d ; and 
O(md −1/2 ) ·
 √ 
min σ p d)−1 , µ −1 2 ≤ the initialization level
 √ 
 −4 m −1 ) min (σ p d)−1 , µ
σ0 ≤ O(n −1
2 .

q √ 
 η−1 ·
For any  > 0, if σ p ( d)q = (m · µ 2 ), then there exists a T = O
q
√ −(q−2) 
m(σ p d)−q · σ0 + η−1  −1 mn 4 d −1 σ p−2 with probability at least 1 − d −1 such
that 1) the training loss converges to , i.e., R̂ S (W (T ) ) ≤ , and 2) the CNN memorizes
the noise in the training data: maxr ζ y(Ti ,r,i
)
≥ (n −1/q ).
Lower and upper bounds on the population loss achieved by a CNN can be obtained
(T ) (T )
based on the bounds on γ j,r , ζ j,r,i and ω(T )
j,r,i in Theorems 9.2 and 9.3.

Theorem 9.4 Suppose that d= (n 8 m 4), where n, m = ( polylog(d)); the
learning rate satisfies η ≤ 
O n −2/q min µ −2 −2 −1
2 , 4σ p d ; and 
O(md −1/2 ) ·
 √ 
min σ p d)−1 , µ −1 2 ≤ the initialization level
 √ 
 −4 m −1 ) min (σ p d)−1 , µ
σ0 ≤ O(n −1
2 .

q √
For any  > 0, 1) if n · µ 2 = (σ p ( d)q ), then gradient descent will yield a CNN
q

with parameters W  such that R̂ S ( W


 ) <  and R( W
 ) < 6 + exp (−n 2 ), and 2) if
q √ q  

σ p ( d) =  n · µ 2 , then gradient descent will yield a CNN with parameters W
q

 ) <  and R( W
such that R̂ S ( W  ) = (1), where R(W ) := E(x,y)∼D  [y · f (W , x)].

This theorem further illustrates the phenomenon of phase transition that is revealed
in Theorems 9.2 and 9.3 with respect to the population loss.

9.2 Neural Tangent Kernels (NTKs)

Neal (1995, 1996), Williams (1996), and Hazan and Jaakkola (2015) laid the ground-
work by proving the equivalence between an infinite-width shallow neural network
and a Gaussian process. This fundamental insight provided a solid foundation for
9.2 Neural Tangent Kernels (NTKs) 201

further exploration. Subsequently, researchers including Damianou and Lawrence


(2013), Duvenaud et al. (2014), Hensman and Lawrence (2014), Lawrence and Moore
(2007), Bui et al. (2016), and Lee et al. (2018) gradually expanded this equivalence
to encompass infinite-width neural networks of arbitrary depths. These advance-
ments broadened our understanding of the fundamental connections between neural
networks and Gaussian processes, shedding light on their underlying mathematical
properties. Furthermore, Lee et al. (2019) pushed the boundaries even further by
demonstrating that infinite-width neural networks of any depth behave similarly to
linear models when trained using gradient descent. This comprehensive body of work
not only elucidates the intricate relationship between neural networks and Gaussian
processes but also provides valuable insights into the behavior of deep learning
models during training.
Suppose that f θ is the network function induced by a neural network parameterized
by θ . Further suppose that a realization function F (L) maps the parameters θ to the
network function f θ , where L is the network depth. The network function f θ is
optimized along the gradient with respect to the weights θ in the parameter space.
The dynamics of f θ in the function space (when the network width is infinitely large)
have been proven by Jacot et al. (2018) to follow the gradient with respect to the
following neural tangent kernel (NTK):


P
(L) (θ ) = ∂θ p F (L) (θ ) ⊗ ∂θ p F (L) (θ ).
p=1

The finite-width NTK is defined as


   T
H x, x = J (x) J x , (9.2)

where Jiα (x) = ∂Hα z iL (x) is the Jacobian evaluated at a point x for parameter α and
z iL (x) is the i-th output of the last output layer. Jacot et al. (2018) has shown that
the NTK converges to a deterministic kernel and remains constant over the course of
training. Subsequently, the authors proved that the expected outputs of an infinitely
wide network can be solved for by means of an ordinary differential equation (ODE)
as follows:  
μt (X train ) = Id − e−ηHtrain, train t Ytrain . (9.3)

Here, Htrain, train denotes the NTK between the training inputs. As the number of
steps t approaches infinity, Eq. 9.3 reduces to μt (X train ) = Ytrain . Equation 9.3 can
be further written as  
μ̃t (X train )i = Id − e−ηλi t Ỹtrain, i , (9.4)

where the λi are the eigenvalues of Htrain, train and μ̃t (X train ) and Ỹtrain are the mean
predictions and labels, respectively. Given the ordered eigenvalues as λ0 ≥ · · ·λm ,
Lee et al. (2019) assumed that the maximum feasible learning rate scales as η ∼ 2/λ0 ,
202 9 Reflecting on the Role of Overparameterization: Is it Solely Harmful?

which has been empirically verified by Xiao et al. (2020). Plugging η ∼ 2/λ0 into
Eq. 9.3, we find that λm will converge exponentially in a rate κ = λ0 /λm , i.e., the
condition number. It shows that if the condition number of the NTK associated with
a neural network diverges, then the network will become untrainable. Therefore, we
can use κ as a metric for trainability. Investigation has shown that κ is inversely
related to the performance of an architecture. Therefore, we generally wish to find a
network with a smaller κ.
Jacot et al. (2018) later proved that as for the model learned for a least-squares
regression problem, it is characterized by a linear differential equation in the infinite-
width regime during the training process.

9.3 Loss Surface and Optimization

Influence of overparameterization on the loss surface. Overparameterization sig-


nificantly influences the shapes of the loss surfaces of neural networks. Choromanska
et al. (2015) empirically showed that (1) a large proportion of the local minima on
the loss surface of an overparameterized network are “equivalent” to each other (they
have the same empirical risk) and (2) neural networks of small or normal sizes have
spurious local minima, but the proportion of spurious local minima decreases rapidly
as the network size increases. Li et al. (2018) proved that overparameterized fully
connected deep neural networks do not have any spurious local minima under the fol-
lowing assumptions: (1) the activation functions are continuous and (2) the loss func-
tions are convex and differentiable. Nguyen et al. (2019) extended the “no bad local
minima” property to networks trained under a cross-entropy loss. Nguyen (2019) fur-
ther discovered that the global minima are connected to each other and concentrated
in a unique valley if the neural network is sufficiently overparameterized.
Influence of overparameterization on optimization. It has been shown that
overparameterization contributes to ensuring the optimization performance of
gradient-based optimizers. Li and Liang (2018) proved that overparameterized neu-
ral networks trained via SGD on data drawn from a mixture of several distributions
have a small generalization error. Du et al. (2019b) proved that, for a two-layer neural
network as long as the network width is sufficiently large, gradient descent converges
to a global minimum in a linear convergence rate, as follows:

P[ ŷ(t) − y 2
2 ≤ exp(−λ0 t) ŷ(0) − y 22 ] ≥ 1 − δ,

where y(t) is the prediction at iteration t, λ0 > 0 is a constant real number, and
0 < δ < 1 is a real number. This result relies on the following
 6 overparameterization

condition: the number of hidden units must satisfy n = m λ−40 δ
−3
. The authors
also show that overparameterization can restrict the weights to be close to the random
initialization. In Chizat and Bach (2018), the optimization of a one-hidden-layer net-
work was modelled to the minimization of a convex function of a measure discretized
9.4 Generalization and Learnability 203

into a mixture of particles via continuous-time gradient descent. They proved that
the gradient flow characterizing the optimization converges to a global minimizer.
Going beyond a single hidden layer, Allen-Zhu et al. (2019b), Du et al. (2019a),
and Zou et al. (2020) concurrently proved that SGD converges to a globally optimal
solution for an overparameterized deep neural network in polynomial time under
slightly different assumptions on the network architecture and training data. Chen
et al. (2019) proved that the convergence of optimization is guaranteed if the network
width is polylogarithmic in the sample size m and 1/, where  is the target error. A
sample complexity of

n(δ, R, L) = Õ(poly(R, L) log4/3 (m/δ))

is needed to ensure that with probability at least 1 − δ, SGD with a specific learning
rate achieves an error no larger than

8L 2 R 2 8 log(2/δ)
+ + 24NTRF ,
m m
where NTRF is the minimum achievable training loss, L is the network depth, R is
the radius of the neural tangent random feature (NTRF) function class (Cao and Gu
2019) and n(δ, R, L) is network width.

9.4 Generalization and Learnability

Under the assumption that the nodes in the first layer all have linear functions while
the hidden nodes in the last layer are nonlinear, Andoni et al. (2014) studied two-
layer neural networks. These authors proved that if a sufficiently wide neural network
is trained with a generic gradient descent algorithm and all weights are randomly
initialized, it can learn any low-degree polynomial function. Brutzkus et al. (2018)
also gave generalization guarantees for two-layer neural networks under specific
assumptions. Arora et al. (2019) and Allen-Zhu et al. (2019a) proved that for neural
networks with two or three layers, the sample complexity is almost independent of
the parameter size. Arora et al. (2019) proved that any one-hidden-layer rectified
linear unit (ReLU) network trained via gradient descent under a 1-Lipschitz loss has
a generalization error of at most

2 y (H ∞ )−1 y
,
m

where y = (y1 , . . . , ym ) , the yi (i = 1, . . . , m) are the labels, and H ∞ is the Gram


matrix, ∀i, j = 1, . . . , m, defined as follows:
204 9 Reflecting on the Role of Overparameterization: Is it Solely Harmful?

  x  x j (π − arccos(xi x j ))
Hi,∞j = Ew∼N (0,I ) xi x j Iw xi ≥0,w x j ≥0 = i .

Chen et al. (2019) showed that generalization is guaranteed if the network width is
polylogarithmic in the sample size m and  −1 , where  is the target error. Cao and Gu
(2019) also proved the generalization bounds for wide and deep neural networks. Wei
et al. (2019) proved that regularizers can significantly influence the generalization
and optimization properties.
Influence of the network depth on generalizability. Canziani et al. (2016) sug-
gested that the deeper neural networks could have the better generalizability, sum-
marizing various practical applications of neural networks. This understanding plays
a major part in the reconciliation between the overparameterization of neural net-
works and their excellent generalizability. In addition to the previous results, there
is another possible explanation that originates from information theory. Zhang et al.
(2018) adopted the techniques developed by Xu and Raginsky (2017) and proved a
generalization bound to characterize how the generalizability evolves as a network
becomes deeper. The expectation of the generalization error, E[R − R̂ S ], has the
following upper bound:

L 1 2σ 2
exp − log I (S, W ),
2 η m

where L is the network depth, 0 < η < 1 is a real constant, S is the training sample,
and W denotes the parameters of the learned hypothesis.

References

Allen-Zhu, Zeyuan, Yuanzhi Li, and Yingyu Liang. 2019a. Learning and generalization in over-
parameterized neural networks, going beyond two layers. In Advances in Neural Information
Processing Systems, 6158–6169.
Allen-Zhu, Zeyuan, Yuanzhi Li, and Zhao Song. 2019b. A convergence theory for deep learning
via over-parameterization. In International Conference on Machine Learning.
Andoni, Alexandr, Rina Panigrahy, Gregory Valiant, and Li Zhang. 2014. Learning polynomials
with neural networks. In International Conference on Machine Learning, 1908–1916.
Arora, Sanjeev, Simon S Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. 2019. Fine-grained analysis of
optimization and generalization for overparameterized two-layer neural networks. arXiv preprint
arXiv:1901.08584.
Bartlett, Peter L, Philip M Long, Gábor Lugosi, and Alexander Tsigler. 2020. Benign overfitting in
linear regression. Proceedings of the National Academy of Sciences 117 (48): 30063–30070.
Belkin, Mikhail, Daniel J Hsu, and Partha Mitra. 2018a. Overfitting or perfect fitting? risk bounds
for classification and regression rules that interpolate. Advances in Neural Information Processing
Systems 31.
Belkin, Mikhail, Siyuan Ma, and Soumik Mandal. 2018b. To understand deep learning we need to
understand kernel learning. In International Conference on Machine Learning, 541–549.
References 205

Belkin, Mikhail, Daniel Hsu, Siyuan Ma, and Soumik Mandal. 2019. Reconciling modern machine-
learning practice and the classical bias-variance trade-off. Proceedings of the National Academy
of Sciences 116 (32): 15849–15854.
Belkin, Mikhail, Daniel Hsu, and Xu. Ji. 2020. Two models of double descent for weak features.
SIAM Journal on Mathematics of Data Science 2 (4): 1167–1180.
Brutzkus, Alon, Amir Globerson, Eran Malach, and Shai Shalev-Shwartz. 2018. SGD learns over-
parameterized networks that provably generalize on linearly separable data. In International
Conference on Learning Representations.
Bui, Thang, Daniel Hernández-Lobato, Jose Hernandez-Lobato, Yingzhen Li, and Richard Turner.
2016. Deep Gaussian processes for regression using approximate expectation propagation. In
International Conference on Machine Learning, 1472–1481.
Canziani, Alfredo, Adam Paszke, and Eugenio Culurciello. 2016. An analysis of deep neural network
models for practical applications. arXiv preprint arXiv:1605.07678.
Cao, Yuan, and Quanquan Gu. 2019. Generalization bounds of stochastic gradient descent for wide
and deep neural networks. In Advances in Neural Information Processing Systems, 10836–10846.
Cao, Yuan, Zixiang Chen, Mikhail Belkin, and Quanquan Gu. 2022. Benign overfitting in two-layer
convolutional neural networks. arXiv preprint arXiv:2202.06526.
Chen, Zixiang, Yuan Cao, Difan Zou, and Quanquan Gu. 2019. How much over-parameterization
is sufficient to learn deep ReLU networks? arXiv preprint arXiv:1911.12360.
Chinot, Geoffrey, and Matthieu Lerasle. 2020. Benign overfitting in the large deviation regime.
arXiv preprint arXiv:2003.05838, 1(5).
Chizat, Lenaic, and Francis Bach. 2018. On the global convergence of gradient descent for over-
parameterized models using optimal transport. In Advances in Neural Information Processing
Systems, 3036–3046.
Choromanska, Anna, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, and Yann LeCun. 2015.
The loss surfaces of multilayer networks. In International Conference on Artificial Intelligence
and Statistics.
Damianou, Andreas, and Neil D Lawrence. 2013. Deep Gaussian processes. In Artificial Intelligence
and Statistics, 207–215.
Du, Simon, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. 2019a. Gradient descent finds
global minima of deep neural networks. In International Conference on Machine Learning,
1675–1685.
Du, Simon S., Xiyu Zhai, Barnabas Poczos, and Aarti Singh. 2019b. Gradient descent prov-
ably optimizes over-parameterized neural networks. In International Conference on Learning
Representations.
Duvenaud, David, Oren Rippel, Ryan Adams, and Zoubin Ghahramani. 2014. Avoiding pathologies
in very deep networks. In Artificial Intelligence and Statistics, 202–210.
Hastie, Trevor, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. 2019. Surprises in
high-dimensional ridgeless least squares interpolation. arxiv e-prints, page. arXiv preprint
arXiv:1903.08560.
Hazan, Tamir, and Tommi Jaakkola. 2015. Steps toward deep kernel methods from infinite neural
networks. arXiv preprint arXiv:1508.05133.
Hensman, James, and Neil D Lawrence. 2014. Nested variational compression in deep Gaussian
processes. arXiv preprint arXiv:1412.1370.
Jacot, Arthur, Franck Gabriel, and Clément Hongler. 2018. Neural tangent kernel: convergence
and generalization in neural networks. In Advances in Neural Information Processing Systems,
8571–8580.
Lawrence, Neil D, and Andrew J Moore. 2007. Hierarchical Gaussian process latent variable models.
In International Conference on Machine Learning, 481–488.
Lee, Jaehoon, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-
Dickstein, and Jeffrey Pennington. 2019. Wide neural networks of any depth evolve as linear
models under gradient descent. In Advances in Neural Information Processing Systems, 8572–
8583.
206 9 Reflecting on the Role of Overparameterization: Is it Solely Harmful?

Lee, Jaehoon, Yasaman Bahri, Roman Novak, Samuel S Schoenholz, Jeffrey Pennington, and Jascha
Sohl-Dickstein. 2018. Deep neural networks as Gaussian processes. In International Conference
on Learning Representations.
Li, Dawei, Tian Ding, and Ruoyu Sun. 2018. Over-parameterized deep neural networks have no
strict local minima for any continuous activations. arXiv preprint arXiv:1812.11039.
Li, Yaopeng, Ming Jia, Xu Han, and Xue-Song Bai. 2021. Towards a comprehensive optimization of
engine efficiency and emissions by coupling artificial neural network (ann) with genetic algorithm
(ga). Energy 225: 120331.
Li, Yuanzhi, and Yingyu Liang. 2018. Learning overparameterized neural networks via stochastic
gradient descent on structured data. Advances in Neural Information Processing Systems 31:
8157–8166.
Liang, Tengyuan, and Alexander Rakhlin. 2020. Just interpolate: Kernel “ridgeless” regression can
generalize. The Annals of Statistics 48 (3): 1329–1347.
Muthukumar, Vidya, Adhyyan Narang, Vignesh Subramanian, Mikhail Belkin, Daniel Hsu, and
Anant Sahai. 2020. Classification vs regression in overparameterized regimes: does the loss
function matter? arXiv preprint arXiv:2005.08054.
Nakkiran, Preetum, Behnam Neyshabur, and Hanie Sedghi. 2020. The deep bootstrap framework:
Good online learners are good offline generalizers. arXiv preprint arXiv:2010.08127.
Nakkiran, Preetum, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. 2021.
Deep double descent: where bigger models and more data hurt. Journal of Statistical Mechanics:
Theory and Experiment 2021 (12): 124003.
Neal, Radford M. 1995. Bayesian Learning for Neural Networks. PhD thesis, University of Toronto.
Neal, Radford M. 1996. Priors for infinite networks. In Bayesian Learning for Neural Networks,
29–53. Springer.
Nguyen, Quynh, Mahesh Chandra Mukkamala, and Matthias Hein. 2019. On the loss landscape of a
class of deep neural networks with no bad local valleys. In International Conference on Learning
Representations.
Nguyen, Quynh. 2019. On connected sublevel sets in deep learning. In International Conference
on Machine Learning.
Tsigler, Alexander, and Peter L Bartlett. 2020. Benign overfitting in ridge regression. arXiv preprint
arXiv:2009.14286.
Wei, Colin, Jason D Lee, Qiang Liu, and Tengyu Ma. 2019. Regularization matters: generalization
and optimization of neural nets vs their induced kernel. In Advances in Neural Information
Processing Systems, 9712–9724.
Williams, Christopher. 1996. Computing with infinite networks. Advances in Neural Information
Processing Systems 9: 295–301.
Xiao, Lechao, Jeffrey Pennington, and Samuel Schoenholz. 2020. Disentangling trainability and
generalization in deep neural networks. In International Conference on Machine Learning,
10462–10472.
Xu, Aolin, and Maxim Raginsky. 2017. Information-theoretic analysis of generalization capability
of learning algorithms. In Advances in Neural Information Processing Systems, 2524–2533.
Zhang, Jingwei, Tongliang Liu, and Dacheng Tao. 2018. An information-theoretic view for deep
learning. arXiv preprint arXiv:1804.09060.
Zou, Difan, Jingfeng Wu, Vladimir Braverman, Quanquan Gu, and Sham Kakade. 2021. Benign
overfitting of constant-stepsize sgd for linear regression. In Conference on Learning Theory,
4633–4635.
Zou, Difan, Yuan Cao, Dongruo Zhou, and Quanquan Gu. 2020. Gradient descent optimizes over-
parameterized deep ReLU networks. Machine Learning 109 (3): 467–492.
Chapter 10
Theoretical Foundations for Specific
Architectures

In the previous chapters, we have presented an overview of the generalizability


of neural networks. This chapter discusses the influence of some specific network
structures. Convolutional neural networks (CNNs) introduce convolutional layers
into deep learning, which have been widely applied in computer vision, natural
language processing, and deep reinforcement learning. Recurrent neural networks
(RNNs) possess a recurrent structure and shows its promising performance in the
processing and analysis of sequential data. They have been applied to many real-
world scenarios, including NLP and speech recognition. Recently, equivariant neural
networks have shown its great successes in various areas, including 3D point cloud
processing, chemistry, and astronomy. The intuition is that when the prior symmetry
in the data can be maintained, networks can achieve better performance.

10.1 Convolutional Neural Networks (CNNs) and


Recurrent Neural Networks (RNNs)

Convolutional neural networks. Convolutional neural networks (CNNs) introduce


convolutional layers into deep learning, which have been widely applied in computer
vision (Krizhevsky et al. 2012), natural language processing (Yu et al. 2018), and
deep reinforcement learning (Silver et al. 2016).
Du et al. (2018) proved that to achieve a population prediction error of  on
d-dimensional
 input,
 a c-dimensional convolutional filter with a linear activation
 
requires Õ c/ 2 training examples while a fully connected one requires  d/ 2
training examples. Moreover, the authors further proved that for a one-hidden-layer
CNN with output dimensionality
 r and
 the ratio of the stride size to the filter size
being fixed, it requires Õ (c + r )/ 2 training examples to achieve this. Zhou and
Feng (2018) proved (1) given the magnitude  ofthe parameters
 bi in the i-th layer,
D
an generalization bound for a CNN is O log i=0 bi , and (2) a one-to-one cor-
respondence between convergence guarantees for the stationary points and their

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025 207
F. He and D. Tao, Foundations of Deep Learning, Machine Learning: Foundations,
Methodologies, and Applications, https://doi.org/10.1007/978-981-16-8233-9_10
208 10 Theoretical Foundations for Specific Architectures

population counterparts. Lin and Zhang (2019) studied influence of the sparsity and
permutation of convolutional layers on the spectral norm and generalizability. Other
works have also studied the generalizability of CNNs with residual networks (He
et al. 2016), including He et al. (2020) and Chen et al. (2019a).
Recurrent neural networks. Recurrent neural networks (RNNs) possess a recur-
rent structure and shows its promising performance in the processing and analysis
of sequential data. They have been applied to many real-world scenarios, including
NLP (Bahdanau et al. 2015; Sutskever et al. 2014) and speech recognition (Graves
et al. 2006, 2013). Chen et al. (2019b) developed a generalization bound for RNNs
based on works by Neyshabur et al. (2017) and Bartlett et al. (2017). Allen-Zhu and
Li (2019) also analysed the generalizability of RNNs.
Based on the Fisher-Rao norm, Tu et al. (2020) developed a gradient measure and
then derived a generalization bound as follows.
Theorem 10.1 Let us fix a margin parameter α; then, for any δ > 0, with prob-
ability at least 1 − δ, the following holds for every RNN whose weight matrices
θ = (U, W, V ) satisfy ||V T ||1 ≤ βV , ||W T ||1 ≤ βW , ||U T ||1 ≤ βU and ||θ || f s ≤ r :
 
4k r ||X || F 1 1
E[IM yL (x,y)≤0 ] ≤ + βV βU ||X T ||1
α 2m λmin (E(X X T )) m

1 log 2δ
+ IM yL (xi ,yi )≤a + 3 . (10.1)
m 2m

This gradient measure builds a relationship between generalizability and trainability


and obtain a tighter bound. It shows that adding noise to the input data is a way to
boost the generalizability. It also justifies the use of the gradient clipping technique
(Merity et al. 2018; Gehring et al. 2017; Peters et al. 2018).

10.2 Equivariant Neural Networks

Recently, equivariant neural networks have shown its great successes in various areas,
including 3D point cloud processing (Li et al. 2018), chemistry (Faber et al. 2016),
and astronomy (Ntampaka et al. 2016; Ravanbakhsh et al. 2016). The intuition is that
when the prior symmetry in the data can be maintained, networks can achieve better
performance. As an example, objects in images of different classes can be oblique,
while all rotated versions of the same image are expected to belong to the same class.
Hence, there is some prior symmetry in such data, and if a model can preserve this
symmetry, it will enjoy better performance.
This section is organized as follows. First, there are two main ways to design a new
equivariant network. One way is to modify a traditional network; as an example, we
consider group CNNs, which are impressive traditional models of this type. The other
way is to limit the layers such that they always remain equivariant. Subsequently,
10.2 Equivariant Neural Networks 209

we discuss the nonlinearity in equivariant neural networks. Finally, we present a


theoretical analysis, including analyses of generalization and approximation, which
show the benefits of equivariant neural networks.

10.2.1 Group CNNs

An equivariant neural network was generalized and developed by Cohen and Welling
(2016a), who noted the following shift equivariance in CNNs:

(Tg x) = Tg (x) for all x, (10.2)

where is a convolutional layer, Tg is a shift transformation acting on the input x and


Tg is the corresponding transformation acting on the output. That is, transforming
the input x by a transformation Tg is equivalent to transforming the output (x) by
a corresponding transformation Tg , which is independent of the input x. In addition,
the transformation T should be a linear representation of the group G, that is, for
two elements g and h of the group G,

Tgh = Tg Th . (10.3)

However, convolution is shift equivariant but not rotation equivariant. To overcome


this shortcoming, the authors defined the operation of group convolution, which is
always equivariant with regular representations. Let f denote the input and ψ denote
a filter; then, group convolution is defined as

f ψ(g) = f k (h)ψk (g −1 h). (10.4)


h∈G k

For a regular representation L such that L h f (g) = f (h −1 g), it can be proven that

(L u f ) ψ = L u ( f ψ) for all u ∈ G. (10.5)

Despite the application of convolution, the equivariance of the nonlinearity and some
operations should be maintained. For any pointwise nonlinearity Cv , it can be verified
that

Cv L h f = v ◦ f ◦ h −1 = (v ◦ f ) ◦ h −1 = L h (v ◦ f ) = L h Cv f. (10.6)

In addition, these authors defined subgroup pooling as follows:

P( f )(g) = max f (h), (10.7)


h∈gU
210 10 Theoretical Foundations for Specific Architectures

where U is a subgroup of G and gU = {gu : u ∈ U } is a coset. The reader can


prove that P L h f = L h P f . As seen from the examples above, each operator in an
equivariant neural network should itself be equivariant. Thus, when a new equivariant
neural network is designed, all operators should be proven to be equivariant. It is
sometimes necessary that the initial operator not be equivariant; in this case, it needs
to be modified to an equivariant version.

10.2.2 Steerable Neural Networks

10.2.2.1 Steerable CNNs

In Cohen and Welling (2016b), the authors showed another way to achieve equivari-
ance, that is, constraining the filter such that the corresponding layer is equivariant.
A subgroup H of G is often considered first; for any h ∈ H , the filter satisfies

ψh = ρh for all h ∈ H (10.8)

for some linear representation ψ of H acting on the output of the layer and ρ acting
on the input of the layer. Note that if the constraint is linear, then the solution space
will be linear. The space of all solutions is denoted by Hom H (ρ, ψ) because
an equivariant map is a “homomorphism of group representations”. Equivariant
maps are also called intertwiners (Serre et al. 1977). Before training, a basis
ψ1 , . . . , ψm of the solution space can be calculated, and all admitted weights can be
represented as linear combinations of this basis of the form = i αi ψi . For any
G, there are many choices of different linear representations, and thus, a steerable
neural network is flexible and efficient. By training the coefficients of the linear
combination, steerable neural networks can achieve excellent performance in various
tasks; examples include graph neural networks (Maron et al. 2018), E 2 steerable
CNNs (Weiler and Cesa 2019), and rotation-equivariant steerable CNNs (Weiler
et al. 2018b). In the following, we will present some theoretical implementations of
steerable neural networks.
A linear representation of G can be defined as an induced representation of the
linear representation H . The following example comes from Cohen and Welling
(2016b). The correlation f can be computed as

( f )(x) = (ρ −1 (x) f ), (10.9)

where x ∈ Z2 is viewed as a translation. Let H represent all transformations acting


on the input f , and let t ∈ Z2 and r ∈ H . It can be verified that

[ ρ(tr ) f ](x) = ψ(r )[ ρ]((tr )−1 x), (10.10)


10.2 Equivariant Neural Networks 211

where t, r mean a translation and transformation, respectively. Thus, if we consider


a representation ψ  such that

[ψ  (tr ) f ](x) = ψ(r ) f ((tr )−1 x), (10.11)

then ψ  (tr ) f = ρ(tr ) f . The representation ψ  is the representation of G


induced by ψ, which is denoted by IndGH ψ.
Any linear representation can be decomposed into a direct sum of irreducible
representations, that is,
ψ = P −1 ψi P, (10.12)

where P is an invertible matrix and ψi is an irreducible representation for all i.


Similarly, if we write ρ as Q −1 ρ j Q, then the equation ψ = ρ can be written
as
ψi ˜ = ˜ ρj, (10.13)

where ˜ = (P Q −1 ). Then, we can decompose ˜ into [ ˜ i j ]i j such that ψi ˜ i j =


˜ i j ρ j . If ψi and ρ j are not isomorphic, then ˜ i j = 0.

10.2.2.2 Steerable Neural Networks

A general equivariant feedforward neural network (without bias) can be expressed


as
W L σ (W L−1 σ (. . . σ (W1 x))), (10.14)

where Wk satisfies
ψ k ◦ Wk = Wk ◦ ψ k−1 (10.15)

and σ satisfies that for any k ∈ [L],

ψk ◦ σ = σ ◦ ψk. (10.16)

One can verify that networks satisfying these two constraints are equivariant. We first
assume that σ is a pointwise nonlinearity. Constraint (10.16) will be discussed further
in the following section. We can prove that the image of the linear representation
satisfying constraint (10.16) must be a set of generalized permutation matrices, where
each ψg is a permutation matrix but the value of each nonzero entry need not be 1.
In addition, the network can be expressed as

F(x) = W L−1 σ (. . . σ (W
L σ (W 1 x̃))), (10.17)

L = [ψ L−1 W L . . . ψ L−1 W L ],
where W g 1 g t
212 10 Theoretical Foundations for Specific Architectures

⎡ ⎤ ⎡ k ⎤
ψg01 ψg g−1 Wk . . . ψgk g−1 Wk
⎢ψ k 1 W . . . ψgk g−1 Wk ⎥
1 1 t
⎢ψg0 ⎥ ⎢ ⎥
⎢ 2⎥ k = ⎢ g2 g1.
−1 k
⎥,
x̃ = ⎢ . ⎥ x, and W
2 t

⎣ .. ⎦ ⎢ .. .. .. ⎥
⎣ . . ⎦
ψgt0
ψg g−1 Wk
k
. . . ψgk g−1 Wk
t 1 t t

with any k < L and any arbitrary Wk . Applications of steerable neural networks have
also been reported. For example, Maron et al. (2018) considered equivariant graph
neural networks and solved the solution space HomG (ρ, ψ).

10.2.2.3 The Dimensionality of Hom G (ρ, ψ)

Recall the definition of HomG (ρ, ψ) = {W : W ◦ ρ = ψ ◦ W }. Here, we will intro-


duce the result for the dimensionality of Hom H (ρ, ψ). More details can be found in
Reeder (2014). The dimensionality of Hom H (ρ, ψ) is smaller than that of the weight
matrix. The equation can be written as

ρgT ⊗ ψg−1 vec(W ) = vec(W ) for all g ∈ G. (10.18)

Let us define ρ  : G → GL(Hom(V, V  )) as ρg W = ψg−1 Wρg . If f is in


HomG (ρ, ψ), then ρg f = f for any g ∈ G. Now, we consider that P = |G|
1
ρg
g∈G
is a projection from Hom(V, V  ) to HomG (ρ, ψ). Then, we know that

Hom(V, V  ) = KerP ⊕ HomG (ρ, ψ), (10.19)

and in turn,
1
dimHomG (ρ, ψ) = tr(P) = χρ  (g), (10.20)
|G| g∈G

where χρ  (g) = tr(ρg ) is the character. Equation (10.18) implies that χρ  (g) =
χρ T ⊕ψ −1 (g) = χρ (g)χψ (g −1 ). Finally, we know that

1
dimHomG (ρ, ψ) = χρ (g)χψ (g −1 ) = χρ , χψ . (10.21)
|G| g∈G

As mentioned above, any representation can be decomposed into a direct sum of


irreducible representations. For a finite group G, the number of different (non-
isomorphic) irreducible representations is equal to the number of classes of G. Let all
irreducible representations be denoted by ψ1 , . . . , ψn , and let ρ = Q −1 m i ψi Q
and ψ = P −1 m i ψi P. Note that {ψ1 , ..., ψn } forms an orthonormal basis; then,
10.2 Equivariant Neural Networks 213

n
dimHomG (ρ, ψ) = χρ , χψ = m i m i . (10.22)
i=1

From this result, we know that the dimensionality of HomG (ρ, ψ) can be modified to
any desired value. In addition, in practice, we need only to compute HomG ψi , ψ j for
all irreducible representations ψi and ψ j . However, the linear representation should
commute with the nonlinearity, and thus, the choice of the invertible matrix P is also
important. We discuss this topic in the following section.

10.2.3 Nonlinearities in Equivariant Networks

As shown above, we can express linear representations as direct sums of irre-


ducible representations. However, any linear representation should commute with the
nonlinearity: σ ◦ ψ = ψ ◦ σ . More generally, this requirement can be extended to
ψ ◦ σ = σ ◦ ρ. That means that the selection of group linear representations restricts
the range of the admitted nonlinearities, or conversely, that a certain choice of non-
linearity allows for only particular linear representations. We consider the condition
ψ ◦ σ = σ ◦ ρ.
For a pointwise nonlinearity, such as rectified linear unit (ReLU) activation, it can
be proven that the image of the linear representation satisfying constraint (10.16)
must consist of generalized permutation matrices, where each ψg is a permutation
matrix but the value of each nonzero entry need not be 1. When the nonlinearity
is ReLU, we can further prove that any such nonzero value is independent of the
group element g. Two useful linear representations for pointwise nonlinearity are
a regular representation and a quotient representation. In a regular representation,
the basis is {eg }, and ρg eg = egg . Thus, it is a permutation representation with
dimensions of |G| × |G|. A quotient representation uses a subgroup H and the basis
eg H . Then, ρg eg H = egg H , which is also a permutation matrix but with dimensions
of [G : H ] × [G : H ]. In the following section, we introduce the result that a network
under this setting has a greater approximation ability.
With the exception of elementwise nonlinearities, there are also some special
nonlinearities for which a given group representation achieves equivariance. For
unitary representations, such that ψg x = x , nonlinearities that act on the norm
of a feature but preserve its orientation are popular because they are always com-
mutative. In general, the nonlinearity can be decomposed into η( x )x/ x , where
η( x ) can be ReLU ( x − b) for some positive b (Weiler et al. 2018a; Worrall
et al. 2017), and squashing nonlinearities of the form x 2 /( x 2 + 1) (Sabour et al.
2017). These nonlinearities all offer excellent performance in different tasks. Mean-
while, the theory of steerable networks itself shows no preference for any specific
nonlinearity.
214 10 Theoretical Foundations for Specific Architectures

10.2.4 Generalization of Equivariant Networks

In this section, we present two results to show the benefits of equivariant neural
networks. First, however, let us begin with some generalization bounds. Intuitively,
equivariant neural networks are relatively constrained in comparison to all possible
neural networks. Specifically, the hypothesis set is smaller, implying better gen-
eralization. Some works have attempted to provide new generalization bounds for
equivariant neural networks, including Kondor and Trivedi (2018), Sokolic et al.
(2017), Sannai et al. (2021). However, generalization bounds cover only the worst
case, and the results of the above works do not explicitly consider the benefits of
equivariant models. To address this gap, we introduce the benefits of equivariant
neural networks (Elesedy and Zaidi 2021; Lyle et al. 2020).

10.2.5 Generalization Bounds of Equivariant Networks

Some works have attempted to present generalization bounds for equivariant net-
works, such as Kondor and Trivedi (2018), Sannai et al. (2021). Here, we discuss
the result in Sannai et al. (2021), which is the newest generalization bound. We first
introduce the generalization bounds for invariant neural networks and equivariant
neural networks, and we then share the method used to obtain them. Let R( f ) =
m
E(x,y)∼D [l( f (x), y)] be the expected risk, and let R̂ S ( f ) = m1 i=1 l( f (xi ), yi ) be
the empirical risk. Then, for any invariant neural network f uniformly bounded by
1, the following inequality holds with probability at least 1 − 2:
 
C 2 log(1/2)
R( f ) − R̂ S ( f ) ≤ + . (10.23)
|G|m 2/n m

For any equivariant neural network f uniformly bounded by 1, the following


inequality holds with probability at least 1 − 2:
 
C 2 log(1/2)
R( f ) − R̂ S ( f ) ≤ + , (10.24)
|St(G)|m 2/n m

where C is independent of m, n and , and St(G) ⊂ G is a subgroup of elements


whose first coordinates are fixed. These results follow from the Rademacher com-
plexity bound, and they estimate the Rademacher complexity at the bound. Note that
m is the number of examples and n is the dimensionality of the input, and it has been
shown that the Rademacher complexity is O(m −2/n ).
Now, we introduce the method. For an orbit {gx : g ∈ G}, if f (x) is deter-
mined, then f (gx) = g f (x) is determined. This means that the whole space
[0, 1]n can be divided into orbits. Then, a quotient map φG : [0, 1]n → [0, 1]n /G
10.2 Equivariant Neural Networks 215

can be defined as φG (x) = {gx : g ∈ G}, and the hypothesis set { f : [0, 1]n →
R M such that f is equivariant} can be covered by { f : [0, 1]n /G → R M }. The cov-
ering number of [0, 1]n /G can be bounded by C/|G| n . The results above can be
proven in accordance with the Rademacher complexity bound.
In addition, note that |G| is sometimes not independent of n. For example, when
G = Sn , the generalization bound becomes
 
C 2 log(1/2)
+ . (10.25)
n!m 2/n m

10.2.5.1 Benefits of Equivariant Networks

Beyond traditional generalization bounds, some works have focused on the bene-
fits of equivariant networks. In these works, the authors have shown the benefits
of equivariant maps but not those of equivariant networks in practice. This means
that these analyses offer no suggestions regarding how to design a new equivariant
neural network. Thus, the theoretical analysis of equivariant neural networks is still
immature.
Lyle et al. (2020) compared data augmentation and feature averaging for invariant
targets. For a given dataset {(xi , yi )}i=1
m
and a linear representation ρ transforming
the data, suppose that the data have an invariant distribution, that is, PD (x, y) =
PD (ρg x, y) for all x, y, and g ∈ G. For data augmentation, the empirical risk is
considered to be
m m n
1 1
R̂ S ( f ) = Eg∼λl( f (gxi ), yi ) ≈ l( f (g j xi ), yi ), (10.26)
m i=1
mn i=1 j=1

where g j is sampled from a distribution λ. The empirical risk of data augmentation


has been shown to have a relatively small variance because it relies on a better
dataset with a better empirical distribution. However, this risk cannot guarantee that
an invariant network will be learned. Another approach, feature averaging, is to take
the average of the outputs of all transformations Eg∼λ f H (gx) as a new output of
the first H layers. It is easy to find that in this approach, the empirical risk is the
same as that in data augmentation. In the case of feature averaging, however, the
network is always invariant. When the posterior Q is restricted to be invariant, the
symmetrization gap K L(P||Q) is smaller than that when Q is not invariant. The
smaller symmetrization gap K L(P||Q) indicates a better Probably Approximately
Correct (PAC)-Bayes bound. In this sense, it reveals a benefit of invariant neural
networks, i.e., the invariant nature of the learned networks.
In addition, the orbit averaging technique was used in Elesedy and Zaidi (2021).
The orbit averaging operation is defined as
216 10 Theoretical Foundations for Specific Architectures

Qf = ψg−1 ◦ f ◦ ρg (10.27)
g∈G

for finite groups and 


Qf = ψg−1 ◦ f ◦ ρg dλ(g) (10.28)
G

for compact groups and those with a Haar measure λ over G. Note that

−1
Q f ◦ ρh = ψh ψgh ◦ f ◦ ρgh dλ(g) = ψh ◦ Q f, (10.29)

and thus, Q f is equivariant. Furthermore, Q is a projection map from the space of all
functions to the space of all equivariant functions, and Q2 = Q. When we consider
the expected loss

R( f ) = Ex∼D [l( f (x), s(x))] = Ex∼D [ f (x) − s(x) 22 ], (10.30)

if the target s is equivariant such that ψ ◦ s = s ◦ ρ, then R(Q f ) ≤ R( f ) for any f ,


and the gap is exactly

R( f ) − R(Q f ) = f (x) − Q f (x), f (x) − Q f (x) dμ(x) ≥ 0.
X

In particular, when f is not equivariant, the gap satisfies R( f ) − R(Q f ) > 0. Finally,
if f is the best predictor with the least expected risk, then f must be equivariant.
If not, then Q f is a better predictor. However, a problem arises in this case: Q f
may not be a neural network of the same width. If we use the predictor Q f , then
the process will require more calculation, which is not desirable. In practice, an
equivariant neural network is layerwise equivariant, and the linear representations in
each of the layers are chosen prior to training. Thus, a real equivariant neural network
has additional constraints.

10.2.6 Approximation of Equivariant Networks

Another aspect addressed by theoretical research is the approximation ability.


Because equivariant neural networks can be viewed as constrained neural networks,
a natural question is whether equivariant neural networks are able to approximate
any equivariant function. To answer this question, some works have considered cer-
tain special cases and proved the approximation property under certain assumptions.
Here, we introduce some related results from Ravanbakhsh (2020).
Two assumptions are adopted in the proof presented by the cited authors. One
is that orbit averaging is a projection to all equivariant maps, and the other is that
References 217

two-layer neural networks have uniform approximation properties. That is, for a
given compact set C, any continuous function f supported on C can be uniformly
approximated by a two-layer neural network F with ReLU nonlinearity. Any equiv-
ariant map f defined on C can be extended to be defined on C̃ = {gx : x ∈ C} as
f˜(gx) = g f (x) for any x ∈ C. Hence, without loss of generality, we can assume
that C = C̃.
Now, there are two claims to be proven. On the one hand, when F is a two-layer
universal approximator such that f (x) − F(x) <  for any x ∈ C, then QF is
also, and QF can be modified into a two-layer neural network. Note that

1
QF(x) − f (x) = g −1 [F(gx) − f (gx)] (10.31)
|G| g∈G
1
≤ g −1 [F(x) − f (x)] (10.32)
|G| g∈G

≤ sup g , (10.33)
g∈G

which implies that QF is also a uniform approximator. On the other hand, for a
two-layer neural network F(x) = W2 σ W1 · x, QF can be written as
⎡ ⎤
W1 g1
  ⎢ ⎥
g1 −1 W2 . . . gt −1 W2 σ ⎣ ... ⎦ x,
1
QF(x) = (10.34)
|G|
W1 gt

which is an equivariant neural network with a regular representation. By combining


these two results, we prove the uniform approximation property for finite groups.

References

Allen-Zhu, Zeyuan, and Yuanzhi Li. 2019. Can SGD learn recurrent neural networks with provable
generalization? In Advances in Neural Information Processing Systems, 10331–10341.
Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by
jointly learning to align and translate. In International Conference on Learning Representations.
Bartlett, Peter L, Dylan J Foster, and Matus J Telgarsky. 2017. Spectrally-normalized margin bounds
for neural networks. In Advances in Neural Information Processing Systems, 6240–6249.
Chen, Hao, Zhanfeng Mo, Zhouwang Yang, and Xiao Wang. 2019a. Theoretical investigation
of generalization bound for residual networks. In International Joint Conference on Artificial
Intelligence, 2081–2087.
Chen, Minshuo, Xingguo Li, and Tuo Zhao. 2019b. On generalization bounds of a family of recurrent
neural networks. arXiv preprint arXiv:1910.12947.
Cohen, Taco, and Max Welling. 2016a. Group equivariant convolutional networks. In International
Conference on Machine Learning, 2990–2999.
Cohen, Taco S, and Max Welling. 2016b. Steerable cnns. arXiv preprint arXiv:1612.08498.
218 10 Theoretical Foundations for Specific Architectures

Du, Simon S, Yining Wang, Xiyu Zhai, Sivaraman Balakrishnan, Russ R Salakhutdinov, and Aarti
Singh. 2018. How many samples are needed to estimate a convolutional neural network? In
Advances in Neural Information Processing Systems, 373–383.
Elesedy, Bryn, and Sheheryar Zaidi. 2021. Provably strict generalisation benefit for equivariant
models. In International Conference on Machine Learning, 2959–2969.
Faber, Felix A, Alexander Lindmaa, O Anatole Von Lilienfeld, and Rickard Armiento. 2016.
Machine learning energies of 2 million elpasolite (a b c 2 d 6) crystals. Physical Review Letters
117 (13): 135502.
Gehring, Jonas, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. 2017. Con-
volutional sequence to sequence learning. In International Conference on Machine Learning,
1243–1252.
Graves, Alex, Abdel-rahman Mohamed, and Geoffrey Hinton. 2013. Speech recognition with deep
recurrent neural networks. In IEEE International Conference on Acoustics, Speech and Signal
Processing, 6645–6649.
Graves, Alex, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist
temporal classification: labelling unsegmented sequence data with recurrent neural networks. In
International Conference on Machine Learning, 369–376.
He, Fengxiang, Tongliang Liu, and Dacheng Tao. 2020. Why resnet works? residuals generalize.
IEEE Transactions on Neural Networks and Learning Systems.
He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image
recognition. In IEEE Conference on Computer Vision and Pattern Recognition.
Kondor, Risi, and Shubhendu Trivedi. 2018. On the generalization of equivariance and convolution
in neural networks to the action of compact groups. In International Conference on Machine
Learning, 2747–2755.
Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep
convolutional neural networks. In Advances in Neural Information Processing Systems, 1097–
1105.
Li, Yangyan, Rui Bu, Mingchao Sun, Wei Wu, Xinhan Di, and Baoquan Chen. 2018. Pointcnn:
convolution on x-transformed points. In Advances in Neural Information Processing Systems,
820–830.
Lin, Shan, and Jingwei Zhang. 2019. Generalization bounds for convolutional neural networks.
arXiv preprint arXiv:1910.01487.
Lyle, Clare, Mark van der Wilk, Marta Kwiatkowska, Yarin Gal, and Benjamin Bloem-Reddy. 2020.
On the benefits of invariance in neural networks. arXiv preprint arXiv:2005.00178.
Maron, Haggai, Heli Ben-Hamu, Nadav Shamir, and Yaron Lipman. 2018. Invariant and equivariant
graph networks. arXiv preprint arXiv:1812.09902.
Merity, Stephen, Nitish Shirish Keskar, and Richard Socher. 2018. Regularizing and optimizing
lstm language models. In International Conference on Learning Representations.
Neyshabur, Behnam, Srinadh Bhojanapalli, and Nathan Srebro. 2017. A PAC-Bayesian approach
to spectrally-normalized margin bounds for neural networks. arXiv preprint arXiv:1707.09564.
Ntampaka, Michelle, Hy Trac, Dougal J Sutherland, Sebastian Fromenteau, Barnabás Póczos, and
Jeff Schneider. 2016. Dynamical mass measurements of contaminated galaxy clusters using
machine learning. The Astrophysical Journal 831 (2): 135.
Peters, Matthew E, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee,
and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Annual Conference
of the North American Chapter of the Association for Computational Linguistics, 2227–2237.
Ravanbakhsh, Siamak, Junier Oliva, Sebastian Fromenteau, Layne Price, Shirley Ho, Jeff Schnei-
der, and Barnabás Póczos. 2016. Estimating cosmological parameters from the dark matter
distribution. In International Conference on Machine Learning, 2407–2416.
Ravanbakhsh, Siamak. 2020. Universal equivariant multilayer perceptrons. In International
Conference on Machine Learning.
Reeder, Mark. 2014. Notes on group theory.
References 219

Sabour, Sara, Nicholas Frosst, and Geoffrey E Hinton. 2017. Dynamic routing between capsules.
Advances in Neural Information Processing Systems, 30.
Sannai, Akiyoshi, Masaaki Imaizumi, and Makoto Kawano. 2021. Improved generalization bounds
of group invariant/equivariant deep networks via quotient feature spaces. In Uncertainty in
Artificial Intelligence, 771–780.
Serre, Jean-Pierre, et al. 1977. Linear Representations of Finite Groups, vol. 42. Springer.
Silver, David, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driess-
che, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, and Marc Lanctot. 2016.
Mastering the game of go with deep neural networks and tree search. Nature 529 (7587): 484–489.
Sokolic, Jure, Raja Giryes, Guillermo Sapiro, and Miguel Rodrigues. 2017. Generalization error of
invariant classifiers. In Artificial Intelligence and Statistics, 1094–1103.
Sutskever, Ilya, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural
networks. In Advances in Neural Information Processing Systems, 3104–3112.
Tu, Zhuozhuo, Fengxiang He, and Dacheng Tao. 2020. Understanding generalization in recurrent
neural networks. In International Conference on Learning Representations.
Weiler, Maurice, and Gabriele Cesa. 2019. General e (2)-equivariant steerable cnns. Advances in
Neural Information Processing Systems, 32.
Weiler, Maurice, Mario Geiger, Max Welling, Wouter Boomsma, and Taco S Cohen. 2018a. 3d
steerable cnns: learning rotationally equivariant features in volumetric data. Advances in Neural
Information Processing Systems, 31.
Weiler, Maurice, Fred A Hamprecht, and Martin Storath. 2018b. Learning steerable filters for
rotation equivariant cnns. In IEEE Conference on Computer Vision and Pattern Recognition,
849–858.
Worrall, Daniel E, Stephan J Garbin, Daniyar Turmukhambetov, and Gabriel J Brostow. 2017.
Harmonic networks: deep translation and rotation equivariance. In IEEE Conference on Computer
Vision and Pattern Recognition, 5028–5037.
Yu, Adams Wei, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi,
and Quoc V Le. 2018. Qanet: combining local convolution with global self-attention for reading
comprehension. In International Conference on Learning Representations.
Zhou, Pan, and Jiashi Feng. 2018. Understanding generalization and optimization performance of
deep cnns. In International Conference on Machine Learning, 5960–5969.
Part III
Deep Learning Theory form
the Trust-worthy Facet
Chapter 11
Privacy Preservation

In this chapter, we introduce an important line of research dedicated to measuring


and providing protection for individual privacy in machine learning. We will start
from a common standard for measuring privacy protection, differential privacy, and
present related definitions, properties, and extended theories. We will examine vari-
ous means of privacy protection that aim to hide personal information from hostile
adversaries. To orient the topic in a context appropriate to this book, we also provide
discussions concerning the relationship between privacy protection and other factors
in trustworthy artificial intelligence.

11.1 Differential Privacy

Deep learning is deployed to process massive amounts of personal data, including


financial data and medical information. The ability to extract highly valuable knowl-
edge from the population while protecting each individual’s privacy and sensitive
data is of critical importance (Dwork and Mulligan 2013; Dwork and Roth 2014).
Differential privacy. Differential privacy is a major metric to measure the privacy-
preserving ability of an algorithm (Dwork and Roth 2014). (ε, δ)-differential private
means that, for a learning algorithm, upon exposure of the training data to a small
disturbance, if its change in the output hypothesis is bounded as follows:
 
PA(S) (A(S) ∈ B) − δ
log ≤ ε, (11.1)
PA(S  ) (A(S  ) ∈ B)

where B is an arbitrary subset of the hypothesis space and (S, S  ) is a pair of neigh-
bouring sample sets in which S and S  differ by only one example. The left-hand
side of Eq. (11.1) is also called the privacy loss.
In Eq. (11.1), a division operation is employed to measure changes in the out-
put hypothesis. In recent years, numerous variants of the privacy-preserving abil-
ity metrics have been devised by tweaking this division operation or introducing
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025 223
F. He and D. Tao, Foundations of Deep Learning, Machine Learning: Foundations,
Methodologies, and Applications, https://doi.org/10.1007/978-981-16-8233-9_11
224 11 Privacy Preservation

specific assumptions. The variants include: (1) Concentration Differential Privacy,


which assumes privacy loss follows a sub-Gaussian distribution (Dwork and Roth-
blum 2016; Bun and Steinke 2016); (2) Mutual-Information Differential Privacy and
Kullback-Leibler (KL) Differential Privacy, which replace the division operation
with mutual information and the KL divergence, respectively (Cuff and Yu 2016;
Wang et al. 2016; Liao et al. 2017; Chaudhuri et al. 2019); and (3) Rényi Differential
Privacy, which substitutes the KL divergence with the Rényi divergence (Mironov
2017; Geumlek et al. 2017). These approaches offer nuanced perspectives on pri-
vacy preservation in machine learning models, prospecting privacy preservation in
data-driven applications.
These techniques have also been applied to preserve privacy in deep learning
(Abadi et al. 2016).
Generalization-privacy relationship. It is suggested that the generalizability
is somehow equivalent to the privacy-preserving ability measured by differential
privacy.
Dwork et al. (2015) proved a high-probability generalization bound for (ε, δ)-
differentially private machine learning algorithms formulated as follows:
 
P R(A(S)) − R̂ S (A(S)) < 4ε > 1 − 8δ ε .

Oneto et al. (2017) improved this generalization bound as follows:


  
 
> 1 − 3e−mε
2
P Diff R < 6 R̂ S (A(S))ε̂ + 6 ε + 1/m
2

and
  
5m  2 
ε + 1/m > 1 − 3e−mε ,
2
P Diff R < 4V̂S (A(S))ε̂ +
m−1

where

Diff R = R(A(S)) − R̂ S (A(S)),


ε̂ = ε + 1/m,

and V̂S (A(S)) is the empirical variance of l(A(S), ·):

1  2
V̂S (A(S)) =  (A(S), z i ) −  A(S), z j .
2m(m − 1) i= j

Nissim and Stemmer (2015) further proved that


  
2δ 2
P R(A(S)) − R̂ S (A(S)) < 13ε > 1 − log .
ε ε
11.2 The Interplay of Generalizability and Privacy-Preserving Ability 225

To date, the tightest generalization bound by far has been given by He et al. (2020)
as follows:
   
  e−ε δ 2
P  R̂ S (A(S)) − R(A(S)) < 9ε > 1 − ln . (11.2)
ε ε

This generalization bound was derived in three stages. The authors first proved an
on-average generalization bound for any (ε, δ)-differentially private multidatabase
learning algorithm
à : S → H × {1, . . . , k}

as follows:
   
     
 E E R̂ h − E R h A(S)  ≤ e−ε kδ + 1 − e−ε , (11.3)
S∼D m A(S)
SiA(S) A(S)
A(S)

where the loss function satisfies l∞ ≤ 1.


They then obtained a high-probability generalization bound for multidatabase
algorithms as follows:
     
P R̂ SiA(S) h A(S) ≤ R h A(S) + ke−ε δ + 3ε ≥ ε. (11.4)

Finally, they proved Eq. (11.2) through reduction to absurdity.

11.2 The Interplay of Generalizability and


Privacy-Preserving Ability

The abilities of generalization to unseen data and privacy-preserving are becoming


more and more important for machine learning. Specifically, a good generalization
ability means that an algorithm learns the underlying patterns in the training data
instead of just memorizing it (Vapnik 2013; Mohri et al. 2018). Therefore, good
generalizability provides confidence for models learned on given data to readily apply
to unseen data with similar distribution. Additionally, a massive amount of personal
data has been collected for training machine learning models, such as financial and
medical applications. Thus, extract the underlying highly valuable information in the
data while preserving the highly sensitive information and individual privacy from
leaking is profoundly important (Dwork and Mulligan 2013; Dwork and Roth 2014;
Pittaluga and Koppal 2016).
This section focuses on the relationship between the privacy-preserving ability and
generalization ability in iterative machine learning algorithms through the following
steps: (1) exploring the relationship between privacy-preserving ability and general-
ization ability in any learning algorithm; and (2) analyzing how the iterative nature
226 11 Privacy Preservation

commonly existing in learning algorithms would influence the privacy-preserving


ability and further the generalization ability.
We first drive two theorems to upper bound the generalization error of an learning
algorithm via its differential privacy. Specifically, we prove a high-probability upper
bound for the generalization error,

R(A(S)) − R̂ S (A(S)),

where A(S) is the hypothesis learned by algorithm A on the training sample set S,
R(A(S)) is the expected risk, and R̂ S (A(S)) is the empirical risk.
To derive this bound, we employ a new on-average generalization bound for (ε, δ)-
differentially private multi-database learning algorithms. Moreover, this bound can
also be deduced from the proven high-probability generalization bound, indicating
that differentially private machine learning algorithms are likely to be approximately
correct (PAC)-learnable. These findings indicate a notable connection between an
algorithm’s privacy-preserving capabilities and its ability to generalize effectively.
In essence, algorithms exhibiting strong privacy preservation also tend to demon-
strate robust generalizability. Consequently, there exists an opportunity to enhance
the generalizability of learning algorithms by bolstering their privacy-preserving
mechanisms.
We then investigate the impact of the iterative nature inherent in learning algo-
rithms on both their privacy-preserving capabilities and generalizability. Typically,
the privacy-preserving effectiveness of an iterative algorithm diminishes over the
course of training. This decline occurs due to the accumulation of leaked informa-
tion as the algorithm iteratively processes data. To investigate this phenomenon, we
further derived three composition theorems that measure the differential privacy of
any iterative algorithm by assessing the privacy of each individual iteration. By inter-
twining these theorems with the established correlation between generalizability and
privacy preservation, we gained insights into the generalizability of iterative learn-
ing algorithms. These composition theorems provide a framework for understanding
how the iterative nature of algorithms influences both their privacy and their ability
to generalize effectively.
These results provide an insight of the relationship between generalizability and
privacy-preserving ability in iterative learning algorithms.
Existing works (Dwork et al. 2015; Nissim and Stemmer 2015; Oneto et al. 2017)
have already proved some high-probability generalization bounds in the following
form,  
P R(A(S)) − R̂ S (A(S)) > a < b,

where a and b are two positive constant real numbers. The high-probability bound
provided in this section is strictly tighter than the current best results proved in
Nissim and Stemmer (2015) from two aspects: (1) our bound improves the term
a from 13ε to 9ε; and (2) our bound improves the term b from 2δε−1 log (2/ε) to
2e−ε δε−1 log (2ε). Besides, based on this bound we further derived a PAC-learnable
11.2 The Interplay of Generalizability and Privacy-Preserving Ability 227

guarantee for differentially private machine learning algorithms. Nissim and Stem-
mer have proved an on-average multi-database generalization bound, which is looser
than ours by a factor of eε . Such improvements are significant in practice because ε
can always be as large as 10 (Abadi et al. 2016). Besides, the bounds in Nissim and
Stemmer (2015) are only applicable for binary classification, while ours are suitable
for any differentially private learning algorithms.
Some works have also proved composition theorems (Dwork and Roth 2014;
Kairouz et al. 2017). The approximation of factor δ in our composition theorems is
tighter than the best existing result (Kairouz et al. 2017) by
  
eε − 1 ε
δ T − ,
eε + 1 ε

where T is the number of iterations, while the estimate of ε remains the same. This
improvement is significant in practice, where T is always considerably large and
is helpful for further tightening our the generalization bounds for iterative learning
algorithms considerably.
Our results are applicable to a wide range of machine learning algorithms. In
this book, we consider the stochastic gradient MCMC scheme (Ma et al. 2015) and
agnostic federated learning (Geyer et al. 2017) and take stochastic gradient Langevin
dynamics (Welling and Teh 2011) as an example of application. Our results deliver
generalization bounds for SGLD and agnostic federated learning. The obtained gener-
alization bounds are not explicitly relied on the model size, which can be prohibitively
large in modern methods, such as deep neural networks.

11.2.1 Preliminaries

Here, we slightly abuse the notations of distribution and its cumulative distribu-
tion function when no ambiguity is introduced because there is a one-one mapping
between them if we ignore zero-probability events.

Definition 11.1 (Max Divergence; cf. Dwork and Roth (2014), Definition 3.6) For
any random variables X and Y , the max divergence between X and Y is defined as
 
P(X ∈ S)
D∞ (X Y ) = max log .
S⊆Supp(X ) P(Y ∈ S)

Definition 11.2 (δ-Approximate Max Divergence; cf. (Dwork and Roth 2014), Def-
inition 3.6) For any random variables X and Y , the δ-approximate max divergence
between X to Y is defined as
 
δ P(X ∈ S) − δ
D∞ (X Y ) = max log .
S⊆Supp(X ):P(Y ∈S)≥δ P(Y ∈ S)
228 11 Privacy Preservation

Definition 11.3 (Statistical Distance; cf. (Dwork and Roth 2014)) For any random
variables X and Y , the statistical distance between X and Y is defined as

(X Y ) = max |P(X ∈ S) − P(Y ∈ S)|.


S

We then recall the following two lemmas.

Lemma 11.1 (cf. (Dwork and Rothblum 2016), Lemmas 3.9 and 3.10) For any two
distributions D and D , there exist distributions M and M such that

max{D∞ (MM ), D∞ (M M)} = max{D∞ (DD ), D∞ (D D)},

and
D K L (DD ) ≤ D K L (MM ) = D K L (M M).

Lemma 11.2 (cf. Dwork and Roth (2014), Theorem 3.17) For any random variables
Y and Z , we have
δ δ
D∞ (Y Z ) ≤ ε, D∞ (Z Y ) ≤ ε,

if and only if there exist random variables Y  , Z  such that

δ δ
(Y Y  ) ≤ , (Z Z  ) ≤ ,
eε +1 1 + eε
D∞ (Y  Z  ) ≤ ε, D∞ (Z  Y  ) ≤ ε.

We finally recall Azuma Lemma (Boucheron et al. 2013) which gives a


concentration inequality to the martingales.

Lemma 11.3 (Azuma Lemma; cf. (Mohri et al. 2018), p. 371) Suppose {Yi }i=1 T
is
a sequence of random variables, where Yi ∈ [−ai , ai ]. Let {X i }i=1 be a sequence of
T

random variables such that,

E(Yi |X i−1 , ..., X 1 ) ≤ Ci ,

where {Ci }i=1


T
is a sequence of constant real numbers. Then, we have the following
inequality, ⎛  ⎞
 T
T T 
Ci + t 
t2
P⎝ Yi ≥ ai2 ⎠ ≤ e− 2 .
i=1 i=1 i=1
11.2 The Interplay of Generalizability and Privacy-Preserving Ability 229

11.2.2 Generalization Bounds for Iterative Differentially


Private Algorithms

In this section, we ascertain the generalizability of iterative differentially private


algorithms through the following two primary steps: (1) we establish generalization
bounds applicable to any differentially private learning algorithm; and (2) we explore
how the prevalent iterative nature inherent in learning algorithms affects both their
privacy-preserving efficacy and their ability to generalize. This exploration is facili-
tated by three composition theorems. Additionally, we provide a brief outline of the
theoretical proofs and illustrate their superiority over existing findings.

11.2.2.1 Bridging Generalization and Privacy Preservation

We first prove a high-probability generalization bound for any (ε, δ)-differentially


private machine learning algorithm as follows.

Theorem 11.1 (High-Probability Generalization Bound via Differential Privacy)


Suppose 16 algorithm
 A is (ε, δ)-differentially private, the training sample size m ≥
ε
2
2 ln −ε
e δ
, and the loss function l∞ ≤ 1. Then, for any data distribution D over
data space Z, we have the following inequality,
   
  e−ε δ 2
P  R̂ S (A(S)) − R(A(S)) < 9ε > 1 − ln .
ε ε

Theorem 11.1 presents a valuable finding: it establishes a direct link between a


learning algorithm’s privacy-preserving prowess and its generalizability. This implies
that efforts aimed at improving privacy preservation can also lead to improved gen-
eralization performance. Moreover, the theorem suggests that (ε, δ)-differentially
private algorithms may offer a guarantee of probable approximate correctness (PAC)
learnability. PAC-learnability ensures that the algorithm is likely to provide accu-
rate predictions with high probability. This insight validates the potential synergy
between privacy and performance in algorithm design. PAC-learnability is defined
as follows.

Definition 11.4 (PAC-Learnability; cf. (Mohri et al. 2018), Definition 2.4) A concept
class C is said to be PAC-learnable if there exists an algorithm A and a polynomial
function poly(·, ·, ·, ·) such that for any s > 0 and t > 0, for all distributions D on
the training example Z , any target concept c ∈ C, and any sample size

m ≥ poly(1/s, 1/t, n, size(C)),

the following inequality holds,

P S∼Dm (R(A(S)) < s) > 1 − t.


230 11 Privacy Preservation

In Sect. 11.2.3, we show how our result leads to PAC-learnable guarantees by


using SGLD and agnostic federated learning as examples.

Proof Skeleton

We now give the proof skeleton for Theorem 11.1. The proofs have three stages:
(1) we first prove an on-average generalization bound for multi-database learning
algorithms; (2) we then obtain a high-probability generalization bound for multi-
database algorithms; and (3) we eventually prove Theorem 11.1 by reduction to
absurdity.
Stage 1: Prove an on-average generalization bound for multi-database
learning algorithms.
We first prove the following on-average generalization bound for multi-database
learning algorithms which are defined as follows.

Definition 11.5 (Multi-Database Learning Algorithms; cf. (Nissim and Stemmer


2015)) Suppose the training sample set S is separated into k sub-databases S1 , . . . , Sk ,
each of which has the size of m. For brevity, we rewrite the training sample set as
follows.
S = (S1 , . . . , Sk ).

The hypothesis Ã(S) learned by multi-database algorithm à on dataset S is defined


as follows,  
Ã(S) : Zkm H × {1, . . . , k}, S → h A(S) , i A(S) .

Theorem 11.2 (On-Average Multi-Database Generalization Bound) Let algorithm,

à : S → H × {1, . . . , k},

be (ε, δ)-differentially private and the loss function l∞ ≤ 1. Then, for any data
distribution D over data space Z, we have the following inequality,
   
     
 E E R̂ SiA(S) h A(S) − E R h A(S)  ≤ e−ε kδ + 1 − e−ε . (11.5)
S∼Dm A(S) A(S) 

Proof The left side of Eq. (11.5) can be rewritten as


    
    
E E E
R̂ SiA(S) h A(S) E Ez∼SiA(S) l h A(S) , z
=
S∼Dkm A(S) S∼Dkm A(S)
     
(∗)    
= E E E l h A(S) , z iA(S) = E E E l h A(S) , z iA(S)
S∼Dkm A(S) z∼S S∼Dkm z∼S A(S)
  
 
= E E E l h A(S) , z iA(S)
S∼Dkm z∼S A(S)
11.2 The Interplay of Generalizability and Privacy-Preserving Ability 231
  
1    
= E E P l h A(S) , z iA(S) ≤ t dt
S∼Dkm z∼S 0
   
k 1    
= E E P l h A(S) , z i ≤ t, i A(S) = i dt ,
S∼Dkm z∼S 0
i=1

where z in the right side of (∗) is defined as {z 1 , . . . , z k }, z i is uniformly selected


from Si . Since A is (ε, δ)-differentially private, we further have
   
k 1    
E E P l h A(S) , z i ≤ t, i A(S) = i dt
S∼Dkm z∼S 0
i=1
   
k 1  
ε
 
≤ E E e P l h A(Szi :z0 ) , z i ≤ t, i A(Szi :z0 ) = i + δ dt
S∼Dkm z∼S,z 0 ∼D 0
i=1
   
ε
k 1    
=e E E P l h A(Szi :z0 ) , z i ≤ t, i A(Szi :z0 ) = i dt + kδ
S∼Dkm z∼S,z 0 ∼D 0
i=1
  
k
ε
1    
= e E E P l h A(Szi :z0 ) , z i ≤ t, i A(Szi :z0 ) = i dt + kδ
S∼Dkm z∼S,z 0 ∼D 0
i=1
  
k
ε
1    
= e E E P l h A(S ∪{z0 }) , z i ≤ t, i A(S ∪{z0 }) = i dt + kδ.
S ∼Dkm−1 z i ∼D,z 0 ∼D 0
i=1

Let S = S ∪ {z 0 } and z = z i (it is without less of generality since all z i is i.i.d. drawn
from D). Since S ∪ {z 0 } ∼ Dkm , we have
  
k 1    
eε E E P l h A(S ∪{z0 }) , z i ≤ t, i A(S ∪{z0 }) = i dt + kδ
S ∼Dkm−1 z i ∼D,z 0 ∼D 0
i=1
  
k 1    
= eε E E P l h A(S) , z ≤ t, i A(S) = i dt + kδ
S∼Dkm z∼D 0
i=1
  
    1
= eε E E
P l h A(S) , z ≤ t dt + kδ
S∼Dkm z∼D 0
 
 
= eε E E EA(S) l h A(S) , z + kδ.
S∼Dkm z∼D

Therefore, we have
    
E E R̂ SiA(S) (h A(S) ) ≤ kδ + eε E [ E [R h A(S) ]]. (11.6)
S∼Dkm A(S) S∼Dkm A(S)

Rearranging Eq. (11.6), we have


232 11 Privacy Preservation
    
 
e−ε E E R̂ SiA(S) (h A(S) ) ≤ e−ε kδ +
E R h A(S) E
S∼Dkm A(S) S∼Dkm A(S)
  
 
− E [ E [R h A(S) ]] ≤ e−ε kδ − e−ε E E R̂ SiA(S) (h A(S) ) ,
S∼Dkm A(S) S∼Dkm A(S)

which leads to
    
E E R̂ SiA(S) (h A(S) ) − E [ E [R h A(S) ]]
S∼Dkm A(S) S∼Dkm A(S)
     
≤e−ε kδ − e−ε E E R̂ SiA(S) (h A(S) ) + E E R̂ SiA(S) (h A(S) )
S∼Dkm A(S) S∼Dkm A(S)
−ε −ε
≤1 − e + e kδ.

The other side of the inequality can be similarly obtained.


The proof is completed. 
Since 1 − e−ε ≤ ε, we have the following corollary.
Corollary 11.1 Suppose all the conditions in Theorem 11.2 hold, then we have the
following inequality,
    
−ε
 
E E R̂ SiA(S) (h A(S) ) ≤ e kδ + ε + E E R h A(S) .
S∼Dkm A(S) S∼Dkm A(S)

Stage 2: Prove a high-probability generalization bound for multi-database


algorithms.
Markov bound (cf. Mohri et al. 2018, Theorem C.1) is an important concentration
inequality in learning theory. Here, we slightly modify the original version as follows,

Ex [h(x)] ≥ Ex h(x)Ih(x)≥g(x) ≥ Ex g(x)Ih(x)≥g(x) .

Then, combining it with Theorem 11.2, we derive the following high-probability


generalization bound for multi-database algorithms.
Theorem 11.3 (High-Probability Multi-Database Generalization Bound) Let the
following algorithm,
 
A : Zkm → YX × {1, . . . , k}, S → h A(S) , i A(S) ,

be (ε, δ)-differential private, where km is the size of the whole dataset S and YX =
{ f : X → Y}. Then, for any data distribution D over data space Z, any database
set S = {Si }i=1
k
, where Si is a database contains m i.i.d. examples drawn from D, we
have the following generalization bound,
     
P R̂ SiA(S) h A(S) ≤ R h A(S) + ke−ε δ + 3ε ≥ ε. (11.7)
11.2 The Interplay of Generalizability and Privacy-Preserving Ability 233

Proof By Corollary 11.1, we have that


    
−ε
 
E E R̂ SiA(S) (h A(S) ) ≤ e kδ + ε + E E R h A(S) .
S∼Dkm A(S) S∼Dkm A(S)

 
Since R̂ SiA(S) h A(S) ≥ 0, we have that for any α > 0,
  
E E R̂ SiA(S) (h A(S) )
S∼Dkm A(S)
  
≥ E E R̂ SiA(S) (h A(S) )I R̂ S (h A(S) )≥R (h A(S) )+α
S∼Dkm A(S) i A(S)
  
 
≥ E E (α + R h A(S) )I R̂ S (h A(S) )≥R (h A(S) )+α .
S∼Dkm A(S) i A(S)

 
Furthermore, by splitting E [ E [R h A(S) ]] into two parts, we have
S∼Dkm A(S)

    
 
E E R̂ SiA(S) (h A(S) ) − E E R h A(S)
S∼Dkm A(S) S∼Dkm A(S)
  
 
≥ E E (α + R h A(S) )I R̂ S (h A(S) )≥R (h A(S) )+α
S∼Dkm A(S) i A(S)
  
 
− E E R h A(S) I R̂ S (h A(S) )≥R (h A(S) )+α
S∼Dkm A(S) i A(S)
  
 
+ E E R h A(S) I R̂ S (h A(S) )<R (h A(S) )+α
S∼Dkm A(S) i A(S)
  
≥ E E αI R̂ S (h A(S) )≥R (h A(S) )+α
S∼Dkm A(S) i A(S)
  
 
− E E R h A(S) I R̂ S (h A(S) )<R (h A(S) )+α
S∼Dkm A(S) i A(S)
       
≥αP R̂ SiA(S) (h A(S) ) > R h A(S) + α − P R̂ SiA(S) (h A(S) ) ≤ R h A(S) + α .

Let α = e−ε kδ + 3ε. We have


    α − (e−ε kδ + ε)
P R̂ SiA(S) (h A(S) ) ≤ R h A(S) + α ≤ ≤ ε.
1+α

The proof is completed. 

Stage 3: Prove Theorem 11.1 by Reduction to Absurdity.


We eventually prove Theorem 11.1 by reduction to absurdity. We assume that there
exists an algorithm A which conflicts with Theorem 11.1. We can then construct an
algorithm B based on the exponential mechanism which is defined as follows.
234 11 Privacy Preservation

Definition 11.6 (Exponential Mechanism; cf. (Nissim and Stemmer 2015), p. 3, and
(McSherry and Talwar 2007)) Suppose that S is a sample set, u : (S, r ) → R+ is the
utility function, R is an index set, ε is the privacy parameter, and u is the sensitivity
of u defined by

u = max max 
|u(S, r ) − u(S  , r )|.
r ∈R S,S adjacent

Then, the exponential mechanism q(S, u, R, ε) is defined as (S, u, R, ε) → r , where


r ∈ R.
Then, we can prove the following lemma.

Lemma 11.4 We define an algorithm A : Zm → YX , where m is the training sam-


ple size, Z is the data space, Zm is the space of the training sample set, and
YX = { f : X → Y}. Suppose k =  e−εε δ  and

2 16
m ≥ 2 ln −ε .
ε e δ

If we have
  
e−ε δ 2
P R̂ S (A(S)) ≤ e−ε kδ + 8ε + R(A(S)) < 1 − ln , (11.8)
ε 

then there exists an algorithm

B : Zkm → YX × {1, . . . , k},

is (2ε, δ)-differentially private and


     
P R̂ SiB(S) h B(S) ≤ R h B(S) + ke−ε δ + 3ε < ε, (11.9)

where S = {Si }i=1


k
and Si is a database containing m i.i.d. sampled from D.

Proof of Lemma 11.4 Construct algorithm B with input S = {Si }i=1 k


and T (where
Si , T ∈ Zm ) as follows:
Step 1. Run A on Si , i = 1, . . . , k. We denote the output as h i = A(S  i ).
Step 2. Let the utility function as q(S, T, i) = m R̂ Si (h i ) − R̂T (h i ) . We apply
the utility q to an ε-differential private exponential mechanism M(h i , S, T ) and
return the output.
We then prove that B satisfies
   
P l h B(S) , SiB(S) ≤ R h B(S) + ke−ε δ + 3ε < ε.
11.2 The Interplay of Generalizability and Privacy-Preserving Ability 235

By using Eq. (11.8), we have

  
e−ε δ 2 k
P ∀i, R̂ Si (A(Si )) ≤ e−ε kδ + 8 + R(A(Si )) ≤ 1 − ln ,
ε 

which leads to
  k
−ε 1 2 ε
P ∃i, R̂ Si (A(Si )) > e kδ + 8 + R(A(Si )) > 1 − 1 − ln ≥1− .
k  2
(11.10)
Furthermore, since T is independent with S, by using the Hoeffding inequality, we
have
 ε ε
≥ (1 − e− /2m )k ≥ 1 − .
2
P ∀i, |l(h i , T ) − R(h i )| ≤ (11.11)
2 8
Therefore, by combining Eqs. (11.10) and (11.11), we obtain

−ε15 5ε
P ∃i, R̂ Si (A(Si )) > e kδ +  + l(h i , T ) > 1 − .
2 8

Since q has sensitivity of 1, fix h i , we have


ε
P (M(h i , S, T ) ≤ OPT(q(S, T, i)) − mε)) ≥ 1 − ,
4
which leads to

13 7ε
P R̂ SiB(S) (h B(S) ) > e−ε kδ +  + R̂T (h B(S) ) > 1 − .
2 8

Then, using Eq. (11.11) again, we have


 
P R̂ SiB(S) (h B(S) ) > e−ε kδ + 6 + R(h B(S) ) > 1 − ε.

We complete the proof.

The Eq. (11.9) in Lemma 11.4 conflicts with Eq. (11.7). Thus, we proved Theorem
11.1.
We then compares our results with the existing works.
Comparison of Theorem 11.1.
There have been several high-probability generalization bounds for (ε, δ)-
differentially private machine learning algorithms.
Dwork et al. (2015) proved that
 
P R(A(S)) − R̂ S (A(S)) < 4ε > 1 − 8δ ε .
236 11 Privacy Preservation

Oneto et al. (2017) proved that


  
 
> 1 − 3e−mε ,
2
P Diff R < 6 R̂ S (A(S))ε̂ + 6 ε + 1/m
2

and
  
5N  2 
ε + 1/m > 1 − 3e−mε ,
2
P Diff R < 4V̂S (A(S))ε̂ +
N −1

where

Diff R = R(A(S)) − R̂ S (A(S)), ε̂ = ε + 1/m,

and V̂S (A(S)) is the empirical variance of l(A(S), ·):

1  2
V̂S (A(S)) =  (A(S), z i ) −  A(S), z j .
2m(m − 1) i= j

Nissim and Stemmer (2015) proved that


 

2δ 2
P R(A(S)) − R̂ S (A(S)) < 13ε > 1 − log .
ε ε

This is the existing tightest high-probability generalization bound in the literature.


However, this bound only stands for binary classification problems. By contrast, our
high-probability generalization bound holds for any machine learning algorithm.
Also, our bound is strictly tighter. All the bounds, including ours, are in the
following form,  
P R(A(S)) − R̂ S (A(S)) < a > 1 − b,

where a and b are two positive constant real numbers. Apparently, a smaller a and
a smaller b imply a tighter generalization bound. Our bound improves the current
tightest result from two aspects:
• Our bound tightens the term a from 13ε to 9ε.
• Our bound tightens the term b from (2δ/ε) log (2/ε) to (2e−ε δ/ε) log (2/ε).
These improvements are significant. Abadi et al. (2016) conducted experiments on
the differential privacy in deep learning. Their empirical results demonstrate that the
factor ε can be as large as 10.
A comparison of Theorem 11.2.
There is only one related work in the literature that presents an on-average gener-
alization bound for multi-database algorithms. Nissim and Stemmer (2015) proved
that,
11.2 The Interplay of Generalizability and Privacy-Preserving Ability 237
    
   
 E E R̂ SiA(S) (h A(S) ) − E R h A(S)  ≤ kδ + 2ε.
S∼Dkm A(S) A(S) 

Our bound is tighter by a factor of eε . According to the empirical results by Abadi


et al. (2016), this factor can be as large as e10 ≈ 20, 000. It is a significant multiplier
for loss function. Furthermore, the result by Nissim and Stemmer stands only for
binary classification, while our result applies to all differentially private learning
algorithms.

11.2.2.2 How Does the Iterative Nature Contribute?

Most machine learning algorithms are iterative, which may degenerate the privacy-
preserving ability as the number of iterations. This section studies the degenerative
nature of the privacy preservation in iterative machine learning algorithms and its
influence on the generalization.
We have the following composition theorem.

Theorem 11.4 (Composition Theorem I) Suppose an iterative machine learning


algorithm A has T steps: {Yi (S)}i=0 T
, where Yi is the learned hypothesis after the
i-th iteration. Suppose the i-th iterator

Mi : (Yi−1 , S) → Yi

is (ε, δ)-differentially private. Then, the algorithm A is (ε , δ  )-differentially private.


The factor ε is as follows,
 
ε = min ε1 , ε2 , ε3 , (11.12)

where
T
ε1 = εi , (11.13)
i=1
 
 ⎛ ⎞
T εi  T T
ε 2
(e − 1) ε 
εi2 log ⎝e + ⎠,
i=1 i
ε2 = + 2
i
εi + 1
i=1
e i=1
δ̃

  T
(e − 1) εi 
T εi
  1
ε3 = εi + 1
+ 2 log εi2 , (11.14)
i=1
e δ̃ i=1

and δ̃ is an arbitrary positive real constant.


Correspondingly, the factor δ  is defined as the maximal value of the following
equation with respect to {αi }i=1
T
∈ I,
238 11 Privacy Preservation

!
T  !
T 
δi δi
1− 1 − eαi +1− 1− + δ̃, (11.15)
i=1
1 + eεi i=1
1 + eεi

where " T
#

I = {αi }i=1
T
: αi = ε , |{i : αi = εi , αi = 0}| ≤ 1 ,
i=1

and δ̃ is the same real constant mentioned above.

Proof Skeleton

We proceed by providing a sketch of the proofs for Theorem 11.4, accompanied by


the derivation of two additional composition theorems as incidental outcomes. While
these two composition theorems are not as strong as Theorem 11.4, they play crucial
roles in the overall proofs. The proof process unfolds across four distinct stages: (1)
we approximate the KL-divergence between hypotheses learned from neighboring
training sample sets; (2) we establish a composition bound for ε-differentially private
learning algorithms; (3) we refine this composition theorem to derive a composition
bound suitable for(ε, δ)-differentially private learning algorithms; and (4) we further
tighten the results from the stage (2) to derive Theorem 11.4.
Stage 1: Approximate the KL-divergence between hypotheses acquired from
adjacent training sample sets.
Addressing the differential privacy of an iterative learning algorithm by directly
considering the differential privacy of each iteration would pose a significant techni-
cal challenge. To overcome this obstacle, we introduce KL divergence as an interme-
diary in this section. Specifically, for any ε-differentially private learning algorithm,
we establish the following lemma to provide an approximation of the KL-divergence
between hypotheses learned from neighboring training sample sets.

Lemma 11.5 If A is an ε-differentially private algorithm , then for every neighbor-


ing database pair S and S  , the KL divergence between hypotheses A(S) and A(S  )
satisfies the following inequality,

eε − 1
D K L (A(S)A(S  )) ≤ ε .
eε + 1

This lemma is new, and its proof involves non-trivial technical challenges. Lemma
11.1 is a crucial component in establishing the proof of Lemma 11.5. Notably, there
are two related results in the literature, albeit considerably less stringent than ours.
Dwork et al. (2010) demonstrated an inequality of the KL divergence as follows,

D K L (A(S)A(S  )) ≤ ε(eε − 1).


11.2 The Interplay of Generalizability and Privacy-Preserving Ability 239

Then, Dwork and Rothblum (2016) further improved it to

1 ε
D K L (A(S)A(S  )) ≤ ε(e − 1). (11.16)
2
Compared with ours, Eq. (11.16) is larger by a factor of (1 + eε )/2, which can be
very large in practice.

Proof By Lemma 11.1, we have a random variable M(S) and M(S  ), which satisfies

d∞ (M(S)|M(S  )) ≤ ε, d∞ (M(s  )|M(S)) ≤ ε,

and
dkl (A(S)|A(S  )) ≤ dkl (M(S)|M(S  )) = dkl (M(S  )|M(S)). (11.17)

Therefore, we only need to derive a bound for dkl (M(S)|M(S  )).


By direct calculation,

dkl (M(S)|M(S  ))
(∗) 1
= dkl (M(S)|M(S  )) + dkl (M(S  )|M(S))
2 
1 dP(M(S)) 1 dP(M(S  ))
= log dP(M(S)) + log dP(M(S  ))
2 dP(M(S  )) 2 dP(M(S))

1 dP(M(S))
= log d P(M(S)) − P(M(S  ))
2 dP(M(S  ))
 
1 dP(M(S  )) dP(M(S))
+ log + log dP(M(S  ))
2 dP(M(S)) dP(M(S  ))
 
1 dP(M(S))  1
= log d P(M(S)) − P(M(S )) + log 1dP(M(S  ))
2 dP(M(S  )) 2

1 dP(M(S))
= log d P(M(S)) − P(M(S  )) , (11.18)
2 dP(M(S  ))

where Eq. (∗) comes from Eq. (11.17).


We now analyze the last integration in Eq. (11.18). We define

 dP(M(S) = y)
k(y) = − 1. (11.19)
dP(M(S  ) = y)

Therefore,

k(y)dP(M(S  ) = y) = dP(M(S) = y) − dP(M(S  ) = y). (11.20)


240 11 Privacy Preservation

Additionally,

M(S  ) k(M(S  ) = k(y)dP(M(S  ) = y)
y∈

 
= d P(M(S) = y) − dP(M(S  ) = y)
y∈
=0. (11.21)

By calculating the integration of both sides of Eq. (11.20), we have



k(y)dP(M(S  ) = y) = 0.

Also, combined with the definition of k(y) (see Eq. 11.19), the right-hand side
(rhs) of Eq. (11.18) becomes

rhs = M(S  ) k(M(S )) log(k(M(S  )) + 1). (11.22)

Since m is ε-differentially private, k(y) is bounded from both sides as follows,

e−ε − 1 ≤ k(y) ≤ eε − 1. (11.23)

We now calculate the maximum of Eq. (11.22) subject to Eqs. (11.21) and (11.23).
First, we argue that the maximum is achieved when k(M(S  )) ∈ {e−ε − 1, eε − 1}
with a probability of 1 (almost surely). when k(M(S  )) ∈ {e−ε − 1, eε − 1}, almost
surely, the distribution for k(M(S  )) is as follows,

1
P∗ (k(M(S  )) = eε − 1) = ,
1 + eε

P∗ (k(M(S  )) = e−ε − 1) = .
1 + eε

We argue that it is the distribution that maximizes k(M(S  )).


For brevity, we denote the probability measure for a given distribution q as Pq .
Similarly, P∗ corresponds with the distribution q ∗ . We prove that q ∗ maximizes
Eq. (11.22) in the following two cases: (1) Pq (k(M(S  )) ≥ 0) ≤ P∗ (k(M(S  )) =
eε − 1), and (2) Pq (k(M(S  )) ≥ 0) > P∗ (k(M(S  )) = eε − 1)
Case 1: Pq (k(M(S  )) ≥ 0) ≤ P∗ (k(M(S  )) = eε − 1)
We have

e M(S  )∼q ∗ (k(M(S  ) log(k(M(S  )) + 1))


= P∗ (k(M(S  )) = eε − 1) · ε(eε − 1) − P∗ (k(M(S  )) = e−ε − 1) · ε(e−ε − 1)
= (P∗ (k(M(S  )) = eε − 1) − Pq (k(M(S  )) ≥ 0)) · ε(eε − 1)
11.2 The Interplay of Generalizability and Privacy-Preserving Ability 241

+ Pq (k(M(S  )) ≥ 0) · ε(eε − 1) − P∗ (k(M(S  )) = e−ε − 1) · ε(e−ε − 1)


≥Pq (k(M(S  )) ≥ 0) · ε(1 − e−ε ) − P∗ (k(M(S  )) = e−ε − 1) · ε(1 − e−ε )
+ Pq (k(M(S  )) ≥ 0) · ε(eε − 1) − P∗ (k(M(S  )) = e−ε − 1) · ε(e−ε − 1).

Note that

Pq (k(M(S  )) < 0) = P∗ (k(M(S  )) = eε − 1) − Pq (k(M(S  )) ≥ 0)


+ P∗ (k(M(S  )) = e−ε − 1).

Therefore, together with the condition in Eq. (11.23),

e M(S  )∼q (k(M(S  ) log(k(M(S  )) + 1)i k(M(S  )≤0) )


≤ (P∗ (k(M(S  )) = eε − 1) − Pq (k(M(S  )) ≥ 0)) · ε(1 − e−ε )
+ P∗ (k(M(S  )) = e−ε − 1) · ε(1 − e−ε ). (11.24)

Also,

e M(S  )∼q (k(M(S  ) log(k(M(S  )) + 1)i k(M(S  ))>0 ) ≤ Pq (k(M(S  )) ≥ 0) · ε(eε − 1).
(11.25)
Therefore, by combining the inequalities in Eqs. (11.24) and (11.25), we have

e M(S  )∼q (k(M(S  ) log(k(M(S  )) + 1)) ≤ e M(S  )∼q ∗ (k(M(S  ) log(k(M(S  )) + 1)).

Since the distribution q is arbitrary, the distribution q ∗ maximizes the


k(m(s  ) log(k(m(s  )) + 1).
Case 2: Pq (k(M(S  )) ≥ 0) > P∗ (k(M(S  )) = eε − 1)
We first prove that if Pq (1 − e−ε < k(M(S  )) < 0) = 0, there exists a distribution

q such that

Pq  (k(M(S  )) ≥ 0) = Pq (k(M(S  )) ≥ 0),


Pq  (k(M(S  )) < 0) = Pq (k(M(S  )) < 0),
Pq  (k(M(S  )) < 0) = Pq  (k(M(S  ) = e−ε − 1),
   
q  (k(M(S ) log(k(M(S )) + 1)) > q  (k(M(S ) log(k(M(S )) + 1)),

while the two conditions (Eqs. 11.21, 11.23) still hold.


Additionally, we have assumed that

Pq (k(M(S  )) ≥ 0) > P∗ (k(M(S  )) = eε − 1).

Therefore,
Pq (k(M(S  )) ≤ 0) < P∗ (k(M(S  )) = e−ε − 1).
242 11 Privacy Preservation

Also, since the distribution q  is arbitrary, let it satisfy

Pq  (k(M(S  )) < 0) = Pq (k(M(S  )) < 0) = Pq  (k(M(S  ) = e−ε − 1).

Then, in order to meet the condition Eq. (11.21), let

Pq  (k(M(S  ) = eε − 1) > Pq (k(M(S  ) = eε − 1),

and
Pq  (0 < k(M(S  )) < eε − 1) ≤ Pq (0 < k(M(S  )) < eε − 1),

Since x log(x + 1) increases when x > 0 and decreases when x < 0, we have

q  (k(M(S ) log(k(M(S  )) + 1)) > q (k(M(S  ) log(k(M(S  )) + 1)).

Therefore, we have proved that the argument when Pq (k(M(S  )) < 0) =


Pq (k(M(S  )) = e−ε − 1). We now prove the case that

Pq (k(M(S  )) < 0) = Pq (k(M(S  )) = e−ε − 1),

where

q (k(M(S ) log(k(M(S  )) + 1)i k(M(S  ))<0 ) = ε(1 − e−ε )Pq (k(M(S  )) < 0).

By applying Jensen’s inequality to bound the q (k(m(s ) log(k(m(s  )) +
1)i k(m(s  ))≥0 ), we have
 
q (k(M(S )) log(k(M(S )) + 1)i k(M(S  ))≥0 )
= Pq (M(S  ) ≥ 0)q  (k(M(S  )) log(k(M(S  )) + 1)|k(M(S  )) ≥ 0)
(∗)    
≤ Pq (M(S  ) ≥ 0)q k(M(S  ))|k(M(S  )) ≥ 0 · log(q k(M(S  ))|k(M(S  )) ≥ 0 + 1),
(11.26)

where the inequality (∗) uses jensen’s inequality (x log(1 + x) is convex with respect
to x when x > 0). The upper bound in Eq. (11.26) is achieved as long as

Pq (k(M(S  )) ≥ 0) = Pq (k(M(S  )) = q (k(M(S  ))|k(M(S  )) ≥ 0)).

Furthermore,

Pq (k(M(S  )) < 0) = Pq (k(M(S  )) = e−ε − 1).

Therefore, the distribution q is determined by the cumulative density functions


Pq (k(M(S  )) < 0) and Pq (k(M(S  )) ≥ 0).
11.2 The Interplay of Generalizability and Privacy-Preserving Ability 243

Hence, maximizing q (k(M(S  ) log(k(M(S  )) + 1)) is equivalent to maximizing


the following object function,

−ε ε q −ε q −ε
g(q) = q(1 − e ) log e + (1 − q) (1 − e ) log (1 − e ) + 1 ,
1−q 1−q

subject to
q
≤ eε , (11.27)
1−q

where g(q) is the maximum of Eq. (11.22) subject to Pq (k(M(S  )) < 0) = q, and the
condition Eq. (11.27) comes from the Pq (k(M(S  )) ≥ 0) > P∗ (k(M(S  )) = eε − 1)
(the assumption of case 2).
Additionally, g(q) can be represented as follows,

q
q(1 − e−ε ) log (eε − 1) + ε .
1−q

Since both q and q/(1 − q) monotonously increase, g(q) monotonously


increases. Therefore, Q ∗ maximize Eq. (11.22), which completes the proof. 

Stage 2: Prove a weaker composition theorem where every iteration is


ε-differential private.
Based on Lemma 11.5, we can prove the following composition theorem as a
preparation theorem.

Theorem 11.5 (Composition Theorem IV) Suppose an iterative machine learning


algorithm A has T steps: {Yi (S)}i=1
T
. Specifically, we define the i-th iterator as
follows,
Mi : (Yi−1 (S), S) → Yi (S). (11.28)

We assume that Y0 is the initial hypothesis (which does not depend on S). If for any
fixed observation yi−1 of the variable Yi−1 , Mi (yi−1 , S) is εi -differentially private,
then {Yi (S)}i=0
T
is (ε , δ  )-differentially private that


 $ T
% T
1 eεi − 1
ε = 2 log

εi2 + εi .
δ i=1 i=1
eεi + 1

Based on Lemma 11.5, we can prove the following composition theorem for
ε-differential privacy as a preparation theorem of the general case.
244 11 Privacy Preservation

Proof of Theorem 11.5 By calculating log P ({Y i(S  )=yi }i=0


P {Y (S)=y }T )
, we have
( i i i=0 )
T

 
P {Yi (S) = yi }i=0 T
log  
P {Yi (S  ) = yi }i=0
T
$T %
! P (Yi (S) = yi |Yi−1 (S) = yi−1 , ..., Y0 (S) = y0 )
= log
i=0
P (Yi (S  ) = yi |Yi−1 (S  ) = yi−1 , ..., Y0 (S  ) = y0 )
T 
P (Yi (S) = yi |Yi−1 (S) = yi−1 , ..., Y0 (S) = y0 )
= log
i=0
P (Yi (S  ) = yi |Yi−1 (S  ) = yi−1 , ..., Y0 (S  ) = y0 )
T 
(∗) P (Yi (S) = yi |Yi−1 (S) = yi−1 , ..., Y0 (S) = y0 )
= log
i=1
P (Yi (S  ) = yi |Yi−1 (S  ) = yi−1 , ..., Y0 (S  ) = y0 )
T 
P (Mi (yi−1 , S) = yi |Yi−1 (S) = yi−1 , ..., Y0 (S) = y0 )
= log
i=1
P (Mi (yi−1 , S  ) = yi |Yi−1 (S  ) = yi−1 , ..., Y0 (S  ) = y0 )
T 
(∗∗) P (Mi (yi−1 , S) = yi )
= log ,
i=1
P (Mi (yi−1 , S  ) = yi )

where Eq. (∗) comes from the independence of Y0 with respect to S and Eq. (∗∗) is
due the independence of Mi to Yk (k < i) when the Yi−1 is fixed.
By the definition of ε-differential privacy, one has for arbitrary yi−1 as the
observation of Yi−1 ,
   
D∞ Mi (yi−1 , S)Mi (yi−1 , S  ) < εi , D∞ Mi (yi−1 , S  )Mi (yi−1 , S) < εi .

Thus, by Lemma 11.5, we have


 
P (Mi (Yi−1 , S) = Yi ) 
E log  Yi−1 (S) = yi−1 , ..., Y0 (S) = y0
P (Mi (Yi−1 , S  ) = Yi )
= D K L (Mi (yi−1 , S)Mi (yi−1 , S  ))
eεi − 1
≤ εi ε . (11.29)
e i +1

Combining Azuma Lemma (Lemma 11.3), Eq. (11.29) derives the following
equation $ %
 
P {Yi (S) = yi }i=0T
ε
P {Yi (S) = yi }i=0 : 
T
 >e < δ,
P {Yi (S  ) = yi }i=0
T

where S and S  are neighbouring sample sets.


Therefore, the algorithm A is ε -differentially private.
The proof is completed.
11.2 The Interplay of Generalizability and Privacy-Preserving Ability 245

Stage 3: Prove a weaker composition theorem where every iteration is (εi , δi )-


differentially private.
Based on Lemmas 11.1 and 11.3, we proved the following lemma that the
maximum of the following function,

  !
T
f {α }
i i=1 = 1 −
T
(1 − αi Ai ), (11.30)
i=1

is achieved when {αi }i=1


T
are at the boundary under some restrictions.

Lemma 11.6 The maximum of function (11.30) when Ai is positive real such that,

!
T
1 ≤ αi ≤ ci , (here ci Ai ≤ 1), and αi = c,
i=1

is achieved at the point when the cardinality follows the inequality:

|{i : αi = ci and αi = 1}| ≤ 1. (11.31)

Based on Lemmas 11.2 and 11.6, we can prove the following composition theorem
whose estimate of ε is somewhat looser than our main results.

Theorem 11.6 (Composition Theorem V) Suppose an iterative machine learning


algorithm A has T steps: {Yi (S)}i=1
T
. Specifically, let the i-th iterator be as follows,

Mi : (Yi−1 (S), S) → Yi (S). (11.32)

We assume that Y0 is the initial hypothesis (which does not depend on S). If for
any fixed observation yi−1 of the variable Yi−1 , Mi (yi−1 , S) is (εi , δi )-differentially
private (i ≥ 1), then {Yi (S)}i=0
T
is (ε , δ  )-differentially private where


 $ T % T
1 eεi − 1
ε =2 log

εi2 + εi ,
δ̃ i=1 i=1
eεi + 1

!
T  !
T 
 δi
αi δi
δ = max 1 − 1−e +1− 1− + δ̃,
{αi }i=1
T
∈I
i=1
1 + eεi i=1
1 + eεi

and I is defined as the set of {αi }i=1


T
such that

T
αi = ε , |{i : αi = εi and αi = 0}| ≤ 1.
i=1
246 11 Privacy Preservation

Now, we can prove our composition theorems for (ε, δ)-differential privacy. We
first prove a composition algorithm of (ε, δ)-differential privacy whose estimate of ε
is somewhat looser than the existing results. Then, we tighten the results and obtain
a composition theorem that is strictly tighter than the current estimate.

Proof of Theorem 11.6 It has been proved that the optimal privacy preservation
can be achieved by a sequence of independent iterations (see Kairouz et al. (2017),
Theorem 3.5). Therefore, without loss of generality, we assume that the iterations in
our theorem are independent with each other.
Rewrite Yi (S) as Yi0 , and Yi (S  ) as Yi1 (i ≥ 1). Then, by Lemma 11.2, there exist
random variables Ỹi0 and Ỹi1 , such that
  δi
 Yi0 Ỹi0 ≤ , (11.33)
1 + eεi
  δi
 Yi1 Ỹi1 ≤ , (11.34)
1 + eεi
 
D∞ Ỹi0 Ỹi1 ≤εi , (11.35)
 
D∞ Ỹi1 Ỹi0 ≤εi . (11.36)

Applying Theorem 11.6 (here, δ = δ̃), we have


 
δ̃
D∞ {Ỹi0 }i=0
T
{Ỹi1 }i=0
T
≤ ε ,
 
δ̃
D∞ {Ỹi1 }i=0
T
{Ỹi0 }i=0
T
≤ ε .

Apparently,
& '
δi
P(Yi0 ∈ Bi ) − min , P(Yi ∈ Bi ) ≥ 0.
0
1 + eεi

Therefore, for any sequence of hypothesis sets B0 , . . ., BT ,


& '
δ1
P(Y00 ∈ B0 ) P(Y10
∈ B1 ) − min , P(Y10 ∈ B1 )
1 + eε1
& '
δT
· · · P(YT ∈ BT ) − min
0
, P(YT ∈ B1 )
0
1 + e εT
≤P(Ỹ00 ∈ B0 ) · · · P(ỸT0 ∈ BT )

≤eε P(Ỹ01 ∈ B0 ) · · · P(ỸT1 ∈ BT ) + δ̃. (11.37)
11.2 The Interplay of Generalizability and Privacy-Preserving Ability 247

Furthermore, by Eq. (11.36), we also have that


" #
εi 1
P(Ỹi0 ∈ Bi ) ≤ min e , P(Ỹi1 ∈ Bi ).
P(Ỹi1 ∈ Bi )

Therefore,
" #
!
T
1
P(Ỹ00 ∈ B0 ) · · · P(Ỹn0 ∈ BT ) ≤ min eεi , P(Ỹ01 ∈ B0 ) · · · P(ỸT1 ∈ BT ) + δ̃.
i=1
P(Ỹi1 ∈ Bi )

Then, we prove )this theorem * in two cases: ) *


(T  (T 
(1) i=1 min eεi , P(Ỹ 11∈B ) ≤ eε ; and (2) i=1 min eεi , P(Ỹ 11∈B ) > eε .
(T ) i i
* 
i i

Case 1- i=1 min eεi , P(Ỹ 11∈B ) ≤ eε .


i i
We have
 
δ1 δT
P(Ỹ01 ∈ B0 ) P(Ỹ11 ∈ B1 ) − · · · P( Ỹ 1
∈ B T ) −
1 + eε1 T
1 + e εT
≤ P(Y01 ∈ B0 ) · · · P(YT1 ∈ BT ).

By simple calculation, we have


" #
!
T
1
εi
min e , P(Ỹ01 ∈ B0 ) · · · P(ỸT1 ∈ BT )
i=1
P(Ỹi1 ∈ Bi )
" #
!
T
1
εi
≤ min e , P(Y01 ∈ B0 ) · · · P(YT1 ∈ BT )
i=1
P(Ỹi1 ∈ Bi )
" #
!
T
1
εi
+ min e , P(Ỹ01 ∈ B0 ) · · · P(ỸT1 ∈ BT )
i=1
P(Ỹi1 ∈ Bi )
" #
!
n
1
− min eεi , P(Ỹ01 ∈ B0 )
i=1
P(Ỹi1 ∈ Bi )
 
δ1 δT
P(Ỹ11 ∈ B1 ) − ··· P(ỸT1 ∈ BT ) − .
1 + eε1 1 + e εT

Apparently, " #
εi 1
min e , P(Ỹ0i ∈ Bi ) ≤ 1,
P(Ỹi1 ∈ Bi )

and when A > B, f (x) = Ax − (x − a)B increases when x increases.


248 11 Privacy Preservation

Therefore, we have
" #
!
T
1
εi
min e , P(Ỹ01 ∈ B0 ) · · · P(ỸT1 ∈ BT )
i=1
P(Ỹi1 ∈ Bi )
" #
!
T
1
εi
− min e , P(Ỹ01 ∈ B0 )
i=1
P(Ỹi1 ∈ Bi )
 
δ1 δT
P(Ỹ11 ∈ B1 ) − · · · P( Ỹ 1
∈ B T ) −
1 + eε1 T
1 + e εT
$ " # %
!T
1 δi
≤1 − 1 − min eεi , .
i=1
P(Ỹi ∈ Bi ) 1 + eεi
1

Combining with Eq. (11.37), we have


$ " # % 
!
T
1 δi !
T
δi
 εi
δ ≤1− 1 − min e , +1− 1− + δ̃.
i=1
P(Ỹi1 ∈ Bi ) 1 + eεi i=1
1 + eεi

(T ) * 
εi
Case 2- i=1 min e , 1
P(Ỹ 1 ∈B )
> eε :
i i

There exists a sequence of reals {αi }i=1


T
such that
" # T
αi εi 1
e ≤ min e , , αi = ε .
P(Ỹi1 ∈ Bi ) i=1

Therefore, similar to Case 1, we have that

!
T  !
T 
 δiαi δi
δ ≤1− 1−e +1− 1− .
i=1
1 + eεi i=1
1 + eεi

Overall, we have proven that

!
T  !
T 
δi δi
δ ≤ 1 − 1 − eαi +1− 1− ,
i=1
1 + eεi i=1
1 + eεi

where i=1 T
αi ≤ ε and αi ≤ εi .
From Lemma 11.6, the minimum is realized on the boundary, which is exactly as
this theorem claims.
The proof is completed.
11.2 The Interplay of Generalizability and Privacy-Preserving Ability 249

Stage 4: Prove Theorem 11.4.


Applying Lemma 11.3 and Theorem 3.5 in Kairouz et al. (2015), we eventually
extend the weaker versions to Theorem 11.4. Theorem 3.5 in Kairouz et al. (2017)
relies on the term privacy area. A larger privacy area corresponds to a worse privacy
preservation. In this section, we make a novel contribution that proves the moment
generating function of the following random variable represents the worst case,

P(∩i Yi (S) ∈ Bi )
log , (11.38)
P(∩i Yi (S  ) ∈ Bi )

where Yi (S) and Yi (S  ) are the mechanisms achieving the largest privacy area.
Thus, we can deliver an approximation of the differential privacy via this moment
generating function.
Proof of Theorem 11.4 By applying Theorem 3.5 proposed in Kairouz et al. (2017)
and replacing ε in the proof of Theorem 11.6 as

ε = min {I1 , I2 , I3 } ,

where
T
I1 = εi ,
i=1
 
 ⎛ ⎞
T εi  T T
ε 2
(e − 1) εi 
2εi2 log ⎝e + ⎠,
i=1 i
I2 = εi + 1
+
i=1
e i=1
δ̃

 T 
(eεi − 1) εi 
T
 1
I3 = + 2
2εi log .
i=1
eεi + 1 i=1
δ̃

The proof is completed.


Comparison with Existing Results

Our composition theorem is strictly tighter than the existing results.


A classic composition theorem is as follows (see Dwork and Roth (2014), Theorem
3.20 and Corollary 3.21, pp. 49–52),

  T
T  1
T

ε = εi 
εi (e − 1) + 2 log εi2 , δ  = δ̃ + δi ,
i=1
δ i=1 i=1

where δ̃ is an arbitrary positive real number, (ε , δ  ) is the differential privacy of the
whole algorithm, and (εi , δi ) is the differential privacy of the i-th iteration.
250 11 Privacy Preservation

Currently, the tightest approximation is given by Kairouz et al. (2017) as follows,


 
ε = min ε1 , ε2 , ε3 , δ  = 1 − (1 − δ)T (1 − δ̃),

where

ε1 = T ε,

 $ %
 √
(e − ε
1) εT  T ε 2
ε2 = + ε2T log e + ,
eε + 1 δ̃
+ 
 (eε − 1) εT 1
ε3 = + ε 2T log .
eε + 1 δ̃

The estimate of the ε is the 


- as ours, while their δ is also larger than ours
, same
ε
approximately δ eeε −1
+1
T − εε . In many situations, the number of iterations T is
overwhelmingly large, which guarantees that our advantage is significant.
When all the iterations have the same privacy-preserving ability, we can tighten
the approximation of the factor δ  as the following corollary.

Corollary 11.2 (Composition Theorem II) When all the iterations are (ε, δ)-
differential private, δ  is
, - , -
 εε T − εε T
 ε δ δ δ
δ =1 − 1 − e 1 − + 1 − 1 − + δ̃
1 + eε 1 + eε 1 + eε
     $ 2 %
ε 2δ ε δ
= T− + δ + δ̃ + O .
ε 1 + eε ε 1 + eε
, -
ε
Proof The maximum of δ  is achieved when at most T − ε
elements αi = 0. We
note that (1 − x) = 1 − nx + O(x ). Then, the proof is completed by estimating
n 2

the δ  in Theorem 11.4 as

, - , -
 ε T − ε T
 δ ε
ε δ ε δ
δ =1 − 1 − e 1− + 1 − 1 − + δ̃
1 + eε 1 + eε 1 + eε
$ 2 %
δ δ
=1 + T + δ̃ + O
1 + eε 1 + eε
⎛ , - $ ⎞
ε 2 % $    $ 2 %%
ε δ δ ε δ δ
− ⎝1 − +O ⎠ 1− T − +O
1 + eε 1 + eε ε 1 + eε 1 + eε
11.2 The Interplay of Generalizability and Privacy-Preserving Ability 251

     $ 2 %
εδ ε δ δ δ
= + T − + T + O + δ̃
1 + eε
ε ε 1 + eε 1 + eε 1 + eε
    
ε 2δ ε
≈ T− + δ + δ̃.
ε 1 + eε ε

When all the iterators Mi are ε-differentially private, we can further tighten the
third estimation of ε in Theorem 11.4, Eq. (11.12) as the following theorem.

Corollary 11.3 (Composition Theorem III) Suppose all the iterators Mi are
ε-differentially private and all the other conditions in Theorem 11.4 hold. Then,
the algorithm A is (ε , δ  )-differentially private that
+ 
(eε − 1) ε 1
ε =T ε + 2 log T ε2 ,
e +1 δ̃
, - , -
 ε T − ε T
δ ε δ ε δ
δ  =1 − 1 − eε 1− +1− 1− + δ  ,
1 + eε 1 + eε 1 + eε

where δ  is defined as follows:


T − ε +T ε



− ε +T ε 1 2T ε T ε + ε 2ε
δ =e 2 .
1 + eε T ε − ε T ε − ε

Proof of Corollary 11.3 Let P0 and P1 be two distributions whose cumulative


distribution functions P0 and P1 are respectively defined as follows:


⎪ δ, x =0



⎪ (1 − δ)eε

⎨ , x =1
1 + eε
P0 (x) = ,

⎪ 1−δ

⎪ , x =2
⎪ ε
⎪1+e


0, x =3

and ⎧

⎪ 0, x =0



⎪ (1 − δ)eε

⎨ , x =1
1 + eε
P1 (x) = .

⎪ 1−δ

⎪ , x =2

⎪ 1 + eε


δ, x =3

According to Theorem 3.4 by Kairouz et al. (2017), the largest magnitude of the
(ε , δ  )-differential privacy can be calculated from the P0T and P1T .
252 11 Privacy Preservation

Construct P̃0 and P̃1 , whose cumulative distribution functions are as follows,
⎧ ε
⎪ e δ

⎪ , x =0
⎪ 1 + eε



⎪ (1 − δ)eε


⎨ , x =1
1 + eε
P̃0 (x) = ,

⎪ 1−δ

⎪ , x =2

⎪ 1 + eε



⎩ δ ,

x =3
1 + eε

and ⎧
⎪ δ

⎪ , x =0

⎪ 1 + eε

⎪ (1 − δ)eε



⎨ , x =1
1 + eε
P̃1 (x) .
⎪ 1−δ ,

⎪ x =2


⎪ 1 + eε


⎪ ε
⎩ e δ ,

x =3
1 + eε

One can easily verify that

δ
(P0 P̃0 ) ≤ ,
1 + eε
δ
(P1 P̃1 ) ≤ ,
1 + eε
D∞ (P̃0 P̃1 ) ≤ ε,
D∞ (P̃1 P̃0 ) ≤ εi .
 
P̃0 (xi ) T
Let Vi (xi ) = log P̃1 (xi )
and S(x1 , . . . , x T ) = i=1 Vi (xi ). We have that for any
t > 0,

PP̃0T ({xi } : S({xi }) > ε ) ≤ e−ε t EP̃0T (et S )
T
−ε t etε+ε e−tε
=e +
1 + eε 1 + eε
T
−ε t−T tε e2tε+ε 1
=e + . (11.39)
1 + eε 1 + eε

By calculating the derivative, we can deduce that the minimum of the RHS of
Eq. (11.39) is achieved at
11.2 The Interplay of Generalizability and Privacy-Preserving Ability 253

T ε + ε
e2εt = e−ε . (11.40)
T ε − ε
ε
Since ε ≥ T eeε −1
+1
,
T ε + ε
e−ε > 1.
T ε − ε

Therefore, by applying Eq. (11.40) into the RHS of Eq. (11.39), we have

T − ε +T ε



− ε +T ε 1 2T ε T ε + ε 2ε
PP̃0T ({xi } : S({xi }) > ε ) ≤ e 2 .
1 + eε T ε − ε T ε − ε
(11.41)

We define the RHS of Eq. (11.41) as δ  . We have (P̃b )T (b = 0, 1) is (ε ,δ  )-


differentially private. Then, using similar analysis of the Proof of Theorem 11.6, we
prove this theorem.

We now analyze the tightness of Corollary 11.3. Specifically, we compare it with


our Theorem 11.4.
In the proof of Theorem 11.4, ε3 is derived through Azuma Lemma (Lemma 11.3).
Specifically, the δ  is derived by
 
eε − 1
  eε −1
P ST ≥ ε − T ε ≤ e−t (ε −T eε +1 ) E et ST
e +1
 eε −1
= e−t ( −T eε +1 ) E et ST −1 E etVT |Y1 (S), . . . , YT −1 (S)
 eε −1
≤ e−t ( −T eε +1 ) E et ST −1 e4t ε /8
2 2

 eε −1
≤ e−t ( −T eε +1 ) e T t ε /2
2 2
,
 
P(Yi (S)) P(Yi (S))
where Vi is defined as log P(Y 
i (S ))
− E log 
P(Yi (S ))
|Y1 (S), . . . , Yi−1 (S) and S j is
j
defined as i=1 Vi .
ε
Since P ST ≥ ε − T eeε −1
+1
does not depend on t,
  ε
eε − 1
 −
(  −T eε −1 )2
e +1
P ST ≥ ε − T ε ≤ min e 2T ε 2
= δ,
e +1 t>0

By contrary, the approach here directly calculates E[et ST ], without the shrinkage
in the proof of Theorem 11.4. Specifically,
T
 e2tε+ε 1  eε −1
e−ε t−T tε = e−t E et ST ≤ e−t ( −T eε +1 ) e T t ε /2
2 2
+ .
1 + eε 1 + eε
254 11 Privacy Preservation

Therefore,
T
 e2tε+ε 1  eε −1
min e−ε t−T tε ≤ min e−t ( −T eε +1 ) e T t ε /2
2 2

ε
+ ,
t>0 1+e 1 + eε t>0

which leads to
T − ε +T ε

− ε +T ε 1 2T ε T ε + ε 2ε
e 2 ≤ δ.
1 + eε T ε − ε T ε − ε

It ensures that this estimate further tightens δ  than Theorem 11.4 if the ε is the same.
The trio of composition theorems expands upon the established connection
between generalizability and privacy-preserving capabilities to encompass itera-
tive machine learning algorithms. With these theorems, we establish the theoretical
groundwork for understanding the generalizability of iterative differentially private
machine learning algorithms.

11.2.3 Applications

Our theories apply to a wide spectrum of machine learning algorithms. This section
implements them on two popular regimes as examples: (1) stochastic gradient
Langevin dynamics (Welling and Teh 2011) as an example of the stochastic gra-
dient Markov chain Monte Carlo scheme (Ma et al. 2015; Zhang et al. 2018); and (2)
agnostic federated
√ learning (Geyer et al. 2017; Mohri et al. 2019). Our results help
deliver O( log m/m) high-probability generalization bounds and PAC-learnability
guarantees for the two schemes.

11.2.3.1 Application in SGLD

Bayesian inference estimates the posterior distribution of model parameters in para-


metric machine learning models, enabling us to approach to the optimal parame-
ters by iteratively increasing evidence. However, in practice, obtaining an analytical
expression for the posterior distribution is challenging. Consequently, Markov Chain
Monte Carlo (MCMC) methods are widely deployed to posterior inference (Hast-
ings 1970; Duane et al. 1987). Unfortunately, traditional MCMC methods can be
computationally demanding in general, especially when dealing with large-scale
datasets. In response, stochastic gradient Markov chain Monte Carlo (SGMCMC,
Ma et al. (2015)) methods have emerged by integrating stochastic gradient estimation
techniques (Robbins and Monro 1951) into Bayesian inference. Stochastic Gradi-
ent Langevin Dynamics (SGLD) Welling and Teh (2011), serves as a representative
example of SGMCMC methods. SGMCMC methods have found applications across
11.2 The Interplay of Generalizability and Privacy-Preserving Ability 255

a wide range of domains, including topic modeling (Larochelle and Lauly 2012;
Zhang et al. 2020), Bayesian neural networks (Louizos and Welling 2017; Roth and
Pernkopf 2018; Ban et al. 2019; Ye et al. 2020), and generative models (Wang et al.
2019; Kobyzev et al. 2020). In this section, we investigate the analysis of SGLD as
an example of the SGMCMC framework. SGLD is illustrated as the following table.

Algorithm 1 Stochastic Gradient Langevin Dynamics


Require: Samples S = {z 1 , ...z m }, Gaussian noise variance σ , size of mini-batch τ , iteration steps
T , learning rate {η1 , ...ηT }, Regularization function r , Lipschitz constant L of loss l.
1: Initialize θ0 randomly.
2: For t = 1 to T do:
3: Randomly sample a mini-batch B of size τ ;
4: Sample gt from N (0, σ 2 I );
5: Update
6: θt ← θt−1 − ηt τ1 ∇r (θt−1 ) + τ1 z∈B ∇l(z|θt−1 ) + gt .

The following theorem provides an estimation of the differential privacy and


generalization bounds of SGLD.

Theorem 11.7 SGLD is (ε , δ  )-differentially private. The factor ε is as follows,


+   τ
 1 τ2 2 τ e2 m ε̃ − 1
ε = 8 log T ε̃ + 2T ε̃ τ ,
δ̃ m2 m e2 m ε̃ + 1

and the factor δ  is as follows,


, - , -
τδ  mε τδ T − mε τ T

τ 2τ ε̃ 2τ ε̃
1− 1− e2 m ε̃ m
1− m
+1− 1− + δ̃,
2 mτ ε̃ 2 mτ ε̃ τ
2 m ε̃
1+e 1+e 1+e

where
√ 
2 2Lσ τ1 log 1δ + 4
τ2
L2
ε̃ = ,
2σ 2
and
T − mε2τ+τε̃ T ε̃

τ T ε̃
ε + m 1 2 mτ T ε̃ τ
ε̃T + ε
δ̃ = e 2
τ τ
m
τ .
1 + e m ε̃ m
T ε̃ − ε m
T ε̃ − ε

Additionally, a generalization bound is delivered by combining with Theorem 11.1.

Some existing works have also studied the privacy-preservation and generalization
of SGLD. Wang et al. (2015) proved that SGLD has “privacy for free” without
injecting noise. Specifically, the authors proved that SGLD is (ε, δ)-differentially
private if
256 11 Privacy Preservation

ε2 m
T > .
32τ log(2/δ)

Pensia et al. (2018) analyzed the generalizability of SGLD via information theory.
Some works also deliver generalization bounds via algorithmic stability or the PAC-
Bayesian framework (Hardt et al. 2016; Raginsky et al. 2017; Mou et al. 2018).
Our Theorem 11.7 also demonstrates that SGLD is PAC-learnable under the
following assumption.

Assumption 11.8 There exist constants K 1 > 0, K 2 , T0 , and m 0 , such that, for T >
T0 and any m > m 0 , we have

R̂ S (A(S)) ≤ exp(−K 1 T + K 2 ).

This assumption can be easily justified: the training error is possible to achieve a
near-0 training error in machine learning. Then, we have the following remark.

Remark 11.1 Theorem 11.7 implies that


   
T T
P R̂ S (A(S)) ≤ O √ + R(A(S)) ≥ 1 − O √ .
m m

It leads to a PAC-learnable guarantee under Assumption 11.8.

Proof of Theorem 11.7 We first calculate the differential privacy of each step. We
assume mini-batch B has been selected and define ∇ R̂ Sτ (θ ) as following:

∇ R̂ Sτ (θ ) = ∇r (θ ) + ∇l(z|θ ).
z∈B

For any two neighboring sample sets S and S  and fixed θi−1 , we have

p(θiS = θi |θi−1
S =θ
i−1 ) p(ηi (− τ1 ∇ R̂ τS (θi−1 ) + N (0, σ 2 I)) = θi − θi−1 )
max = max
θi S S
p(θi = θi |θi−1 = θi−1 ) θi p(ηi (− τ1 ∇ R̂ τS  (θi−1 ) + N (0, σ 2 I)) = θi − θi−1 )
p(ηi (− τ1 ∇ R̂ τS (θi−1 ) + N (0, σ 2 I)) = θi )
= max .
θi p(ηi (− τ1 ∇ R̂ τS  (θi−1 ) + N (0, σ 2 I)) = θi )

We define

p(− τ1 ∇ R̂ Sτ (θi−1 ) + N (0, σ 2 I) = θ  )


D(θ  ) = log ,
p(− τ1 ∇ R̂ Sτ  (θi−1 ) + N (0, σ 2 I) = θ  )

where θ  = 1 
θ
ηi i
obeys − τ1 ∇ R̂ Sτ (θi−1 ) + N (0, σ 2 I).
Let θ  = θ + τ1 ∇ R̂ Sτ (θi−1 ) and we rewrite D(θ  ) as:

11.2 The Interplay of Generalizability and Privacy-Preserving Ability 257

θ  + τ1 ∇ R̂ τS (θi−1 ))2

 e− 2σ 2
D(θ ) = log θ  + τ1 ∇ R̂ τ  (θi−1 )2

e−
S
2σ 2

θ + τ ∇ R̂ Sτ (θi−1 ))2
 1
θ  + τ1 ∇ R̂ Sτ  (θi−1 )2
=− +
2σ 2 2σ 2
θ  2 θ + τ ∇ R̂ S  (θi−1 ) − τ ∇ R̂ Sτ (θi−1 ))2
 1 τ 1
=− +
2σ 2 2σ 2
2θ τ (∇ R̂ S  (θi−1 ) − ∇ R̂ S (θi−1 )) + τ12 (∇ R̂ Sτ  (θi−1 ) − ∇ R̂ Sτ (θi−1 )2 )
T 1 τ τ
= .
2σ 2

We define ∇ R̂ Sτ  (θi−1 ) − ∇ R̂ Sτ (θi−1 ) as v. By the definition of L, we have

v < 2L .

Therefore, since θ T v ∼ N (0, v2 σ 2 ), by the Chernoff Bound technique,


$ 2 % $ 2 %
√ 1 √ 1
P θ T v ≥ 2 2Lσ log ≤ P θ T v ≥ 2vσ log
δ δ
√ √ 1 T
≤ min e− 2tvσ log δ E(etθ v )
t
= δ.

Therefore, with probability at least 1 − δ with respect to θ  , we have


√ 
2 2Lσ τ1 log 1δ + 4
τ2
L2
D(θ  ) ≤ .
2σ 2
√ 
We define ε = 21 (2 2Lσ τ1 log 1δ + 4
τ2
L 2 )/σ 2 . By applying Lemma 4.4 in
Beimel et al. (2010), we arrive at the conclusion iteration − τ1 ∇ R̂ Sτ (θi−1 ) + N (0, σ 2 I)
is (2τ m −1 ε, τ m −1 δ)-differentially private. By applying Theorem 11.3 and
+   τ
 1 τ2 2 τ e2 m ε − 1
ε = 8 log T ε + 2T ε 2 τ ε ,
δ̃ m2 m e m +1

we can prove the differential privacy.


By sampling B randomly and applying Theorem 11.1, we can prove the
generalization bound.
The proof is completed.
258 11 Privacy Preservation

11.2.3.2 Application in Agnostic Federated Learning

In today’s data-driven world, enormous amounts of personal information, ranging


from financial transactions to medical records, are routinely collected and used.
While such data holds immense value for deriving insights on a population scale,
it simultaneously poses significant concerns regarding individual privacy leaking.
Federated learning (Shokri and Shmatikov 2015; Konečnỳ et al. 2016; McMahan
et al. 2017; Yang et al. 2019) aims to protect the privacy by adopting a decentralized
approach. Specifically, instead of accessing raw data stored on personal devices,
federated learning deploys learning models directly on these devices. The central
server then aggregates gradients computed locally on each device and distributes
weight updates without ever accessing the raw data. This innovative framework offers
a promising solution to the privacy-preserving dilemma. Furthermore, algorithms
developed by Geyer et al. (2017), Mohri et al. (2019) introduce additional layers of
privacy protection by shielding client identities from potential differential attacks.

Algorithm 2 Differentially Private Federated Learning


Require: Clients {c1 , ...cm }, Gaussian noise variance σ , size of mini-batch τ , iteration steps T ,
upper bound L of the step size.
1: Initialize θ0 randomly.
2: For t = 1 to T do:
3: Randomly sample a mini-batch of clients of size τ ;
4: Randomly sample bt from N (0, L 2 σ 2 I );
5: $ θt−1 to the clients in the mini-batch
Central curator distributes % B;
ClientUpdate(c ,θt )
6: Update θt+1 ← θt + 1
B i∈B

h 
i
+ bt .
max 1, Li 2

The following theorem provides an estimation of the differential privacy and a


generalization bound for agnostic federated learning.

Theorem 11.9 (Differential Privacy and Generalization Bounds of Differentially


Private Federated Learning) Agnostic federated learning is (ε , δ  )-differentially
private. The factor ε is as follows,
+   τ
 1 τ2 2 τ e2 m ε − 1
ε = 8 log T ε + 2T ε τ , (11.42)
δ̃ m2 m e2 m ε + 1

and the factor δ  is defined as follows,


, - , -
τ  mε τ T − mε
τ T
mδ mδ mδ
2τ ε̃ 2τ ε̃
2 mτ ε̃
1− 1−e 2 mτ ε̃
1− 2 mτ ε̃
+1− 1− 2 mτ ε̃
+ δ̃,
1+e 1+e 1+e

where 
4σ τ1 log 1δ + 1
τ2
ε̃ = ,
2σ 2
11.2 The Interplay of Generalizability and Privacy-Preserving Ability 259

and
T − mε2τ+τε̃ T ε̃

τ T ε̃
ε + m 1 2 mτ T ε̃ τ
ε̃T + ε
δ̃ =e 2
τ τ
m
τ .
1 + e m ε̃ m
T ε̃ − ε m
T ε̃ − ε

Additionally, a generalization bound is delivered by combining with Theorem 11.1.

The following remark gives a PAC-learnable guarantee for agnostic federated


learning.

Remark 11.2 Theorem 11.9 implies that


   
T T
P R̂ S (A(S)) ≤ O √ + R(A(S)) ≥ 1 − O √ .
m m

It leads to a PAC-learnable guarantee under Assumption 11.8.

To prove Theorem 11.9, we only need to prove the differential privacy part of
Theorem 11.9, while the rest of the proof is similar to that of Theorem 11.7.

Proof of Theorem 11.9 The proof bears resemblance to the proof of Theorem 11.7.
One only has to notice that each update is still a Gauss mechanism, while
3 3
3 h it 3
3 3
3 3 ≤ L.
3 max(1, h i 2 ) 3
t

Then, in this situation, D(θ  ) is as follows:


$ $ % %
h it 
p 1
τ h it 2
+ N (0, L σ I) = θ
2 2
ck ∈B max(1, )
 L
D(θ ) = log $ $ % %.
h it
p 1
τ h it 2
+ N (0, L 2 σ 2 I) = θ 
ck ∈B max(1, L )

All other reasoning is the same as the previous proof.


By Theorem 11.3 and
+   τ
 1 τ2 2 τ e2 m ε − 1
ε = 8 log T ε + 2T ε 2 τ ε ,
δ̃ m2 m e m +1

we can calculate the differential privacy of federated learning.


The proof is completed.
260 11 Privacy Preservation

References

Abadi, Martin, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar,
and Li Zhang. 2016. Deep learning with differential privacy. In ACM SIGSAC Conference on
Computer and Communications Security, 308–318.
Ban, Yutong, Xavier Alameda-Pineda, Laurent Girin, and Radu Horaud. Variational bayesian infer-
ence for audio-visual tracking of multiple speakers. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 2019.
Beimel, Amos, Shiva Prasad Kasiviswanathan, and Kobbi Nissim. 2010. Bounds on the sample
complexity for private learning and private data release. In Theory of Cryptography Conference,
437–454. Springer.
Boucheron, Stéphane, Gábor Lugosi, and Pascal Massart. 2013. Concentration inequalities: a
nonasymptotic theory of independence. Oxford University Press.
Bun, Mark, and Thomas Steinke. 2016. Concentrated differential privacy: simplifications, exten-
sions, and lower bounds. In Theory of Cryptography Conference, 635–658.
Chaudhuri, Kamalika, Jacob Imola, and Ashwin Machanavajjhala. 2019. Capacity bounded
differential privacy. arXiv preprint arXiv:1907.02159.
Cuff, Paul, and Lanqing Yu. 2016. Differential privacy as a mutual information constraint. In ACM
SIGSAC Conference on Computer and Communications Security, 43–54.
Duane, Simon, Anthony D Kennedy, Brian J Pendleton, and Duncan Roweth. 1987. Hybrid Monte
Carlo. Physics Letters B 195 (2): 216–222.
Dwork, Cynthia, and Aaron Roth. 2014. The algorithmic foundations of differential privacy.
Foundations and Trends® in Theoretical Computer Science 9 (3–4): 211–407.
Dwork, Cynthia, and Deirdre K Mulligan. 2013. It’s not privacy, and it’s not fair. Stanford Law
Review Online 66: 35.
Dwork, Cynthia, and Guy N Rothblum. 2016. Concentrated differential privacy. arXiv preprint
arXiv:1603.01887.
Dwork, Cynthia, Guy N Rothblum, and Salil Vadhan. 2010. Boosting and differential privacy. In
IEEE Annual Symposium on Foundations of Computer Science, 51–60.
Dwork, Cynthia, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Leon
Roth. 2015. Preserving statistical validity in adaptive data analysis. In Annual ACM Symposium
on Theory of Computing, 117–126.
Geumlek, Joseph, Shuang Song, and Kamalika Chaudhuri. 2017. Renyi differential privacy
mechanisms for posterior sampling. In Advances in Neural Information Processing Systems,
5289–5298.
Geyer, Robin C, Tassilo Klein, and Moin Nabi. 2017. Differentially private federated learning: a
client level perspective. In Advances in Neural Information Processing Systems.
Hardt, Moritz, Ben Recht, and Yoram Singer. 2016. Train faster, generalize better: Stability of
stochastic gradient descent. In International Conference on Machine learning, 1225–1234.
Hastings, W Keith. 1970. Monte Carlo sampling methods using Markov chains and their
applications.
He, Fengxiang, Bohan Wang, and Dacheng Tao. 2020. Tighter generalization bounds for iterative
differentially private learning algorithms. arXiv preprint arXiv:2007.09371.
Kairouz, Peter, Sewoong Oh, and Pramod Viswanath. 2015. The composition theorem for
differential privacy. In International Conference on Machine Learning, 1376–1385.
Kairouz, Peter, Oh. Sewoong, and Pramod Viswanath. 2017. The composition theorem for
differential privacy. IEEE Transactions on Information Theory 63 (6): 4037–4049.
Kobyzev, Ivan, Simon Prince, and Marcus Brubaker. 2020. Normalizing flows: an introduction and
review of current methods. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Konečnỳ, Jakub, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and
Dave Bacon. 2016. Federated learning: Strategies for improving communication efficiency. In
Advances in Neural Information Processing Systems Workshop on Private Multi-Party Machine
Learning.
References 261

Larochelle, Hugo, and Stanislas Lauly. 2012. A neural autoregressive topic model. In Advances in
Neural Information Processing Systems, 2708–2716.
Liao, Jiachun, Lalitha Sankar, Vincent YF Tan, and Flavio du Pin Calmon. 2017. Hypothesis testing
under mutual information privacy constraints in the high privacy regime. IEEE Transactions on
Information Forensics and Security 13 (4): 1058–1071.
Louizos, Christos, and Max Welling. 2017. Multiplicative normalizing flows for variational
Bayesian neural networks. In International Conference on Machine Learning, 2218–2227.
Ma, Yi-An, Tianqi Chen, and Emily Fox. 2015. A complete recipe for stochastic gradient mcmc.
In Advances in Neural Information Processing Systems, 2917–2925.
McMahan, H Brendan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y
Arcas. 2017. Communication-efficient learning of deep networks from decentralized data. In
International Conference on Artificial Intelligence and Statistics.
McSherry, Frank, and Kunal Talwar. 2007. Mechanism design via differential privacy. In 48th
Annual IEEE Symposium on Foundations of Computer Science (FOCS’07), 94–103. IEEE.
Mironov, Ilya. 2017. Rényi differential privacy. In IEEE Computer Security Foundations Sympo-
sium, 263–275.
Mohri, Mehryar, Afshin Rostamizadeh, and Ameet Talwalkar. 2018. Foundations of Machine
Learning. MIT Press.
Mohri, Mehryar, Gary Sivek, and Ananda Theertha Suresh. 2019. Agnostic federated learning. In
International Conference on Machine Learning, 4615–4625.
Mou, Wenlong, Liwei Wang, Xiyu Zhai, and Kai Zheng. 2018. Generalization bounds of sgld for
non-convex learning: two theoretical viewpoints. In Annual Conference on Learning Theory,
605–638.
Nissim, Kobbi, and Uri Stemmer. 2015. On the generalization properties of differential privacy.
CoRR, abs/1504.05800.
Oneto, Luca, Sandro Ridella, and Davide Anguita. 2017. Differential privacy and generalization:
Sharper bounds with applications. Pattern Recognition Letters 89: 31–38.
Pensia, Ankit, Varun Jog, and Po-Ling Loh. 2018. Generalization error bounds for noisy, iterative
algorithms. In 2018 IEEE International Symposium on Information Theory (ISIT), 546–550.
Pittaluga, Francesco, and Sanjeev Jagannatha Koppal. 2016. Pre-capture privacy for small vision
sensors. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (11): 2215–2226.
Raginsky, Maxim, Alexander Rakhlin, and Matus Telgarsky. 2017. Non-convex learning via stochas-
tic gradient Langevin dynamics: a nonasymptotic analysis. In Conference on Learning Theory,
1674–1703.
Robbins, Herbert, and Sutton Monro. 1951. A stochastic approximation method. The Annals of
Mathematical Statistics, 400–407.
Roth, Wolfgang, and Franz Pernkopf. 2018. Bayesian neural networks with weight sharing using
Dirichlet processes. IEEE Transactions on Pattern Analysis and Machine Intelligence 42 (1):
246–252.
Shokri, Reza, and Vitaly Shmatikov. 2015. Privacy-preserving deep learning. In ACM SIGSAC
Conference on Computer and Communications Security, 1310–1321.
Vapnik, Vladimir. 2013. The Nature of Statistical Learning Theory. Springer Science & Business
Media.
Wang, Yu-Xiang, Stephen Fienberg, and Alex Smola. 2015. Privacy for free: posterior sampling and
stochastic gradient monte carlo. In International Conference on Machine Learning, 2493–2502.
Wang, Weina, Lei Ying, and Junshan Zhang. 2016. On the relation between identifiability, differen-
tial privacy, and mutual-information privacy. IEEE Transactions on Information Theory 62 (9):
5018–5029.
Wang, Hongwei, Jialin Wang, Jia Wang, Miao Zhao, Weinan Zhang, Fuzheng Zhang, Wenjie Li,
Xing Xie, and Minyi Guo. 2019. Learning graph representation with generative adversarial nets.
IEEE Transactions on Knowledge and Data Engineering 33 (8): 3090–3103.
Welling, Max, and Yee W Teh. 2011. Bayesian learning via stochastic gradient Langevin dynamics.
In International Conference on Machine Learning, 681–688.
262 11 Privacy Preservation

Yang, Qiang, Yang Liu, Tianjian Chen, and Yongxin Tong. 2019. Federated machine learning:
concept and applications. ACM Transactions on Intelligent Systems and Technology 10 (2): 12:1–
12:19. ISSN 2157-6904.
Ye, Qiaoling, Arash A Amini, and Qing Zhou. 2020. Optimizing regularized cholesky score for
order-based learning of bayesian networks. IEEE Transactions on Pattern Analysis and Machine
Intelligence 43 (10): 3555–3572.
Zhang, Hao, Bo Chen, Yulai Cong, Dandan Guo, Hongwei Liu, and Mingyuan Zhou. 2020. Deep
autoencoding topic model with scalable hybrid Bayesian inference. IEEE Transactions on Pattern
Analysis and Machine Intelligence.
Zhang, Cheng, Judith Bütepage, Hedvig Kjellström, and Stephan Mandt. 2018. Advances in vari-
ational inference. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (8):
2008–2026.
Chapter 12
Algorithmic Fairness

Deep learning technology has been widely deployed in increasingly critical decision-
making tasks, such as mortgage approval, credit card assessment, college admissions,
employee selection, and recidivism prediction. However, these areas are observed
to be subject to long-standing systematic discrimination against certain people on
the basis of diverse background traits, including race, gender, nationality, age, and
religion. Unfortunately, the introduction of intelligent algorithms into these areas
has failed to relieve the discrimination conundrum because people with minority
backgrounds are institutionally underrepresented in the historical data that fuel the
algorithmic systems. Thus, the unfairness residing in biased historical data is inher-
ited and sometimes intensified by a machine learning (ML) model that is trained on
such data. The consequent fairness concerns are particularly severe due to the black-
box nature of ML algorithms. Therefore, mitigation of the fairness issues arising in
ML applications is both urgent and important.

12.1 Definitions of Fairness

In general, there are two types of fairness related to ML: individual fairness and
group fairness. The concept of individual fairness was first proposed by Dwork et al.
(2012), based on the core principle that similar individuals should be treated similarly.
Since then, many other individual fairness measures have been developed, including
but not limited to average individual fairness (Kearns et al. 2019), counterfactual
fairness (Kusner et al. 2017), meritocratic fairness (Kearns et al. 2017), and others
(Yurochkin and Sun 2021). On the other hand, group fairness measures the level
of bias across different groups of individuals. The most commonly used measures
include demographic parity (Calders et al. 2009), equalized odds (Hardt et al. 2016),

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025 263
F. He and D. Tao, Foundations of Deep Learning, Machine Learning: Foundations,
Methodologies, and Applications, https://doi.org/10.1007/978-981-16-8233-9_12
264 12 Algorithmic Fairness

and equalized opportunity (Hardt et al. 2016), among others (Zafar et al. 2017; Choi
et al. 2020).
We consider a binary classification task. Each sample has the form z = (x, a, y),
where x ∈ X is the input feature, a ∈ A = {0, 1} represents one or more sensitive
attributes (e.g., gender, race, and age), and y ∈ {0, 1} is the prediction target. Let Z ,
X , A, and Y denote the random variables corresponding to z, x, a, and y, respec-
tively. Then, the goal of fair ML is to learn a binary classifier h : X × A → {0, 1}
while ensuring a specific notion of fairness concerning the sensitive attributes A.
For simplicity, we define Ŷ := h(X, A) to denote the prediction of the classifier h
for variable Z = (X, A, Y ).

Group fairness.

A major family of fairness concepts is group fairness, which aims to characterize


discrimination across various groups of individuals. The most commonly used def-
initions of group fairness are demographic parity (Calders et al. 2009), equalized
odds (Hardt et al. 2016), and equal opportunity (Hardt et al. 2016).
Given a data distribution D, a classifier h satisfies
• demographic parity if the prediction Ŷ is independent of the sensitive attributes A,
• equalized odds if the prediction Ŷ and the sensitive attribute A are independent
conditional on the target Y , and
• equal opportunity if P(Ŷ = 1|A = 0, Y = 1) = P(Ŷ = 1|A = 1, Y = 1).
The following metrics are further designed to assess the degree to which a classifier
satisfies the fairness constraints presented in Definition 12.1.
Given a data distribution D, for a binary classifier h,
• the demographic parity DP is defined as

DP (Ŷ , D) = |P(Ŷ = 1|A = 0) − P(Ŷ = 1|A = 1)|, (12.1)

• the equalized odds gap EO is defined as

EO (Ŷ , D) = max |P(Ŷ = 1|A = 0, Y = y) − P(Ŷ = 1|A = 1, Y = y)|,


y∈{0,1}
(12.2)

and
• the equal opportunity gap EOP is defined as

EOP (Ŷ , D) = |P(Ŷ = 1|A = 0, Y = 1) − P(Ŷ = 1|A = 1, Y = 1)|. (12.3)

Combined with Definition 12.1, it is clear that a small value of any fairness
measure in Definition 12.1 would indicate a strongly nondiscriminatory nature of
the given classifier, and vice versa. When the metric DP , EO , or EOP is equal to
12.2 Fairness-Aware Algorithms 265

zero, the classifier h perfectly satisfies demographic parity, equalized odds, or equal
opportunity, respectively.

12.2 Fairness-Aware Algorithms

Based on the mathematical definition of fairness, many algorithms are believed to be


able to mitigate existing biases. We argue that ML algorithms achieve bias mitigation
via three dominant approaches and thus can be divided into three corresponding
classes: preprocessing methods, in-processing methods, and postprocessing methods.

12.2.1 Preprocessing Methods

Preprocessing methods aim to eliminate the inherent bias in data before those data
are fed into learning algorithms (Calders et al. 2009; Feldman et al. 2014; Calmon
et al. 2017; Louizos et al. 2015; Choi et al. 2020; Kamiran and Calders 2012; Zemel
et al. 2013; Zhao et al. 2020). For example, Kamiran and Calders (2012) reviewed
several data preprocessing techniques, including (1) removing attributes that are most
relevant to the target sensitive attribute, (2) changing data labels, and (3) reweighting
or resampling the data. Zemel et al. (2013) proposed learning fair representations
by learning a linear transform to encode all information in the input features except
information that could lead to biased decision-making. This method can enhance both
individual fairness and group fairness. Zhao et al. (2020) focused on group fairness
and further proposed a novel fair representation learning technique that can mitigate
unfairness in terms of both accuracy parity and equalized odds simultaneously while
achieving a better utility-fairness trade-off.

12.2.2 In-processing Methods

In-processing methods enforce fairness by imposing constraints or regularizers dur-


ing the training procedure (Mozannar et al. 2020; Baharlouei et al. 2020; Yurochkin
and Sun 2021; Zafar et al. 2017; Martinez et al. 2020; Agarwal et al. 2018; Kamishima
et al. 2012; Menon and Williamson 2018; Kearns et al. 2018). We present an example
to illustrate how these methods work (Mozannar et al. 2020).
Let H be the set of all possible distributions over the hypothesis space H, where
each hypothesis depends only on X . Let γ y,a (Ỹ ) be P(Ỹ = 1|Y = y, A = a). Then,
the goal is to learn a predictor that approximates the optimal nondiscriminatory
distribution:
266 12 Algorithmic Fairness

Y ∗ = arg min P(Q(X ) = Y )


Q∈H

s.t. γ y,a (Q) = γ y,0 (Q), ∀y ∈ {0, 1}, a ∈ A.

Several works have explored methods of finding a near-optimal nondiscriminatory


solution. Mozannar et al. (2020) proposed a two-step method for learning a near-
optimal predictor. They divided the whole dataset into two equal parts, S1 and S2 .
First, they trained a predictor Ŷ , defined as follows:

Ŷ = arg min PS1 (Q(X ) = Y )


Q∈H
S1 S1
s.t. |γ y,a (Q) − γ y,0 (Q)| ≤ αn , ∀y ∈ {0, 1}, a ∈ A.

To solve this optimization problem, they used the Lagrange method to minimize the
following loss function with Lagrange multipliers λ ∈ R+K
:
 S1
S1
L(Q, λ) = PS1 (Q(X ) = Y ) + λk (|γ y,a (Q) − γ y,0 (Q)| − αn ).

Thus, they obtained a predictor Ŷ that depended only on X . To extend this to a


predictor Ỹ that would also depend on A, based on the given predictor Ŷ , they found
Ỹ by punishing the gap  E O P . Let p y,a be P(Ỹ = 1|Ŷ = y, A = a). Then, they
found the transition probability p by solving the following problem:
 
p∗ = arg min PS2 (Ŷ = y, A = a, Y = 0) − PS2 (Ŷ = y, A = a, Y = 1) p y,a
p
s.t. | p0,a PS2 (Ŷ = 0|A = a, Y = y) + p1,a PS2 (Ŷ = 1|A = a, Y = y)
− p0,0 PS2 (Ŷ = 0|A = 0, Y = y) − p1,0 PS2 (Ŷ = 1|A = 0, Y = y)| ≤ α̃n , ∀y, a.

The transition probability p learned in this way is the optimal solution to make Ỹ
closer to being nondiscrinminatory. Based on these two steps, the authors obtained
a both near-optimal and near-nondiscrinminatory predictor Ỹ .
Moreover, Zafar et al. (2017) studied binary classification and proposed a fair
learning algorithm by restricting the covariance among the target sensitive attributes
and restricting the distance between the data features and the decision boundary.
Baharlouei et al. (2020) employed Rényi correlation as a regularizer and proposed a
min-max optimization algorithm that can reduce arbitrary dependence between the
predictions of a model and the sensitive attributes of the input data. Yurochkin and
Sun (2021) further defined the concept of distributional individual fairness and then
designed an approximation algorithm to enforce such a fairness restriction by means
of regularization.
References 267

12.2.3 Postprocessing Methods

Postprocessing methods aim to refine the outputs of learning systems to remove


potential discrimination (Woodworth et al. 2017; Kim et al. 2019; Hardt et al. 2016;
Fish et al. 2016; Lahoti et al. 2020; Dwork et al. 2018). Fish et al. (2016) proposed the
shifted decision boundary algorithm based on predicted confidences; this algorithm
finds the shift for the data that best balances fairness and decision accuracy and shifts
the data accordingly. Woodworth et al. (2017) argued that the post hoc correction
method proposed by Hardt et al. (2016) to achieve fair learning through postpro-
cessing is not optimal and may fail to reduce bias in some situations. To address
this shortcoming, the authors further suggested a two-step learning framework that
employs a second-moment fairness constraint. Based on the concept of multiaccu-
racy, which is defined as a reweighting of the prediction accuracy of a model, Kim
et al. (2019) proposed the Multiaccuracy Boost algorithm to mitigate the decision
bias of any black-box learning system.

References

Agarwal, Alekh, Alina Beygelzimer, Miroslav Dudík, John Langford, and Hanna Wallach. 2018.
A reductions approach to fair classification. arXiv preprint arXiv:1803.02453.
Baharlouei, Sina, Maher Nouiehed, Ahmad Beirami, and Meisam Razaviyayn. 2020. Rényi fair
inference. In International Conference on Learning Representations.
Calders, Toon, Faisal Kamiran, and Mykola Pechenizkiy. 2009. Building classifiers with indepen-
dency constraints. In IEEE International Conference on Data Mining Workshops, 13–18.
Calmon, Flavio, Dennis Wei, Bhanukiran Vinzamuri, Karthikeyan Natesan Ramamurthy, and
Kush R Varshney. 2017. Optimized pre-processing for discrimination prevention. Advances in
Neural Information Processing Systems 30: 3992–4001.
Choi, Kristy, Aditya Grover, Trisha Singh, Rui Shu, and Stefano Ermon. 2020. Fair generative
modeling via weak supervision. In International Conference on Machine Learning, 1887–1898.
Dwork, Cynthia, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. 2012. Fairness
through awareness. In Innovations in Theoretical Computer Science Conference, 214–226.
Dwork, Cynthia, Nicole Immorlica, Adam Tauman Kalai, and Max Leiserson. 2018. Decoupled
classifiers for group-fair and efficient machine learning. In Conference on Fairness, Accountability
and Transparency, 119–133.
Feldman, Michael, Sorelle Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkatasubra-
manian. 2014. Certifying and removing disparate impact.
Fish, Benjamin, Jeremy Kun, and Ádám D Lelkes. 2016. A confidence-based approach for balancing
fairness and accuracy. In 2016 SIAM International Conference on Data Mining, 144–152. SIAM.
Hardt, Moritz, Eric Price, and Nati Srebro. 2016. Equality of opportunity in supervised learning. In
Advances in Neural Information Processing Systems, 3315–3323.
Kamiran, Faisal, and Toon Calders. 2012. Data preprocessing techniques for classification without
discrimination. Knowledge and Information Systems 33 (1): 1–33.
Kamishima, Toshihiro, Shotaro Akaho, Hideki Asoh, and Jun Sakuma. 2012. Fairness-aware clas-
sifier with prejudice remover regularizer. In Joint European Conference on Machine Learning
and Knowledge Discovery in Databases, 35–50.
Kearns, Michael, Aaron Roth, and Saeed Sharifi-Malvajerdi. 2019. Average individual fairness:
Algorithms, generalization and experiments. In Advances in Neural Information Processing
Systems, vol. 32.
268 12 Algorithmic Fairness

Kearns, Michael, Aaron Roth, and Zhiwei Steven Wu. 2017. Meritocratic fairness for cross-
population selection. In International Conference on Machine Learning, 1828–1836.
Kearns, Michael, Seth Neel, Aaron Roth, and Zhiwei Steven Wu. 2018. Preventing fairness gerry-
mandering: Auditing and learning for subgroup fairness. In International Conference on Machine
learning, 2564–2572.
Kim, Michael P, Amirata Ghorbani, and James Zou. 2019. Multiaccuracy: Black-box post-
processing for fairness in classification. In 2019 AAAI/ACM Conference on AI, Ethics, and Society,
247–254.
Kusner, Matt J, Joshua Loftus, Chris Russell, and Ricardo Silva. 2017. Counterfactual fairness. In
Advances in Neural Information Processing Systems, vol. 30.
Lahoti, Preethi, Alex Beutel, Jilin Chen, Kang Lee, Flavien Prost, Nithum Thain, Xuezhi Wang,
and Ed Chi. 2020. Fairness without demographics through adversarially reweighted learning. In
Advances in Neural Information Processing Systems, vol. 33, 728–740.
Louizos, Christos, Kevin Swersky, Yujia Li, Max Welling, and Richard Zemel. 2015. The variational
fair autoencoder. arXiv preprint arXiv:1511.00830.
Martinez, Natalia, Martin Bertran, and Guillermo Sapiro. 2020. Minimax Pareto fairness: A multi
objective perspective. In International Conference on Machine Learning, 6755–6764.
Menon, Aditya Krishna, and Robert C Williamson. 2018. The cost of fairness in binary classification.
In Conference on Fairness, Accountability and Transparency, 107–118.
Mozannar, Hussein, Mesrob Ohannessian, and Nathan Srebro. 2020. Fair learning with private
demographic data. In International Conference on Machine Learning, 7066–7075.
Nesterov, Yurii E. 1983. A method for solving the convex programming problem with convergence
rate o (1/k∧ 2). In Dokl. Akad. Nauk Sssr, vol. 269, 543–547.
Woodworth, Blake, Suriya Gunasekar, Mesrob I Ohannessian, and Nathan Srebro. 2017. Learning
non-discriminatory predictors. arXiv preprint arXiv:1702.06081.
Yurochkin, Mikhail, and Yuekai Sun. 2021. SenSeI: Sensitive set invariance for enforcing individual
fairness. In International Conference on Learning Representations.
Zafar, Muhammad Bilal, Isabel Valera, Manuel Gomez Rogriguez, and Krishna P Gummadi. 2017.
Fairness constraints: Mechanisms for fair classification. In Artificial Intelligence and Statistics,
962–970.
Zemel, Rich, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. 2013. Learning fair
representations. In International Conference on Machine Learning, 325–333.
Zhao, Han, Amanda Coston, Tameem Adel, and Geoffrey J. Gordon. 2020. Conditional learning of
fair representations. In International Conference on Learning Representations.
Chapter 13
Adversarial Robustness

Szegedy et al. (2014) discovered that neural networks are vulnerable to adversarial
examples, which are true examples perturbed with small artificial noise to fake the
classifiers. Since that finding was reported, many methods have been proposed to
study attacks on neural network using adversarial examples and defences against
such adversarial attacks. This chapter discusses the theory of adversarial robustness
and its relation to generalizability and privacy preservation.

13.1 Adversarial Attacks and Defences

Generalizability in the presence of adversarial examples. Schmidt et al. (2018)


first studied the generalizability in the problems with adversarial examples. They
propose new notion of adversarial robustness generalization to measure the gen-
eralizability of neural networks trained on data including adversarial examples. In
contrast, the generalizability of neural networks trained without adversarial exam-
ples can be evaluated with the standard generalization measure. They proved that
to ensure the same generalizability, adversarial robustness generalization requires a
larger sample size when the data are drawn from certain specific distributions and
the adversarial examples are generated under the l∞ norm.
Cullina et al. (2018) extended the Probably Approximately Correct (PAC) learn-
ing framework to that with adversarial examples. The authors proposed corrupted
hypothesis classes, which are defined as the hypothesis classes for a classification
problem with adversaries. The Vapnik-Chervonenkis (VC) dimension of a corrupted
hypothesis class is then defined as the adversarial VC dimension, while the standard
VC dimension is defined for clean datasets. They showed that the adversarial VC
dimension can be either larger or smaller than the standard VC dimension. Mon-
tasser et al. (2019) further defined agnostic robust PAC learnability and realizable

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025 269
F. He and D. Tao, Foundations of Deep Learning, Machine Learning: Foundations,
Methodologies, and Applications, https://doi.org/10.1007/978-981-16-8233-9_13
270 13 Adversarial Robustness

robust PAC learnability. When a learning algorithm meets the requirements for both
agnostic robust PAC learnability and realizable robust PAC learnability, the learning
rule is defined as improper. They then showed that under an improper learning rule,
any hypothesis space H is robustly PAC learnable if its VC dimension is finite.
Generalizability of adversarial training. Among the proposed defence
approaches, adversarial training (Dai et al. 2018; Li et al. 2018a; Baluja and Fischer
2018; Zheng et al. 2019) have promising performance on improving the adversarial
robustness of deep neural networks against adversarial examples. Mathematically,
adversarial training can be formulated as solving the following minimax loss problem
(Tu et al. 2019; Yin et al. 2019; Khim and Loh 2018):

1 
m
min max l(h θ (xi ), yi ),
θ m i=1 xi −xi ≤ρ

where h θ is the hypothesis parameterized by θ , m is the training sample size, xi is a


feature, yi is the corresponding label, and l is the loss function. Intuitively, adversarial
training optimizes a neural network on the data with adversarial examples.
As for such minimax loss problem, several generalization bounds have been pro-
posed (Yin et al. 2019; Tu et al. 2019; Khim and Loh 2018). All these results suggest
that adversarial training will comprise generalizability while improving adversarial
robustness.
Yin et al. (2019) studied how adversarial training influences generalizability by
leveraging the techniques developed by Bartlett et al. (2017) and Raghunathan et al.
(2018). They derived a tight upper bound on the Rademacher complexity and obtained
a generalization bound. However, this generalization bound is only applicable to
neural networks with one hidden layer and rectified linear unit (ReLU) activation
functions.
To address this drawback, Khim and Loh (2018) proved a generalization bound
via function transformation. An upper bound for the Rademacher complexity is then
proved based on the techniques developed by Golowich et al. (2018).
Tu et al. (2019) proposed an innovative pathway for assessing the generalizability
of machine learning models, comprising the following three steps. (1) We demon-
strated the feasibility of eliminating the maximization operation from the objective
function within the minimax framework. This was achieved through a reparameteri-
zation mapping that minimally altered the parameter distribution under the Wasser-
stein distribution. (2) We established a local worst-case risk bound for the resulting
reparameterized minimization problem, providing valuable insights into the model’s
performance. (3) We derived a comprehensive generalization bound based on the
local worst-case risk, offering a broader understanding of the model’s predictive
ability. Notably, this generalization bound extends its applicability in two perspec-
tives: (1) it covers adversarial examples generated by bounded adversaries under the
lq norm, accommodating various threat levels; and (2) the method is highly versatile,
applicable to multiclass classification tasks and compatible with a wide range of
practical loss functions, including the hinge loss and ramp loss.
13.2 Interplay Between Adversarial Robustness, Privacy … 271

In contrast, Bhagoji et al. (2019) provided valuable insights by establishing a


lower bound for the generalization error of classifiers when considering adversarial
examples. Meanwhile, Min et al. (2020) delved deeper into the dynamics between
adversarial training performance and training sample size. Their study categorized
all learning problems into three distinct regimes, shedding light on various scenarios:
(1) In the weak adversary regime, enlarging the training sample size corresponds to
an improvement in generalizability. (2) Within the medium adversary regime, an
intriguing phenomenon emerged, where the generalization error followed a double-
descent curve with increasing training sample size, suggesting a complex interplay
between model complexity and adversarial robustness. (3) In contrast, within the
strong adversary regime, increasing the training sample size actually led to a rise in
the generalization error, revealing the limitations imposed by formidable adversaries.
By and large, these regimes, categorized based on the adversaries’ strength, provide
a nuanced understanding of model behavior. Additionally, Chen et al. (2020) cor-
roborated similar observations akin to those found in the medium adversary regime,
further reinforcing the significance of these findings.
Several recent theoretical analyses have significantly enriched our understand-
ing of machine learning models’ generalization and robustness. Attias et al.
(2019) advanced the field by deriving an improved generalization bound through
Rademacher complexity. Meanwhile, Zhang et al. (2019) conducted a thorough
exploration into the delicate trade-offs between accuracy and robustness, shedding
light on the challenges posed by adversarial settings. Additionally, Pydi and Jog
(2020) introduced a generalization bound framework based on optimal transport
and optimal couplings, offering a new perspective on the model’s generalization
abilities under different learning scenarios. These theoretical analyses collectively
contribute to a more comprehensive understanding of the generalization, robustness,
and convergence underlying machine learning algorithms.

13.2 Interplay Between Adversarial Robustness, Privacy


Preservation, and Generalizability

Adversarial training (Dai et al. 2018; Li et al. 2018a; Baluja and Fischer 2018; Zheng
et al. 2019) has promising performance on improving the adversarial robustness of
deep neural networks against adversarial examples (Biggio et al. 2013; Szegedy et al.
2013; Goodfellow et al. 2014; Papernot et al. 2016). Specifically, adversarial training
can be formulated as solving the following minimax loss problem:

1 
m
min max l(h θ (xi ), yi ),
θ m i=1 xi −xi ≤ρ

where h θ is the hypothesis parameterized by θ , m is the number of training samples,


xi is a feature, yi is the corresponding label, and l is the loss function. Intuitively,
272 13 Adversarial Robustness

adversarial training optimizes a neural network in accordance with its performance


on worst-case examples, which are most likely to be adversarial examples.
This section studies how adversarial training influences the privacy-preserving
ability (Dwork and Mulligan 2013; Dwork and Roth 2014) and generalizability
(Vapnik 2013; Mohri et al. 2018) of neural networks, both of which are considerably
important in machine learning. Based on both theoretical and empirical evidence,
we prove the following:
The minimax-based approach can degrade a neural network’s privacy preservation and
generalization abilities while enhancing its adversarial robustness.

The first question to be addressed is how to measure adversarial robustness?


Adversarial accuracy, which is defined as the accuracy on adversarial examples, is
the most straightforward measure for this purpose. However, it might be difficult to
develop theoretical foundations based on this measure. The radius ρ is also straight-
forward; however, it has been observed that a larger ρ does not necessarily imply a
higher adversarial accuracy, and a single adversarial accuracy value might correspond
to multiple ρ values.
Addressing the pressing challenge of safe machine learning algorithms against
adversarial attacks, we introduce a new metric called the “robustified intensity”. This
metric serves as a yardstick for assessing how effectively a learning algorithm can
withstand such attacks. It measures the disparity in gradient norms that arise from
adversarial training, providing a tangible measure of robustness.
Moreover, to bridge theory with practical application, we develop an empirical
estimator termed the “empirical robustified intensity”. This estimator offers a prag-
matic approach to quantifying robustness in real-world scenarios. Through rigorous
analysis, we demonstrate that the empirical robustified intensity asymptotically con-
verges to its theoretical counterpart as the sample size grows, ensuring its consistency
and applicability.
Moving beyond theoretical foundations, we conduct a comprehensive empirical
study to validate the utility of the robustified intensity. Our findings reveal a direct and
positive correlation between the robustified intensity and adversarial accuracy. This
empirical evidence validates the robustified intensity as not just a theoretical con-
struct but a practical and informative metric for evaluating the resilience of learning
algorithms against adversarial attacks.
Exploring the intricate dynamics between privacy and robustness in neural net-
works, we shift our focus to adversarial training methodologies. Unlike conventional
approaches that aim for the optimal performance across all training examples, adver-
sarial training takes a more cautious route. Specifically, neural networks are trained
to withstand worst-case scenarios, where adversaries attempt to exploit vulnerabili-
ties by perturbing input data. This strategic adaptation forces the model to rely more
heavily on a small subset of training examples, heightening susceptibility to differ-
ential attacks. These attacks involve replacing a training example with a perturbed
version and then using the model’s output to infer other training examples.
13.2 Interplay Between Adversarial Robustness, Privacy … 273

Our investigation extends beyond practical implications to theoretical underpin-


nings. We establish the privacy-preserving nature of adversarial training, demonstrat-
ing that it achieves (ε, δ)-differential privacy when employing stochastic gradient
descent (SGD) to minimize the minimax loss function. Furthermore, we uncover an
intriguing positive correlation between the magnitudes of ε and δ and the robustified
intensity. This relationship marks a significant step forward in understanding the
delicate balance between privacy guarantees and model robustness.
Building upon the concept of privacy preservation, we unveil insights into the
generalization performance of adversarial training methods. Our analysis
√ yields two
key generalization bounds tailored for adversarial
√ scenarios: an O( log m/m) on-
average generalization bound and an O(1/ m) high-probability bound, where m
denotes the size of the training sample. These two bounds are derived from a new
theorem that establishes a profound link between algorithmic stability and differential
privacy.
Importantly, our generalization bounds offer a distinct advantage—they are not
contingent on the number of model parameters. This is particularly valuable in the
context of deep learning, where models can exhibit substantial complexity. Remark-
ably, the terms governing these bounds, except for the gradient norm, exhibit no
explicit dependence on model size. Through empirical studies, we confirm that the
gradient norm remains consistently small.
Our experimental study involves a meticulous examination conducted across
three widely-used datasets: CIFAR-10, CIFAR-100 (Krizhevsky and Hinton 2009),
and Tiny ImageNet (Le and Yang 2015). We carefully controlled unrelated vari-
ables, ensuring the reproducibility of our results. Throughout our experiments, we
systematically vary training settings to analyze their impact on model performance.
We focus on several key metrics, including generalization gaps, membership infer-
ence attack accuracies, and empirical robustified intensities, to assess the efficacy
of different training configurations. Remarkably, our empirical findings consistently
support our hypotheses.

13.2.1 Measurement of Robustness

The most straightforward measure of adversarial robustness is the adversarial accu-


racy, which is the accuracy on adversarial examples. However, it might be difficult
to establish theoretical foundations concerning the relationships between adversarial
accuracy, privacy preservation, and generalization based on this measure. Another
natural choice is the radius ρ. However, our experiments show that there is no
one-to-one mapping from ρ to the adversarial accuracy (see Fig. 13.1a).
This section proposes a new metric, the robustified intensity, and its asymptoti-
cally consistent empirical estimator, the empirical robustified intensity, to measure
adversarial robustness.
274 13 Adversarial Robustness

Fig. 13.1 The first three subfigures are plots of a adversarial accuracy versus radius, b robustified
intensity versus radius, and c adversarial accuracy versus robustified intensity. The green and blue
curves correspond to CIFAR-10 and CIFAR-100, respectively. Subfigure d shows a histogram of
8, 000, 000 gradient noise instances on Tiny ImageNet versus the probability density function of
Lap(0, 0.48)

13.2.1.1 Robustified Intensity

In conventional methods, the empirical risk minimization (ERM) problem is solved


to approach the optimal hypothesis, as follows:

1 
m
min R̂ S (θ ) = min l(h θ (xi ), yi ).
θ θ m i=1

Stochastic-gradient-based optimizers are then employed for ERM in deep learn-


ing, including SGD (Robbins and Monro 1951), momentum (Nesterov 1983; Tseng
1998), and Adam (Kingma and Ba 2015). For brevity, we consider SGD in this
analysis. The analysis for other stochastic-gradient-based optimizers is similar.
Suppose that B is a minibatch with a batch size of τ . Then, the stochastic gradient
on B is as follows:
1 
ĝ ERM (θ ) = ∇θ l(h θ (xi ), yi ).
τ (x ,y )∈B
i i

In iteration t, the weights are updated as follows:

θt+1 = θt − ηt ĝ ERM (θt ), (13.1)

where θt is the weight vector in the t-th iteration and ηt is the corresponding learning
rate.
In adversarial training, SGD is employed to solve the following minimax loss
problem:
1 
m
min R̂ SA (θ ) = min max l(h θ (xi ), yi ), (13.2)
θ θ m xi −xi ≤ρ
i=1

where ρ is the radius of a ball centred at example (xi , yi ). Here, we call R̂ SA (θ ) the
adversarial empirical risk. Correspondingly, the stochastic gradient on a minibatch
B and the weight update are calculated as follows:
13.2 Interplay Between Adversarial Robustness, Privacy … 275

1 
ĝ A (θ ) = ∇θ max l(h θ (xi ), yi ),
|B| (x ,y )∈B xi −xi ≤ρ (13.3)
i i

θt+1 = θt − ηt ĝ (θt ).
A

We then define the robustified intensity based on the gradient norms as follows.
Definition 13.1 (Robustified intensity) For adversarial training (Eq. 13.2), the
robustified intensity is defined as
 
maxθ,x,y ∇θ maxx  −x≤ρ l(h θ (x  ), y)
I = , (13.4)
maxθ,x,y ∇θ l(h θ (x), y)

where  ·  is a norm defined in the space of the gradient.


For brevity, we denote the numerator and denominator by L A and L E R M , respec-
tively; i.e., I = L A /L E R M . Specifically, when ρ = 0, we have L A = L E R M , and
thus, I = 1. Usually, I ≥ 1 in our experiments (see Fig. 13.1b).

13.2.1.2 How can the Robustified Intensity be Estimated?

Conducting a rigorous search for the maximal r value for either adversarial training or
ERM within every ball of radius ρ in the Euclidean space is technically impossible.
Therefore, we define an empirical robustified intensity for practical utilization to
empirically estimate the robustified intensity, as follows.
Definition 13.2 (Empirical robustified intensity) For adversarial training (Eq. 13.2),
the empirical robustified intensity is defined as
 
max(xi ,yi )∈B,θ ∇θ maxxi −xi ≤ρ l(h θ (xi ), yi )
Iˆ = , (13.5)
max(xi ,yi )∈B,θ ∇θ l(h θ (xi ), yi )

where  ·  is a norm defined in the space of the gradient.


To calculate the empirical robustified intensity, the maximal values of the gradient
norm for either adversarial training or ERM in the robustified intensity are replaced
by the maximal gradient norm in all minibatches.
We can now prove the following theorem.
Theorem 13.1 (Asymptotic consistency of the empirical robustified intensity) Sup-
pose that the empirical robustified intensity is Iˆm when the number of training samples
is m. The empirical robustified intensity is an unbiased estimator of the robustified
intensity, i.e., limm→∞ Iˆm = I .
This theorem ensures that when the number of training sample is sufficiently large,
the empirical robustified intensity is rigorously equal to the robustified intensity.
Theorem 13.1 requires only one mild assumption, as shown below.
276 13 Adversarial Robustness

Assumption 13.2 The gradient of the loss function satisfies ∇θ l(h θ (x), y) ∈
C 0 (Z); i.e., for any hypothesis h θ ∈ H, ∇θ l(h θ (x), y) is continuous w.r.t. z. 

To prove Theorem 13.1, we first recall additional preliminaries that are necessary
for the proofs.
Suppose that every example z is sampled in an independent and identically dis-
tributed (i.i.d.) manner from the data distribution D, i.e., z ∼ D. Thus, the training
sample set satisfies S ∼ Dm , where m is the number of training samples.
To prove our theorem, we establish the following definitions.

Definition 13.3 (Ball and sphere) The ball of radius r > 0 in terms of norm  ·  in
space H centred at point x ∈ H is defined as

Br (h) = {x : x − h ≤ r }.

The sphere ∂ Br (h) corresponding to ball Br (h) is defined as

∂ Br (h) = {x : x − h = r }.

Definition 13.4 (Complementary set) For a subset A ⊂ H of a space H, its


complementary set Ac is defined as follows:

A = {h : h ∈ H, h ∈
/ A}.

Proof of Theorem 13.1 We need only to prove that


   
           
  
lim max ∇θ max l h θ xi , yi  = max ∇θ max l hθ x , y 

m→∞ θ,xi ,yi  xi −xi ≤ρ  θ,x,y x  −x≤ρ
(13.6)
almost surely and

lim max ∇θ l (h θ (xi ) , yi ) = max ∇θ l (h θ (x) , y) (13.7)


m→∞ θ,xi ,yi θ,x,y

almost surely.
We first prove that for any positive real
   
g(θ, z) = ∇θ max l h θ x  , y , ρ > 0
x ∈Bx (ρ)

is a continuous function with respect to z = (x, y), where Bx (ρ) = {x  : x − x   ≤


ρ} is a ball centred at x with a radius of ρ.
By fixing y ∈ Y, we define Ty (x) = arg maxx  ∈Bx (ρ) l(h θ (x  ), y) as a mapping
from X to X. We will prove by reduction to absurdity that Ty (x) is continuous with
respect to (x, y). Suppose that there exist a sequence
13.2 Interplay Between Adversarial Robustness, Privacy … 277


{z i = (xi , yi )}i=1 , lim z i = z 0
i→∞

and a constant ε0 > 0 such that

Tyi (xi ) − Ty0 (x0 ) ≥ ε0 .


Since {Tyi (xi )}i=1 is a bounded set, there exists an increasing subsequence
∞ + ∞
{ki }i=1 ⊆ Z such that {Tyki (xki )}i=1 converges to some point T∞ . Then, we have

T∞ ∈ ∩i=1 Bxki (ρ) ⊂ Bx0 (ρ).

Furthermore, for any ε ≥ 0, there exists a δ > 0, such that for any x ∈ BTy0 (x0 ) (δ),
l(h θ (x), y0 )) ≥ l(h θ (Ty0 (x0 )), y0 ) − ε. In the case of Ty0 (x0 ) ∈ ∂ Bx0 (ρ) such that

Ty0 (x0 ) ∈
/ ∩i=1 Bxki (ρ), let x  ∈ BTy0 (x0 ) (δ) be an inner point of Bx0 (ρ). When i is
sufficiently large, we have x  ∈ Bxki (ρ), which yields

l(h θ (x  ), yki ) ≤ l(h θ (Tyki (xki )), yki ).

Let i approach ∞; then, we have

l(h θ (Ty0 (x0 ), y0 ) − ε ≤ l(h θ (x  ), y0 ) ≤ l(h θ (T∞ ), y0 ).

Since ε is arbitrarily selected, we then have

l(h θ (Ty0 (x0 ), y0 )) ≤ l(h θ (T∞ ), y0 ) ≤ l(h θ (Ty0 (x0 ), y0 )).

Therefore, T∞ = Ty0 (x0 ), which leads to a contradiction since Tyi (xi ) − Ty0 (x0 ) ≥
ε0 .
Since g(θ, z) can be rewritten as
   
g(θ, z) = ∇θ max l h θ x  , y = ∇θ l(h θ (Ty (x)), y),
x ∈Bx (ρ)

from Assumption 13.2, we know that g(θ, z) is continuous with respect to z.


We can now prove Eqs. (13.6) and (13.7). For Eq. (13.6), there exist a θ0 and a
z 0 = (x0 , y0 ) such that
   
         
∇θ max l h θ x  , y0  = max ∇θ max l h θ x  , y  .
 0 x  −x0 ≤ρ 0  θ,x,y  x  −x≤ρ 

For any ε > 0, since g(θ0 , z) is continuous with respect to z, there exists a δ > 0
such that for any (x  , y  ) ∈ B(x0 ,y0 ) (δ),

g(θ0 , (x  , y  )) ≥ g(θ0 , (x0 , y0 )) − ε.


278 13 Adversarial Robustness

Therefore,
 c
{(x, y) : g(θ0 , (x, y)) < g(θ0 , (x0 , y0 )) − ε} ⊂ B(x0 ,y0 ) (δ) ,

and we have
   
P S∼Dm max g(θ, z) ≤ max g(θ, z) − ε ≤ P S∼Dm max g(θ0 , z) ≤ g(θ0 , z 0 ) − ε
θ,z∈S θ,z z∈S
 
≤ P S∼Dm S ∩ B(x0 ,y0 ) (δ) = ∅
  m
= 1 − Pz∼D z ∈ B(x0 ,y0 ) (δ) .

As m → ∞, we have
 
lim P S∼Dm max g(θ, z) ≤ max g(θ, z) − ε = 0.
m→∞ θ,z∈S θ,z

Since ε is arbitrarily selected, we have


 
lim P S∼Dm max g(θ, z) ≤ max g(θ, z) = 0,
m→∞ θ,z∈S θ,z

which proves Eq. (13.6).


By replacing g(θ, z) = ∇θ l(h θ (x), y), we can prove Eq. (13.7) via the same
procedure.
The proof is complete.

13.2.1.3 Is the Robustified Intensity Informative?

We present a comprehensive empirical study conducted to compare the robustified


intensity, radius ρ, and adversarial accuracy on CIFAR-10 and CIFAR-100.
In Fig. 13.1, we present three plots, namely, the adversarial accuracy versus the
radius ρ, the robustified intensity versus the radius ρ, and the adversarial accuracy
versus the robustified intensity. From these plots, we can draw three major observa-
tions: (1) The adversarial test accuracy has a clear positive correlation with the radius
ρ when ρ is smaller than 5 on CIFAR-10 or smaller than 9 on CIFAR-100; after those
points, no significant correlation between the adversarial test accuracy and radius is
observed. This observation suggests that the radius ρ is not always an informative
choice for measuring the robustness. (2) A bell curve shape is observed in the plot
of the robustified intensity I versus the radius ρ, which suggests that both maximal
and minimal values of the robustified intensity I are achieved, i.e., the full domain
of the potential robustified intensity I is discovered. (3) A clear positive correla-
tion is observed between the adversarial test accuracy and the robustified intensity I
throughout the full interval of I , which suggests that there is a one-to-one mapping
13.2 Interplay Between Adversarial Robustness, Privacy … 279

from robustified intensity to adversarial accuracy. These three observations verify


that the robustified intensity I is an informative measure of robustness.

13.2.2 Privacy–Robustness Trade-Off

This section studies the relationship between privacy preservation and robustness in
adversarial training. We prove that adversarial training is (ε, δ)-differentially private.

13.2.2.1 What Is the Distribution of the Gradient Noise?

In stochastic-gradient-based optimization, noise is an inherent factor. This leads us


to ponder: what exactly characterizes the distribution of this noise? Past research
efforts by Kushner and Yin (2003), Ljung et al. (2012), Mandt et al. (2017) have
leaned towards assuming a Gaussian distribution for the gradient noise. However,
in our quest for deeper understanding, we embark on a large-scale experiment to
scrutinize this assumption more closely.
Our investigation takes us to the Tiny ImageNet dataset, where we conduct exten-
sive experiments. The results, as depicted in Fig. 13.1d, reveal an intriguing insight:
the gradient noise seems to align more closely with a Laplacian distribution. This
pivotal finding prompts us to adopt a new foundational assumption moving forward,
one that acknowledges the Laplacian nature of the gradient noise distribution.
Assumption 13.3 The gradient calculated from a minibatch is drawn from a
Laplacian distribution centred on the empirical risk,

1  

∇θ max l(h θ (x ), y) ∼ Lap ∇θ R̂ SA (θ ), b .
τ (x,y)∈B x  −x≤ρ

13.2.2.2 Theoretical Evidence

Then, following the Laplacian mechanism, we approximate the differential privacy


of adversarial training as follows.
Theorem 13.4 Suppose that SGD is employed for adversarial training. L E R M is the
maximal gradient norm in ERM. Additionally, suppose that the whole training pro-
cedure consists of T iterations. Then, the adversarial training is (ε, δ)-differentially
private, where

m
ε = ε0 2T log + T ε0 (eε0 − 1),
δ
δ
δ= ,
m
280 13 Adversarial Robustness

with
2L E R M
ε0 = I.
mb

Here, δ  is a positive real number, τ is the batch size, I is the robustified intensity,
and b is the Laplace parameter.
Remark√ 13.1 The approximation of the differential privacy given by Theorem 13.4
is (O( log m/m), O(1/m)).
To prove Theorem 13.4, we will first prove two lemmas (Lemma 13.1 and Lemma
13.2 below).
In practice, it is easier to obtain high-probability approximations of ε-differential
privacy from concentration inequalities. Lemma 13.1 presents a relationship between
high-probability approximations of ε-differential privacy and approximations of
(ε, δ)-differential privacy. Similar arguments have been used in some related works;
see, for example, the proof of Theorem 3.20 in (Dwork and Roth 2014). Here, we
present a detailed proof to conclude our work in this study.
Lemma 13.1 Suppose that A : Zm → H is a stochastic algorithm, whose output
hypothesis learned on training sample set S is A(S). For any hypothesis h ∈ H, if
the condition
P [A(S) = h]
log ≤ε (13.8)
P [A(S  ) = h]

is satisfied with probability at least 1 − δ over the randomness of A(S), then


algorithm A is (ε, δ)-differentially private.
Proof After rearranging Eq. (13.8), we have that, with probability at least 1 − δ,

P [A(S) = h] ≤ P A(S  ) = h eε . (13.9)

For brevity, we define an event as follows:


 
P [A(S) = h]
B0 = h : log ≤ε .
P [A(S  ) = h]

Additionally, we define
B0c = H − B0 .

Apparently, for any subset B ∈ H,



P [A(S) ∈ B0 ∩ B] ≤P A(S  ) ∈ B0 ∩ B eε , (13.10)
P [A(S) ∈ B0 ] ≥1 − δ,

P A(S) ∈ B0c ≤δ. (13.11)

Then, for any subset B ∈ , we have


13.2 Interplay Between Adversarial Robustness, Privacy … 281
 
P [A(S) ∈ B] = P A(S) ∈ B ∩ B0 ∪ B0c

= P [A(S) ∈ B ∩ B0 ] + P A(S) ∈ B ∩ B0c

≤P [A(S) ∈ B ∩ B0 ] + P A(S) ∈ B0c .

Combining Eqs. (13.9), (13.10), and (13.11), we obtain


 
P [A(S) ∈ B] ≤ eε P A(S  ) ∈ B ∩ B0 + δ ≤ eε P A(S  ) ∈ B + δ.

Therefore, the stochastic algorithm A is (ε, δ)-differentially private.


The proof is complete. 

Quantifying the level of differential privacy for an iterative algorithm can be a


daunting task. Yet, by dissecting the algorithm into its individual steps, we find that
assessing the differential privacy of each step is more manageable. The following
advanced composition lemma enables us to estimate the overall level of differen-
tial privacy for the entire iterative algorithm. It achieves this by aggregating the
differential privacy of each step.

Lemma 13.2 (Advanced composition; cf. (Dwork and Roth 2014), Theorem 3.20)
Suppose that an (ε0 , δ0 )-differentially private process is run repeatedly T times.
Then, the whole algorithm is ( , δ)-differentially private, where

1
ε= 2T log ε0 + T ε0 (eε0 − 1),
δ
δ = T δ0 + δ  ,

and δ  is a positive real number.

We now prove Theorem 13.4.

Proof of Theorem 13.4 We assume that the gradients calculated from a randomly
sampled minibatch B with a size of τ are random variables drawn from a Laplace
distribution, as justified previously:

1 
∇θ max l(θ, x, y) ∼ Lap ∇θ R̂ SA (θ ), b .
τ z∈B x  −x≤ρ

Correspondingly, the counterpart of this distribution on the training sample set S  is


as follows:
1 
∇θ max l(θ, x, y) ∼ Lap ∇θ R̂ SA (θ ), b ,
τ z∈B x  −x≤ρ

where B is uniformly sampled from S  , which is also of size τ .


282 13 Adversarial Robustness

The output hypothesis is uniquely indexed by the weights. Specifically, we denote


the weights after the t-th iteration by θt+1 . Furthermore, the weight updates θt =
θt+1 − θt are uniquely determined by the gradients. Therefore, we can calculate the
probability of the gradients to approximate the differential privacy. For any ĝtA ,
⎡  ⎤ ⎡    ⎤
 
p Lap(∇θ R̂ SA (θt ), b) = ĝtA exp − ∇θ R̂ SA (θt ) − ĝtA  /b
log ⎣   ⎦ = log ⎣    ⎦
 
p Lap(∇θ R̂ SA (θt ), b) = ĝtA exp − ∇θ R̂ SA (θt ) − ĝtA  /b
1  
 
 


= − ∇θ R̂ SA (θt ) − ĝtA  + ∇θ R̂ SA (θt ) − ĝtA  .
b
(13.12)

We define  
 

L A = max ∇θ max l(θt , x, y)
θt ,x,y x  −x≤ρ 

and
v = ∇θ R̂ SA (θt ) − ∇θ R̂ SA (θt ).

Since the training sample sets S and S  differ by only one pair of examples, we have

2L A
v ≤ . (13.13)
m
Combining Eqs. (13.12) and (13.13), we obtain
⎡  ⎤
p Lap(∇θ R̂ SA (θt ), b) = ĝtA
log ⎣  ⎦
p Lap(∇θ R̂ SA (θt ), b) = ĝtA
1  
 
 


= − ∇θ R̂ SA (θt ) − ĝtA  + ∇θ R̂ SA (θt ) − ĝtA 
b
2L A
= . (13.14)
mb
Since L A = I L E R M , we have
⎡  ⎤
p Lap(∇θ R̂ SA (θt ), b) = ĝtA 2L E R M
log ⎣  ⎦ ≤ I.
p Lap(∇θ R̂ SA (θt ), b) = ĝtA mb

We define
2L E R M
ε0 = I.
mb
13.2 Interplay Between Adversarial Robustness, Privacy … 283

By applying Lemma 13.2 with ε0 = ε0 , δ  = δ  /m, and δ0 = 0, the proof is


completed.

Similarly, we can calculate the differential privacy of ERM.

Corollary 13.1 Suppose that SGD is employed for ERM and that the whole training
procedure consists of T iterations. Then, the ERM process is (ε, δ)-differentially
private, where

m
+ T ε0ERM (eε0 − 1),
ERM
ε = ε0ERM 2T log
δ
δ
δ= ,
m
with
2L E R M
ε0ERM = .
mb
By comparing the results between adversarial training and ERM, we can conclude
that ε0 = I · ε0ERM .
Theorem 13.4 and Corollary 13.1 show that the factors ε and δ are both positively
correlated with the robustified intensity, which suggests a trade-off between privacy
preservation and adversarial robustness.

13.2.2.3 Empirical Evidence

We conducted an extensive empirical study on the privacy–robustness trade-off based


on the ResNet network architecture and the CIFAR-10, CIFAR-100, and Tiny Ima-
geNet datasets. We employed membership inference attack (Shokri et al. 2017), a
standard privacy attack tool, to measure the privacy preservation ability. A mem-
bership inference attack attempts to infer whether a data point is present in the
training sample from the output of the model. Therefore, the membership inference
attack accuracy intuitively expresses the algorithm’s privacy preservation ability.
Specifically, a higher membership inference attack accuracy suggests worse privacy
preservation.
We collected data of both the membership inference attack accuracy and the empir-
ical robustified intensity across all models. This comprehensive dataset enabled us to
construct the four plots showcased in Fig. 13.2. Understanding these plots is crucial:
a higher membership inference attack accuracy implies an increased susceptibility
to privacy breaches, indicating weaker privacy preservation measures.
By investigating the figure, we can observe a pattern: there’s a noticeable positive
correlation between the membership inference attack accuracy and the robustified
intensity I . Essentially, this correlation highlights a critical trade-off between pri-
vacy preservation and model robustness. The stronger the model’s defense against
284 13 Adversarial Robustness

Fig. 13.2 Plots of membership inference attack accuracy versus empirical robustified intensity. For
the four plots, the datasets and norms used in projected gradient descent (PGD) are (1) CIFAR-10
and the L ∞ norm, (2) CIFAR-100 and the L ∞ norm, (3) CIFAR-100 and the L 2 norm, and (4) Tiny
ImageNet and the L ∞ norm

adversarial attacks (reflected in its robustness), the more vulnerable it becomes to pri-
vacy breaches. This observation suggests the intricate balance required in developing
learning models that prioritize both privacy and robustness.

13.2.3 Generalization–Robustness Trade-Off

This section studies the relationship between generalization and robustness.


We first prove a high-probability generalization bound for an (ε, δ)-differentially
private machine learning algorithm. Since we have already approximated the
(ε, δ)-differential privacy of adversarial training, this bound can help us study the
generalizability of adversarial training.
Theorem 13.5 (High-probability generalization bound in terms of differential pri-
vacy) Suppose that all conditions of Theorem 13.4 hold. Then, algorithm A
has a high-probability generalization bound as follows. Specifically, the following
inequality holds with probability at least 1 − γ :
 
−ε −ε m log 1/γ
EA R(A(S)) − EA R̂ S (A(S)) ≤ c M(1 − e + e δ) log m log + ,
γ m
(13.15)

where γ is an arbitrary probability mass, M is the bound on the loss l, m is the


number of training samples, c is a universal constant for any sample distribution,
and the probability is defined over the sample set S.
The proof of Theorem 13.5 relies on the following lemma from Oneto et al. (2017).
Lemma 13.3 (cf. (Feldman and Vondrak 2019), Theorem 1) Suppose that a deter-
ministic machine learning algorithm A is stable with uniform stability β. Suppose
that l ≤ 1. Then, for any sample distribution and any γ ∈ (0, 1), there exists a uni-
versal constant c such that, with probability at least 1 − γ over the draw of the
sample, the generalization error can be upper bounded as follows:
13.2 Interplay Between Adversarial Robustness, Privacy … 285
 
1  m log 1/γ
Ez l(A(S), z) − l(A(S), z) ≤ c β log mlog + .
m z∈S γ m

Proof of Theorem 13.5 By combining Lemmas 13.5 and 13.3, we can directly prove
Theorem 13.5.
Remark 13.2 By the postprocessing property of differential privacy, since B :
h → maxx  ∈B∗ (ρ) l(h, (x  , ∗)) is a one-to-one mapping, maxx  ∈B∗ (ρ) l(A, (x  , ∗)) is
(ε, δ)-differentially private. Therefore, Theorems 13.5 and 13.6 hold for adversarial
learning algorithms.
In our endeavor to quantify the generalizability of a learning model, we not only
establish a high-probability generalization bound but also derive an on-average gen-
eralization bound. This on-average bound offers a glimpse into the “expected” perfor-
mance of the learning model. It’s essential to understand the theoretical possibility
of deriving on-average bounds from high-probability bounds through integration.
However, in practice, executing such calculations becomes dauntingly challenging.
As a result, we opt for a more feasible and independent approach to achieve the
on-average bound.
Theorem 13.6 (On-average generalization bound in terms of differential pri-
vacy) Suppose that all conditions of Theorem 13.4 hold. Then, the on-average
generalization error of algorithm A is upper bounded by
 
E S,A R(A(S)) − R̂ S (A(S)) ≤ Mδe−ε + M(1 − e−ε ).

The proof of Theorem 13.6 relies on the following lemma from Dwork et al.
(2015a).
Lemma 13.4 (Lemma 11, cf. (Shalev-Shwartz et al. 2010)) Suppose that the loss
function is upper bounded. For any machine learning algorithm with β replace-one
stability, its generalization error is upper bound as follows:

R(A(S)) − R̂ S (A(S)) ≤ β.

Proof of Theorem 13.6 By combining Lemma 13.5 and Lemma 13.4, we can
directly prove Theorem 13.6.
By combining Theorem 13.4, Corollary 13.1, Theorem 13.5, and Theorem 13.6,
we can obtain generalization bounds for both adversarial training and conventional
ERM.
Both generalization bounds are positively correlated with the magnitude of the
differential privacy, which, in turn, is positively correlated with the adversarial
robustness. This leads to the following corollary.
Corollary 13.2 A trade-off exists between generalizability and adversarial robust-
ness (as measured by the robustified intensity) in adversarial training.
286 13 Adversarial Robustness

13.2.3.1 Establishing Generalization Bounds Based on Algorithmic


Stability

Theorems 13.5 and 13.6 are established via algorithmic stability, which measures
how stable an algorithm is when the training sample is exposed to disturbance (Rogers
and Wagner 1978; Kearns and Ron 1999; Bousquet and Elisseeff 2002). While algo-
rithmic stability has many different definitions, this work mainly discusses uniform
stability.

Definition 13.5 (Uniform stability; cf. (Bousquet and Elisseeff 2002)) A machine
learning algorithm A is uniformly stable if, for any neighbouring sample pair S and
S  that differ by only one example, the following inequality holds:
 
EAl(A(S), z) − EAl(A(S  ), z) ≤ β,

where z is an arbitrary example; A(S) and A(S  ) are the output hypotheses learned
on training sets S and S  , respectively; and β is a positive real constant. The constant
β is called the uniform stability of A.

In this section, we prove that (ε, δ)-differentially private machine learning


algorithms are algorithmically stable, as shown by the following lemma.

Lemma 13.5 (Stability–privacy relationship) Suppose that a machine learning


algorithm A is (ε, δ)-differentially private. Additionally, suppose that the loss func-
tion l is upper bounded by a positive real constant M > 0. Then, the algorithm A is
uniformly stable:
 
EAl(A(S), z) − EAl(A(S  ), z) ≤ Mδe−ε + M(1 − e−ε ).

To prove Lemma 13.5, we first prove a weaker version of it that holds when
algorithm A has ε-pure differential privacy.

Lemma 13.6 Suppose that a machine learning algorithm A is ε-differentially pri-


vate. Additionally, suppose that the loss function l is upper bounded by a positive
real constant M > 0. Then, the algorithm A is uniformly stable:
 
EA(S) l(A(S), z) − EA(S  )l(A(S  ), z) ≤ M(1 − e−ε ).

Proof We define a set B as B = {h ∈ H : l(h, z) > t}, where t is an arbitrary real


number. Then, for any t ∈ R,

PA(S) (A(S) ∈ B) ≤ eε PA(S  ) (A(S  ) ∈ B). (13.16)

By rearranging Eq. (13.16), we obtain

e−ε PA(S) (A(S) ∈ B) ≤ PA(S  ) (A(S  ) ∈ B),


13.2 Interplay Between Adversarial Robustness, Privacy … 287

(e−ε − 1)PA(S) (A(S) ∈ B) ≤ PA(S  ) (A(S  ) ∈ B) − PA(S) (A(S) ∈ B).

Since ε > 0, we have e−ε < 1. Therefore,

(e−ε − 1) ≤ PA(S  ) (A(S  ) ∈ B) − PA(S) (A(S) ∈ B). (13.17)

Equation (13.17) holds for every pair of neighbouring sample sets S and S  . Thus,

e−ε − 1 ≤ min (PA(S  ) (A(S  ) ∈ B) − PA(S) (A(S) ∈ B)).


S and S  neighbour

Therefore,
 
max  PA(S  ) (A(S  ) ∈ B) − PA(S) (A(S) ∈ B)  ≤ 1 − e−ε .

S and S neighbour

Thus,

|EA(S  )l(A(S  ), z) − EA(S)l(A(S), z)|


  
 
=  l(A(S), z) dPA(S) − l(A(S ), z)d PA(S  ) 
 

(∗)
≤ max {I1 , I2 } ≤ M(1 − e−ε ),

where I1 and I2 in inequality (∗) are defined as



 
I1 = l(A(S), z) dPA(S) − dPA(S  ) ,
PA(S) >PA(S  )

 
I2 = l(A(S), z) dPA(S  ) − dPA(S) .
PA(S) ≤PA(S  )

The proof is complete. 

Now, we prove Lemma 13.5 using a different method.

Proof of Lemma 13.5 Since algorithm A is (ε, δ)-differentially private, we have

PA(S) (A(S) ∈ B) ≤ eε PA(S  ) (A(S  ) ∈ B) + δ,

where B is an arbitrary subset drawn from the hypothesis space H. Let B = {h ∈


H : l(h, z) > t}. Then, we have the following inequality:

PA(S) (l(A(S), z) > t) ≤ eε PA(S  ) (l(A(S  ), z) > t) + δ. (13.18)

Additionally, EA(S) l(A(S), z) is calculated as follows:


288 13 Adversarial Robustness
 M
EA(S) l(A(S), z) = PA(S) (l(A(S), z) > t)dt.
0

By applying Eq. (13.18), we obtain


 M
EA(S) l(A(S), z) = PA(S) (l(A(S), z) > t)dt
0
 M
≤ eε PA(S  ) (l(A(S  ), z) > t)dt + Mδ
0
= eε EA(S  )l(A(S  ), z) + Mδ. (13.19)

By rearranging Eq. (13.19), we obtain

e−ε EA(S)l(A(S), z) ≤ EA(S  )l(A(S  ), z) + e−ε Mδ,


−ε
(e − 1)EA(S) l(A(S), z) ≤ EA(S  )l(A(S  ), z) − EA(S)l(A(S), z) + e−ε Mδ.

Therefore,

EA(S  )l(A(S  ), z) − EA(S) l(A(S), z) ≤ e−ε Mδ + (1 − e−ε )EA(S)l(A(S), z).

Similarly, we can obtain the following inequality:

−EA(S  )l(A(S  ), z) + EA(S)l(A(S), z) ≤ e−ε Mδ + (1 − e−ε )EA(S)l(A(S), z).

Thus,
 
EA(S)l(A(S), Z ) − EA(S  )l(A(S  ), Z ) ≤ Mδe−ε + M(1 − e−ε ).

The proof is complete.

Theorems 13.5 and 13.6 are further established based on Lemma 13.5 with
Feldman and Vondrak (2019) and Bousquet and Elisseeff (2002), respectively.

13.2.3.2 Tightness of Generalization Bounds

Dependency on the number of training samples m.

In Sect. 13.2.2, we have approximated the differential privacy rate of adversarial


training with respect to the number of training samples m; see Remark 13.1. By
combining Theorems 13.5 and 13.6, we can approximate the tightness of the high-
probability generalization bound and the on-average generalization bound as shown
in the following remarks.
13.2 Interplay Between Adversarial Robustness, Privacy … 289

Remark
√ 13.3 The high-probability generalization bound given by Theorem 13.5 is
O(1/ m).

Remark
√ 13.4 The on-average generalization bound given by Theorem 13.6 is
O( log m/m).

Dependency on the model size.

Our generalization bounds do not explicitly depend on the number of parameters,


which typically is prohibitively large in deep learning. The only term that could
implicitly depend on the model size is the gradient norm. We trained two different
networks, namely, ResNet and Tiny ImageNet, within the frameworks of ERM and
adversarial training with different values of the radius ρ and two different norms in
projected gradient descent (PGD), namely, the L 2 norm and the L ∞ norm. Some of
the collected data are shown in Fig. 13.3. These box plots clearly illustrate that the
gradient norm is very small in comparison to the number of parameters (which is on
the order of tens of millions): the mean of the gradient norm is approximately 1, and
the largest value is approximately 5.
Existing works have proven generalization
√ bounds based on hypothesis complex-
ity. Yin et al. (2019) proved an O(1/ m) generalization bound for models learned
via adversarial training based on the Rademacher complexity
√ of the deep neural net-
works. Khim and Loh (2018) also proved an O(1/ m) generalization bound based
on the Rademacher
√ complexity of the hypothesis space. Tu et al. (2019) proved
an O(1/ m) generalization bound based on the covering number of the hypothe-
sis space. However, the hypothesis complexity will be prohibitively large in deep
learning.

13.2.3.3 Experimental Verification

Our investigation into the trade-off between generalization and robustness is fur-
ther grounded in empirical analysis, leveraging the ResNet architecture and datasets

Fig. 13.3 Box plots of the gradient norms in ERM and adversarial training on CIFAR-100 and
Tiny ImageNet with six different ρ values and two different norms in PGD. The sample sizes are
55, 000 and 120, 000, respectively.The last two plots correspond to experimental settings A and B,
respectively
290 13 Adversarial Robustness

Fig. 13.4 Plots of the generalization gap versus the empirical robustified intensity. For the four
plots, the datasets and norms used in PGD are (1) CIFAR-10 and the L ∞ norm, (2) CIFAR-100 and
the L ∞ norm, (3) CIFAR-100 and the L 2 norm, and (4) Tiny ImageNet and the L ∞ norm

including CIFAR-10, CIFAR-100, and Tiny ImageNet. To gauge the generalizabil-


ity of our models, we employ a key metric known as the generalization gap. This
metric quantifies the disparity between training and test accuracies, offering valuable
insights into the model’s ability to perform well on unseen data.
We collected data on both the generalization gaps and empirical robustified inten-
sities across all models, using this information to construct four informative plots
presented in Fig. 13.4. A larger generalization gap signifies poorer generalizability
of the models to unseen data. Upon scrutinizing these plots, a notable trend emerges:
there’s a distinct positive correlation between the generalization gap and the empiri-
cal robustified intensity. This observation serves as compelling evidence confirming
the existence of a trade-off between generalization and robustness in the examined
deep learning models.

References

Attias, Idan, Aryeh Kontorovich, and Yishay Mansour. 2019. Improved generalization bounds for
robust learning. In Algorithmic Learning Theory, 162–183.
Baluja, Shumeet, and Ian Fischer. 2018. Learning to attack: adversarial transformation networks.
In AAAI Conference on Artificial Intelligence, vol. 1, 3.
Bartlett, Peter L, Dylan J Foster, and Matus J Telgarsky. 2017. Spectrally-normalized margin bounds
for neural networks. In Advances in Neural Information Processing Systems, 6240–6249.
Bhagoji, Arjun Nitin, Daniel Cullina, and Prateek Mittal. 2019. Lower bounds on adversarial
robustness from optimal transport. In Advances in Neural Information Processing Systems,
7498–7510.
Biggio, Battista, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Šrndić, Pavel Laskov, Gior-
gio Giacinto, and Fabio Roli. 2013. Evasion attacks against machine learning at test time. In
European Conference on Machine Learning.
Bousquet, Olivier, and André Elisseeff. 2002. Stability and generalization. Journal of Machine
Learning Research 2 (Mar): 499–526.
Chen, Lin, Yifei Min, Mingrui Zhang, and Amin Karbasi. 2020. More data can expand the general-
ization gap between adversarially robust and standard models. arXiv preprint arXiv:2002.04725.
Cullina, Daniel, Arjun Nitin Bhagoji, and Prateek Mittal. 2018. PAC-learning in the presence of
adversaries. In Advances in Neural Information Processing Systems, 230–241.
Dai, Hanjun, Hui Li, Tian Tian, Xin Huang, Lin Wang, Jun Zhu, and Le Song. 2018. Adversarial
attack on graph structured data. arXiv preprint arXiv:1806.02371.
References 291

Dwork, Cynthia, and Aaron Roth. 2014. The algorithmic foundations of differential privacy.
Foundations and Trends® in Theoretical Computer Science 9 (3–4): 211–407.
Dwork, Cynthia, and Deirdre K Mulligan. 2013. It’s not privacy, and it’s not fair. Stanford Law
Review Online 66: 35.
Dwork, Cynthia, Vitaly Feldman, Moritz Hardt, Toni Pitassi, Omer Reingold, and Aaron Roth. 2015.
Generalization in adaptive data analysis and holdout reuse. In Advances in Neural Information
Processing Systems, 2350–2358.
Feldman, Vitaly, and Jan Vondrak. 2019. High probability generalization bounds for uniformly
stable algorithms with nearly optimal rate. arXiv preprint arXiv:1902.10710.
Golowich, Noah, Alexander Rakhlin, and Ohad Shamir. 2018. Size-independent sample complexity
of neural networks. In Annual Conference on Learning Theory, 297–299.
Goodfellow, Ian J, Jonathon Shlens, and Christian Szegedy. 2014. Explaining and harnessing
adversarial examples. arXiv preprint arXiv:1412.6572.
Kearns, Michael, and Dana Ron. 1999. Algorithmic stability and sanity-check bounds for leave-
one-out cross-validation. Neural Computation 11 (6): 1427–1453.
Khim, Justin, and Po-Ling Loh. 2018. Adversarial risk bounds via function transformation. arXiv
preprint arXiv:1810.09519.
Kingma, Diederik P, and Jimmy Ba. 2015. Adam: a method for stochastic optimization. In
International Conference on Learning Systems.
Krizhevsky, Alex, and Geoffrey Hinton. 2009. Learning multiple layers of features from tiny images.
Technical report, Citeseer.
Kushner, Harold, and G George Yin. 2003. Stochastic Approximation and Recursive Algorithms
and Applications, vol. 35. Springer Science & Business Media.
Le, Ya, and Xuan Yang. 2015. Tiny imagenet visual recognition challenge. CS 231N 7 (7): 3.
Li, Bai, Changyou Chen, Wenlin Wang, and Lawrence Carin. 2018. Second-order adversarial attack
and certifiable robustness.
Ljung, Lennart, Georg Pflug, and Harro Walk. 2012. Stochastic Approximation and Optimization
of Random Systems, vol. 17. Birkhäuser.
Mandt, Stephan, Matthew D Hoffman, and David M Blei. 2017. Stochastic gradient descent as
approximate Bayesian inference. Journal of Machine Learning Research 18 (1): 4873–4907.
Min, Yifei, Lin Chen, and Amin Karbasi. 2020. The curious case of adversarially robust models:
More data can help, double descend, or hurt generalization. arXiv preprint arXiv:2002.11080.
Mohri, Mehryar, Afshin Rostamizadeh, and Ameet Talwalkar. 2018. Foundations of Machine
Learning. MIT Press.
Montasser, Omar, Steve Hanneke, and Nathan Srebro. 2019. VC classes are adversarially robustly
learnable, but only improperly. In Conference on Learning Theory, 2512–2530.
Nesterov, Yurii E. 1983. A method for solving the convex programming problem with convergence
rate o (1/k∧ 2). In Dokl. Akad. Nauk Sssr, vol. 269, 543–547.
Oneto, Luca, Sandro Ridella, and Davide Anguita. 2017. Differential privacy and generalization:
Sharper bounds with applications. Pattern Recognition Letters 89: 31–38.
Papernot, Nicolas, Patrick McDaniel, Somesh Jha, Matt Fredrikson, Z Berkay Celik, and Anan-
thram Swami. 2016. The limitations of deep learning in adversarial settings. In IEEE European
Symposium on Security and Privacy, 372–387.
Pydi, Muni Sreenivas, and Varun Jog. 2020. Adversarial risk via optimal transport and optimal
couplings. In International Conference on Machine Learning, 7814–7823.
Raghunathan, Aditi, Jacob Steinhardt, and Percy Liang. 2018. Certified defenses against adversarial
examples. In International Conference on Learning Representations.
Robbins, Herbert, and Sutton Monro. 1951. A stochastic approximation method. The Annals of
Mathematical Statistics, 400–407.
Rogers, William H, and Terry J Wagner. 1978. A finite sample distribution-free performance bound
for local discrimination rules. The Annals of Statistics, 506–514.
292 13 Adversarial Robustness

Schmidt, Ludwig, Shibani Santurkar, Dimitris Tsipras, Kunal Talwar, and Aleksander Madry.
2018. Adversarially robust generalization requires more data. In Advances in Neural Information
Processing Systems, 5014–5026.
Shalev-Shwartz, Shai, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. 2010. Learnability,
stability and uniform convergence. Journal of Machine Learning Research 11 (Oct): 2635–2670.
Shokri, Reza, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. 2017. Membership inference
attacks against machine learning models. In IEEE Symposium on Security and Privacy (SP), 3–18.
Szegedy, Christian, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow,
and Rob Fergus. 2013. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199.
Szegedy, Christian, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow,
and Rob Fergus. 2014. Intriguing properties of neural networks. In International Conference on
Learning Representations.
Tseng, Paul. 1998. An incremental gradient (-projection) method with momentum term and adaptive
stepsize rule. SIAM Journal on Optimization 8 (2): 506–531.
Tu, Zhuozhuo, Jingwei Zhang, and Dacheng Tao. 2019. Theoretical analysis of adversarial learning:
a minimax approach. In Advances in Neural Information Processing Systems, 12280–12290.
Vapnik, Vladimir. 2013. The Nature of Statistical Learning Theory. Springer Science & Business
Media.
Yin, Dong, Ramchandran Kannan, and Peter Bartlett. 2019. Rademacher complexity for adversar-
ially robust generalization. In International Conference on Machine Learning, 7085–7094.
Zhang, Hongyang, Yaodong Yu, Jiantao Jiao, Eric P Xing, Laurent El Ghaoui, and Michael I
Jordan. 2019. Theoretically principled trade-off between robustness and accuracy. arXiv preprint
arXiv:1901.08573.
Zheng, Tianhang, Changyou Chen, and Kui Ren. 2019. Distributionally adversarial attack. In AAAI
Conference on Artificial Intelligence, vol. 33, 2253–2260.

You might also like