0% found this document useful (0 votes)
6 views53 pages

Full Error Analysis

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views53 pages

Full Error Analysis

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Full error analysis for the

training of deep neural networks


Christian Beck1, Arnulf Jentzen2,3, and Benno Kuckuck4,5
1 Department of Mathematics, ETH Zurich, Zürich,
Switzerland, e-mail: [email protected]
2 Department of Mathematics, ETH Zurich, Zürich,
arXiv:1910.00121v2 [math.NA] 30 Jan 2020

Switzerland, e-mail: [email protected]


3 Faculty of Mathematics and Computer Science, University of Münster,
Münster, Germany, e-mail: [email protected]
4 Institute of Mathematics, University of Düsseldorf, Düsseldorf,
Germany, e-mail: [email protected]
5 Faculty of Mathematics and Computer Science, University of Münster,
Münster, Germany, e-mail: [email protected]

January 31, 2020

Abstract
Deep learning algorithms have been applied very successfully in recent years to a range of prob-
lems out of reach for classical solution paradigms. Nevertheless, there is no completely rigorous
mathematical error and convergence analysis which explains the success of deep learning algorithms.
The error of a deep learning algorithm can in many situations be decomposed into three parts, the
approximation error, the generalization error, and the optimization error. In this work we estimate
for a certain deep learning algorithm each of these three errors and combine these three error es-
timates to obtain an overall error analysis for the deep learning algorithm under consideration. In
particular, we thereby establish convergence with a suitable convergence speed for the overall error of
the deep learning algorithm under consideration. Our convergence speed analysis is far from optimal
and the convergence speed that we establish is rather slow, increases exponentially in the dimensions,
and, in particular, suffers from the curse of dimensionality. The main contribution of this work is,
instead, to provide a full error analysis (i) which covers each of the three different sources of errors
usually emerging in deep learning algorithms and (ii) which merges these three sources of errors into
one overall error estimate for the considered deep learning algorithm.

Contents
1 Introduction 2

2 Deep neural networks (DNNs) 4


2.1 Vectorized description of DNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Affine functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Vectorized description of DNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.3 Activation functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1
2.1.4 Rectified DNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Structured description of DNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Structured description of DNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.2 Realizations of DNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.3 On the connection to the vectorized description of DNNs . . . . . . . . . . . . . . . 6
2.2.4 Parallelizations of DNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.5 Basic examples for DNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.6 Compositions of DNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.7 Powers and extensions of DNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.8 Embedding DNNs in larger architectures . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Local Lipschitz continuity of the parametrization function . . . . . . . . . . . . . . . . . . 16

3 Separate analyses of the error sources 21


3.1 Analysis of the approximation error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.1 Approximations for Lipschitz continuous functions . . . . . . . . . . . . . . . . . . . 21
3.1.2 DNN representations for maxima . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1.3 Interpolations through DNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1.4 Explicit approximations through DNNs . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1.5 Implicit approximations through DNNs . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Analysis of the generalization error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.1 Hoeffding’s concentration inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.2 Covering number estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.3 Measurability properties for suprema . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.4 Concentration inequalities for random fields . . . . . . . . . . . . . . . . . . . . . . 34
3.2.5 Uniform estimates for the statistical learning error . . . . . . . . . . . . . . . . . . . 36
3.3 Analysis of the optimization error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3.1 Convergence rates for the minimum Monte Carlo method . . . . . . . . . . . . . . . 38
3.3.2 Continuous uniformly distributed samples . . . . . . . . . . . . . . . . . . . . . . . 39

4 Overall error analysis 39


4.1 Bias-variance decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2 Overall error decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3 Analysis of the convergence speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3.1 Convergence rates for convergence in probability . . . . . . . . . . . . . . . . . . . . 42
4.3.2 Convergence rates for strong convergence . . . . . . . . . . . . . . . . . . . . . . . . 48

1 Introduction
In problems like image recognition, text analysis, speech recognition, or playing various games, to name
a few, it is very hard and seems at the moment entirely impossible to provide a function or to hard-code
a computer program which attaches to the input – be it a picture, a piece of text, an audio recording,
or a certain game situation – a meaning or a recommended action. Nevertheless deep learning has been
applied very successfully in recent years to such and related problems. The success of deep learning
in applications is even more surprising as, to this day, the reasons for its performance are not entirely
rigorously understood. In particular, there is no rigorous mathematical error and convergence analysis
which explains the success of deep learning algorithms.
In contrast to traditional approaches, machine learning methods in general and deep learning methods
in particular attempt to infer the unknown target function or at least a good enough approximation thereof
from examples encountered during the training. Often a deep learning algorithm has three ingredients: (i)
the hypothesis class, a parametrizable class of functions in which we try to find a reasonable approximation
of the unknown target function, (ii) a numerical approximation of the expected loss function based on the

2
training examples, and (iii) an optimization algorithm which tries to approximately calculate an element
of the hypothesis class which minimizes the numerical approximation of the expected loss function from
(ii) given the training examples. Common approaches are to choose a set of suitable fully connected deep
neural networks (DNNs) as hypothesis class in (i), empirical risks as approximations of the expected loss
function in (ii), and stochastic gradient descent-type algorithms with random initializations as optimiza-
tion algorithms in (iii). Each of these three ingredients contributes to the overall error of the considered
approximation algorithm. The choice of the hypothesis class results in the so-called approximation error
(cf., e.g., [3, 4, 19, 38, 40, 41] and the references mentioned at the beginning of Section 3), replacing the
exact expected loss function by a numerical approximation leads to the so-called generalization error (cf.,
e.g., [5, 10, 18, 35, 51, 68, 71] and the references mentioned therein), and the employed optimization algo-
rithm introduces the optimization error (cf., e.g., [2, 6, 9, 15, 20, 26, 43, 45] and the references mentioned
therein).
In this work we estimate the approximation error, the generalization error, as well as the optimization
error and we also combine these three errors to establish convergence with a suitable convergence speed
for the overall error of the deep learning algorithm under consideration. Our convergence speed analysis
is far from optimal and the convergence speed that we establish is rather slow, increases exponentially in
the dimensions, and, in particular, suffers from the curse of dimensionality (cf., e.g., Bellman [8], Novak
& Woźniakowski [56, Chapter 1], and Novak & Woźniakowski [57, Chapter 9]). The main contribution of
this work is, instead, to provide a full error analysis (i) which covers each of the three different sources of
errors usually emerging in deep learning algorithms and (ii) which merges these three sources of errors into
one overall error estimate for the considered deep learning algorithm. In the next result, Theorem 1.1,
we briefly illustrate the findings of this article in a special case and we refer to Section 4.2 below for the
more general convergence results which we develop in this article.
Theorem 1.1. Let d ∈ N, L, a ∈ R, b ∈ (a, ∞), R ∈ [max{2, L, |a|, |b|}, ∞), let (Ω, F , P) be a probability
space, let Xm : Ω → [a, b]d , m ∈ N, be i.i.d. random variables, let k·k : Rd → [0, ∞) be the standard
norm on Rd , let ϕ : [a, b]d → [0, 1] satisfy for all x, y ∈ [a, b]d that |ϕ(x) − ϕ(y)| ≤ Lkx − yk, for every
d, r, s ∈ N, δ ∈ N0 , θ = (θ1 , θ2 , . . . , θd ) ∈ Rd with d ≥ δ + rs + r let Aθ,δ s r
r,s : R → R satisfy for all
s
x = (x1 , x2 , . . . , xs ) ∈ R that
 s   s   s  
θ,δ P P P
Ar,s (x) = xi θδ+i + θδ+rs+1 , xi θδ+s+i + θδ+rs+2 , . . . , xi θδ+(r−1)s+i + θδ+rs+r , (1)
i=1 i=1 i=1

let c : R → [0, 1] and Rτ : Rτ → Rτ , τ ∈ N, satisfy for all τ ∈ N, x = (x1 , x2 , . . . , xτ ) ∈ Rτ , y ∈ R


that c(y) = min{1, max{0, y}} and Rτ (x) = (max{x1 , 0}, max{x2 , 0}, . . . , max{xτ , 0}), for every d, τ ∈
{3, 4, . . .}, θ ∈ Rd with d ≥ τ (d + 1) + (τ − 3)τ (τ + 1) + τ + 1 let Nθ,τ : Rd → R satisfy for all x ∈ Rd that
 θ,τ (d+1)+(τ −3)τ (τ +1) 
Nθ,τ (x) = c ◦ A1,τ ◦ Rτ ◦ Aθ,τ
τ,τ
(d+1)+(τ −4)τ (τ +1)
◦ Rτ ◦ . . . ◦ Aθ,τ
τ,τ
(d+1)
◦ Rτ ◦ Aθ,0
τ,d (x), (2)

let Ed,M,τ : [−R, R]d × Ω → [0, ∞), d, M, τ ∈ N, satisfy for all d, M ∈ N, τ ∈ {3, 4, . . .}, θ ∈ [−R, R]d ,
ω ∈ Ω with d ≥ τ (d + 1) + (τ − 3)τ (τ + 1) + τ + 1 that
M 
1 P θ,τ 2
Ed,M,τ (θ, ω) = |N (Xm (ω)) − ϕ(Xm (ω))| , (3)
M m=1

for every d ∈ N let Θd,k : Ω → [−R, R]d , k ∈ N, be i.i.d. random variables, assume for all d ∈ N that
Θd,1 is continuous uniformly distributed on [−R, R]d , and let Ξd,K,M,τ : Ω → [−R, R]d , d, K, M, τ ∈ N,
satisfy for all d, K, M, τ ∈ N that Ξd,K,M,τ = Θd,min{k∈{1,2,...,K} : Ed,M,τ (Θd,k )=minl∈{1,2,...,K} Ed,M,τ (Θd,l )} . Then
there exists c ∈ (0, ∞) such that for all d, K, M, τ ∈ N, ε ∈ (0, 1] with τ ≥ 2d(2dLε−1 + 2)d and
d ≥ τ (d + 1) + (τ − 3)τ (τ + 1) + τ + 1 it holds that
Z 
Ξd,K,M,τ ,τ
  4 
P |N (x) − ϕ(x)| PX1 (dx) > ε ≤ exp −K(cτ )−τ d ε2d + 2 exp d ln (cτ )τ ε−2 − ε cM . (4)
[a,b]d

3
Theorem 1.1 is an immediate consequence of Corollary 4.8 in Section 4.2 below. Corollary 4.8 follows
from Corollary 4.7 which, in turn, is implied by Theorem 4.5, the main result of this article. In the
following we add some comments and explanations regarding the mathematical objects which appear in
Theorem 1.1 above. For every d, τ ∈ {3, 4, ...}, θ ∈ Rd with d ≥ τ (d + 1) + (τ − 3)τ (τ + 1) + τ + 1 the
function Nθ,τ : Rd → R in (2) above describes the realization of a fully connected deep neural network with
τ layers (1 input layer with d neurons [d dimensions], 1 output layer with 1 neuron [1 dimension], as well as
τ − 2 hidden layers with τ neurons on each hidden layer [τ dimensions in each hidden layer]). The vector
θ ∈ Rd in (2) in Theorem 1.1 above stores the real parameters (the weights and the biases) for the concrete
considered neural network. In particular, the architecture of the deep neural network in (2) is chosen so
that we have τ d + (τ − 3)τ 2 + τ real parameters in the weight matrices and (τ − 2)τ + 1 real parameters
in the bias vectors resulting in [τ d + (τ − 3)τ 2 + τ ] + [(τ − 2)τ + 1] = τ (d + 1) + (τ − 3)τ (τ + 1) + τ + 1
real parameters for the deep neural network overall. This explains why the dimension d of the parameter
vector θ ∈ Rd must be larger or equal than the number of real parameters used to describe the deep
neural network in (2) in the sense that d ≥ τ (d + 1) + (τ − 3)τ (τ + 1) + τ + 1 (see above (2)). The affine
linear transformations for the deep neural network, which appear just after the input layer and just after
each hidden layer in (2), are specified in (1) above. The functions Rτ : Rτ → R, τ ∈ N, describe the
multi-dimensional rectifier functions which are employed as activation functions in (2). Realizations of
the random variables (Xm , Ym ) := (Xm , ϕ(Xm )), m ∈ {1, 2, . . . , M}, act as training data and the neural
network parameter vector θ ∈ Rd should be chosen so that the empirical risk in (3) gets minimized.
In Theorem 1.1 above, we use as an optimization algorithm just random initializations and perform
no gradient descent steps. The inequality in (4) in Theorem 1.1 above provides a quantitative error
estimate for the probability that the L1 -distance between the trained deep neural network approximation
NΞd,K,M,τ ,τ (x), x ∈ [a, b]d , and the function ϕ(x), x ∈ [a, b]d , which we actually want to learn, is larger
than a possibly arbitrarily small real number ε ∈ (0, 1]. In (4) in Theorem 1.1 above we measure the error
between the deep neural network and the function ϕ : [a, b]d → [0, 1], which we intend to learn, in the
L1 -distance instead of in the L2 -distance. However, in the more general results in Section 4.2 below we
measure the error in the L2 -distance and, just to keep the statement in Theorem 1.1 as easily accessible as
possible, we restrict ourselves in Theorem 1.1 above to the L1 -distance. Observe that for every ε ∈ (0, 1]
and every d, τ ∈ {3, 4, . . .} with d ≥ τ (d + 1) + (τ − 3)τ (τ + 1) + τ + 1 we have that the right hand side
of (4) converges to zero as K and M tend to infinity. The right hand side of (4) also specifies a concrete
speed of convergence and in this sense Theorem 1.1 provides a full error analysis for the deep learning
algorithm under consideration. Our analysis is in parts inspired by Maggi [50], Berner et al. [10], Cucker
& Smale [18], Beck et al. [6], and Fehrman et al. [26].
The remainder of this article is organized as follows. In Section 2 we present two elementary approaches
how DNNs can be described in a mathematical fashion. Both approaches will be used in our error
analyses in the later parts of this article. In Section 3 we separately analyze the approximation error, the
generalization error, and the optimization error of the considered algorithm. In Section 4 we combine the
separate error analyses in Section 3 to obtain an overall error analysis of the considered algorithm.

2 Deep neural networks (DNNs)


In this section we present two elementary approaches on how DNNs can be described in a mathematical
fashion. More specifically, we present in Section 2.1 a vectorized description for DNNs and we present
in Section 2.2 a structured description for DNNs. Both approaches will be used in our error analyses in
the later parts of this article. Sections 2.1, 2.2, and 2.3 are partially based on material in publications
from the scientific literature such as Beck et al. [6, 7], Berner et al. [10], Goodfellow et al. [28], and
Grohs et al. [31, 32]. In particular, Definition 2.1 is inspired by, e.g., (25) in [7], Definition 2.2 is inspired
by, e.g., (26) in [7], Definition 2.3 is, e.g., [31, Definition 2.2], Definitions 2.4, 2.5, 2.6, 2.7, and 2.8 are
inspired by, e.g., [10, Setting 2.3], Definition 2.9 is, e.g., [31, Definition 2.1], Definition 2.10 is, e.g., [31,
Definition 2.3], Definition 2.16 is, e.g., [31, Definition 2.17], Definition 2.17 is, e.g., [32, Definition 3.10],

4
Definition 2.18 is, e.g., [32, Definition 3.15], Definition 2.19 is, e.g., [31, Definition 2.5], Definition 2.23 is,
e.g., [31, Definition 2.11], Definition 2.24 is, e.g., [31, Definition 2.12], and Theorem 2.36 is a strengthened
version of [10, Theorem 4.2].

2.1 Vectorized description of DNNs


2.1.1 Affine functions
Definition 2.1 (Affine function). Let d, r, s ∈ N, δ ∈ N0 , θ = (θ1 , θ2 , . . . , θd ) ∈ Rd satisfy d ≥ δ + rs + r.
Then we denote by Aθ,δ s r
r,s : R → R the function which satisfies for all x = (x1 , x2 , . . . , xs ) ∈ R that
s

    
θδ+1 θδ+2 · · · θδ+s x1 θδ+rs+1
 θδ+s+1
 θδ+s+2 · · · θδ+2s    x2   θδ+rs+2 
   
Aθ,δ
 θδ+2s+1
r,s (x) = 
θδ+2s+2 · · · θδ+3s    x3  +  θδ+rs+3 
   
 .. .. .. ..   .   .. 
 . . . .   ..   .  (5)
θδ+(r−1)s+1 θδ+(r−1)s+2 · · · θδ+rs xs θδ+rs+r
h P i hP i hP i 
s s s
= k=1 xk θδ+k + θδ+rs+1 , k=1 x k θ δ+s+k + θδ+rs+2 , . . . , k=1 xk θδ+(r−1)s+k + θδ+rs+r .

2.1.2 Vectorized description of DNNs


Definition 2.2. Let d, L ∈ N, l0 , l1 , . . . , lL ∈ N, δ ∈ N0 , θ ∈ Rd satisfy
L
X
d≥δ+ lk (lk−1 + 1) (6)
k=1

and let Ψk : Rlk → Rlk , k ∈ {1, 2, . . . , L}, be functions. Then we denote by NΨθ,δ,l0
1 ,Ψ2 ,...,ΨL
: Rl0 → RlL the
l0
function which satisfies for all x ∈ R that
P P
 θ,δ+ L−1 l (l +1) θ,δ+ L−2k=1 lk (lk−1 +1)
NΨθ,δ,l
1 ,Ψ
0
2 ,...,Ψ L
(x) = ΨL ◦ AlL ,lL−1k=1 k k−1 ◦ ΨL−1 ◦ AlL−1 ,lL−2 ◦ ...
θ,δ+l (l +1) 
. . . ◦ Ψ2 ◦ Al2 ,l1 1 0 ◦ Ψ1 ◦ Aθ,δ
l1 ,l0 (x) (7)

(cf. Definition 2.1).

2.1.3 Activation functions


Definition 2.3 (Multidimensional version). Let d ∈ N and let ψ : R → R be a function. Then we denote
by Mψ,d : Rd → Rd the function which satisfies for all x = (x1 , x2 , . . . , xd ) ∈ Rd that
Mψ,d (x) = (ψ(x1 ), ψ(x2 ), . . . , ψ(xd )) . (8)
Definition 2.4 (Rectifier function). We denote by r : R → R the function which satisfies for all x ∈ R
that
r(x) = max{x, 0}. (9)
Definition 2.5 (Multidimensional rectifier function). Let d ∈ N. Then we denote by Rd : Rd → Rd the
function given by
Rd = Mr,d (10)
(cf. Definitions 2.3 and 2.4).
Definition 2.6 (Clipping function). Let u ∈ [−∞, ∞), v ∈ (u, ∞]. Then we denote by cu,v : R → R the
function which satisfies for all x ∈ R that
cu,v (x) = max{u, min{x, v}}. (11)

5
Definition 2.7 (Multidimensional clipping function). Let d ∈ N, u ∈ [−∞, ∞), v ∈ (u, ∞]. Then we
denote by Cu,v,d : Rd → Rd the function given by

Cu,v,d = Mcu,v ,d (12)

(cf. Definitions 2.3 and 2.6).

2.1.4 Rectified DNNs


Definition 2.8 (Rectified clipped DNN). Let L, d ∈ N, u ∈ [−∞, ∞), v ∈ (u, ∞], l = (l0 , l1 , . . . , lL ) ∈
NL+1 , θ ∈ Rd satisfy
XL
d≥ lk (lk−1 + 1). (13)
k=1
θ,l
Then we denote by Nu,v : Rl0 → RlL the function which satisfies for all x ∈ Rl0 that
( 
θ,l
NCθ,0,l0
u,v,l
(x) :L=1
Nu,v (x) = L  (14)
NRθ,0,l0
l1 ,Rl2 ,...,RlL−1 ,Cu,v,lL
(x) :L>1

(cf. Definitions 2.7, 2.5, and 2.2).

2.2 Structured description of DNNs


2.2.1 Structured description of DNNs
Definition 2.9. We denote by N the set given by
S S L lk ×lk−1

N= L∈N (l0 ,l1 ,...,lL )∈NL+1 k=1 (R × Rlk ) (15)
S∞ L

and we denote by P, L, I, O : N → N, H : N → N0 , and D : N → N the functions which satisfy
L lk ×lk−1 lk
 L=2 PL
for all L ∈ N, l0 , l1 , . . . , lL ∈ N, Φ ∈ k=1 (R × R ) that P(Φ) = k=1 lk (lk−1 + 1), L(Φ) = L,
I(Φ) = l0 , O(Φ) = lL , H(Φ) = L − 1, and D(Φ) = (l0 , l1 , . . . , lL ).

2.2.2 Realizations of DNNs


Definition
S 2.10 (Realization associated to a DNN). Let a ∈ C(R, R). Then we denote by Ra : N →
k l
k,l∈N C(R , R ) the function which  satisfies for all L ∈ N, l0 , l1 , . . . , lL ∈ N, Φ = ((W1 , B1 ), (W2 , B2 ),
L lk ×lk−1
. . . , (WL , BL )) ∈ k=1 (R × R ) , x0 ∈ Rl0 , x1 ∈ Rl1 , . . . , xL−1 ∈ RlL−1 with ∀ k ∈ N ∩ (0, L) : xk =
lk

Ma,lk (Wk xk−1 + Bk ) that

Ra (Φ) ∈ C(Rl0 , RlL ) and (Ra (Φ))(x0 ) = WL xL−1 + BL (16)

(cf. Definitions 2.9 and 2.3).

2.2.3 On the connection to the vectorized description of DNNs


S d

Definition 2.11. We denote by T : N → d∈N R the function which satisfies for all L, d ∈ N,
L lm ×lm−1

l0 , l1 , . . . , lL ∈ N, Φ = ((W1 , B1 ), (W2 , B2 ), . . . , (WL , BL )) ∈ m=1 (R ×Rlm ) , θ = (θ1 , θ2 , . . . , θd ) ∈

6
Rd , k ∈ {1, 2, . . . , L} with T (Φ) = θ that
 
θ(Pk−1 li (li−1 +1))+lk lk−1 +1
 θ Pi=1 
 ( k−1
i=1 li (li−1 +1))+lk lk−1 +2 
 P 
d = P(Φ), Bk =  θ k−1
 ( i=1 li (li−1 +1))+lk lk−1 +3  , and

 .. 
 . 
θ(Pk−1 li (li−1 +1))+lk lk−1 +lk
i=1
  (17)
θ(Pk−1 li (li−1 +1))+1 θ(Pk−1 li (li−1 +1))+2 ··· θ(Pk−1 li (li−1 +1))+lk−1
i=1 i=1 i=1

 θ(Pk−1 li (li−1 +1))+lk−1 +1 θ(Pk−1 li (li−1 +1))+lk−1 +2 ··· θ(Pk−1 li (li−1 +1))+2lk−1 

 i=1 i=1 i=1 
Wk = 
 θ(Pk−1 li (li−1 +1))+2lk−1 +1θ Pk−1
( i=1 li (li−1 +1))+2lk−1 +2 ··· P
θ( k−1 li (li−1 +1))+3lk−1  ,
i=1 i=1 
 .. .. .. .. 
 . . . . 
θ( k−1 li (li−1 +1))+(lk −1)lk−1 +1 θ( k−1 li (li−1 +1))+(lk −1)lk−1 +2 · · · θ(Pk−1 li (li−1 +1))+lk lk−1
P P
i=1 i=1 i=1

(cf. Definition 2.9).


Lemma 2.12. Let a, b ∈ N, W = (Wi,j )(i,j)∈{1,2,...,a}×{1,2,...,b} ∈ Ra×b , B = (Bi )i∈{1,2,...,a} ∈ Ra . Then
 
T ((W, B)) = W1,1 , W1,2 , . . . , W1,b , W2,1 , W2,2 , . . . , W2,b , . . . , Wa,1 , Wa,2 , . . . , Wa,b , B1 , B2 , . . . , Ba (18)
(cf. Definition 2.11).
Proof of Lemma 2.12. Observe that (17) establishes (18). The proof of Lemma 2.12 is thus completed.
Lemma 2.13. Let L ∈ N, l0 , l1 , . . . , lL ∈ N, let Wk = (Wk,i,j )(i,j)∈{1,2,...,lk }×{1,2,...,lk−1 } ∈ Rlk ×lk−1 , k ∈
{1, 2, . . . , L}, and let Bk = (Bk,i )i∈{1,2,...,lk } ∈ Rlk , k ∈ {1, 2, . . . , L}. Then
(i) it holds for all k ∈ {1, 2, . . . , L} that

T ((Wk , Bk )) = Wk,1,1, Wk,1,2, . . . , Wk,1,lk−1 , Wk,2,1, Wk,2,2, . . . , Wk,2,lk−1 , . . . ,

Wk,lk ,1 , Wk,lk ,2 , . . . , Wk,lk ,lk−1 , Bk,1, Bk,2, . . . , Bk,lk (19)
and
(ii) it holds that
 
T (W1 , B1 ), (W2 , B2 ), . . . , (WL , BL )

= W1,1,1 , W1,1,2 , . . . , W1,1,l0 , . . . , W1,l1 ,1 , W1,l1 ,2 , . . . , W1,l1 ,l0 , B1,1 , B1,2 , . . . , B1,l1 ,
W2,1,1 , W2,1,2 , . . . , W2,1,l1 , . . . , W2,l2 ,1 , W2,l2 ,2 , . . . , W2,l2 ,l1 , B2,1 , B2,2 , . . . , B2,l2 , (20)
...,

WL,1,1 , WL,1,2 , . . . , WL,1,lL−1 , . . . WL,lL ,1 , WL,lL ,2 , . . . , WL,lL,lL−1 , BL,1 , BL,2 , . . . , BL,lL

(cf. Definition 2.11).


Proof of Lemma 2.13. Note that Lemma 2.12 proves item (i). Moreover, observe that (17) establishes
item (ii). The proof of Lemma 2.13 is thus completed.
Lemma 2.14. Let a ∈ C(R, R), Φ ∈ N, L ∈ N, l0 , l1 , . . . , lL ∈ N satisfy D(Φ) = (l0 , l1 , . . . , lL ) (cf.
Definition 2.9). Then it holds for all x ∈ Rl0 that
 
 NidT (Φ),0,l0 (x) :L=1
R lL
(Ra (Φ))(x) =  (21)
 N T (Φ),0,l0 (x) : L > 1
Ma,l ,Ma,l ,...,Ma,l ,id l
1 2 L−1 R L

(cf. Definitions 2.10, 2.11, 2.3, and 2.2).

7
Proof of Lemma 2.14. Throughout this proof let W1 ∈ Rl1 ×l0 , B1 ∈ Rl1 , W2 ∈ Rl2 ×l1 , B2 ∈ Rl2 , . . . ,
WL ∈ RlL ×lL−1 , BL ∈ RlL satisfy Φ = ((W1 , B1 ), (W2 , B2 ), . . . , (WL , BL )). Note that (17) shows that for
all k ∈ {1, 2, . . . , L}, x ∈ Rlk−1 it holds that
Pk−1
T (Φ), li (li−1 +1) 
Wk x + Bk = Alk ,lk−1 i=1
(x) (22)

(cf. Definitions 2.11 and 2.1). This demonstrates that for all x0 ∈ Rl0 , x1 ∈ Rl1 , . . . , xL−1 ∈ RlL−1 with
∀ k ∈ N ∩ (0, L) : xk = Ma,lk (Wk xk−1 + Bk ) it holds that

xL−1 = (23)
(
x0 :L=1
P P
T (Φ), L−2
i=1 li (li−1 +1) T (Φ), L−3
i=1 li (li−1 +1) T (Φ),0 
Ma,lL−1 ◦ AlL−1 ,lL−2 ◦ Ma,lL−2 ◦ AlL−2 ,lL−3 ◦ . . . ◦ Ma,l1 ◦ Al1 ,l0 (x0 ) :L>1

(cf. Definition 2.3). Combining this and (22) with (7) and (16) proves that for all x0 ∈ Rl0 , x1 ∈
Rl1 , . . . , xL−1 ∈ RlL−1 with ∀ k ∈ N ∩ (0, L) : xk = Ma,lk (Wk xk−1 + Bk ) it holds that
P
 T (Φ), L−1 l (l +1) 
Ra (Φ) (x0 ) = WL xL−1 + BL = AlL ,lL−1 i=1 i i−1 (xL−1 )
 
 NidT (Φ),0,l0 (x0 ) :L=1 (24)
R lL
= 
 N T (Φ),0,l0 (x0 ) : L > 1
Ma,l ,Ma,l ,...,Ma,l
1 2
,id l L−1 R L

(cf. Definitions 2.10 and 2.2). The proof of Lemma 2.14 is thus completed.

Corollary 2.15. Let Φ ∈ N (cf. Definition 2.9). Then it holds for all x ∈ RI(Φ) that
T (Φ),D(Φ) 
N−∞,∞ (x) = (Rr (Φ))(x) (25)

(cf. Definitions 2.11, 2.8, 2.4, and 2.10).

Proof of Corollary 2.15. Note that Lemma 2.14, (14), (10), and the fact that for all d ∈ N it holds that
C−∞,∞,d = idRd establish (25) (cf. Definition 2.7). The proof of Corollary 2.15 is thus completed.

8
2.2.4 Parallelizations of DNNs
Definition 2.16 (Parallelization of DNNs). Let n ∈ N. Then we denote by

Pn : (Φ1 , Φ2 , . . . , Φn ) ∈ Nn : L(Φ1 ) = L(Φ2 ) = . . . = L(Φn ) → N (26)

the function which satisfies for all L ∈ N, (l1,0 , l1,1 , . . . , l1,L ),L (l2,0 ,ll1,k , (ln,0 , ln,1 , . . . , ln,L ) ∈
2,1 , . . . , l2,L ), . . . 
L+1 ×l1,k−1 l1,k
N , Φ1 = ((W1,1 , B1,1 ), (W1,2 , B1,2 ), . . . , (W1,L , B1,L )) ∈ (R × R ) , Φ2 = ((W2,1 , B2,1 ),
L l2,k ×l2,k−1 l2,k
 k=1
(W2,2 , B2,2 ), . . . , (W2,L , B2,L )) ∈ k=1 (R × R ) , . . . , Φn = ((Wn,1 , Bn,1 ), (Wn,2, Bn,2 ), . . . ,
L ln,k ×ln,k−1 ln,k
(Wn,L , Bn,L )) ∈ k=1 (R × R ) that
   
W1,1 0 0 ··· 0 B1,1
 0
 W2,1 0 ··· 0    B2,1 
 
 0
Pn (Φ1 , Φ2 , . . . , Φn ) =  0 W3,1 · · · 0  ,  B3,1 
  
 ,
 .. .. .. .. ..   .. 
 . . . . .   . 
0 0 0 · · · Wn,1 Bn,1
   
W1,2 0 0 ··· 0 B1,2
 0
 W2,2 0 ··· 0    B2,2 
 
 0
 0 W3,2 · · · 0   ,  B3,2  , . . . ,
 
(27)
 .. .. .. .. ..   .. 
 . . . . .   .  
0 0 0 · · · Wn,2 Bn,2
   
W1,L 0 0 ··· 0 B1,L
 0
 W2,L 0 ··· 0    B2,L 
 
 0
 0 W3,L · · · 0   ,  B3,L 
 
 .. .. .. .. .
..   .. 
  . 
 . . . . 
0 0 0 · · · Wn,L Bn,L

(cf. Definition 2.9).

2.2.5 Basic examples for DNNs


Definition 2.17 (Linear transformations as DNNs). Let m, n ∈ N, W ∈ Rm×n . Then we denote by
NW ∈ Rm×n × Rm the pair given by NW = (W, 0).

Definition 2.18. We denote by I = (Id )d∈N : N → N the function which satisfies for all d ∈ N that
     
1 0   
I1 = , , 1 −1 , 0 ∈ (R2×1 × R2 ) × (R1×2 × R1 ) (28)
−1 0

and
Id = Pd (I1 , I1 , . . . , I1 ) (29)
(cf. Definitions 2.9 and 2.16).

9
2.2.6 Compositions of DNNs
Definition 2.19 (Composition of DNNs). We denote by (·) • (·) : {(Φ1 , Φ2 ) ∈ N × N : I(Φ1 ) = O(Φ2 )}
→ N the function  which satisfies for all L, L ∈ N, l0 , l1 , . . . , lL , l0 , l1 , . . . , lL ∈ N, Φ1 = ((W1 , B1 ), (W2 , B2 ),
L lk ×lk−1 lk L lk ×lk−1
. . . , (WL , BL )) ∈ k=1 (R × R ) , Φ2 = ((W 1 , B 1 ), (W 2 , B 2 ), . . . , (WL , BL )) ∈ k=1 (R ×
lk
R ) with l0 = I(Φ1 ) = O(Φ2 ) = lL that

Φ1 • Φ2 =

 (W1 , B1 ), (W2 , B2 ), . . . , (WL−1 , BL−1 ), (W1 WL , W1 BL + B1 ),


  :L>1<L


 (W2 , B2 ), (W3 , B3 ), . . . , (WL , BL )

  (30)
(W1 W1 , W1 B1 + B1 ), (W2 , B2 ), (W3 , B3 ), . . . , (WL , BL ) :L>1=L

 



 (W 1 , B1 ), (W2 , B2 ), . . . , (WL−1 , BL−1 ), (W 1 W L , W 1 B L + B 1 ) :L=1<L

 

(W1 W1 , W1 B1 + B1 ) :L=1=L

(cf. Definition 2.9).


S∞ 
Definition 2.20 (Maximum norm). We denote by |||·||| : d=1 Rd → [0, ∞) the function which satisfies
for all d ∈ N, θ = (θ1 , θ2 , . . . , θd ) ∈ Rd that

|||θ||| = max |θi |. (31)


i∈{1,2,...,d}

Lemma
L 2.21. Let L,L ∈ N, l0 , l1 , . . . , lL , l0 , l1 , . . . , lL ∈ N, Φ1 = ((W 1 L, B1 ), (W 2 , B2 ), . . . ,(WL , BL )) ∈
lk ×lk−1 lk lk ×lk−1
k=1 (R × R ) , Φ2 = ((W1 , B1 ), (W 2 , B 2 ), . . . , (WL , BL )) ∈ k=1 (R × Rlk ) . Then
 
|||T (Φ1 • Φ2 )||| ≤ max |||T (Φ1 )|||, |||T (Φ2 )|||, T ((W1 WL , W1 BL + B1 )) (32)

(cf. Definitions 2.19, 2.11, and 2.20).

Proof of Lemma 2.21. Note that (30) and Lemma 2.13 establish (32). The proof of Lemma 2.21 is thus
completed.

2.2.7 Powers and extensions of DNNs


Definition 2.22. Let d ∈ N. Then we denote by Id ∈ Rd×d the identity matrix in Rd×d .

Definition 2.23. We denote by (·)•n : {Φ ∈ N : I(Φ) = O(Φ)} → N, n ∈ N0 , the functions which satisfy
for all n ∈ N0 , Φ ∈ N with I(Φ) = O(Φ) that
( 
•n IO(Φ) , (0, 0, . . . , 0) ∈ RO(Φ)×O(Φ) × RO(Φ) : n = 0
Φ = (33)
Φ • (Φ•(n−1) ) :n∈N

(cf. Definitions 2.9, 2.22, and 2.19).

Definition 2.24 (Extension of DNNs). Let L ∈ N, Ψ ∈ N satisfy I(Ψ) = O(Ψ). Then we denote by
EL,Ψ : {Φ ∈ N : (L(Φ) ≤ L and O(Φ) = I(Ψ))} → N the function which satisfies for all Φ ∈ N with
L(Φ) ≤ L and O(Φ) = I(Ψ) that 
EL,Ψ (Φ) = Ψ•(L−L(Φ)) • Φ (34)
(cf. Definitions 2.9, 2.23, and 2.19).

10
Lemma 2.25. Let d, i, L, L ∈ N, l0 , l1 , . . . , lL−1 ∈ N, Φ, Ψ ∈ N satisfy L ≥ L, D(Φ) = (l0 , l1 , . . . , lL−1 , d)
and D(Ψ) = (d, i, d) (cf. Definition 2.9). Then it holds that D(EL,Ψ (Φ)) ∈ NL+1 and
(
(l0 , l1 , . . . , lL−1 , d) :L=L
D(EL,Ψ (Φ)) = (35)
(l0 , l1 , . . . , lL−1 , i, i, . . . , i, d) : L > L

(cf. Definition 2.24).

Proof of Lemma 2.25. Observe that item (i) in [31, Lemma 2.13] ensures that L(Ψ•(L−L) ) = L − L + 1,
D(Ψ•(L−L) ) ∈ NL−L+2 , and (
•(L−L) (d, d) :L=L
D(Ψ )= (36)
(d, i, i, . . . , i, d) : L > L
(cf. Definition 2.23). Combining this with [31, Proposition 2.6] shows that L((Ψ•(L−L) ) • Φ) = L(Ψ•(L−L) )+
L(Φ) − 1 = L, D((Ψ•(L−L) ) • Φ) ∈ NL+1 , and
(
•(L−L) (l0 , l1 , . . . , lL−1 , d) :L=L
D((Ψ ) • Φ) = (37)
(l0 , l1 , . . . , lL−1 , i, i, . . . , i, d) : L > L.

This and (34) establish (35). The proof of Lemma 2.25 is thus completed.

Lemma 2.26. Let d, L ∈ N, Φ ∈ N satisfy L ≥ L(Φ) and d = O(Φ) (cf. Definition 2.9). Then

|||T (EL,Id (Φ))||| ≤ max{1, |||T (Φ)|||} (38)

(cf. Definitions 2.18, 2.24, 2.11, and 2.20).

Proof of Lemma 2.26. Throughout this proof assume w.l.o.g. that L > L(Φ) and let l0 , l1 , . . . , lL−L(Φ)+1 ∈
N satisfy (l0 , l1 , . . . , lL−L(Φ)+1 ) = (d, 2d, 2d, . . . , 2d, d). Note that [32, Lemma 3.16] ensures that D(Id ) =
(d, 2d, d) ∈ N3 (cf. Definition 2.18). Item (i) in [31, Lemma 2.13] hence establishes that

L((Id )•(L−L(Φ) ) = L − L(Φ) + 1 and D((Id )•(L−L(Φ) ) = (l0 , l1 , . . . , lL−L(Φ)+1 ) ∈ NL−L(Φ)+2 (39)

(cf. Definition 2.23). This shows that there exist Wk ∈ Rlk ×lk−1 , k ∈ {1, 2, . . . , L−L(Φ)+1}, and Bk ∈ Rlk ,
k ∈ {1, 2, . . . , L − L(Φ) + 1}, which satisfy

(Id )•(L−L(Φ)) = ((W1 , B1 ), (W2 , B2 ), . . . , (WL−L(Φ)+1 , BL−L(Φ)+1 )). (40)

Next observe that (27), (28), (29), (30), and (33) demonstrate that
 
1 0 ··· 0
−1 0 · · · 0 
 
0
 1 ··· 0  
 0 −1 · · · 0 
W1 =   ∈ R(2d)×d
 .. .. . . . 
 . . . .. 
 
0 0 ··· 1  (41)
0 0 · · · −1
 
1 −1 0 0 · · · 0 0
0 0 1 −1 · · · 0 0 
WL−L(Φ)+1 =  .. .. .. .. . . .. ..  ∈ Rd×(2d) .
 
and
. . . . . . . 
0 0 0 0 · · · 1 −1

11
Moreover, note that (27), (28), (29), (30), and (33) prove that for all k ∈ N ∩ (1, L − L(Φ) + 1) it holds
that
 
1 0 ··· 0
  −1 0 ··· 0
1 −1 0 0 · · · 0 0  
0 0 1 −1 · · · 0 0   0
 1 ··· 
0
Wk =  .. .. .. .. . . .. ..   0
  −1 ··· 
0
. . . . . . .   .. .. .. ..
 . . .  .
0 0 0 0 · · · 1 −1  
| {z } 0 0 ··· 1 
∈R d×(2d) 0 0 · · · −1
| {z }
∈R(2d)×d (42)
 
1 −1 0 0 ··· 0 0
−1 1 0 0 ··· 0 0
 
0
 0 1 −1 · · · 0 0 

0
= 0 −1 1 · · · 0 0  ∈ R(2d)×(2d) .
 .. .. .. .. . . .. .. 
 . . . . . . . 
 
0 0 0 0 · · · 1 −1
0 0 0 0 · · · −1 1

In addition, observe that (28), (29), (27), (33), and (30) show that for all k ∈ N ∩ [1, L − L(Φ)] it holds
that
Bk = 0 ∈ R2d and BL−L(Φ)+1 = 0 ∈ Rd . (43)
Combining this, (41), and (42) establishes that

T (Id )•(L−L(Φ)) =1 (44)

(cf. Definitions 2.11 and 2.20). Furthermore, note that (41) demonstrates that for all k ∈ N, W =
(wi,j )(i,j)∈{1,2,...,d}×{1,2,...,k} ∈ Rd×k it holds that
 
w1,1 w1,2 · · · w1,k
−w1,1 −w1,2 · · · −w1,k 
 
 w2,1
 w2,2 · · · w2,k 
−w2,1 −w2,2 · · · −w2,k 
W1 W =   ∈ R(2d)×k . (45)
 .. .. .. .. 
 . . . . 
 
 wd,1 wd,2 · · · wd,k 
−wd,1 −wd,2 · · · −wd,k

Next observe that (41) and (43) show that for all B = (b1 , b2 , . . . , bd ) ∈ Rd it holds that
   
1 0 ··· 0 b1
−1 0 ··· 0    −b1 
  b1  
0 1 ··· 0   
 b
 2 


 0 −1 b 2
W1 B + B1 =  ··· 0    ..  = −b2  ∈ R2d .
   
(46)
 .. .. .. .  .  . 
. ..   .. 
 
 . .
  b  
0 0 ··· 1  d  bd 
0 0 · · · −1 −bd

Combining this with (45) proves that for all k ∈ N, W ∈ Rd×k , B ∈ Rd it holds that
 
T ((W1 W, W1 B + B1 )) = T ((W, B)) . (47)

12
This, Lemma 2.21, and (44) establish that

|||T (EL,Id (Φ))||| = T ((Id )•(L−L(Φ)) ) • Φ
  (48)
≤ max T (Id )•(L−L(Φ)) , |||T (Φ)||| = max{1, |||T (Φ)|||}

(cf. Definition 2.24). The proof of Lemma 2.26 is thus completed.

2.2.8 Embedding DNNs in larger architectures


Lemma 2.27. Let a ∈ C(R, R), L ∈ N, l0 , l1 , . . . , lL , l0 , l1 , . . . , lL ∈ N satisfy for all k ∈ {1, 2, . . . , L} that
l0 = l0 , lL = lL , and lk ≥ lk , for every k ∈ {1, 2, . . . , L} let Wk = (Wk,i,j )(i,j)∈{1,2,...,lk }×{1,2,...,lk−1 } ∈ Rlk ×lk−1 ,
Wk = (Wk,i,j )(i,j)∈{1,2,...,lk }×{1,2,...,lk−1} ∈ Rlk ×lk−1 , Bk = (Bk,i )i∈{1,2,...,lk } ∈ Rlk , Bk = (Bk,i )i∈{1,2,...,lk } ∈ Rlk ,
assume for all k ∈ {1, 2, . . . , L}, i ∈ {1, 2, . . . , lk }, j ∈ N ∩ (0, lk−1] that Wk,i,j = Wk,i,j and Bk,i = Bk,i ,
and assume for all k ∈ {1, 2, . . . , L}, i ∈ {1, 2, . . . , lk }, j ∈ N ∩ (lk−1 , lk−1 + 1) that Wk,i,j = 0. Then
 
Ra ((W1 , B1 ), (W2 , B2 ), . . . , (WL , BL )) = Ra ((W1 , B1 ), (W2 , B2 ), . . . , (WL , BL )) (49)

(cf. Definition 2.10).


Proof of Lemma 2.27. Throughout this proof let πk : Rlk → Rlk , k ∈ {0, 1, . . . , L}, satisfy for all k ∈
{0, 1, . . . , L}, x = (x1 , x2 , . . . , xlk ) that

πk (x) = (x1 , x2 , . . . , xlk ). (50)

Observe that the hypothesis that l0 = l0 and lL = lL shows that



Ra ((W1 , B1 ), (W2 , B2 ), . . . , (WL , BL )) ∈ C(Rl0 , RlL ) (51)

(cf. Definition 2.10). Furthermore, note that the hypothesis that for all k ∈ {1, 2, . . . , L}, i ∈ {1, 2, . . . , lk },
j ∈ N∩(lk−1 , lk−1 +1) it holds that Wk,i,j = 0 ensures that for all k ∈ {1, 2, . . . , L}, x = (x1 , x2 , . . . , xlk−1 ) ∈
Rlk−1 it holds that
" lk−1 # " lk−1 # " lk−1 # !
X X X
πk (Wk x + Bk ) = Wk,1,i xi + Bk,1, Wk,2,i xi + Bk,2 , . . . , Wk,lk ,i xi + Bk,lk
i=1 i=1 i=1
"lk−1 # "lk−1 # "lk−1 # ! (52)
X X X
= Wk,1,i xi + Bk,1, Wk,2,i xi + Bk,2 , . . . , Wk,lk ,i xi + Bk,lk .
i=1 i=1 i=1

Combining this with the hypothesis that for all k ∈ {1, 2, . . . , L}, i ∈ {1, 2, . . . , lk }, j ∈ N∩(0, lk−1 ] it holds
that Wk,i,j = Wk,i,j and Bk,i = Bk,i shows that for all k ∈ {1, 2, . . . , L}, x = (x1 , x2 , . . . , xlk−1 ) ∈ Rlk−1 it
holds that
"lk−1 # "lk−1 # "lk−1 # !
X X X
πk (Wk x + Bk ) = Wk,1,i xi + Bk,1 , Wk,2,i xi + Bk,2, . . . , Wk,lk ,i xi + Bk,lk
i=1 i=1 i=1
(53)
= Wk (πk−1 (x)) + Bk .

Moreover, observe that (50) and (8) ensure that for all k ∈ {0, 1, . . . , L}, x = (x1 , x2 , . . . , xlk ) ∈ Rlk it
holds that

πk (Ma,lk (x)) = πk (a(x1 ), a(x2 ), . . . , a(xlk )) = (a(x1 ), a(x2 ), . . . , a(xlk )) = Ma,lk (πk (x)). (54)

Combining this and (53) demonstrates that for all x0 ∈ Rl0 , x1 ∈ Rl1 , . . . , xL−1 ∈ RlL−1 , k ∈ N ∩ (0, L)
with ∀ m ∈ N ∩ (0, L) : xm = Ma,lm (Wm xm−1 + Bm ) it holds that

πk (xk ) = πk (Ma,lk (Wk xk−1 + Bk )) = Ma,lk (πk (Wk xk−1 + Bk )) = Ma,lk (Wk πk−1 (xk−1 ) + Bk ) (55)

13
(cf. Definition 2.3). The hypothesis that l0 = l0 and lL = lL and (53) therefore prove that for all
x0 ∈ Rl0 , x1 ∈ Rl1 , . . . , xL−1 ∈ RlL−1 with ∀ k ∈ N ∩ (0, L) : xk = Ma,lk (Wk xk−1 + Bk ) it holds that
 
Ra ((W1 , B1 ), (W2 , B2 ), . . . , (WL , BL )) (x0 ) = Ra ((W1 , B1 ), (W2 , B2 ), . . . , (WL , BL )) (π0 (x0 ))
= WL πL−1 (xL−1 ) + BL
(56)
= πL (WL xL−1 + BL ) = WL xL−1 + BL

= Ra ((W1 , B1 ), (W2 , B2 ), . . . , (WL , BL )) (x0 )

(cf. Definition 2.10). The proof of Lemma 2.27 is thus completed.


Lemma 2.28. Let a ∈ C(R, R), L ∈ N, l0 , l1 , . . . , lL , l0 , l1 , . . . , lL ∈ N satisfy for all k ∈ {1, 2, . . . , L} that
l0 = l0 , lL = lL , and lk ≥ lk and let Φ ∈ N satisfy D(Φ) = (l0 , l1 , . . . , lL ) (cf. Definition 2.9). Then there
exists Ψ ∈ N such that

D(Ψ) = (l0 , l1 , . . . , lL ), |||T (Ψ)||| = |||T (Φ)|||, and Ra (Ψ) = Ra (Φ) (57)

(cf. Definitions 2.11, 2.20, and 2.10).

Proof of Lemma 2.28. Throughout this proof let Bk = (Bk,i )i∈{1,2,...,lk } ∈ Rlk , k ∈ {1, 2, . . . , L}, and Wk =
(Wk,i,j )(i,j)∈{1,2,...,lk }×{1,2,...,lk−1 } ∈ Rlk ×lk−1 , k ∈ {1, 2, . . . , L}, satisfy Φ = ((W1 , B1 ), (W2 , B2 ), . . . , (WL , BL ))
and let Wk = (Wk,i,j )(i,j)∈{1,2,...,lk }×{1,2,...,lk−1 } ∈ Rlk ×lk−1 , k ∈ {1, 2, . . . , L}, and Bk = (Bk,i )i∈{1,2,...,lk } ∈ Rlk ,
k ∈ {1, 2, . . . , L}, satisfy for all k ∈ {1, 2, . . . , L}, i ∈ {1, 2, . . . , lk }, j ∈ {1, 2, . . . , lk−1} that
( (
Wk,i,j : (i ≤ lk ) ∧ (j ≤ lk−1 ) Bk,i : i ≤ lk
Wk,i,j = and Bk,i = (58)
0 : (i > lk ) ∨ (j > lk−1 ) 0 : i > lk .
L li ×li−1

Note that (15) ensures that ((W1 , B1 ), (W2 , B2 ), . . . , (WL , BL )) ∈ i=1 (R × Rli ) ⊆ N and

D ((W1 , B1 ), (W2 , B2 ), . . . , (WL , BL )) = (l0 , l1 , . . . , lL ). (59)

Furthermore, observe that Lemma 2.13 and (58) show that



|||T ((W1 , B1 ), (W2 , B2 ), . . . , (WL , BL )) ||| = |||T (Φ)||| (60)

(cf. Definitions 2.11 and 2.20). In addition, note that Lemma 2.27 establishes that
 
Ra (Φ) = Ra ((W1 , B1 ), (W2 , B2 ), . . . , (WL , BL )) = Ra ((W1 , B1 ), (W2 , B2 ), . . . , (WL , BL )) (61)

(cf. Definition 2.10). The proof of Lemma 2.28 is thus completed.


Lemma 2.29. Let L, L ∈ N, l0 , l1 , . . . , lL , l0 , l1 , . . . , lL ∈ N satisfy for all i ∈ N ∩ [0, L) that L ≥ L
l0 = l0 , lL = lL , and li ≥ li , assume for all i ∈ N ∩ (L − 1, L) that li ≥ 2lL , and let Φ ∈ N satisfy
D(Φ) = (l0 , l1 , . . . , lL ) (cf. Definition 2.9). Then there exists Ψ ∈ N such that

D(Ψ) = (l0 , l1 , . . . , lL ), |||T (Ψ)||| ≤ max{1, |||T (Φ)|||}, and Rr (Ψ) = Rr (Φ) (62)

(cf. Definitions 2.11, 2.20, 2.4, and 2.10).

Proof of Lemma 2.29. Throughout this proof let Ξ ∈ N satisfy Ξ = EL,IlL (Φ) (cf. Definitions 2.18
and 2.24). Note that item (i) in [32, Lemma 3.16] demonstrates that D(IlL ) = (lL , 2lL , lL ) ∈ N3 . Com-
bining this with Lemma 2.25 shows that D(Ξ) ∈ NL+1 and
(
(l0 , l1 , . . . , lL ) :L=L
D(Ξ) = (63)
(l0 , l1 , . . . , lL−1 , 2lL , 2lL , . . . , 2lL , lL ) : L > L.

14
Moreover, observe that Lemma 2.26 (with d ← lL , L ← L, Φ ← Φ in the notation of Lemma 2.26)
establishes that
|||T (Ξ)||| ≤ max{1, |||T (Φ)|||} (64)
(cf. Definitions 2.11 and 2.20). In addition, note that item (iii) in [32, Lemma 3.16] ensures that for all
x ∈ RlL it holds that
(Rr (IlL ))(x) = x (65)
(cf. Definitions 2.4 and 2.10). This and item (ii) in [31, Lemma 2.14] prove that
Rr (Ξ) = Rr (Φ). (66)
In the next step, we observe that (63), the hypothesis that for all i ∈ [0, L) it holds that l0 = l0 , lL = lL ,
and li ≤ li , the hypothesis that for all i ∈ N∩(L−1, L) it holds that li ≥ 2lL , and Lemma 2.28 (with a ← r,
L ← L, (l0 , l1 , . . . , lL ) ← D(Ξ), (l0 , l1 , . . . , lL ) ← (l0 , l1 , . . . , lL ), Φ ← Ξ in the notation of Lemma 2.28)
ensure that there exists Ψ ∈ N such that
D(Ψ) = (l0 , l1 , . . . , lL ), |||T (Ψ)||| = |||T (Ξ)|||, and Rr (Ψ) = Rr (Ξ). (67)
Combining this with (64) and (66) establishes (62). The proof of Lemma 2.29 is thus completed.
d
Lemma 2.30. Let u ∈ [−∞, ∞), PL v ∈ (u, ∞], L, L, d,PdL∈ N, θ ∈ R , l0 , l1 , . . . , lL , l0 , l1 , . . . , lL ∈ N satisfy
for all i ∈ N ∩ [0, L) that d ≥ i=1 li (li−1 + 1), d ≥ i=1 li (li−1 + 1), L ≥ L, l0 = l0 , lL = lL , and li ≥ li
and assume for all i ∈ N ∩ (L − 1, L) that li ≥ 2lL . Then there exists ϑ ∈ Rd such that
ϑ,(l0 ,l1 ,...,lL ) θ,(l0 ,l1 ,...,lL )
|||ϑ||| ≤ max{1, |||θ|||} and Nu,v = Nu,v (68)
(cf. Definitions 2.20 and 2.8).
Proof of Lemma 2.30. Throughout this proof let η1 , η2 , . . . , ηd ∈ R satisfy
θ = (η1 , η2 , . . . , ηd ) (69)
L 
and let Φ ∈ i=1 Rli ×li−1 × Rli satisfy
T (Φ) = (η1 , η2 , . . . , ηP(Φ) ) (70)
(cf. Definition 2.11). Note that Lemma 2.29 establishes that there exists Ψ ∈ N which satisfies
D(Ψ) = (l0 , l1 , . . . , lL ), |||T (Ψ)||| ≤ max{1, |||T (Φ)|||}, and Rr (Ψ) = Rr (Φ) (71)
(cf. Definitions 2.9, 2.20, 2.4, and 2.10). Next let ϑ = (ϑ1 , ϑ2 , . . . , ϑd ) ∈ Rd satisfy
(ϑ1 , ϑ2 , . . . , ϑP(Ψ) ) = T (Ψ) and ∀ i ∈ N ∩ (P(Ψ), d + 1) : ϑi = 0. (72)
Note that (69), (70), (71), and (72) show that
|||ϑ||| = |||T (Ψ)||| ≤ max{1, |||T (Φ)|||} ≤ max{1, |||θ|||}. (73)
Next observe that Corollary 2.15 and (70) establish that for all x ∈ Rl0 it holds that
θ,(l0 ,l1 ,...,lL )  T (Φ),D(Φ) 
N−∞,∞ (x) = N−∞,∞ (x) = (Rr (Φ))(x). (74)
In addition, observe that Corollary 2.15, (71), and (72) prove that for all x ∈ Rl0 it holds that
ϑ,(l0 ,l1 ,...,lL )  T (Ψ),D(Ψ) 
N−∞,∞ (x) = N−∞,∞ (x) = (Rr (Ψ))(x). (75)
Combining this and (74) with (71) and the hypothesis that l0 = l0 and lL = lL demonstrates that
θ,(l ,l ,...,lL )
0 1 ϑ,(l ,l ,...,lL )
0 1
N−∞,∞ = N−∞,∞ . (76)
Hence, we obtain that
θ,(l0 ,l1 ,...,lL ) θ,(l ,l ,...,lL )
0 1 0 1 ϑ,(l ,l ,...,lL ) ϑ,(l0 ,l1 ,...,lL )
Nu,v = Cu,v,lL ◦ N−∞,∞ = Cu,v,lL ◦ N−∞,∞ = Nu,v (77)
(cf. Definition 2.7). This and (73) establish (68). The proof of Lemma 2.30 is thus completed.

15
2.3 Local Lipschitz continuity of the parametrization function
Lemma 2.31. Let a, x, y ∈ R. Then

|max{x, a} − max{y, a}| ≤ max{x, y} − min{x, y} = |x − y|. (78)

Proof of Lemma 2.31. Observe that

|max{x, a} − max{y, a}| = |max{max{x, y}, a} − max{min{x, y}, a}|


 
= max max{x, y}, a − max min{x, y}, a
n   o
= max max{x, y} − max min{x, y}, a , a − max min{x, y}, a
n  o (79)
≤ max max{x, y} − max min{x, y}, a , a − a
n  o n o
= max max{x, y} − max min{x, y}, a , 0 ≤ max max{x, y} − min{x, y}, 0
= max{x, y} − min{x, y} = |max{x, y} − min{x, y}| = |x − y| .

The proof of Lemma 2.31 is thus completed.

Corollary 2.32. Let a, x, y ∈ R. Then

|min{x, a} − min{y, a}| ≤ max{x, y} − min{x, y} = |x − y|. (80)

Proof of Corollary 2.32. Note that Lemma 2.31 ensures that

|min{x, a} − min{y, a}| = |− (min{x, a} − min{y, a})| = |max{−x, −a} − max{−y, −a}|
(81)
≤ |(−x) − (−y)| = |x − y| .

The proof of Corollary 2.32 is thus completed.

Lemma 2.33. Let d ∈ N, u ∈ [−∞, ∞), v ∈ (u, ∞]. Then it holds for all x, y ∈ Rd that

|||Cu,v,d (x) − Cu,v,d (y)||| ≤ |||x − y||| (82)

(cf. Definitions 2.7 and 2.20).

Proof of Lemma 2.33. Note that Lemma 2.31, Corollary 2.32, and the fact that for all x ∈ R it holds that
max{−∞, x} = x = min{x, ∞} show that for all x, y ∈ R it holds that

|cu,v (x) − cu,v (y)| = |max{u, min{x, v}} − max{u, min{y, v}}| ≤ |min{x, v} − min{y, v}| ≤ |x − y| (83)

(cf. Definition 2.6). Hence, we obtain that for all x = (x1 , x2 , . . . , xd ), y = (y1 , y2 , . . . , yd) ∈ Rd it holds
that
|||Cu,v,d (x) − Cu,v,d (y)||| = max |cu,v (xi ) − cu,v (yi )| ≤ max |xi − yi | = |||x − y||| (84)
i∈{1,2,...,d} i∈{1,2,...,d}

(cf. Definitions 2.7 and 2.20). The proof of Lemma 2.33 is thus completed.

Lemma 2.34. Let d ∈ N. Then it holds for all x, y ∈ Rd that

|||Rd (x) − Rd (y)||| ≤ |||x − y||| (85)

(cf. Definitions 2.5 and 2.20).

Proof of Lemma 2.34. Note that Lemma 2.33 and the fact that Rd = C0,∞,d establish (85). The proof of
Lemma 2.34 is thus completed.

16
Lemma 2.35. Let a, b ∈ N, M = (Mi,j )(i,j)∈{1,2,...,a}×{1,2,...,b} ∈ Ra×b . Then
  " #  
|||Mv||| b
P
sup = max |Mi,j | ≤ b max max |Mi,j | (86)
v∈Rb \{0} |||v||| i∈{1,2,...,a} j=1 i∈{1,2,...,a} j∈{1,2,...,b}

(cf. Definition 2.20).


Proof of Lemma 2.35. Observe that
 
|||Mv|||
sup = sup |||Mv||| = sup |||Mv|||
v∈Rb |||v||| v∈Rb , |||v|||≤1 v=(v1 ,v2 ,...,vb )∈[−1,1]b
!
P b
= sup max Mi,j vj (87)
v=(v1 ,v2 ,...,vb )∈[−1,1]b i∈{1,2,...,a} j=1
! !
b
P b
P
= max sup Mi,j vj = max |Mi,j |
i∈{1,2,...,a} v=(v1 ,v2 ,...,vb )∈[−1,1]b j=1 i∈{1,2,...,a} j=1

(cf. Definition 2.20). The proof of Lemma 2.35 is thus completed.


Theorem 2.36. Let a ∈ R, b ∈ [a, ∞), d, L ∈ N, l = (l0 , l1 , . . . , lL ) ∈ NL+1 satisfy
L
X
d≥ lk (lk−1 + 1). (88)
k=1

Then it holds for all θ, ϑ ∈ Rd that


θ,l ϑ,l
sup |||N−∞,∞ (x) − N−∞,∞ (x)|||
x∈[a,b]l0
" L−1 # "L−1 #
Y X 
≤ max{1, |a|, |b|}|||θ − ϑ||| (lm + 1) max{1, |||θ|||n } |||ϑ|||L−1−n
m=0 n=0
" L−1 # (89)
L−1
Y
≤ L max{1, |a|, |b|} (max{1, |||θ|||, |||ϑ|||}) (lm + 1) |||θ − ϑ|||
m=0
≤ L max{1, |a|, |b|} (|||l||| + 1) (max{1, |||θ|||, |||ϑ|||})L−1 |||θ − ϑ|||
L

(cf. Definitions 2.8 and 2.20).


Proof of Theorem 2.36. Throughout this proof let θj = (θj,1 , θj,2 , . . . , θj,d) ∈ Rd , j ∈ {1, 2}, let d ∈ N
satisfy that
X L
d= lk (lk−1 + 1), (90)
k=1
lk ×lk−1
let Wj,k ∈ R , k ∈ {1, 2, . . . , L}, j ∈ {1, 2}, and Bj,k ∈ Rlk , k ∈ {1, 2, . . . , L}, j ∈ {1, 2}, satisfy for
all j ∈ {1, 2}, k ∈ {1, 2, . . . , L} that

T (Wj,1, Bj,1), (Wj,2 , Bj,2), . . . , (Wj,L , Bj,L ) = (θj,1 , θj,2 , . . . , θj,d ), (91)
let φj,k ∈ N, k ∈ {1, 2, . . . , L}, j ∈ {1, 2}, satisfy for all j ∈ {1, 2}, k ∈ {1, 2, . . . , L} that
 k li ×li−1

φj,k = (Wj,1, Bj,1 ), (Wj,2, Bj,2), . . . , (Wj,k , Bj,k ) ∈ i=1 R × Rli , (92)
let D = [a, b]l0 , let mj,k ∈ [0, ∞), j ∈ {1, 2}, k ∈ {0, 1, . . . , L}, satisfy for all j ∈ {1, 2}, k ∈ {0, 1, . . . , L}
that (
max{1, |a|, |b|} :k=0
mj,k = (93)
max{1, supx∈D |||(Rr (φj,k ))(x)|||} : k > 0,

17
and let ek ∈ [0, ∞), k ∈ {0, 1, . . . , L}, satisfy for all k ∈ {0, 1, . . . , L} that
(
0 :k=0
ek = (94)
supx∈D |||(Rr (φ1,k ))(x) − (Rr (φ2,k ))(x)||| : k > 0
(cf. Definitions 2.11, 2.4, 2.10, and 2.20). Note that Lemma 2.35 demonstrates that
e1 = sup |||(Rr (φ1,1 ))(x) − (Rr (φ2,1))(x)||| = sup |||(W1,1 x + B1,1 ) − (W2,1 x + B2,1 )|||
x∈D x∈D
 
≤ sup |||(W1,1 − W2,1 )x||| + |||B1,1 − B2,1 |||
x∈D
"  #  
|||(W1,1 − W2,1 )v||| (95)
≤ sup sup |||x||| + |||B1,1 − B2,1 |||
v∈Rl0 \{0} |||v||| x∈D

≤ l0 |||θ1 − θ2 ||| max{|a|, |b|} + |||B1,1 − B2,1 ||| ≤ l0 |||θ1 − θ2 ||| max{|a|, |b|} + |||θ1 − θ2 |||
= |||θ1 − θ2 ||| (l0 max{|a|, |b|} + 1) ≤ m1,0 |||θ1 − θ2 ||| (l0 + 1) .
Moreover, observe that the triangle inequality assures that for all k ∈ {1, 2, . . . , L} ∩ (1, ∞) it holds that
ek = sup |||(Rr (φ1,k ))(x) − (Rr (φ2,k ))(x)|||
x∈D
h   i h   i
= sup W1,k Rlk−1 (Rr (φ1,k−1))(x) + B1,k − W2,k Rlk−1 (Rr (φ2,k−1))(x) + B2,k (96)
x∈D
  
  
≤ sup W1,k Rlk−1 (Rr (φ1,k−1))(x) − W2,k Rlk−1 (Rr (φ2,k−1))(x) + |||θ1 − θ2 |||.
x∈D

The triangle inequality hence implies that for all j ∈ {1, 2}, k ∈ {1, 2, . . . , L} ∩ (1, ∞) it holds that
 
 
ek ≤ sup W1,k − W2,k Rlk−1 (Rr (φj,k−1))(x)
x∈D
  
 
+ sup W3−j,k Rlk−1 (Rr (φ1,k−1 ))(x) − Rlk−1 (Rr (φ2,k−1 ))(x) + |||θ1 − θ2 |||
x∈D
" # 

|||(W1,k − W2,k )v||| 
 (97)
≤ sup sup Rlk−1 (Rr (φj,k−1))(x) + |||θ1 − θ2 |||
v∈Rlk−1 \{0} |||v||| x∈D
"  #  
|||W3−j,k v|||  
+ sup sup Rlk−1 (Rr (φ1,k−1))(x) − Rlk−1 (Rr (φ2,k−1))(x) .
v∈Rlk−1 \{0}
|||v||| x∈D

Lemma 2.35 and Lemma 2.34 therefore show that for all j ∈ {1, 2}, k ∈ {1, 2, . . . , L} ∩ (1, ∞) it holds
that
 

ek ≤ lk−1 |||θ1 − θ2 ||| sup Rlk−1 (Rr (φj,k−1))(x) + |||θ1 − θ2 |||
x∈D
 
 
+ lk−1 |||θ3−j ||| sup Rlk−1 (Rr (φ1,k−1 ))(x) − Rlk−1 (Rr (φ2,k−1))(x)
x∈D
 
(98)
≤ lk−1 |||θ1 − θ2 ||| sup |||(Rr (φj,k−1))(x)||| + |||θ1 − θ2 |||
x∈D
 
+ lk−1 |||θ3−j ||| sup |||(Rr (φ1,k−1))(x) − (Rr (φ2,k−1))(x)|||
x∈D
≤ |||θ1 − θ2 ||| (lk−1 mj,k−1 + 1) + lk−1 |||θ3−j ||| ek−1 .
Hence, we obtain that for all j ∈ {1, 2}, k ∈ {1, 2, . . . , L} ∩ (1, ∞) it holds that
ek ≤ mj,k−1 |||θ1 − θ2 ||| (lk−1 + 1) + lk−1 |||θ3−j ||| ek−1 . (99)

18
Combining this with (95), the fact that e0 = 0, and the fact that m1,0 = m2,0 demonstrates that for all
j ∈ {1, 2}, k ∈ {1, 2, . . . , L} it holds that

ek ≤ mj,k−1 (lk−1 + 1) |||θ1 − θ2 ||| + lk−1 |||θ3−j ||| ek−1 . (100)

This shows that for all j = (jn )n∈{0,1,...,L} : {0, 1, . . . , L} → {1, 2} and all k ∈ {1, 2, . . . , L} it holds that

ek ≤ mjk−1 ,k−1 (lk−1 + 1) |||θ1 − θ2 ||| + lk−1 |||θ3−jk−1 ||| ek−1. (101)

Therefore, we obtain that for all j = (jn )n∈{0,1,...,L} : {0, 1, . . . , L} → {1, 2} and all k ∈ {1, 2, . . . , L} it
holds that
k−1
" k−1 # !
X Y 
ek ≤ lm |||θ3−jm ||| mjn ,n (ln + 1) |||θ1 − θ2 |||
n=0 m=n+1
" k−1 " k−1 # !# (102)
X Y 
= |||θ1 − θ2 ||| lm |||θ3−jm ||| mjn ,n (ln + 1) .
n=0 m=n+1

Next observe that Lemma 2.35 ensures that for all j ∈ {1, 2}, k ∈ {1, 2, . . . , L} ∩ (1, ∞), x ∈ D it holds
that
 
|||(Rr (φj,k ))(x)||| = Wj,k Rlk−1 (Rr (φj,k−1))(x) + Bj,k
" #
|||Wj,k v||| 
≤ sup Rlk−1 (Rr (φj,k−1))(x) + |||Bj,k |||
v∈Rlk−1 \{0}
|||v|||
 (103)
≤ lk−1 |||θj ||| Rlk−1 (Rr (φj,k−1))(x) + |||θj |||
≤ lk−1 |||θj ||| |||(Rr (φj,k−1))(x)||| + |||θj |||
= (lk−1 |||(Rr (φj,k−1))(x)||| + 1) |||θj |||
≤ (lk−1 mj,k−1 + 1) |||θj ||| ≤ mj,k−1 (lk−1 + 1) |||θj |||.
Hence, we obtain for all j ∈ {1, 2}, k ∈ {1, 2, . . . , L} ∩ (1, ∞) that
mj,k ≤ max{1, mj,k−1 (lk−1 + 1) |||θj |||}. (104)
Furthermore, note that Lemma 2.35 assures that for all j ∈ {1, 2}, x ∈ D it holds that
|||(Rr (φj,1))(x)||| = |||Wj,1 x + Bj,1 |||
" #
|||Wj,1 v|||
≤ sup |||x||| + |||Bj,1 |||
v∈Rl0 \{0} |||v||| (105)
≤ l0 |||θj ||| |||x||| + |||θj ||| ≤ l0 |||θj ||| max{|a|, |b|} + |||θj |||
= (l0 max{|a|, |b|} + 1) |||θj ||| ≤ m1,0 (l0 + 1) |||θj |||.
Therefore, we obtain that for all j ∈ {1, 2} it holds that
mj,1 ≤ max{1, mj,0 (l0 + 1) |||θj |||}. (106)
Combining this with (104) demonstrates that for all j ∈ {1, 2}, k ∈ {1, 2, . . . , L} it holds that
mj,k ≤ max{1, mj,k−1 (lk−1 + 1) |||θj |||}. (107)
Hence, we obtain that for all j ∈ {1, 2}, k ∈ {0, 1, . . . , L} it holds that
"k−1 #
Y  k
mj,k ≤ mj,0 (ln + 1) max{1, |||θj |||} . (108)
n=0

19
Combining this with (102) proves that for all j = (jn )n∈{0,1,...,L} : {0, 1, . . . , L} → {1, 2} and all k ∈
{1, 2, . . . , L} it holds that
" k−1 " k−1 # "n−1 # !!#
X Y  Y
ek ≤ |||θ1 − θ2 ||| lm |||θ3−jm ||| mjn ,0 (lv + 1) max{1, |||θjn |||n } (ln + 1)
n=0 m=n+1 v=0
" k−1 " k−1 # " n # !!#
X Y  Y n
= m1,0 |||θ1 − θ2 ||| lm |||θ3−jm ||| (lv + 1) max{1, |||θjn ||| }
n=0 m=n+1 v=0
" k−1 " k−1 # "k−1 # !# (109)
X Y Y
≤ m1,0 |||θ1 − θ2 ||| |||θ3−jm ||| (lv + 1) max{1, |||θjn |||n }
n=0 m=n+1 v=0
"k−1 # " k−1 " k−1 # !#
Y X Y n
= m1,0 |||θ1 − θ2 ||| (ln + 1) |||θ3−jm ||| max{1, |||θjn ||| } .
n=0 n=0 m=n+1

Therefore, we obtain that for all j ∈ {1, 2}, k ∈ {1, 2, . . . , L} it holds that
"k−1 # " k−1 " k−1 # !#
Y X Y
ek ≤ m1,0 |||θ1 − θ2 ||| (ln + 1) |||θ3−j ||| max{1, |||θj |||n }
n=0 n=0 m=n+1
"k−1 # " k−1 #
Y X n k−1−n

= m1,0 |||θ1 − θ2 ||| (ln + 1) max{1, |||θj ||| } |||θ3−j ||| (110)
n=0 n=0
" k−1 #
k−1
Y 
≤ k m1,0 |||θ1 − θ2 ||| (max{1, |||θ1 |||, |||θ2 |||}) lm + 1 .
m=0

The proof of Theorem 2.36 is thus completed.

Corollary 2.37. Let a ∈ R, b ∈ [a, ∞), u ∈ [−∞, ∞), v ∈ (u, ∞], d, L ∈ N, l = (l0 , l1 , . . . , lL ) ∈ NL+1
satisfy
XL
d≥ lk (lk−1 + 1). (111)
k=1
d
Then it holds for all θ, ϑ ∈ R that
θ,l ϑ,l
sup |||Nu,v (x) − Nu,v (x)||| ≤ L max{1, |a|, |b|} (|||l||| + 1)L (max{1, |||θ|||, |||ϑ|||})L−1 |||θ − ϑ||| (112)
x∈[a,b]l0

(cf. Definitions 2.8 and 2.20).

Proof of Corollary 2.37. Observe that Theorem 2.36 and Lemma 2.33 demonstrate that for all θ, ϑ ∈ Rd
it holds that
θ,l ϑ,l θ,l
 ϑ,l

sup |||Nu,v (x) − Nu,v (x)||| = sup Cu,v,lL (N−∞,∞ )(x) − Cu,v,lL (N−∞,∞ )(x)
x∈[a,b]l0 x∈[a,b]l0
θ,l ϑ,l
≤ sup |||(N−∞,∞ )(x) − (N−∞,∞ )(x)||| (113)
x∈[a,b]l0

≤ L max{1, |a|, |b|} (|||l||| + 1)L (max{1, |||θ|||, |||ϑ|||})L−1 |||θ − ϑ|||

(cf. Definitions 2.8, 2.20, and 2.7). This completes the proof of Corollary 2.37.

20
3 Separate analyses of the error sources
In this section we study separately the approximation error (see Section 3.1 below), the generalization
error (see Section 3.2 below), and the optimization error (see Section 3.3 below).
In particular, the main result in Section 3.1, Proposition 3.5 below, establishes an upper bound for
the error in the approximation of a Lipschitz continuous function by DNNs. This approximation result
is obtained by combining the essentially well-known approximation result in Lemma 3.1 with the DNN
calculus in Section 2.2 above (cf., e.g., Grohs et al. [31, 32]). Some of the results in Section 3.1 are partially
based on material in publications from the scientific literature. In particular, the elementary result in
Lemma 3.2 is basically well-known in the scientific literature. For further approximation results for DNNs
we refer, e.g., to [1, 3, 4, 11, 12, 13, 14, 16, 17, 19, 21, 22, 23, 24, 25, 27, 29, 30, 31, 33, 34, 36, 38, 39, 40,
41, 42, 44, 47, 49, 52, 53, 54, 55, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 69, 70, 72, 73, 74] and the references
mentioned therein.
In Lemmas 3.20 and 3.21 in Section 3.2 below we study the generalization error. Our analysis in
Section 3.2 is in parts inspired by Berner et al. [10] and Cucker & Smale [18]. Proposition 3.10 in
Section 3.2.1 is known as Hoeffding’s inequality in the scientific literature and Proposition 3.10 is, e.g.,
proved as Theorem 2 in Hoeffding [37]. The proof of Proposition 3.12 can be found, e.g., in Cucker
& Smale [18, Proposition 5] (cf. also Berner et al. [10, Proposition 4.3]). For further results on the
generalization error we refer, e.g., to [5, 35, 51, 68, 71] and the references mentioned therein.
In the two elementary results in Section 3.3, Lemmas 3.22 and 3.23, we study the optimization error of
the minimum Monte Carlo algorithm. A related result can be found, e.g., in [6, Lemma 3.5]. For further
results on the optimization error we refer, e.g., to [2, 9, 15, 20, 26, 43, 45, 46, 48] and the references
mentioned therein.

3.1 Analysis of the approximation error


3.1.1 Approximations for Lipschitz continuous functions
Lemma 3.1. Let (E, δ) be a metric space, let M ⊆ E satisfy M = 6 ∅, let L ∈ [0, ∞), let f : E → R
satisfy for all x ∈ E, y ∈ M that |f (x) − f (y)| ≤ Lδ(x, y), and let F : E → R ∪ {∞} satisfy for all x ∈ E
that
F (x) = sup [f (y) − Lδ(x, y)]. (114)
y∈M

Then

(i) it holds for all x ∈ E that F (x) ≤ f (x),

(ii) it holds for all x ∈ M that F (x) = f (x),

(iii) it holds for all x, y ∈ E that |F (x) − F (y)| ≤ Lδ(x, y), and

(iv) it holds for all x ∈ E that  


|F (x) − f (x)| ≤ 2L inf δ(x, y) . (115)
y∈M

Proof of Lemma 3.1. First, observe that the hypothesis that for all x ∈ E, y ∈ M it holds that |f (x) −
f (y)| ≤ Lδ(x, y) ensures that for all x ∈ E, y ∈ M it holds that

f (x) ≥ f (y) − Lδ(x, y). (116)

Hence, we obtain that for all x ∈ E it holds that

f (x) ≥ sup [f (y) − Lδ(x, y)] = F (x). (117)


y∈M

21
This establishes item (i). Next observe that (114) implies that for all x ∈ M it holds that

F (x) ≥ f (x) − Lδ(x, x) = f (x). (118)

Combining this with item (i) establishes item (ii). In the next step we note that (114) and the fact that
for all x ∈ E it holds that F (x) ≤ f (x) < ∞ show that for all x, y ∈ E it holds that
   
F (x) − F (y) = sup (f (v) − Lδ(x, v)) − sup (f (w) − Lδ(y, w))
v∈M w∈M
 
= sup f (v) − Lδ(x, v) − sup (f (w) − Lδ(y, w))
v∈M w∈M
 
≤ sup f (v) − Lδ(x, v) − (f (v) − Lδ(y, v)) (119)
v∈M
 
= L sup (δ(y, v) − δ(x, v))
v∈M
 
≤ L sup (δ(y, x) + δ(x, v) − δ(x, v)) = Lδ(x, y).
v∈M

Combining this with the fact that for all x, y ∈ E it holds that δ(x, y) = δ(y, x) establishes item (iii).
Observe that item (ii), the triangle inequality, item (iii), and the hypothesis that for all x ∈ E, y ∈ M it
holds that |f (x) − f (y)| ≤ Lδ(x, y) ensure that for all x ∈ E it holds that

|F (x) − f (x)| = inf |F (x) − F (y) + f (y) − f (x)|


y∈M

≤ inf (|F (x) − F (y)| + |f (y) − f (x)|)


y∈M (120)
 
≤ inf (2Lδ(x, y)) = 2L inf δ(x, y) .
y∈M y∈M

This establishes item (iv). The proof of Lemma 3.1 is thus completed.

3.1.2 DNN representations for maxima


Lemma 3.2. Let Φ ∈ N satisfy
     
1 −1 0   
1 1 −1 , 0  ∈ (R3×2 × R3 ) × (R1×3 × R)

Φ =  0 1  , 0 , (121)
0 −1 0

(cf. Definition 2.9). Then

(i) it holds for all k ∈ N that L(Ik ) = 2,

(ii) there exist unique φk ∈ N, k ∈ {2, 3, . . .}, which satisfy for all k ∈ {2, 3, . . .} that φ2 = Φ, I(φk ) =
O(P2 (Φ, Ik−1 )), and 
φk+1 = φk • P2 (Φ, Ik−1 ) , (122)

(iii) it holds for all k ∈ {2, 3, . . .} that L(φk ) = k, and

(iv) it holds for all k ∈ {2, 3, . . .} that D(φk ) = (k, 2k − 1, 2k − 3, . . . , 3, 1) ∈ Nk+1 , and

(v) it holds for all k ∈ {2, 3, . . .}, x = (x1 , x2 , . . . , xk ) ∈ Rk that



Rr (φk ) (x) = max{x1 , x2 , . . . , xk } (123)

22
(cf. Definitions 2.18, 2.16, 2.19, 2.4, and 2.10).

Proof of Lemma 3.2. First, note that, e.g., item (i) in [32, Lemma 3.16] shows that for all k ∈ N it holds
that
D(Ik ) = (k, 2k, k). (124)
This establishes item (i). Next note that (121) demonstrates that

D(Φ) = (2, 3, 1). (125)

Combining this and (124) with item (i) in [31, Proposition 2.20] shows that for all k ∈ N it holds that

D(P2 (Φ, Ik )) = (k + 2, 2k + 3, k + 1) (126)

(cf. Definition 2.16). Hence, we obtain that for all k ∈ {2, 3, . . .} it holds that

D(P2 (Φ, Ik−1 )) = (k + 1, 2k + 1, k). (127)

Combining this with (126) ensures that for all k ∈ {2, 3, . . .} it holds that

O(P2 (Φ, Ik )) = k + 1 = I(P2 (Φ, Ik−1 )). (128)

Moreover, note that (121) and (126) assure that

I(Φ) = 2 = O(P2 (Φ, I1 )). (129)

Furthermore, observe that item (i) in [31, Proposition 2.6] and (128) show that for all k ∈ {2, 3, . . .},
ψ ∈ N with I(ψ) = O(P2 (Φ, Ik−1 )) it holds that

I ψ • P2 (Φ, Ik−1 ) = I(P2 (Φ, Ik−1 )) = O(P2 (Φ, Ik )) (130)

(cf. Definition 2.19). Combining this and (129) with induction establishes item (ii). In the next step we
note that (122) and item (ii) in [31, Proposition 2.6] imply that for all k ∈ {2, 3, . . .} it holds that

L(φk+1 ) = L(φk ) + L(P2 (φ2 , Ik−1 )) − 1 = L(φk ) + 1. (131)

Combining this and the fact that L(φ2 ) = 2 with induction establishes item (iii). Furthermore, observe
that (122), (127), and item (i) in [31, Proposition 2.6] demonstrate that for all k ∈ {2, 3, . . .}, l0 , l1 , . . . , lk ∈
N with D(φk ) = (l0 , l1 , . . . , lk ) it holds that

D(φk+1) = D φk • P2 (Φ, Ik−1 ) = (k + 1, 2k + 1, l1 , l2 , . . . , lk ). (132)

This, item (iii), the fact that D(φ2 ) = (2, 3, 1), and induction establish item (iv). Moreover, note that
(121) ensures that for all (x1 , x2 ) ∈ R2 it holds that
    
1 −1   0
  x1
Rr (Φ) (x1 , x2 ) = 1 1 −1 Mr,3   0 1  + 0 + 0

x2
0 −1 0
 
 max{x1 − x2 , 0} (133)
= 1 1 −1  max{x2 , 0} 
max{−x2 , 0}
= max{x1 − x2 , 0} + max{x2 , 0} − max{−x2 , 0}
= max{x1 − x2 , 0} + x2 = max{x1 , x2 }

23
(cf. Definitions 2.4, 2.10, and 2.3). Combining this and item (iii) in [32, Lemma 3.16] with [31, Proposi-
tion 2.19] proves that for all k ∈ {2, 3, . . .}, x = (x1 , x2 , . . . , xk+1 ) ∈ Rk+1 it holds that
   
Rr (P2 (Φ, Ik−1 )) (x) = Rr (Φ) (x1 , x2 ), Rr (Ik−1 ) (x3 , x4 , . . . , xk+1 )
(134)
= (max{x1 , x2 }, x3 , x4 , . . . , xk+1 ) ∈ Rk .

Item (v) in [31, Proposition 2.6] and (122) hence show that for all k ∈ {2, 3, . . .}, x = (x1 , x2 , . . . , xk+1 ) ∈
Rk+1 it holds that
      
Rr (φk+1) (x) = Rr φk • P2 (Φ, Ik−1 ) (x) = Rr (φk ) ◦ Rr (P2 (Φ, Ik−1 )) (x)
 (135)
= Rr (φk ) (max{x1 , x2 }, x3 , x4 , . . . , xk+1 ).

This, the fact that φ2 = Φ, (133), and induction establish item (v). The proof of Lemma 3.2 is thus
completed.

Lemma 3.3. Let Ak ∈ R(2k−1)×k , k ∈ {2, 3, . . .}, and Ck ∈ R(k−1)×(2k−1) , k ∈ {2, 3, . . .}, satisfy for all
k ∈ {2, 3, . . .} that
 
1 −1 0 · · · 0
0 1
 0 ··· 0    
0 −1 0 · · · 0  1 1 −1 0 0 · · · 0 0
 
0 0
 1 · · · 0 

0
 0 0 1 −1 · · · 0 0  
Ak = 0 0 −1 · · · 0  and Ck =  .. .. .. .. .. . . .. ..  (136)
  . . . . . . . . 
 .. .. .. . . .. 
. . . . . 
 0 0 0 0 0 · · · 1 −1

0 0 0 ··· 1 
0 0 0 · · · −1

and let φk = ((Wk,1 , Bk,1), (Wk,2, Bk,2 ), . . . , (Wk,k , Bk,k )) ∈ N, k ∈ {2, 3, . . .}, satisfy for all k ∈ {2, 3, . . .}
that I(φk ) = O(P2 (φ2 , Ik−1)), φk+1 = φk • (P2 (φ2 , Ik−1 )), and
     
1 −1 0   
φ2 =  0 1  , 0 , 1 1 −1 , 0  ∈ (R3×2 × R3 ) × (R1×3 × R)

(137)
0 −1 0

(cf. Definitions 2.9, 2.18, 2.19, and 2.16 and Lemma 3.2). Then

(i) it holds for all k ∈ {2, 3, . . .} that Wk,1 = Ak ,

(ii) it holds for all k ∈ {2, 3, . . .}, l ∈ {1, 2, . . . , k} that Bk,l = 0 ∈ R2(k−l)+1 ,

(iii) it holds for all k ∈ {2, 3, . . .}, l ∈ {3, 4, . . . , k + 1} that (Wk+1,l , Bk+1,l ) = (Wk,l−1, Bk,l−1),

(iv) it holds for all k ∈ {2, 3, . . .} that Wk+1,2 = Wk,1 Ck+1 ,



(v) it holds for all k ∈ {2, 3, . . .} that (0, 0, . . . , 0) 6= T (φk ) ∈ {−1, 0, 1}P(φk ) , and

(vi) it holds for all k ∈ {2, 3, . . .} that |||T (φk )||| = 1

(cf. Definitions 2.11 and 2.20).

Proof of Lemma 3.3. First, note that (28), (29), (136), and (137) ensure that for all k ∈ {2, 3, . . .} it holds
that
P2 (φ2 , Ik−1 ) = (NAk+1 , NCk+1 ) (138)

24
(cf. Definition 2.17). This and (137) imply that for all k ∈ {2, 3, . . .} it holds that

φk+1 = φk • P2 (φ2 , Ik−1 )
= ((Wk,1 , Bk,1), (Wk,2, Bk,2), . . . , (Wk,k , Bk,k )) • (NAk+1 , NCk+1 ) (139)
= (NAk+1 , (Wk,1Ck+1 , Bk,1), (Wk,2 , Bk,2), . . . , (Wk,k , Bk,k )).
This, (136), and (137) establish item (i). Next observe that (137), (139), item (iv) in Lemma 3.2, and
induction prove item (ii). Moreover, note that (139) establishes items (iii) and (iv). In addition, observe
that item (i) proves that for all k ∈ {2, 3, . . .} it holds that
 
1 −1 0 · · · 0
0 1
 0 ··· 0   
0 −1 0 · · · 0  1 1 −1 0 0 · · · 0 0
 
0 0
 1 ··· 0   0 0 0 1 −1 · · · 0 0 
 
Wk,1 Ck+1 = Ak Ck+1 = 0 0 −1 · · · 0   .. .. .. .. .. . . .. .. 
  . . . . . . . . 
 .. .. .. . . .. 
. .
 0 0 0 0 0 · · · 1 −1
. . . 

0 0 0 ··· 1 
0 0 0 · · · −1
  (140)
1 1 −1 −1 1 0 0 ··· 0 0
0 0 0
 1 −1 0 0 ··· 0 0 
0 0 0 −1 1 0 0 ··· 0 0
 
0 0 0
 0 0 1 −1 · · · 0 0 

= 0 0 0 0 0 −1 1 · · · 0 0 .
 
 .. .. .. .. .. .. .. . . . .. 
. . . . . . . . .. . 
 
0 0 0 0 0 0 0 · · · 1 −1
0 0 0 0 0 0 0 · · · −1 1
Combining this, (137), and (139) with induction proves item (v). Next note that item (v) establishes
item (vi). The proof of Lemma 3.3 is thus completed.

3.1.3 Interpolations through DNNs


Lemma 3.4. Let φk ∈ N, k ∈ {2, 3, . . .}, satisfy for all k ∈ {2, 3, . . .} that I(φk ) = O(P2 (φ2 , Ik−1 )),
φk+1 = φk • (P2 (φ2 , Ik−1 )), and
     
1 −1 0   
φ2 =  0 1  , 0 , 1 1 −1 , 0  ∈ (R3×2 × R3 ) × (R1×3 × R) ,

(141)
0 −1 0

let d ∈ N, L ∈ [0, ∞), let M ⊆ Rd satisfy |M| ∈ {2, 3, . . .}, let m : {1, 2, . . . , |M|} → M be bijective, let
f : M → R and F : Rd → R satisfy for all x = (x1 , x2 , . . . , xd ) ∈ Rd that
" d
!#
X
F (x) = max f (y) − L |xi − yi | , (142)
y=(y1 ,y2 ...,yd )∈M
i=1

let W1 ∈ R(2d)×d , W2 ∈ R1×(2d) , and Bz ∈ R2d , z ∈ M, satisfy for all z = (z1 , z2 , . . . , zd ) ∈ M that
   
1 0 ··· 0 −z1
−1 0 ··· 0   z1 
   
0
 1 ··· 0   −z2 
  
W1 =  0
 −1 · · · 0  ,
 z2 
Bz =  , and W2 = −L −L · · · −L , (143)
 .. .. . . ..   .. 
 . . . .
 . 
  
 
0 0 ··· 1  −zd 
0 0 · · · −1 zd

25
let W1 ∈ R(2d|M|)×d , B1 ∈ R2d|M| , W2 ∈ R|M|×(2d|M|) , B2 ∈ R|M| satisfy
       
W1 Bm(1) W2 0 ··· 0 f (m(1))
W1 
 
 Bm(2) 
 
 0 W2
 ··· 0  
 f (m(2)) 
 
W1 =  ..  , B1 =  ..  , W2 =  .. .. .. .
.  , and B2 =  .. , (144)
 .   .   . . . .   . 
W1 Bm(|M|) 0 0 · · · W2 f (m(|M|))
and let Φ ∈ N satisfy Φ = φ|M| • ((W1 , B1 ), (W2 , B2 )) (cf. Definitions 2.9, 2.18, 2.16, and 2.19 and
Lemma 3.2). Then
(i) it holds that D(Φ) = (d, 2d|M|, 2|M| − 1, 2|M| − 3, . . . , 3, 1) ∈ N|M|+2,
(ii) it holds that L(Φ) = |M| + 1,
(iii) it holds that |||T (Φ)||| ≤ max{1, L, supz∈M |||z|||, 2[supz∈M |f (z)|]}, and
(iv) it holds that F = Rr (Φ)
(cf. Definitions 2.11, 2.20, 2.4, and 2.10).
Proof of Lemma 3.4. Throughout this proof let Ψ ∈ N satisfy Ψ = ((W1 , B1 ), (W2 , B2 )) and let mi,j ∈ R,
i ∈ {1, 2, . . . , |M|}, j ∈ {1, 2 . . . , d}, satisfy for all i ∈ {1, 2, . . . , |M|}, j ∈ {1, 2 . . . , d} that m(i) =
(mi,1 , mi,2 , . . . , mi,d ). Note that Lemma 3.2 establishes that there exist W1 ∈ R(2|M|−1)×|M| , B1 ∈ R2|M|−1 ,
W2 ∈ R(2|M|−3)×(2|M|−1) , B2 ∈ R2|M|−3 , . . . , W|M|−1 ∈ R3×5 , B|M|−1 ∈ R3 , W|M| ∈ R1×3 , B|M| ∈ R such
that
φ|M| = ((W1 , B1 ), (W2 , B2 ), . . . , (W|M| , B|M| )). (145)
Next observe that (144) establishes that L(Ψ) = 2 and
D(Ψ) = (d, 2d|M|, |M|). (146)
Moreover, note that item (iv) in Lemma 3.2 ensures that
D(φ|M| ) = (|M|, 2|M| − 1, 2|M| − 3, . . . , 3, 1) ∈ N|M|+1 . (147)
This, the fact that Φ = φ|M| • Ψ, (146), and item (i) in [31, Proposition 2.6] show that L(Φ) = |M| + 1
and
D(Φ) = (d, 2d|M|, 2|M| − 1, 2|M| − 3, . . . , 3, 1) ∈ N|M|+2. (148)
This establishes items (i) and (ii). In the next step we note that the hypothesis that Φ = φ|M| •
((W1 , B1 ), (W2 , B2 )) and (145) ensure that
Φ = ((W1 , B1 ), (W2 , B2 ), . . . , (W|M| , B|M| )) • ((W1 , B1 ), (W2 , B2 ))
(149)
= ((W1 , B1 ), (W1 W2 , W1 B2 + B1 ), (W2 , B2 ), . . . , (W|M| , B|M| )).
Lemma 2.13 hence implies that
   
T (Φ) = T ((W1 , B1 )) , T ((W1 W2 , W1 B2 + B1 )) , T ((W2 , B2 )) , . . . , T ((W|M| , B|M| )) (150)
(cf. Definition 2.11). Moreover, note that (144) and item (i) in Lemma 3.3 imply that
   
1 −1 0 · · · 0 W2 −W2 0 ··· 0
0 1
 0 ··· 0  
 0
 W2 0 ··· 0 
0 −1 0 · · · 0   0 −W2 0 ··· 0 
   
0 0
 1 · · · 0 

 0
 0 W 2 · · · 0 

W1 W2 = 0 0 −1 · · · 0  W2 =  0 0 −W2 · · · 0  .

 .. ..
   (151)
.. . . .   .. .. .. .. .. 
. . . . ..   . . . . . 
   
0 0 0 ··· 1   0 0 0 · · · W2 
0 0 0 · · · −1 0 0 0 · · · −W2
| {z } | {z }
∈R(2|M|−1)×|M| ∈R(2|M|−1)×(2d|M|)

26
In addition, observe that (144) and items (i) and (ii) in Lemma 3.3 show that
   
1 −1 0 · · · 0 0
0 1
 0 ··· 0 
0
 
0 −1 0 · · · 0 0
   
0 0
 1 ··· 0  0
 
W1 B2 + B1 = 0 0 −1 · · · 0 B2 + 0
   
 .. .. .. . . ..   .. 
. . . . .  .
   
0 0 0 ··· 1 0
0 0 0 ··· −1 0
| {z } | {z }
∈R(2|M|−1)×|M| ∈R2|M|−1
    (152)
1 −1 0 · · · 0 f (m(1)) − f (m(2))
0 1
 0 ··· 0  

  f (m(2)) 

0 −1 0 · · · 0  f (m(1))  −f (m(2)) 
   
0 0
 1 · · · 0   f (m(2))  
   f (m(3)) 

= 0 0 −1 · · · 0   .. = −f (m(3)) .

 .. ..
 .   
.. . . ..   .. 
. . . . .  f (m(|M|))  . 
 |  
0 0 0 ··· 1  {z
|M|
}  f (m(|M|)) 
∈R
0 0 0 · · · −1 −f (m(|M|))
| {z } | {z }
∈R(2|M|−1)×|M| ∈R2|M|−1

This and (151) demonstrate that



T ((W1 W2 , W1 B2 + B1 ))
  
= max{L, |f (m(1)) − f (m(2))|, |f (m(2))|, |f (m(3))|, . . . , |f (m(|M|))|} ≤ max L, 2 sup |f (z)|
z∈M
(153)

(cf. Definition 2.20). Combining this, (144), and item (vi) in Lemma 3.3 with (150) proves that
  
|||T (Φ)||| ≤ max T ((W1 , B1 )) , T ((W1 W2 , W1 B2 + B1 )) , |||T (φ|M| )|||
  
(154)
≤ max 1, sup |||z|||, L, 2 sup |f (z)| .
z∈M z∈M

This establishes item (iii). Observe that (143) ensures that for all x = (x1 , x2 , . . . , xd ) ∈ Rd , z =
(z1 , z2 , . . . , zd ) ∈ M it holds that
         
1 0 ··· 0 −z1 x1 −z1 x1 − z1
−1 0 ··· 0     z1  −x1   z1  −(x1 − z1 )
  x1        
0 1 ··· 0   
 −z   x 
 2  2   2  2
 −z   x − z2 

 x 2
W1 x + Bz =  0
 −1 · · · 0    ..  +  z2  = −x2  +  z2  = −(x2 − z2 ) . (155)
         
 .. .. . . .  .  .   .   .   .. 
. ..   ..   ..   ..  
 
 . . . 
  xd        
0 0 ··· 1  −zd   xd  −zd   xd − zd 
0 0 · · · −1 zd −xd zd −(xd − zd )
| {z }
∈R(2d)×d

27
This and (144) prove that for all x = (x1 , x2 , . . . , xd ) ∈ Rd , z = (z1 , z2 , . . . , zd ) ∈ M it holds that
 
max{x1 − z1 , 0}
max{z1 − x1 , 0}
 
max{x2 − z2 , 0}
  
W2 R2d (W1 x + Bz ) = −L −L · · · −L max{z2 − x2 , 0}
 
| {z } .. 
∈R1×(2d)

 . 
 (156)
max{xd − zd , 0}
max{zd − xd , 0}
" d # " d #
X X
= −L (max{xi − zi , 0} + max{zi − xi , 0}) = −L |xi − zi |
i=1 i=1
d
(cf. Definition 2.5). Moreover, note that (144) implies that for all x ∈ R it holds that
 
W1 x + Bm(1)
 W1 x + Bm(2) 
 
W1 x + B1 =  .. . (157)
 . 
W1 x + Bm(|M|)
Therefore, we obtain that for all x ∈ Rd it holds that
 
R2d (W1 x + Bm(1) )

 R2d (W1 x + Bm(2) )


R2d|M| (W1 x + B1 ) =  .. . (158)
 . 
R2d (W1 x + Bm(|M|) )
This, (144), and (156) imply that for all x = (x1 , x2 , . . . , xd ) ∈ Rd it holds that
 
Rr (Ψ) (x) = W2 R2d|M| (W1 x + B1 ) + B2
    
W2 0 · · · 0 R2d (W1 x + Bm(1) ) f (m(1))
 0 W2 · · · 0   R2d (W1 x + Bm(2) )   f (m(2)) 
    
=  .. .. . . ..   .. + .. 
 . . . .  .   . 
0 0 · · · W2 R2d (W1 x + Bm(|M|) ) f (m(|M|))
    
W2 R2d (W1 x + Bm(1) ) f (m(1))
 W2 R2d (W1 x + Bm(2) )   f (m(2))  (159)
   
= .. + .. 
 . 
  . 
W2 R2d (W1 x + Bm(|M|) ) f (m(|M|))
 Pd  
f (m(1)) − L i=1 |xi − m1,i |
 f (m(2)) − LPd |x − m | 
 i=1 i 2,i 
= .. 
 . 
Pd 
f (m(|M|)) − L i=1 |xi − m|M|,i |
(cf. Definitions 2.4 and 2.10). This, the fact that Φ = φ|M| • Ψ, item (v) in Lemma 3.2, and item (v) in
[31, Proposition 2.6] ensure that for all x = (x1 , x2 , . . . , xd ) ∈ Rd it holds that
" d
!#
     X
Rr (Φ) (x) = Rr (φ|M| ) ◦ Rr (Ψ) (x) = max f (m(i)) − L |xj − mi,j |
i∈{1,2,...,|M|}
j=1
" !# (160)
d
X
= max f (z) − L |xi − yi | .
y=(y1 ,y2 ...,yd )∈M
i=1

This establishes item (iv). The proof of Lemma 3.4 is thus completed.

28
3.1.4 Explicit approximations through DNNs
Proposition 3.5. Let φk ∈ N, k ∈ {2, 3, . . .}, satisfy for all k ∈ {2, 3, . . .} that I(φk ) = O(P2 (φ2 , Ik−1 )),
φk+1 = φk • (P2 (φ2 , Ik−1 )), and
     
1 −1 0   
φ2 =  0 1  , 0 , 1 1 −1 , 0  ∈ (R3×2 × R3 ) × (R1×3 × R) ,

(161)
0 −1 0

let d ∈ N, L ∈ R, let D ⊆ Rd be a set,P let f : D → R satisfy for all x = (x1 , x2 , . . . , xd ), y =


(y1 , y2 , . . . , yd ) ∈ D that |f (x) − f (y)| ≤ L di=1 |xi − yi | , let M ⊆ D satisfy |M| ∈ {2, 3, . . .}, let
m : {1, 2, . . . , |M|} → M be bijective, let W1 ∈ R(2d)×d , W2 ∈ R1×(2d) , and Bz ∈ R2d , z ∈ M, satisfy for
all z = (z1 , z2 , . . . , zd ) ∈ M that
   
1 0 ··· 0 −z1
−1 0 · · · 0   z1 
   
0
 1 · · · 0 

−z2 
  
W1 =  0 −1 · · · 0  , Bz =  z2  ,
   
and W2 = −L −L · · · −L , (162)
 .. .. . . ..   .. 
 . . . .   . 
   
0 0 ··· 1  −zd 
0 0 · · · −1 zd

let W1 ∈ R(2d|M|)×d , B1 ∈ R2d|M| , W2 ∈ R|M|×(2d|M|) , B2 ∈ R|M| satisfy


       
W1 Bm(1) W2 0 ··· 0 f (m(1))
W1 
 
 Bm(2) 
 
 0 W2
 ··· 0  
 f (m(2)) 
 
W1 =  ..  , B1 =  ..  , W2 =  .. .. .. .  , and B2 =  .. , (163)
 .   .   . . . ..   . 
W1 Bm(|M|) 0 0 · · · W2 f (m(|M|))

and let Φ ∈ N satisfy Φ = φ|M| • ((W1 , B1 ), (W2 , B2 )) (cf. Definitions 2.9, 2.18, 2.16, and 2.19 and
Lemma 3.2). Then

(i) it holds that D(Φ) = (d, 2d|M|, 2|M| − 1, 2|M| − 3, . . . , 3, 1) ∈ N|M|+2,

(ii) it holds that |||T (Φ)||| ≤ max{1, L, supz∈M |||z|||, 2[supz∈M |f (z)|]}, and

(iii) it holds that


" d
!#
 X
sup f (x) − Rr (Φ) (x) ≤ 2L sup inf |xi − yi | (164)
x∈D x=(x1 ,x2 ,...,xd )∈D y=(y1 ,y2 ,...,yd )∈M
i=1

(cf. Definitions 2.11, 2.20, 2.4, and 2.10).

Proof of Proposition 3.5. Throughout this proof let F : Rd → R satisfy for all x = (x1 , x2 , . . . , xd ) ∈ Rd
that " !#
Xd
F (x) = max f (y) − L |xi − yi | . (165)
y=(y1 ,y2 ...,yd )∈M
i=1

Observe that Lemma 3.4 establishes that

(A) it holds that D(Φ) = (d, 2d|M|, 2|M| − 1, 2|M| − 3, . . . , 3, 1) ∈ N|M|+2 ,

(B) it holds that |||T (Φ)||| ≤ max{1, L, supz∈M |||z|||, 2[supz∈M |f (z)|]}, and

29

(C) it holds for all x ∈ D that Rr (Φ) (x) = F (x)

(cf. Definitions 2.11, 2.20, 2.4, and 2.10). Note that items (A) and (B) prove items (i) and (ii). Next
observe that item (C) and Lemma 3.1 (with E ← D, δ ← (D × D ∋ ((x1 , x2 , . . . , xd ), (y1, y2 , . . . , yd )) 7→
P d
i=1 |xi − yi | ∈ [0, ∞)), M ← M, L ← L, f ← f , F ← (D ∋ x 7→ F (x) ∈ R ∪ {∞}) in the notation of
Lemma 3.1) ensure that

sup f (x) − Rr (Φ) (x) = sup |f (x) − F (x)|
x∈D x∈D
" !#
Xd (166)
≤ 2L sup inf |xi − yi | .
x=(x1 ,x2 ,...,xd )∈D y=(y1 ,y2 ,...,yd )∈M
i=1

The proof of Proposition 3.5 is thus completed.

3.1.5 Implicit approximations through DNNs


d
Corollary 3.6. Let d, d ∈ N, L ∈ R, let D ⊆ R Pbe a set, let f : D → R satisfy for all x = (x1 , x2 , . . . , xd ),
y = (y1 , y2 , . . . , yd ) ∈ D that |f (x)−f (y)| ≤ L di=1 |xi −yi | , let M ⊆ D satisfy |M| ∈ {2, 3, . . .}, and let
P|M|+1
l = (l0 , l1 , . . . , l|M|+1) ∈ N|M|+2 satisfy l = (d, 2d|M|, 2|M|−1, 2|M|−3, . . . , 3, 1) and k=1 lk (lk−1 +1) ≤
d. Then there exists θ ∈ Rd such that |||θ||| ≤ max{1, L, supz∈M |||z|||, 2[supz∈M |f (z)|]} and
" d
!#
θ,l
X
sup f (x) − (N−∞,∞)(x) ≤ 2L sup inf |xi − yi | (167)
x∈D x=(x1 ,x2 ,...,xd )∈D y=(y1 ,y2 ,...,yd )∈M
i=1

(cf. Definitions 2.20 and 2.8).

Proof of Corollary 3.6. Observe that Proposition 3.5 and item (ii) in Lemma 3.2 ensure that there exists
Φ ∈ N such that

(A) it holds that D(Φ) = l,

(B) it holds that |||T (Φ)||| ≤ max{1, L, supz∈M |||z|||, 2[supz∈M |f (z)|]}, and

(C) it holds that


" d
!#
 X
sup f (x) − Rr (Φ) (x) ≤ 2L sup inf |xi − yi | (168)
x∈D x=(x1 ,x2 ,...,xd )∈D y=(y1 ,y2 ,...,yd )∈M
i=1

(cf. Definitions 2.9, 2.11, 2.20, 2.4, and 2.10). Combining this with Corollary 2.15 establishes (167). The
proof of Corollary 3.6 is thus completed.

Corollary 3.7. Let d, d ∈ N, L ∈ R, u ∈ [−∞, ∞), v ∈ (u, ∞], let D ⊆ Rd be a set, Plet f : D → [u, v]
satisfy for all x = (x1 , x2 , . . . , xd ), y = (y1 , y2 , . . . , yd) ∈ D that |f (x) − f (y)| ≤ L di=1 |xi − yi | , let
M ⊆ D satisfy |M| ∈ {2, 3, . . .}, let l = (l0 , l1 , . . . , l|M|+1) ∈ N|M|+2 satisfy l = (d, 2d|M|, 2|M|−1, 2|M|−
P|M|+1
3, . . . , 3, 1) and d ≥ k=1 lk (lk−1 + 1). Then there exists θ ∈ Rd such that |||θ||| ≤ max{1, L, supz∈M |||z|||,
2[supz∈M |f (z)|]} and
" d
!#
X
θ,l
sup f (x) − Nu,v (x) ≤ 2L sup inf |xi − yi | (169)
x∈D x=(x1 ,x2 ,...,xd )∈D y=(y1 ,y2 ,...,yd )∈M
i=1

(cf. Definitions 2.20 and 2.8).

30
Proof of Corollary 3.7. First, observe that Corollary 3.6 (with d ← d, d ← d, L ← L, D ← D, f ← (D ∋
x 7→ f (x) ∈ R), M ← M, l ← l in the notation of Corollary 3.6) ensures that there exists θ ∈ Rd which
satisfies |||θ||| ≤ max{1, L, supz∈M |||z|||, 2[supz∈M |f (z)|]} and
" d
!#
θ,l
X
sup f (x) − (N−∞,∞ )(x) ≤ 2L sup inf |xi − yi | (170)
x∈D x=(x1 ,x2 ,...,xd )∈D y=(y1 ,y2 ,...,yd )∈M
i=1

(cf. Definitions 2.20 and 2.8). The assumption that for all x ∈ D it holds that u ≤ f (x) ≤ v and
Lemma 2.33 hence imply that
θ,l θ,l

sup f (x) − Nu,v (x) = sup Cu,v,1 (f (x)) − Cu,v,1 (N−∞,∞ )(x)
x∈D x∈D
" !#
d
X (171)
θ,l
≤ sup f (x) − (N−∞,∞ )(x) ≤ 2L sup inf |xi − yi |
x∈D x=(x1 ,x2 ,...,xd )∈D y=(y1 ,y2 ,...,yd )∈M
i=1

(cf. Definition 2.7). The proof of Corollary 3.7 is thus completed.

Corollary 3.8. Let d, d, L ∈ N, L ∈ R, u ∈ [−∞, ∞), v ∈ (u, ∞], let D ⊆ Rd be a set, let Pdf : D → ([u, v]∩
R) satisfy for all x = (x1 , x2 , . . . , xd ), y = (y1 , y2 , . . . , yd ) ∈ D that |f (x) − f (y)| ≤ L i=1 |xi − yi | , let
M⊆DP satisfy |M| ∈ {2, 3, . . .}, let l = (l0 , l1 , . . . , lL ) ∈ NL+1 satisfy for all k ∈ {2, 3, . . . , |M|} that L ≥
|M|+1, Li=1 li (li−1 +1) ≤ d, l0 = d, lL = 1, l1 ≥ 2d|M|, and lk ≥ 2|M|−2k+3, and assume for all i ∈ N∩
(|M|, L) that li ≥ 2. Then there exists θ ∈ Rd such that |||θ||| ≤ max{1, L, supz∈M |||z|||, 2[supz∈M |f (z)|]}
and " !#
Xd
θ,l
sup f (x) − Nu,v (x) ≤ 2L sup inf |xi − yi | (172)
x∈D x=(x1 ,x2 ,...,xd )∈D y=(y1 ,y2 ,...,yd )∈M
i=1

(cf. Definitions 2.20 and 2.8).

Proof of Corollary 3.8. Throughout this proof let l = (l0 , l1 , . . . , l|M|+1) ∈ N|M|+2 satisfy l = (d, 2d|M|,
P|M|+1
2|M| − 1, 2|M| − 3, . . . , 3, 1). First, note that Corollary 3.7 (with d ← d, d ← k=1 lk (lk−1 + 1), L ← L,
u ← u, v ← v, D ← D, f ← f , M ← M, l ← l in the notation of Corollary 3.7) establishes that there
P|M|+1
exists η ∈ R k=1 lk (lk−1 +1) which satisfies |||η||| ≤ max{1, L, supz∈M |||z|||, 2[supz∈M |f (z)|]} and
" d
!#
X
η,l
sup f (x) − Nu,v (x) ≤ 2L sup inf |xi − yi | (173)
x∈D x=(x1 ,x2 ,...,xd )∈D y=(y1 ,y2 ,...,yd )∈M
i=1

(cf. Definitions 2.20 and 2.8). Next observe that Lemma 2.30 (with u ← u, v ← v, L ← |M| + 1, L ← L,
P|M|+1
d ← k=1 lk (lk−1 + 1), d ← d, θ ← η, (l0 , l1 , . . . , lL ) ← (l0 , l1 , . . . , l|M|+1), (l0 , l1 , . . . , lL ) ← (l0 , l1 , . . . , lL ),
in the notation of Lemma 2.30) shows that there exists θ ∈ Rd such that
θ,l η,l
|||θ||| ≤ max{1, |||η|||} and Nu,v = Nu,v . (174)

Combining this with (173) proves (172). The proof of Corollary 3.8 is thus completed.

Corollary 3.9. Let d, d, N ∈ N, L ∈ R, u ∈ [−∞, ∞), v ∈ (u, ∞] satisfy d ≥ 2d2 (N + 1)d + 5d(N + 1)2d +
4
3
(N + 1)3d , let k·k : Rd → [0, ∞) be the standard norm, let p = (p1 , p2 , . . . , pd ), q = (q1 , q2 , . . . , qd ) ∈ Rd
Q
satisfy for all i ∈ {1, 2, . . . , d} that pi ≤ qi and maxj∈{1,2,...,d} (qj − pj ) > 0, let D = di=1 [pi , qi ], let M ⊆ D
satisfy   
d ∃ k1 , k2 , . . . , kd ∈ {0, 1, . . . , N} :
M = y = (y1 , y2 , . . . , yd ) ∈ R : , (175)
∀ i ∈ {1, 2, . . . , d} : yi = pi + kNi (qi − pi )

31
and let f : D → ([u, v] ∩ R) satisfy for all x, y ∈ D that |f (x) − f (y)| ≤ Lkx − yk. Then there Pexist θ ∈ Rd ,
L ∈ N, l = (l0 , l1 , . . . , lL ) ∈ NL+1 such that |||θ||| ≤ max{1, L, |||p|||, |||q|||, 2[supz∈D |f (z)|]}, Lk=1 lk (lk−1 +
1) ≤ d, and " d #
θ,l L X
sup f (x) − Nu,v (x) ≤ |qi − pi | (176)
x∈D N i=1
(cf. Definitions 2.20 and 2.8).
Proof of Corollary 3.9. Throughout this proof let l = (l0 , l1 , . . . , l|M|+1) ∈ N|M|+2 satisfy l = (d, 2d|M|,
2|M| − 1, 2|M| − 3, . . . , 3, 1). Observe that the fact that |M| ≤ (N + 1)d , the fact that for all n ∈ N it
P P 3
holds that ni=1 (2i − 1) = n2 , and the fact that for all n ∈ N it holds that ni=1 i2 = n(n+1)(2n+1)
6
≤ (n+1)
3
ensure that
|M|+1
X
lk (lk−1 + 1)
k=1
"|M|−1 # " |M| #
X X
= d(2d|M|) + 2d|M|(2|M| − 1) + (2i + 1)(2i − 1) + 2d|M| + (2i − 1)
i=1 i=1
| {z } | {z }
number of weights number of biases
"|M|−1 # (177)
X
= 2d2 |M| + 4d|M|2 − 2d|M| + (4i2 − 1) + 2d|M| + |M|2
i=1
"|M|−1 #
X
= 2d2 |M| + (4d + 1)|M|2 + 4 i2 − (|M| − 1)
i=1
2 2
≤ 2d |M| + 5d|M| + 4
3
|M|3 ≤ 2d (N + 1)d + 5d(N + 1)2d + 43 (N + 1)3d ≤ d.
2

In addition, note that the hypothesis that for all x, y ∈ D it holds that |f (x) − f (y)| ≤ Lkx − yk implies
that for all x = (x1 , x2 , . . . , xd ), y = (y1 , y2 , . . . , yd ) ∈ D it holds that
" d #
X
|f (x) − f (y)| ≤ L |xi − yi | . (178)
i=1

Furthermore, observe that the hypothesis that maxj∈{1,2,...,d} (qj − pj ) > 0 ensures that |M| ≥ 2. Com-
bining this, (177), and (178) with Corollary 3.7 establishes that there exists θ ∈ Rd such that |||θ||| ≤
max{1, L, supz∈M |||z|||, 2[supz∈M |f (z)|]} and
" d
!#
X
θ,l
sup f (x) − Nu,v (x) ≤ 2L sup inf |xi − yi | (179)
x∈D x=(x1 ,x2 ,...,xd )∈D y=(y1 ,y2 ,...,yd )∈M
i=1
Qd
(cf. Definitions 2.20 and 2.8). Next note that the hypothesis that M ⊆ D = i=1 [pi , qi ] implies that for
all z ∈ M it holds that
|||z||| ≤ max{|||p|||, |||q|||}. (180)
Therefore, we obtain that
     
|||θ||| ≤ max 1, L, |||p|||, |||q|||, 2 sup |f (z)| ≤ max 1, L, |||p|||, |||q|||, 2 sup |f (z)| . (181)
z∈M z∈D

In the next step we note that the fact that for all N ∈ N, r ∈ R, s ∈ [r, ∞), x ∈ [r, s] there exists
k ∈ {0, 1, . . . , N} such that |x − (r + Nk (s − r))| ≤ s−r 2N
ensures that for all x = (x1 , x2 , . . . , xd ) ∈ D there
exists y = (y1 , y2, . . . , yd ) ∈ M such that
d
" d #
X 1 X
|xi − yi | ≤ |qi − pi | . (182)
i=1
2N i=1

Combining this, (177), (179), and (181) establishes (176). The proof of Corollary 3.9 is thus completed.

32
3.2 Analysis of the generalization error
3.2.1 Hoeffding’s concentration inequality
Proposition 3.10. Let (Ω, F , P) be a probability space, P let N ∈ N, 2ε ∈ [0, ∞), a1 , a2 , . . . , aN ∈ R,
b1 ∈ [a1 , ∞), b2 ∈ [a2 , ∞), . . . , bN ∈ [aN , ∞), assume Nn=1 (bn − an ) 6= 0, and let Xn : Ω → [an , bn ],
n ∈ {1, 2, . . . , N}, be independent random variables. Then
N
! !
1 X  −2ε2 N 2
P Xn − E[Xn ] ≥ ε ≤ 2 exp PN . (183)
N n=1 n=1 (bn − an ) 2

3.2.2 Covering number estimates


Definition 3.11 (Covering number). Let (E, δ) be a metric space and let r ∈ [0, ∞]. Then we denote by
C(E,δ),r ∈ N0 ∪ {∞} (we denote by CE,r ∈ N0 ∪ {∞}) the extended real number given by
n  o 
C(E,δ),r = inf n ∈ N0 : ∃ A ⊆ E : (|A| ≤ n) ∧ (∀ x ∈ E : ∃ a ∈ A : δ(a, x) ≤ r) ∪ {∞} . (184)

Proposition 3.12. Let (X, k·k) be a finite-dimensional Banach space, let R, r ∈ (0, ∞), B = {θ ∈
X : kθk ≤ R}, and let δ : B × B → [0, ∞) satisfy for all θ, ϑ ∈ B that δ(θ, ϑ) = kθ − ϑk. Then
(
1 :r≥R
C(B,δ),r ≤  4R dim(X) (185)
r
:r<R

(cf. Definition 3.11).

3.2.3 Measurability properties for suprema


Lemma 3.13. Let (E, E ) be a topological space, assume E 6= ∅, let E ⊆ E be an at most countable
set, assume that E is dense in E, let (Ω, F ) be a measurable space, let fx : Ω → R, x ∈ E, be F /B(R)-
measurable functions, assume for all ω ∈ Ω that E ∋ x 7→ fx (ω) ∈ R is a continuous function, and let
F : Ω → R ∪ {∞} satisfy for all ω ∈ Ω that F (ω) = supx∈E fx (ω). Then

(i) it holds for all ω ∈ Ω that F (ω) = supx∈E fx (ω) and

(ii) it holds that F is an F /B(R ∪ {∞})-measurable function.

Proof of Lemma 3.13. Note that the hypothesis that E is dense in E implies that for all g ∈ C(E, R) it
holds that
sup g(x) = sup g(x). (186)
x∈E x∈E

This and the hypothesis that for all ω ∈ Ω it holds that E ∋ x 7→ fx (ω) ∈ R is a continuous function
show that for all ω ∈ Ω it holds that

F (ω) = sup fx (ω) = sup fx (ω). (187)


x∈E x∈E

This establishes item (i). Next note that item (i) and the hypothesis that for all x ∈ E it holds that
fx : Ω → R is an F /B(R)-measurable function demonstrate item (ii). The proof of Lemma 3.13 is thus
completed.

Lemma 3.14. Let (E, δ) be a separable metric space, assume E 6= ∅, let (Ω, F , P) be a probability space,
let L ∈ R, and let Zx : Ω → R, x ∈ E, be random variables which satisfy for all x, y ∈ E that E[|Zx |] < ∞
and |Zx − Zy | ≤ Lδ(x, y). Then

33
(i) it holds for all x, y ∈ E, η ∈ Ω that |(Zx (η) − E[Zx ]) − (Zy (η) − E[Zy ])| ≤ 2Lδ(x, y) and

(ii) it holds that Ω ∋ η 7→ supx∈E |Zx (η) − E[Zx ]| ∈ [0, ∞] is an F /B([0, ∞])-measurable function.

Proof of Lemma 3.14. Note that the hypothesis that for all x, y ∈ E it holds that |Zx − Zy | ≤ Lδ(x, y)
shows that for all x, y ∈ E, η ∈ Ω it holds that

|(Zx (η) − E[Zx ]) − (Zy (η) − E[Zy ])| = |(Zx (η) − Zy (η)) + (E[Zy ] − E[Zx ])|
≤ |Zx (η) − Zy (η)| + |E[Zx ] − E[Zy ]| ≤ Lδ(x, y) + |E[Zx ] − E[Zy ]| (188)
= Lδ(x, y) + |E[Zx − Zy ]| ≤ Lδ(x, y) + E[|Zx − Zy |] ≤ Lδ(x, y) + Lδ(x, y) = 2Lδ(x, y).

This proves item (i). Next observe that item (i) implies that for all η ∈ Ω it holds that E ∋ x 7→
|Zx (η) − E[Zx ]| ∈ R is a continuous function. Combining this and the hypothesis that E is separable with
Lemma 3.13 establishes item (ii). The proof of Lemma 3.14 is thus completed.

3.2.4 Concentration inequalities for random fields


LemmaS 3.15. Let (E, δ) be a separable metric space, let ε, L ∈ R, N ∈ N, z1 , z2 , . . . , zN ∈ E satisfy
E ⊆ N i=1 {x ∈ E : 2Lδ(x, zi ) ≤ ε}, let (Ω, F , P) be a probability space, and let Zx : Ω → R, x ∈ E, be
random variables which satisfy for all x, y ∈ E that |Zx − Zy | ≤ Lδ(x, y). Then
N
X
ε

P(supx∈E |Zx | ≥ ε) ≤ P |Zzi | ≥ 2
(189)
i=1

(cf. Lemma 3.13).

Proof of Lemma 3.15. Throughout this proof let B1 , B2 , . . . , BN ⊆ E satisfy for all i ∈ {1, 2, . . . , N} that
Bi = {x ∈ E : 2Lδ(x, zi ) ≤ ε}. Observe that the triangle inequality and the hypothesis that for all
x, y ∈ E it holds that |Zx − Zy | ≤ Lδ(x, y) show that for all i ∈ {1, 2, . . . , N}, x ∈ Bi it holds that
ε
|Zx | = |Zx − Zzi + Zzi | ≤ |Zx − Zzi | + |Zzi | ≤ Lδ(x, zi ) + |Zzi | ≤ 2
+ |Zzi |. (190)

Combining this with Lemma 3.13 proves that for all i ∈ {1, 2, . . . , N} it holds that
  
P supx∈Bi |Zx | ≥ ε ≤ P 2ε + |Zzi | ≥ ε = P |Zzi | ≥ 2ε . (191)

This and Lemma 3.13 establish that


  S  
N
P(supx∈E |Zx | ≥ ε) = P supx∈(SN Bi ) |Zx | ≥ ε = P i=1 supx∈Bi |Zx | ≥ ε
i=1

N
X  N
X  (192)
ε
≤ P supx∈Bi |Zx | ≥ ε ≤ P |Zzi | ≥ 2
.
i=1 i=1

This completes the proof of Lemma 3.15.

Lemma 3.16. Let (E, δ) be a separable metric space, assume E 6= ∅, let ε, L ∈ (0, ∞), let (Ω, F , P) be
a probability space, and let Zx : Ω → R, x ∈ E, be random variables which satisfy for all x, y ∈ E that
|Zx − Zy | ≤ Lδ(x, y). Then
 −1 ε

C(E,δ), 2L
ε P(supx∈E |Zx | ≥ ε) ≤ supx∈E P |Zx | ≥ 2
. (193)

(cf. Definition 3.11 and Lemma 3.13).

34
Proof of Lemma 3.16. Throughout this proof let N ∈ N ∪ {∞} satisfy N = C(E,δ), 2L ε , assume without
SN ε
loss of generality that N < ∞, and let z1 , z2 , . . . , zN ∈ E satisfy E ⊆ i=1 {x ∈ E : δ(x, zi ) ≤ 2L } (cf.
Definition 3.11). Observe that Lemma 3.13 and Lemma 3.15 establish that
N
X
ε
  
P(supx∈E |Zx | ≥ ε) ≤ P |Zzi | ≥ 2
≤ N supx∈E P |Zx | ≥ 2ε . (194)
i=1

This completes the proof of Lemma 3.16.

Lemma 3.17. Let (E, δ) be a separable metric space, assume E 6= ∅, let ε, L ∈ (0, ∞), let (Ω, F , P) be
a probability space, and let Zx : Ω → R, x ∈ E, be random variables which satisfy for all x, y ∈ E that
E[|Zx |] < ∞ and |Zx − Zy | ≤ Lδ(x, y). Then
 −1 ε

C(E,δ), 4L
ε P(supx∈E |Zx − E[Zx ]| ≥ ε) ≤ supx∈E P |Zx − E[Zx ]| ≥ 2
. (195)

(cf. Definition 3.11 and Lemma 3.14).

Proof of Lemma 3.17. Throughout this proof let Yx : Ω → R, x ∈ E, satisfy for all x ∈ E, η ∈ Ω that
Yx (η) = Zx (η) − E[Zx ]. Observe that Lemma 3.14 ensures that for all x, y ∈ E it holds that

|Yx − Yy | ≤ 2Lδ(x, y). (196)

This and Lemma 3.16 (with (E, δ) ← (E, δ), ε ← ε, L ← 2L, (Ω, F , P) ← (Ω, F , P), (Zx )x∈E ← (Yx )x∈E
in the notation of Lemma 3.16) establish (195). The proof of Lemma 3.17 is thus completed.

Lemma 3.18. Let (E, δ) be a separable metric space, assume E 6= ∅, let M ∈ N, ε, L, D ∈ (0, ∞),
let (Ω, F , P) be a probability space, for every x ∈ E let Yx,1, Yx,2, . . . , Yx,M : Ω → [0, D] be independent
random variables, assume for all x, y ∈ E, m ∈ {1, 2, . . . , M} that |Yx,m − Yy,m | ≤ Lδ(x, y), and let
Zx : Ω → [0, ∞), x ∈ E, satisfy for all x ∈ E that
"M #
1 X
Zx = Yx,m . (197)
M m=1

Then

(i) it holds for all x ∈ E that E[|Zx |] ≤ D < ∞,

(ii) it holds that Ω ∋ η 7→ supx∈E |Zx (η) − E[Zx ]| ∈ [0, ∞] is an F /B([0, ∞])-measurable function, and

(iii) it holds that  


−ε2 M
P(supx∈E |Zx − E[Zx ]| ≥ ε) ≤ 2C(E,δ), 4L
ε exp (198)
2D 2

(cf. Definition 3.11).

Proof of Lemma 3.18. First, observe that the triangle inequality and the hypothesis that for all x, y ∈ E,
m ∈ {1, 2, . . . , M} it holds that |Yx,m − Yy,m | ≤ Lδ(x, y) imply that for all x, y ∈ E it holds that
"M # "M # M
1 X 1 X 1 X 
|Zx − Zy | = Yx,m − Yy,m = Yx,m − Yy,m
M m=1 M m=1 M m=1
"M # (199)
1 X
≤ Yx,m − Yy,m ≤ Lδ(x, y).
M m=1

35
Next note that the hypothesis that for all x ∈ E, m ∈ {1, 2, . . . , M}, ω ∈ Ω it holds that |Yx,m (ω)| ∈ [0, D]
ensures that for all x ∈ E it holds that
" "M ## "M #
  1 X 1 X  
E |Zx | = E Yx,m = E Yx,m ≤ D < ∞. (200)
M m=1 M m=1

This proves item (i). Furthermore, note that item (i), (199), and Lemma 3.14 establish item (ii). Next
observe that (197) shows that for all x ∈ E it holds that
"M # " "M ## M
1 X 1 X 1 X  
|Zx − E[Zx ]| = Yx,m − E Yx,m = Yx,m − E Yx,m . (201)
M m=1 M m=1 M m=1

Combining this with Proposition 3.10 (with (Ω, F , P) ← (Ω, F , P), N ← M, ε ← 2ε , (a1 , a2 , . . . , aN ) ←
(0, 0, . . . , 0), (b1 , b2 , . . . , bN ) ← (D, D, . . . , D), (Xn )n∈{1,2,...,N } ← (Yx,m)m∈{1,2,...,M } for x ∈ E in the nota-
tion of Proposition 3.10) ensures that for all x ∈ E it holds that
 ε 2 2 !  2 
ε
 −2 2
M −ε M
P |Zx − E[Zx ]| ≥ 2 ≤ 2 exp 2
= 2 exp . (202)
MD 2D 2

Combining this, (199), and (200) with Lemma 3.17 establishes item (iii). The proof of Lemma 3.18 is
thus completed.

3.2.5 Uniform estimates for the statistical learning error


Lemma 3.19. Let (E, δ) be a separable metric space, assume E 6= ∅, let M ∈ N, ε, L, D ∈ (0, ∞), let
(Ω, F , P) be a probability space, let Xx,m : Ω → R, x ∈ E, m ∈ {1, 2, . . . , M}, and Ym : Ω → R, m ∈
{1, 2, . . . , M}, be functions, assume for all x ∈ E that (Xx,m, Ym ), m ∈ {1, 2, . . . , M}, are i.i.d. random
variables, assume for all x, y ∈ E, m ∈ {1, 2, . . . , M} that |Xx,m − Xy,m | ≤ Lδ(x, y) and |Xx,m − Ym | ≤ D,
let Ex : Ω → [0, ∞), x ∈ E, satisfy for all x ∈ E that
"M #
1 X
Ex = |Xx,m − Ym |2 , (203)
M m=1

and let Ex ∈ [0, ∞), x ∈ E, satisfy for all x ∈ E that Ex = E[|Xx,1 − Y1 |2 ]. Then Ω ∋ ω 7→ supx∈E |Ex (ω) −
Ex | ∈ [0, ∞] is an F /B([0, ∞])-measurable function and
 2 
−ε M
P(supx∈E |Ex − Ex | ≥ ε) ≤ 2C(E,δ), 8LD
ε exp (204)
2D 4

(cf. Definition 3.11).

Proof of Lemma 3.19. Throughout this proof let Ex,m : Ω → [0, D 2 ], x ∈ E, m ∈ {1, 2, . . . , M}, satisfy for
all x ∈ E, m ∈ {1, 2, . . . , M} that
Ex,m = |Xx,m − Ym |2 . (205)
Observe that the fact that for all x1 , x2 , y ∈ R it holds that (x1 −y)2 −(x2 −y)2 = (x1 −x2 )((x1 −y)+(x2 −y)),
the hypothesis that for all x ∈ E, m ∈ {1, 2, . . . , M} it holds that |Xx,m − Ym | ≤ D, and the hypothesis
that for all x, y ∈ E, m ∈ {1, 2, . . . , M} it holds that |Xx,m − Xy,m | ≤ Lδ(x, y) imply that for all x, y ∈ E,
m ∈ {1, 2, . . . , M} it holds that

|Ex,m − Ey,m | = (Xx,m − Ym )2 − (Xy,m − Ym )2 = |Xx,m − Xy,m | (Xx,m − Ym ) + (Xy,m − Ym )


 (206)
≤ |Xx,m − Xy,m | |Xx,m − Ym | + |Xy,m − Ym | ≤ 2D|Xx,m − Xy,m | ≤ 2LDδ(x, y).

36
In addition, note that (203) and the hypothesis that for all x ∈ E it holds that (Xx,m , Ym ), m ∈
{1, 2, . . . , M}, are i.i.d. random variables show that for all x ∈ E it holds that
"M # "M # "M #
  1 X   1 X   1 X
E Ex = E |Xx,m − Ym |2 = E |Xx,1 − Y1 |2 = Ex = Ex . (207)
M m=1 M m=1 M m=1

Furthermore, observe that the hypothesis that for all x ∈ E it holds that (Xx,m , Ym), m ∈ {1, 2, . . . , M},
are i.i.d. random variables ensures that for all x ∈ E it holds that Ex,m , m ∈ {1, 2, . . . , M}, are i.i.d.
random variables. Combining this, (206), and (207) with Lemma 3.18 (with (E, δ) ← (E, δ), M ← M,
ε ← ε, L ← 2LD, D ← D 2 , (Ω, F , P) ← (Ω, F , P), (Yx,m)x∈E, m∈{1,2,...,M } ← (Ex,m )x∈E, m∈{1,2,...,M } ,
(Zx )x∈E = (Ex )x∈E in the notation of Lemma 3.18) establishes (204). The proof of Lemma 3.19 is thus
completed.
Lemma 3.20. Let d, d, M ∈ N, R, L, R, ε ∈ (0, ∞), let D ⊆ Rd be a compact set, let (Ω, F , P) be a
probability space, let Xm : Ω → D, m ∈ {1, 2, . . . , M}, and Ym : Ω → R, m ∈ {1, 2, . . . , M}, be functions,
assume that (Xm , Ym ), m ∈ {1, 2, . . . , M}, are i.i.d. random variables, let H = (Hθ )θ∈[−R,R]d : [−R, R]d →
C(D, R) satisfy for all θ, ϑ ∈ [−R, R]d , x ∈ D that |Hθ (x) − Hϑ (x)| ≤ L|||θ − ϑ|||, assume for all θ ∈
[−R, R]d , m ∈ {1, 2, . . . , M} that |Hθ (Xm ) − Ym | ≤ R and E[|Y1 |2 ] < ∞, let E : C(D, R) → [0, ∞) satisfy
for all f ∈ C(D, R) that E(f ) = E[|f (X1 ) − Y1 |2 ], and let E : [−R, R]d × Ω → [0, ∞) satisfy for all
θ ∈ [−R, R]d , ω ∈ Ω that "M #
1 X 2
E(θ, ω) = |Hθ (Xm (ω)) − Ym (ω)| (208)
M m=1
(cf. Definition 2.20). Then Ω ∋ ω 7→ supθ∈[−R,R]d |E(θ, ω) − E(Hθ )| ∈ [0, ∞] is an F /B([0, ∞])-measurable
function and
  d   2 
 32LRR −ε M
P supθ∈[−R,R]d |E(θ) − E(Hθ )| ≥ ε ≤ 2 max 1, exp . (209)
ε 2R4
Proof of Lemma 3.20. Throughout this proof let B ⊆ Rd satisfy B = [−R, R]d = {θ ∈ Rd : |||θ||| ≤ R}
and let δ : B × B → [0, ∞) satisfy for all θ, ϑ ∈ B that

δ(θ, ϑ) = |||θ − ϑ|||. (210)

Observe that the hypothesis that (Xm , Ym ), m ∈ {1, 2, . . . , M}, are i.i.d. random variables and the hy-
pothesis that for all θ ∈ [−R, R]d it holds that Hθ is a continuous function imply that for all θ ∈ B it holds
that (Hθ (Xm ), Ym), m ∈ {1, 2, . . . , M}, are i.i.d. random variables. Combining this, the hypothesis that
for all θ, ϑ ∈ B, x ∈ D it holds that |Hθ (x) − Hϑ (x)| ≤ L|||θ − ϑ|||, and the hypothesis that for all θ ∈ B,
m ∈ {1, 2, . . . , M} it holds that |Hθ (Xm ) − Ym | ≤ R with Lemma 3.19 (with (E, δ) ← (B, δ), M ← M,
ε ← ε, L ← L, D ← R, (Ω, F , P) ← (Ω, F , P), (Xx,m )x∈E, m∈{1,2,...,M } ←  (Hθ (Xm ))θ∈B, m∈{1,2,...,M } ,
(Ym )m∈{1,2,...,M } ← (Ym )m∈{1,2,...,M } , (Ex )x∈E ← (Ω ∋ ω 7→ E(θ, ω) ∈ [0, ∞)) θ∈B , (Ex )x∈E ← (E(Hθ ))θ∈B
in the notation of Lemma 3.19) establishes that Ω ∋ ω 7→ supθ∈B |E(θ, ω) − E(Hθ )| ∈ [0, ∞] is an
F /B([0, ∞])-measurable function and
 2 
 −ε M
P supθ∈B |E(θ) − E(Hθ )| ≥ ε ≤ 2C(B,δ), 8LR exp
ε (211)
2R4
(cf. Definition 3.11). Moreover, note that Proposition 3.12 (with X ← Rd , k·k ← (Rd ∋ x 7→ |||x||| ∈
ε
[0, ∞)), R ← R, r ← 8LR , B ← B, δ ← δ in the notation of Proposition 3.12) demonstrates that
  d 
32LRR
C(B,δ), 8LR
ε ≤ max 1, . (212)
ε
This and (211) prove (209). The proof of Lemma 3.20 is thus completed.

37
Lemma 3.21. LetP d, M, L ∈ N, u ∈ R, v ∈ (u, ∞), R ∈ [1, ∞), ε, b ∈ (0, ∞), l = (l0 , l1 , . . . , lL ) ∈ NL+1
satisfy lL = 1 and Lk=1 lk (lk−1 + 1) ≤ d, let D ⊆ [−b, b]l0 be a compact set, let (Ω, F , P) be a probability
space, let Xm : Ω → D, m ∈ {1, 2, . . . , M}, and Ym : Ω → [u, v], m ∈ {1, 2, . . . , M}, be functions, assume
that (Xm , Ym), m ∈ {1, 2, . . . , M}, are i.i.d. random variables, let E : C(D, R) → [0, ∞) satisfy for all
f ∈ C(D, R) that E(f ) = E[|f (X1 ) − Y1 |2 ], and let E : [−R, R]d × Ω → [0, ∞) satisfy for all θ ∈ [−R, R]d ,
ω ∈ Ω that "M #
1 X
E(θ, ω) = |N θ,l (Xm (ω)) − Ym (ω)|2 (213)
M m=1 u,v
θ,l

(cf. Definitions 2.20 and 2.8). Then Ω ∋ ω 7→ supθ∈[−R,R]d E(θ, ω)−E Nu,v |D ∈ [0, ∞] is an F /B([0, ∞])-
measurable function and
θ,l
 
P supθ∈[−R,R]d E(θ) − E Nu,v |D ≥ ε
  d   
32L max{1, b}(|||l||| + 1)L RL (v − u) −ε2 M (214)
≤ 2 max 1, exp .
ε 2(v − u)4
Proof of Lemma 3.21. Throughout this proof let L ∈ (0, ∞) satisfy
L = L max{1, b} (|||l||| + 1)L RL−1 . (215)
Observe that Corollary 2.37 (with a ← −b, b ← b, u ← u, v ← v, d ← d, L ← L, l ← l in the notation of
Corollary 2.37) and the hypothesis that D ⊆ [−b, b]l0 show that for all θ, ϑ ∈ [−R, R]d it holds that
θ,l ϑ,l θ,l ϑ,l
sup |Nu,v (x) − Nu,v (x)| ≤ sup |Nu,v (x) − Nu,v (x)|
x∈D x∈[−b,b]l0

≤ L max{1, b} (|||l||| + 1)L (max{1, |||θ|||, |||ϑ|||})L−1 |||θ − ϑ||| (216)


≤ L max{1, b} (|||l||| + 1)L RL−1 |||θ − ϑ||| = L|||θ − ϑ|||.

Furthermore, observe that the fact that for all θ ∈ Rd , x ∈ Rl0 it holds that Nu,v θ,l
(x) ∈ [u, v] and the
hypothesis that for all m ∈ {1, 2, . . . , M}, ω ∈ Ω it holds that Ym (ω) ∈ [u, v] demonstrate that for all
θ ∈ [−R, R]d , m ∈ {1, 2, . . . , M} it holds that
θ,l
|Nu,v (Xm ) − Ym | ≤ v − u. (217)
Combining this and (216) with Lemma 3.20 (with d ← l0 , d ← d, M ← M, R ← R, L ← L, R ← v − u,
ε ← ε, D ← D, (Ω, F , P) ← (Ω, F , P), (Xm )m∈{1,2,...,M } ← (Xm )m∈{1,2,...,M } , (Ym )m∈{1,2,...,M } ← ((Ω ∋ ω 7→
θ,l
Ym (ω) ∈ R))m∈{1,2,...,M } , H ← ([−R, R]d ∋ θ 7→ Nu,v |D ∈ C(D, R)), E ← E, E ← E in the notation of
θ,l

Lemma 3.20) establishes that Ω ∋ ω 7→ supθ∈[−R,R]d E(θ, ω) − E Nu,v |D ∈ [0, ∞] is an F /B([0, ∞])-
measurable function and
  d   
θ,l
  32LR(v − u) −ε2 M
P supθ∈[−R,R]d E(θ) − E Nu,v |D ≥ ε ≤ 2 max 1, exp . (218)
ε 2(v − u)4
The proof of Lemma 3.21 is thus completed.

3.3 Analysis of the optimization error


3.3.1 Convergence rates for the minimum Monte Carlo method
Lemma 3.22. Let (Ω, F , P) be a probability space, let d, N ∈ N, let k·k : Rd → [0, ∞) be a norm, let
H ⊆ Rd be a set, let ϑ ∈ H, L, ε ∈ (0, ∞), let E : H × Ω → R be a (B(H) ⊗ F )/B(R)-measurable function,
assume for all x, y ∈ H, ω ∈ Ω that |E(x, ω) − E(y, ω)| ≤ Lkx − yk, and let Θn : Ω → H, n ∈ {1, 2, . . . , N},
be i.i.d. random variables. Then
    N 
P minn∈{1,2,...,N } E(Θn ) − E(ϑ) > ε ≤ P kΘ1 − ϑk > Lε ≤ exp −N P kΘ1 − ϑk ≤ Lε . (219)

38
Proof of Lemma 3.22. Note that the hypothesis that for all x, y ∈ H, ω ∈ Ω it holds that |E(x, ω) −
E(y, ω)| ≤ Lkx − yk implies that
 
minn∈{1,2,...,N } E(Θn ) − E(ϑ) = minn∈{1,2,...,N } [E(Θn ) − E(ϑ)]
 
≤ minn∈{1,2,...,N } |E(Θn ) − E(ϑ)| ≤ minn∈{1,2,...,N } LkΘn − ϑk (220)
 
= L minn∈{1,2,...,N } kΘn − ϑk .

The hypothesis that Θn , n ∈ {1, 2, . . . , N}, are i.i.d. random variables and the fact that ∀ x ∈ R : 1 − x ≤
e−x hence show that
      
P minn∈{1,2,...,N } E(Θn ) − E(ϑ) > ε ≤ P L minn∈{1,2,...,N } kΘn − ϑk > ε
  N
= P minn∈{1,2,...,N } kΘn − ϑk > Lε = P kΘ1 − ϑk > Lε (221)
 N 
= 1 − P kΘ1 − ϑk ≤ Lε ≤ exp −N P kΘ1 − ϑk ≤ Lε .

The proof of Lemma 3.22 is thus completed.

3.3.2 Continuous uniformly distributed samples


Lemma 3.23. Let (Ω, F , P) be a probability space, let d, N ∈ N, a ∈ R, b ∈ (a, ∞), ϑ ∈ [a, b]d , L, ε ∈
(0, ∞), let E : [a, b]d × Ω → R be a (B([a, b]d ) ⊗ F )/B(R)-measurable function, assume for all x, y ∈ [a, b]d ,
ω ∈ Ω that |E(x, ω) − E(y, ω)| ≤ L|||x − y|||, let Θn : Ω → [a, b]d , n ∈ {1, 2, . . . , N}, be i.i.d. random
variables, and assume that Θ1 is continuous uniformly distributed on [a, b]d (cf. Definition 2.20). Then
  
   εd
P minn∈{1,2,...,N } E(Θn ) − E(ϑ) > ε ≤ exp −N min 1, d . (222)
L (b − a)d

Proof of Lemma 3.23. Note that the hypothesis that Θ1 is continuous uniformly distributed on [a, b]d
ensures that
  
P |||Θ1 − ϑ||| ≤ Lε ≥ P |||Θ1 − (a, a, . . . , a)||| ≤ Lε = P |||Θ1 − (a, a, . . . , a)||| ≤ min{ Lε , b − a}
  (  d )
min{ Lε , b − a} d ε (223)
= = min 1, .
(b − a) L (b − a)

Combining this with Lemma 3.22 proves (222). The proof of Lemma 3.23 is thus completed.

4 Overall error analysis


In this section we combine the separate error analyses of the approximation error, the generalization error,
and the optimization error in Section 3 to obtain an overall analysis (cf. Theorem 4.5 below). We note that,
e.g., [6, Lemma 2.4] ensures that the integral appearing on the left-hand side of (238) in Theorem 4.5
and subsequent results (cf. (251) in Corollary 4.6, (259) in Corollary 4.7, (269) in Corollary 4.8, and
(274) in Corollary 4.10) is indeed measurable. In Lemma 4.1 below we present the well-known bias-
variance decomposition result. To formulate this bias-variance decomposition lemma we observe that
for every probability space (Ω, F , P), every measurable space (S, S), every random variable X : Ω → S,
and every A ∈ S it holds that PX (A) = P(X ∈ A). Moreover, note that for every probability space
(Ω, F , P), every measurable spaceR (S, S), every Rrandom variable X : RΩ → S, and every S/B(R)-measurable
R
function
 f: S → R it holds that S |f |2 dPX = S |f (x)|2 PX (dx) = Ω |f (X(ω))|2 P(dω) = Ω |f (X)|2 dP =
E |f (X)|2 . A result related to Lemmas 4.1 and 4.2 can, e.g., be found in Berner et al. [10, Lemma 2.8].

39
4.1 Bias-variance decomposition
Lemma 4.1 (Bias-variance decomposition). Let (Ω, F , P) be a probability space, let (S, S) be a measurable
space, let X : Ω → S and Y : Ω → R be random variables with E[|Y |2 ] < ∞, and let E : L2 (PX ; R) → [0, ∞)
satisfy for all f ∈ L2 (PX ; R) that E(f ) = E[|f (X) − Y |2 ]. Then
(i) it holds for all f ∈ L2 (PX ; R) that
   
E(f ) = E |f (X) − E[Y |X]|2 + E |Y − E[Y |X]|2 , (224)

(ii) it holds for all f, g ∈ L2 (PX ; R) that


   
E(f ) − E(g) = E |f (X) − E[Y |X]|2 − E |g(X) − E[Y |X]|2 , (225)
and
(iii) it holds for all f, g ∈ L2 (PX ; R) that
    
E |f (X) − E[Y |X]|2 = E |g(X) − E[Y |X]|2 + E(f ) − E(g) . (226)

Proof of Lemma 4.1. First, observe that the hypothesis that for all f ∈ L2 (PX ; R) it holds that E(f ) =
E[|f (X) − Y |2 ] shows that for all f ∈ L2 (PX ; R) it holds that
   
E(f ) = E |f (X) − Y |2 = E |(f (X) − E[Y |X]) + (E[Y |X] − Y )|2
      
= E |f (X) − E[Y |X]|2 + 2 E f (X) − E[Y |X] E[Y |X] − Y + E |E[Y |X] − Y |2
  h    i  
= E |f (X) − E[Y |X]|2 + 2 E E f (X) − E[Y |X] E[Y |X] − Y X + E |E[Y |X] − Y |2
  h    i   (227)
= E |f (X) − E[Y |X]|2 + 2 E f (X) − E[Y |X] E E[Y |X] − Y X + E |E[Y |X] − Y |2
      
= E |f (X) − E[Y |X]|2 + 2 E f (X) − E[Y |X] E[Y |X] − E[Y |X] + E |E[Y |X] − Y |2
   
= E |f (X) − E[Y |X]|2 + E |E[Y |X] − Y |2 .
This implies that for all f, g ∈ L2 (PX ; R) it holds that
   
E(f ) − E(g) = E |f (X) − E[Y |X]|2 − E |g(X) − E[Y |X]|2 . (228)
Hence, we obtain that for all f, g ∈ L2 (PX ; R) it holds that
   
E |f (X) − E[Y |X]|2 = E |g(X) − E[Y |X]|2 + E(f ) − E(g). (229)
Combining this with (227) and (228) establishes items (i), (ii), and (iii). The proof of Lemma 4.1 is thus
completed.

4.2 Overall error decomposition


Lemma 4.2. Let (Ω, F , P) be a probability space, let d, M ∈ N, let D ⊆ Rd be a compact set, let Xm : Ω →
D, m ∈ {1, 2, . . . , M}, and Ym : Ω → R, m ∈ {1, 2, . . . , M}, be functions, assume that (Xm , Ym ), m ∈
{1, 2, . . . , M}, are i.i.d. random variables, assume E[|Y1 |2 ] < ∞, let E : C(D, R) → [0, ∞) satisfy for all
f ∈ C(D, R) that E(f ) = E[|f (X1 ) − Y1 |2 ], and let E : C(D, R) × Ω → [0, ∞) satisfy for all f ∈ C(D, R),
ω ∈ Ω that "M #
1 X
E(f, ω) = |f (Xm (ω)) − Ym (ω)|2 . (230)
M m=1
Then it holds for all f, φ ∈ C(D, R) that
   
E |f (X1 ) − E[Y1 |X1 ]|2 = E |φ(X1 ) − E[Y1 |X1 ]|2 + E(f ) − E(φ)
 
 2   (231)
≤ E |φ(X1 ) − E[Y1 |X1 ]| + E(f ) − E(φ) + 2 max |E(v) − E(v)| .
v∈{f,φ}

40
Proof of Lemma 4.2. Note that Lemma 4.1 ensures that for all f, φ ∈ C(D, R) it holds that
 
E |f (X1 ) − E[Y1 |X1 ]|2
 
= E |φ(X1 ) − E[Y1 |X1 ]|2 + E(f ) − E(φ)
 
= E |φ(X1 ) − E[Y1 |X1 ]|2 + E(f ) − E(f ) + E(f ) − E(φ) + E(φ) − E(φ)
      
= E |φ(X1 ) − E[Y1 |X1 ]|2 + E(f ) − E(f ) + E(φ) − E(φ) + E(f ) − E(φ)
" # (232)
  X  
≤ E |φ(X1 ) − E[Y1 |X1 ]|2 + |E(v) − E(v)| + E(f ) − E(φ)
v∈{f,φ}
 
 2  
≤ E |φ(X1 ) − E[Y1 |X1 ]| + 2 max |E(v) − E(v)| + E(f ) − E(φ) .
v∈{f,φ}

The proof of Lemma 4.2 is thus completed.

Lemma 4.3. Let (Ω, F , P) be a probability space, let d, d, M ∈ N, let D ⊆ Rd be a compact set, let
B ⊆ Rd be a set, let H = (Hθ )θ∈B : B → C(D, R) be a function, let Xm : Ω → D, m ∈ {1, 2, . . . , M},
and Ym : Ω → R, m ∈ {1, 2, . . . , M}, be functions, assume that (Xm , Ym ), m ∈ {1, 2, . . . , M}, are i.i.d.
random variables, assume E[|Y1 |2 ] < ∞, let ϕ : D → R be a B(D)/B(R)-measurable function, assume
that it holds P-a.s. that ϕ(X1 ) = E[Y1 |X1 ], let E : C(D, R) → [0, ∞) satisfy for all f ∈ C(D, R) that
E(f ) = E[|f (X1 ) − Y1 |2 ], and let E : B × Ω → [0, ∞) satisfy for all θ ∈ B, ω ∈ Ω that
"M #
1 X
E(θ, ω) = |Hθ (Xm (ω)) − Ym (ω)|2 . (233)
M m=1

Then it holds for all θ, ϑ ∈ B that


Z Z
2
|Hθ (x) − ϕ(x)| PX1 (dx) = |Hϑ (x) − ϕ(x)|2 PX1 (dx) + E(Hθ ) − E(Hϑ )
D D
Z
 
  (234)
2
≤ |Hϑ (x) − ϕ(x)| PX1 (dx) + E(θ) − E(ϑ) + 2 sup |E(η) − E(Hη )| .
D η∈B

Proof of Lemma 4.3. First, observe that Lemma 4.2 (with (Ω, F , P) ← (Ω, F , P), d ← d, M ← M, D ←
D, (Xm )m∈{1,2,...,M } ← (Xm )m∈{1,2,...,M } , (Ym )m∈{1,2,...,M } ← (Ym )m∈{1,2,...,M } , E ← E, E ← C(D, R) × Ω ∋
PM  
(f, ω) 7→ M1 m=1 |f (X m (ω)) − Y m (ω)| 2
∈ [0, ∞) in the notation of Lemma 4.2) shows that for all
θ, ϑ ∈ B it holds that
   
E |Hθ (X1 ) − E[Y1 |X1 ]|2 = E |Hϑ (X1 ) − E[Y1 |X1 ]|2 + E(Hθ ) − E(Hϑ )
 
 2  
≤ E |Hϑ (X1 ) − E[Y1 |X1 ]| + E(θ) − E(ϑ) + 2 max |E(η) − E(Hη )|
η∈{θ,ϑ} (235)
 
   
≤ E |Hϑ (X1 ) − E[Y1 |X1 ]|2 + E(θ) − E(ϑ) + 2 sup |E(η) − E(Hη )| .
η∈B

In addition, note that the hypothesis that it holds P-a.s. that ϕ(X1 ) = E[Y1 |X1 ] ensures that for all η ∈ B
it holds that
Z
 2  2
E |Hη (X1 ) − E[Y1 |X1 ]| = E |Hη (X1 ) − ϕ(X1 )| = |Hη (x) − ϕ(x)|2 PX1 (dx). (236)
D

Combining this with (235) establishes (234). The proof of Lemma 4.3 is thus completed.

41
4.3 Analysis of the convergence speed
4.3.1 Convergence rates for convergence in probability
Lemma 4.4. Let (Ω, F ,P P) be a probability space, let u ∈ R, v ∈ (u, ∞), d, L ∈ N, let l = (l0 , l1 , . . . , lL ) ∈
NL+1 satisfy lL = 1 and Li=1 li (li−1 + 1) ≤ d, let B ⊆ Rd be a non-empty compact set, and let X : Ω → Rl0
and Y : Ω → [u, v] be random variables. Then
θ,l
(i) it holds for all θ ∈ B, ω ∈ Ω that |Nu,v (X(ω)) − Y (ω)|2 ∈ [0, (v − u)2 ],
 θ,l

(ii) it holds that B ∋ θ 7→ E |Nu,v (X) − Y |2 ∈ [0, ∞) is continuous, and
 ϑ,l
  
(iii) there exists ϑ ∈ B such that E |Nu,v (X) − Y |2 = inf E |Nu,v θ,l
(X) − Y |2
θ∈B

(cf. Definition 2.8).


Proof of Lemma 4.4. First, note that the fact that for all θ ∈ Rd , x ∈ Rl0 it holds that Nu,v
θ,l
(x) ∈ [u, v]
and the hypothesis that for all ω ∈ Ω it holds that Y (ω) ∈ [u, v] demonstrate item (i). Next observe
θ,l
that Corollary 2.37 ensures that for all ω ∈ Ω it holds that B ∋ θ 7→ |Nu,v (X(ω)) − Y (ω)|2 ∈ [0, ∞)
is a continuous function. Combining this and item (i) with Lebesgue’s dominated convergence theorem
establishes item (ii). Furthermore, note that item (ii) and the assumption that B ⊆ Rd is a non-empty
compact set prove item (iii). The proof of Lemma 4.4 is thus completed.
Theorem 4.5. Let (Ω, F , P) be a probability space, let d, d, K, M ∈ N, ε ∈ (0, ∞), L, u ∈ R, v ∈ (u, ∞),
let D ⊆ Rd be a compact set, assume |D| ≥ 2, let Xm : Ω → D, m ∈ {1, 2, . . . , M}, and Ym : Ω → [u, v],
m ∈ {1, 2, . . . , M}, be functions, assume that (Xm , Ym ), m ∈ {1, 2, . . . , M}, are i.i.d. randomP variables, let
δ : D × D → [0, ∞) satisfy for all x = (x1 , x2 , . . . , xd ), y = (y1 , y2 , . . . , yd ) ∈ D that δ(x, y) = di=1 |xi − yi |,
let ϕ : D → [u, v] satisfy P-a.s. that ϕ(X1 ) = E[Y1 |X1 ], assume for all x, y ∈ D that |ϕ(x) − ϕ(y)| ≤
Lδ(x, y), let N ∈ N ∩ [max{2, C(D,δ), 4L ε }, ∞), let l ∈ N ∩ (N, ∞), let l = (l , l , . . . , l ) ∈ Nl+1 satisfy
0 1 l
for all i ∈ N ∩ [2, N], j ∈ N ∩ [N, l) that l0 = d, l1 ≥ 2dN, li ≥ 2N − 2i + 3, lj ≥ 2, ll = 1,
Pl d
and k=1 lk (lk−1 + 1) ≤ d, let R ∈ [max{1, L, supz∈D |||z|||, 2[supz∈D |ϕ(z)|]}, ∞), let B ⊆ R satisfy
B = [−R, R]d , let E : B × Ω → [0, ∞) satisfy for all θ ∈ B, ω ∈ Ω that
"M #
1 X
E(θ, ω) = |N θ,l (Xm (ω)) − Ym (ω)|2 , (237)
M m=1 u,v

let Θk : Ω → B, k ∈ {1, 2, . . . , K}, be i.i.d. random variables, assume that Θ1 is continuous uniformly dis-
tributed on B, and let Ξ : Ω → B satisfy Ξ = Θmin{k∈{1,2,...,K} : E(Θk )=minl∈{1,2,...,K} E(Θl )} (cf. Definitions 3.11,
2.20, and 2.8). Then
Z   
Ξ,l 2 2 ε2d
P |Nu,v (x) − ϕ(x)| PX1 (dx) > ε ≤ exp −K min 1,
D (16(v − u)l(|||l||| + 1)l Rl+1 )d
    
128l(|||l||| + 1)l Rl+1 (v − u) ε4 M
+ 2 exp d ln max 1, − . (238)
ε2 32(v − u)4

Proof of Theorem 4.5. Throughout this proof let M ⊆ D satisfy |M| = max{2, C(D,δ), 4L
ε } and

  
4L sup inf δ(x, y) ≤ ε, (239)
x∈D y∈M

let b ∈ [0, ∞) satisfy b = supz∈D |||z|||, let E : C(D, R) → [0, ∞) satisfy for all f ∈ C(D, R) that E(f ) =
E[|f (X1) − Y1 |2 ], and let ϑ ∈ B satisfy E(Nu,v ϑ,l θ,l
|D ) = inf θ∈B E(Nu,v |D ) (cf. Lemma 4.4). Observe that the
hypothesis that for all x, y ∈ D it holds that |ϕ(x) − ϕ(y)| ≤ Lδ(x, y) implies that ϕ is a B(D)/B([u, v])-
measurable function. Lemma 4.3 (with (Ω, F , P) ← (Ω, F , P), d ← d, d ← d, M ← M, D ← D, B ← B,

42
θ,l
H ← (B ∋ θ 7→ Nu,v |D ∈ C(D, R)), (Xm )m∈{1,2,...,M } ← (Xm )m∈{1,2,...,M } , (Ym )m∈{1,2,...,M } ← ((Ω ∋ ω 7→
Ym (ω) ∈ R))m∈{1,2,...,M } , ϕ ← (D ∋ x 7→ ϕ(x) ∈ R), E ← E, E ← E in the notation of Lemma 4.3)
therefore ensures that for all ω ∈ Ω it holds that
Z
Ξ(ω),l
|Nu,v (x) − ϕ(x)|2 PX1 (dx)
D
Z  
ϑ,l 2
  θ,l (240)
≤ |Nu,v (x) − ϕ(x)| PX1 (dx) + E(Ξ(ω), ω) − E(ϑ, ω) + 2 sup |E(θ, ω) − E(Nu,v |D )| .
| {z } θ∈B
|D {z } Optimization error | {z }
Approximation error Generalization error

Next observe that the assumption that N ≥ max{2, C(D,δ), 4L ε } = |M| shows that for all i ∈ N ∩ [2, N]

it holds that l ≥ |M| + 1, l1 ≥ 2d|M| and li ≥ 2|M| − 2i + 3. The hypothesis that for all x, y ∈ D
it holds that |ϕ(x) − ϕ(y)| ≤ Lδ(x, y), the hypothesis that R ≥ max{1, L, supz∈D |||z|||, 2[supz∈D |ϕ(z)|]},
Corollary 3.8 (with d ← d, d ← d, L ← l, L ← L, u ← u, v ← v, D ← D, f ← ϕ, M ← M, l ← l in the
notation of Corollary 3.8), and (239) hence ensure that there exists η ∈ B which satisfies
" d
!#
X
η,l
sup |Nu,v (x) − ϕ(x)| ≤ 2L sup inf |xi − yi |
x∈D x=(x1 ,x2 ,...,xd )∈D y=(y1 ,y2 ,...,yd )∈M i=1
   (241)
ε
= 2L sup inf δ(x, y) ≤ .
x∈D y∈M 2
Lemma 4.3 (with (Ω, F , P) ← (Ω, F , P), d ← d, d ← d, M ← M, D ← D, B ← B, H ← (B ∋
θ,l
θ 7→ Nu,v |D ∈ C(D, R)), (Xm )m∈{1,2,...,M } ← (Xm )m∈{1,2,...,M } , (Ym )m∈{1,2,...,M } ← ((Ω ∋ ω 7→ Ym (ω) ∈
R))m∈{1,2,...,M } , ϕ ← (D ∋ x 7→ ϕ(x) ∈ R), E ← E, E ← E in the notation of Lemma 4.3) and the
ϑ,l θ,l
assumption that E(Nu,v |D ) = inf θ∈B E(Nu,v |D ) therefore prove that
Z Z
ϑ,l 2 η,l
|Nu,v (x) − ϕ(x)| PX1 (dx) = |Nu,v (x) − ϕ|2 PX1 (dx) + E(Nu,v
ϑ,l η,l
|D ) − E(Nu,v |D )
D D | {z }
≤0 (242)
Z 2
η,l ε
≤ |Nu,v (x) − ϕ(x)|2 PX1 (dx) ≤ sup |Nu,v η,l
(x) − ϕ(x)|2 ≤ .
D x∈D 4
Combining this with (240) shows that for all ω ∈ Ω it holds that
Z  
Ξ(ω),l 2 ε2   θ,l
|Nu,v (x) − ϕ(x)| PX1 (dx) ≤ + E(Ξ(ω), ω) − E(ϑ, ω) + 2 sup |E(θ, ω) − E(Nu,v |D )| . (243)
D 4 θ∈B

Hence, we obtain that


Z     
Ξ,l 2 2
  θ,l 3ε2
P |Nu,v (x) − ϕ(x)| PX1 (dx) > ε ≤ P E(Ξ) − E(ϑ) + 2 sup |E(θ) − E(Nu,v |D )| >
D θ∈B 4
 2
  2

ε θ,l ε
≤ P E(Ξ) − E(ϑ) > + P sup |E(θ) − E(Nu,v |D )| > . (244)
4 θ∈B 4
Next observe that Corollary 2.37 (with a ← −b, b ← b, u ← u, v ← v, d ← d, L ← l, l ← l in the notation
of Corollary 2.37) demonstrates that for all θ, ξ ∈ B it holds that
θ,l ξ,l θ,l ξ,l
sup |Nu,v (x) − Nu,v (x)| ≤ sup |Nu,v (x) − Nu,v (x)|
x∈D x∈[−b,b]d

≤ l max{1, b}(|||l||| + 1)l (max{1, |||θ|||, |||ξ|||})l−1 |||θ − ξ||| (245)


≤ lR(|||l||| + 1)l Rl−1 |||θ − ξ||| = l(|||l||| + 1)l Rl |||θ − ξ|||.
θ,l
Combining this with the fact that for all θ ∈ Rd , x ∈ D it holds that Nu,v (x) ∈ [u, v], the hypothesis that
for all m ∈ {1, 2, . . . , M}, ω ∈ Ω it holds that Ym (ω) ∈ [u, v], the fact that for all x1 , x2 , y ∈ R it holds

43
that (x1 − y)2 − (x2 − y)2 = (x1 − x2 )((x1 − y) + (x2 − y)), and (237) ensures that for all θ, ξ ∈ B, ω ∈ Ω
it holds that

|E(θ, ω) − E(ξ, ω)|


"M # "M #
1 X 1 X
= |N θ,l (Xm (ω)) − Ym (ω)|2 − |N ξ,l (Xm (ω)) − Ym (ω)|2 (246)
M m=1 u,v M m=1 u,v
M
1 X θ,l ξ,l
 θ,l
 ξ,l

= Nu,v (Xm (ω)) − Nu,v (Xm (ω)) Nu,v (Xm (ω)) − Ym (ω) + Nu,v (Xm (ω)) − Ym (ω)
M m=1
"M #
1 X θ,l ξ,l
 θ,l ξ,l
 
≤ Nu,v (Xm (ω)) − Nu,v (Xm (ω)) |Nu,v (Xm (ω)) − Ym (ω)| + |Nu,v (Xm (ω)) − Ym (ω)|
M m=1 | {z }
≤2(v−u)
l l
≤ 2(v − u)l(|||l||| + 1) R |||θ − ξ|||.

Lemma 3.23 (with (Ω, F , P) ← (Ω, F , P), d ← d, N ← K, a ← −R, b ← R, ϑ ← ϑ, L ← 2(v − u)l(|||l||| +


2
1)l Rl , ε ← ε4 , E ← E, (Θn )n∈{1,2,...,N } ← (Θk )k∈{1,2,...,K} in the notation of Lemma 3.23) therefore shows
that
    
ε2 ε2
P E(Ξ) − E(ϑ) > =P min E(Θk ) − E(ϑ) >
4 k∈{1,2,...,K} 4
  
ε2 d 
4 (247)
≤ exp −K min 1,
[2(v − u)l(|||l||| + 1)l Rl ]d (2R)d
  
ε2d
= exp −K min 1, .
(16(v − u)l(|||l||| + 1)l Rl+1 )d
2
Moreover, note that Lemma 3.21 (with d ← d, M ← M, L ← l, u ← u, v ← v, R ← R, ε ← ε4 ,
b ← b, l ← l, D ← D, (Ω, F , P) ← (Ω, F , P), (Xm )m∈{1,2,...,M } ← (Xm )m∈{1,2,...,M } , (Ym )m∈{1,2,...,M } ←
(Ym )m∈{1,2,...,M } , E ← E, E ← E in the notation of Lemma 3.21) establishes that
 
θ,l ε2
P supθ∈B |E(θ) − E(Nu,v |D )| ≥
4
  d   
128l max{1, b}(|||l||| + 1)l Rl (v − u) −ε4 M
≤ 2 max 1, exp
ε2 32(v − u)4
  d    (248)
128l(|||l||| + 1)l Rl+1 (v − u) −ε4 M
≤ 2 max 1, exp
ε2 32(v − u)4
    
128l(|||l||| + 1)l Rl+1 (v − u) ε4 M
= 2 exp d ln max 1, − .
ε2 32(v − u)4

Combining this and (247) with (244) proves that


Z    
Ξ,l 2 2 ε2d
P |Nu,v (x) − ϕ(x)| PX1 (dx) > ε ≤ exp −K min 1,
D (16(v − u)l(|||l||| + 1)l Rl+1 )d
    
128l(|||l||| + 1)l Rl+1 (v − u) ε4 M
+ 2 exp d ln max 1, − . (249)
ε2 32(v − u)4
The proof of Theorem 4.5 is thus completed.
Corollary 4.6. Let (Ω, F , P) be a probability space, let d, d, K, M, τ ∈ N, ε ∈ (0, ∞), L, a, u ∈ R,
b ∈ (a, ∞), v ∈ (u, ∞), R ∈ [max{1, L, |a|, |b|, 2|u|, 2|v|}, ∞), let Xm : Ω → [a, b]d , m ∈ {1, 2, . . . , M},
be i.i.d. random variables, let k·k : Rd → [0, ∞) be the standard norm on Rd , let ϕ : [a, b]d → [u, v]

44
satisfy for all x, y ∈ [a, b]d that |ϕ(x) − ϕ(y)| ≤ Lkx − yk, assume τ ≥ 2d(2dL(b − a)ε−1 + 2)d and
d ≥ τ (d + 1) + (τ − 3)τ (τ + 1) + τ + 1, let l ∈ Nτ satisfy l = (d, τ, τ, . . . , τ, 1), let B ⊆ Rd satisfy
B = [−R, R]d , let E : B × Ω → [0, ∞) satisfy for all θ ∈ B, ω ∈ Ω that
"M #
1 X
E(θ, ω) = |N θ,l (Xm (ω)) − ϕ(Xm (ω))|2 , (250)
M m=1 u,v

let Θk : Ω → B, k ∈ {1, 2, . . . , K}, be i.i.d. random variables, assume that Θ1 is continuous uniformly dis-
tributed on B, and let Ξ : Ω → B satisfy Ξ = Θmin{k∈{1,2,...,K} : E(Θk )=minl∈{1,2,...,K} E(Θl )} (cf. Definition 2.8).
Then
Z 1/2 !   
Ξ,l 2 ε2d
P |Nu,v (x) − ϕ(x)| PX1 (dx) > ε ≤ exp −K min 1,
[a,b]d (16(v − u)(τ + 1)τ Rτ )d
    
128(τ + 1)τ Rτ (v − u) ε4 M
+ 2 exp d ln max 1, − . (251)
ε2 32(v − u)4

Proof of Corollary 4.6. Throughout this proof let N ∈ N satisfy


 
2dL(b − a)
N = min k ∈ N : k ≥ , (252)
ε

let M ⊆ [a, b]d satisfy M = {a, a + b−a N


, . . . , a + (N −1)(b−a)
N
, b}d , let δ : [a, b]d × [a, b]d → [0, ∞) satisfy for
P
all x = (x1 , x2 , . . . , xd ), y = (y1, y2 , . . . , yd ) ∈ [a, b]d that δ(x, y) = di=1 |xi − yi |, and let l0 , l1 , . . . , lτ −1 ∈ N
satisfy l = (l0 , l1 , . . . , lτ −1 ). Observe that for all x ∈ [a, b] there exists y ∈ {a, a + b−a N
, . . . , a + (N −1)(b−a)
N
, b}
b−a
such that |x − y| ≤ 2N . This demonstrates that
" d
!#
X 2Ld(b − a)
4L sup inf |xi − yi | ≤ ≤ ε. (253)
x=(x1 ,x2 ,...,xd )∈[a,b]d y=(y1 ,y2 ,...,yd )∈M
i=1
N

Hence, we obtain that


d
ε ≤ |M| = (N + 1) .
C([a,b]d ,δ), 4L (254)
Next note that (252) implies that N < 2dL(b − a)ε−1 + 1. The hypothesis that τ ≥ 2d(2dL(b − a)ε−1 + 2)d
therefore ensures that
τ > 2d(N + 1)d ≥ (N + 1)d + 2. (255)
Hence, we obtain that for all i ∈ {2, 3, . . . , (N + 1)d }, j ∈ {(N + 1)d + 1, (N + 1)d + 2, . . . , τ − 2} it holds
that

l0 = d, l1 = τ ≥ 2d(N + 1)d , lτ −1 = 1, li = τ ≥ 2(N + 1)d − 2i + 3, and lj = τ ≥ 2. (256)

Furthermore, observe that the hypothesis that for all x, y ∈ [a, b]d it holds that |ϕ(x) − ϕ(y)| ≤ Lkx − yk
implies that for all x, y ∈ [a, b]d it holds that |ϕ(x) − ϕ(y)| ≤ Lδ(x, P y). Combining this, (254), (255),
τ −1
(256), and the hypothesis that d ≥ τ (d + 1) + (τ − 3)τ (τ + 1) + τ + 1 = i=1 li (li−1 + 1) with Theorem 4.5
(with (Ω, F , P) ← (Ω, F , P), d ← d, d ← d, K ← K, M ← M, ε ← ε, L ← L, u ← u, v ← v, D ← [a, b]d ,
(Xm )m∈{1,2,...,M } ← (Xm )m∈{1,2,...,M } , (Ym )m∈{1,2,...,M } ← (ϕ(Xm ))m∈{1,2,...,M } , δ ← δ, ϕ ← ϕ, N ← (N +1)d ,
l ← τ − 1, l ← l, R ← R, B ← B, E ← E, (Θk )k∈{1,2,...,K} ← (Θk )k∈{1,2,...,K}, Ξ ← Ξ in the notation of

45
Theorem 4.5) establishes that
Z 1/2 !
Ξ,l 2
P |Nu,v (x) − ϕ(x)| PX1 (dx) >ε
[a,b]d
  
ε2d
≤ exp −K min 1,
(16(v − u)(τ − 1)(τ + 1)τ −1 Rτ )d
    
128(τ − 1)(τ + 1)τ −1 Rτ (v − u) ε4 M (257)
+ 2 exp d ln max 1, −
ε2 32(v − u)4
  
ε2d
≤ exp −K min 1,
(16(v − u)(τ + 1)τ Rτ )d
    
128(τ + 1)τ Rτ (v − u) ε4 M
+ 2 exp d ln max 1, − .
ε2 32(v − u)4

The proof of Corollary 4.6 is thus completed.

Corollary 4.7. Let (Ω, F , P) be a probability space, let d ∈ N, L, a, u ∈ R, b ∈ (a, ∞), v ∈ (u, ∞),
R ∈ [max{1, L, |a|, |b|, 2|u|, 2|v|}, ∞), let Xm : Ω → [a, b]d , m ∈ N, be i.i.d. random variables, let k·k : Rd →
[0, ∞) be the standard norm on Rd , let ϕ : [a, b]d → [u, v] satisfy for all x, y ∈ [a, b]d that |ϕ(x) − ϕ(y)| ≤
Lkx−yk, let lτ ∈ Nτ , τ ∈ N, satisfy for all τ ∈ N∩[3, ∞) that lτ = (d, τ, τ, . . . , τ, 1), let Ed,M,τ : [−R, R]d ×
Ω → [0, ∞), d, M, τ ∈ N, satisfy for all d, M ∈ N, τ ∈ N ∩ [3, ∞), θ ∈ [−R, R]d , ω ∈ Ω with d ≥
τ (d + 1) + (τ − 3)τ (τ + 1) + τ + 1 that
"M #
1 X
Ed,M,τ (θ, ω) = |N θ,lτ (Xm (ω)) − ϕ(Xm (ω))|2 , (258)
M m=1 u,v

for every d ∈ N let Θd,k : Ω → [−R, R]d , k ∈ N, be i.i.d. random variables, assume for all d ∈ N
that Θd,1 is continuous uniformly distributed on [−R, R]d , and let Ξd,K,M,τ : Ω → [−R, R]d , d, K, M, τ ∈
N, satisfy for all d, K, M, τ ∈ N that Ξd,K,M,τ = Θd,min{k∈{1,2,...,K} : Ed,M,τ (Θd,k )=minl∈{1,2,...,K} Ed,M,τ (Θd,l )} (cf.

Definition 2.8). Then there exists c ∈ (0, ∞) such that for all d, K, M, τ ∈ N, ε ∈ (0, v − u] with
τ ≥ 2d(2dL(b − a)ε−1 + 2)d and d ≥ τ (d + 1) + (τ − 3)τ (τ + 1) + τ + 1 it holds that
Z 1 !
/2
Ξd,K,M,τ ,lτ 2
P |Nu,v (x) − ϕ(x)| PX1 (dx) >ε
[a,b]d (259)
  
≤ exp −K(cτ )−τ d ε 2d
+ 2 exp d ln (cτ )τ ε −2
− c−1 ε4 M .

Proof of Corollary 4.7. Throughout this proof let c ∈ (0, ∞) satisfy

c = max{32(v − u)4 , 256(v − u + 1)R}. (260)

Note that Corollary 4.6 establishes that for all d, K, M, τ ∈ N, ε ∈ (0, ∞) with τ ≥ 2d(2dL(b − a)ε−1 + 2)d
and d ≥ τ (d + 1) + (τ − 3)τ (τ + 1) + τ + 1 it holds that
Z 1/2 !   
Ξd,K,M,τ ,lτ 2 ε2d
P |Nu,v (x) − ϕ(x)| PX1 (dx) >ε ≤ exp −K min 1,
[a,b]d (16(v − u)(τ + 1)τ Rτ )d
    
128(τ + 1)τ Rτ (v − u) ε4 M
+ 2 exp d ln max 1, − . (261)
ε2 32(v − u)4

Next observe that (260) ensures that for all τ ∈ N it holds that

16(v − u)(τ + 1)τ Rτ ≤ (16(v − u + 1)(τ + 1)R)τ ≤ (32(v − u + 1)Rτ )τ ≤ (cτ )τ . (262)

46

The fact that √for all ε ∈ (0, v − u], τ ∈ N it holds that ε2 ≤ 16(v − u)(τ + 1)τ Rτ therefore shows that
for all ε ∈ (0, v − u], τ ∈ N it holds that
 
ε2d −ε2d −ε2d
− min 1, = ≤ . (263)
(16(v − u)(τ + 1)τ Rτ )d (16(v − u)(τ + 1)τ Rτ )d (cτ )τ d
Furthermore, note that (260) implies that for all τ ∈ N it holds that
128(τ + 1)τ Rτ (v − u) ≤ 128(2τ )τ Rτ (v − u) ≤ (256Rτ (v − u + 1))τ ≤ (cτ )τ . (264)
√ 2 τ τ
√ for all ε ∈ (0, v − u], τ ∈ N it holds that ε ≤ 128(τ + 1) R (v − u) hence proves that for
The fact that
all ε ∈ (0, v − u], τ ∈ N it holds that
      
128(τ + 1)τ Rτ (v − u) 128(τ + 1)τ Rτ (v − u) (cτ )τ
ln max 1, = ln ≤ ln (265)
ε2 ε2 ε2
In addition, observe that (260) ensures that
−1 −1
≤ . (266)
32(v − u)4 c

Combining this, (263), and (265) with (261) proves that for all d, K, M, τ ∈ N, ε ∈ (0, v − u] with
τ ≥ 2d(2dL(b − a)ε−1 + 2)d and d ≥ τ (d + 1) + (τ − 3)τ (τ + 1) + τ + 1 it holds that
Z 1/2 !
Ξd,K,M,τ ,lτ
P |Nu,v (x) − ϕ(x)|2 PX1 (dx) >ε
[u,v]d
      (267)
−Kε2d (cτ )τ ε4 M
≤ exp + 2 exp d ln − .
(cτ )τ d ε2 c
The proof of Corollary 4.7 is thus completed.
Corollary 4.8. Let (Ω, F , P) be a probability space, let d ∈ N, L, a, u ∈ R, b ∈ (a, ∞), v ∈ (u, ∞),
R ∈ [max{1, L, |a|, |b|, 2|u|, 2|v|}, ∞), let Xm : Ω → [a, b]d , m ∈ N, be i.i.d. random variables, let k·k : Rd →
[0, ∞) be the standard norm on Rd , let ϕ : [a, b]d → [u, v] satisfy for all x, y ∈ [a, b]d that |ϕ(x) − ϕ(y)| ≤
Lkx−yk, let lτ ∈ Nτ , τ ∈ N, satisfy for all τ ∈ N∩[3, ∞) that lτ = (d, τ, τ, . . . , τ, 1), let Ed,M,τ : [−R, R]d ×
Ω → [0, ∞), d, M, τ ∈ N, satisfy for all d, M ∈ N, τ ∈ N ∩ [3, ∞), θ ∈ [−R, R]d , ω ∈ Ω with d ≥
τ (d + 1) + (τ − 3)τ (τ + 1) + τ + 1 that
"M #
1 X
Ed,M,τ (θ, ω) = |N θ,lτ (Xm (ω)) − ϕ(Xm (ω))|2 , (268)
M m=1 u,v

for every d ∈ N let Θd,k : Ω → [−R, R]d , k ∈ N, be i.i.d. random variables, assume for all d ∈ N
that Θd,1 is continuous uniformly distributed on [−R, R]d , and let Ξd,K,M,τ : Ω → [−R, R]d , d, K, M, τ ∈
N, satisfy for all d, K, M, τ ∈ N that Ξd,K,M,τ = Θd,min{k∈{1,2,...,K} : Ed,M,τ (Θd,k )=minl∈{1,2,...,K} Ed,M,τ (Θd,l )} (cf.

Definition 2.8). Then there exists c ∈ (0, ∞) such that for all d, K, M, τ ∈ N, ε ∈ (0, v − u] with
τ ≥ 2d(2dL(b − a)ε−1 + 2)d and d ≥ τ (d + 1) + (τ − 3)τ (τ + 1) + τ + 1 it holds that
Z 
Ξd,K,M,τ ,lτ
P |Nu,v (x) − ϕ(x)| PX1 (dx) > ε
[a,b]d (269)
−τ d 2d
 τ −2
 −1 4

≤ exp −K(cτ ) ε + 2 exp d ln (cτ ) ε −c ε M .
Proof of Corollary 4.8. Note that Jensen’s inequality shows that for all f ∈ C([a, b]d , R) it holds that
Z Z  12
2
|f (x)| PX1 (dx) ≤ |f (x)| PX1 (dx) . (270)
[a,b]d [a,b]d

Combining this with Corollary 4.7 proves (269). The proof of Corollary 4.8 is thus completed.

47
4.3.2 Convergence rates for strong convergence
Lemma 4.9. Let (Ω, F , P) be a probability space, let c ∈ [0, ∞), and let X : Ω → [−c, c] be a random
variable. Then it holds for all ε, p ∈ (0, ∞) that
E[|X|p ] ≤ εp P(|X| ≤ ε) + cp P(|X| > ε) ≤ εp + cp P(|X| > ε). (271)
Proof of Lemma 4.9. Observe that the hypothesis that for all ω ∈ Ω it holds that |X(ω)| ≤ c ensures that
for all ε, p ∈ (0, ∞) it holds that
   
E[|X|p ] = E |X|p 1{|X|≤ε} + E |X|p 1{|X|>ε} ≤ εp P(|X| ≤ ε) + cp P(|X| > ε) ≤ εp + cp P(|X| > ε) . (272)
The proof of Lemma 4.9 is thus completed.
Corollary 4.10. Let (Ω, F , P) be a probability space, let d ∈ N, L, a, u ∈ R, b ∈ (a, ∞), v ∈ (u, ∞),
R ∈ [max{1, L, |a|, |b|, 2|u|, 2|v|}, ∞), let Xm : Ω → [a, b]d , m ∈ N, be i.i.d. random variables, let k·k : Rd →
[0, ∞) be the standard norm on Rd , let ϕ : [a, b]d → [u, v] satisfy for all x, y ∈ [a, b]d that |ϕ(x) − ϕ(y)| ≤
Lkx−yk, let lτ ∈ Nτ , τ ∈ N, satisfy for all τ ∈ N∩[3, ∞) that lτ = (d, τ, τ, . . . , τ, 1), let Ed,M,τ : [−R, R]d ×
Ω → [0, ∞), d, M, τ ∈ N, satisfy for all d, M ∈ N, τ ∈ N ∩ [3, ∞), θ ∈ [−R, R]d , ω ∈ Ω with d ≥
τ (d + 1) + (τ − 3)τ (τ + 1) + τ + 1 that
"M #
1 X
Ed,M,τ (θ, ω) = |N θ,lτ (Xm (ω)) − ϕ(Xm (ω))|2 , (273)
M m=1 u,v

for every d ∈ N let Θd,k : Ω → [−R, R]d , k ∈ N, be i.i.d. random variables, assume for all d ∈ N that Θd,1 is
continuous uniformly distributed on [−R, R]d , and let Ξd,K,M,τ : Ω → [−R, R]d , d, K, M, τ ∈ N, satisfy for
all d, K, M, τ ∈ N that Ξd,K,M,τ = Θd,min{k∈{1,2,...,K} : Ed,M,τ (Θd,k )=minl∈{1,2,...,K} Ed,M,τ (Θd,l )} (cf. Definition 2.8).

Then there exists c ∈ (0, ∞) such that for all d, K, M, τ ∈ N, p ∈ [1, ∞), ε ∈ (0, v − u] with τ ≥
2d(2dL(b − a)ε−1 + 2)d and d ≥ τ (d + 1) + (τ − 3)τ (τ + 1) + τ + 1 it holds that
"Z p #!1/p/2
Ξd,K,M,τ ,lτ 2
E |Nu,v (x) − ϕ(x)| PX1 (dx)
[a,b]d (274)
   1/p
≤ (v − u) exp −K(cτ )−τ d ε2d + 2 exp d ln (cτ )τ ε−2 − c−1 ε4 M + ε.
Proof of Corollary 4.10. First, observe√ that Corollary 4.7 ensures that there exists c ∈ (0, ∞) which
satisfies for all d, K, M, τ ∈ N, ε ∈ (0, v − u] with τ ≥ 2d(2dL(b − a)ε−1 + 2)d and d ≥ τ (d + 1) + (τ −
3)τ (τ + 1) + τ + 1 that
Z 1/2 !
Ξd,K,M,τ ,lτ
P |Nu,v (x) − ϕ(x)|2 PX1 (dx) >ε
[a,b] d (275)
  
≤ exp −K(cτ )−τ d ε2d + 2 exp d ln (cτ )τ ε−2 − c−1 ε4 M .
R Ξ (ω),l
Lemma 4.9 (with (Ω, F , P) ← (Ω, F , P), c ← v − u, X ← (Ω ∋ ω 7→ [a,b]d |Nu,vd,K,M,τ τ (x) −
1/2
ϕ(x)|2 P√X1 (dx) ∈ [u − v, v − u]) in the notation of Lemma 4.9) hence ensures that for all d, K, M, τ ∈ N,
ε ∈ (0, v − u], p ∈ (0, ∞) with τ ≥ 2d(2dL(b − a)ε−1 + 2)d and d ≥ τ (d + 1) + (τ − 3)τ (τ + 1) + τ + 1 it
holds that
"Z p/2 #
Ξd,K,M,τ ,lτ
E |Nu,v (x) − ϕ(x)|2 PX1 (dx)
[a,b] d
(276)
h    i
≤ εp + (v − u)p exp −K(cτ )−τ d ε2d + 2 exp d ln (cτ )τ ε−2 − c−1 ε4 M .

The fact that for all p ∈ [1, ∞), x, y ∈ [0, ∞) it holds that (x + y)1/p ≤ x1/p + y 1/p therefore establishes
(274). The proof of Corollary 4.10 is thus completed.

48
Acknowledgements
This work has been funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Founda-
tion) under Germany’s Excellence Strategy EXC 2044–390685587, Mathematics Münster: Dynamics–
Geometry–Structure.

References
[1] Bach, F. Breaking the curse of dimensionality with convex neural networks. J. Mach. Learn. Res.
18 (2017), 53 pages.
[2] Bach, F., and Moulines, E. Non-strongly-convex smooth stochastic approximation with con-
vergence rate O(1/n). In Proceedings of the 26th International Conference on Neural Information
Processing Systems (USA, 2013), NIPS’13, Curran Associates Inc., pp. 773–781.
[3] Barron, A. R. Universal approximation bounds for superpositions of a sigmoidal function. IEEE
Trans. Inform. Theory 39, 3 (1993), 930–945.
[4] Barron, A. R. Approximation and estimation bounds for artificial neural networks. Machine
Learning 14, 1 (1994), 115–133.
[5] Bartlett, P. L., Bousquet, O., and Mendelson, S. Local Rademacher complexities. Ann.
Statist. 33, 4 (2005), 1497–1537.
[6] Beck, C., Becker, S., Grohs, P., Jaafari, N., and Jentzen, A. Solving stochastic differential
equations and Kolmogorov equations by means of deep learning. arXiv:1806.00421 (2018), 56 pages.
[7] Beck, C., E, W., and Jentzen, A. Machine Learning Approximation Algorithms for High-
Dimensional Fully Nonlinear Partial Differential Equations and Second-order Backward Stochastic
Differential Equations. J. Nonlinear Sci. 29, 4 (2019), 1563–1619.
[8] Bellman, R. Dynamic programming. Princeton Landmarks in Mathematics. Princeton University
Press, Princeton, NJ, 2010. Reprint of the 1957 edition.
[9] Bercu, B., and Fort, J.-C. Generic stochastic gradient methods. Wiley Encyclopedia of Opera-
tions Research and Management Science (2011), 1–8.
[10] Berner, J., Grohs, P., and Jentzen, A. Analysis of the generalization error: Empirical risk
minimization over deep artificial neural networks overcomes the curse of dimensionality in the nu-
merical approximation of Black–Scholes partial differential equations. arXiv:1809.03062 (2018), 35
pages.
[11] Blum, E. K., and Li, L. K. Approximation theory and feedforward networks. Neural Networks
4, 4 (1991), 511–515.
[12] Bölcskei, H., Grohs, P., Kutyniok, G., and Petersen, P. Optimal approximation with
sparsely connected deep neural networks. SIAM J. Math. Data Sci. 1, 1 (2019), 8–45.
[13] Burger, M., and Neubauer, A. Error bounds for approximation with neural networks. J.
Approx. Theory 112, 2 (2001), 235–250.
[14] Candes, E. J. Ridgelets: theory and applications. PhD thesis, Stanford University Stanford, 1998.
[15] Chau, N. H., Moulines, É., Rásonyi, M., Sabanis, S., and Zhang, Y. On stochastic gradient
Langevin dynamics with dependent data streams: the fully non-convex case. arXiv:1905.13142
(2019), 27 pages.

49
[16] Chen, T., and Chen, H. Approximation capability to functions of several variables, nonlinear
functionals, and operators by radial basis function neural networks. IEEE Trans. Neural Netw. 6, 4
(1995), 904–910.

[17] Chui, C. K., Li, X., and Mhaskar, H. N. Neural networks for localized approximation. Math.
Comp. 63, 208 (1994), 607–623.

[18] Cucker, F., and Smale, S. On the mathematical foundations of learning. Bull. Amer. Math.
Soc. (N.S.) 39, 1 (2002), 1–49.

[19] Cybenko, G. Approximation by superpositions of a sigmoidal function. Math. Control Signals


Systems 2, 4 (1989), 303–314.

[20] Dereich, S., and Müller-Gronbach, T. General multilevel adaptations for stochastic ap-
proximation algorithms of Robbins-Monro and Polyak-Ruppert type. Numer. Math. 142, 2 (2019),
279–328.

[21] DeVore, R. A., Oskolkov, K. I., and Petrushev, P. P. Approximation by feed-forward


neural networks. In The heritage of P. L. Chebyshev: a Festschrift in honor of the 70th birthday of
T. J. Rivlin, vol. 4. Baltzer Science Publishers BV, Amsterdam, 1997, pp. 261–287.

[22] E, W., and Wang, Q. Exponential convergence of the deep neural network approximation for
analytic functions. arXiv:1807.00297 (2018), 7 pages.

[23] Elbrächter, D., Grohs, P., Jentzen, A., and Schwab, C. DNN expression rate analysis of
high-dimensional PDEs: Application to option pricing. arXiv:1809.07669 (2018), 50 pages.

[24] Eldan, R., and Shamir, O. The power of depth for feedforward neural networks. In 29th Annual
Conference on Learning Theory (Columbia University, New York, New York, USA, 23–26 Jun 2016),
V. Feldman, A. Rakhlin, and O. Shamir, Eds., vol. 49 of Proceedings of Machine Learning Research,
PMLR, pp. 907–940.

[25] Ellacott, S. W. Aspects of the numerical analysis of neural networks. In Acta numerica, 1994,
Acta Numer. Cambridge University Press, Cambridge, 1994, pp. 145–202.

[26] Fehrman, B., Gess, B., and Jentzen, A. Convergence rates for the stochastic gradient descent
method for non-convex objective functions. arXiv:1904.01517 (2019), 59 pages.

[27] Funahashi, K.-I. On the approximate realization of continuous mappings by neural networks.
Neural Networks 2, 3 (1989), 183–192.

[28] Goodfellow, I., Bengio, Y., and Courville, A. Deep learning. Adaptive Computation and
Machine Learning. MIT Press, Cambridge, MA, 2016.

[29] Gribonval, R., Kutyniok, G., Nielsen, M., and Voigtlaender, F. Approximation spaces
of deep neural networks. arXiv:1905.01208 (2019), 63 pages.

[30] Grohs, P., Hornung, F., Jentzen, A., and von Wurstemberger, P. A proof that artificial
neural networks overcome the curse of dimensionality in the numerical approximation of Black–
Scholes partial differential equations. arXiv:1809.02362 (2018), 124 pages. Revision requested from
Memoirs of the AMS.

[31] Grohs, P., Hornung, F., Jentzen, A., and Zimmermann, P. Space-time error estimates for
deep neural network approximations for differential equations. arXiv:1908.03833 (2019), 86 pages.

50
[32] Grohs, P., Jentzen, A., and Salimova, D. Deep neural network approximations for Monte
Carlo algorithms. arXiv:1908.10828 (2019), 45 pages.

[33] Grohs, P., Perekrestenko, D., Elbrächter, D., and Bölcskei, H. Deep neural network
approximation theory. arXiv:1901.02220 (2019), 60 pages.

[34] Gühring, I., Kutyniok, G., and Petersen, P. Error bounds for approximations with deep
ReLU neural networks in W s,p norms. arXiv:1902.07896 (2019), 42 pages.

[35] Györfi, L., Kohler, M., Krzyżak, A., and Walk, H. A distribution-free theory of nonpara-
metric regression. Springer Series in Statistics. Springer-Verlag, New York, 2002.

[36] Hartman, E. J., Keeler, J. D., and Kowalski, J. M. Layered neural networks with Gaussian
hidden units as universal approximations. Neural Comput. 2, 2 (1990), 210–215.

[37] Hoeffding, W. Probability inequalities for sums of bounded random variables. J. Amer. Statist.
Assoc. 58, 301 (1963), 13–30.

[38] Hornik, K. Approximation capabilities of multilayer feedforward networks. Neural Networks 4, 2


(1991), 251–257.

[39] Hornik, K. Some new results on neural network approximation. Neural Networks 6, 8 (1993),
1069–1072.

[40] Hornik, K., Stinchcombe, M., and White, H. Multilayer feedforward networks are universal
approximators. Neural Networks 2, 5 (1989), 359–366.

[41] Hornik, K., Stinchcombe, M., and White, H. Universal approximation of an unknown map-
ping and its derivatives using multilayer feedforward networks. Neural Networks 3, 5 (1990), 551–560.

[42] Hutzenthaler, M., Jentzen, A., Kruse, T., and Nguyen, T. A. A proof that rectified deep
neural networks overcome the curse of dimensionality in the numerical approximation of semilinear
heat equations. arXiv:1901.10854 (2019), 29 pages.

[43] Jentzen, A., Kuckuck, B., Neufeld, A., and von Wurstemberger, P. Strong error analysis
for stochastic gradient descent optimization algorithms. arXiv:1801.09324 (2018), 75 pages. Revision
requested from IMA J. Numer. Anal.

[44] Jentzen, A., Salimova, D., and Welti, T. A proof that deep artificial neural networks over-
come the curse of dimensionality in the numerical approximation of Kolmogorov partial differential
equations with constant diffusion and nonlinear drift coefficients. arXiv:1809.07321 (2018), 48 pages.

[45] Jentzen, A., and von Wurstemberger, P. Lower error bounds for the stochastic gradient
descent optimization algorithm: Sharp convergence rates for slowly and fast decaying learning rates.
arXiv:1803.08600 (2018), 42 pages. To appear in J. Complex.

[46] Karimi, B., Miasojedow, B., Moulines, E., and Wai, H.-T. Non-asymptotic Analysis of
Biased Stochastic Approximation Scheme. arXiv:1902.00629 (2019), 32 pages.

[47] Kutyniok, G., Petersen, P., Raslan, M., and Schneider, R. A theoretical analysis of deep
neural networks and parametric PDEs. arXiv:1904.00377 (2019), 43 pages.

[48] Lei, Y., Hu, T., Li, G., and Tang, K. Stochastic Gradient Descent for Nonconvex Learning
without Bounded Gradient Assumptions. arXiv:1902.00908 (2019), 6 pages.

51
[49] Leshno, M., Lin, V. Y., Pinkus, A., and Schocken, S. Multilayer feedforward networks with
a nonpolynomial activation function can approximate any function. Neural Networks 6, 6 (1993),
861–867.

[50] Maggi, F. Sets of finite perimeter and geometric variational problems, vol. 135 of Cambridge Studies
in Advanced Mathematics. Cambridge University Press, Cambridge, 2012.

[51] Massart, P. Concentration inequalities and model selection, vol. 1896 of Lecture Notes in Mathe-
matics. Springer, Berlin, 2007. Lectures from the 33rd Summer School on Probability Theory held
in Saint-Flour, July 6–23, 2003.

[52] Mhaskar, H. N. Neural networks for optimal approximation of smooth and analytic functions.
Neural Comput. 8, 1 (1996), 164–177.

[53] Mhaskar, H. N., and Micchelli, C. A. Degree of approximation by neural and translation
networks with a single hidden layer. Adv. in Appl. Math. 16, 2 (1995), 151–183.

[54] Mhaskar, H. N., and Poggio, T. Deep vs. shallow networks: an approximation theory perspec-
tive. Anal. Appl. (Singap.) 14, 6 (2016), 829–848.

[55] Nguyen-Thien, T., and Tran-Cong, T. Approximation of functions and their derivatives: A
neural network implementation with applications. Appl. Math. Model. 23, 9 (1999), 687–704.

[56] Novak, E., and Woźniakowski, H. Tractability of multivariate problems. Vol. 1: Linear in-
formation, vol. 6 of EMS Tracts in Mathematics. European Mathematical Society (EMS), Zürich,
2008.

[57] Novak, E., and Woźniakowski, H. Tractability of multivariate problems. Volume II: Standard
information for functionals, vol. 12 of EMS Tracts in Mathematics. European Mathematical Society
(EMS), Zürich, 2010.

[58] Park, J., and Sandberg, I. W. Universal approximation using radial-basis-function networks.
Neural Comput. 3, 2 (1991), 246–257.

[59] Perekrestenko, D., Grohs, P., Elbrächter, D., and Bölcskei, H. The universal approx-
imation power of finite-width deep ReLU networks. arXiv:1806.01528 (2018), 16 pages.

[60] Petersen, P., Raslan, M., and Voigtlaender, F. Topological properties of the set of functions
generated by neural networks of fixed size. arXiv:1806.08459 (2018), 56 pages.

[61] Petersen, P., and Voigtlaender, F. Equivalence of approximation by convolutional neural


networks and fully-connected networks. arXiv:1809.00973 (2018), 10 pages.

[62] Petersen, P., and Voigtlaender, F. Optimal approximation of piecewise smooth functions
using deep ReLU neural networks. Neural Networks 108 (2018), 296–330.

[63] Pinkus, A. Approximation theory of the MLP model in neural networks. In Acta numerica, 1999,
vol. 8 of Acta Numer. Cambridge University Press, Cambridge, 1999, pp. 143–195.

[64] Reisinger, C., and Zhang, Y. Rectified deep neural networks overcome the curse of dimension-
ality for nonsmooth value functions in zero-sum games of nonlinear stiff systems. arXiv:1903.06652
(2019), 34 pages.

[65] Schmitt, M. Lower bounds on the complexity of approximating continuous functions by sigmoidal
neural networks. In Proceedings of the 12th International Conference on Neural Information Process-
ing Systems (Cambridge, MA, USA, 1999), NIPS’99, MIT Press, pp. 328–334.

52
[66] Schwab, C., and Zech, J. Deep learning in high dimension: neural network expression rates for
generalized polynomial chaos expansions in UQ. Anal. Appl. (Singap.) 17, 1 (2019), 19–55.

[67] Shaham, U., Cloninger, A., and Coifman, R. R. Provable approximation properties for deep
neural networks. Appl. Comput. Harmon. Anal. 44, 3 (2018), 537–557.

[68] Shalev-Shwartz, S., and Ben-David, S. Understanding machine learning: From theory to
algorithms. Cambridge University Press, Cambridge, 2014.

[69] Shen, Z., Yang, H., and Zhang, S. Deep network approximation characterized by number of
neurons. arXiv:1906.05497 (2019), 36 pages.

[70] Shen, Z., Yang, H., and Zhang, S. Nonlinear approximation via compositions. arXiv:1902.10170
(2019), 19 pages.

[71] van de Geer, S. A. Applications of empirical process theory, vol. 6 of Cambridge Series in Statistical
and Probabilistic Mathematics. Cambridge University Press, Cambridge, 2000.

[72] Voigtlaender, F., and Petersen, P. Approximation in Lp (µ) with deep ReLU neural networks.
arXiv:1904.04789 (2019), 4 pages.

[73] Yarotsky, D. Error bounds for approximations with deep ReLU networks. Neural Networks 94
(2017), 103–114.

[74] Yarotsky, D. Universal approximations of invariant maps by neural networks. arXiv:1804.10306


(2018), 64 pages.

53

You might also like