Appendhhdhdh
Appendhhdhdh
Consider a very simple neural network, f θ ( x ) = θ1 + θ2 x. We wish our Example D.1. The fundamentals
of neural networks and parameter
neural network to take the square footage x of a home and predict its price tuning.
ypred . We want to minimize the square deviation between the predicted
housing price and the true housing price by the loss function ℓ(ypred , ytrue ) =
(ypred − ytrue )2 . Given a training pair, we can compute the gradient:
If our initial parameterization were θ = [10,000, 123] and we had the input-
output pair ( x = 2,500, ytrue = 360,000), then the loss gradient would be
∇θ ℓ = [−85,000, −2.125 × 108 ]. We would take a small step in the opposite
direction to improve our function approximation.
Data sets for modern problems tend to be very large, making the gradient of
equation (D.1) expensive to evaluate. It is common to sample random subsets of
the training data in each iteration, using these batches to compute the loss gradient.
In addition to reducing computation, computing gradients with smaller batch 3
A sufficiently large, single-layer
neural network can, in theory, ap-
sizes introduces some stochasticity to the gradient, which helps training to avoid
proximate any function. See A.
getting stuck in local minima. Pinkus, “Approximation Theory
of the MLP Model in Neural
Networks,” Acta Numerica, vol. 8,
D.2 Feedforward Networks pp. 143–195, 1999.
Neural networks are typically constructed to pass the input through a series of 4
The nonlinearity introduced by
layers.3 Networks with many layers are often called deep. In feedforward networks, the activation function provides
each layer applies an affine transform, followed by a nonlinear activation function something analogous to the acti-
vation behavior of biological neu-
applied elementwise:4
rons, in which input buildup even-
x′ = φ.(Wx + b) (D.2) tually causes a neuron to fire. A. L.
Hodgkin and A. F. Huxley, “A
where matrix W and vector b are parameters associated with the layer. A fully Quantitative Description of Mem-
connected layer is shown in figure D.1. The dimension of the output layer is brane Current and Its Applica-
tion to Conduction and Excitation
different from that of the input layer when W is nonsquare. Figure D.2 shows a in Nerve,” Journal of Physiology,
more compact depiction of the same network. vol. 117, no. 4, pp. 500–544, 1952.
© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2024-02-06 20:54:49-08:00, comments to [email protected]
d . 2. fe e d forw a rd n e tw orks 583
There are many types of activation functions that are commonly used. Similar
to their biological inspiration, they tend to be close to zero when their input is
low and large when their input is high. Some common activation functions are
shown in figure D.5.
Sometimes special layers are incorporated to achieve certain effects. For ex-
ample, in figure D.4, we used a softmax layer at the end to force the output to
represent a two-element categorical distribution. The softmax function applies
© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2024-02-06 20:54:49-08:00, comments to [email protected]
584 a p p e n d i x d . n e u ra l re p re se n ta ti on s
0
0.4 ∈ R5 complicated, nonlinear decision
boundaries.
−0.5 fully connected + softmax
0.2
−1 0 ypred ∈ R2
−1 −0.5 0 0.5 1
x1
1
φ( x )
−1
relu leaky relu swish
max(0, x ) max(αx, x ) x sigmoid( x )
2
1
φ( x )
−1
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
x x x
© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2024-02-06 20:54:49-08:00, comments to [email protected]
d . 3. p a ra me te r re g u l a ri z a ti on 585
the exponential function to each element, which ensures that they are positive
and then renormalizes the resulting values:
exp( xi )
softmax(x)i = (D.4)
∑ j exp( x j )
Gradients for neural networks are typically computed using reverse accumu-
lation.5 The method begins with a forward step, in which the neural network is 5
This process is commonly called
evaluated using all input parameters. In the backward step, the gradient of each backpropagation, which specifically
refers to reverse accumulation ap-
term of interest is computed working from the output back to the input. Reverse plied to a scalar loss function. D. E.
accumulation uses the chain rule for derivatives: Rumelhart, G. E. Hinton, and R. J.
Williams, “Learning Representa-
tions by Back-Propagating Errors,”
∂f(g(h(x))) ∂f(g(h)) ∂h(x) ∂f(g) ∂g(h) ∂h(x)
= = (D.5) Nature, vol. 323, pp. 533–536, 1986.
∂x ∂h ∂x ∂g ∂h ∂x
Example D.2 demonstrates this process. Many deep learning packages compute
gradients using such automatic differentiation techniques.6 Users rarely have to 6
A. Griewank and A. Walther, Eval-
uating Derivatives: Principles and
provide their own gradients.
Techniques of Algorithmic Differentia-
tion, 2nd ed. SIAM, 2008.
Neural networks are typically underdetermined, meaning that there are multiple
parameter instantiations that can result in the same optimal training loss.7 It 7
For example, suppose that we
is common to use parameter regularization, also called weight regularization, to have a neural network with a final
softmax layer. The inputs to that
introduce an additional term to the loss function that penalizes large parameter layer can be scaled while produc-
values. Regularization also helps prevent overfitting, which occurs when a network ing the same output, and therefore
the same loss.
over-specializes to the training data but fails to generalize to unseen data.
Regularization often takes the form of an L2 -norm of the parameterization
vector:
arg min ∑ ℓ( f θ ( x ), y) + βkθk2 (D.6)
θ ( x,y)∈D
where the positive scalar β controls the strength of the parameter regularization.
The scalar is often quite small, with values as low as 10−6 , to minimize the degree
to which matching the training set is sacrificed by introducing regularization.
© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2024-02-06 20:54:49-08:00, comments to [email protected]
586 a p p e n d i x d . n e u ra l re p re se n ta ti on s
Recall the neural network and loss function from example D.1. Here we have Example D.2. How reverse accu-
mulation is used to compute pa-
drawn the computational graph for the loss calculation: rameter gradients given training
x data.
× c1
θ2
+ ypred
− c2 c22 ℓ
θ1
ytrue
Reverse accumulation begins with a forward pass, in which the compu-
tational graph is evaluated. We will again use θ = [10,000, 123] and the
input-output pair ( x = 2,500, ytrue = 360,000) as follows:
2,500
x
307,500
123
× c1
θ2 317,500
+ ypred
−42,500 1.81 × 109
10,000
360,000
− c2 c22 ℓ
θ1
ytrue
The gradient is then computed by working back up the tree:
2,500
x
307,500
123
× c1
θ2 317,500
∂ypred /∂c1 = 1 ypred
+
∂c1 /∂θ2 = 2,500 −42,500 1.81 × 109
10,000 ∂c2 /∂ypred = 1 − c2 c22 ℓ
θ1 360,000
∂ypred /∂θ1 = 1 ytrue ∂ ℓ/∂c2 = −85,000
Finally, we compute:
∂ℓ ∂ ℓ ∂c2 ∂ypred
∂θ1 = ∂c2 ∂ypred ∂θ1 = −85,000 · 1 · 1 = −85,000
∂ ℓ ∂c2 ∂ypred ∂c1
∂ℓ
∂θ2 = ∂c2 ∂ypred ∂c1 ∂θ2 = −85,000 · 1 · 1 · 2500 = −2.125 × 108
© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2024-02-06 20:54:49-08:00, comments to [email protected]
d . 4 . c on vol u ti on a l n e u ra l n e tw orks 587
© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2024-02-06 20:54:49-08:00, comments to [email protected]
588 a p p e n d i x d . n e u ra l re p re se n ta ti on s
The neural network architectures discussed so far are ill suited for temporal
or sequential inputs. Operations on sequences occur when processing images
© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2024-02-06 20:54:49-08:00, comments to [email protected]
d . 5 . re c u rre n t n e tw orks 589
y {y1 , y2 , y3 , . . .} {y1 , y2 , y3 , . . .}
When a neural network has sequential input, sequential output, or both (fig-
ure D.8), we can use a recurrent neural network to act over multiple iterations.
These neural networks maintain a recurrent state r, sometimes called its memory,
to retain information over time. For example, in translation, a word used early in
a sentence may be relevant to the proper translation of words later in the sentence.
Figure D.9 shows the structure of a basic recurrent neural network and how the
same neural network can be understood to be a larger network unrolled in time.
rk +1
© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2024-02-06 20:54:49-08:00, comments to [email protected]
590 ap p e n d i x d . n e u ra l re p re se n ta ti on s
0 0 0 0
x x y1 x1 x y x1 x y1 x1 x y1
r1 r1 r1 r1
0 x y2 x2 x y x2 x y2 x2 x y2
r2 r2 r2 r2
0 x y3 x3 x y x3 x y3 x3 x y3
r3
r3 r3 r3 0 x y4
r4
0 x y5
r5
one-to-many many-to-one many-to-many many-to-many
© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2024-02-06 20:54:49-08:00, comments to [email protected]
d . 5 . re c u rre n t n e tw orks 591
To illustrate vanishing and exploding gradients, consider a deep neural net- Example D.4. A demonstration
of how vanishing and exploding
work made of one-dimensional, fully connected layers with relu activations. gradients arise in deep neural net-
For example, if the network has three layers, its output is works. This example uses a very
simple neural network. In larger,
fully connected layers, the same
f θ ( x ) = relu(w3 relu(w2 relu(w1 x1 + b1 ) + b2 ) + b3 ) principles apply.
© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2024-02-06 20:54:49-08:00, comments to [email protected]
592 a p p e n d i x d . n e u ra l re p re se n ta ti on s
While exploding gradients can often be handled with gradient clipping, regu-
larization, and initializing parameters to small values, these solutions merely shift
the problem toward that of vanishing gradients. Recurrent neural networks often
use layers specifically constructed to mitigate the vanishing gradients problem.
They function by selectively choosing whether to retain memory, and these gates 9
S. Hochreiter and J. Schmidhuber,
“Long Short-Term Memory,” Neural
help regulate the memory and the gradient. Two common recurrent layers are
Computation, vol. 9, no. 8, pp. 1735–
long short-term memory (LSTM)9 and gated recurrent units (GRU).10 1780, 1997.
10
K. Cho, B. van Merriënboer,
C. Gulcehre, D. Bahdanau, F.
D.6 Autoencoder Networks Bougares, H. Schwenk, and
Y. Bengio, “Learning Phrase
Neural networks are often used to process high-dimensional inputs such as im- Representations Using RNN
Encoder-Decoder for Statistical
ages or point clouds. These high-dimensional inputs are often highly structured, Machine Translation,” in Confer-
with the actual information content being much lower-dimensional than the high- ence on Empirical Methods in Natural
Language Processing (EMNLP),
dimensional space in which it is presented. Pixels in images tend to be highly 2014.
correlated with their neighbors, and point clouds often have many regions of 11
Such dimensionality reduction
can also be done using traditional
continuity. Sometimes we wish to build an understanding of the information
machine learning techniques, such
content of our data sets by converting them to a much smaller set of features, as principal component analysis.
or an embedding. This compression, or representation learning, has many advan- Neural models allow more flexibil-
ity and can handle nonlinear rep-
tages.11 Lower-dimensional representations can help facilitate the application of resentations.
traditional machine learning techniques like Bayesian networks to what would
have otherwise been intractable. The features can be inspected to develop an x
understanding of the information content of the data set, and these features can
encoder
be used as inputs to other models.
An autoencoder is a neural network trained to discover a low-dimensional feature
representation of a higher-level input. An autoencoder network takes in a high-
dimensional input x and produces an output x′ with the same dimensionality. We encoding z
design the network architecture to pass through a lower-dimensional intermediate
representation called a bottleneck. The activations z at this bottleneck are our low-
decoder
dimensional features, which exist in a latent space that is not explicitly observed.
Such an architecture is shown in figure D.11.
We train the autoencoder to reproduce its input. For example, to encourage the x′
output x′ to match x as closely as possible, we may simply minimize the L2 -norm,
Figure D.11. An autoencoder
passes a high-dimensional input
minimize E [k f θ (x) − xk2 ] (D.7) through a low-dimensional bot-
θ x∈D
tleneck and then reconstructs the
original input. Minimizing recon-
struction loss can result in an effi-
cient low-dimensional encoding.
© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2024-02-06 20:54:49-08:00, comments to [email protected]
d . 6 . a u toe n c od e r n e tw orks 593
Noise is often added to the input to produce a more robust feature embedding:
encoder
most efficient low-dimensional encoding that is sufficient to accurately reconstruct
the original input. Furthermore, training is unsupervised, in that we do not need
to guide the training to a particular feature set.
After training, the upper portion of the autoencoder above the bottleneck can encoding distribution P(z)
be used as an encoder that transforms an input into the feature representation.
The lower portion of the autoencoder can be used as a decoder that transforms the z ∼ P(z)
decoder
training neural networks to generate images or other high-dimensional outputs.
Example D.5 shows an embedding learned for handwritten digits.
A variational autoencoder, shown in figure D.12, extends the autoencoder frame-
work to learn a probabilistic encoder.12 Rather than outputting a deterministic x′
sample, the encoder produces a distribution over the encoding, which allows the
Figure D.12. A variational autoen-
model to assign confidence to its encoding. Multivariate Gaussian distributions coder passes a high-dimensional
with diagonal covariance matrices are often used for their mathematical conve- input through a low-dimensional
bottleneck that produces a prob-
nience. In such a case, the encoder outputs both an encoding mean and diagonal ability distribution over the en-
covariance matrix. coding. The decoder reconstructs
samples from this encoding to re-
Variational autoencoders are trained to both minimize the expected reconstruc-
construct the original input. Varia-
tion loss while keeping the encoding components close to unit Gaussian. The tional autoencoders can therefore
former is achieved by taking a single sample from the encoding distribution with assign confidence to each encoded
feature. The decoder can thereafter
each passthrough, z ∼ N µ, σ⊤ Iσ . For backpropagation to work, we typically
be used as a generative model.
include random noise w ∼ N (0, I) as an additional input to the neural network 12
D. Kingma and M. Welling,
and obtain our sample according to z = µ + w ⊙ σ. “Auto-Encoding Variational
Bayes,” in International Conference
The components are kept close to unit Gaussian by also minimizing the KL on Learning Representations (ICLR),
divergence (appendix A.10).13 This objective encourages smooth latent space 2013.
representations. The network is penalized for spreading out the latent representa-
13
The KL divergence for two unit
Gaussians is
tions (large values for kµk) and for focusing each representation into a very small
σ 2 + ( µ1 − µ2 )2
σ2
encoding space (small values for kσk), ensuring better coverage of the latent
1
log + 1 −
σ1 2σ22 2
space. As a result, smooth variations into the decoder can result in smoothly
varying outputs. This property allows decoders to be used as generative models,
where samples from a unit multivariate Gaussian can be input to the decoder to
© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2024-02-06 20:54:49-08:00, comments to [email protected]
594 a p p e n d i x d . n e u ra l re p re se n ta ti on s
produce realistic samples in the original space. The combined loss function is
|µ|
" #
minimize E kx′ − xk2 + c ∑ DKL N µi , σi2 , N (0, 1)
θ x∈D
(D.9)
i =1
subject to µ, σ = encoder(x + ǫ)
x′ = decoder(µ + w ⊙ σ)
where the trade-off between the two losses is tuned by the scalar c > 0. Exam-
ple D.6 demonstrates this process on a latent space learned from handwritten
digits.
Variational autoencoders are derived by representing the encoder as a condi-
tional distribution q(z | x), where x belongs to the observed input space and z
is in the unobserved embedding space. The decoder performs inference in the
other direction, representing p(x | z), in which case it also outputs a probability
distribution. We seek to minimize the KL divergence between q(z | x) and p(z | x),
which is the same as minimizing E [log p(x | z)] − DKL (q(z | x) || p(z)), where
p(z) is our prior, the unit multivariate Gaussian to which we bias our encoding
distribution. We thus obtain our reconstruction loss and our KL divergence.
© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2024-02-06 20:54:49-08:00, comments to [email protected]
d . 7. a d ve rsa ri a l n e tw orks 595
We can use an autoencoder to train an embedding for the MNIST data set. Example D.5. A visualization
of a two-dimensional embedding
In this example, we use an encoder similar to the convolutional network in learned for the MNIST digits.
example D.3, except with a two-dimensional output and no softmax layer.
We construct a decoder that mirrors the encoder and train the full network to
minimize the reconstruction loss. Here are the encodings for 10,000 images
from the MNIST data set after training. Each encoding is colored according
to the corresponding digit:
10
0
1
8 2
3
4
6 5
6
7
4 8
9
z2
−2
−6 −4 −2 0 2 4 6 8
z1
We find that the digits tend to be clustered into regions that are roughly
radially distributed from the origin. Note how the encodings for 1 and 7 are
similar, as the two digits look alike. Recall that training is unsupervised, and
the network is not given any information about the digit values. Nevertheless,
these clusterings are produced.
© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2024-02-06 20:54:49-08:00, comments to [email protected]
596 a p p e n d i x d . n e u ra l re p re se n ta ti on s
In example D.5, we trained an autoencoder on the MNIST data set. We can Example D.6. A visualization of
a two-dimensional embedding
adapt the same network to produce two-dimensional mean and variance learned using a variational
vectors at the bottleneck instead of a two-dimensional embedding, and then autoencoder for the MNIST digits.
Here, we show decoded outputs
train it to minimize both the reconstruction loss and the KL divergence. Here, from inputs panned over the
we show the mean encodings for the same 10,000 images for the MNIST data encoding space:
set. Each encoding is again colored according to the corresponding digit:
0
1
2
2 3
4
5
6
0 7
8
z2
−2
−4
−2 0 2 4
z1
© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2024-02-06 20:54:49-08:00, comments to [email protected]
d . 7. a d ve rsa ri a l n e tw orks 597
x
One common approach to penalize off-nominal outputs or behavior is to use
adversarial learning by including a discriminator, as shown in figure D.13.14 A primary network
discriminator is a neural network that acts as a binary classifier that takes in
neural network outputs and learns to distinguish between real outputs from y ytrue
the training set and the outputs from the primary neural network. The primary
neural network, also called a generator, is then trained to deceive the discriminator, discriminator
thereby naturally producing outputs that are more difficult to distinguish from
the data set. The primary advantage of this technique is that we do not need P(true)
to design special features to identify or quantify how the output fails to match
Figure D.13. A generative adversar-
the training data, but we can allow the discriminator to naturally learn such ial network causes a primary net-
differences. work’s output to be more realistic
by using a discriminator to force
Learning is adversarial in the sense that we have two neural networks: the the primary network to produce
primary neural network that we would like to produce realistic outputs and the more realistic output.
discriminator network that distinguishes between primary network outputs and 14
These techniques were intro-
real examples. They are each training to outperform the other. Training is an duced by I. Goodfellow, J. Pouget-
Abadie, M. Mirza, B. Xu, D. Warde-
iterative process in which each network is improved in turn. It can sometimes be Farley, S. Ozair, A. Courville, and
challenging to balance their relative performance; if one network becomes too Y. Bengio, “Generative Adversarial
Nets,” in Advances in Neural Infor-
good, the other can become stuck.
mation Processing Systems (NIPS),
2014.
© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2024-02-06 20:54:49-08:00, comments to [email protected]