0% found this document useful (0 votes)
13 views17 pages

Appendhhdhdh

Neural networks are parametric representations of nonlinear functions that utilize gradient-based optimization for training, allowing them to approximate desired input-output relationships. They are commonly used in decision-making contexts and can be structured in various architectures, including feedforward networks that apply affine transformations followed by nonlinear activation functions. Parameter regularization techniques are often employed to prevent overfitting and improve generalization to unseen data.

Uploaded by

rammu2001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views17 pages

Appendhhdhdh

Neural networks are parametric representations of nonlinear functions that utilize gradient-based optimization for training, allowing them to approximate desired input-output relationships. They are commonly used in decision-making contexts and can be structured in various architectures, including feedforward networks that apply affine transformations followed by nonlinear activation functions. Parameter regularization techniques are often employed to prevent overfitting and improve generalization to unseen data.

Uploaded by

rammu2001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

D Neural Representations

Neural networks are parametric representations of nonlinear functions.1 The func- 1


The name derives from the loose
tion represented by a neural network is differentiable, allowing gradient-based inspiration of networks of neurons
in biological brains. We will not dis-
optimization algorithms such as stochastic gradient descent to optimize their cuss these biological connections,
parameters to better approximate desired input-output relationships.2 Neural but an overview and historical per-
spective is provided by B. Müller,
representations can be helpful in a variety of contexts related to decision making, J. Reinhardt, and M. T. Strickland,
such as representing probabilistic models, utility functions, and decision policies. Neural Networks. Springer, 1995.
This appendix outlines several relevant architectures. 2
This optimization process when
applied to neural networks with
many layers, as we will discuss
D.1 Neural Networks shortly, is often called deep learn-
ing. Several textbooks are dedi-
cated entirely to these techniques,
A neural network is a differentiable function y = fθ (x) that maps inputs x to including I. Goodfellow, Y. Bengio,
produce outputs y and is parameterized by θ. Modern neural networks may and A. Courville, Deep Learning.
MIT Press, 2016. The Julia package
have millions of parameters and can be used to convert inputs in the form of Flux.jl provides efficient imple-
high-dimensional images or video into high-dimensional outputs like multidi- mentations of various learning al-
gorithms.
mensional classifications or speech.
The parameters of the network θ are generally tuned to minimize a scalar loss
function ℓ(fθ (x), y) that is related to how far the network output is from the desired
output. Both the loss function and the neural network are differentiable, allowing
us to use the gradient of the loss function with respect to the parameterization
∇θ ℓ to iteratively improve the parameterization. This process is often referred to
as neural network training or parameter tuning. It is demonstrated in example D.1.
Neural networks are typically trained on a data set of input-output pairs D. In
this case, we tune the parameters to minimize the aggregate loss over the data
set:
arg min ∑ ℓ(fθ (x), y) (D.1)
θ (x,y)∈D
582 a p p e n d i x d . n e u ra l re p re se n ta ti on s

Consider a very simple neural network, f θ ( x ) = θ1 + θ2 x. We wish our Example D.1. The fundamentals
of neural networks and parameter
neural network to take the square footage x of a home and predict its price tuning.
ypred . We want to minimize the square deviation between the predicted
housing price and the true housing price by the loss function ℓ(ypred , ytrue ) =
(ypred − ytrue )2 . Given a training pair, we can compute the gradient:

∇θ ℓ( f ( x ), ytrue ) = ∇θ (θ1 + θ2 x − ytrue )2


" #
2(θ1 + θ2 x − ytrue )
=
2(θ1 + θ2 x − ytrue ) x

If our initial parameterization were θ = [10,000, 123] and we had the input-
output pair ( x = 2,500, ytrue = 360,000), then the loss gradient would be
∇θ ℓ = [−85,000, −2.125 × 108 ]. We would take a small step in the opposite
direction to improve our function approximation.

Data sets for modern problems tend to be very large, making the gradient of
equation (D.1) expensive to evaluate. It is common to sample random subsets of
the training data in each iteration, using these batches to compute the loss gradient.
In addition to reducing computation, computing gradients with smaller batch 3
A sufficiently large, single-layer
neural network can, in theory, ap-
sizes introduces some stochasticity to the gradient, which helps training to avoid
proximate any function. See A.
getting stuck in local minima. Pinkus, “Approximation Theory
of the MLP Model in Neural
Networks,” Acta Numerica, vol. 8,
D.2 Feedforward Networks pp. 143–195, 1999.

Neural networks are typically constructed to pass the input through a series of 4
The nonlinearity introduced by
layers.3 Networks with many layers are often called deep. In feedforward networks, the activation function provides
each layer applies an affine transform, followed by a nonlinear activation function something analogous to the acti-
vation behavior of biological neu-
applied elementwise:4
rons, in which input buildup even-
x′ = φ.(Wx + b) (D.2) tually causes a neuron to fire. A. L.
Hodgkin and A. F. Huxley, “A
where matrix W and vector b are parameters associated with the layer. A fully Quantitative Description of Mem-
connected layer is shown in figure D.1. The dimension of the output layer is brane Current and Its Applica-
tion to Conduction and Excitation
different from that of the input layer when W is nonsquare. Figure D.2 shows a in Nerve,” Journal of Physiology,
more compact depiction of the same network. vol. 117, no. 4, pp. 500–544, 1952.

© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2024-02-06 20:54:49-08:00, comments to [email protected]
d . 2. fe e d forw a rd n e tw orks 583

Figure D.1. A fully connected layer


x1′ = φ(w1,1 x1 + w1,2 x2 + w1,3 x3 + b1 ) with a three-component input and
a five-component output.

x1 x2′ = φ(w2,1 x1 + w2,2 x2 + w2,3 x3 + b2 )

x2 x3′ = φ(w3,1 x1 + w3,2 x2 + w3,3 x3 + b3 )

x3 x4′ = φ(w4,1 x1 + w4,2 x2 + w4,3 x3 + b4 )

x5′ = φ(w5,1 x1 + w5,2 x2 + w5,3 x3 + b5 )

If there are no activation functions between them, multiple successive affine x ∈ R3


transformations can be collapsed into a single, equivalent affine transform:
fully connected + φ
W2 (W1 x + b1 ) + b2 = W2 W1 x + (W2 b1 + b2 ) (D.3)
x′ ∈ R 5
These nonlinearities are necessary to allow neural networks to adapt to fit arbitrary
Figure D.2. A more compact de-
target functions. To illustrate, figure D.3 shows the output of a neural network piction of figure D.1. Neural net-
trained to approximate a nonlinear function. work layers are often represented
as blocks or slices for simplicity.

true function Figure D.3. A deep neural net-


1
training samples work fit to samples from a nonlin-
ear function so as to minimize the
0 learned model
squared error. This neural network
−1 has four affine layers, with 10 neu-
rons in each intermediate represen-
−2 tation.
0 2 4 6 8 10

There are many types of activation functions that are commonly used. Similar
to their biological inspiration, they tend to be close to zero when their input is
low and large when their input is high. Some common activation functions are
shown in figure D.5.
Sometimes special layers are incorporated to achieve certain effects. For ex-
ample, in figure D.4, we used a softmax layer at the end to force the output to
represent a two-element categorical distribution. The softmax function applies

© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2024-02-06 20:54:49-08:00, comments to [email protected]
584 a p p e n d i x d . n e u ra l re p re se n ta ti on s

1 1 Figure D.4. A simple, two-layer,


fully connected network trained
0.8 x ∈ R2 to classify whether a given coor-
0.5 dinate lies within a circle (shown
0.6 fully connected + sigmoid in white). The nonlinearities
allow neural networks to form
x3

0
0.4 ∈ R5 complicated, nonlinear decision
boundaries.
−0.5 fully connected + softmax
0.2

−1 0 ypred ∈ R2
−1 −0.5 0 0.5 1
x1

sigmoid tanh softplus Figure D.5. Several common acti-


vation functions.
1/(1 + exp(− x )) tanh( x ) log(1 + exp( x ))
2

1
φ( x )

−1
relu leaky relu swish
max(0, x ) max(αx, x ) x sigmoid( x )
2

1
φ( x )

−1
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
x x x

© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2024-02-06 20:54:49-08:00, comments to [email protected]
d . 3. p a ra me te r re g u l a ri z a ti on 585

the exponential function to each element, which ensures that they are positive
and then renormalizes the resulting values:

exp( xi )
softmax(x)i = (D.4)
∑ j exp( x j )

Gradients for neural networks are typically computed using reverse accumu-
lation.5 The method begins with a forward step, in which the neural network is 5
This process is commonly called
evaluated using all input parameters. In the backward step, the gradient of each backpropagation, which specifically
refers to reverse accumulation ap-
term of interest is computed working from the output back to the input. Reverse plied to a scalar loss function. D. E.
accumulation uses the chain rule for derivatives: Rumelhart, G. E. Hinton, and R. J.
Williams, “Learning Representa-
tions by Back-Propagating Errors,”
 
∂f(g(h(x))) ∂f(g(h)) ∂h(x) ∂f(g) ∂g(h) ∂h(x)
= = (D.5) Nature, vol. 323, pp. 533–536, 1986.
∂x ∂h ∂x ∂g ∂h ∂x

Example D.2 demonstrates this process. Many deep learning packages compute
gradients using such automatic differentiation techniques.6 Users rarely have to 6
A. Griewank and A. Walther, Eval-
uating Derivatives: Principles and
provide their own gradients.
Techniques of Algorithmic Differentia-
tion, 2nd ed. SIAM, 2008.

D.3 Parameter Regularization

Neural networks are typically underdetermined, meaning that there are multiple
parameter instantiations that can result in the same optimal training loss.7 It 7
For example, suppose that we
is common to use parameter regularization, also called weight regularization, to have a neural network with a final
softmax layer. The inputs to that
introduce an additional term to the loss function that penalizes large parameter layer can be scaled while produc-
values. Regularization also helps prevent overfitting, which occurs when a network ing the same output, and therefore
the same loss.
over-specializes to the training data but fails to generalize to unseen data.
Regularization often takes the form of an L2 -norm of the parameterization
vector:
arg min ∑ ℓ( f θ ( x ), y) + βkθk2 (D.6)
θ ( x,y)∈D

where the positive scalar β controls the strength of the parameter regularization.
The scalar is often quite small, with values as low as 10−6 , to minimize the degree
to which matching the training set is sacrificed by introducing regularization.

© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2024-02-06 20:54:49-08:00, comments to [email protected]
586 a p p e n d i x d . n e u ra l re p re se n ta ti on s

Recall the neural network and loss function from example D.1. Here we have Example D.2. How reverse accu-
mulation is used to compute pa-
drawn the computational graph for the loss calculation: rameter gradients given training
x data.

× c1
θ2
+ ypred
− c2 c22 ℓ
θ1
ytrue
Reverse accumulation begins with a forward pass, in which the compu-
tational graph is evaluated. We will again use θ = [10,000, 123] and the
input-output pair ( x = 2,500, ytrue = 360,000) as follows:
2,500
x
307,500
123
× c1
θ2 317,500
+ ypred
−42,500 1.81 × 109
10,000
360,000
− c2 c22 ℓ
θ1
ytrue
The gradient is then computed by working back up the tree:
2,500
x
307,500
123
× c1
θ2 317,500
∂ypred /∂c1 = 1 ypred
+
∂c1 /∂θ2 = 2,500 −42,500 1.81 × 109
10,000 ∂c2 /∂ypred = 1 − c2 c22 ℓ
θ1 360,000
∂ypred /∂θ1 = 1 ytrue ∂ ℓ/∂c2 = −85,000

Finally, we compute:

∂ℓ ∂ ℓ ∂c2 ∂ypred
∂θ1 = ∂c2 ∂ypred ∂θ1 = −85,000 · 1 · 1 = −85,000
∂ ℓ ∂c2 ∂ypred ∂c1
∂ℓ
∂θ2 = ∂c2 ∂ypred ∂c1 ∂θ2 = −85,000 · 1 · 1 · 2500 = −2.125 × 108

© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2024-02-06 20:54:49-08:00, comments to [email protected]
d . 4 . c on vol u ti on a l n e u ra l n e tw orks 587

D.4 Convolutional Neural Networks

Neural networks may have images or other multidimensional structures such


as lidar scans as inputs. Even a relatively small 256 × 256 RGB image (similar to
figure D.6) has 256 × 256 × 3 = 196,608 entries. Any fully connected layer taking
an m × m × 3 image as input and producing a vector of n outputs would have Figure D.6. Multidimensional in-
a weight matrix with 3m2 n values. The large number of parameters to learn is puts like images generalize vectors
not only computationally expensive, it is also wasteful. Information in images to tensors. Here, we show a three-
layer RGB image. Such inputs can
is typically translation-invariant; an object in an image that is shifted right by 1 have very many entries.
pixel should produce a similar, if not identical, output.
Convolutional layers8 both significantly reduce the amount of computation and 8
Y. LeCun, L. Bottou, Y. Bengio,
support translation invariance by sliding a smaller, fully connected window to and P. Haffner, “Gradient-Based
Learning Applied to Document
produce their output. Significantly fewer parameters need to be learned. These Recognition,” Proceedings of the
parameters tend to be receptive to local textures in much the same way that the IEEE, vol. 86, no. 11, pp. 2278–2324,
1998.
neurons in the visual cortex respond to stimuli in their receptive fields.

Figure D.7. A convolutional layer


repeatedly applies filters across an
input tensor, such as an image, to
filter produce an output tensor. This il-
filter output lustration shows how each applica-
receptive field tion of the filter acts like a small,
fully connected layer applied to
a small receptive field to produce
a single entry in the output ten-
input tensor sor. Each filter is shifted across
the input according to a prescribed
stride. The resulting output has as
many layers as there are filters.
The convolutional layer consists of a set of features, or kernels, each of which is
equivalent to a fully connected layer into which one can input a smaller region
of the input tensor. A single kernel is being applied once in figure D.7. These
features have full depth, meaning that if an input tensor is n × m × d, the features
will also have a third dimension of d. The features are applied many times by
sliding them over the input in both the first and second dimensions. If the stride
is 1 × 1, then all k filters are applied to every possible position and the output
dimension will be n × m × k. If the stride is 2 × 2, then the filters are shifted by 2
in the first and second dimensions with every application, resulting in an output
of size n/2 × m/2 × k. It is common for convolutional neural networks to increase
in the third dimension and reduce in the first two dimensions with each layer.

© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2024-02-06 20:54:49-08:00, comments to [email protected]
588 a p p e n d i x d . n e u ra l re p re se n ta ti on s

Convolutional layers are translation-invariant because each filter behaves the


same regardless of where in the input is applied. This property is especially useful
in spatial processing because shifts in an input image can yield similar outputs,
making it easier for neural networks to extract common features. Individual
features tend to learn how to recognize local attributes such as colors and textures.

Example D.3. A convolutional neu-


The MNIST data set contains handwritten dig- ral network for the MNIST data set.
its in the form of 28 × 28 monochromatic im- Y. LeCun, L. Bottou, Y. Bengio, and
P. Haffner, “Gradient-Based Learn-
ages. It is a often used to test image classifi- ing Applied to Document Recog-
cation networks. To the right, we have a sam- nition,” Proceedings of the IEEE,
ple convolutional neural network that takes an 28 × 28 × 1 vol. 86, no. 11, pp. 2278–2324, 1998.

MNIST image as input and produces a cate- conv 5 × 5 stride 2 + relu


gorical probability distribution over the 10 pos- 14 × 14 × 8
sible digits. Convolutional layers are used to conv 3 × 3 stride 2 + relu
efficiently extract features. The model shrinks 7 × 7 × 16
in the first two dimensions and expands in the conv 3 × 3 stride 2 + relu
third dimension (the number of features) as 4 × 4 × 32
the network depth increases. Eventually reach- conv 3 × 3 stride 2 + relu
ing a first and second dimension of 1 ensures 2 × 2 × 32
that information from across the entire image conv 2 × 2 stride 1 + relu
can affect every feature. The flatten operation 1 × 1 × 32
takes the 1 × 1 × 32 input and flattens it into flatten
a 32-component output. Such operations are 32
common when transitioning between convolu- fully connected + softmax
tional and fully connected layers. This model
10
has 19,722 parameters. The parameters can be
ypred
tuned to maximize the likelihood of the train-
ing data.

D.5 Recurrent Networks

The neural network architectures discussed so far are ill suited for temporal
or sequential inputs. Operations on sequences occur when processing images

© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2024-02-06 20:54:49-08:00, comments to [email protected]
d . 5 . re c u rre n t n e tw orks 589

from videos, when translating a sequence of words, or when tracking time-series


data. In such cases, the outputs depend on more than just the most recent input.
In addition, the neural network architectures discussed so far do not naturally
produce variable-length outputs. For example, a neural network that writes an
essay would be difficult to train using a conventional, fully connected neural
network.

{x1 , x2 , x3 , . . .} x { x1 , x2 , x 3 , . . . } Figure D.8. Traditional neural


networks do not naturally accept
variable-length inputs or produce
neural network neural network neural network variable-length outputs.

y {y1 , y2 , y3 , . . .} {y1 , y2 , y3 , . . .}

many-to-one one-to-many many-to-many

When a neural network has sequential input, sequential output, or both (fig-
ure D.8), we can use a recurrent neural network to act over multiple iterations.
These neural networks maintain a recurrent state r, sometimes called its memory,
to retain information over time. For example, in translation, a word used early in
a sentence may be relevant to the proper translation of words later in the sentence.
Figure D.9 shows the structure of a basic recurrent neural network and how the
same neural network can be understood to be a larger network unrolled in time.

x rk −2 Figure D.9. A recurrent neural net-


work (left) and the same recurrent
neural network unrolled in time
neural network r xk −1 neural network yk −1 (right). These networks maintain a
rk −1 recurrent state r that allows the net-
work to develop a sort of memory,
y xk neural network yk
transferring information across it-
rk erations.
xk +1 neural network yk +1

rk +1

© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2024-02-06 20:54:49-08:00, comments to [email protected]
590 ap p e n d i x d . n e u ra l re p re se n ta ti on s

This unrolled structure can be used to produce a rich diversity of sequential


neural networks, as shown in figure D.10. Many-to-many structures come in mul-
tiple forms. In one form, the output sequence begins with the input sequence. In
another form, the output sequence does not begin with the input sequence. When
using variable-length outputs, the neural network output itself often indicates
when a sequence begins or ends. The recurrent state is often initialized to zero, as
are extra inputs after the input sequence has been passed in, but this need not be
the case.

0 0 0 0

x x y1 x1 x y x1 x y1 x1 x y1
r1 r1 r1 r1
0 x y2 x2 x y x2 x y2 x2 x y2
r2 r2 r2 r2
0 x y3 x3 x y x3 x y3 x3 x y3
r3
r3 r3 r3 0 x y4
r4
0 x y5

r5
one-to-many many-to-one many-to-many many-to-many

Figure D.10. A recurrent neural


network can be unrolled in time
Recurrent neural networks with many layers, unrolled over multiple time steps, to produce different relationships.
effectively produce a very deep neural network. During training, gradients are Unused or default inputs and out-
puts are grayed out.
computed with respect to the loss function. The contribution of layers farther from
the loss function tends to be smaller than that of layers close to the loss function.
This leads to the vanishing gradient problem, in which deep neural networks have
vanishingly small gradients in their upper layers. These small gradients slow
training.
Very deep neural networks can also suffer from exploding gradients, in which
successive gradient contributions through the layers combine to produce very
large values. Such large values make learning unstable. Example D.4 shows both
exploding and vanishing gradients.

© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2024-02-06 20:54:49-08:00, comments to [email protected]
d . 5 . re c u rre n t n e tw orks 591

To illustrate vanishing and exploding gradients, consider a deep neural net- Example D.4. A demonstration
of how vanishing and exploding
work made of one-dimensional, fully connected layers with relu activations. gradients arise in deep neural net-
For example, if the network has three layers, its output is works. This example uses a very
simple neural network. In larger,
fully connected layers, the same
f θ ( x ) = relu(w3 relu(w2 relu(w1 x1 + b1 ) + b2 ) + b3 ) principles apply.

The gradient with respect to a loss function depends on the gradient of


fθ .
We can get vanishing gradients in the parameters of the first layer, w1
and b1 , if the gradient contributions in successive layers are less than 1. For
example, if any of the layers has a negative input to its relu function, the
gradient of its inputs will be zero, so the gradient vanishes entirely. In a less
extreme case, suppose that the weights are all w = 0.5 1, the offsets are all
b = 0, and the input x is positive. In this case, the gradient with respect to
w1 is
∂f
= x 1 · w2 · w3 · w4 · w5 . . .
∂w1
The deeper the network, the smaller the gradient will be.
We can get exploding gradients in the parameters of the first layer if the
gradient contributions in successive layers are greater than 1. If we merely
increase our weights to w = 2 1, the very same gradient is suddenly doubling
every layer.

© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2024-02-06 20:54:49-08:00, comments to [email protected]
592 a p p e n d i x d . n e u ra l re p re se n ta ti on s

While exploding gradients can often be handled with gradient clipping, regu-
larization, and initializing parameters to small values, these solutions merely shift
the problem toward that of vanishing gradients. Recurrent neural networks often
use layers specifically constructed to mitigate the vanishing gradients problem.
They function by selectively choosing whether to retain memory, and these gates 9
S. Hochreiter and J. Schmidhuber,
“Long Short-Term Memory,” Neural
help regulate the memory and the gradient. Two common recurrent layers are
Computation, vol. 9, no. 8, pp. 1735–
long short-term memory (LSTM)9 and gated recurrent units (GRU).10 1780, 1997.
10
K. Cho, B. van Merriënboer,
C. Gulcehre, D. Bahdanau, F.
D.6 Autoencoder Networks Bougares, H. Schwenk, and
Y. Bengio, “Learning Phrase
Neural networks are often used to process high-dimensional inputs such as im- Representations Using RNN
Encoder-Decoder for Statistical
ages or point clouds. These high-dimensional inputs are often highly structured, Machine Translation,” in Confer-
with the actual information content being much lower-dimensional than the high- ence on Empirical Methods in Natural
Language Processing (EMNLP),
dimensional space in which it is presented. Pixels in images tend to be highly 2014.
correlated with their neighbors, and point clouds often have many regions of 11
Such dimensionality reduction
can also be done using traditional
continuity. Sometimes we wish to build an understanding of the information
machine learning techniques, such
content of our data sets by converting them to a much smaller set of features, as principal component analysis.
or an embedding. This compression, or representation learning, has many advan- Neural models allow more flexibil-
ity and can handle nonlinear rep-
tages.11 Lower-dimensional representations can help facilitate the application of resentations.
traditional machine learning techniques like Bayesian networks to what would
have otherwise been intractable. The features can be inspected to develop an x
understanding of the information content of the data set, and these features can

encoder
be used as inputs to other models.
An autoencoder is a neural network trained to discover a low-dimensional feature
representation of a higher-level input. An autoencoder network takes in a high-
dimensional input x and produces an output x′ with the same dimensionality. We encoding z
design the network architecture to pass through a lower-dimensional intermediate
representation called a bottleneck. The activations z at this bottleneck are our low-
decoder

dimensional features, which exist in a latent space that is not explicitly observed.
Such an architecture is shown in figure D.11.
We train the autoencoder to reproduce its input. For example, to encourage the x′
output x′ to match x as closely as possible, we may simply minimize the L2 -norm,
Figure D.11. An autoencoder
passes a high-dimensional input
minimize E [k f θ (x) − xk2 ] (D.7) through a low-dimensional bot-
θ x∈D
tleneck and then reconstructs the
original input. Minimizing recon-
struction loss can result in an effi-
cient low-dimensional encoding.
© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2024-02-06 20:54:49-08:00, comments to [email protected]
d . 6 . a u toe n c od e r n e tw orks 593

Noise is often added to the input to produce a more robust feature embedding:

minimize E [k f θ (x + ǫ) − xk2 ] (D.8)


θ x∈D x
Training to minimize the reconstruction loss forces the autoencoder to find the

encoder
most efficient low-dimensional encoding that is sufficient to accurately reconstruct
the original input. Furthermore, training is unsupervised, in that we do not need
to guide the training to a particular feature set.
After training, the upper portion of the autoencoder above the bottleneck can encoding distribution P(z)
be used as an encoder that transforms an input into the feature representation.
The lower portion of the autoencoder can be used as a decoder that transforms the z ∼ P(z)

feature representation into the input representation. Decoding is useful when

decoder
training neural networks to generate images or other high-dimensional outputs.
Example D.5 shows an embedding learned for handwritten digits.
A variational autoencoder, shown in figure D.12, extends the autoencoder frame-
work to learn a probabilistic encoder.12 Rather than outputting a deterministic x′
sample, the encoder produces a distribution over the encoding, which allows the
Figure D.12. A variational autoen-
model to assign confidence to its encoding. Multivariate Gaussian distributions coder passes a high-dimensional
with diagonal covariance matrices are often used for their mathematical conve- input through a low-dimensional
bottleneck that produces a prob-
nience. In such a case, the encoder outputs both an encoding mean and diagonal ability distribution over the en-
covariance matrix. coding. The decoder reconstructs
samples from this encoding to re-
Variational autoencoders are trained to both minimize the expected reconstruc-
construct the original input. Varia-
tion loss while keeping the encoding components close to unit Gaussian. The tional autoencoders can therefore
former is achieved by taking a single sample from the encoding distribution with assign confidence to each encoded
feature. The decoder can thereafter
each passthrough, z ∼ N µ, σ⊤ Iσ . For backpropagation to work, we typically

be used as a generative model.
include random noise w ∼ N (0, I) as an additional input to the neural network 12
D. Kingma and M. Welling,
and obtain our sample according to z = µ + w ⊙ σ. “Auto-Encoding Variational
Bayes,” in International Conference
The components are kept close to unit Gaussian by also minimizing the KL on Learning Representations (ICLR),
divergence (appendix A.10).13 This objective encourages smooth latent space 2013.
representations. The network is penalized for spreading out the latent representa-
13
The KL divergence for two unit
Gaussians is
tions (large values for kµk) and for focusing each representation into a very small
σ 2 + ( µ1 − µ2 )2
 
σ2
encoding space (small values for kσk), ensuring better coverage of the latent
1
log + 1 −
σ1 2σ22 2
space. As a result, smooth variations into the decoder can result in smoothly
varying outputs. This property allows decoders to be used as generative models,
where samples from a unit multivariate Gaussian can be input to the decoder to

© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2024-02-06 20:54:49-08:00, comments to [email protected]
594 a p p e n d i x d . n e u ra l re p re se n ta ti on s

produce realistic samples in the original space. The combined loss function is

|µ|
" #
   
minimize E kx′ − xk2 + c ∑ DKL N µi , σi2 , N (0, 1)
θ x∈D
(D.9)
i =1

subject to µ, σ = encoder(x + ǫ)
x′ = decoder(µ + w ⊙ σ)

where the trade-off between the two losses is tuned by the scalar c > 0. Exam-
ple D.6 demonstrates this process on a latent space learned from handwritten
digits.
Variational autoencoders are derived by representing the encoder as a condi-
tional distribution q(z | x), where x belongs to the observed input space and z
is in the unobserved embedding space. The decoder performs inference in the
other direction, representing p(x | z), in which case it also outputs a probability
distribution. We seek to minimize the KL divergence between q(z | x) and p(z | x),
which is the same as minimizing E [log p(x | z)] − DKL (q(z | x) || p(z)), where
p(z) is our prior, the unit multivariate Gaussian to which we bias our encoding
distribution. We thus obtain our reconstruction loss and our KL divergence.

D.7 Adversarial Networks

We often want to train neural networks to produce high-dimensional outputs,


such as images or sequences of helicopter control inputs. When the output space is
large, the training data may cover only a very small region of the state space. Hence,
training purely on the available data can cause unrealistic results or overfitting.
We generally want the neural network to produce plausible outputs. For example,
when producing images, we want the images to look realistic. When mimicking
human driving, such as in imitation learning (chapter 18), we want the vehicle to
typically stay in its lane and to react appropriately to other vehicles.

© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2024-02-06 20:54:49-08:00, comments to [email protected]
d . 7. a d ve rsa ri a l n e tw orks 595

We can use an autoencoder to train an embedding for the MNIST data set. Example D.5. A visualization
of a two-dimensional embedding
In this example, we use an encoder similar to the convolutional network in learned for the MNIST digits.
example D.3, except with a two-dimensional output and no softmax layer.
We construct a decoder that mirrors the encoder and train the full network to
minimize the reconstruction loss. Here are the encodings for 10,000 images
from the MNIST data set after training. Each encoding is colored according
to the corresponding digit:

10
0
1
8 2
3
4
6 5
6
7
4 8
9
z2

−2

−6 −4 −2 0 2 4 6 8
z1

We find that the digits tend to be clustered into regions that are roughly
radially distributed from the origin. Note how the encodings for 1 and 7 are
similar, as the two digits look alike. Recall that training is unsupervised, and
the network is not given any information about the digit values. Nevertheless,
these clusterings are produced.

© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2024-02-06 20:54:49-08:00, comments to [email protected]
596 a p p e n d i x d . n e u ra l re p re se n ta ti on s

In example D.5, we trained an autoencoder on the MNIST data set. We can Example D.6. A visualization of
a two-dimensional embedding
adapt the same network to produce two-dimensional mean and variance learned using a variational
vectors at the bottleneck instead of a two-dimensional embedding, and then autoencoder for the MNIST digits.
Here, we show decoded outputs
train it to minimize both the reconstruction loss and the KL divergence. Here, from inputs panned over the
we show the mean encodings for the same 10,000 images for the MNIST data encoding space:
set. Each encoding is again colored according to the corresponding digit:

0
1
2
2 3
4
5
6
0 7
8
z2

−2

−4
−2 0 2 4
z1

The variational autoencoder also produces clusters in the embedding


space for each digit, but this time they are roughly distributed according to
a zero-mean, unit variance Gaussian distribution. We again see how some
encodings are similar, such as the significant overlap for 4 and 9.

© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2024-02-06 20:54:49-08:00, comments to [email protected]
d . 7. a d ve rsa ri a l n e tw orks 597

x
One common approach to penalize off-nominal outputs or behavior is to use
adversarial learning by including a discriminator, as shown in figure D.13.14 A primary network
discriminator is a neural network that acts as a binary classifier that takes in
neural network outputs and learns to distinguish between real outputs from y ytrue
the training set and the outputs from the primary neural network. The primary
neural network, also called a generator, is then trained to deceive the discriminator, discriminator
thereby naturally producing outputs that are more difficult to distinguish from
the data set. The primary advantage of this technique is that we do not need P(true)
to design special features to identify or quantify how the output fails to match
Figure D.13. A generative adversar-
the training data, but we can allow the discriminator to naturally learn such ial network causes a primary net-
differences. work’s output to be more realistic
by using a discriminator to force
Learning is adversarial in the sense that we have two neural networks: the the primary network to produce
primary neural network that we would like to produce realistic outputs and the more realistic output.
discriminator network that distinguishes between primary network outputs and 14
These techniques were intro-
real examples. They are each training to outperform the other. Training is an duced by I. Goodfellow, J. Pouget-
Abadie, M. Mirza, B. Xu, D. Warde-
iterative process in which each network is improved in turn. It can sometimes be Farley, S. Ozair, A. Courville, and
challenging to balance their relative performance; if one network becomes too Y. Bengio, “Generative Adversarial
Nets,” in Advances in Neural Infor-
good, the other can become stuck.
mation Processing Systems (NIPS),
2014.

© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2024-02-06 20:54:49-08:00, comments to [email protected]

You might also like