0% found this document useful (0 votes)
17 views83 pages

HODL Lec 2 Training NNs Intro TF

The document discusses setting up a simple neural network with two inputs, one hidden layer with 3 neurons, and one output neuron. It explains that the network has 13 total parameters and walks through predicting the output for new input values. The output layer choices are also reviewed, including using sigmoid for regression, softmax for multiclass classification, and different activations for other types of outputs.

Uploaded by

Josh Li
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views83 pages

HODL Lec 2 Training NNs Intro TF

The document discusses setting up a simple neural network with two inputs, one hidden layer with 3 neurons, and one output neuron. It explains that the network has 13 total parameters and walks through predicting the output for new input values. The output layer choices are also reviewed, including using sigmoid for regression, softmax for multiclass classification, and different activations for other types of outputs.

Uploaded by

Josh Li
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 83

Lecture 2: Building Deep Learning

Neural Networks in Tensorflow

15.S04: Hands-on Deep Learning


Spring 2022
Farias, Ramakrishnan
Recap: Designing a DNN
Hidden Hidden
Input Layer 1 Layer 2 Output
Layer layer

x1

xk

User chooses the # of hidden layers, # units in each layer, the


activation function(s) for the hidden layers and for the output layer

2
Let’s practice setting up a simple NN

• Problem Specification
• Two input variables
• An output variable that has to be between 0
and 1

• Design choices
• We will use one hidden layer with 3 ReLU
neurons
• Since the output is constrained to be in (0,1),
we will use the sigmoid for the output layer
3
Let’s practice setting up a simple NN

Input Hidden layer Output


Layer layer

x1
y

x2

How many parameters (i.e., weights and biases) does this network have?

4
Let’s practice setting up a simple NN

Input Hidden layer Output


Layer layer
?

? ?
x1 ?
? ?
? y
?
?
?
x2 ? ?
?

How many parameters (i.e., weights and biases) does this network have? 13

5
Let’s assume that we have trained* this network on
data and have found these values for the parameters

Input Hidden layer Output


Layer layer
-0.3

0.5 -0.2
x1 -0.1
0.2 0.05
0.2 y
-0.3
0.1
0.3
x2 0.5 -0.15
-0.1

*details coming up shortly 6


Predicting with the NN

Input Hidden layer Output


Layer layer
-0.3

0.5 a1
-0.2
x1 -0.1
0.05
0.2
0.2 a2 -0.3 y
0.1
0.3 -0.15
x2 0.5
-0.1 a3

Output of hidden layer :


• Top node: max( 0 , -0.3 + 0.5x1 + 0.1x2 ) = a1
• Middle node: max( 0 , 0.2 – 0.1x1 + 0.3x2 ) = a2
• Bottom node: max( 0 , 0.5 + 0.2x1 - 0.1x2 ) = a3

7
Predicting with the NN

Input Hidden layer Output


Layer layer
-0.3
a1
0.5
-0.2
x1 -0.1
a2 0.05
0.2
0.2 -0.3 y
0.1
0.3 -0.15
x2 0.5 a3
-0.1

• Recall a1, a2, and a3 are the output of the hidden layer nodes
• Output of output layer node:

8
The Network can be written as this
function

Equivalent

9
Note the complexity of even this simple network
compared to a logistic regression model

y is a much more complex function of its inputs x1 and x2


compared to (say) and it can capture more
complex relationships between x and y.

10
Predicting with the NN

Input Hidden layer Output


Layer layer
-0.3
1.87 = a1
0.5
-0.2
x1 2.3 -0.1
3.03 = a2
0.2 0.05
0.2 -0.3 y
0.1
0.3 -0.15
x2 10.2 0.5 0 = a3
-0.1

Output of hidden layer nodes:


• Top node: max(0, -0.3 + 0.5*2.3 + 0.1*10.2) = 1.87 = a1
• Middle node: max(0, 0.2 – 0.1*2.3 + 0.3*10.2) = 3.03 = a2
• Bottom node: max(0, 0.5 + 0.2*2.3 - 0.1*10.2) = 0 = a3

11
Predicting with the NN

Input Hidden layer Output


Layer layer
-0.3
1.87 = a1
0.5
-0.2
x1 2.3 -0.1
3.03 = a2
0.2 0.05
0.2 -0.3 y = 0.226
0.1
0.3 -0.15
x2 10.2 0.5 0 = a3
-0.1

Output of output layer node:

12
Output Layers

13
Output Layers for Regression and
Classification
• If we want an output layer that predicts a probability (i.e.,
single number that’s between 0 and 1), we can use the
sigmoid activation.
• What if we wanted other kinds of outputs?
• A single number
• A vector of numbers (e.g., the GPS coordinates of points on a
map)
• A vector of probabilities that add up to 1.0 (i.e., for classifying x
into one of many classes)
• …

14
Output Layers for Regression and
Classification
Output Variable Output Layer

Single number (regression


with a single output)

Single probability (binary


classification)
Vector of n numbers Stack of
(regression with multiple
outputs)
Vector of n probabilities that ?
add up to 1 (multi-class
classification)

15
Multi-Class Classification

Suppose the output variable is categorical with 10 levels

We know how to We know how to


output 10 numbers output 10 probabilities

How do we output 10
probabilities that sum to
1.0?

16
The Softmax Layer
softmax takes in n arbitrary numbers and converts them to n probabilities

𝑎"
𝑎! softmax

𝑎"#

17
Output Layers for Regression and
Classification

Output Variable Output Layer

Single number (regression


with a single output)

Single probability (binary


classification)
Vector of n numbers Stack of
(regression with multiple
outputs)
Vector of n probabilities that Softmax
add up to 1 (multi-class
classification)

18
Loss Functions

19
Loss functions

• A “loss function” is a function that quantifies the error in a


model’s prediction.
• If the predictions are close to the actual values, the “loss” would be
small.
• A perfect model would have a loss of zero.
• In linear regression, you will recall that we quantify this error
using ”sum of squared errors”. So, ”sum of squared errors” is
the loss function used in linear regression
• The loss function we chose must be matched well with the
kind of output that comes out of the model.

20
Loss functions for different output layers

Output Variable Output Layer Loss Function

Single number (regression with a Mean squared error


single output)

Single probability (binary Binary cross-entropy


classification)
Vector of n numbers (regression Stack of Mean squared error
with multiple outputs)
Vector of n probabilities that add Softmax Categorical cross-
up to 1 (multi-class classification) entropy

21
Mean Squared Error (MSE) Loss

&
1 !
$ 𝑦 $ − 𝑚𝑜𝑑𝑒𝑙 𝑥 $
𝑛
$%"

Actual Predicted
value of value of ith
ith data data point
point

22
Binary Cross-Entropy Loss

$
1
# −𝑦 ! log 𝑚𝑜𝑑𝑒𝑙 𝑥 ! − 1 − 𝑦 ! log(1 − 𝑚𝑜𝑑𝑒𝑙 𝑥 ! )
𝑛
!"#

23
Why cross-entropy is a good loss function
for classification
See appendix

24
In case you are wondering …
An intuitive error metric for classification is the “error
rate” (i.e., the number of misclassified data points).
Why can’t we use this as the loss function?

We will come back to this after we learn about


derivatives and gradient descent

25
Training a Deep Neural Network

26
Recap: Training Linear and Logistic Regression
Models
Linear Regression
+ Data lm

Logistic Regression
+ Data glm

Recall
• Training is finding values for the weights/coefficients so that the model’s predictions
come as close to the actual values as possible
• ‘lm’ and ‘glm’ use optimization algorithms under the hood to find these “best” values

27
Training a DNN

+ Data
Training

Training a DNN is no different since it


is just a (very complex function) with
lots of parameters

28
The essence of training

• The essence of training is to find the “best” values for


the parameters i.e., those that minimize the chosen
loss function

• “Finding the best parameters” = solving the


optimization problem to minimize the loss function =
“training the neural network”

29
Minimizing loss functions

30
Minimizing functions

• Loss functions are just a particular kind of function so


we will first consider the general problem of
minimizing an arbitrary function

• After we develop some intuition about how to do


this, we will return to the specific task of minimizing a
loss function

31
Minimizing a single-variable function

Let’s say we want to minimize the function:

How can we go about this?

https://kenndanielso.github.io/mlrefined/blog_posts/6_First_order_methods/6_4_Gradient_descent.html

32
Minimizing a single-variable function

Let’s say we want to minimize the function:

Can we use its derivative?

https://kenndanielso.github.io/mlrefined/blog_posts/6_First_order_methods/6_4_Gradient_descent.html

33
Minimizing a single-variable function

Let’s say we want to minimize the function:

What does the derivative at a point tell us?

https://kenndanielso.github.io/mlrefined/blog_posts/6_First_order_methods/6_4_Gradient_descent.html

34
Minimizing a single-variable function

Let’s say we want to minimize the function:

What does the derivative at a point tell us?

The derivative (or slope) tells us the change in g(w) for a


small increase in w
https://kenndanielso.github.io/mlrefined/blog_posts/6_First_order_methods/6_4_Gradient_descent.html

35
The value of knowing the derivative

If the derivative at a point w is What it means



Positive Increasing w slightly will increase g(w)

https://kenndanielso.github.io/mlrefined/blog_posts/6_First_order_methods/6_4_Gradient_descent.html

36
The value of knowing the derivative

If the derivative at a point w is What it means



Positive Increasing w slightly will increase g(w)
Negative Increasing w slightly will decrease g(w)

37
https://kenndanielso.github.io/mlrefined/blog_posts/6_First_order_methods/6_4_Gradient_descent.html
The value of knowing the derivative

If the derivative at a point w is What it means



Positive Increasing w slightly will increase g(w)
Negative Increasing w slightly will decrease g(w)
~0 Changing w slightly won’t change g(w)

https://kenndanielso.github.io/mlrefined/blog_posts/6_First_order_methods/6_4_Gradient_descent.html
38
This suggests an algorithm for minimizing
g(w)
1. Start with some point w
2. Calculate the derivative (i.e., slope) of g(w) at w

If the derivative is … What it means Since we want to


minimize loss, do this …
Positive Increasing w will increase Reduce w slightly
the loss function
Negative Increasing w will Increase w slightly
decrease the loss
function
~0 Changing w won’t change Stop
the loss function

3. Go to step 2

39
This is Gradient Descent!

1. Start with some point w


2. Calculate the derivative (i.e., slope) of g(w) at w
If the derivative What it means Since we want to minimize loss, do this
is … …
This can Positive Increasing w will increase the Reduce w slightly
be written loss function

compactly Negative Increasing w will decrease the Increase w slightly


loss function
as ~0 Changing w won’t change the Stop
loss function

𝑑𝑔(𝑤)
𝑤 ←𝑤−𝛼
𝑑𝑤
3. Go to step 2

40
Gradient Descent

𝑑𝑔(𝑤)
𝑤 ←𝑤−𝛼
𝑑𝑤
𝛼 is called the “learning rate” and is our way of
ensuring that we increase or decrease 𝑤 slightly

Typically set to small values (e.g., 0.1, 0.001,


0.0001) and determined by trial and error

41
Let’s apply this algorithm to g(w)

𝑑𝑔(𝑤)
𝑤 ←𝑤−𝛼
𝑑𝑤
We will start at 𝑤 = 2.5, set 𝛼 = 1 and run the algorithm for
a few iterations (switch to animation)

https://kenndanielso.github.io/mlrefined/blog_posts/6_First_order_methods/6_4_Gradient_descent.html 42
Gradient Descent in action

https://kenndanielso.github.io/mlrefined/blog_posts/6_First_order_methods/6_4_Gradient_descent.html 43
Minimizing a multi-variable function

𝑔 𝑤1, 𝑤2 = 𝑤12 + 𝑤22 + 2

We can calculate the partial derivative of 𝑔 𝑤1, 𝑤2

𝜕𝑔 𝜕𝑔
, = [2𝑤1, 2𝑤2]
𝜕𝑤1 𝜕𝑤2

How should we interpret this?

https://kenndanielso.github.io/mlrefined/blog_posts/6_First_order_methods/6_4_Gradient_descent.html 44
Minimizing a multi-variable function

𝑔 𝑤1, 𝑤2 = 𝑤12 + 𝑤22 + 2

𝜕𝑔 𝜕𝑔
∇𝑔 = , = [2𝑤1, 2𝑤2]
𝜕𝑤1 𝜕𝑤2

The first number is the change in g(w) for a small increase in w1,
with w2 kept unchanged. The second number is the change in
g(w) for a small increase in w2, with w1 kept unchanged

https://kenndanielso.github.io/mlrefined/blog_posts/6_First_order_methods/6_4_Gradient_descent.html 45
Minimizing a multi-variable function

𝑔 𝑤1, 𝑤2 = 𝑤12 + 𝑤22 + 2

𝜕𝑔 𝜕𝑔
∇𝑔 = , = [2𝑤1, 2𝑤2]
𝜕𝑤1 𝜕𝑤2

The first number is the change in g(w) for a small increase in w1,
with w2 kept unchanged. The second number is the change in
g(w) for a small increase in w2, with w1 kept unchanged
This is called the “gradient” of 𝑔 𝑤1, 𝑤2 and written as ∇𝑔

https://kenndanielso.github.io/mlrefined/blog_posts/6_First_order_methods/6_4_Gradient_descent.html 46
Minimizing a multi-variable function

𝜕𝑔 𝜕𝑔
∇𝑔 = , = [2𝑤1, 2𝑤2]
𝜕𝑤1 𝜕𝑤2

We can do gradient descent on each coordinate by using the


corresponding partial derivative.
!"
𝑤# ← 𝑤1 − 𝛼 ( )
!#!
!"
𝑤$ ← 𝑤2 − 𝛼 (!# )
"

https://kenndanielso.github.io/mlrefined/blog_posts/6_First_order_methods/6_4_Gradient_descent.html 47
Minimizing a multi-variable function

∇𝑔 = [2𝑤1, 2𝑤2]
!"
𝑤# ← 𝑤1 − 𝛼 (!# )
!
!"
𝑤$ ← 𝑤2 − 𝛼 (!# )
"

As before, this whole thing can be summarized compactly as:

𝑤 ← 𝑤 − 𝛼∇𝑔(𝑤)

https://kenndanielso.github.io/mlrefined/blog_posts/6_First_order_methods/6_4_Gradient_descent.html 48
Gradient Descent in two dimensions

𝑔 𝑤0, 𝑤1 = 𝑤02 + 𝑤12 + 2

https://kenndanielso.github.io/mlrefined/blog_posts/6_First_order_methods/6_4_Gradient_descent.html

49
GD may stop near a local minimum or a stationary point
(not necessarily a global minimum) but we don’t worry
about this in practice

50
Minimizing a loss function with
gradient descent
(
1
Minimize 𝑛
$ −𝑦 & log 𝑚𝑜𝑑𝑒𝑙 𝑥 &
&'!
− 1 − 𝑦 & log(1 − 𝑚𝑜𝑑𝑒𝑙 𝑥 & )

What are the variables we need to


change to minimize this function?

51
Minimizing a loss function with
gradient descent
(
1
Minimize 𝑛
$ −𝑦 & log 𝑚𝑜𝑑𝑒𝑙 𝑥 &
&'!
− 1 − 𝑦 & log(1 − 𝑚𝑜𝑑𝑒𝑙 𝑥 & )

What are the variables we need to


change to minimize this function?

They are the parameters “hiding”


inside 𝑚𝑜𝑑𝑒𝑙 𝑥$
52
Minimizing a loss function with
gradient descent
(
1
Minimize 𝑛
$ −𝑦 & log 𝑚𝑜𝑑𝑒𝑙 𝑥 & − 1 − 𝑦 & log(1 − 𝑚𝑜𝑑𝑒𝑙 𝑥 & )
&'!

Recall this model


and the NN it
represents

53
Minimizing a loss function with
gradient descent
(
Minimize 1
$ −𝑦 & log 𝑚𝑜𝑑𝑒𝑙 𝑥 & − 1 − 𝑦 & log(1 − 𝑚𝑜𝑑𝑒𝑙 𝑥 & )
𝑛
&'!

w1, w2, …, w13 are the The values of x1, x2 and y, on the
variables we can other hand, are just data
change to minimize
the loss function

54
Minimizing a loss function with
gradient descent
(
Minimize 1
$ −𝑦 & log 𝑚𝑜𝑑𝑒𝑙 𝑥 & − 1 − 𝑦 & log(1 − 𝑚𝑜𝑑𝑒𝑙 𝑥 & )
𝑛
&'!

Imagine replacing 𝑚𝑜𝑑𝑒𝑙 𝑥 ! with the mathematical expression above wherever


𝑚𝑜𝑑𝑒𝑙 𝑥 ! appears in the loss function

Now, your loss function is just a ”good old” function of w1, w2, …, w13 and you can apply
gradient descent to it as we normally would.

55
An intuitive error metric for classification is the “error rate”
(i.e., the number of misclassified data points). Why can’t
we use this as the loss function?
Imagine increasing this weight from 0.3 to 0.301,
keeping the other weights unchanged.

For every data point x, the predicted value model(x)


will also change but very slightly. But this change
will change the classification of x only if the
predicted value went from below 0.5 to above 0.5
(or vice-versa).

This ”0.5 crossing” will happen only if the predicted


value for an x is very close to 0.5 before the change.
This is very unlikely to be the case for most/all the
points.

As a result, our classifications won’t change for


most/all points! And therefore, the error rate won’t
change either ==> the partial derivative of the error
rate with respect to the weight we changed will be
0.0 => gradient descent will stop immediately!

56
An intuitive error metric for classification is the “error rate”
(i.e., the number of misclassified data points). Why can’t
we use this as the loss function?

No gradient information for


gradient descent to act on!

Slope = 0

Slope = 0

# misclassifications

0.3 weight w

57
Chain rule/Backprop [placeholder]
• Skim sections 2.4.3 and 2.4.4 of textbook

58
Gradient Descent à Stochastic Gradient Descent

59
Making Gradient Descent work with large
datasets

• Problem: For large datasets (e.g., n in the millions), computing the gradient of the
loss function can be very expensive

• The Solution:
• At each iteration, instead of using all the n data points in the calculation of the gradient of
the loss function, randomly choose just a few of the n observations (called a minibatch)
and use only these observations to compute the partial derivatives.
• This is called Stochastic Gradient Descent (SGD)*
• Because not all n data points are used in the calculation, this only approximates the true
gradient but nevertheless works well in practice.
• SGD comes in many “flavors” and we will these flavors in the remainder of HODL

* Strictly speaking, SGD chooses just one observation. What we are describing here is Minibatch Gradient
Descent but the term SGD is widely used in the field to describe the latter so we will do the same 60
Summary of overall training flow

SGD and its siblings

Image: Page 61 of textbook

61
Overfitting and Regularization

62
Recall Underfitting vs. Overfitting

Underfitting: Model cannot Unseen Data


Error capture the richness of the
data

“Sweet spot”

Overfitting: Model captures


idiosyncrasies of training
data
Training Data

Model complexity
63
Overfitting in Neural Networks
• To learn smart representations of complex, unstructured
data, the NN needs to have large “capacity” i.e., many
layers and many neurons in each layer

• But this raises the likelihood of overfitting so we need to


add regularization

• Several regularization methods have been developed to


address this problem

64
Regularization strategy: Dropout
Randomly zero out the output from some of the nodes (typically 50% of the nodes) in a hidden layer
(implemented as a “dropout layer” in Keras)

<“Bank teller” analogy>


65
Regularization strategy: Early Stopping

Stop the training early before the training loss is minimized by


monitoring the loss on a validation dataset.

Error early
stopping
Validation Dataset

Training Dataset
Iteration
s 66
Summary: Creating and training a DNN from
scratch
• We get the data ready

• We design i.e., “lay out” the network


• Choose the number of hidden layers and the number of ‘neurons’ in each layer
• Pick the right output layer based on the type of the output (more on this shortly)

• We pick
• An appropriate loss function based on the type of the output (more on this shortly)
• An optimizer from the many SGD flavors that are available and a “good” learning rate

• We decide on a regularization strategy

• We set things up in Keras/Tensorflow and start training!

67
Lightning Intro to Keras and Tensorflow

68
Tensorflow and Keras
Tensorflow (TF) is a library that
provides
• Numerous built-in functions
to manipulate and transform
tensors
• Automatic calculation of
gradient of (complicated) loss
functions
• Library of state-of-the-art
optimizers i.e., SGD and its
Image: Page 70 of textbook
“siblings”
• Automatic distribution of
computational load across
servers
• Automatic adaptation of code
to work on parallel hardware
(GPUs and TPUs)

69
Tensorflow and Keras
Keras ”sits on top of” TF and
provides
• Pre-defined layers
• Incredibly flexible ways to specify
network architectures
• Easy ways to preprocess data
• Easy ways to train models and
report metrics
• Easy access to pre-built
industrial-strength models that Image: Page 70 of textbook
you can download and customize

A wealth of introductory and


advanced material, with colabs, at
tensorflow.org

70
What’s a Tensor?
Tensor of rank 2 (aka Matrix)
Tensor of rank 0 ( Scalar)

42

Tensor of rank 3 (aka “cube”)


Tensor of rank 1 (aka Vector)

(42, 23.4, 11.2)

71
Application: Predicting heart disease

72
Predicting Heart Disease
Using a dataset of patients made available by the Cleveland
Clinic, we will build our first DL model to predict if a patient
has been diagnosed with heart disease from demographics
and bio-markers

What we want to predict

73
Checklist reminder

• We get the data ready (will cover in the colab)

• We design i.e., “lay out” the network 1 hidden layer with 16 ReLU neurons
• Choose the number of hidden layers and the number of ‘neurons’ in each layer
• Pick the right output layer based on the type of the output (more on this shortly) Sigmoid

• We pick Binary cross-entropy


• An appropriate loss function based on the type of the output (more on this shortly)
• An optimizer from the many SGD flavors that are available “adam"

• We decide on a regularization strategy Early stopping

• We set things up in Keras/Tensorflow and start training!

74
Before we start coding …
• Don’t worry if you don’t understand every detail of what we will
do in class.

• But go through the Colab notebooks carefully later, play around


with the code and make sure you understand every line

75
Colab

Predicting Heart Disease

76
Recap: Heart Disease Prediction Model

input = keras.Input(shape=num_columns)

h = keras.layers.Dense(16, activation=“relu”)(input)

output = keras.layers.Dense(1, activation = “sigmoid”)(h)

model = keras.Model(inputs=input, outputs=output)

https://www.tensorflow.org/guide/keras/functional 77
Output Layers for Regression and
Classification Revisited
Output Variable Output Layer Keras

Single number (regression keras.layers.Dense(1, activation =


with a single output) “linear”)

Single probability (binary keras.layers.Dense(1, activation =


classification) “sigmoid”)
Vector of n numbers Stack of keras.layers.Dense(n, activation =
(regression with multiple “linear”)
outputs)
Vector of n probabilities that Softmax keras.layers.Dense(n, activation =
add up to 1 (multi-class “softmax”)
classification)

78
Before the next class …
Go through today’s Colab notebook carefully later, play
around with the code and make sure you understand every
line

79
Colab Instructions
https://colab.research.google.com/drive/1f9jJDN0GH2pbGr_fo0WO3I_u4kRNOoGD

Step 1 Make your


own copy of
the notebook

Step 2 Request a GPU for your notebook

Step 3 Start your


notebook

You need to do steps 1 and 2 just the first time you use a notebook. From the second time onwards, jump to Step 3.

80
Appendix

81
Why cross-entropy is a good loss function
for classification
For a data-point x, let’s say the predicted probability from the model is model(x) and the true
classification is y. Recall that y = 0 or 1. Now, consider this function:
−log(𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑎𝑠𝑠𝑖𝑔𝑛𝑒𝑑 𝑏𝑦 𝑚𝑜𝑑𝑒𝑙 𝑓𝑜𝑟 𝑡ℎ𝑒 𝑡𝑟𝑢𝑒 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑥)

If y = 1, this becomes −log 𝑚𝑜𝑑𝑒𝑙 𝑥 . If y = 0, this becomes −log(1 − 𝑚𝑜𝑑𝑒𝑙 𝑥 )

This can be written in one line as: −𝑦 log 𝑚𝑜𝑑𝑒𝑙 𝑥 − 1 − 𝑦 log(1 − 𝑚𝑜𝑑𝑒𝑙 𝑥 )

Convince yourself that this is a loss function:


• it is always non-negative
• a perfect model will have a value of zero
• models with better predictions will generally have lower values

Summing this across every data point and averaging, we get the cross-entropy loss:
(
1
$ −𝑦 & log 𝑚𝑜𝑑𝑒𝑙 𝑥 & − 1 − 𝑦 & log(1 − 𝑚𝑜𝑑𝑒𝑙 𝑥 & )
𝑛
&'!

82
Further reading (optional)
• http://neuralnetworksanddeeplearning.com/chap1.html#l
earning_with_gradient_descent

• https://kenndanielso.github.io/mlrefined/blog_posts/6_Fir
st_order_methods/6_4_Gradient_descent.html

• Skim sections 2.4.3 and 2.4.4 of textbook (for backprop and


computation graphs)

83

You might also like