HODL Lec 2 Training NNs Intro TF
HODL Lec 2 Training NNs Intro TF
x1
xk
2
Let’s practice setting up a simple NN
• Problem Specification
• Two input variables
• An output variable that has to be between 0
and 1
• Design choices
• We will use one hidden layer with 3 ReLU
neurons
• Since the output is constrained to be in (0,1),
we will use the sigmoid for the output layer
3
Let’s practice setting up a simple NN
x1
y
x2
How many parameters (i.e., weights and biases) does this network have?
4
Let’s practice setting up a simple NN
? ?
x1 ?
? ?
? y
?
?
?
x2 ? ?
?
How many parameters (i.e., weights and biases) does this network have? 13
5
Let’s assume that we have trained* this network on
data and have found these values for the parameters
0.5 -0.2
x1 -0.1
0.2 0.05
0.2 y
-0.3
0.1
0.3
x2 0.5 -0.15
-0.1
0.5 a1
-0.2
x1 -0.1
0.05
0.2
0.2 a2 -0.3 y
0.1
0.3 -0.15
x2 0.5
-0.1 a3
7
Predicting with the NN
• Recall a1, a2, and a3 are the output of the hidden layer nodes
• Output of output layer node:
8
The Network can be written as this
function
Equivalent
9
Note the complexity of even this simple network
compared to a logistic regression model
10
Predicting with the NN
11
Predicting with the NN
12
Output Layers
13
Output Layers for Regression and
Classification
• If we want an output layer that predicts a probability (i.e.,
single number that’s between 0 and 1), we can use the
sigmoid activation.
• What if we wanted other kinds of outputs?
• A single number
• A vector of numbers (e.g., the GPS coordinates of points on a
map)
• A vector of probabilities that add up to 1.0 (i.e., for classifying x
into one of many classes)
• …
14
Output Layers for Regression and
Classification
Output Variable Output Layer
15
Multi-Class Classification
How do we output 10
probabilities that sum to
1.0?
16
The Softmax Layer
softmax takes in n arbitrary numbers and converts them to n probabilities
𝑎"
𝑎! softmax
𝑎"#
17
Output Layers for Regression and
Classification
18
Loss Functions
19
Loss functions
20
Loss functions for different output layers
21
Mean Squared Error (MSE) Loss
&
1 !
$ 𝑦 $ − 𝑚𝑜𝑑𝑒𝑙 𝑥 $
𝑛
$%"
Actual Predicted
value of value of ith
ith data data point
point
22
Binary Cross-Entropy Loss
$
1
# −𝑦 ! log 𝑚𝑜𝑑𝑒𝑙 𝑥 ! − 1 − 𝑦 ! log(1 − 𝑚𝑜𝑑𝑒𝑙 𝑥 ! )
𝑛
!"#
23
Why cross-entropy is a good loss function
for classification
See appendix
24
In case you are wondering …
An intuitive error metric for classification is the “error
rate” (i.e., the number of misclassified data points).
Why can’t we use this as the loss function?
25
Training a Deep Neural Network
26
Recap: Training Linear and Logistic Regression
Models
Linear Regression
+ Data lm
Logistic Regression
+ Data glm
Recall
• Training is finding values for the weights/coefficients so that the model’s predictions
come as close to the actual values as possible
• ‘lm’ and ‘glm’ use optimization algorithms under the hood to find these “best” values
27
Training a DNN
+ Data
Training
28
The essence of training
29
Minimizing loss functions
30
Minimizing functions
31
Minimizing a single-variable function
https://kenndanielso.github.io/mlrefined/blog_posts/6_First_order_methods/6_4_Gradient_descent.html
32
Minimizing a single-variable function
https://kenndanielso.github.io/mlrefined/blog_posts/6_First_order_methods/6_4_Gradient_descent.html
33
Minimizing a single-variable function
https://kenndanielso.github.io/mlrefined/blog_posts/6_First_order_methods/6_4_Gradient_descent.html
34
Minimizing a single-variable function
35
The value of knowing the derivative
https://kenndanielso.github.io/mlrefined/blog_posts/6_First_order_methods/6_4_Gradient_descent.html
36
The value of knowing the derivative
37
https://kenndanielso.github.io/mlrefined/blog_posts/6_First_order_methods/6_4_Gradient_descent.html
The value of knowing the derivative
https://kenndanielso.github.io/mlrefined/blog_posts/6_First_order_methods/6_4_Gradient_descent.html
38
This suggests an algorithm for minimizing
g(w)
1. Start with some point w
2. Calculate the derivative (i.e., slope) of g(w) at w
3. Go to step 2
39
This is Gradient Descent!
𝑑𝑔(𝑤)
𝑤 ←𝑤−𝛼
𝑑𝑤
3. Go to step 2
40
Gradient Descent
𝑑𝑔(𝑤)
𝑤 ←𝑤−𝛼
𝑑𝑤
𝛼 is called the “learning rate” and is our way of
ensuring that we increase or decrease 𝑤 slightly
41
Let’s apply this algorithm to g(w)
𝑑𝑔(𝑤)
𝑤 ←𝑤−𝛼
𝑑𝑤
We will start at 𝑤 = 2.5, set 𝛼 = 1 and run the algorithm for
a few iterations (switch to animation)
https://kenndanielso.github.io/mlrefined/blog_posts/6_First_order_methods/6_4_Gradient_descent.html 42
Gradient Descent in action
https://kenndanielso.github.io/mlrefined/blog_posts/6_First_order_methods/6_4_Gradient_descent.html 43
Minimizing a multi-variable function
𝜕𝑔 𝜕𝑔
, = [2𝑤1, 2𝑤2]
𝜕𝑤1 𝜕𝑤2
https://kenndanielso.github.io/mlrefined/blog_posts/6_First_order_methods/6_4_Gradient_descent.html 44
Minimizing a multi-variable function
𝜕𝑔 𝜕𝑔
∇𝑔 = , = [2𝑤1, 2𝑤2]
𝜕𝑤1 𝜕𝑤2
The first number is the change in g(w) for a small increase in w1,
with w2 kept unchanged. The second number is the change in
g(w) for a small increase in w2, with w1 kept unchanged
https://kenndanielso.github.io/mlrefined/blog_posts/6_First_order_methods/6_4_Gradient_descent.html 45
Minimizing a multi-variable function
𝜕𝑔 𝜕𝑔
∇𝑔 = , = [2𝑤1, 2𝑤2]
𝜕𝑤1 𝜕𝑤2
The first number is the change in g(w) for a small increase in w1,
with w2 kept unchanged. The second number is the change in
g(w) for a small increase in w2, with w1 kept unchanged
This is called the “gradient” of 𝑔 𝑤1, 𝑤2 and written as ∇𝑔
https://kenndanielso.github.io/mlrefined/blog_posts/6_First_order_methods/6_4_Gradient_descent.html 46
Minimizing a multi-variable function
𝜕𝑔 𝜕𝑔
∇𝑔 = , = [2𝑤1, 2𝑤2]
𝜕𝑤1 𝜕𝑤2
https://kenndanielso.github.io/mlrefined/blog_posts/6_First_order_methods/6_4_Gradient_descent.html 47
Minimizing a multi-variable function
∇𝑔 = [2𝑤1, 2𝑤2]
!"
𝑤# ← 𝑤1 − 𝛼 (!# )
!
!"
𝑤$ ← 𝑤2 − 𝛼 (!# )
"
𝑤 ← 𝑤 − 𝛼∇𝑔(𝑤)
https://kenndanielso.github.io/mlrefined/blog_posts/6_First_order_methods/6_4_Gradient_descent.html 48
Gradient Descent in two dimensions
https://kenndanielso.github.io/mlrefined/blog_posts/6_First_order_methods/6_4_Gradient_descent.html
49
GD may stop near a local minimum or a stationary point
(not necessarily a global minimum) but we don’t worry
about this in practice
50
Minimizing a loss function with
gradient descent
(
1
Minimize 𝑛
$ −𝑦 & log 𝑚𝑜𝑑𝑒𝑙 𝑥 &
&'!
− 1 − 𝑦 & log(1 − 𝑚𝑜𝑑𝑒𝑙 𝑥 & )
51
Minimizing a loss function with
gradient descent
(
1
Minimize 𝑛
$ −𝑦 & log 𝑚𝑜𝑑𝑒𝑙 𝑥 &
&'!
− 1 − 𝑦 & log(1 − 𝑚𝑜𝑑𝑒𝑙 𝑥 & )
53
Minimizing a loss function with
gradient descent
(
Minimize 1
$ −𝑦 & log 𝑚𝑜𝑑𝑒𝑙 𝑥 & − 1 − 𝑦 & log(1 − 𝑚𝑜𝑑𝑒𝑙 𝑥 & )
𝑛
&'!
w1, w2, …, w13 are the The values of x1, x2 and y, on the
variables we can other hand, are just data
change to minimize
the loss function
54
Minimizing a loss function with
gradient descent
(
Minimize 1
$ −𝑦 & log 𝑚𝑜𝑑𝑒𝑙 𝑥 & − 1 − 𝑦 & log(1 − 𝑚𝑜𝑑𝑒𝑙 𝑥 & )
𝑛
&'!
Now, your loss function is just a ”good old” function of w1, w2, …, w13 and you can apply
gradient descent to it as we normally would.
55
An intuitive error metric for classification is the “error rate”
(i.e., the number of misclassified data points). Why can’t
we use this as the loss function?
Imagine increasing this weight from 0.3 to 0.301,
keeping the other weights unchanged.
56
An intuitive error metric for classification is the “error rate”
(i.e., the number of misclassified data points). Why can’t
we use this as the loss function?
Slope = 0
Slope = 0
# misclassifications
0.3 weight w
57
Chain rule/Backprop [placeholder]
• Skim sections 2.4.3 and 2.4.4 of textbook
58
Gradient Descent à Stochastic Gradient Descent
59
Making Gradient Descent work with large
datasets
• Problem: For large datasets (e.g., n in the millions), computing the gradient of the
loss function can be very expensive
• The Solution:
• At each iteration, instead of using all the n data points in the calculation of the gradient of
the loss function, randomly choose just a few of the n observations (called a minibatch)
and use only these observations to compute the partial derivatives.
• This is called Stochastic Gradient Descent (SGD)*
• Because not all n data points are used in the calculation, this only approximates the true
gradient but nevertheless works well in practice.
• SGD comes in many “flavors” and we will these flavors in the remainder of HODL
* Strictly speaking, SGD chooses just one observation. What we are describing here is Minibatch Gradient
Descent but the term SGD is widely used in the field to describe the latter so we will do the same 60
Summary of overall training flow
61
Overfitting and Regularization
62
Recall Underfitting vs. Overfitting
“Sweet spot”
Model complexity
63
Overfitting in Neural Networks
• To learn smart representations of complex, unstructured
data, the NN needs to have large “capacity” i.e., many
layers and many neurons in each layer
64
Regularization strategy: Dropout
Randomly zero out the output from some of the nodes (typically 50% of the nodes) in a hidden layer
(implemented as a “dropout layer” in Keras)
Error early
stopping
Validation Dataset
Training Dataset
Iteration
s 66
Summary: Creating and training a DNN from
scratch
• We get the data ready
• We pick
• An appropriate loss function based on the type of the output (more on this shortly)
• An optimizer from the many SGD flavors that are available and a “good” learning rate
67
Lightning Intro to Keras and Tensorflow
68
Tensorflow and Keras
Tensorflow (TF) is a library that
provides
• Numerous built-in functions
to manipulate and transform
tensors
• Automatic calculation of
gradient of (complicated) loss
functions
• Library of state-of-the-art
optimizers i.e., SGD and its
Image: Page 70 of textbook
“siblings”
• Automatic distribution of
computational load across
servers
• Automatic adaptation of code
to work on parallel hardware
(GPUs and TPUs)
69
Tensorflow and Keras
Keras ”sits on top of” TF and
provides
• Pre-defined layers
• Incredibly flexible ways to specify
network architectures
• Easy ways to preprocess data
• Easy ways to train models and
report metrics
• Easy access to pre-built
industrial-strength models that Image: Page 70 of textbook
you can download and customize
70
What’s a Tensor?
Tensor of rank 2 (aka Matrix)
Tensor of rank 0 ( Scalar)
42
71
Application: Predicting heart disease
72
Predicting Heart Disease
Using a dataset of patients made available by the Cleveland
Clinic, we will build our first DL model to predict if a patient
has been diagnosed with heart disease from demographics
and bio-markers
73
Checklist reminder
• We design i.e., “lay out” the network 1 hidden layer with 16 ReLU neurons
• Choose the number of hidden layers and the number of ‘neurons’ in each layer
• Pick the right output layer based on the type of the output (more on this shortly) Sigmoid
74
Before we start coding …
• Don’t worry if you don’t understand every detail of what we will
do in class.
75
Colab
76
Recap: Heart Disease Prediction Model
input = keras.Input(shape=num_columns)
h = keras.layers.Dense(16, activation=“relu”)(input)
https://www.tensorflow.org/guide/keras/functional 77
Output Layers for Regression and
Classification Revisited
Output Variable Output Layer Keras
78
Before the next class …
Go through today’s Colab notebook carefully later, play
around with the code and make sure you understand every
line
79
Colab Instructions
https://colab.research.google.com/drive/1f9jJDN0GH2pbGr_fo0WO3I_u4kRNOoGD
You need to do steps 1 and 2 just the first time you use a notebook. From the second time onwards, jump to Step 3.
80
Appendix
81
Why cross-entropy is a good loss function
for classification
For a data-point x, let’s say the predicted probability from the model is model(x) and the true
classification is y. Recall that y = 0 or 1. Now, consider this function:
−log(𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑎𝑠𝑠𝑖𝑔𝑛𝑒𝑑 𝑏𝑦 𝑚𝑜𝑑𝑒𝑙 𝑓𝑜𝑟 𝑡ℎ𝑒 𝑡𝑟𝑢𝑒 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑥)
This can be written in one line as: −𝑦 log 𝑚𝑜𝑑𝑒𝑙 𝑥 − 1 − 𝑦 log(1 − 𝑚𝑜𝑑𝑒𝑙 𝑥 )
Summing this across every data point and averaging, we get the cross-entropy loss:
(
1
$ −𝑦 & log 𝑚𝑜𝑑𝑒𝑙 𝑥 & − 1 − 𝑦 & log(1 − 𝑚𝑜𝑑𝑒𝑙 𝑥 & )
𝑛
&'!
82
Further reading (optional)
• http://neuralnetworksanddeeplearning.com/chap1.html#l
earning_with_gradient_descent
• https://kenndanielso.github.io/mlrefined/blog_posts/6_Fir
st_order_methods/6_4_Gradient_descent.html
83