0% found this document useful (0 votes)

17 views83 pages

HODL Lec 2 Training NNs Intro TF

The document discusses setting up a simple neural network with two inputs, one hidden layer with 3 neurons, and one output neuron. It explains that the network has 13 total parameters and walks through predicting the output for new input values. The output layer choices are also reviewed, including using sigmoid for regression, softmax for multiclass classification, and different activations for other types of outputs.

Uploaded by

Josh Li

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views83 pages

HODL Lec 2 Training NNs Intro TF

Uploaded by

Josh Li

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 83

Lecture 2: Building Deep Learning

Neural Networks in Tensorflow

15.S04: Hands-on Deep Learning

Spring 2022
Farias, Ramakrishnan
Recap: Designing a DNN
Hidden Hidden
Input Layer 1 Layer 2 Output
Layer layer

User chooses the # of hidden layers, # units in each layer, the

activation function(s) for the hidden layers and for the output layer

2
Let’s practice setting up a simple NN

• Problem Specification
• Two input variables
• An output variable that has to be between 0
and 1

• Design choices
• We will use one hidden layer with 3 ReLU
neurons
• Since the output is constrained to be in (0,1),
we will use the sigmoid for the output layer
3
Let’s practice setting up a simple NN

Input Hidden layer Output

Layer layer

x1
y

How many parameters (i.e., weights and biases) does this network have?

4
Let’s practice setting up a simple NN

Input Hidden layer Output

Layer layer
?

? ?
x1 ?
? ?
? y
?
?
?
x2 ? ?
?

How many parameters (i.e., weights and biases) does this network have? 13

5
Let’s assume that we have trained* this network on
data and have found these values for the parameters

Input Hidden layer Output

Layer layer
-0.3

0.5 -0.2
x1 -0.1
0.2 0.05
0.2 y
-0.3
0.1
0.3
x2 0.5 -0.15
-0.1

*details coming up shortly 6

Predicting with the NN

Input Hidden layer Output

Layer layer
-0.3

0.5 a1
-0.2
x1 -0.1
0.05
0.2
0.2 a2 -0.3 y
0.1
0.3 -0.15
x2 0.5
-0.1 a3

Output of hidden layer :

• Top node: max( 0 , -0.3 + 0.5x1 + 0.1x2 ) = a1
• Middle node: max( 0 , 0.2 – 0.1x1 + 0.3x2 ) = a2
• Bottom node: max( 0 , 0.5 + 0.2x1 - 0.1x2 ) = a3

7
Predicting with the NN

Input Hidden layer Output

Layer layer
-0.3
a1
0.5
-0.2
x1 -0.1
a2 0.05
0.2
0.2 -0.3 y
0.1
0.3 -0.15
x2 0.5 a3
-0.1

• Recall a1, a2, and a3 are the output of the hidden layer nodes
• Output of output layer node:

8
The Network can be written as this
function

Equivalent

9
Note the complexity of even this simple network
compared to a logistic regression model

y is a much more complex function of its inputs x1 and x2

compared to (say) and it can capture more
complex relationships between x and y.

10
Predicting with the NN

Input Hidden layer Output

Layer layer
-0.3
1.87 = a1
0.5
-0.2
x1 2.3 -0.1
3.03 = a2
0.2 0.05
0.2 -0.3 y
0.1
0.3 -0.15
x2 10.2 0.5 0 = a3
-0.1

Output of hidden layer nodes:

• Top node: max(0, -0.3 + 0.5*2.3 + 0.1*10.2) = 1.87 = a1
• Middle node: max(0, 0.2 – 0.1*2.3 + 0.3*10.2) = 3.03 = a2
• Bottom node: max(0, 0.5 + 0.2*2.3 - 0.1*10.2) = 0 = a3

11
Predicting with the NN

Input Hidden layer Output

Layer layer
-0.3
1.87 = a1
0.5
-0.2
x1 2.3 -0.1
3.03 = a2
0.2 0.05
0.2 -0.3 y = 0.226
0.1
0.3 -0.15
x2 10.2 0.5 0 = a3
-0.1

Output of output layer node:

12
Output Layers

13
Output Layers for Regression and
Classification
• If we want an output layer that predicts a probability (i.e.,
single number that’s between 0 and 1), we can use the
sigmoid activation.
• What if we wanted other kinds of outputs?
• A single number
• A vector of numbers (e.g., the GPS coordinates of points on a
map)
• A vector of probabilities that add up to 1.0 (i.e., for classifying x
into one of many classes)
• …

14
Output Layers for Regression and
Classification
Output Variable Output Layer

Single number (regression

with a single output)

Single probability (binary

classification)
Vector of n numbers Stack of
(regression with multiple
outputs)
Vector of n probabilities that ?
add up to 1 (multi-class
classification)

15
Multi-Class Classification

Suppose the output variable is categorical with 10 levels

We know how to We know how to

output 10 numbers output 10 probabilities

How do we output 10
probabilities that sum to
1.0?

16
The Softmax Layer
softmax takes in n arbitrary numbers and converts them to n probabilities

𝑎"
𝑎! softmax

𝑎"#

17
Output Layers for Regression and
Classification

Output Variable Output Layer

Single number (regression

with a single output)

Single probability (binary

classification)
Vector of n numbers Stack of
(regression with multiple
outputs)
Vector of n probabilities that Softmax
add up to 1 (multi-class
classification)

18
Loss Functions

19
Loss functions

• A “loss function” is a function that quantifies the error in a

model’s prediction.
• If the predictions are close to the actual values, the “loss” would be
small.
• A perfect model would have a loss of zero.
• In linear regression, you will recall that we quantify this error
using ”sum of squared errors”. So, ”sum of squared errors” is
the loss function used in linear regression
• The loss function we chose must be matched well with the
kind of output that comes out of the model.

20
Loss functions for different output layers

Output Variable Output Layer Loss Function

Single number (regression with a Mean squared error

single output)

Single probability (binary Binary cross-entropy

classification)
Vector of n numbers (regression Stack of Mean squared error
with multiple outputs)
Vector of n probabilities that add Softmax Categorical cross-
up to 1 (multi-class classification) entropy

21
Mean Squared Error (MSE) Loss

&
1 !
$ 𝑦 $ − 𝑚𝑜𝑑𝑒𝑙 𝑥 $
𝑛
$%"

Actual Predicted
value of value of ith
ith data data point
point

22
Binary Cross-Entropy Loss

$
1
# −𝑦 ! log 𝑚𝑜𝑑𝑒𝑙 𝑥 ! − 1 − 𝑦 ! log(1 − 𝑚𝑜𝑑𝑒𝑙 𝑥 ! )
𝑛
!"#

23
Why cross-entropy is a good loss function
for classification
See appendix

24
In case you are wondering …
An intuitive error metric for classification is the “error
rate” (i.e., the number of misclassified data points).
Why can’t we use this as the loss function?

We will come back to this after we learn about

derivatives and gradient descent

25
Training a Deep Neural Network

26
Recap: Training Linear and Logistic Regression
Models
Linear Regression
+ Data lm

Logistic Regression
+ Data glm

Recall
• Training is finding values for the weights/coefficients so that the model’s predictions
come as close to the actual values as possible
• ‘lm’ and ‘glm’ use optimization algorithms under the hood to find these “best” values

27
Training a DNN

+ Data
Training

Training a DNN is no different since it

is just a (very complex function) with
lots of parameters

28
The essence of training

• The essence of training is to find the “best” values for

the parameters i.e., those that minimize the chosen
loss function

• “Finding the best parameters” = solving the

optimization problem to minimize the loss function =
“training the neural network”

29
Minimizing loss functions

30
Minimizing functions

• Loss functions are just a particular kind of function so

we will first consider the general problem of
minimizing an arbitrary function

• After we develop some intuition about how to do

this, we will return to the specific task of minimizing a
loss function

31
Minimizing a single-variable function

Let’s say we want to minimize the function:

How can we go about this?

https://kenndanielso.github.io/mlrefined/blog_posts/6_First_order_methods/6_4_Gradient_descent.html

32
Minimizing a single-variable function

Let’s say we want to minimize the function:

Can we use its derivative?

https://kenndanielso.github.io/mlrefined/blog_posts/6_First_order_methods/6_4_Gradient_descent.html

33
Minimizing a single-variable function

Let’s say we want to minimize the function:

What does the derivative at a point tell us?

https://kenndanielso.github.io/mlrefined/blog_posts/6_First_order_methods/6_4_Gradient_descent.html

34
Minimizing a single-variable function

Let’s say we want to minimize the function:

What does the derivative at a point tell us?

The derivative (or slope) tells us the change in g(w) for a

small increase in w
https://kenndanielso.github.io/mlrefined/blog_posts/6_First_order_methods/6_4_Gradient_descent.html

35
The value of knowing the derivative

If the derivative at a point w is What it means

…
Positive Increasing w slightly will increase g(w)

https://kenndanielso.github.io/mlrefined/blog_posts/6_First_order_methods/6_4_Gradient_descent.html

36
The value of knowing the derivative

If the derivative at a point w is What it means

…
Positive Increasing w slightly will increase g(w)
Negative Increasing w slightly will decrease g(w)

37
https://kenndanielso.github.io/mlrefined/blog_posts/6_First_order_methods/6_4_Gradient_descent.html
The value of knowing the derivative

If the derivative at a point w is What it means

…
Positive Increasing w slightly will increase g(w)
Negative Increasing w slightly will decrease g(w)
~0 Changing w slightly won’t change g(w)

https://kenndanielso.github.io/mlrefined/blog_posts/6_First_order_methods/6_4_Gradient_descent.html
38
This suggests an algorithm for minimizing
g(w)
1. Start with some point w
2. Calculate the derivative (i.e., slope) of g(w) at w

If the derivative is … What it means Since we want to

minimize loss, do this …
Positive Increasing w will increase Reduce w slightly
the loss function
Negative Increasing w will Increase w slightly
decrease the loss
function
~0 Changing w won’t change Stop
the loss function

3. Go to step 2

39
This is Gradient Descent!

1. Start with some point w

2. Calculate the derivative (i.e., slope) of g(w) at w
If the derivative What it means Since we want to minimize loss, do this
is … …
This can Positive Increasing w will increase the Reduce w slightly
be written loss function

compactly Negative Increasing w will decrease the Increase w slightly

loss function
as ~0 Changing w won’t change the Stop
loss function

𝑑𝑔(𝑤)
𝑤 ←𝑤−𝛼
𝑑𝑤
3. Go to step 2

40
Gradient Descent

𝑑𝑔(𝑤)
𝑤 ←𝑤−𝛼
𝑑𝑤
𝛼 is called the “learning rate” and is our way of
ensuring that we increase or decrease 𝑤 slightly

Typically set to small values (e.g., 0.1, 0.001,

0.0001) and determined by trial and error

41
Let’s apply this algorithm to g(w)

𝑑𝑔(𝑤)
𝑤 ←𝑤−𝛼
𝑑𝑤
We will start at 𝑤 = 2.5, set 𝛼 = 1 and run the algorithm for
a few iterations (switch to animation)

https://kenndanielso.github.io/mlrefined/blog_posts/6_First_order_methods/6_4_Gradient_descent.html 42
Gradient Descent in action

https://kenndanielso.github.io/mlrefined/blog_posts/6_First_order_methods/6_4_Gradient_descent.html 43
Minimizing a multi-variable function

𝑔 𝑤1, 𝑤2 = 𝑤12 + 𝑤22 + 2

We can calculate the partial derivative of 𝑔 𝑤1, 𝑤2

𝜕𝑔 𝜕𝑔
, = [2𝑤1, 2𝑤2]
𝜕𝑤1 𝜕𝑤2

How should we interpret this?

https://kenndanielso.github.io/mlrefined/blog_posts/6_First_order_methods/6_4_Gradient_descent.html 44
Minimizing a multi-variable function

𝑔 𝑤1, 𝑤2 = 𝑤12 + 𝑤22 + 2

𝜕𝑔 𝜕𝑔
∇𝑔 = , = [2𝑤1, 2𝑤2]
𝜕𝑤1 𝜕𝑤2

The first number is the change in g(w) for a small increase in w1,
with w2 kept unchanged. The second number is the change in
g(w) for a small increase in w2, with w1 kept unchanged

https://kenndanielso.github.io/mlrefined/blog_posts/6_First_order_methods/6_4_Gradient_descent.html 45
Minimizing a multi-variable function

𝑔 𝑤1, 𝑤2 = 𝑤12 + 𝑤22 + 2

𝜕𝑔 𝜕𝑔
∇𝑔 = , = [2𝑤1, 2𝑤2]
𝜕𝑤1 𝜕𝑤2

The first number is the change in g(w) for a small increase in w1,
with w2 kept unchanged. The second number is the change in
g(w) for a small increase in w2, with w1 kept unchanged
This is called the “gradient” of 𝑔 𝑤1, 𝑤2 and written as ∇𝑔

https://kenndanielso.github.io/mlrefined/blog_posts/6_First_order_methods/6_4_Gradient_descent.html 46
Minimizing a multi-variable function

𝜕𝑔 𝜕𝑔
∇𝑔 = , = [2𝑤1, 2𝑤2]
𝜕𝑤1 𝜕𝑤2

We can do gradient descent on each coordinate by using the

corresponding partial derivative.
!"
𝑤# ← 𝑤1 − 𝛼 ( )
!#!
!"
𝑤$ ← 𝑤2 − 𝛼 (!# )
"

https://kenndanielso.github.io/mlrefined/blog_posts/6_First_order_methods/6_4_Gradient_descent.html 47
Minimizing a multi-variable function

∇𝑔 = [2𝑤1, 2𝑤2]
!"
𝑤# ← 𝑤1 − 𝛼 (!# )
!
!"
𝑤$ ← 𝑤2 − 𝛼 (!# )
"

As before, this whole thing can be summarized compactly as:

𝑤 ← 𝑤 − 𝛼∇𝑔(𝑤)

https://kenndanielso.github.io/mlrefined/blog_posts/6_First_order_methods/6_4_Gradient_descent.html 48
Gradient Descent in two dimensions

𝑔 𝑤0, 𝑤1 = 𝑤02 + 𝑤12 + 2

https://kenndanielso.github.io/mlrefined/blog_posts/6_First_order_methods/6_4_Gradient_descent.html

49
GD may stop near a local minimum or a stationary point
(not necessarily a global minimum) but we don’t worry
about this in practice

50
Minimizing a loss function with
gradient descent
(
1
Minimize 𝑛
$ −𝑦 & log 𝑚𝑜𝑑𝑒𝑙 𝑥 &
&'!
− 1 − 𝑦 & log(1 − 𝑚𝑜𝑑𝑒𝑙 𝑥 & )

What are the variables we need to

change to minimize this function?

51
Minimizing a loss function with
gradient descent
(
1
Minimize 𝑛
$ −𝑦 & log 𝑚𝑜𝑑𝑒𝑙 𝑥 &
&'!
− 1 − 𝑦 & log(1 − 𝑚𝑜𝑑𝑒𝑙 𝑥 & )

What are the variables we need to

change to minimize this function?

They are the parameters “hiding”

inside 𝑚𝑜𝑑𝑒𝑙 𝑥$
52
Minimizing a loss function with
gradient descent
(
1
Minimize 𝑛
$ −𝑦 & log 𝑚𝑜𝑑𝑒𝑙 𝑥 & − 1 − 𝑦 & log(1 − 𝑚𝑜𝑑𝑒𝑙 𝑥 & )
&'!

Recall this model

and the NN it
represents

53
Minimizing a loss function with
gradient descent
(
Minimize 1
$ −𝑦 & log 𝑚𝑜𝑑𝑒𝑙 𝑥 & − 1 − 𝑦 & log(1 − 𝑚𝑜𝑑𝑒𝑙 𝑥 & )
𝑛
&'!

w1, w2, …, w13 are the The values of x1, x2 and y, on the
variables we can other hand, are just data
change to minimize
the loss function

54
Minimizing a loss function with
gradient descent
(
Minimize 1
$ −𝑦 & log 𝑚𝑜𝑑𝑒𝑙 𝑥 & − 1 − 𝑦 & log(1 − 𝑚𝑜𝑑𝑒𝑙 𝑥 & )
𝑛
&'!

Imagine replacing 𝑚𝑜𝑑𝑒𝑙 𝑥 ! with the mathematical expression above wherever

𝑚𝑜𝑑𝑒𝑙 𝑥 ! appears in the loss function

Now, your loss function is just a ”good old” function of w1, w2, …, w13 and you can apply
gradient descent to it as we normally would.

55
An intuitive error metric for classification is the “error rate”
(i.e., the number of misclassified data points). Why can’t
we use this as the loss function?
Imagine increasing this weight from 0.3 to 0.301,
keeping the other weights unchanged.

For every data point x, the predicted value model(x)

will also change but very slightly. But this change
will change the classification of x only if the
predicted value went from below 0.5 to above 0.5
(or vice-versa).

This ”0.5 crossing” will happen only if the predicted

value for an x is very close to 0.5 before the change.
This is very unlikely to be the case for most/all the
points.

As a result, our classifications won’t change for

most/all points! And therefore, the error rate won’t
change either ==> the partial derivative of the error
rate with respect to the weight we changed will be
0.0 => gradient descent will stop immediately!

56
An intuitive error metric for classification is the “error rate”
(i.e., the number of misclassified data points). Why can’t
we use this as the loss function?

No gradient information for

gradient descent to act on!

Slope = 0

# misclassifications

0.3 weight w

57
Chain rule/Backprop [placeholder]
• Skim sections 2.4.3 and 2.4.4 of textbook

58
Gradient Descent à Stochastic Gradient Descent

59
Making Gradient Descent work with large
datasets

• Problem: For large datasets (e.g., n in the millions), computing the gradient of the
loss function can be very expensive

• The Solution:
• At each iteration, instead of using all the n data points in the calculation of the gradient of
the loss function, randomly choose just a few of the n observations (called a minibatch)
and use only these observations to compute the partial derivatives.
• This is called Stochastic Gradient Descent (SGD)*
• Because not all n data points are used in the calculation, this only approximates the true
gradient but nevertheless works well in practice.
• SGD comes in many “flavors” and we will these flavors in the remainder of HODL

* Strictly speaking, SGD chooses just one observation. What we are describing here is Minibatch Gradient
Descent but the term SGD is widely used in the field to describe the latter so we will do the same 60
Summary of overall training flow

SGD and its siblings

Image: Page 61 of textbook

61
Overfitting and Regularization

62
Recall Underfitting vs. Overfitting

Underfitting: Model cannot Unseen Data

Error capture the richness of the
data

“Sweet spot”

Overfitting: Model captures

idiosyncrasies of training
data
Training Data

Model complexity
63
Overfitting in Neural Networks
• To learn smart representations of complex, unstructured
data, the NN needs to have large “capacity” i.e., many
layers and many neurons in each layer

• But this raises the likelihood of overfitting so we need to

add regularization

• Several regularization methods have been developed to

address this problem

64
Regularization strategy: Dropout
Randomly zero out the output from some of the nodes (typically 50% of the nodes) in a hidden layer
(implemented as a “dropout layer” in Keras)

<“Bank teller” analogy>

65
Regularization strategy: Early Stopping

Stop the training early before the training loss is minimized by

monitoring the loss on a validation dataset.

Error early
stopping
Validation Dataset

Training Dataset
Iteration
s 66
Summary: Creating and training a DNN from
scratch
• We get the data ready

• We design i.e., “lay out” the network

• Choose the number of hidden layers and the number of ‘neurons’ in each layer
• Pick the right output layer based on the type of the output (more on this shortly)

• We pick
• An appropriate loss function based on the type of the output (more on this shortly)
• An optimizer from the many SGD flavors that are available and a “good” learning rate

• We decide on a regularization strategy

• We set things up in Keras/Tensorflow and start training!

67
Lightning Intro to Keras and Tensorflow

68
Tensorflow and Keras
Tensorflow (TF) is a library that
provides
• Numerous built-in functions
to manipulate and transform
tensors
• Automatic calculation of
gradient of (complicated) loss
functions
• Library of state-of-the-art
optimizers i.e., SGD and its
Image: Page 70 of textbook
“siblings”
• Automatic distribution of
computational load across
servers
• Automatic adaptation of code
to work on parallel hardware
(GPUs and TPUs)

69
Tensorflow and Keras
Keras ”sits on top of” TF and
provides
• Pre-defined layers
• Incredibly flexible ways to specify
network architectures
• Easy ways to preprocess data
• Easy ways to train models and
report metrics
• Easy access to pre-built
industrial-strength models that Image: Page 70 of textbook
you can download and customize

A wealth of introductory and

advanced material, with colabs, at
tensorflow.org

70
What’s a Tensor?
Tensor of rank 2 (aka Matrix)
Tensor of rank 0 ( Scalar)

Tensor of rank 3 (aka “cube”)

Tensor of rank 1 (aka Vector)

(42, 23.4, 11.2)

71
Application: Predicting heart disease

72
Predicting Heart Disease
Using a dataset of patients made available by the Cleveland
Clinic, we will build our first DL model to predict if a patient
has been diagnosed with heart disease from demographics
and bio-markers

What we want to predict

73
Checklist reminder

• We get the data ready (will cover in the colab)

• We design i.e., “lay out” the network 1 hidden layer with 16 ReLU neurons
• Choose the number of hidden layers and the number of ‘neurons’ in each layer
• Pick the right output layer based on the type of the output (more on this shortly) Sigmoid

• We pick Binary cross-entropy

• An appropriate loss function based on the type of the output (more on this shortly)
• An optimizer from the many SGD flavors that are available “adam"

• We decide on a regularization strategy Early stopping

• We set things up in Keras/Tensorflow and start training!

74
Before we start coding …
• Don’t worry if you don’t understand every detail of what we will
do in class.

• But go through the Colab notebooks carefully later, play around

with the code and make sure you understand every line

75
Colab

Predicting Heart Disease

76
Recap: Heart Disease Prediction Model

input = keras.Input(shape=num_columns)

h = keras.layers.Dense(16, activation=“relu”)(input)

output = keras.layers.Dense(1, activation = “sigmoid”)(h)

model = keras.Model(inputs=input, outputs=output)

https://www.tensorflow.org/guide/keras/functional 77
Output Layers for Regression and
Classification Revisited
Output Variable Output Layer Keras

Single number (regression keras.layers.Dense(1, activation =

with a single output) “linear”)

Single probability (binary keras.layers.Dense(1, activation =

classification) “sigmoid”)
Vector of n numbers Stack of keras.layers.Dense(n, activation =
(regression with multiple “linear”)
outputs)
Vector of n probabilities that Softmax keras.layers.Dense(n, activation =
add up to 1 (multi-class “softmax”)
classification)

78
Before the next class …
Go through today’s Colab notebook carefully later, play
around with the code and make sure you understand every
line

79
Colab Instructions
https://colab.research.google.com/drive/1f9jJDN0GH2pbGr_fo0WO3I_u4kRNOoGD

Step 1 Make your

own copy of
the notebook

Step 2 Request a GPU for your notebook

Step 3 Start your

notebook

You need to do steps 1 and 2 just the first time you use a notebook. From the second time onwards, jump to Step 3.

80
Appendix

81
Why cross-entropy is a good loss function
for classification
For a data-point x, let’s say the predicted probability from the model is model(x) and the true
classification is y. Recall that y = 0 or 1. Now, consider this function:
−log(𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑎𝑠𝑠𝑖𝑔𝑛𝑒𝑑 𝑏𝑦 𝑚𝑜𝑑𝑒𝑙 𝑓𝑜𝑟 𝑡ℎ𝑒 𝑡𝑟𝑢𝑒 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑥)

If y = 1, this becomes −log 𝑚𝑜𝑑𝑒𝑙 𝑥 . If y = 0, this becomes −log(1 − 𝑚𝑜𝑑𝑒𝑙 𝑥 )

This can be written in one line as: −𝑦 log 𝑚𝑜𝑑𝑒𝑙 𝑥 − 1 − 𝑦 log(1 − 𝑚𝑜𝑑𝑒𝑙 𝑥 )

Convince yourself that this is a loss function:

• it is always non-negative
• a perfect model will have a value of zero
• models with better predictions will generally have lower values

Summing this across every data point and averaging, we get the cross-entropy loss:
(
1
$ −𝑦 & log 𝑚𝑜𝑑𝑒𝑙 𝑥 & − 1 − 𝑦 & log(1 − 𝑚𝑜𝑑𝑒𝑙 𝑥 & )
𝑛
&'!

82
Further reading (optional)
• http://neuralnetworksanddeeplearning.com/chap1.html#l
earning_with_gradient_descent

• https://kenndanielso.github.io/mlrefined/blog_posts/6_Fir
st_order_methods/6_4_Gradient_descent.html

• Skim sections 2.4.3 and 2.4.4 of textbook (for backprop and

computation graphs)

CHRO's List
No ratings yet
CHRO's List
18 pages
DL Unit-2
No ratings yet
DL Unit-2
24 pages
LOGIQ E10 Abdominal SWE Guide PDF
100% (1)
LOGIQ E10 Abdominal SWE Guide PDF
24 pages
DeepLearning Recap
No ratings yet
DeepLearning Recap
104 pages
Linearity: Skip To Content
No ratings yet
Linearity: Skip To Content
10 pages
Unit 2 DL
No ratings yet
Unit 2 DL
70 pages
CS601 Machine Learning Unit 2 Notes 1672759753
No ratings yet
CS601 Machine Learning Unit 2 Notes 1672759753
14 pages
CS601 - Machine Learning - Unit 2 - Notes - 1672759753
No ratings yet
CS601 - Machine Learning - Unit 2 - Notes - 1672759753
14 pages
Neural Networks Essay Feranmi Dere
No ratings yet
Neural Networks Essay Feranmi Dere
7 pages
Neural Network (Perceptrons)
No ratings yet
Neural Network (Perceptrons)
31 pages
EE769 7 Introduction To Neural Networks
No ratings yet
EE769 7 Introduction To Neural Networks
52 pages
Ch2-Training, Optimization and Regularization of DNN-new
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new
114 pages
Neural Network - Optimization DRAFT 3.11
No ratings yet
Neural Network - Optimization DRAFT 3.11
66 pages
26 Neural Nets
No ratings yet
26 Neural Nets
77 pages
Neural Networks - 2
No ratings yet
Neural Networks - 2
79 pages
Neural Network
No ratings yet
Neural Network
14 pages
Lecture 3
No ratings yet
Lecture 3
24 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
DL - M2 - Deep Feedforward NN
No ratings yet
DL - M2 - Deep Feedforward NN
97 pages
DNN - M2 - Deep Feedforward NN 23dec
No ratings yet
DNN - M2 - Deep Feedforward NN 23dec
97 pages
Chapter 6 - Feedforward Deep Networks
No ratings yet
Chapter 6 - Feedforward Deep Networks
27 pages
Deep Learning 1
No ratings yet
Deep Learning 1
48 pages
2.3 Feed Forward Netwoks
No ratings yet
2.3 Feed Forward Netwoks
25 pages
DL U-I Introduction Part-2
No ratings yet
DL U-I Introduction Part-2
48 pages
CS601 - Machine Learning - Unit 2 New
No ratings yet
CS601 - Machine Learning - Unit 2 New
56 pages
Mid 1 DL Notes
No ratings yet
Mid 1 DL Notes
15 pages
CS460 - Deep Learning - W02 & W03
No ratings yet
CS460 - Deep Learning - W02 & W03
44 pages
DL Unit2
No ratings yet
DL Unit2
113 pages
Week - 5 (Deep Learning) Q. 1) Explain The Architecture of Feed Forward Neural Network or Multilayer Perceptron. (12 Marks)
No ratings yet
Week - 5 (Deep Learning) Q. 1) Explain The Architecture of Feed Forward Neural Network or Multilayer Perceptron. (12 Marks)
7 pages
Convolutional Neural Network
100% (1)
Convolutional Neural Network
59 pages
1.1 Introduction
No ratings yet
1.1 Introduction
73 pages
Neural Networks
No ratings yet
Neural Networks
63 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
Machine Learning With Artificial Neural Networks
No ratings yet
Machine Learning With Artificial Neural Networks
44 pages
PDF 1678529419
No ratings yet
PDF 1678529419
100 pages
Practical-5 - 2CEIT606 - Artificial Intelligence
No ratings yet
Practical-5 - 2CEIT606 - Artificial Intelligence
14 pages
Introduction of Machine Learning
No ratings yet
Introduction of Machine Learning
61 pages
Neural Networks
No ratings yet
Neural Networks
29 pages
Neural Network Intro Lecture 4
No ratings yet
Neural Network Intro Lecture 4
46 pages
ML Notes
No ratings yet
ML Notes
14 pages
Lecture8 DeepLearning
No ratings yet
Lecture8 DeepLearning
94 pages
Crashcourse DL Pytorch Parr
No ratings yet
Crashcourse DL Pytorch Parr
39 pages
Lec 04 Deep Networks 2
No ratings yet
Lec 04 Deep Networks 2
78 pages
3 TrainingNetwork
No ratings yet
3 TrainingNetwork
65 pages
Deep Learning
No ratings yet
Deep Learning
15 pages
DL M2 Tech
No ratings yet
DL M2 Tech
32 pages
Neural Networks Skimmed - Ipynb - Colab
No ratings yet
Neural Networks Skimmed - Ipynb - Colab
8 pages
Unit 2
No ratings yet
Unit 2
18 pages
Linear Classifier: by Dr. Sanjeev Kumar Associate Professor Department of Mathematics IIT Roorkee, Roorkee-247 667, India
No ratings yet
Linear Classifier: by Dr. Sanjeev Kumar Associate Professor Department of Mathematics IIT Roorkee, Roorkee-247 667, India
86 pages
Lecture20 Backprop
No ratings yet
Lecture20 Backprop
77 pages
L3 Cse256 Fa24 FFN
No ratings yet
L3 Cse256 Fa24 FFN
64 pages
Activation Function To Back Pro
No ratings yet
Activation Function To Back Pro
22 pages
Week 2 Artificial Neural Networks
No ratings yet
Week 2 Artificial Neural Networks
62 pages
Introduction To Neural Network
No ratings yet
Introduction To Neural Network
20 pages
Lect 5
No ratings yet
Lect 5
89 pages
Deep Learning (All in One)
No ratings yet
Deep Learning (All in One)
23 pages
Gradient Descent Deep Learning: by T.K. Damodharan Vice President, RBS Reg - No: PC2013003013008
No ratings yet
Gradient Descent Deep Learning: by T.K. Damodharan Vice President, RBS Reg - No: PC2013003013008
37 pages
Dat 300
No ratings yet
Dat 300
12 pages
Types of Neural Networks
No ratings yet
Types of Neural Networks
7 pages
Week 06 - Deep Feedforward Networks - Optimization
No ratings yet
Week 06 - Deep Feedforward Networks - Optimization
83 pages
Domnic Object Detecion Basics
No ratings yet
Domnic Object Detecion Basics
62 pages
Main PDF 2 - PHP Operators and Control Structures PDF
No ratings yet
Main PDF 2 - PHP Operators and Control Structures PDF
27 pages
2020 Mini Project
No ratings yet
2020 Mini Project
22 pages
Ie Generative Ai Deloitte Consulting
No ratings yet
Ie Generative Ai Deloitte Consulting
3 pages
Pa0201 2000
No ratings yet
Pa0201 2000
103 pages
Idm Manual
No ratings yet
Idm Manual
8 pages
ASSIGNMENT 1 Mass Com
No ratings yet
ASSIGNMENT 1 Mass Com
4 pages
MicroC2 eCh10L02Mem Const Var DataTypes
No ratings yet
MicroC2 eCh10L02Mem Const Var DataTypes
44 pages
Applied DevOps
No ratings yet
Applied DevOps
4 pages
Achieving High Navigation Accuracy Using Inertial Navigation Systems in Autonomous Underwater Vehicles
No ratings yet
Achieving High Navigation Accuracy Using Inertial Navigation Systems in Autonomous Underwater Vehicles
7 pages
Microwave Engineering Sem VII Mu Question Paper 23 D
No ratings yet
Microwave Engineering Sem VII Mu Question Paper 23 D
1 page
MAURER Guided Cross-Tie: Railway Expansion Joint
100% (1)
MAURER Guided Cross-Tie: Railway Expansion Joint
4 pages
Etabs V18 Course Content
No ratings yet
Etabs V18 Course Content
9 pages
Textbook Order Form
No ratings yet
Textbook Order Form
1 page
BS - EN12715 - 2000 Execution of Special Geotechnical Work - Grouting
No ratings yet
BS - EN12715 - 2000 Execution of Special Geotechnical Work - Grouting
56 pages
Ketron Sd1 Plus Quick
No ratings yet
Ketron Sd1 Plus Quick
42 pages
Influence of Raw Materials Characteristics On Pyroprocessing
No ratings yet
Influence of Raw Materials Characteristics On Pyroprocessing
19 pages
Microcontroller Lab Manual GTU SEM V 2012
0% (1)
Microcontroller Lab Manual GTU SEM V 2012
76 pages
Class 2 Word Processing (Ms Word)
No ratings yet
Class 2 Word Processing (Ms Word)
8 pages
HTML
No ratings yet
HTML
10 pages
JD - Evalueserve - FS - Equity Strategy - BASBA
No ratings yet
JD - Evalueserve - FS - Equity Strategy - BASBA
2 pages
Case Study - Pomelo Health X Exer
No ratings yet
Case Study - Pomelo Health X Exer
3 pages
Wize Free 13 Hour Final Exam Crash Course Math 180 Fall 2021 Final Exam Booklet
No ratings yet
Wize Free 13 Hour Final Exam Crash Course Math 180 Fall 2021 Final Exam Booklet
103 pages
Grade 3 Mental Maths Subtraction Worksheet 1 PDF 2
No ratings yet
Grade 3 Mental Maths Subtraction Worksheet 1 PDF 2
1 page
The Three Main Pillars of Siebel Architecutre
100% (2)
The Three Main Pillars of Siebel Architecutre
10 pages
PAS 2 Inventories: Problem 1: True or False
No ratings yet
PAS 2 Inventories: Problem 1: True or False
3 pages
REPORT
No ratings yet
REPORT
3 pages
1 4 Calculation of Acoustic BHP 2014
No ratings yet
1 4 Calculation of Acoustic BHP 2014
60 pages
Remote Reporting in The COVID 19 Era From Pilot S
No ratings yet
Remote Reporting in The COVID 19 Era From Pilot S
4 pages