0% found this document useful (0 votes)

434 views

Udacity Deep LEarning Part4 RNN

The document describes the feedforward process in a neural network. It explains that the feedforward process has two steps: 1) calculating the hidden state vector by multiplying the input vector by the first weight matrix, and 2) calculating the output vector by multiplying the hidden state vector by the second weight matrix. It provides the mathematical equations for these calculations and notes that an activation function is applied to the hidden state vector before obtaining the final hidden vector values.

Uploaded by

yousef shaban

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

434 views

Udacity Deep LEarning Part4 RNN

Uploaded by

yousef shaban

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 338

Recurrent Neural Networks

 Back to Home

 01. Introducing Ortal

 02. RNN Introduction
 03. RNN History
 04. RNN Applications
 05. Feedforward Neural Network-Reminder
 06. The Feedforward Process
 07. Feedforward Quiz
 08. Backpropagation- Theory
 09. Backpropagation - Example (part a)
 10. Backpropagation- Example (part b)
 11. Backpropagation Quiz
 12. RNN (part a)
 13. RNN (part b)
 14. RNN- Unfolded Model
 15. Unfolded Model Quiz
 16. RNN- Example
 17. Backpropagation Through Time (part a)
 18. Backpropagation Through Time (part b)
 19. Backpropagation Through Time (part c)
 20. BPTT Quiz 1
 21. BPTT Quiz 2
 22. BPTT Quiz 3
 23. Some more math
 24. RNN Summary
 25. From RNN to LSTM
 26. Wrap Up
02. RNN Introduction
RNN Introduction
Hi! I am Ortal, your instructor for this lesson!

In this lesson we will learn about Recurrent Neural Networks (RNNs).

The neural network architectures you've seen so far were trained using the current inputs only.
We did not consider previous inputs when generating the current output. In other words, our
systems did not have any memory elements. RNNs address this very basic and important issue
by using memory (i.e. past inputs to the network) when producing the current output.
03. RNN History
A bit of history
How did the theory behind RNN evolve? Where were we a few years ago and where
are we now?

02 RNN History V4 Final

Play
00:00
02:42
Disable captions
Settings
Enter fullscreen
Play
As mentioned in this video, RNNs have a key flaw, as capturing relationships that span
more than 8 or 10 steps back is practically impossible. This flaw stems from the
"vanishing gradient" problem in which the contribution of information decays
geometrically over time.

What does this mean?

As you may recall, while training our network we use backpropagation. In the

backpropagation process we adjust our weight matrices with the use of a gradient. In
the process, gradients are calculated by continuous multiplications of derivatives. The
value of these derivatives may be so small, that these continuous multiplications may
cause the gradient to practically "vanish".

LSTM is one option to overcome the Vanishing Gradient problem in RNNs.

Please use these resources if you would like to read more about the Vanishing
Gradient problem or understand further the concept of a Geometric Series and how
its values may exponentially decrease.

If you are still curious, for more information on the important milestones mentioned
here, please take a peek at the following links:

 TDNN

 Here is the original Elman Network publication from 1990. This link is provided

here as it's a significant milestone in the world on RNNs. To simplify things a
bit, you can take a look at the following additional info.
 In this LSTM link you will find the original paper written by Sepp
Hochreiter and Jürgen Schmidhuber. Don't get into all the details just yet. We
will cover all of this later!

As mentioned in the video, Long Short-Term Memory Cells (LSTMs) and Gated
Recurrent Units (GRUs) give a solution to the vanishing gradient problem, by helping
us apply networks that have temporal dependencies. In this lesson we will focus on
RNNs and continue with LSTMs. We will not be focusing on GRUs.
More information about GRUs can be found in the following blog. Focus on the
overview titled: GRUs.
04. RNN Applications
Applications
The world's leading tech companies are all using RNNs, particularly LSTMs, in their
applications. Let's take a look at a few.

03 RNN Applications V3 Final

Play
00:00
01:56
Disable captions
Settings
Enter fullscreen
Play
There are so many interesting applications, let's look at a few more!

 Are you into gaming and bots? Check out the DotA 2 bot by Open AI

o How about automatically adding sounds to silent movies?

o Here is a cool tool for automatic handwriting generation

o Amazon's voice to text using high quality speech recognition, Amazon Lex.

o Facebook uses RNN and LSTM technologies for building language models

o Netflix also uses RNN models - here is an interesting read

05. Feedforward Neural Network-Reminder
Feedforward Neural Network - A Reminder

The mathematical calculations needed for training RNN systems are fascinating. To deeply
understand the process, we first need to feel confident with the vanilla FFNN system. We need
to thoroughly understand the feedforward process, as well as the backpropagation process used
in the training phases of such systems.
The next few videos will cover these topics, which you are already familiar with. We will
address the feedforward process as well as backpropagation, using specific examples. These
examples will serve as extra content to help further understand RNNs later in this lesson.

The following couple of videos will give you a brief overview of the Feedforward Neural
Network (FFNN).

04 RNN FFNN Reminder A V7 Final

Play
02:03
04:39
Disable captions
Settings
Enter fullscreen
Play
OK, you can take a small break now. We will continue with FFNN when you come back!

05 RNN FFNN Reminder B V6 Final

Play
01:56
03:24
Disable captions
Settings
Enter fullscreen
Play
As mentioned before, when working with neural networks we have 2 primary phases:

Training

and

Evaluation.

During the training phase, we take the data set (also called the training set), which includes
many pairs of inputs and their corresponding targets (outputs). Our goal is to find a set of
weights that would best map the inputs to the desired outputs.
In the evaluation phase, we use the network that was created in the training phase, apply our
new inputs and expect to obtain the desired outputs.

The training phase will include two steps:

Feedforward

and
Backpropagation

We will repeat these steps as many times as we need until we decide that our system has
reached the best set of weights, giving us the best possible outputs.

The next two videos will focus on the feedforward process.

You will notice that in these videos I use subscripts as well as superscript as a numeric
notation for the weight matrix.

For example:

 W_kWk is weight matrix kk

 \ W_{ij}^k Wijk is the ijij element of weight matrix kk
06. The Feedforward Process
Feedforward
In this section we will look closely at the math behind the feedforward process. With
the use of basic Linear Algebra tools, these calculations are pretty simple!

If you are not feeling confident with linear combinations and matrix multiplications,
you can use the following links as a refresher:

 Linear Combination
 Matrix Multiplication

Assuming that we have a single hidden layer, we will need two steps in our
calculations. The first will be calculating the value of the hidden states and the latter
will be calculating the value of the outputs.
Notice that both the hidden layer and the output layer are displayed as vectors, as
they are both represented by more than a single neuron.

Our first video will help you understand the first step- Calculating the value of the
hidden states.

06 FeedForward A V7 Final

Play
00:00
04:32
Disable captions
Settings
Enter fullscreen
Play
As you saw in the video above, vector h' of the hidden layer will be calculated by
multiplying the input vector with the weight matrix W^{1}W1 the following way:
\bar{h'} = (\bar{x} W^1 )h′ˉ=(xˉW1)
Using vector by matrix multiplication, we can look at this computation the following
way:

Equation 1

After finding h'h′ , we need an activation function (\PhiΦ) to finalize the

computation of the hidden layer's values. This activation function can be a Hyperbolic
Tangent, a Sigmoid or a ReLU function. We can use the following two equations to
express the final hidden vector \bar{h}hˉ:
\bar{h} = \Phi(\bar{x} W^1 )hˉ=Φ(xˉW1)
or

\bar{h} = \Phi(h')hˉ=Φ(h′)
Since W_{ij}Wij
represents the weight component in the weight matrix, connecting neuron i from the
input to neuron j in the hidden layer, we can also write these calculations in the
following way:
(notice that in this example we have n inputs and only 3 hidden neurons)

Equation 2

More information on the activation functions and how to use them can be found here

This next video will help you understand the second step- Calculating the values of
the Outputs.

07 FeedForward B V3

Play
00:00
05:55
Disable captions
Settings
Enter fullscreen
Play
As you've seen in the video above, the process of calculating the output vector is
mathematically similar to that of calculating the vector of the hidden layer. We use,
again, a vector by matrix multiplication, which can be followed by an activation
function. The vector is the newly calculated hidden layer and the matrix is the one
connecting the hidden layer to the output.

Essentially, each new layer in an neural network is calculated by a vector by matrix

multiplication, where the vector represents the inputs to the new layer and the matrix
is the one connecting these new inputs to the next layer.

In our example, the input vector is \bar{h}hˉ and the matrix is W^2W2,

therefore \bar{y}=\bar{h}W^2yˉ=hˉW2. In some applications it can be beneficial to
use a softmax function (if we want all output values to be between zero and 1, and
their sum to be 1).

Equation 3

The two error functions that are most commonly used are the Mean Squared Error
(MSE) (usually used in regression problems) and the cross entropy (usually used in
classification problems).

In the above calculations we used a variation of the MSE.

The next few videos will focus on the backpropagation process, or what we also call
stochastic gradient decent with the use of the chain rule.
07. Feedforward Quiz
The following picture is of a feedforward network with

 A single input x
 Two hidden layers
 A singe output

The first hidden layer has M neurons.

The second hidden layer has N neurons.

What is the total number of multiplication operations needed for a single feedforward pass?

M+N

M+N+2MN

M+N+NM

There isn't enough information to solve this question

SOLUTION:M+N+NM
Solution
To calculate the number of multiplications needed for a single feedforward pass, we can break
down the network to three steps:

 Step 1: From the single input to the first hidden layer

 Step 2: From the first hidden layer to the second hidden layer
 Step 2: From the second hidden layer to the single output

Step 1

The single input is multiplied by a vector with M values. Each value in the vector will
represent a weight connecting the input to the first hidden layer. Therefore, we will
have M multiplication operations.

Step 2

Each value in the first hidden layer (M in total) will be multiplied by a vector with N values.
Each value in the vector will represent a weight connecting the neurons in the first hidden
layer to the neurons in the second hidden layer. Therefore, we will have here M times N
calculations, or simply MN multiplication operations.

Step 3

Each value in the second hidden layer (N in total) will be multiplied once, by the weight
element connecting it to the single output. Therefore, we will have N multiplication
operations.

In total, we will add the number of operations we calculated in each step: M+MN+N .

08. Backpropagation- Theory
Backpropagation Theory
Since partial derivatives are the key mathematical concept used in backpropagation, it's
important that you feel confident in your ability to calculate them. Once you know how to
calculate basic derivatives, calculating partial derivatives is easy to understand.
For more information on partial derivatives use the following link

For calculation purposes in future quizzes of the lesson, you can use the following link as a
reference for common derivatives.

In the backpropagation process we minimize the network error slightly with each iteration,

by adjusting the weights. The following video will help you understand the mathematical
process we use for computing these adjustments.

08 Backpropagation Theory V6 Final

Play
00:00
06:16
Disable captions
Settings
Enter fullscreen
Play
If we look at an arbitrary layer k, we can define the amount by which we change the weights
from neuron i to neuron j stemming from layer k as: \Delta W^kΔWk_{ij}ij.
The superscript (k) indicates that the weight connects layer k to layer k+1.

Therefore, the weight update rule for that neuron can be expressed as:

W_{new} = W_{previous} +\Delta W^kWnew=Wprevious+ΔWk_{ij}ij

Equation 4

The updated value \Delta W_{ij}^kΔWijk is calculated through the use of the gradient
calculation, in the following way:
\Delta W_{ij}^k=\alpha (-\frac{\partial E}{\partial W})ΔWijk=α(−∂W∂E),
where \alphaα is a small positive number called the** Learning Rate**.
Equation 5

From these derivation we can easily see that the weight updates are calculated the by the
following equation:

W_{new}= W_{previous} +\alpha (-\frac{\partial E}{\partial W} )Wnew

=Wprevious+α(−∂W∂E)
Equation 6

Since many weights determine the network’s output, we can use a vector of the partial
derivatives (defined by the Greek letter Nabla \nabla∇) of the network error - each with
respect to a different weight.
W_{new}= W_{previous}+\alpha \nabla_W(-E)Wnew=Wprevious+α∇W(−E)
Equation 7

Here you can find other good resources for understanding and tuning the Learning Rate:

 resource 1
 resource 2

The following video is given as a refresher to overfitting . You have already seen this concept
in the Training Neural Networks lesson. Feel free to skip it and jump right into the next video.

13 Overfitting Intro V4 Final

Play
00:00
02:04
Disable captions
Settings
Enter fullscreen
Play
09. Backpropagation - Example (part a)
Backpropagation- Example (part a)
We will now continue with an example focusing on the backpropagation process, and consider
a network having two inputs [x_1, x_2][x1,x2], three neurons in a single hidden layer [h_1,
h_2, h_3][h1,h2,h3] and a single output yy.

The weight matrices to update are W^1W1 from the input to the hidden layer,
and W^2W2 from the hidden layer to the output. Notice that in our case W^2W2 is a vector,
not a matrix, as we only have one output.
10 Backpropagation Example A V3 Final

Play
00:00
03:19
Disable captions
Settings
Enter fullscreen
Play
The chain of thought in the weight updating process is as follows:

To update the weights, we need the network error. To find the network error, we need the
network output, and to find the network output we need the value of the hidden layer,
vector \bar {h}hˉ.
\bar{h}=[h_1, h_2, h_3]hˉ=[h1,h2,h3]
Equation 8

Each element of vector \bar {h}hˉ is calculated by a simple linear combination of the input
vector with its corresponding weight matrix W^1W1, followed by an activation function.

Equation 9

We now need to find the network's output, yy. yy is calculated in a similar way by using a
linear combination of the vector \bar{h}hˉ with its corresponding elements of the weight
vector W^2W2.

Equation 10

After computing the output, we can finally find the network error.

As a reminder, the two Error functions most commonly used are the Mean Squared Error
(MSE) (usually used in regression problems) and the cross entropy (often used in classification
problems).

In this example, we use a variation of the MSE:

E=\frac{(d-y)^2}{2}E=2(d−y)2,
where dd is the desired output and yy is the calculated one. Notice that y and d are not vectors
in this case, as we have a single output.
The error is their squared difference, E=(d-y)^2E=(d−y)2, and is also called the
network's Loss Function. We are dividing the error term by 2 to simplify notation, as will
become clear soon.
The aim of the backpropagation process is to minimize the error, which in our case is the Loss
Function. To do that we need to calculate its partial derivative with respect to all of the
weights.

Since we just found the output y, we can now minimize the error by finding the updated
values \Delta W_{ij}^kΔWijk.
The superscript k indicates that we need to update each and every layer k.
As we noted before, the weight update value \Delta W_{ij}^kΔWijk is calculated with the
use of the gradient the following way:
\Delta W_{ij}^k=\alpha (-\frac{\partial E}{\partial W})ΔWijk=α(−∂W∂E)
Therefore:

\Delta W_{ij}^k=\alpha (-\frac{\partial E}{\partial W})=-\frac{\alpha}{2}

\frac{\partial (d-y)^2}{\partial W_{ij}}=-2 \frac{\alpha}{2}(d-y) \large
\frac{\partial (d-y)}{\partial W_{ij}}ΔWijk=α(−∂W∂E)=−2α∂Wij∂(d−y)2=−22α
(d−y)∂Wij∂(d−y)
which can be simplified as:

\Delta W_{ij}^k=\alpha(d-y) \frac{\partial y}{\partial W_{ij}}ΔWijk

=α(d−y)∂Wij∂y
Equation 11

(Notice that dd is a constant value, so it’s partial derivative is simply a zero)

This partial derivative of the output with respect to each weight, defines the gradient and is
often denoted by the Greek letter \deltaδ.

Equation 12

We will find all the elements of the gradient using the chain rule.

If you are feeling confident with the chain rule and understand how to apply it, skip the next
video and continue with our example. Otherwise, give Luis a few minutes of your time as he
takes you through the process!

Regra da cadeia
10. Backpropagation- Example (part b)
Backpropagation- Example (part b)
Now that we understand the chain rule, we can continue with our backpropagation example,
where we will calculate the gradient

12 Backpropagation Example B V6 Final

Play
00:00
07:11
Disable captions
Settings
Enter fullscreen
Play
In our example we only have one hidden layer, so our backpropagation process will consist of
two steps:

Step 1: Calculating the gradient with respect to the weight vector W^2W2 (from the output to
the hidden layer).
Step 2: Calculating the gradient with respect to the weight matrix W^1W1 (from the hidden
layer to the input).
Step 1
(Note that the weight vector referenced here will be W^2W2. All indices referring
to W^2W2 have been omitted from the calculations to keep the notation simple).
Equation 13

As you may recall:

\large\Delta W_{ij}=\alpha(d-y) \frac{\partial y}{\partial W_{ij}} ΔWij

=α(d−y)∂Wij∂y
In this specific step, since the output is of only a single value, we can rewrite the equation the
following way (in which we have a weights vector):

\large\Delta W_i=\alpha(d-y) \frac{\partial y}{\partial W_i}ΔWi=α(d−y)∂Wi

∂y
Since we already calculated the gradient, we now know that the incremental value we need for
step one is:

\Delta W_i=\alpha(d-y) h_iΔWi=α(d−y)hi

Equation 14

Having calculated the incremental value, we can update vector W^2W2 the following way:
Equation 15

Step 2
(In this step, we will need to use both weight matrices. Therefore we will not be omitting the
weight indices.)

In our second step we will update the weights of matrix W^1W1 by calculating the partial
derivative of yy with respect to the weight matrix W^1W1.
The chain rule will be used the following way:

obtain the partial derivative of yy with respect to \bar{h}hˉ, and multiply it by the partial
derivative of \bar{h}hˉ with respect to the corresponding elements in W^1W1. Instead of
referring to vector \bar{h}hˉ, we can observe each element and present the equation the
following way:
Equation 16

In this example we have only 3 neurons the the single hidden layer, therefore this will be a
linear combination of three elements:

Equation 17

We will calculate each derivative separately. \frac{\partial y}{\partial h_j}∂hj∂y will be

calculated first, followed by \frac{\partial h_j}{\partial W^1_{ij}}∂Wij1∂hj.

Equation 18

Notice that most of the derivatives were zero, leaving us with the simple solution
of \frac{\partial y}{\partial h_{j}}=W^2_j∂hj∂y=Wj2
To calculate \frac{\partial h_j}{\partial W^1_{{ij}}}∂Wij1∂hj we need to remember first
that
Equation 19

Therefore:

Equation 20

Since the function \ h_j hj is an activation function (\PhiΦ) of a linear combination, its
partial derivative will be calculated the following way:
Equation 21

Given that there are various activation functions, we will leave the partial derivative
of \PhiΦ using a general notation. Each neuron j will have its own value
for \PhiΦ and \Phi'Φ′, according to the activation function we choose to use.

Equation 22

The second calculation of equation 21 can be calculated the following way:

(Notice how simple the result is, as most of the components of this partial derivative are zero).
Equation 23

After understanding how to treat each multiplication of equation 21 separately, we can now
summarize it the following way:

Equation 24

We are ready to finalize step 2, in which we update the weights of matrix W^1W1 by

calculating the gradient shown in equation 17. From the above calculations, we can conclude
that:

Equation 25

Since \Delta
W^1_{ij}=\alpha(d-y) \large\frac{\partial y}{\partial
W^1_{ij}}ΔWij1=α(d−y)∂Wij1∂y , when finalizing step 2, we have:
Equation 26

Having calculated the incremental value, we can update vector W^1W1 the following way:
W^1_{new}=W^1_{previous}+\Delta W^1_{ij}Wnew1=Wprevious1+ΔWij1
W^1_{new}=W^1_{previous}+\alpha(d-y)W^2_j\Phi'_jx_iWnew1
=Wprevious1+α(d−y)Wj2Φj′xi
Equation 27

After updating the weight matrices we begin once again with the Feedforward pass, starting
the process of updating the weights all over again.

This video touches on the subject of Mini Batch Training. We will further explain things in
our Hyperparameters lesson coming up.
11. Backpropagation Quiz
The following picture is of a feedforward network with

 A single input x
 Two hidden layers with two neurons in each layer
 A single output
What is the update rule of weight matrix W1?

(In other words, what is the partial derivative of y with respect to W1?)

Hint: Use the chain rule

Equation A

Equation B

Equation C

Equation D

SOLUTION:Equation A
12. RNN (part a)
We are finally ready to talk about Recurrent Neural Networks (or RNNs), where we will be
opening the doors to new content!

14 RNN A V4 Final

Play
00:00
04:37
Disable captions
Settings
Enter fullscreen
Play
RNNs are based on the same principles as those behind FFNNs, which is why we spent so
much time reminding ourselves of the feedforward and backpropagation steps that are used in
the training phase.

There are two main differences between FFNNs and RNNs. The Recurrent Neural Network
uses:

 sequences as inputs in the training phase, and

 memory elements

Memory is defined as the output of hidden layer neurons, which will serve as additional input
to the network during next training step.

The basic three layer neural network with feedback that serve as memory inputs is called
the Elman Network and is depicted in the following picture:
Diff between RNN and feedforward NN
13. RNN (part b)
16 RNN B V4 Final

Play
01:01
04:32
Disable captions
Settings
Enter fullscreen
Play
As we've see, in FFNN the output at any time t, is a function of the current input and
the weights. This can be easily expressed using the following equation:

Equation 28

In RNNs, our output at time t, depends not only on the current input and the weight,
but also on previous inputs. In this case the output at time t will be defined as:

Equation 29

This is the RNN folded model:

The RNN folded model

In this picture, \bar{x}xˉ represents the input vector, \bar{y}yˉ represents the

output vector and \bar{s}sˉ denotes the state vector.
W_xWx is the weight matrix connecting the inputs to the state layer.
W_yWy is the weight matrix connecting the state layer to the output layer.
W_sWs represents the weight matrix connecting the state from the previous timestep
to the state in the current timestep.
The model can also be "unfolded in time". The unfolded model is usually what we
use when working with RNNs.

The RNN unfolded model

In both the folded and unfolded models shown above the following notation is used:

\bar{x}xˉ represents the input vector, \bar{y}yˉ represents the output vector

and \bar{s}sˉ represents the state vector.
W_xWx is the weight matrix connecting the inputs to the state layer.
W_yWy is the weight matrix connecting the state layer to the output layer.
W_sWs represents the weight matrix connecting the state from the previous timestep
to the state in the current timestep.
In FFNNs the hidden layer depended only on the current inputs and weights, as well
as on an activation function \PhiΦ in the following way:
\bar{h}=\Phi(\bar{x}W)hˉ=Φ(xˉW).
Equation 30

In RNNs the state layer depended on the current inputs, their corresponding weights,
the activation function and also on the previous state:

Equation 31

The output vector is calculated exactly the same as in FFNNs. It can be a linear
combination of the inputs to each output node with the corresponding weight
matrix W_yWy, or a softmax function of the same linear combination.
\bar{y}_t=\bar{s}_t W_yyˉt=sˉtWy
or

\bar{y}_t=\sigma(\bar{s}_t W_y)yˉt=σ(sˉtWy)
Equation 32

The next video will focus on the unfolded model as we try to further understand it.
14. RNN- Unfolded Model
The next video will focus on the unfolded model as we try to further understand it.

17 RNN Unfolded V3 Final

Play
00:59
-02:07
Mute
Disable captions
Settings
Enter fullscreen
15. Unfolded Model Quiz

A folded model of a RNN

Look at the above picture of a folded model of a RNN. Which of the pictures below represents
the unfolded model of the same network?

Picture A

Picture B

Both A and B are correct

I don't have enough information to answer this question

SOLUTION:Both A and B are correct

Picture A
Picture B
16. RNN- Example
In this example we will illustrate how RNNs can be helpful in detecting sequences. When
detecting a sequence, the system has to remember what the previous inputs were, so it makes
sense to use a recurrent network.

If you are unfamiliar with the term sequence detection, the idea is to see if a specific pattern of
inputs has entered the system. In our example the pattern will be the word U,D,A,C,I,T,Y.

18 RNN Example V5 Final

Play
00:00
04:02
Disable captions
Settings
Enter fullscreen
17. Backpropagation Through Time (part a)
We are now ready to understand how to train the RNN.

When we train RNNs we also use backpropagation, but with a conceptual change. The process
is similar to that in the FFNN, with the exception that we need to consider previous time steps,
as the system has memory. This process is called Backpropagation Through Time
(BPTT) and will be the topic of the next three videos.

 As always, don't forget to take notes.

In the following videos we will use the Loss Function for our error. The Loss Function is the
square of the difference between the desired and the calculated outputs. There are variations to
the Loss Function, for example, factoring it with a scalar. In the backpropagation example we
used a factoring scalar of 1/2 for calculation convenience.

As described previously, the two most commonly used are the Mean Squared Error
(MSE) (usually used in regression problems) and the cross entropy (usually used in
classification problems).

Here, we are using a variation of the MSE.

19 RNN BPTT A V6 Final

Play
02:07
03:07
Disable captions
Settings
Enter fullscreen
Play
Before diving into Backpropagation Through Time we need a few reminders.

The state vector \bar{s}_tsˉt is calculated the following way:

Equation 33

The output vector \bar{y}_tyˉt can be product of the state vector \bar{s}_tsˉt and the

corresponding weight elements of matrix W_yWy. As mentioned before, if the desired
outputs are between 0 and 1, we can also use a softmax function. The following set of
equations depicts these calculations:
Equation 34

As mentioned before, for the error calculations we will use the Loss Function, where

E_tEt represents the output error at time t

d_tdt represents the desired output at time t
y_tyt represents the calculated output at time t

Equation 35

In BPTT we train the network at timestep t as well as take into account all of the previous
timesteps.
The easiest way to explain the idea is to simply jump into an example.

In this example we will focus on the BPTT process for time step t=3. You will see that in
order to adjust all three weight matrices, W_x, W_sWx,Ws and W_yWy, we need to
consider timestep 3 as well as timestep 2 and timestep 1.
As we are focusing on timestep t=3, the Loss function will
be: E_3=(\bar{d}_3-\bar{y}_3)^2E3=(dˉ3−yˉ3)2

The Folded Model at Timestep 3

To update each weight matrix, we need to find the partial derivatives of the Loss Function at
time 3, as a function of all of the weight matrices. We will modify each matrix using gradient
descent while considering the previous timesteps.
Gradient Considerations in the Folded Model
18. Backpropagation Through Time (part b)
We will now unfold the model. You will see that unfolding the model in time is very
helpful in visualizing the number of steps (translated into multiplication) needed in
the Backpropagation Through Time process. These multiplications stem from the
chain rule and are easily visualized using this model.

In this video we will understand how to use Backpropagation Through Time (BPTT)
when adjusting two weight matrices:

 W_yWy - the weight matrix connecting the state the output

 W_sWs - the weight matrix connecting one state to the next state

20 RNN BPTT B V5 Final

Play
00:00
03:49
Disable captions
Settings
Enter fullscreen
Play
The unfolded model can be very helpful in visualizing the BPTT process.
The Unfolded Model at timestep 3

Gradient calculations needed to adjust W_yWy

The partial derivative of the Loss Function with respect to W_yWy is found by a
simple one step chain rule:
(Note that in this case we do not need to use BPTT. Visualization of the calculations
path can be found in the video).
Equation 36

Generally speaking, we can consider multiple timesteps back, and not only 3 as in this
example. For an arbitrary timestep N, the gradient calculation needed for
adjusting W_yWy, is:

Equation 37

Gradient calculations needed to adjust W_sWs

We still need to adjust W_sWs the weight matrix connecting one state to the next
and W_xWx the weight matrix connecting the input to the state. We will arbitrarily
start with W_sWs.
To understand the BPTT process, we can simplify the unfolded model. We will focus
on the contributions of W_sWs to the output, the following way:
Simplified Unfolded model for Adjusting Ws

When calculating the partial derivative of the Loss Function with respect to W_sWs,
we need to consider all of the states contributing to the output. In the case of this
example it will be states \bar{s_3}s3ˉ which depends on its predecessor \bar{s_2}s2
ˉ which depends on its predecessor \bar{s_1}s1ˉ, the first state.
In BPTT we will take into account every gradient stemming from each
state, accumulating all of these contributions.

 At timestep t=3, the contribution to the gradient stemming from \bar{s_3}s3ˉ is the

following :
(Notice the use of the chain rule here. If you need, go back to the video to visualize
the calculation path).
Equation 38

 At timestep t=3, the contribution to the gradient stemming from \bar{s_2}s2ˉ is the

following :
(Notice how the equation, derived by the chain rule, considers the contribution
of \bar{s_2}s2ˉ to \bar{s_3}s3ˉ . If you need, go back to the video to visualize the
calculation path).
19. Backpropagation Through Time (part c)
Last step! Adjusting W_xWx, the weight matrix connecting the input to the state.
If you took on the previous challenge of deriving the math by yourself first, sit back,
fasten your seat belts and compare our notes to yours! Don't worry if you made
mistakes, we all do. Your mistakes will help you learn what to avoid next time.

21 RNN BPTT C V7 Final

Play
00:00
03:08
Disable captions
Settings
Enter fullscreen
Play
Gradient calculations needed to adjust W_xWx
To further understand the BPTT process, we will simplify the unfolded model again.
This time the focus will be on the contributions of W_xWx to the output, the
following way:
Simplified Unfolded model for Adjusting Wx

When calculating the partial derivative of the Loss Function with respect to
to W_xWx we need to consider, again, all of the states contributing to the output. As
we saw before, in the case of this example it will be states \bar{s_3}s3ˉ which
depend on its predecessor \bar{s_2}s2ˉ which depends on its
predecessor \bar{s_1}s1ˉ, the first state.
As we mentioned previously, in BPTT we will take into account each gradient
stemming from each state, accumulating all of the contributions.

 At timestep t=3, the contribution to the gradient stemming from \bar{s_3}s3ˉ is the

following :
(Notice the use of the chain rule here. If you need, go back to the video to visualize
the calculation path).

Equation 43

 At timestep t=3, the contribution to the gradient stemming from \bar{s_2}s2ˉ is the

 At timestep t=3, the contribution to the gradient stemming from \bar{s_1}s1ˉ is the

following :
(Notice how the equation, derived by the chain rule, considers the contribution
of \bar{s_1}s1ˉ to \bar{s_2}s2ˉ and \bar{s_3}s3ˉ . If you need, go back to the
video to visualize the calculation path).

Equation 45

After considering the contributions from all three states: \bar{s_3}s3ˉ ,\bar{s_2}s2

ˉ and \bar{s_1}s1ˉ, we will accumulate them to find the final gradient calculation.
The following equation is the gradient contributing to the adjustment
of W_xWx using Backpropagation Through Time:
20. BPTT Quiz 1

A folded RNN model

Consider the above folded RNN Model. Both states S and Z have multiple neurons in each
layer.
The mathematical derivation of state Z at time t is:

Equation A

Equation B

Equation C

Equation D

SOLUTION:Equation D
Solution
\bar{z}zˉ and \bar{s}sˉ are vectors, as we indicate that they have multiple neurons in each
layer. Using this logic we can understand that equations A and C are incorrect.
Since w_2w2 connects the hidden state \bar{z}zˉ to itself, we know that we need to
consider the previous timestep here. Therefore only equation D is the correct one.
21. BPTT Quiz 2

A folded RNN model

Lets look at the same folded model again (displayed above). Assume that the error is noted by
the symbol E. What is the update rule of weight matrix V1 at time t, over a single timestep ?

Equation A

Equation B

Equation C

Equation D

SOLUTION:Equation B
Solution
Equation B Is the only equation with the correct derivation of the chain rule with the proper
use of the learning rate.
22. BPTT Quiz 3

A folded RNN model

Lets look at the same folded model again (displayed above). Assume that the error is noted by
the symbol E. What is the update rule of weight matrix U at time t+1 (over 2 timesteps) ?
Hint: Use the unfolded model for a better visualization.

Equation A

Equation B

Equation C

SOLUTION:Equation C
23. Some more math
This section is given as bonus material and is not mandatory. If you are curious how we
derived the final accumulative equation for BPTT, this section will help you out.

In the previous videos, we talked about Backpropagation Through Time. We used a lot of

partial derivatives, accumulating the contributions to the change in the error from each state.
Remember?
When we needed a general scheme for the BPTT, I simply displayed the equation without
giving you further explanations.

As a reminder, the following two equations were derived when adjusting the weights of
matrix W_sWs and matrix W_xWx:

Equation 48: BPTT calculations for the purpose of adjusting Ws

Equation 49: BPTT calculations for the purpose of adjusting Wx

To generalize the case, we will avoid proving equation 48 or 49, and will focus on a general
framework.
Let's look at the following sketch, presenting a portion of a network:

In the picture above, we have four states, starting with s_tst.

We will initially consider the three weight matrices W_1W1,W_2W2 and W_3W3 as three
different matrices.
Using the chain rule we can derive the following three equations:
24. RNN Summary
Let's summarize what we have seen so far:

RNN Summary

Play
00:00
04:29
Disable captions
Settings
Enter fullscreen
Play
As you have seen, in RNNs the current state depends on the input as well as the previous
states, with the use of an activation function.

Equation 56

The current output is a simple linear combination of the current state elements with the
corresponding weight matrix.

\bar{y}_t=\bar{s}_t W_yyˉt=sˉtWy (without the use of an activation function)

\bar{y}_t=\sigma(\bar{s}_t W_y)yˉt=σ(sˉtWy) (with the use of an activation function)

Equation 57

We can represent the recurrent network with the use of a folded model or an unfolded model:
The RNN Folded Model

The RNN Unfolded Model

In the case of a single hidden (state) layer, we will have three weight matrices to consider.
Here we use the following notations:

W_xWx - represents the weight matrix connecting the inputs to the state layer.
W_yWy - represents the weight matrix connecting the state to the output.
W_sWs - represents the weight matrix connecting the state from the previous timestep to the
state in the following timestep.
The gradient calculations for the purpose of adjusting the weight matrices are the following:

Equation 58

Equation 59
Equation 60

In equations 51 and 52 we used Backpropagation Through Time (BPTT) where we

accumulate all of the contributions from previous timesteps.

When training RNNs using BPTT, we can choose to use mini-batches, where we update the
weights in batches periodically (as opposed to once every inputs sample). We calculate the
gradient for each step but do not update the weights right away. Instead, we update the weights
once every fixed number of steps. This helps reduce the complexity of the training process and
helps remove noise from the weight updates.

The following is the equation used for Mini-Batch Training Using Gradient Descent:
(where \delta_{ij}δij represents the gradient calculated once every inputs sample and M
represents the number of gradients we accumulate in the process).
Equation 61

If we backpropagate more than ~10 timesteps, the gradient will become too small. This
phenomena is known as the vanishing gradient problem where the contribution of
information decays geometrically over time. Therefore temporal dependencies that span many
time steps will effectively be discarded by the network. Long Short-Term Memory
(LSTM) cells were designed to specifically solve this problem.

In RNNs we can also have the opposite problem, called the exploding gradient problem, in
which the value of the gradient grows uncontrollably. A simple solution for the exploding
gradient problem is Gradient Clipping.

More information about Gradient Clipping can be found here.

You can concentrate on Algorithm 1 which describes the gradient clipping idea in simplicity.
25. From RNN to LSTM
Before we take a close look at the Long Short-Term Memory (LSTM) cell, let's take a look
at the following video:

23 From RNNs To LSTMs V4 Final

Play
00:00
04:45
Disable captions
Settings
Enter fullscreen
Play
Long Short-Term Memory Cells, (LSTM) give a solution to the vanishing gradient problem,
by helping us apply networks that have temporal dependencies. They were proposed in 1997
by Sepp Hochreiter and Jürgen Schmidhuber

If we take a close look at the RNN neuron, we can see that we have simple linear
combinations (with or without the use of an activation function). We can also see that we have
a single addition.

Zooming in on the neuron, we can graphically see this in the following configuration:
26. Wrap Up
Long Short-Term Memory Networks (LSTM)

 Back to Home

 01. Intro to LSTM

 02. RNN vs LSTM
 03. Basics of LSTM
 04. Architecture of LSTM
 05. The Learn Gate
 06. The Forget Gate
 07. The Remember Gate
 08. The Use Gate
 09. Putting it All Together
 10. Quiz
 11. Other architectures
 12. Outro LSTM
01. Intro to LSTM

Hi! It's Luis again!

Now that you've gone through the Recurrent Neural Network lesson, I'll be teaching you
what an LSTM is. This stands for Long Short Term Memory Networks, and are quite useful
when our neural network needs to switch between remembering recent things, and things from
long time ago. But first, I want to give you some great references to study this further. There
are many posts out there about LSTMs, here are a few of my favorites:

 Chris Olah's LSTM post

 Edwin Chen's LSTM post
 Andrej Karpathy's lecture on RNNs and LSTMs from CS231n

So, let's dig in!

03. Sequence Batching
Sequence-Batching
04. Character-wise RNN Notebook
Character-wise RNN Notebook
You can get the notebook with the character-wise RNN from our public GitHub
repo in the intro-to-rnns folder.

To clone the entire repository to your machine:

git clone https://github.com/udacity/deep-learning.git

This code requires TensorFlow 1.0, so make sure you upgrade if you're using an older
version.
Play with the network, improve it, train it on your own text. This thing is for you to
build off of.

If you find ways to improve it, make a pull request and we'll add it in.

05. Implementing a Character-wise RNN

Implementing a Character-wise RNN
06. Batching Data Solution
Batching Data Solution
07. LSTM Cell
LSTM Cell
08. LSTM Cell Solution
LSTM Cell Solution
09. RNN Output
RNN Output

Play
00:00
04:00
Disable captions
Settings
Enter fullscreen
10. Network Loss
Network Loss
11. Output and Loss Solutions
Output And Loss Solutions

Play
00:00
02:19
Disable captions
Settings
Enter fullscreen
12. Build the Network
Build The Network
13. Build the Network Solution
Build The Network And Results
Part 04-Module 01-Lesson 04_Hyperparameters
01. Introducing Jay
For this section, we're introducing a new Udacity instructor, Jay Alammar. Jay has
done some great work in interactive explorations of neural networks, check out his
blog.

Jay will be reviewing some of the material you saw in the Deep Neural Networks
section on hyperparameters, and he will also introduce the hyperparameters used in
Recurrent Neural Networks.

02. Introduction
Introduction
03. Learning Rate
Learning Rate

Exponential Decay in TensorFlow.

Adaptive Learning Optimizers

 AdamOptimizer
 AdagradOptimizer
04. Learning Rate
Learning Rate Tuning #1

Say you're training a model. If the output from the training process looks as shown
below, what action would you take on the learning rate to improve the training?

Epoch 1, Batch 1, Training Error: 8.4181

Epoch 1, Batch 2, Training Error: 8.4177
Epoch 1, Batch 3, Training Error: 8.4177
Epoch 1, Batch 4, Training Error: 8.4173
Epoch 1, Batch 5, Training Error: 8.4169

Decrease the learning rate

Try again using the same learning rate

Increase the learning rate

SOLUTION:Increase the learning rate

Learning Rate Tuning #2

Say you're training a model. If the output from the training process looks as shown
below, what action would you take on the learning rate to improve the training?

Epoch 1, Batch 1, Training Error: 8.71

Epoch 1, Batch 2, Training Error: 3.25
Epoch 1, Batch 3, Training Error: 4.93
Epoch 1, Batch 4, Training Error: 3.30
Epoch 1, Batch 5, Training Error: 4.82

Decrease the learning rate

Use an adaptive learning rate

Increase the learning rate

SOLUTION:

 Decrease the learning rate

 Use an adaptive learning rate

05. Minibatch Size

Minibatch Size

Systematic evaluation of CNN advances on the ImageNet by Dmytro Mishkin, Nikolay

Sergievskiy, Jiri Matas
06. Number of Training Iterations / Epochs
Number Of Iterations

The number of training iterations is a hyperparameter we can optimize automatically

using a technique called early stopping (also "early termination").

ValidationMonitor (Deprecated)
In tensorflow, we can use a ValidationMonitor with tf.contrib.learn to not only monitor
the progress of training, but to also stop the training when certain conditions are met.

The following example from the ValidationMonitor documentation shows how to set
it up. Note that the last three parameters indicate which metric we're optimizing.

validation_monitor = tf.contrib.learn.monitors.ValidationMonitor(
test_set.data,
test_set.target,
every_n_steps=50,
metrics=validation_metrics,
early_stopping_metric="loss",
early_stopping_metric_minimize=True,
early_stopping_rounds=200)
The last parameter indicates to ValidationMonitor that it should stop the training
process if the loss did not decrease in 200 steps (rounds) of training.

The validation_monitor is then passed to tf.contrib.learn's "fit" method which runs the
training process:

classifier = tf.contrib.learn.DNNClassifier(
feature_columns=feature_columns,
hidden_units=[10, 20, 10],
n_classes=3,
model_dir="/tmp/iris_model",
config=tf.contrib.learn.RunConfig(save_checkpoints_secs=1))

classifier.fit(x=training_set.data,
y=training_set.target,
steps=2000,
monitors=[validation_monitor])
SessionRunHook
More recent versions of TensorFlow deprecated monitors in favor
of SessionRunHooks. SessionRunHooks are an evolving part of tf.train, and going
forward appear to be the proper place where you'd implement early stopping.

At the time of writing, two pre-defined stopping monitors exist as a part of

tf.train's training hooks:

 StopAtStepHook: A monitor to request the training stop after a certain number of

steps
 NanTensorHook: a monitor that monitor's loss and stops training if it encounters a
NaN loss

http://jalammar.github.io/
07. Number of Hidden Units / Layers
Number Of Hidden Units Layers

"in practice it is often the case that 3-layer neural networks will outperform 2-layer
nets, but going even deeper (4,5,6-layer) rarely helps much more. This is in stark
contrast to Convolutional Networks, where depth has been found to be an extremely
important component for a good recognition system (e.g. on order of 10 learnable
layers)." ~ Andrej Karpathy in https://cs231n.github.io/neural-networks-1/

More on Capacity
A more detailed discussion on a model's capacity appears in the Deep Learning book,
chapter 5.2 (pages 110-120).
08. RNN Hyperparameters
RNN Hyperparameters
LSTM Vs GRU
"These results clearly indicate the advantages of the gating units over the more
traditional recurrent
units. Convergence is often faster, and the final solutions tend to be better. However,
our results are
not conclusive in comparing the LSTM and the GRU, which suggests that the choice of
the type of
gated recurrent unit may depend heavily on the dataset and corresponding task."

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling by

Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, Yoshua Bengio

"The GRU outperformed the LSTM on all tasks with the exception of language
modelling"

An Empirical Exploration of Recurrent Network Architectures by Rafal Jozefowicz,

Wojciech Zaremba, Ilya Sutskever

"Our consistent finding is that depth of at least two is

beneficial. However, between two and three layers our results are mixed. Additionally,
the results
are mixed between the LSTM and the GRU, but both significantly outperform the
RNN."

Visualizing and Understanding Recurrent Networks by Andrej Karpathy, Justin

Johnson, Li Fei-Fei

"Which of these variants is best? Do the differences matter? Greff, et al. (2015) do a

nice comparison of popular variants, finding that they’re all about the
same. Jozefowicz, et al. (2015) tested more than ten thousand RNN architectures,
finding some that worked better than LSTMs on certain tasks."

Understanding LSTM Networks by Chris Olah

"In our [Neural Machine Translation] experiments, LSTM cells consistently

outperformed GRU cells. Since the computational bottleneck in our architecture is the
softmax operation we did not observe large difference in training speed between
LSTM and GRU cells. Somewhat to our surprise, we found that the vanilla decoder is
unable to learn nearly as well as the gated variant."

Massive Exploration of Neural Machine Translation Architectures by Denny Britz, Anna

Goldie, Minh-Thang Luong, Quoc Le
09. RNN Hyperparameters
LSTM Vs. GRU

How do Long Short Term Memory (LSTM) cells and Gated Recurrent Unit (GRU) cells
compare?

LSTMs are superior to GRUs in every way

GRUs are superior to LSTMs in every way

It depends.. It's probably worth it to compare the two on my task and dataset.

SOLUTION:It depends.. It's probably worth it to compare the two on my task and dataset.
Which embedding size looks more reasonable for the majority of cases?

500

50,000

SOLUTION:500
10. Sources & References
If you want to learn more about hyperparameters, these are some great resources on
the topic:

 Practical recommendations for gradient-based training of deep

architectures by Yoshua Bengio

 Deep Learning book - chapter 11.4: Selecting Hyperparameters by Ian

Goodfellow, Yoshua Bengio, Aaron Courville

 Neural Networks and Deep Learning book - Chapter 3: How to choose a neural
network's hyper-parameters? by Michael Nielsen

 Efficient BackProp (pdf) by Yann LeCun

More specialized sources:

 How to Generate a Good Word Embedding? by Siwei Lai, Kang Liu, Liheng Xu, Jun
Zhao
 Systematic evaluation of CNN advances on the ImageNet by Dmytro Mishkin, Nikolay
Sergievskiy, Jiri Matas
 Visualizing and Understanding Recurrent Networks by Andrej Karpathy, Justin
Johnson, Li Fei-Fei

Part 04-Module 01-Lesson 05_Embeddings and Word2vec

Embeddings and Word2vec

 Back to Home

 01. Embeddings Intro

 02. Implementing Word2Vec
 03. Subsampling Solution
 04. Making Batches
 05. Batches Solution
 06. Building the Network
 07. Negative Sampling
 08. Building the Network Solution
 09. Training Results
01. Embeddings Intro

Hi, it's Mat again!

Word Embeddings
This week, we'll be covering embeddings. This is a deep neural network method for
representing data with a huge number of classes more efficiently. Embeddings greatly
improve the ability of networks to learn from data of this sort by representing the
data with lower dimensional vectors.

Word embeddings in particular are interesting because the networks are able to learn
semantic relationships between words. For example, the embeddings will know that
the male equivalent of a queen is a king.
These word embeddings are learned using a model called Word2vec. In this lesson,
you'll implement Word2vec yourself.

We've built a notebook with exercises and also provided our solutions. You can find
the notebooks in our GitHub repo in the embeddings folder.

Requirements: You'll need Numpy, Matplotlib, Scikit-learn, tqdm, and TensorFlow

1.0 to run this code.

Next up, I'll walk you through implementing the Word2Vec model.
02. Implementing Word2Vec
Implementing Word2Vec
03. Subsampling Solution
04. Making Batches
05. Batches Solution
06. Building the Network
07. Negative Sampling
08. Building the Network Solution
09. Training Results
Part 04-Module 01-Lesson 06_Sentiment Prediction RNN
Sentiment Prediction RNN

 Back to Home

 01. Intro
 02. Sentiment RNN
 03. Data Preprocessing
 04. Creating Testing Sets
 05. Building the RNN
 06. Training the Network
 07. Solutions
01. Intro

It's Mat again, with more knowledge for you

Sentiment Prediction RNN

Welcome to this lesson on building a recurrent neural network for predicting
sentiment. This is intended to give you more experience building RNNs. I'm using the
dataset from Andrew Trask's lesson, a bunch of movie reviews from IMDB labeled with
sentiment (positive or negative).

I'm going to have you implement this RNN. You can find the notebooks in our public
GitHub repo. You can download the notebooks from the sentiment-rnn folder there,
or clone the repository:

git clone https://github.com/udacity/deep-learning.git

If you already have the repo, do a git pull to get the new notebooks.

The Data
The data, reviews.txt and labels.txt , is located in
the sentiment_network directory. You can also find the labels here and the
reviews here.
02. Sentiment RNN
03. Data Preprocessing
04. Creating Testing Sets
Note from Mat
Here I say split_frac is the fraction to keep in the "test" set. It's actually the fraction
to keep in the training set. My apologies, will fix this video.
05. Building the RNN
Building The RNN 1

Note from Mat

We're aware of some audio mixup at 1:00 here. We're on it!
06. Training the Network
07. Solutions
Sentiment RNN 2
Part 04-Module 01-Lesson 07_Generate TV Scripts
Generate TV Scripts

 Back to Home

 01. Introduction
 02. TV Script Workspace
 Project Description - Generate TV Scripts
 Project Rubric - Generate TV Scripts
01. Introduction
Project-3-Intro

02. TV Script Workspace

Workspace

This section contains either a workspace (it can be a Jupyter Notebook workspace or an online
code editor work space, etc.) and it cannot be automatically downloaded to be generated here.
Please access the classroom with your account and manually download the workspace to your
local machine. Note that for some courses, Udacity upload the workspace files
onto https://github.com/udacity, so you may be able to download them there.

Workspace Information:

 Default file path:

 Workspace type: jupyter
 Opened files (when workspace is loaded): n/a
Generate TV Scripts
Generate TV Scripts
Introduction

In this project, you'll generate your own Simpsons TV scripts using RNNs. You'll be using
part of the Simpsons dataset of scripts from 27 seasons. The Neural Network you'll build will
generate a new TV script for a scene at Moe's Tavern.

Getting the project files

The project files can be found in our public GitHub repo, in the tv-script-
generation folder. You can download the files from there, but it's better to clone the
repository to your computer

git clone https://github.com/udacity/deep-learning.git

This way you can stay up to date with any changes we make by pulling the changes to your
local repository with git pull .

Submission

1. Ensure you've passed all the unit tests in the notebook.

2. Ensure you pass all points on the rubric.
3. When you're done with the project, please save the notebook as an HTML file. You can do this
by going to the File menu in the notebook and choosing "Download as" > HTML. ** Ensure
you submit both the Jupyter Notebook and it's HTML version together. **
4. Package the "dlnd_tv_script_generation.ipynb", "helper.py", "problem_unittests.py", and the
HTML file into a zip archive, or push the files from your GitHub repo.
5. Hit Submit Project below!

Advanced Projects

After completing this project, try applying what you learned to one of these problems.

 Generate your own Bach music using like DeepBach.

 Predict seizures in intracranial EEG recordings on Kaggle.

(Ebook) Graph Neural Networks: Foundations, Frontiers, and Applications by Lingfei Wu, Peng Cui, Jian Pei, Liang Zhao ISBN 9789811660535, 9811660530 - The ebook with all chapters is available with just one click
100% (1)
(Ebook) Graph Neural Networks: Foundations, Frontiers, and Applications by Lingfei Wu, Peng Cui, Jian Pei, Liang Zhao ISBN 9789811660535, 9811660530 - The ebook with all chapters is available with just one click
60 pages
The Hundred Page Machine Learning Book
No ratings yet
The Hundred Page Machine Learning Book
7 pages
2024 - NN - Python Development With Large Language Models From Text To Tasks Python Programming With The Help of Large Language Models - Millie
100% (1)
2024 - NN - Python Development With Large Language Models From Text To Tasks Python Programming With The Help of Large Language Models - Millie
134 pages
Gradient Descent
No ratings yet
Gradient Descent
17 pages
NDEB Equivalency - Process - Required - Documents - 2020
No ratings yet
NDEB Equivalency - Process - Required - Documents - 2020
9 pages
UNIT-5 Foundations of Deep Learning
No ratings yet
UNIT-5 Foundations of Deep Learning
9 pages
neural network-unit-1-complete-notes
No ratings yet
neural network-unit-1-complete-notes
154 pages
DL CNN
No ratings yet
DL CNN
129 pages
Full Deep Learning With Python Develop Deep Learning Models On Theano and TensorFLow Using Keras Jason Brownlee Ebook All Chapters
100% (4)
Full Deep Learning With Python Develop Deep Learning Models On Theano and TensorFLow Using Keras Jason Brownlee Ebook All Chapters
62 pages
UNIT-2 Foundations of Deep Learning
No ratings yet
UNIT-2 Foundations of Deep Learning
64 pages
Yasha Hasija, Rajkumar Chakraborty - Hands on Data Science for Biologists Using Python (2021, CRC Press) - Libgen.li
No ratings yet
Yasha Hasija, Rajkumar Chakraborty - Hands on Data Science for Biologists Using Python (2021, CRC Press) - Libgen.li
299 pages
Principles of Convolutional Neural Networks
No ratings yet
Principles of Convolutional Neural Networks
9 pages
Ensemble Machine Learning With Python: 7-Day Mini-Course Jason Brownlee - The full ebook version is ready for instant download
100% (1)
Ensemble Machine Learning With Python: 7-Day Mini-Course Jason Brownlee - The full ebook version is ready for instant download
46 pages
Immediate download (eBook PDF) Machine Learning A Probabilistic Perspective by Kevin P. Murphy ebooks 2024
100% (8)
Immediate download (eBook PDF) Machine Learning A Probabilistic Perspective by Kevin P. Murphy ebooks 2024
46 pages
convolutional_neural_networks
No ratings yet
convolutional_neural_networks
108 pages
UNIT-4 Foundations of Deep Learning
100% (1)
UNIT-4 Foundations of Deep Learning
43 pages
Complete Statistical Methods For Machine Learning: Discover How To Transform Data Into Knowledge With Python Jason Brownlee PDF For All Chapters
100% (2)
Complete Statistical Methods For Machine Learning: Discover How To Transform Data Into Knowledge With Python Jason Brownlee PDF For All Chapters
62 pages
Regularization_for_Neural_Networks_1718966083
No ratings yet
Regularization_for_Neural_Networks_1718966083
9 pages
UNIT-1 Foundations of Deep Learning
100% (1)
UNIT-1 Foundations of Deep Learning
51 pages
Lec19 - GANs
No ratings yet
Lec19 - GANs
47 pages
G5Aiai Introduction To AI: Graham Kendall
No ratings yet
G5Aiai Introduction To AI: Graham Kendall
48 pages
How To Code A Neural Network With Backpropagation in Python
No ratings yet
How To Code A Neural Network With Backpropagation in Python
133 pages
Graph Neural Networks In Action Meap Version 4 Chapters 4 Of 8 Keita Broadwater pdf download
No ratings yet
Graph Neural Networks In Action Meap Version 4 Chapters 4 Of 8 Keita Broadwater pdf download
52 pages
Machine Learning LAB: Practical-1
100% (2)
Machine Learning LAB: Practical-1
24 pages
Convolutional Neural Networks
No ratings yet
Convolutional Neural Networks
161 pages
Applying Bayesian Inference in A Hybrid CNN-LSTM Model For Time Series Prediction.
No ratings yet
Applying Bayesian Inference in A Hybrid CNN-LSTM Model For Time Series Prediction.
7 pages
Predictive Modeling of Stock Prices Using Transformer Model
No ratings yet
Predictive Modeling of Stock Prices Using Transformer Model
8 pages
NPU MachineLearning
No ratings yet
NPU MachineLearning
28 pages
Udacity Deep Learning Notes
No ratings yet
Udacity Deep Learning Notes
46 pages
Altoros Tensorflow Cheat Sheet
100% (1)
Altoros Tensorflow Cheat Sheet
1 page
Syllabus
No ratings yet
Syllabus
2 pages
Generative Ai With Python Harnessing the Power of Machine Learning and Deep Learning to Build Creative and Intelligent Systems
100% (1)
Generative Ai With Python Harnessing the Power of Machine Learning and Deep Learning to Build Creative and Intelligent Systems
239 pages
Module2.3 Hyperparameter Optimization
No ratings yet
Module2.3 Hyperparameter Optimization
29 pages
Neurocomputing: Zhaoyang Niu, Guoqiang Zhong, Hui Yu
No ratings yet
Neurocomputing: Zhaoyang Niu, Guoqiang Zhong, Hui Yu
15 pages
Deep learning notes
No ratings yet
Deep learning notes
51 pages
Object Detection in Drone Imagery Using Convolutional Neural Networks
100% (1)
Object Detection in Drone Imagery Using Convolutional Neural Networks
191 pages
Machine Learning Midterm
No ratings yet
Machine Learning Midterm
18 pages
Information Theory and Cognition A Review
No ratings yet
Information Theory and Cognition A Review
19 pages
Machine Learning Module-3
No ratings yet
Machine Learning Module-3
23 pages
Lecture 1: Introduction To Reinforcement Learning: David Silver
No ratings yet
Lecture 1: Introduction To Reinforcement Learning: David Silver
46 pages
TensorFlow Tutorial For Beginners (Article) - DataCamp PDF
No ratings yet
TensorFlow Tutorial For Beginners (Article) - DataCamp PDF
60 pages
The Backpropagation Algorithm
No ratings yet
The Backpropagation Algorithm
4 pages
Infosys Pragathi Report
No ratings yet
Infosys Pragathi Report
68 pages
Feature Selection in Machine Learning
No ratings yet
Feature Selection in Machine Learning
4 pages
RBM, DBN, and DBM
No ratings yet
RBM, DBN, and DBM
79 pages
Simple Libraries in Python
No ratings yet
Simple Libraries in Python
12 pages
A Gentle Introduction To Neural Networks With Python
No ratings yet
A Gentle Introduction To Neural Networks With Python
85 pages
Deep Learning Handson
No ratings yet
Deep Learning Handson
65 pages
StatisticsMachineLearningPythonDraft PDF
100% (1)
StatisticsMachineLearningPythonDraft PDF
319 pages
Yousef Udacity Deep Learning Part1 Introdution + Part 2 NN
No ratings yet
Yousef Udacity Deep Learning Part1 Introdution + Part 2 NN
437 pages
PPT_Btech CSE
No ratings yet
PPT_Btech CSE
17 pages
Full download Neural Networks A Visual Introduction for Beginners Michael Taylor pdf docx
100% (1)
Full download Neural Networks A Visual Introduction for Beginners Michael Taylor pdf docx
65 pages
PyTorch Workflow Fundamentals
No ratings yet
PyTorch Workflow Fundamentals
1 page
Understanding Large Language Models: Learning Their Underlying Concepts and Technologies 1st Edition Thimira Amaratunga - Download the full set of chapters carefully compiled
100% (1)
Understanding Large Language Models: Learning Their Underlying Concepts and Technologies 1st Edition Thimira Amaratunga - Download the full set of chapters carefully compiled
55 pages
Top 100 Interview Questions On Machine Learning
100% (1)
Top 100 Interview Questions On Machine Learning
155 pages
Data Science Guide
100% (1)
Data Science Guide
275 pages
Building Transformer Models With Attention - Web - Page
No ratings yet
Building Transformer Models With Attention - Web - Page
19 pages
Introduction to Python Programming
No ratings yet
Introduction to Python Programming
95 pages
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
From Everand
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
Robert Johnson
No ratings yet
Hebbian Learning: Fundamentals and Applications for Uniting Memory and Learning
From Everand
Hebbian Learning: Fundamentals and Applications for Uniting Memory and Learning
Fouad Sabry
No ratings yet
Hopfield Networks: Fundamentals and Applications of The Neural Network That Stores Memories
From Everand
Hopfield Networks: Fundamentals and Applications of The Neural Network That Stores Memories
Fouad Sabry
No ratings yet
Kubernetes: How To Pass The Certified Kubernetes Administrator (CKA) Exam
100% (1)
Kubernetes: How To Pass The Certified Kubernetes Administrator (CKA) Exam
241 pages
Work Sheet 4 - Complex Sentences - Adjective Clause
No ratings yet
Work Sheet 4 - Complex Sentences - Adjective Clause
9 pages
Essay 7232 Writing Correction Task 2
No ratings yet
Essay 7232 Writing Correction Task 2
5 pages
Essay 7231 Writing Correction Task 2
No ratings yet
Essay 7231 Writing Correction Task 2
3 pages
Complex Sentences Using Adjective Clauses - 3
No ratings yet
Complex Sentences Using Adjective Clauses - 3
2 pages
Coursera Algorithms On Graphs
No ratings yet
Coursera Algorithms On Graphs
312 pages
Youtube Data Strcutre and Algorithms New Baghdad
No ratings yet
Youtube Data Strcutre and Algorithms New Baghdad
147 pages
Coursera Advanced Algorithms and Complexity
No ratings yet
Coursera Advanced Algorithms and Complexity
329 pages
Yousef AI Follow-Up Sheet
100% (1)
Yousef AI Follow-Up Sheet
729 pages
Coursera Algorithms On String
0% (1)
Coursera Algorithms On String
256 pages
Antenna Configurations 1560 13
No ratings yet
Antenna Configurations 1560 13
1 page
GPRS Dimensions
No ratings yet
GPRS Dimensions
10 pages
Yousef Udacity Deep Learning Part 3 CNN
No ratings yet
Yousef Udacity Deep Learning Part 3 CNN
253 pages
Coursera Data Structure
No ratings yet
Coursera Data Structure
491 pages
Coursera Algorithm Toolbox
0% (1)
Coursera Algorithm Toolbox
456 pages
Cell Configuration Change Request: BSC06 GC2133A BSC06 GC2133A BSC06 GC2133C BSC06 GC2133C
No ratings yet
Cell Configuration Change Request: BSC06 GC2133A BSC06 GC2133A BSC06 GC2133C BSC06 GC2133C
3 pages
Filter High PDCH Cong Continuous TBF Sharing Afp A FLP ??
No ratings yet
Filter High PDCH Cong Continuous TBF Sharing Afp A FLP ??
9 pages
BCCH Power Saving
No ratings yet
BCCH Power Saving
15 pages
1.4 Main Changes in Ericsson GSM RAN G16B: Also Relation Attendance Level
No ratings yet
1.4 Main Changes in Ericsson GSM RAN G16B: Also Relation Attendance Level
54 pages
BTS Power Saving Feature
No ratings yet
BTS Power Saving Feature
534 pages
Inferential Statistics in Details
No ratings yet
Inferential Statistics in Details
652 pages
Yousef ML Washin Classification
100% (1)
Yousef ML Washin Classification
333 pages
Yousef Time Series Analysis in Python 2020
100% (1)
Yousef Time Series Analysis in Python 2020
835 pages
CNN 2
No ratings yet
CNN 2
47 pages
Deep Learning - IIT Ropar - Unit 5 - Week 2
No ratings yet
Deep Learning - IIT Ropar - Unit 5 - Week 2
4 pages
Convolutional Neural Network
No ratings yet
Convolutional Neural Network
27 pages
Deep Learning Mar 2022
No ratings yet
Deep Learning Mar 2022
1 page
Soft Computing MCQ
No ratings yet
Soft Computing MCQ
10 pages
Performance Analysis of Various Activation Functions Using LSTM Neural Network For Movie Recommendation Systems
No ratings yet
Performance Analysis of Various Activation Functions Using LSTM Neural Network For Movie Recommendation Systems
32 pages
Modul 7 (Neural Network & Evaluasi)
No ratings yet
Modul 7 (Neural Network & Evaluasi)
29 pages
Deep Learning
No ratings yet
Deep Learning
6 pages
07cp18 Neural Networks and Applications 3 0 0 100
No ratings yet
07cp18 Neural Networks and Applications 3 0 0 100
2 pages
Deep Learning Notes
No ratings yet
Deep Learning Notes
3 pages
Supervised Learning Neural Networks
No ratings yet
Supervised Learning Neural Networks
34 pages
07 AIS302 CNN
No ratings yet
07 AIS302 CNN
56 pages
XXXBetter Plain ViT Baselines for ImageNet-1k
No ratings yet
XXXBetter Plain ViT Baselines for ImageNet-1k
3 pages
Model Questions DWT
No ratings yet
Model Questions DWT
2 pages
Deep Learning Basics in Machine Learnning 1
No ratings yet
Deep Learning Basics in Machine Learnning 1
29 pages
Deep Learning Interview Questions - Deep Learning Questions
No ratings yet
Deep Learning Interview Questions - Deep Learning Questions
21 pages
Cabs Availability Prediction Using Deep Learning: Project Member
No ratings yet
Cabs Availability Prediction Using Deep Learning: Project Member
58 pages
Soft Computing Vs Hard Computing
No ratings yet
Soft Computing Vs Hard Computing
23 pages
UNIT II Basic On Neural Networks
No ratings yet
UNIT II Basic On Neural Networks
36 pages
Recurrent Neural Network: Unit - 3
No ratings yet
Recurrent Neural Network: Unit - 3
12 pages
CT1 NNDL Question Bank
No ratings yet
CT1 NNDL Question Bank
8 pages
dl_nlp_reading materials_bda_cs_25
No ratings yet
dl_nlp_reading materials_bda_cs_25
25 pages
Group_8_Practical
No ratings yet
Group_8_Practical
8 pages
Artificial Neural Network
100% (1)
Artificial Neural Network
35 pages
Convolutional Neural Network CNN For Ima
No ratings yet
Convolutional Neural Network CNN For Ima
5 pages
Technical Report: Supervised Training of Convolutional Spiking Neural Networks With Pytorch
No ratings yet
Technical Report: Supervised Training of Convolutional Spiking Neural Networks With Pytorch
24 pages
Computer Organization: National Institute of Technology Hamirpur
No ratings yet
Computer Organization: National Institute of Technology Hamirpur
8 pages
ANN Matlab
No ratings yet
ANN Matlab
13 pages
PowerPoint Presentation-3
No ratings yet
PowerPoint Presentation-3
28 pages
Deep Learning Most Important Ideas PDF
No ratings yet
Deep Learning Most Important Ideas PDF
16 pages

Udacity Deep LEarning Part4 RNN

Uploaded by

Udacity Deep LEarning Part4 RNN

Uploaded by

Recurrent Neural Networks

 01. Introducing Ortal

In this lesson we will learn about Recurrent Neural Networks (RNNs).

02 RNN History V4 Final

What does this mean?

As you may recall, while training our network we use backpropagation. In the

LSTM is one option to overcome the Vanishing Gradient problem in RNNs.

 Here is the original Elman Network publication from 1990. This link is provided

03 RNN Applications V3 Final

o How about automatically adding sounds to silent movies?

o Here is a cool tool for automatic handwriting generation

o Amazon's voice to text using high quality speech recognition, Amazon Lex.

o Facebook uses RNN and LSTM technologies for building language models

o Netflix also uses RNN models - here is an interesting read

04 RNN FFNN Reminder A V7 Final

05 RNN FFNN Reminder B V6 Final

The training phase will include two steps:

The next two videos will focus on the feedforward process.

 W_kWk is weight matrix kk

After finding h&#x27;h′ , we need an activation function (\PhiΦ) to finalize the

Essentially, each new layer in an neural network is calculated by a vector by matrix

In our example, the input vector is \bar{h}hˉ and the matrix is W^2W2,

In the above calculations we used a variation of the MSE.

The first hidden layer has M neurons.

The second hidden layer has N neurons.

 Step 1: From the single input to the first hidden layer

In total, we will add the number of operations we calculated in each step: M+MN+N .

In the backpropagation process we minimize the network error slightly with each iteration,

08 Backpropagation Theory V6 Final

W_{new} = W_{previous} +\Delta W^kWnew=Wprevious+ΔWk_{ij}ij

W_{new}= W_{previous} +\alpha (-\frac{\partial E}{\partial W} )Wnew

13 Overfitting Intro V4 Final

In this example, we use a variation of the MSE:

\Delta W_{ij}^k=\alpha (-\frac{\partial E}{\partial W})=-\frac{\alpha}{2}

\Delta W_{ij}^k=\alpha(d-y) \frac{\partial y}{\partial W_{ij}}ΔWijk

(Notice that dd is a constant value, so it’s partial derivative is simply a zero)

12 Backpropagation Example B V6 Final

As you may recall:

\large\Delta W_{ij}=\alpha(d-y) \frac{\partial y}{\partial W_{ij}} ΔWij

\large\Delta W_i=\alpha(d-y) \frac{\partial y}{\partial W_i}ΔWi=α(d−y)∂Wi

\Delta W_i=\alpha(d-y) h_iΔWi=α(d−y)hi

We will calculate each derivative separately. \frac{\partial y}{\partial h_j}∂hj∂y will be

The second calculation of equation 21 can be calculated the following way:

We are ready to finalize step 2, in which we update the weights of matrix W^1W1 by

Hint: Use the chain rule

 sequences as inputs in the training phase, and

This is the RNN folded model:

In this picture, \bar{x}xˉ represents the input vector, \bar{y}yˉ represents the

The RNN unfolded model

\bar{x}xˉ represents the input vector, \bar{y}yˉ represents the output vector

17 RNN Unfolded V3 Final

A folded model of a RNN

SOLUTION:Both A and B are correct

18 RNN Example V5 Final

 As always, don't forget to take notes.

Here, we are using a variation of the MSE.

19 RNN BPTT A V6 Final

The state vector \bar{s}_tsˉt is calculated the following way:

The output vector \bar{y}_tyˉt can be product of the state vector \bar{s}_tsˉt and the

E_tEt represents the output error at time t

The Folded Model at Timestep 3

 W_yWy - the weight matrix connecting the state the output

20 RNN BPTT B V5 Final

Gradient calculations needed to adjust W_yWy

Gradient calculations needed to adjust W_sWs

 At timestep t=3, the contribution to the gradient stemming from \bar{s_3}s3ˉ is the

 At timestep t=3, the contribution to the gradient stemming from \bar{s_2}s2ˉ is the

21 RNN BPTT C V7 Final

 At timestep t=3, the contribution to the gradient stemming from \bar{s_3}s3ˉ is the

 At timestep t=3, the contribution to the gradient stemming from \bar{s_2}s2ˉ is the

 At timestep t=3, the contribution to the gradient stemming from \bar{s_1}s1ˉ is the

After considering the contributions from all three states: \bar{s_3}s3ˉ ,\bar{s_2}s2

A folded RNN model

A folded RNN model

A folded RNN model

In the previous videos, we talked about Backpropagation Through Time. We used a lot of

After finding h'h′ , we need an activation function (\PhiΦ) to finalize the