0% found this document useful (0 votes)
434 views

Udacity Deep LEarning Part4 RNN

The document describes the feedforward process in a neural network. It explains that the feedforward process has two steps: 1) calculating the hidden state vector by multiplying the input vector by the first weight matrix, and 2) calculating the output vector by multiplying the hidden state vector by the second weight matrix. It provides the mathematical equations for these calculations and notes that an activation function is applied to the hidden state vector before obtaining the final hidden vector values.

Uploaded by

yousef shaban
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
434 views

Udacity Deep LEarning Part4 RNN

The document describes the feedforward process in a neural network. It explains that the feedforward process has two steps: 1) calculating the hidden state vector by multiplying the input vector by the first weight matrix, and 2) calculating the output vector by multiplying the hidden state vector by the second weight matrix. It provides the mathematical equations for these calculations and notes that an activation function is applied to the hidden state vector before obtaining the final hidden vector values.

Uploaded by

yousef shaban
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 338

Recurrent Neural Networks

 Back to Home

 01. Introducing Ortal


 02. RNN Introduction
 03. RNN History
 04. RNN Applications
 05. Feedforward Neural Network-Reminder
 06. The Feedforward Process
 07. Feedforward Quiz
 08. Backpropagation- Theory
 09. Backpropagation - Example (part a)
 10. Backpropagation- Example (part b)
 11. Backpropagation Quiz
 12. RNN (part a)
 13. RNN (part b)
 14. RNN- Unfolded Model
 15. Unfolded Model Quiz
 16. RNN- Example
 17. Backpropagation Through Time (part a)
 18. Backpropagation Through Time (part b)
 19. Backpropagation Through Time (part c)
 20. BPTT Quiz 1
 21. BPTT Quiz 2
 22. BPTT Quiz 3
 23. Some more math
 24. RNN Summary
 25. From RNN to LSTM
 26. Wrap Up
02. RNN Introduction
RNN Introduction
Hi! I am Ortal, your instructor for this lesson!

In this lesson we will learn about Recurrent Neural Networks (RNNs).

The neural network architectures you've seen so far were trained using the current inputs only.
We did not consider previous inputs when generating the current output. In other words, our
systems did not have any memory elements. RNNs address this very basic and important issue
by using memory (i.e. past inputs to the network) when producing the current output.
03. RNN History
A bit of history
How did the theory behind RNN evolve? Where were we a few years ago and where
are we now?

02 RNN History V4 Final

Play
00:00
02:42
Disable captions
Settings
Enter fullscreen
Play
As mentioned in this video, RNNs have a key flaw, as capturing relationships that span
more than 8 or 10 steps back is practically impossible. This flaw stems from the
"vanishing gradient" problem in which the contribution of information decays
geometrically over time.

What does this mean?

As you may recall, while training our network we use backpropagation. In the


backpropagation process we adjust our weight matrices with the use of a gradient. In
the process, gradients are calculated by continuous multiplications of derivatives. The
value of these derivatives may be so small, that these continuous multiplications may
cause the gradient to practically "vanish".

LSTM is one option to overcome the Vanishing Gradient problem in RNNs.

Please use these resources if you would like to read more about the Vanishing
Gradient problem or understand further the concept of a Geometric Series and how
its values may exponentially decrease.

If you are still curious, for more information on the important milestones mentioned
here, please take a peek at the following links:

 TDNN

 Here is the original Elman Network publication from 1990. This link is provided


here as it's a significant milestone in the world on RNNs. To simplify things a
bit, you can take a look at the following additional info.
 In this LSTM link you will find the original paper written by Sepp
Hochreiter and Jürgen Schmidhuber. Don't get into all the details just yet. We
will cover all of this later!

As mentioned in the video, Long Short-Term Memory Cells (LSTMs) and Gated
Recurrent Units (GRUs) give a solution to the vanishing gradient problem, by helping
us apply networks that have temporal dependencies. In this lesson we will focus on
RNNs and continue with LSTMs. We will not be focusing on GRUs.
More information about GRUs can be found in the following blog. Focus on the
overview titled: GRUs.
04. RNN Applications
Applications
The world's leading tech companies are all using RNNs, particularly LSTMs, in their
applications. Let's take a look at a few.

03 RNN Applications V3 Final

Play
00:00
01:56
Disable captions
Settings
Enter fullscreen
Play
There are so many interesting applications, let's look at a few more!

 Are you into gaming and bots? Check out the DotA 2 bot by Open AI

o How about automatically adding sounds to silent movies?

o Here is a cool tool for automatic handwriting generation

o Amazon's voice to text using high quality speech recognition, Amazon Lex.

o Facebook uses RNN and LSTM technologies for building language models

o Netflix also uses RNN models - here is an interesting read


05. Feedforward Neural Network-Reminder
Feedforward Neural Network - A Reminder

The mathematical calculations needed for training RNN systems are fascinating. To deeply
understand the process, we first need to feel confident with the vanilla FFNN system. We need
to thoroughly understand the feedforward process, as well as the backpropagation process used
in the training phases of such systems.
The next few videos will cover these topics, which you are already familiar with. We will
address the feedforward process as well as backpropagation, using specific examples. These
examples will serve as extra content to help further understand RNNs later in this lesson.

The following couple of videos will give you a brief overview of the Feedforward Neural
Network (FFNN).

04 RNN FFNN Reminder A V7 Final

Play
02:03
04:39
Disable captions
Settings
Enter fullscreen
Play
OK, you can take a small break now. We will continue with FFNN when you come back!

05 RNN FFNN Reminder B V6 Final

Play
01:56
03:24
Disable captions
Settings
Enter fullscreen
Play
As mentioned before, when working with neural networks we have 2 primary phases:

Training

and

Evaluation.

During the training phase, we take the data set (also called the training set), which includes
many pairs of inputs and their corresponding targets (outputs). Our goal is to find a set of
weights that would best map the inputs to the desired outputs.
In the evaluation phase, we use the network that was created in the training phase, apply our
new inputs and expect to obtain the desired outputs.

The training phase will include two steps:

Feedforward

and
Backpropagation

We will repeat these steps as many times as we need until we decide that our system has
reached the best set of weights, giving us the best possible outputs.

The next two videos will focus on the feedforward process.

You will notice that in these videos I use subscripts as well as superscript as a numeric
notation for the weight matrix.

For example:

 W_kWk is weight matrix kk


 \ W_{ij}^k Wijk is the ijij element of weight matrix kk
06. The Feedforward Process
Feedforward
In this section we will look closely at the math behind the feedforward process. With
the use of basic Linear Algebra tools, these calculations are pretty simple!

If you are not feeling confident with linear combinations and matrix multiplications,
you can use the following links as a refresher:

 Linear Combination
 Matrix Multiplication

Assuming that we have a single hidden layer, we will need two steps in our
calculations. The first will be calculating the value of the hidden states and the latter
will be calculating the value of the outputs.
Notice that both the hidden layer and the output layer are displayed as vectors, as
they are both represented by more than a single neuron.

Our first video will help you understand the first step- Calculating the value of the
hidden states.

06 FeedForward A V7 Final

Play
00:00
04:32
Disable captions
Settings
Enter fullscreen
Play
As you saw in the video above, vector h' of the hidden layer will be calculated by
multiplying the input vector with the weight matrix W^{1}W1 the following way:
\bar{h'} = (\bar{x} W^1 )h′ˉ=(xˉW1)
Using vector by matrix multiplication, we can look at this computation the following
way:

Equation 1

After finding h'h′ , we need an activation function (\PhiΦ) to finalize the


computation of the hidden layer's values. This activation function can be a Hyperbolic
Tangent, a Sigmoid or a ReLU function. We can use the following two equations to
express the final hidden vector \bar{h}hˉ:
\bar{h} = \Phi(\bar{x} W^1 )hˉ=Φ(xˉW1)
or

\bar{h} = \Phi(h')hˉ=Φ(h′)
Since W_{ij}Wij
represents the weight component in the weight matrix, connecting neuron i from the
input to neuron j in the hidden layer, we can also write these calculations in the
following way:
(notice that in this example we have n inputs and only 3 hidden neurons)

Equation 2

More information on the activation functions and how to use them can be found here

This next video will help you understand the second step- Calculating the values of
the Outputs.

07 FeedForward B V3

Play
00:00
05:55
Disable captions
Settings
Enter fullscreen
Play
As you've seen in the video above, the process of calculating the output vector is
mathematically similar to that of calculating the vector of the hidden layer. We use,
again, a vector by matrix multiplication, which can be followed by an activation
function. The vector is the newly calculated hidden layer and the matrix is the one
connecting the hidden layer to the output.

Essentially, each new layer in an neural network is calculated by a vector by matrix


multiplication, where the vector represents the inputs to the new layer and the matrix
is the one connecting these new inputs to the next layer.

In our example, the input vector is \bar{h}hˉ and the matrix is W^2W2,


therefore \bar{y}=\bar{h}W^2yˉ=hˉW2. In some applications it can be beneficial to
use a softmax function (if we want all output values to be between zero and 1, and
their sum to be 1).

Equation 3

The two error functions that are most commonly used are the Mean Squared Error
(MSE) (usually used in regression problems) and the cross entropy (usually used in
classification problems).

In the above calculations we used a variation of the MSE.

The next few videos will focus on the backpropagation process, or what we also call
stochastic gradient decent with the use of the chain rule.
07. Feedforward Quiz
The following picture is of a feedforward network with

 A single input x
 Two hidden layers
 A singe output

The first hidden layer has M neurons.

The second hidden layer has N neurons.


What is the total number of multiplication operations needed for a single feedforward pass?

 
MN

 
M+N
 
M+N+2MN

 
M+N+NM

 
There isn't enough information to solve this question

SOLUTION:M+N+NM
Solution
To calculate the number of multiplications needed for a single feedforward pass, we can break
down the network to three steps:

 Step 1: From the single input to the first hidden layer


 Step 2: From the first hidden layer to the second hidden layer
 Step 2: From the second hidden layer to the single output

Step 1

The single input is multiplied by a vector with M values. Each value in the vector will
represent a weight connecting the input to the first hidden layer. Therefore, we will
have M multiplication operations.

Step 2

Each value in the first hidden layer (M in total) will be multiplied by a vector with N values.
Each value in the vector will represent a weight connecting the neurons in the first hidden
layer to the neurons in the second hidden layer. Therefore, we will have here M times N
calculations, or simply MN multiplication operations.

Step 3

Each value in the second hidden layer (N in total) will be multiplied once, by the weight
element connecting it to the single output. Therefore, we will have N multiplication
operations.

In total, we will add the number of operations we calculated in each step: M+MN+N .


08. Backpropagation- Theory
Backpropagation Theory
Since partial derivatives are the key mathematical concept used in backpropagation, it's
important that you feel confident in your ability to calculate them. Once you know how to
calculate basic derivatives, calculating partial derivatives is easy to understand.
For more information on partial derivatives use the following link

For calculation purposes in future quizzes of the lesson, you can use the following link as a
reference for common derivatives.

In the backpropagation process we minimize the network error slightly with each iteration,


by adjusting the weights. The following video will help you understand the mathematical
process we use for computing these adjustments.

08 Backpropagation Theory V6 Final

Play
00:00
06:16
Disable captions
Settings
Enter fullscreen
Play
If we look at an arbitrary layer k, we can define the amount by which we change the weights
from neuron i to neuron j stemming from layer k as: \Delta W^kΔWk_{ij}ij.
The superscript (k) indicates that the weight connects layer k to layer k+1.

Therefore, the weight update rule for that neuron can be expressed as:

W_{new} = W_{previous} +\Delta W^kWnew=Wprevious+ΔWk_{ij}ij


Equation 4

The updated value \Delta W_{ij}^kΔWijk is calculated through the use of the gradient
calculation, in the following way:
\Delta W_{ij}^k=\alpha (-\frac{\partial E}{\partial W})ΔWijk=α(−∂W∂E),
where \alphaα is a small positive number called the** Learning Rate**.
Equation 5

From these derivation we can easily see that the weight updates are calculated the by the
following equation:

W_{new}= W_{previous} +\alpha (-\frac{\partial E}{\partial W} )Wnew


=Wprevious+α(−∂W∂E)
Equation 6

Since many weights determine the network’s output, we can use a vector of the partial
derivatives (defined by the Greek letter Nabla \nabla∇) of the network error - each with
respect to a different weight.
W_{new}= W_{previous}+\alpha \nabla_W(-E)Wnew=Wprevious+α∇W(−E)
Equation 7

Here you can find other good resources for understanding and tuning the Learning Rate:

 resource 1
 resource 2

The following video is given as a refresher to overfitting . You have already seen this concept
in the Training Neural Networks lesson. Feel free to skip it and jump right into the next video.

13 Overfitting Intro V4 Final

Play
00:00
02:04
Disable captions
Settings
Enter fullscreen
Play
09. Backpropagation - Example (part a)
Backpropagation- Example (part a)
We will now continue with an example focusing on the backpropagation process, and consider
a network having two inputs [x_1, x_2][x1,x2], three neurons in a single hidden layer [h_1,
h_2, h_3][h1,h2,h3] and a single output yy.

The weight matrices to update are W^1W1 from the input to the hidden layer,
and W^2W2 from the hidden layer to the output. Notice that in our case W^2W2 is a vector,
not a matrix, as we only have one output.
10 Backpropagation Example A V3 Final

Play
00:00
03:19
Disable captions
Settings
Enter fullscreen
Play
The chain of thought in the weight updating process is as follows:

To update the weights, we need the network error. To find the network error, we need the
network output, and to find the network output we need the value of the hidden layer,
vector \bar {h}hˉ.
\bar{h}=[h_1, h_2, h_3]hˉ=[h1,h2,h3]
Equation 8

Each element of vector \bar {h}hˉ is calculated by a simple linear combination of the input
vector with its corresponding weight matrix W^1W1, followed by an activation function.

Equation 9

We now need to find the network's output, yy. yy is calculated in a similar way by using a
linear combination of the vector \bar{h}hˉ with its corresponding elements of the weight
vector W^2W2.

Equation 10

After computing the output, we can finally find the network error.

As a reminder, the two Error functions most commonly used are the Mean Squared Error
(MSE) (usually used in regression problems) and the cross entropy (often used in classification
problems).

In this example, we use a variation of the MSE:

E=\frac{(d-y)^2}{2}E=2(d−y)2,
where dd is the desired output and yy is the calculated one. Notice that y and d are not vectors
in this case, as we have a single output.
The error is their squared difference, E=(d-y)^2E=(d−y)2, and is also called the
network's Loss Function. We are dividing the error term by 2 to simplify notation, as will
become clear soon.
The aim of the backpropagation process is to minimize the error, which in our case is the Loss
Function. To do that we need to calculate its partial derivative with respect to all of the
weights.

Since we just found the output y, we can now minimize the error by finding the updated
values \Delta W_{ij}^kΔWijk.
The superscript k indicates that we need to update each and every layer k.
As we noted before, the weight update value \Delta W_{ij}^kΔWijk is calculated with the
use of the gradient the following way:
\Delta W_{ij}^k=\alpha (-\frac{\partial E}{\partial W})ΔWijk=α(−∂W∂E)
Therefore:

\Delta W_{ij}^k=\alpha (-\frac{\partial E}{\partial W})=-\frac{\alpha}{2}


\frac{\partial (d-y)^2}{\partial W_{ij}}=-2 \frac{\alpha}{2}(d-y) \large
\frac{\partial (d-y)}{\partial W_{ij}}ΔWijk=α(−∂W∂E)=−2α∂Wij∂(d−y)2=−22α
(d−y)∂Wij∂(d−y)
which can be simplified as:

\Delta W_{ij}^k=\alpha(d-y) \frac{\partial y}{\partial W_{ij}}ΔWijk


=α(d−y)∂Wij∂y
Equation 11

(Notice that dd is a constant value, so it’s partial derivative is simply a zero)


This partial derivative of the output with respect to each weight, defines the gradient and is
often denoted by the Greek letter \deltaδ.

Equation 12

We will find all the elements of the gradient using the chain rule.

If you are feeling confident with the chain rule and understand how to apply it, skip the next
video and continue with our example. Otherwise, give Luis a few minutes of your time as he
takes you through the process!

Regra da cadeia
10. Backpropagation- Example (part b)
Backpropagation- Example (part b)
Now that we understand the chain rule, we can continue with our backpropagation example,
where we will calculate the gradient

12 Backpropagation Example B V6 Final

Play
00:00
07:11
Disable captions
Settings
Enter fullscreen
Play
In our example we only have one hidden layer, so our backpropagation process will consist of
two steps:

Step 1: Calculating the gradient with respect to the weight vector W^2W2 (from the output to
the hidden layer).
Step 2: Calculating the gradient with respect to the weight matrix W^1W1 (from the hidden
layer to the input).
Step 1
(Note that the weight vector referenced here will be W^2W2. All indices referring
to W^2W2 have been omitted from the calculations to keep the notation simple).
Equation 13

As you may recall:

\large\Delta W_{ij}=\alpha(d-y) \frac{\partial y}{\partial W_{ij}} ΔWij


=α(d−y)∂Wij∂y
In this specific step, since the output is of only a single value, we can rewrite the equation the
following way (in which we have a weights vector):

\large\Delta W_i=\alpha(d-y) \frac{\partial y}{\partial W_i}ΔWi=α(d−y)∂Wi


∂y
Since we already calculated the gradient, we now know that the incremental value we need for
step one is:

\Delta W_i=\alpha(d-y) h_iΔWi=α(d−y)hi


Equation 14

Having calculated the incremental value, we can update vector W^2W2 the following way:
Equation 15

Step 2
(In this step, we will need to use both weight matrices. Therefore we will not be omitting the
weight indices.)

In our second step we will update the weights of matrix W^1W1 by calculating the partial
derivative of yy with respect to the weight matrix W^1W1.
The chain rule will be used the following way:

obtain the partial derivative of yy with respect to \bar{h}hˉ, and multiply it by the partial
derivative of \bar{h}hˉ with respect to the corresponding elements in W^1W1. Instead of
referring to vector \bar{h}hˉ, we can observe each element and present the equation the
following way:
Equation 16

In this example we have only 3 neurons the the single hidden layer, therefore this will be a
linear combination of three elements:

Equation 17

We will calculate each derivative separately. \frac{\partial y}{\partial h_j}∂hj∂y will be


calculated first, followed by \frac{\partial h_j}{\partial W^1_{ij}}∂Wij1∂hj.

Equation 18

Notice that most of the derivatives were zero, leaving us with the simple solution
of \frac{\partial y}{\partial h_{j}}=W^2_j∂hj∂y=Wj2
To calculate \frac{\partial h_j}{\partial W^1_{{ij}}}∂Wij1∂hj we need to remember first
that
Equation 19

Therefore:

Equation 20

Since the function \ h_j hj is an activation function (\PhiΦ) of a linear combination, its
partial derivative will be calculated the following way:
Equation 21

Given that there are various activation functions, we will leave the partial derivative
of \PhiΦ using a general notation. Each neuron j will have its own value
for \PhiΦ and \Phi'Φ′, according to the activation function we choose to use.

Equation 22

The second calculation of equation 21 can be calculated the following way:

(Notice how simple the result is, as most of the components of this partial derivative are zero).
Equation 23

After understanding how to treat each multiplication of equation 21 separately, we can now
summarize it the following way:

Equation 24

We are ready to finalize step 2, in which we update the weights of matrix W^1W1 by


calculating the gradient shown in equation 17. From the above calculations, we can conclude
that:

Equation 25

Since \Delta
W^1_{ij}=\alpha(d-y) \large\frac{\partial y}{\partial
W^1_{ij}}ΔWij1=α(d−y)∂Wij1∂y , when finalizing step 2, we have:
Equation 26

Having calculated the incremental value, we can update vector W^1W1 the following way:
W^1_{new}=W^1_{previous}+\Delta W^1_{ij}Wnew1=Wprevious1+ΔWij1
W^1_{new}=W^1_{previous}+\alpha(d-y)W^2_j\Phi'_jx_iWnew1
=Wprevious1+α(d−y)Wj2Φj′xi
Equation 27

After updating the weight matrices we begin once again with the Feedforward pass, starting
the process of updating the weights all over again.

This video touches on the subject of Mini Batch Training. We will further explain things in
our Hyperparameters lesson coming up.
11. Backpropagation Quiz
The following picture is of a feedforward network with

 A single input x
 Two hidden layers with two neurons in each layer
 A single output
What is the update rule of weight matrix W1?

(In other words, what is the partial derivative of y with respect to W1?)

Hint: Use the chain rule

 
Equation A

 
Equation B

 
Equation C

 
Equation D

SOLUTION:Equation A
12. RNN (part a)
We are finally ready to talk about Recurrent Neural Networks (or RNNs), where we will be
opening the doors to new content!

14 RNN A V4 Final

Play
00:00
04:37
Disable captions
Settings
Enter fullscreen
Play
RNNs are based on the same principles as those behind FFNNs, which is why we spent so
much time reminding ourselves of the feedforward and backpropagation steps that are used in
the training phase.

There are two main differences between FFNNs and RNNs. The Recurrent Neural Network
uses:

 sequences as inputs in the training phase, and


 memory elements

Memory is defined as the output of hidden layer neurons, which will serve as additional input
to the network during next training step.

The basic three layer neural network with feedback that serve as memory inputs is called
the Elman Network and is depicted in the following picture:
Diff between RNN and feedforward NN
13. RNN (part b)
16 RNN B V4 Final

Play
01:01
04:32
Disable captions
Settings
Enter fullscreen
Play
As we've see, in FFNN the output at any time t, is a function of the current input and
the weights. This can be easily expressed using the following equation:

Equation 28

In RNNs, our output at time t, depends not only on the current input and the weight,
but also on previous inputs. In this case the output at time t will be defined as:

Equation 29

This is the RNN folded model:


The RNN folded model

In this picture, \bar{x}xˉ represents the input vector, \bar{y}yˉ represents the


output vector and \bar{s}sˉ denotes the state vector.
W_xWx is the weight matrix connecting the inputs to the state layer.
W_yWy is the weight matrix connecting the state layer to the output layer.
W_sWs represents the weight matrix connecting the state from the previous timestep
to the state in the current timestep.
The model can also be "unfolded in time". The unfolded model is usually what we
use when working with RNNs.

The RNN unfolded model

In both the folded and unfolded models shown above the following notation is used:

\bar{x}xˉ represents the input vector, \bar{y}yˉ represents the output vector


and \bar{s}sˉ represents the state vector.
W_xWx is the weight matrix connecting the inputs to the state layer.
W_yWy is the weight matrix connecting the state layer to the output layer.
W_sWs represents the weight matrix connecting the state from the previous timestep
to the state in the current timestep.
In FFNNs the hidden layer depended only on the current inputs and weights, as well
as on an activation function \PhiΦ in the following way:
\bar{h}=\Phi(\bar{x}W)hˉ=Φ(xˉW).
Equation 30

In RNNs the state layer depended on the current inputs, their corresponding weights,
the activation function and also on the previous state:

Equation 31

The output vector is calculated exactly the same as in FFNNs. It can be a linear
combination of the inputs to each output node with the corresponding weight
matrix W_yWy, or a softmax function of the same linear combination.
\bar{y}_t=\bar{s}_t W_yyˉt=sˉtWy
or

\bar{y}_t=\sigma(\bar{s}_t W_y)yˉt=σ(sˉtWy)
Equation 32

The next video will focus on the unfolded model as we try to further understand it.
14. RNN- Unfolded Model
The next video will focus on the unfolded model as we try to further understand it.

17 RNN Unfolded V3 Final

Play
00:59
-02:07
Mute
Disable captions
Settings
Enter fullscreen
15. Unfolded Model Quiz

A folded model of a RNN


Look at the above picture of a folded model of a RNN. Which of the pictures below represents
the unfolded model of the same network?

 
Picture A

 
Picture B

 
Both A and B are correct

 
I don't have enough information to answer this question

SOLUTION:Both A and B are correct


Picture A
Picture B
16. RNN- Example
In this example we will illustrate how RNNs can be helpful in detecting sequences. When
detecting a sequence, the system has to remember what the previous inputs were, so it makes
sense to use a recurrent network.

If you are unfamiliar with the term sequence detection, the idea is to see if a specific pattern of
inputs has entered the system. In our example the pattern will be the word U,D,A,C,I,T,Y.

18 RNN Example V5 Final

Play
00:00
04:02
Disable captions
Settings
Enter fullscreen
17. Backpropagation Through Time (part a)
We are now ready to understand how to train the RNN.

When we train RNNs we also use backpropagation, but with a conceptual change. The process
is similar to that in the FFNN, with the exception that we need to consider previous time steps,
as the system has memory. This process is called Backpropagation Through Time
(BPTT) and will be the topic of the next three videos.

 As always, don't forget to take notes.

In the following videos we will use the Loss Function for our error. The Loss Function is the
square of the difference between the desired and the calculated outputs. There are variations to
the Loss Function, for example, factoring it with a scalar. In the backpropagation example we
used a factoring scalar of 1/2 for calculation convenience.

As described previously, the two most commonly used are the Mean Squared Error
(MSE) (usually used in regression problems) and the cross entropy (usually used in
classification problems).

Here, we are using a variation of the MSE.

19 RNN BPTT A V6 Final

Play
02:07
03:07
Disable captions
Settings
Enter fullscreen
Play
Before diving into Backpropagation Through Time we need a few reminders.

The state vector \bar{s}_tsˉt is calculated the following way:

Equation 33

The output vector \bar{y}_tyˉt can be product of the state vector \bar{s}_tsˉt and the


corresponding weight elements of matrix W_yWy. As mentioned before, if the desired
outputs are between 0 and 1, we can also use a softmax function. The following set of
equations depicts these calculations:
Equation 34

As mentioned before, for the error calculations we will use the Loss Function, where

E_tEt represents the output error at time t


d_tdt represents the desired output at time t
y_tyt represents the calculated output at time t

Equation 35

In BPTT we train the network at timestep t as well as take into account all of the previous
timesteps.
The easiest way to explain the idea is to simply jump into an example.

In this example we will focus on the BPTT process for time step t=3. You will see that in
order to adjust all three weight matrices, W_x, W_sWx,Ws and W_yWy, we need to
consider timestep 3 as well as timestep 2 and timestep 1.
As we are focusing on timestep t=3, the Loss function will
be: E_3=(\bar{d}_3-\bar{y}_3)^2E3=(dˉ3−yˉ3)2

The Folded Model at Timestep 3

To update each weight matrix, we need to find the partial derivatives of the Loss Function at
time 3, as a function of all of the weight matrices. We will modify each matrix using gradient
descent while considering the previous timesteps.
Gradient Considerations in the Folded Model
18. Backpropagation Through Time (part b)
We will now unfold the model. You will see that unfolding the model in time is very
helpful in visualizing the number of steps (translated into multiplication) needed in
the Backpropagation Through Time process. These multiplications stem from the
chain rule and are easily visualized using this model.

In this video we will understand how to use Backpropagation Through Time (BPTT)
when adjusting two weight matrices:

 W_yWy - the weight matrix connecting the state the output


 W_sWs - the weight matrix connecting one state to the next state

20 RNN BPTT B V5 Final

Play
00:00
03:49
Disable captions
Settings
Enter fullscreen
Play
The unfolded model can be very helpful in visualizing the BPTT process.
The Unfolded Model at timestep 3

Gradient calculations needed to adjust W_yWy


The partial derivative of the Loss Function with respect to W_yWy is found by a
simple one step chain rule:
(Note that in this case we do not need to use BPTT. Visualization of the calculations
path can be found in the video).
Equation 36

Generally speaking, we can consider multiple timesteps back, and not only 3 as in this
example. For an arbitrary timestep N, the gradient calculation needed for
adjusting W_yWy, is:

Equation 37

Gradient calculations needed to adjust W_sWs


We still need to adjust W_sWs the weight matrix connecting one state to the next
and W_xWx the weight matrix connecting the input to the state. We will arbitrarily
start with W_sWs.
To understand the BPTT process, we can simplify the unfolded model. We will focus
on the contributions of W_sWs to the output, the following way:
Simplified Unfolded model for Adjusting Ws

When calculating the partial derivative of the Loss Function with respect to W_sWs,
we need to consider all of the states contributing to the output. In the case of this
example it will be states \bar{s_3}s3ˉ which depends on its predecessor \bar{s_2}s2
ˉ which depends on its predecessor \bar{s_1}s1ˉ, the first state.
In BPTT we will take into account every gradient stemming from each
state, accumulating all of these contributions.

 At timestep t=3, the contribution to the gradient stemming from \bar{s_3}s3ˉ is the


following :
(Notice the use of the chain rule here. If you need, go back to the video to visualize
the calculation path).
Equation 38

 At timestep t=3, the contribution to the gradient stemming from \bar{s_2}s2ˉ is the


following :
(Notice how the equation, derived by the chain rule, considers the contribution
of \bar{s_2}s2ˉ to \bar{s_3}s3ˉ . If you need, go back to the video to visualize the
calculation path).
19. Backpropagation Through Time (part c)
Last step! Adjusting W_xWx, the weight matrix connecting the input to the state.
If you took on the previous challenge of deriving the math by yourself first, sit back,
fasten your seat belts and compare our notes to yours! Don't worry if you made
mistakes, we all do. Your mistakes will help you learn what to avoid next time.

21 RNN BPTT C V7 Final

Play
00:00
03:08
Disable captions
Settings
Enter fullscreen
Play
Gradient calculations needed to adjust W_xWx
To further understand the BPTT process, we will simplify the unfolded model again.
This time the focus will be on the contributions of W_xWx to the output, the
following way:
Simplified Unfolded model for Adjusting Wx

When calculating the partial derivative of the Loss Function with respect to
to W_xWx we need to consider, again, all of the states contributing to the output. As
we saw before, in the case of this example it will be states \bar{s_3}s3ˉ which
depend on its predecessor \bar{s_2}s2ˉ which depends on its
predecessor \bar{s_1}s1ˉ, the first state.
As we mentioned previously, in BPTT we will take into account each gradient
stemming from each state, accumulating all of the contributions.

 At timestep t=3, the contribution to the gradient stemming from \bar{s_3}s3ˉ is the


following :
(Notice the use of the chain rule here. If you need, go back to the video to visualize
the calculation path).

Equation 43

 At timestep t=3, the contribution to the gradient stemming from \bar{s_2}s2ˉ is the


following :
(Notice how the equation, derived by the chain rule, considers the contribution
of \bar{s_2}s2ˉ to \bar{s_3}s3ˉ . If you need, go back to the video to visualize the
calculation path).
Equation 44

 At timestep t=3, the contribution to the gradient stemming from \bar{s_1}s1ˉ is the


following :
(Notice how the equation, derived by the chain rule, considers the contribution
of \bar{s_1}s1ˉ to \bar{s_2}s2ˉ and \bar{s_3}s3ˉ . If you need, go back to the
video to visualize the calculation path).

Equation 45

After considering the contributions from all three states: \bar{s_3}s3ˉ ,\bar{s_2}s2


ˉ and \bar{s_1}s1ˉ, we will accumulate them to find the final gradient calculation.
The following equation is the gradient contributing to the adjustment
of W_xWx using Backpropagation Through Time:
20. BPTT Quiz 1

A folded RNN model


Consider the above folded RNN Model. Both states S and Z have multiple neurons in each
layer.
The mathematical derivation of state Z at time t is:

 
Equation A

 
Equation B

 
Equation C

 
Equation D

SOLUTION:Equation D
Solution
\bar{z}zˉ and \bar{s}sˉ are vectors, as we indicate that they have multiple neurons in each
layer. Using this logic we can understand that equations A and C are incorrect.
Since w_2w2 connects the hidden state \bar{z}zˉ to itself, we know that we need to
consider the previous timestep here. Therefore only equation D is the correct one.
21. BPTT Quiz 2

A folded RNN model


Lets look at the same folded model again (displayed above). Assume that the error is noted by
the symbol E. What is the update rule of weight matrix V1 at time t, over a single timestep ?

 
Equation A

 
Equation B

 
Equation C

 
Equation D

SOLUTION:Equation B
Solution
Equation B Is the only equation with the correct derivation of the chain rule with the proper
use of the learning rate.
22. BPTT Quiz 3

A folded RNN model


Lets look at the same folded model again (displayed above). Assume that the error is noted by
the symbol E. What is the update rule of weight matrix U at time t+1 (over 2 timesteps) ? 
Hint: Use the unfolded model for a better visualization.

 
Equation A

 
Equation B

 
Equation C

SOLUTION:Equation C
23. Some more math
This section is given as bonus material and is not mandatory. If you are curious how we
derived the final accumulative equation for BPTT, this section will help you out.

In the previous videos, we talked about Backpropagation Through Time. We used a lot of


partial derivatives, accumulating the contributions to the change in the error from each state.
Remember?
When we needed a general scheme for the BPTT, I simply displayed the equation without
giving you further explanations.

As a reminder, the following two equations were derived when adjusting the weights of
matrix W_sWs and matrix W_xWx:

Equation 48: BPTT calculations for the purpose of adjusting Ws

Equation 49: BPTT calculations for the purpose of adjusting Wx

To generalize the case, we will avoid proving equation 48 or 49, and will focus on a general
framework.
Let's look at the following sketch, presenting a portion of a network:

In the picture above, we have four states, starting with s_tst.


We will initially consider the three weight matrices W_1W1,W_2W2 and W_3W3 as three
different matrices.
Using the chain rule we can derive the following three equations:
24. RNN Summary
Let's summarize what we have seen so far:

RNN Summary

Play
00:00
04:29
Disable captions
Settings
Enter fullscreen
Play
As you have seen, in RNNs the current state depends on the input as well as the previous
states, with the use of an activation function.

Equation 56

The current output is a simple linear combination of the current state elements with the
corresponding weight matrix.

\bar{y}_t=\bar{s}_t W_yyˉt=sˉtWy (without the use of an activation function)


or

\bar{y}_t=\sigma(\bar{s}_t W_y)yˉt=σ(sˉtWy) (with the use of an activation function)


Equation 57

We can represent the recurrent network with the use of a folded model or an unfolded model:
The RNN Folded Model

The RNN Unfolded Model

In the case of a single hidden (state) layer, we will have three weight matrices to consider.
Here we use the following notations:

W_xWx - represents the weight matrix connecting the inputs to the state layer.
W_yWy - represents the weight matrix connecting the state to the output.
W_sWs - represents the weight matrix connecting the state from the previous timestep to the
state in the following timestep.
The gradient calculations for the purpose of adjusting the weight matrices are the following:

Equation 58

Equation 59
Equation 60

In equations 51 and 52 we used Backpropagation Through Time (BPTT) where we


accumulate all of the contributions from previous timesteps.

When training RNNs using BPTT, we can choose to use mini-batches, where we update the
weights in batches periodically (as opposed to once every inputs sample). We calculate the
gradient for each step but do not update the weights right away. Instead, we update the weights
once every fixed number of steps. This helps reduce the complexity of the training process and
helps remove noise from the weight updates.

The following is the equation used for Mini-Batch Training Using Gradient Descent:
(where \delta_{ij}δij represents the gradient calculated once every inputs sample and M
represents the number of gradients we accumulate in the process).
Equation 61

If we backpropagate more than ~10 timesteps, the gradient will become too small. This
phenomena is known as the vanishing gradient problem where the contribution of
information decays geometrically over time. Therefore temporal dependencies that span many
time steps will effectively be discarded by the network. Long Short-Term Memory
(LSTM) cells were designed to specifically solve this problem.

In RNNs we can also have the opposite problem, called the exploding gradient problem, in
which the value of the gradient grows uncontrollably. A simple solution for the exploding
gradient problem is Gradient Clipping.

More information about Gradient Clipping can be found here.

You can concentrate on Algorithm 1 which describes the gradient clipping idea in simplicity.
25. From RNN to LSTM
Before we take a close look at the Long Short-Term Memory (LSTM) cell, let's take a look
at the following video:

23 From RNNs To LSTMs V4 Final

Play
00:00
04:45
Disable captions
Settings
Enter fullscreen
Play
Long Short-Term Memory Cells, (LSTM) give a solution to the vanishing gradient problem,
by helping us apply networks that have temporal dependencies. They were proposed in 1997
by Sepp Hochreiter and Jürgen Schmidhuber

If we take a close look at the RNN neuron, we can see that we have simple linear
combinations (with or without the use of an activation function). We can also see that we have
a single addition.

Zooming in on the neuron, we can graphically see this in the following configuration:
26. Wrap Up
Long Short-Term Memory Networks (LSTM)

 Back to Home

 01. Intro to LSTM


 02. RNN vs LSTM
 03. Basics of LSTM
 04. Architecture of LSTM
 05. The Learn Gate
 06. The Forget Gate
 07. The Remember Gate
 08. The Use Gate
 09. Putting it All Together
 10. Quiz
 11. Other architectures
 12. Outro LSTM
01. Intro to LSTM

Hi! It's Luis again!

Hi! It's Luis again!

Now that you've gone through the Recurrent Neural Network lesson, I'll be teaching you
what an LSTM is. This stands for Long Short Term Memory Networks, and are quite useful
when our neural network needs to switch between remembering recent things, and things from
long time ago. But first, I want to give you some great references to study this further. There
are many posts out there about LSTMs, here are a few of my favorites:

 Chris Olah's LSTM post


 Edwin Chen's LSTM post
 Andrej Karpathy's lecture on RNNs and LSTMs from CS231n

So, let's dig in!


03. Sequence Batching
Sequence-Batching
04. Character-wise RNN Notebook
Character-wise RNN Notebook
You can get the notebook with the character-wise RNN from our public GitHub
repo in the  intro-to-rnns  folder.

To clone the entire repository to your machine:

git clone https://github.com/udacity/deep-learning.git


This code requires TensorFlow 1.0, so make sure you upgrade if you're using an older
version.
Play with the network, improve it, train it on your own text. This thing is for you to
build off of.

If you find ways to improve it, make a pull request and we'll add it in.

05. Implementing a Character-wise RNN


Implementing a Character-wise RNN
06. Batching Data Solution
Batching Data Solution
07. LSTM Cell
LSTM Cell
08. LSTM Cell Solution
LSTM Cell Solution
09. RNN Output
RNN Output

Play
00:00
04:00
Disable captions
Settings
Enter fullscreen
10. Network Loss
Network Loss
11. Output and Loss Solutions
Output And Loss Solutions

Play
00:00
02:19
Disable captions
Settings
Enter fullscreen
12. Build the Network
Build The Network
13. Build the Network Solution
Build The Network And Results
Part 04-Module 01-Lesson 04_Hyperparameters
01. Introducing Jay
For this section, we're introducing a new Udacity instructor, Jay Alammar. Jay has
done some great work in interactive explorations of neural networks, check out his
blog.

Jay will be reviewing some of the material you saw in the Deep Neural Networks
section on hyperparameters, and he will also introduce the hyperparameters used in
Recurrent Neural Networks.

02. Introduction
Introduction
03. Learning Rate
Learning Rate

Exponential Decay in TensorFlow.

Adaptive Learning Optimizers

 AdamOptimizer
 AdagradOptimizer
04. Learning Rate
Learning Rate Tuning #1

Say you're training a model. If the output from the training process looks as shown
below, what action would you take on the learning rate to improve the training?

Epoch 1, Batch 1, Training Error: 8.4181


Epoch 1, Batch 2, Training Error: 8.4177
Epoch 1, Batch 3, Training Error: 8.4177
Epoch 1, Batch 4, Training Error: 8.4173
Epoch 1, Batch 5, Training Error: 8.4169
 
Decrease the learning rate

 
Try again using the same learning rate

 
Increase the learning rate

SOLUTION:Increase the learning rate


Learning Rate Tuning #2

Say you're training a model. If the output from the training process looks as shown
below, what action would you take on the learning rate to improve the training?

Epoch 1, Batch 1, Training Error: 8.71


Epoch 1, Batch 2, Training Error: 3.25
Epoch 1, Batch 3, Training Error: 4.93
Epoch 1, Batch 4, Training Error: 3.30
Epoch 1, Batch 5, Training Error: 4.82
 
Decrease the learning rate

 
Use an adaptive learning rate

 
Increase the learning rate

SOLUTION:

 Decrease the learning rate


 Use an adaptive learning rate

05. Minibatch Size


Minibatch Size

Systematic evaluation of CNN advances on the ImageNet by Dmytro Mishkin, Nikolay


Sergievskiy, Jiri Matas
06. Number of Training Iterations / Epochs
Number Of Iterations

The number of training iterations is a hyperparameter we can optimize automatically


using a technique called early stopping (also "early termination").

ValidationMonitor (Deprecated)
In tensorflow, we can use a ValidationMonitor with tf.contrib.learn to not only monitor
the progress of training, but to also stop the training when certain conditions are met.

The following example from the ValidationMonitor documentation shows how to set
it up. Note that the last three parameters indicate which metric we're optimizing.

validation_monitor = tf.contrib.learn.monitors.ValidationMonitor(
test_set.data,
test_set.target,
every_n_steps=50,
metrics=validation_metrics,
early_stopping_metric="loss",
early_stopping_metric_minimize=True,
early_stopping_rounds=200)
The last parameter indicates to ValidationMonitor that it should stop the training
process if the loss did not decrease in 200 steps (rounds) of training.

The validation_monitor is then passed to tf.contrib.learn's "fit" method which runs the
training process:

classifier = tf.contrib.learn.DNNClassifier(
feature_columns=feature_columns,
hidden_units=[10, 20, 10],
n_classes=3,
model_dir="/tmp/iris_model",
config=tf.contrib.learn.RunConfig(save_checkpoints_secs=1))

classifier.fit(x=training_set.data,
y=training_set.target,
steps=2000,
monitors=[validation_monitor])
SessionRunHook
More recent versions of TensorFlow deprecated monitors in favor
of SessionRunHooks. SessionRunHooks are an evolving part of tf.train, and going
forward appear to be the proper place where you'd implement early stopping.

At the time of writing, two pre-defined stopping monitors exist as a part of


tf.train's training hooks:

 StopAtStepHook: A monitor to request the training stop after a certain number of


steps
 NanTensorHook: a monitor that monitor's loss and stops training if it encounters a
NaN loss

http://jalammar.github.io/
07. Number of Hidden Units / Layers
Number Of Hidden Units Layers

"in practice it is often the case that 3-layer neural networks will outperform 2-layer
nets, but going even deeper (4,5,6-layer) rarely helps much more. This is in stark
contrast to Convolutional Networks, where depth has been found to be an extremely
important component for a good recognition system (e.g. on order of 10 learnable
layers)." ~ Andrej Karpathy in https://cs231n.github.io/neural-networks-1/

More on Capacity
A more detailed discussion on a model's capacity appears in the Deep Learning book,
chapter 5.2 (pages 110-120).
08. RNN Hyperparameters
RNN Hyperparameters
LSTM Vs GRU
"These results clearly indicate the advantages of the gating units over the more
traditional recurrent
units. Convergence is often faster, and the final solutions tend to be better. However,
our results are
not conclusive in comparing the LSTM and the GRU, which suggests that the choice of
the type of
gated recurrent unit may depend heavily on the dataset and corresponding task."

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling by


Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, Yoshua Bengio

"The GRU outperformed the LSTM on all tasks with the exception of language
modelling"

An Empirical Exploration of Recurrent Network Architectures by Rafal Jozefowicz,


Wojciech Zaremba, Ilya Sutskever

"Our consistent finding is that depth of at least two is


beneficial. However, between two and three layers our results are mixed. Additionally,
the results
are mixed between the LSTM and the GRU, but both significantly outperform the
RNN."

Visualizing and Understanding Recurrent Networks by Andrej Karpathy, Justin


Johnson, Li Fei-Fei

"Which of these variants is best? Do the differences matter? Greff, et al. (2015) do a


nice comparison of popular variants, finding that they’re all about the
same. Jozefowicz, et al. (2015) tested more than ten thousand RNN architectures,
finding some that worked better than LSTMs on certain tasks."

Understanding LSTM Networks by Chris Olah

"In our [Neural Machine Translation] experiments, LSTM cells consistently


outperformed GRU cells. Since the computational bottleneck in our architecture is the
softmax operation we did not observe large difference in training speed between
LSTM and GRU cells. Somewhat to our surprise, we found that the vanilla decoder is
unable to learn nearly as well as the gated variant."

Massive Exploration of Neural Machine Translation Architectures by Denny Britz, Anna


Goldie, Minh-Thang Luong, Quoc Le
09. RNN Hyperparameters
LSTM Vs. GRU

How do Long Short Term Memory (LSTM) cells and Gated Recurrent Unit (GRU) cells
compare?

 
LSTMs are superior to GRUs in every way

 
GRUs are superior to LSTMs in every way

 
It depends.. It's probably worth it to compare the two on my task and dataset.

SOLUTION:It depends.. It's probably worth it to compare the two on my task and dataset.
Which embedding size looks more reasonable for the majority of cases?

 
500

 
50,000

SOLUTION:500
10. Sources & References
If you want to learn more about hyperparameters, these are some great resources on
the topic:

 Practical recommendations for gradient-based training of deep


architectures by Yoshua Bengio

 Deep Learning book - chapter 11.4: Selecting Hyperparameters  by Ian


Goodfellow, Yoshua Bengio, Aaron Courville

 Neural Networks and Deep Learning book - Chapter 3: How to choose a neural
network's hyper-parameters? by Michael Nielsen

 Efficient BackProp (pdf) by Yann LeCun

More specialized sources:

 How to Generate a Good Word Embedding? by Siwei Lai, Kang Liu, Liheng Xu, Jun
Zhao
 Systematic evaluation of CNN advances on the ImageNet  by Dmytro Mishkin, Nikolay
Sergievskiy, Jiri Matas
 Visualizing and Understanding Recurrent Networks  by Andrej Karpathy, Justin
Johnson, Li Fei-Fei

Part 04-Module 01-Lesson 05_Embeddings and Word2vec


Embeddings and Word2vec

 Back to Home

 01. Embeddings Intro


 02. Implementing Word2Vec
 03. Subsampling Solution
 04. Making Batches
 05. Batches Solution
 06. Building the Network
 07. Negative Sampling
 08. Building the Network Solution
 09. Training Results
01. Embeddings Intro

Hi, it's Mat again!

Word Embeddings
This week, we'll be covering embeddings. This is a deep neural network method for
representing data with a huge number of classes more efficiently. Embeddings greatly
improve the ability of networks to learn from data of this sort by representing the
data with lower dimensional vectors.

Word embeddings in particular are interesting because the networks are able to learn
semantic relationships between words. For example, the embeddings will know that
the male equivalent of a queen is a king.
These word embeddings are learned using a model called Word2vec. In this lesson,
you'll implement Word2vec yourself.

We've built a notebook with exercises and also provided our solutions. You can find
the notebooks in our GitHub repo in the  embeddings  folder.

Requirements: You'll need Numpy, Matplotlib, Scikit-learn, tqdm, and TensorFlow


1.0 to run this code.

Next up, I'll walk you through implementing the Word2Vec model.
02. Implementing Word2Vec
Implementing Word2Vec
03. Subsampling Solution
04. Making Batches
05. Batches Solution
06. Building the Network
07. Negative Sampling
08. Building the Network Solution
09. Training Results
Part 04-Module 01-Lesson 06_Sentiment Prediction RNN
Sentiment Prediction RNN

 Back to Home

 01. Intro
 02. Sentiment RNN
 03. Data Preprocessing
 04. Creating Testing Sets
 05. Building the RNN
 06. Training the Network
 07. Solutions
01. Intro

It's Mat again, with more knowledge for you

Sentiment Prediction RNN


Welcome to this lesson on building a recurrent neural network for predicting
sentiment. This is intended to give you more experience building RNNs. I'm using the
dataset from Andrew Trask's lesson, a bunch of movie reviews from IMDB labeled with
sentiment (positive or negative).

I'm going to have you implement this RNN. You can find the notebooks in our public
GitHub repo. You can download the notebooks from the  sentiment-rnn  folder there,
or clone the repository:

git clone https://github.com/udacity/deep-learning.git


If you already have the repo, do a  git pull  to get the new notebooks.

The Data
The data,  reviews.txt  and  labels.txt , is located in
the  sentiment_network  directory. You can also find the labels here and the
reviews here.
02. Sentiment RNN
03. Data Preprocessing
04. Creating Testing Sets
Note from Mat
Here I say  split_frac  is the fraction to keep in the "test" set. It's actually the fraction
to keep in the training set. My apologies, will fix this video.
05. Building the RNN
Building The RNN 1

Note from Mat


We're aware of some audio mixup at 1:00 here. We're on it!
06. Training the Network
07. Solutions
Sentiment RNN 2
Part 04-Module 01-Lesson 07_Generate TV Scripts
Generate TV Scripts

 Back to Home

 01. Introduction
 02. TV Script Workspace
 Project Description - Generate TV Scripts
 Project Rubric - Generate TV Scripts
01. Introduction
Project-3-Intro

02. TV Script Workspace


Workspace

This section contains either a workspace (it can be a Jupyter Notebook workspace or an online
code editor work space, etc.) and it cannot be automatically downloaded to be generated here.
Please access the classroom with your account and manually download the workspace to your
local machine. Note that for some courses, Udacity upload the workspace files
onto https://github.com/udacity, so you may be able to download them there.

Workspace Information:

 Default file path:


 Workspace type: jupyter
 Opened files (when workspace is loaded): n/a
Generate TV Scripts
Generate TV Scripts
Introduction

In this project, you'll generate your own Simpsons TV scripts using RNNs. You'll be using
part of the Simpsons dataset of scripts from 27 seasons. The Neural Network you'll build will
generate a new TV script for a scene at Moe's Tavern.

Getting the project files

The project files can be found in our public GitHub repo, in the  tv-script-
generation  folder. You can download the files from there, but it's better to clone the
repository to your computer

git clone https://github.com/udacity/deep-learning.git


This way you can stay up to date with any changes we make by pulling the changes to your
local repository with  git pull .

Submission

1. Ensure you've passed all the unit tests in the notebook.


2. Ensure you pass all points on the rubric.
3. When you're done with the project, please save the notebook as an HTML file. You can do this
by going to the File menu in the notebook and choosing "Download as" > HTML. ** Ensure
you submit both the Jupyter Notebook and it's HTML version together. **
4. Package the "dlnd_tv_script_generation.ipynb", "helper.py", "problem_unittests.py", and the
HTML file into a zip archive, or push the files from your GitHub repo.
5. Hit Submit Project below!

Advanced Projects

After completing this project, try applying what you learned to one of these problems.

 Generate your own Bach music using like DeepBach.


 Predict seizures in intracranial EEG recordings on Kaggle.

You might also like