Udacity Deep LEarning Part4 RNN
Udacity Deep LEarning Part4 RNN
Back to Home
The neural network architectures you've seen so far were trained using the current inputs only.
We did not consider previous inputs when generating the current output. In other words, our
systems did not have any memory elements. RNNs address this very basic and important issue
by using memory (i.e. past inputs to the network) when producing the current output.
03. RNN History
A bit of history
How did the theory behind RNN evolve? Where were we a few years ago and where
are we now?
Play
00:00
02:42
Disable captions
Settings
Enter fullscreen
Play
As mentioned in this video, RNNs have a key flaw, as capturing relationships that span
more than 8 or 10 steps back is practically impossible. This flaw stems from the
"vanishing gradient" problem in which the contribution of information decays
geometrically over time.
Please use these resources if you would like to read more about the Vanishing
Gradient problem or understand further the concept of a Geometric Series and how
its values may exponentially decrease.
If you are still curious, for more information on the important milestones mentioned
here, please take a peek at the following links:
TDNN
As mentioned in the video, Long Short-Term Memory Cells (LSTMs) and Gated
Recurrent Units (GRUs) give a solution to the vanishing gradient problem, by helping
us apply networks that have temporal dependencies. In this lesson we will focus on
RNNs and continue with LSTMs. We will not be focusing on GRUs.
More information about GRUs can be found in the following blog. Focus on the
overview titled: GRUs.
04. RNN Applications
Applications
The world's leading tech companies are all using RNNs, particularly LSTMs, in their
applications. Let's take a look at a few.
Play
00:00
01:56
Disable captions
Settings
Enter fullscreen
Play
There are so many interesting applications, let's look at a few more!
Are you into gaming and bots? Check out the DotA 2 bot by Open AI
The mathematical calculations needed for training RNN systems are fascinating. To deeply
understand the process, we first need to feel confident with the vanilla FFNN system. We need
to thoroughly understand the feedforward process, as well as the backpropagation process used
in the training phases of such systems.
The next few videos will cover these topics, which you are already familiar with. We will
address the feedforward process as well as backpropagation, using specific examples. These
examples will serve as extra content to help further understand RNNs later in this lesson.
The following couple of videos will give you a brief overview of the Feedforward Neural
Network (FFNN).
Play
02:03
04:39
Disable captions
Settings
Enter fullscreen
Play
OK, you can take a small break now. We will continue with FFNN when you come back!
Play
01:56
03:24
Disable captions
Settings
Enter fullscreen
Play
As mentioned before, when working with neural networks we have 2 primary phases:
Training
and
Evaluation.
During the training phase, we take the data set (also called the training set), which includes
many pairs of inputs and their corresponding targets (outputs). Our goal is to find a set of
weights that would best map the inputs to the desired outputs.
In the evaluation phase, we use the network that was created in the training phase, apply our
new inputs and expect to obtain the desired outputs.
Feedforward
and
Backpropagation
We will repeat these steps as many times as we need until we decide that our system has
reached the best set of weights, giving us the best possible outputs.
You will notice that in these videos I use subscripts as well as superscript as a numeric
notation for the weight matrix.
For example:
If you are not feeling confident with linear combinations and matrix multiplications,
you can use the following links as a refresher:
Linear Combination
Matrix Multiplication
Assuming that we have a single hidden layer, we will need two steps in our
calculations. The first will be calculating the value of the hidden states and the latter
will be calculating the value of the outputs.
Notice that both the hidden layer and the output layer are displayed as vectors, as
they are both represented by more than a single neuron.
Our first video will help you understand the first step- Calculating the value of the
hidden states.
06 FeedForward A V7 Final
Play
00:00
04:32
Disable captions
Settings
Enter fullscreen
Play
As you saw in the video above, vector h' of the hidden layer will be calculated by
multiplying the input vector with the weight matrix W^{1}W1 the following way:
\bar{h'} = (\bar{x} W^1 )h′ˉ=(xˉW1)
Using vector by matrix multiplication, we can look at this computation the following
way:
Equation 1
\bar{h} = \Phi(h')hˉ=Φ(h′)
Since W_{ij}Wij
represents the weight component in the weight matrix, connecting neuron i from the
input to neuron j in the hidden layer, we can also write these calculations in the
following way:
(notice that in this example we have n inputs and only 3 hidden neurons)
Equation 2
More information on the activation functions and how to use them can be found here
This next video will help you understand the second step- Calculating the values of
the Outputs.
07 FeedForward B V3
Play
00:00
05:55
Disable captions
Settings
Enter fullscreen
Play
As you've seen in the video above, the process of calculating the output vector is
mathematically similar to that of calculating the vector of the hidden layer. We use,
again, a vector by matrix multiplication, which can be followed by an activation
function. The vector is the newly calculated hidden layer and the matrix is the one
connecting the hidden layer to the output.
Equation 3
The two error functions that are most commonly used are the Mean Squared Error
(MSE) (usually used in regression problems) and the cross entropy (usually used in
classification problems).
The next few videos will focus on the backpropagation process, or what we also call
stochastic gradient decent with the use of the chain rule.
07. Feedforward Quiz
The following picture is of a feedforward network with
A single input x
Two hidden layers
A singe output
MN
M+N
M+N+2MN
M+N+NM
There isn't enough information to solve this question
SOLUTION:M+N+NM
Solution
To calculate the number of multiplications needed for a single feedforward pass, we can break
down the network to three steps:
Step 1
The single input is multiplied by a vector with M values. Each value in the vector will
represent a weight connecting the input to the first hidden layer. Therefore, we will
have M multiplication operations.
Step 2
Each value in the first hidden layer (M in total) will be multiplied by a vector with N values.
Each value in the vector will represent a weight connecting the neurons in the first hidden
layer to the neurons in the second hidden layer. Therefore, we will have here M times N
calculations, or simply MN multiplication operations.
Step 3
Each value in the second hidden layer (N in total) will be multiplied once, by the weight
element connecting it to the single output. Therefore, we will have N multiplication
operations.
For calculation purposes in future quizzes of the lesson, you can use the following link as a
reference for common derivatives.
Play
00:00
06:16
Disable captions
Settings
Enter fullscreen
Play
If we look at an arbitrary layer k, we can define the amount by which we change the weights
from neuron i to neuron j stemming from layer k as: \Delta W^kΔWk_{ij}ij.
The superscript (k) indicates that the weight connects layer k to layer k+1.
Therefore, the weight update rule for that neuron can be expressed as:
The updated value \Delta W_{ij}^kΔWijk is calculated through the use of the gradient
calculation, in the following way:
\Delta W_{ij}^k=\alpha (-\frac{\partial E}{\partial W})ΔWijk=α(−∂W∂E),
where \alphaα is a small positive number called the** Learning Rate**.
Equation 5
From these derivation we can easily see that the weight updates are calculated the by the
following equation:
Since many weights determine the network’s output, we can use a vector of the partial
derivatives (defined by the Greek letter Nabla \nabla∇) of the network error - each with
respect to a different weight.
W_{new}= W_{previous}+\alpha \nabla_W(-E)Wnew=Wprevious+α∇W(−E)
Equation 7
Here you can find other good resources for understanding and tuning the Learning Rate:
resource 1
resource 2
The following video is given as a refresher to overfitting . You have already seen this concept
in the Training Neural Networks lesson. Feel free to skip it and jump right into the next video.
Play
00:00
02:04
Disable captions
Settings
Enter fullscreen
Play
09. Backpropagation - Example (part a)
Backpropagation- Example (part a)
We will now continue with an example focusing on the backpropagation process, and consider
a network having two inputs [x_1, x_2][x1,x2], three neurons in a single hidden layer [h_1,
h_2, h_3][h1,h2,h3] and a single output yy.
The weight matrices to update are W^1W1 from the input to the hidden layer,
and W^2W2 from the hidden layer to the output. Notice that in our case W^2W2 is a vector,
not a matrix, as we only have one output.
10 Backpropagation Example A V3 Final
Play
00:00
03:19
Disable captions
Settings
Enter fullscreen
Play
The chain of thought in the weight updating process is as follows:
To update the weights, we need the network error. To find the network error, we need the
network output, and to find the network output we need the value of the hidden layer,
vector \bar {h}hˉ.
\bar{h}=[h_1, h_2, h_3]hˉ=[h1,h2,h3]
Equation 8
Each element of vector \bar {h}hˉ is calculated by a simple linear combination of the input
vector with its corresponding weight matrix W^1W1, followed by an activation function.
Equation 9
We now need to find the network's output, yy. yy is calculated in a similar way by using a
linear combination of the vector \bar{h}hˉ with its corresponding elements of the weight
vector W^2W2.
Equation 10
After computing the output, we can finally find the network error.
As a reminder, the two Error functions most commonly used are the Mean Squared Error
(MSE) (usually used in regression problems) and the cross entropy (often used in classification
problems).
E=\frac{(d-y)^2}{2}E=2(d−y)2,
where dd is the desired output and yy is the calculated one. Notice that y and d are not vectors
in this case, as we have a single output.
The error is their squared difference, E=(d-y)^2E=(d−y)2, and is also called the
network's Loss Function. We are dividing the error term by 2 to simplify notation, as will
become clear soon.
The aim of the backpropagation process is to minimize the error, which in our case is the Loss
Function. To do that we need to calculate its partial derivative with respect to all of the
weights.
Since we just found the output y, we can now minimize the error by finding the updated
values \Delta W_{ij}^kΔWijk.
The superscript k indicates that we need to update each and every layer k.
As we noted before, the weight update value \Delta W_{ij}^kΔWijk is calculated with the
use of the gradient the following way:
\Delta W_{ij}^k=\alpha (-\frac{\partial E}{\partial W})ΔWijk=α(−∂W∂E)
Therefore:
Equation 12
We will find all the elements of the gradient using the chain rule.
If you are feeling confident with the chain rule and understand how to apply it, skip the next
video and continue with our example. Otherwise, give Luis a few minutes of your time as he
takes you through the process!
Regra da cadeia
10. Backpropagation- Example (part b)
Backpropagation- Example (part b)
Now that we understand the chain rule, we can continue with our backpropagation example,
where we will calculate the gradient
Play
00:00
07:11
Disable captions
Settings
Enter fullscreen
Play
In our example we only have one hidden layer, so our backpropagation process will consist of
two steps:
Step 1: Calculating the gradient with respect to the weight vector W^2W2 (from the output to
the hidden layer).
Step 2: Calculating the gradient with respect to the weight matrix W^1W1 (from the hidden
layer to the input).
Step 1
(Note that the weight vector referenced here will be W^2W2. All indices referring
to W^2W2 have been omitted from the calculations to keep the notation simple).
Equation 13
Having calculated the incremental value, we can update vector W^2W2 the following way:
Equation 15
Step 2
(In this step, we will need to use both weight matrices. Therefore we will not be omitting the
weight indices.)
In our second step we will update the weights of matrix W^1W1 by calculating the partial
derivative of yy with respect to the weight matrix W^1W1.
The chain rule will be used the following way:
obtain the partial derivative of yy with respect to \bar{h}hˉ, and multiply it by the partial
derivative of \bar{h}hˉ with respect to the corresponding elements in W^1W1. Instead of
referring to vector \bar{h}hˉ, we can observe each element and present the equation the
following way:
Equation 16
In this example we have only 3 neurons the the single hidden layer, therefore this will be a
linear combination of three elements:
Equation 17
Equation 18
Notice that most of the derivatives were zero, leaving us with the simple solution
of \frac{\partial y}{\partial h_{j}}=W^2_j∂hj∂y=Wj2
To calculate \frac{\partial h_j}{\partial W^1_{{ij}}}∂Wij1∂hj we need to remember first
that
Equation 19
Therefore:
Equation 20
Since the function \ h_j hj is an activation function (\PhiΦ) of a linear combination, its
partial derivative will be calculated the following way:
Equation 21
Given that there are various activation functions, we will leave the partial derivative
of \PhiΦ using a general notation. Each neuron j will have its own value
for \PhiΦ and \Phi'Φ′, according to the activation function we choose to use.
Equation 22
(Notice how simple the result is, as most of the components of this partial derivative are zero).
Equation 23
After understanding how to treat each multiplication of equation 21 separately, we can now
summarize it the following way:
Equation 24
Equation 25
Since \Delta
W^1_{ij}=\alpha(d-y) \large\frac{\partial y}{\partial
W^1_{ij}}ΔWij1=α(d−y)∂Wij1∂y , when finalizing step 2, we have:
Equation 26
Having calculated the incremental value, we can update vector W^1W1 the following way:
W^1_{new}=W^1_{previous}+\Delta W^1_{ij}Wnew1=Wprevious1+ΔWij1
W^1_{new}=W^1_{previous}+\alpha(d-y)W^2_j\Phi'_jx_iWnew1
=Wprevious1+α(d−y)Wj2Φj′xi
Equation 27
After updating the weight matrices we begin once again with the Feedforward pass, starting
the process of updating the weights all over again.
This video touches on the subject of Mini Batch Training. We will further explain things in
our Hyperparameters lesson coming up.
11. Backpropagation Quiz
The following picture is of a feedforward network with
A single input x
Two hidden layers with two neurons in each layer
A single output
What is the update rule of weight matrix W1?
(In other words, what is the partial derivative of y with respect to W1?)
Equation A
Equation B
Equation C
Equation D
SOLUTION:Equation A
12. RNN (part a)
We are finally ready to talk about Recurrent Neural Networks (or RNNs), where we will be
opening the doors to new content!
14 RNN A V4 Final
Play
00:00
04:37
Disable captions
Settings
Enter fullscreen
Play
RNNs are based on the same principles as those behind FFNNs, which is why we spent so
much time reminding ourselves of the feedforward and backpropagation steps that are used in
the training phase.
There are two main differences between FFNNs and RNNs. The Recurrent Neural Network
uses:
Memory is defined as the output of hidden layer neurons, which will serve as additional input
to the network during next training step.
The basic three layer neural network with feedback that serve as memory inputs is called
the Elman Network and is depicted in the following picture:
Diff between RNN and feedforward NN
13. RNN (part b)
16 RNN B V4 Final
Play
01:01
04:32
Disable captions
Settings
Enter fullscreen
Play
As we've see, in FFNN the output at any time t, is a function of the current input and
the weights. This can be easily expressed using the following equation:
Equation 28
In RNNs, our output at time t, depends not only on the current input and the weight,
but also on previous inputs. In this case the output at time t will be defined as:
Equation 29
In both the folded and unfolded models shown above the following notation is used:
In RNNs the state layer depended on the current inputs, their corresponding weights,
the activation function and also on the previous state:
Equation 31
The output vector is calculated exactly the same as in FFNNs. It can be a linear
combination of the inputs to each output node with the corresponding weight
matrix W_yWy, or a softmax function of the same linear combination.
\bar{y}_t=\bar{s}_t W_yyˉt=sˉtWy
or
\bar{y}_t=\sigma(\bar{s}_t W_y)yˉt=σ(sˉtWy)
Equation 32
The next video will focus on the unfolded model as we try to further understand it.
14. RNN- Unfolded Model
The next video will focus on the unfolded model as we try to further understand it.
Play
00:59
-02:07
Mute
Disable captions
Settings
Enter fullscreen
15. Unfolded Model Quiz
Picture A
Picture B
Both A and B are correct
I don't have enough information to answer this question
If you are unfamiliar with the term sequence detection, the idea is to see if a specific pattern of
inputs has entered the system. In our example the pattern will be the word U,D,A,C,I,T,Y.
Play
00:00
04:02
Disable captions
Settings
Enter fullscreen
17. Backpropagation Through Time (part a)
We are now ready to understand how to train the RNN.
When we train RNNs we also use backpropagation, but with a conceptual change. The process
is similar to that in the FFNN, with the exception that we need to consider previous time steps,
as the system has memory. This process is called Backpropagation Through Time
(BPTT) and will be the topic of the next three videos.
In the following videos we will use the Loss Function for our error. The Loss Function is the
square of the difference between the desired and the calculated outputs. There are variations to
the Loss Function, for example, factoring it with a scalar. In the backpropagation example we
used a factoring scalar of 1/2 for calculation convenience.
As described previously, the two most commonly used are the Mean Squared Error
(MSE) (usually used in regression problems) and the cross entropy (usually used in
classification problems).
Play
02:07
03:07
Disable captions
Settings
Enter fullscreen
Play
Before diving into Backpropagation Through Time we need a few reminders.
Equation 33
As mentioned before, for the error calculations we will use the Loss Function, where
Equation 35
In BPTT we train the network at timestep t as well as take into account all of the previous
timesteps.
The easiest way to explain the idea is to simply jump into an example.
In this example we will focus on the BPTT process for time step t=3. You will see that in
order to adjust all three weight matrices, W_x, W_sWx,Ws and W_yWy, we need to
consider timestep 3 as well as timestep 2 and timestep 1.
As we are focusing on timestep t=3, the Loss function will
be: E_3=(\bar{d}_3-\bar{y}_3)^2E3=(dˉ3−yˉ3)2
To update each weight matrix, we need to find the partial derivatives of the Loss Function at
time 3, as a function of all of the weight matrices. We will modify each matrix using gradient
descent while considering the previous timesteps.
Gradient Considerations in the Folded Model
18. Backpropagation Through Time (part b)
We will now unfold the model. You will see that unfolding the model in time is very
helpful in visualizing the number of steps (translated into multiplication) needed in
the Backpropagation Through Time process. These multiplications stem from the
chain rule and are easily visualized using this model.
In this video we will understand how to use Backpropagation Through Time (BPTT)
when adjusting two weight matrices:
Play
00:00
03:49
Disable captions
Settings
Enter fullscreen
Play
The unfolded model can be very helpful in visualizing the BPTT process.
The Unfolded Model at timestep 3
Generally speaking, we can consider multiple timesteps back, and not only 3 as in this
example. For an arbitrary timestep N, the gradient calculation needed for
adjusting W_yWy, is:
Equation 37
When calculating the partial derivative of the Loss Function with respect to W_sWs,
we need to consider all of the states contributing to the output. In the case of this
example it will be states \bar{s_3}s3ˉ which depends on its predecessor \bar{s_2}s2
ˉ which depends on its predecessor \bar{s_1}s1ˉ, the first state.
In BPTT we will take into account every gradient stemming from each
state, accumulating all of these contributions.
Play
00:00
03:08
Disable captions
Settings
Enter fullscreen
Play
Gradient calculations needed to adjust W_xWx
To further understand the BPTT process, we will simplify the unfolded model again.
This time the focus will be on the contributions of W_xWx to the output, the
following way:
Simplified Unfolded model for Adjusting Wx
When calculating the partial derivative of the Loss Function with respect to
to W_xWx we need to consider, again, all of the states contributing to the output. As
we saw before, in the case of this example it will be states \bar{s_3}s3ˉ which
depend on its predecessor \bar{s_2}s2ˉ which depends on its
predecessor \bar{s_1}s1ˉ, the first state.
As we mentioned previously, in BPTT we will take into account each gradient
stemming from each state, accumulating all of the contributions.
Equation 43
Equation 45
Equation A
Equation B
Equation C
Equation D
SOLUTION:Equation D
Solution
\bar{z}zˉ and \bar{s}sˉ are vectors, as we indicate that they have multiple neurons in each
layer. Using this logic we can understand that equations A and C are incorrect.
Since w_2w2 connects the hidden state \bar{z}zˉ to itself, we know that we need to
consider the previous timestep here. Therefore only equation D is the correct one.
21. BPTT Quiz 2
Equation A
Equation B
Equation C
Equation D
SOLUTION:Equation B
Solution
Equation B Is the only equation with the correct derivation of the chain rule with the proper
use of the learning rate.
22. BPTT Quiz 3
Equation A
Equation B
Equation C
SOLUTION:Equation C
23. Some more math
This section is given as bonus material and is not mandatory. If you are curious how we
derived the final accumulative equation for BPTT, this section will help you out.
As a reminder, the following two equations were derived when adjusting the weights of
matrix W_sWs and matrix W_xWx:
To generalize the case, we will avoid proving equation 48 or 49, and will focus on a general
framework.
Let's look at the following sketch, presenting a portion of a network:
RNN Summary
Play
00:00
04:29
Disable captions
Settings
Enter fullscreen
Play
As you have seen, in RNNs the current state depends on the input as well as the previous
states, with the use of an activation function.
Equation 56
The current output is a simple linear combination of the current state elements with the
corresponding weight matrix.
We can represent the recurrent network with the use of a folded model or an unfolded model:
The RNN Folded Model
In the case of a single hidden (state) layer, we will have three weight matrices to consider.
Here we use the following notations:
W_xWx - represents the weight matrix connecting the inputs to the state layer.
W_yWy - represents the weight matrix connecting the state to the output.
W_sWs - represents the weight matrix connecting the state from the previous timestep to the
state in the following timestep.
The gradient calculations for the purpose of adjusting the weight matrices are the following:
Equation 58
Equation 59
Equation 60
When training RNNs using BPTT, we can choose to use mini-batches, where we update the
weights in batches periodically (as opposed to once every inputs sample). We calculate the
gradient for each step but do not update the weights right away. Instead, we update the weights
once every fixed number of steps. This helps reduce the complexity of the training process and
helps remove noise from the weight updates.
The following is the equation used for Mini-Batch Training Using Gradient Descent:
(where \delta_{ij}δij represents the gradient calculated once every inputs sample and M
represents the number of gradients we accumulate in the process).
Equation 61
If we backpropagate more than ~10 timesteps, the gradient will become too small. This
phenomena is known as the vanishing gradient problem where the contribution of
information decays geometrically over time. Therefore temporal dependencies that span many
time steps will effectively be discarded by the network. Long Short-Term Memory
(LSTM) cells were designed to specifically solve this problem.
In RNNs we can also have the opposite problem, called the exploding gradient problem, in
which the value of the gradient grows uncontrollably. A simple solution for the exploding
gradient problem is Gradient Clipping.
You can concentrate on Algorithm 1 which describes the gradient clipping idea in simplicity.
25. From RNN to LSTM
Before we take a close look at the Long Short-Term Memory (LSTM) cell, let's take a look
at the following video:
Play
00:00
04:45
Disable captions
Settings
Enter fullscreen
Play
Long Short-Term Memory Cells, (LSTM) give a solution to the vanishing gradient problem,
by helping us apply networks that have temporal dependencies. They were proposed in 1997
by Sepp Hochreiter and Jürgen Schmidhuber
If we take a close look at the RNN neuron, we can see that we have simple linear
combinations (with or without the use of an activation function). We can also see that we have
a single addition.
Zooming in on the neuron, we can graphically see this in the following configuration:
26. Wrap Up
Long Short-Term Memory Networks (LSTM)
Back to Home
Now that you've gone through the Recurrent Neural Network lesson, I'll be teaching you
what an LSTM is. This stands for Long Short Term Memory Networks, and are quite useful
when our neural network needs to switch between remembering recent things, and things from
long time ago. But first, I want to give you some great references to study this further. There
are many posts out there about LSTMs, here are a few of my favorites:
If you find ways to improve it, make a pull request and we'll add it in.
Play
00:00
04:00
Disable captions
Settings
Enter fullscreen
10. Network Loss
Network Loss
11. Output and Loss Solutions
Output And Loss Solutions
Play
00:00
02:19
Disable captions
Settings
Enter fullscreen
12. Build the Network
Build The Network
13. Build the Network Solution
Build The Network And Results
Part 04-Module 01-Lesson 04_Hyperparameters
01. Introducing Jay
For this section, we're introducing a new Udacity instructor, Jay Alammar. Jay has
done some great work in interactive explorations of neural networks, check out his
blog.
Jay will be reviewing some of the material you saw in the Deep Neural Networks
section on hyperparameters, and he will also introduce the hyperparameters used in
Recurrent Neural Networks.
02. Introduction
Introduction
03. Learning Rate
Learning Rate
AdamOptimizer
AdagradOptimizer
04. Learning Rate
Learning Rate Tuning #1
Say you're training a model. If the output from the training process looks as shown
below, what action would you take on the learning rate to improve the training?
Try again using the same learning rate
Increase the learning rate
Say you're training a model. If the output from the training process looks as shown
below, what action would you take on the learning rate to improve the training?
Use an adaptive learning rate
Increase the learning rate
SOLUTION:
ValidationMonitor (Deprecated)
In tensorflow, we can use a ValidationMonitor with tf.contrib.learn to not only monitor
the progress of training, but to also stop the training when certain conditions are met.
The following example from the ValidationMonitor documentation shows how to set
it up. Note that the last three parameters indicate which metric we're optimizing.
validation_monitor = tf.contrib.learn.monitors.ValidationMonitor(
test_set.data,
test_set.target,
every_n_steps=50,
metrics=validation_metrics,
early_stopping_metric="loss",
early_stopping_metric_minimize=True,
early_stopping_rounds=200)
The last parameter indicates to ValidationMonitor that it should stop the training
process if the loss did not decrease in 200 steps (rounds) of training.
The validation_monitor is then passed to tf.contrib.learn's "fit" method which runs the
training process:
classifier = tf.contrib.learn.DNNClassifier(
feature_columns=feature_columns,
hidden_units=[10, 20, 10],
n_classes=3,
model_dir="/tmp/iris_model",
config=tf.contrib.learn.RunConfig(save_checkpoints_secs=1))
classifier.fit(x=training_set.data,
y=training_set.target,
steps=2000,
monitors=[validation_monitor])
SessionRunHook
More recent versions of TensorFlow deprecated monitors in favor
of SessionRunHooks. SessionRunHooks are an evolving part of tf.train, and going
forward appear to be the proper place where you'd implement early stopping.
http://jalammar.github.io/
07. Number of Hidden Units / Layers
Number Of Hidden Units Layers
"in practice it is often the case that 3-layer neural networks will outperform 2-layer
nets, but going even deeper (4,5,6-layer) rarely helps much more. This is in stark
contrast to Convolutional Networks, where depth has been found to be an extremely
important component for a good recognition system (e.g. on order of 10 learnable
layers)." ~ Andrej Karpathy in https://cs231n.github.io/neural-networks-1/
More on Capacity
A more detailed discussion on a model's capacity appears in the Deep Learning book,
chapter 5.2 (pages 110-120).
08. RNN Hyperparameters
RNN Hyperparameters
LSTM Vs GRU
"These results clearly indicate the advantages of the gating units over the more
traditional recurrent
units. Convergence is often faster, and the final solutions tend to be better. However,
our results are
not conclusive in comparing the LSTM and the GRU, which suggests that the choice of
the type of
gated recurrent unit may depend heavily on the dataset and corresponding task."
"The GRU outperformed the LSTM on all tasks with the exception of language
modelling"
How do Long Short Term Memory (LSTM) cells and Gated Recurrent Unit (GRU) cells
compare?
LSTMs are superior to GRUs in every way
GRUs are superior to LSTMs in every way
It depends.. It's probably worth it to compare the two on my task and dataset.
SOLUTION:It depends.. It's probably worth it to compare the two on my task and dataset.
Which embedding size looks more reasonable for the majority of cases?
500
50,000
SOLUTION:500
10. Sources & References
If you want to learn more about hyperparameters, these are some great resources on
the topic:
Neural Networks and Deep Learning book - Chapter 3: How to choose a neural
network's hyper-parameters? by Michael Nielsen
How to Generate a Good Word Embedding? by Siwei Lai, Kang Liu, Liheng Xu, Jun
Zhao
Systematic evaluation of CNN advances on the ImageNet by Dmytro Mishkin, Nikolay
Sergievskiy, Jiri Matas
Visualizing and Understanding Recurrent Networks by Andrej Karpathy, Justin
Johnson, Li Fei-Fei
Back to Home
Word Embeddings
This week, we'll be covering embeddings. This is a deep neural network method for
representing data with a huge number of classes more efficiently. Embeddings greatly
improve the ability of networks to learn from data of this sort by representing the
data with lower dimensional vectors.
Word embeddings in particular are interesting because the networks are able to learn
semantic relationships between words. For example, the embeddings will know that
the male equivalent of a queen is a king.
These word embeddings are learned using a model called Word2vec. In this lesson,
you'll implement Word2vec yourself.
We've built a notebook with exercises and also provided our solutions. You can find
the notebooks in our GitHub repo in the embeddings folder.
Next up, I'll walk you through implementing the Word2Vec model.
02. Implementing Word2Vec
Implementing Word2Vec
03. Subsampling Solution
04. Making Batches
05. Batches Solution
06. Building the Network
07. Negative Sampling
08. Building the Network Solution
09. Training Results
Part 04-Module 01-Lesson 06_Sentiment Prediction RNN
Sentiment Prediction RNN
Back to Home
01. Intro
02. Sentiment RNN
03. Data Preprocessing
04. Creating Testing Sets
05. Building the RNN
06. Training the Network
07. Solutions
01. Intro
I'm going to have you implement this RNN. You can find the notebooks in our public
GitHub repo. You can download the notebooks from the sentiment-rnn folder there,
or clone the repository:
The Data
The data, reviews.txt and labels.txt , is located in
the sentiment_network directory. You can also find the labels here and the
reviews here.
02. Sentiment RNN
03. Data Preprocessing
04. Creating Testing Sets
Note from Mat
Here I say split_frac is the fraction to keep in the "test" set. It's actually the fraction
to keep in the training set. My apologies, will fix this video.
05. Building the RNN
Building The RNN 1
Back to Home
01. Introduction
02. TV Script Workspace
Project Description - Generate TV Scripts
Project Rubric - Generate TV Scripts
01. Introduction
Project-3-Intro
This section contains either a workspace (it can be a Jupyter Notebook workspace or an online
code editor work space, etc.) and it cannot be automatically downloaded to be generated here.
Please access the classroom with your account and manually download the workspace to your
local machine. Note that for some courses, Udacity upload the workspace files
onto https://github.com/udacity, so you may be able to download them there.
Workspace Information:
In this project, you'll generate your own Simpsons TV scripts using RNNs. You'll be using
part of the Simpsons dataset of scripts from 27 seasons. The Neural Network you'll build will
generate a new TV script for a scene at Moe's Tavern.
The project files can be found in our public GitHub repo, in the tv-script-
generation folder. You can download the files from there, but it's better to clone the
repository to your computer
Submission
Advanced Projects
After completing this project, try applying what you learned to one of these problems.