Recurrent Neural Networks (RNNs)
Shusen Wang
How to model sequential data?
• Limitations of FC Nets and ConvNets:
• Process a paragraph as a whole.
• Fixed-size input (e.g., image).
• Fixed-size output (e.g., predicted
probabilities).
How to model sequential data?
• Limitations of FC Nets and ConvNets:
• Process a paragraph as a whole.
• Fixed-size input (e.g., image).
• Fixed-size output (e.g., predicted
probabilities).
• RNNs are better ways to model the
sequential data (e.g., text, speech, and
time series).
Recurrent Neural Networks (RNNs)
state
parameter
Word
embedding
word the cat sat ⋯ mat
Simple RNN Model
Simple RNN
'()*
= tanh
+(
'( ,
hyperbolic tangent function
Simple RNN
Question: Why do we need the tanh function?
'()*
= tanh
+(
'( ,
hyperbolic tangent function
Simple RNN
Question: Why do we need the tanh function?
• Suppose 𝐱 ( = ⋯ = 𝐱*(( = 0.
'()*
= tanh
+(
'( ,
hyperbolic tangent function
Simple RNN
Question: Why do we need the tanh function?
• Suppose 𝐱 ( = ⋯ = 𝐱*(( = 0.
• 𝐡*(( = 𝐀𝐡.. = 𝐀/ 𝐡.0 = ⋯ = 𝐀*(( 𝐡( .
• What will happen if 𝜆234 𝐀 = 0.9?
• What will happen if 𝜆234 𝐀 = 1.2?
'()*
= tanh
+(
'( ,
hyperbolic tangent function
Simple RNN
Trainable parameters: matrix 𝐀
• #rows of A: shape(h)
• #cols of A: shape(h)+shape(x)
• Total #parameter: shape(h)× [shape(h)+shape(x)]
'()*
= tanh
+(
'( ,
Simple RNN for Movie Review Analysis
Simple RNN for IMDB Review
shape 𝐡 = 32
shape 𝐱 = 32
Word embedding
i love the ⋯ much
Simple RNN for IMDB Review
shape 𝐡 = 32
shape 𝐱 = 32
Word embedding
i love the ⋯ much
Simple RNN for IMDB Review
𝑠igmoid(𝐯 F 𝐡G )
shape 𝐡 = 32
⋯
shape 𝐱 = 32
Word embedding
i love the ⋯ much
4.
Build
a
Simple
Recurrent
Neural
Network
Simple RNN for IMDB Review
In [8]:
from keras.models import Sequential
from keras.layers import SimpleRNN, Embedding, Dense
vocabulary = 10000 #unique words in the dictionary
embedding_dim = 32 shape 𝐱 = 32
word_num = 500 sequence length
state_dim = 32 shape 𝐡 = 32
model = Sequential()
model.add(Embedding(vocabulary, embedding_dim, input_length=word_num))
model.add(SimpleRNN(state_dim, return_sequences=False))
model.add(Dense(1, activation='sigmoid'))
Only return the last state ℎG .
model.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: Future
from ._conv import register_converters as _register_converters
Simple
Using TensorFlow RNN for IMDB Review
backend.
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (None, 500, 32) 320000
_________________________________________________________________
simple_rnn_1 (SimpleRNN) (None, 32) 2080
_________________________________________________________________
dense_1 (Dense) (None, 1) 33
=================================================================
Total params: 322,113
Trainable params: 322,113 #parameters in RNN:
Non-trainable params: 0 2080 = 32× 32 + 32 + 32
_________________________________________________________________
In [12]: shape 𝐡 = 32 shape 𝐱
=================================================================
Simple RNN for IMDB Review
embedding_1 (Embedding) (None, 500, 32) 320000
_________________________________________________________________
In [12]:
simple_rnn_1 (SimpleRNN) (None, 32) 2080
_________________________________________________________________
dense_1 (Dense) (None, 1) 33
from keras import optimizers
=================================================================
Total params: 322,113
Early stopping alleviates overfitting
Trainable params: 322,113
epochs = 3
Non-trainable params: 0
_________________________________________________________________
In [9]:
model.compile(optimizer=optimizers.RMSprop(lr=0.001),
from keras import optimizers
loss='binary_crossentropy', metrics=['acc'])
epochs = 3
history = model.fit(x_train, y_train, epochs=epochs,
batch_size=32, validation_data=(x_valid, y_valid))
model.compile(optimizer=optimizers.RMSprop(lr=0.001),
loss='binary_crossentropy', metrics=['acc'])
history = model.fit(x_train, y_train, epochs=epochs,
Train on 20000 samples, validate on 5000 samples
batch_size=32, validation_data=(x_valid, y_valid))
Epoch
Train 1/3
on 20000 samples, validate on 5000 samples
Epoch 1/3
20000/20000
20000/20000 [==============================]
[==============================] - 0.6959
- 65s 3ms/step - loss: 0.5514 - acc: 65s - 3ms/step
val_loss: 0.4095-- loss: 0.81
val_acc: 0.8176
Epoch 2/3
Epoch 2/3
20000/20000 [==============================] - 66s 3ms/step - loss: 0.3336 - acc: 0.8620 - val_loss: 0.3296 - val_acc: 0.8658
Epoch 3/3
20000/20000 [==============================] - 65s 3ms/step - loss: 0.51
20000/20000 [==============================] - 65s 3ms/step - loss: 0.2774 - acc: 0.8918 - val_loss: 0.3569 - val_acc: 0.8428
Epoch 3/3
In [10]:
20000/20000 [==============================] - 68s 3ms/step - loss: 0.42
import matplotlib.pyplot as plt
%matplotlib inline
Simple RNN for IMDB Review
In [13]:
loss_and_acc = model.evaluate(x_test, labels_test)
print('loss = ' + str(loss_and_acc[0]))
print('acc = ' + str(loss_and_acc[1]))
25000/25000 [==============================] - 21s 833us/step
loss = 0.6593638356399536
acc = 0.78984
Higher than a naïve shallow model (whose test accuracy is about 75%).
5.
LSTM
In [1]:
Simple RNN for IMDB Review
• Training Accuracy: 89.2%
• Validation Accuracy: 84.3%
• Test Accuracy: 84.4%
Higher than a naïve shallow model (whose test accuracy is about 75%).
Simple RNN for IMDB Review
𝑠igmoid(𝐯 F 𝐡)
Flatten: 𝐡 = vec 𝐡* , ⋯ , 𝐡G
i love the ⋯ much
In [15]: Simple RNN for IMDB Review
from keras.models import Sequential
from keras.layers import SimpleRNN, Embedding, Dense
vocabulary = 10000
embedding_dim = 32
word_num = 500
state_dim = 32
model = Sequential()
model.add(Embedding(vocabulary, embedding_dim, input_length=word_num))
model.add(SimpleRNN(state_dim, return_sequences=True))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
model.summary() Return all the states ℎ*, ⋯ , ℎG .
_________________________________________________________________
model.add(SimpleRNN(state_dim, return_sequences=True))
model.add(Flatten())
Simple RNN for IMDB Review
model.add(Dense(1))
model.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_2 (Embedding) (None, 500, 32) 320000
_________________________________________________________________
simple_rnn_2 (SimpleRNN) (None, 500, 32) 2080
_________________________________________________________________
flatten_2 (Flatten) (None, 16000) 0
_________________________________________________________________
dense_2 (Dense) (None, 1) 16001
=================================================================
Total params: 338,081
Trainable params: 338,081
Non-trainable params: 0
_________________________________________________________________
Simple RNN for IMDB Review
• Training Accuracy: 96.3%
• Validation Accuracy: 85.4%
• Test Accuracy: 84.7%
Not really better than using only the final state (whose accuracy is 84.4%).
Shortcomings of SimpleRNN
SimpleRNN is good at short-term dependence.
Predicted next words: sky
Input text: clouds are in the
Figures are from Christopher Olah’s blog: Understanding LSTM Networks.
SimpleRNN is bad at long-term dependence.
O 𝐡PQQ
𝐡*(( is almost irrelevant to 𝐱* : is near zero.
O 𝐱 P
Figures are from Christopher Olah’s blog: Understanding LSTM Networks.
SimpleRNN is bad at long-term dependence.
Predicted next words: Chinese
Input text: in China speak fluent
Figures are from Christopher Olah’s blog: Understanding LSTM Networks.
Summary
Summary
• RNN for text, speech, and time series data.
• Hidden state 𝐡G aggregates information in the inputs 𝐱 ( , ⋯ , 𝐱 G .
Summary
• RNN for text, speech, and time series data.
• Hidden state 𝐡G aggregates information in the inputs 𝐱 ( , ⋯ , 𝐱 G .
• RNNs can forget early inputs.
• It forgets what it has seen early on.
• If 𝑡 is large, 𝐡G is almost irrelevant to 𝐱 (.
Number of Parameters
• SimpleRNN has a parameter matrix (and perhaps an intercept vector).
• Shape of the parameter matrix is
shape(h) × [shape(h)+shape(x)].
Number of Parameters
• SimpleRNN has a parameter matrix (and perhaps an intercept vector).
• Shape of the parameter matrix is
shape(h) × [shape(h)+shape(x)].
• Only one such parameter matrix, no matter how long the sequence is.
Thank You!