Attention Mechanisms
with Tensorflow
Keon Kim
DeepCoding
2016. 08. 26
Today, We Will Study...
1. Attention Mechanism and Its Implementation
- Attention Mechanism (Short) Review
- Attention Mechanism Code Review
Today, We Will Study...
1. Attention Mechanism and Its Implementation
- Attention Mechanism (Short) Review
- Attention Mechanism Code Review
2. Attention Mechanism Variant (Pointer Networks) and Its Implementation
- Pointer Networks (Short) Review
- Pointer Networks Code Review
Attention Mechanism
Review
Global Attention
Attention Mechanism
Review
Encoder compresses input series into one vector
Decoder uses this vector to generate output
Encoder compresses input series into one vector
Decoder uses this vector to generate output
Attention Mechanism predicts the output yt with a
weighted average context vector ct, not just the last state.
Attention Mechanism predicts the output yt with a
weighted average context vector ct, not just the last state.
Attention Mechanism predicts the output yt with a
weighted average context vector ct, not just the last state.
Softmax
Context
Weight of h
et2et1
Attention Mechanism
Character-based Neural Machine Translation [Ling+2015]
http://arxiv.org/pdf/1511.04586v1.pdf
Character-based Neural Machine Translation [Ling+2015]
http://arxiv.org/pdf/1511.04586v1.pdf
String to Vector
String Generation
Attention Mechanism
Code Review
Attention Mechanism
Code Review
translate.py
example in tensorflow
https://github.com/tensorflow/tensorflow/tree/master/
tensorflow/models/rnn/translate
Implementation of Grammar as a Foreign Language
Given a input string, outputs a syntax tree
Implementation of Grammar as a Foreign Language
Given a input string, outputs a syntax tree
[Go.] is put in reversed ordertranslate.py uses the same model, but for en-fr translation task
Basic Flow of the
Implementation
Preprocess
Create Model
Train
Test
Preprocess
Create Model
Train
Test
I go.
Je vais.
Original English
Original French
Use of Bucket
Bucket is a method to efficiently handle sentences of different lengths
English sentence with length L1, French sentence with length L2 + 1 (prefixed with GO symbol)
English sentence -> encoder_inputs
French sentence -> decoder_inputs
we should in principle create a seq2seq model for every pair (L1, L2+1) of lengths
of an English and French sentence. This would result in an enormous graph
consisting of many very similar subgraphs.
On the other hand, we could just pad every sentence with a special PAD symbol.
Then we'd need only one seq2seq model, for the padded lengths. But on shorter
sentence our model would be inefficient, encoding and decoding many PAD
symbols that are useless.
Preprocess
Create Model
Train
Test
I go.
Je vais.
Original English
Original French
As a compromise between constructing a graph for every pair of lengths and
padding to a single length, we use a number of buckets and pad each sentence to
the length of the bucket above it. In translate.py we use the following default
buckets.
buckets = [(5, 10), (10, 15), (20, 25), (40, 50)]
This means that if the input is an English sentence with 3 tokens, and the
corresponding output is a French sentence with 6 tokens, then they will be put in
the first bucket and padded to length 5 for encoder inputs, and length 10 for
decoder inputs.
Use of Bucket
Preprocess
Create Model
Train
Test
I go.
Je vais.
Original English
Original French
Encoder and Decoder Inputs in Bucket (5, 10)
Preprocess
Create Model
Train
Test
I go.
Je vais.
["I", "go", "."]
["Je", "vais", "."]
Tokenization
Tokenization
Original English
Original French
Encoder and Decoder Inputs in Bucket (5, 10)
Preprocess
Create Model
Train
Test
I go.
Je vais.
["I", "go", "."]
["Je", "vais", "."]
Tokenization
["PAD""PAD"".", "go", "I"]
["GO", "Je", "vais", ".","EOS","PAD","PAD","PAD","PAD","PAD"]
Encoder Input
Tokenization
Decoder Input
Original English
Original French
Encoder and Decoder Inputs in Bucket (5, 10)
Train
Test
[ ["PAD""PAD"".", "go", "I"], … ]
[ ["GO", "Je", "vais", ".","EOS","PAD","PAD","PAD","PAD","PAD"], … ]
Encoder Inputs
Decoder Inputs
Creating Seq2Seq Attention Model
Create Model
Preprocessing
Create Model
Preprocess
model embedding_rnn_seq2seq(encoder_inputs, decoder_inputs, …,
feed_prev=False)
Train
Test
[ ["PAD""PAD"".", "go", "I"], … ]
[ ["GO", "Je", "vais", ".","EOS","PAD","PAD","PAD","PAD","PAD"], … ]
Encoder Inputs
Decoder Inputs
Creating Seq2Seq Attention Model
Create Model
Preprocessing
Create Model
Preprocess
model embedding_rnn_seq2seq(encoder_inputs, decoder_inputs, …,
feed_prev=False)
embedding_rnn_seq2seq() is made of encoder + embedding_attention_decoder
Train
Test
[ ["PAD""PAD"".", "go", "I"], … ]
[ ["GO", "Je", "vais", ".","EOS","PAD","PAD","PAD","PAD","PAD"], … ]
Encoder Inputs
Decoder Inputs
Creating Seq2Seq Attention Model
Create Model
Preprocessing
Create Model
Preprocess
model embedding_rnn_seq2seq(encoder_inputs, decoder_inputs, …,
feed_prev=False)
embedding_rnn_seq2seq() is made of encoder + embedding_attention_decoder()
embedding_attention_decoder() is made of embedding + attention_decoder()
Train
Test
[ ["PAD""PAD"".", "go", "I"], … ]
[ ["GO", "Je", "vais", ".","EOS","PAD","PAD","PAD","PAD","PAD"], … ]
Encoder Inputs
Decoder Inputs
Creating Seq2Seq Attention Model
Create Model
Preprocessing
Create Model
Preprocess
model embedding_rnn_seq2seq(encoder_inputs, decoder_inputs, …,
feed_prev=False)
“feed_prev = False” means that the decoder will use decoder_inputs tensors as
provided
Train
Test
[ ["PAD""PAD"".", "go", "I"], … ]
[ ["GO", "Je", "vais", ".","EOS","PAD","PAD","PAD","PAD","PAD"], … ]
Encoder Inputs
Decoder Inputs
Creating Seq2Seq Attention Model
Create Model
Preprocessing
Create Model
Preprocess
embedding_rnn_seq2seq(encoder_inputs, decoder_inputs, …,
feed_prev=False)
outputs, states = model_with_buckets(encoder_inputs, decoder_inputs, model, … )
model
(Just a wrapper function that helps with using buckets)
Create Model
Training
Test
Train
Preprocess
session.run()
with:
- GradientDescentOptimizer
- Gradient Clipping by Global Norm
Create Model
Training
Test
Train
Preprocess
session.run()
with:
- GradientDescentOptimizer
- Gradient Clipping by Global Norm
Create Model
Train
Testing
Preprocess
Test
decode()
with:
- feed_prev = True
“feed_prev = True” means that the decoder would only use the first element of
decoder_inputs and previous output of the encoder from the second run.
python translate.py --decode
--data_dir [your_data_directory] --train_dir [checkpoints_directory]
Reading model parameters from /tmp/translate.ckpt-340000
> Who is the president of the United States?
Qui est le président des États-Unis ?
Result
Let’s go to TensorFlow
import tensorflow as tf
Attention Mechanism and Its Variants
- Global attention
- Local attention
- Pointer networks
- Attention for image (image caption generation)
…
Attention Mechanism and Its Variants
- Global attention
- Local attention
- Pointer networks ⇠ this one for today
- Attention for image (image caption generation)
…
Pointer Networks
Review
Convex Hull
Pointer Networks ‘Point’ Input Elements!
In Ptr-Net, we do not blend the
encoder state to propagate extra
information to the decoder like
standard attention mechanism.
But instead ...
et2et1
Attention mechanismStandard Attention mechanism
Pointer Networks ‘Point’ Input Elements!
We use
as pointers to the input elements
Ptr-Net approach specifically targets
problems whose outputs are discrete
and correspond to positions in the
input
Pointer Network
Another model for each
number of points?
distribution of u
Distribution of the Attention is the Answer!
distribution of u
Attention Mechanism vs Pointer Networks
Softmax normalizes the vector eij to be an output distribution over the dictionary of inputs
Attention mechanism Ptr-Net
Pointer Networks
Code Review
Pointer Networks
Code Review
Sorting Implementation
https://github.com/ikostrikov/TensorFlow-Pointer-Networks
Characteristics of the Implementation
This Implementation:
- The model code is a slightly modified version of attention_decoder from
seq2seq tensorflow model
- Simple implementation but with poor comments
Characteristics of the Implementation
This Implementation:
- The model code is a slightly modified version of attention_decoder from
seq2seq tensorflow model
- Simple implementation but with poor comments
Focus on:
- The general structure of the code (so you can refer while creating your own)
- How original Implementation of attention_decoder is modified
Task: "Order Matters" 4.4
- Learn “How to sort N ordered numbers between 0 and 1”
- ex) 0.2 0.5 0.6 0.23 0.1 0.9 => 0.1 0.2 0.23 0.5 0.6 0.9
Structure of the
Implementation
How the Implementation is Structured
main.py (Execution)
PointerNetwork()
pointer.py (Decoder Model)
pointer_decoder() attention()
dataset.py (Dataset)
next_batch()create_feed_dict()
step()
__init__()
session
How the Implementation is Structured
main.py
PointerNetwork()
pointer.py
pointer_decoder() attention()
dataset.py
next_batch()create_feed_dict()
step()
__init__()
Encoder and decoder are instantiated in __init__()
session
How the Implementation is Structured
main.py
PointerNetwork()
pointer.py
pointer_decoder() attention()
dataset.py
next_batch()create_feed_dict()
step()
__init__()
1. Dataset is instantiated and called in every step() (batch)
2. Create_feed_dict() is then used to feed the dataset into session.
session
How the Implementation is Structured
main.py
PointerNetwork()
pointer.py
pointer_decoder() attention()
dataset.py
next_batch()create_feed_dict()
step()
__init__()
Finally, step() uses the model created in __init__() to run the session
session
Brief Explanations of
The Model Part
main.py
PointerNetwork()
pointer.py
pointer_decoder() attention()
dataset.py
next_batch()create_feed_dict()
step()
__init__()
session
pointer_decoder()
pointer_decoder()
A Simple Modification to the attention_decoder tensorflow model
main.py
PointerNetwork()
pointer.py
PointerDecoder() attention()
dataset.py
next_batch()create_feed_dict()
step()
__init__()
session
pointer_decoder(): attention()
pointer_decoder(): attention()
query == states
Standard Attention vs Pointer Attention in Code
attention fucntion from attention_decoder() attention fucntion from pointer_decoder()
main.py
PointerNetwork()
pointer.py
pointer_decoder() attention()
dataset.py
next_batch()create_feed_dict()
step()
__init__()
Encoder and decoder are instantiated in __init__()
session
__init__()
Whole Encoder and Decoder Model is Made in Here
Let’s go to TensorFlow
import tensorflow as tf
82%
Step: 0
Train: 1.07584562302
Test: 1.07516384125
Correct order / All order: 0.000000
Step: 100
Train: 8.91889034099
Test: 8.91508702453
Correct order / All order: 0.000000
….
Step: 9800
Train: 0.437000320964
Test: 0.459392405155
Correct order / All order: 0.841875
Step: 9900
Train: 0.424404183739
Test: 0.636979421763
Correct order / All order: 0.825000
Result
Original Theano Implementation (Part)
Ptr-Net
Thank you :D
References
- Many Slides from: http://www.slideshare.net/yutakikuchi927/deep-learning-nlp-attention
- Character Based Neural Machine Translation: http://arxiv.org/abs/1511.04586
- Grammar as a Foreign Language: https://arxiv.org/abs/1412.7449
- Tensorflow Official Tutorial on Seq2Seq Models: https://www.tensorflow.org/versions/r0.10/tutorials/seq2seq
- Pointer Networks: https://arxiv.org/abs/1506.03134
- Order Matters: http://arxiv.org/abs/1511.06391
- Pointer Netoworks Implementation: https://github.com/ikostrikov/TensorFlow-Pointer-Networks

Attention mechanisms with tensorflow

  • 1.
    Attention Mechanisms with Tensorflow KeonKim DeepCoding 2016. 08. 26
  • 2.
    Today, We WillStudy... 1. Attention Mechanism and Its Implementation - Attention Mechanism (Short) Review - Attention Mechanism Code Review
  • 3.
    Today, We WillStudy... 1. Attention Mechanism and Its Implementation - Attention Mechanism (Short) Review - Attention Mechanism Code Review 2. Attention Mechanism Variant (Pointer Networks) and Its Implementation - Pointer Networks (Short) Review - Pointer Networks Code Review
  • 4.
  • 5.
  • 6.
    Encoder compresses inputseries into one vector Decoder uses this vector to generate output
  • 7.
    Encoder compresses inputseries into one vector Decoder uses this vector to generate output Attention Mechanism predicts the output yt with a weighted average context vector ct, not just the last state.
  • 8.
    Attention Mechanism predictsthe output yt with a weighted average context vector ct, not just the last state.
  • 9.
    Attention Mechanism predictsthe output yt with a weighted average context vector ct, not just the last state.
  • 10.
  • 11.
    Character-based Neural MachineTranslation [Ling+2015] http://arxiv.org/pdf/1511.04586v1.pdf
  • 12.
    Character-based Neural MachineTranslation [Ling+2015] http://arxiv.org/pdf/1511.04586v1.pdf String to Vector String Generation
  • 13.
  • 14.
    Attention Mechanism Code Review translate.py examplein tensorflow https://github.com/tensorflow/tensorflow/tree/master/ tensorflow/models/rnn/translate
  • 15.
    Implementation of Grammaras a Foreign Language Given a input string, outputs a syntax tree
  • 16.
    Implementation of Grammaras a Foreign Language Given a input string, outputs a syntax tree [Go.] is put in reversed ordertranslate.py uses the same model, but for en-fr translation task
  • 17.
    Basic Flow ofthe Implementation
  • 18.
  • 19.
    Preprocess Create Model Train Test I go. Jevais. Original English Original French Use of Bucket Bucket is a method to efficiently handle sentences of different lengths English sentence with length L1, French sentence with length L2 + 1 (prefixed with GO symbol) English sentence -> encoder_inputs French sentence -> decoder_inputs we should in principle create a seq2seq model for every pair (L1, L2+1) of lengths of an English and French sentence. This would result in an enormous graph consisting of many very similar subgraphs. On the other hand, we could just pad every sentence with a special PAD symbol. Then we'd need only one seq2seq model, for the padded lengths. But on shorter sentence our model would be inefficient, encoding and decoding many PAD symbols that are useless.
  • 20.
    Preprocess Create Model Train Test I go. Jevais. Original English Original French As a compromise between constructing a graph for every pair of lengths and padding to a single length, we use a number of buckets and pad each sentence to the length of the bucket above it. In translate.py we use the following default buckets. buckets = [(5, 10), (10, 15), (20, 25), (40, 50)] This means that if the input is an English sentence with 3 tokens, and the corresponding output is a French sentence with 6 tokens, then they will be put in the first bucket and padded to length 5 for encoder inputs, and length 10 for decoder inputs. Use of Bucket
  • 21.
    Preprocess Create Model Train Test I go. Jevais. Original English Original French Encoder and Decoder Inputs in Bucket (5, 10)
  • 22.
    Preprocess Create Model Train Test I go. Jevais. ["I", "go", "."] ["Je", "vais", "."] Tokenization Tokenization Original English Original French Encoder and Decoder Inputs in Bucket (5, 10)
  • 23.
    Preprocess Create Model Train Test I go. Jevais. ["I", "go", "."] ["Je", "vais", "."] Tokenization ["PAD""PAD"".", "go", "I"] ["GO", "Je", "vais", ".","EOS","PAD","PAD","PAD","PAD","PAD"] Encoder Input Tokenization Decoder Input Original English Original French Encoder and Decoder Inputs in Bucket (5, 10)
  • 24.
    Train Test [ ["PAD""PAD"".", "go","I"], … ] [ ["GO", "Je", "vais", ".","EOS","PAD","PAD","PAD","PAD","PAD"], … ] Encoder Inputs Decoder Inputs Creating Seq2Seq Attention Model Create Model Preprocessing Create Model Preprocess model embedding_rnn_seq2seq(encoder_inputs, decoder_inputs, …, feed_prev=False)
  • 25.
    Train Test [ ["PAD""PAD"".", "go","I"], … ] [ ["GO", "Je", "vais", ".","EOS","PAD","PAD","PAD","PAD","PAD"], … ] Encoder Inputs Decoder Inputs Creating Seq2Seq Attention Model Create Model Preprocessing Create Model Preprocess model embedding_rnn_seq2seq(encoder_inputs, decoder_inputs, …, feed_prev=False) embedding_rnn_seq2seq() is made of encoder + embedding_attention_decoder
  • 26.
    Train Test [ ["PAD""PAD"".", "go","I"], … ] [ ["GO", "Je", "vais", ".","EOS","PAD","PAD","PAD","PAD","PAD"], … ] Encoder Inputs Decoder Inputs Creating Seq2Seq Attention Model Create Model Preprocessing Create Model Preprocess model embedding_rnn_seq2seq(encoder_inputs, decoder_inputs, …, feed_prev=False) embedding_rnn_seq2seq() is made of encoder + embedding_attention_decoder() embedding_attention_decoder() is made of embedding + attention_decoder()
  • 27.
    Train Test [ ["PAD""PAD"".", "go","I"], … ] [ ["GO", "Je", "vais", ".","EOS","PAD","PAD","PAD","PAD","PAD"], … ] Encoder Inputs Decoder Inputs Creating Seq2Seq Attention Model Create Model Preprocessing Create Model Preprocess model embedding_rnn_seq2seq(encoder_inputs, decoder_inputs, …, feed_prev=False) “feed_prev = False” means that the decoder will use decoder_inputs tensors as provided
  • 28.
    Train Test [ ["PAD""PAD"".", "go","I"], … ] [ ["GO", "Je", "vais", ".","EOS","PAD","PAD","PAD","PAD","PAD"], … ] Encoder Inputs Decoder Inputs Creating Seq2Seq Attention Model Create Model Preprocessing Create Model Preprocess embedding_rnn_seq2seq(encoder_inputs, decoder_inputs, …, feed_prev=False) outputs, states = model_with_buckets(encoder_inputs, decoder_inputs, model, … ) model (Just a wrapper function that helps with using buckets)
  • 29.
  • 30.
  • 31.
    Create Model Train Testing Preprocess Test decode() with: - feed_prev= True “feed_prev = True” means that the decoder would only use the first element of decoder_inputs and previous output of the encoder from the second run.
  • 32.
    python translate.py --decode --data_dir[your_data_directory] --train_dir [checkpoints_directory] Reading model parameters from /tmp/translate.ckpt-340000 > Who is the president of the United States? Qui est le président des États-Unis ? Result
  • 33.
    Let’s go toTensorFlow import tensorflow as tf
  • 34.
    Attention Mechanism andIts Variants - Global attention - Local attention - Pointer networks - Attention for image (image caption generation) …
  • 35.
    Attention Mechanism andIts Variants - Global attention - Local attention - Pointer networks ⇠ this one for today - Attention for image (image caption generation) …
  • 36.
  • 37.
    Pointer Networks ‘Point’Input Elements! In Ptr-Net, we do not blend the encoder state to propagate extra information to the decoder like standard attention mechanism. But instead ... et2et1 Attention mechanismStandard Attention mechanism
  • 38.
    Pointer Networks ‘Point’Input Elements! We use as pointers to the input elements Ptr-Net approach specifically targets problems whose outputs are discrete and correspond to positions in the input Pointer Network
  • 39.
    Another model foreach number of points?
  • 40.
  • 41.
    Distribution of theAttention is the Answer! distribution of u
  • 44.
    Attention Mechanism vsPointer Networks Softmax normalizes the vector eij to be an output distribution over the dictionary of inputs Attention mechanism Ptr-Net
  • 45.
  • 46.
    Pointer Networks Code Review SortingImplementation https://github.com/ikostrikov/TensorFlow-Pointer-Networks
  • 47.
    Characteristics of theImplementation This Implementation: - The model code is a slightly modified version of attention_decoder from seq2seq tensorflow model - Simple implementation but with poor comments
  • 48.
    Characteristics of theImplementation This Implementation: - The model code is a slightly modified version of attention_decoder from seq2seq tensorflow model - Simple implementation but with poor comments Focus on: - The general structure of the code (so you can refer while creating your own) - How original Implementation of attention_decoder is modified
  • 49.
    Task: "Order Matters"4.4 - Learn “How to sort N ordered numbers between 0 and 1” - ex) 0.2 0.5 0.6 0.23 0.1 0.9 => 0.1 0.2 0.23 0.5 0.6 0.9
  • 50.
  • 51.
    How the Implementationis Structured main.py (Execution) PointerNetwork() pointer.py (Decoder Model) pointer_decoder() attention() dataset.py (Dataset) next_batch()create_feed_dict() step() __init__() session
  • 52.
    How the Implementationis Structured main.py PointerNetwork() pointer.py pointer_decoder() attention() dataset.py next_batch()create_feed_dict() step() __init__() Encoder and decoder are instantiated in __init__() session
  • 53.
    How the Implementationis Structured main.py PointerNetwork() pointer.py pointer_decoder() attention() dataset.py next_batch()create_feed_dict() step() __init__() 1. Dataset is instantiated and called in every step() (batch) 2. Create_feed_dict() is then used to feed the dataset into session. session
  • 54.
    How the Implementationis Structured main.py PointerNetwork() pointer.py pointer_decoder() attention() dataset.py next_batch()create_feed_dict() step() __init__() Finally, step() uses the model created in __init__() to run the session session
  • 55.
  • 56.
  • 57.
    pointer_decoder() A Simple Modificationto the attention_decoder tensorflow model
  • 58.
  • 59.
  • 60.
    Standard Attention vsPointer Attention in Code attention fucntion from attention_decoder() attention fucntion from pointer_decoder()
  • 61.
  • 62.
    __init__() Whole Encoder andDecoder Model is Made in Here
  • 63.
    Let’s go toTensorFlow import tensorflow as tf
  • 64.
    82% Step: 0 Train: 1.07584562302 Test:1.07516384125 Correct order / All order: 0.000000 Step: 100 Train: 8.91889034099 Test: 8.91508702453 Correct order / All order: 0.000000 …. Step: 9800 Train: 0.437000320964 Test: 0.459392405155 Correct order / All order: 0.841875 Step: 9900 Train: 0.424404183739 Test: 0.636979421763 Correct order / All order: 0.825000 Result
  • 65.
  • 66.
  • 67.
    References - Many Slidesfrom: http://www.slideshare.net/yutakikuchi927/deep-learning-nlp-attention - Character Based Neural Machine Translation: http://arxiv.org/abs/1511.04586 - Grammar as a Foreign Language: https://arxiv.org/abs/1412.7449 - Tensorflow Official Tutorial on Seq2Seq Models: https://www.tensorflow.org/versions/r0.10/tutorials/seq2seq - Pointer Networks: https://arxiv.org/abs/1506.03134 - Order Matters: http://arxiv.org/abs/1511.06391 - Pointer Netoworks Implementation: https://github.com/ikostrikov/TensorFlow-Pointer-Networks