Designing, Visualizing and Understanding Deep Neural Networks

CS294-129: Designing, Visualizing and
Understanding Deep Neural Networks
John Canny
Fall 2016
Lecture 13: Text Processing with DNNs

Outline
Semantics
• Propositional models
• Matrix factorization
• Word2vec
• Skip-Thought vectors
• Siamese models
Translation + Structure Extraction
• Translation
• Parsing
• Entity-Relation extraction

Text Semantics
• In Natural Language Processing (NLP), semantics is
concerned with the meanings of texts.
• There are two main approaches:
• Propositional or formal semantics: A block of text is
to converted into a formula in a logical language, e.g.
predicate calculus.
• Vector representation. Texts are embedded into a
high-dimensional space.

Semantic Approaches
Propositional:
• “dog bites man”  bites(dog, man)
• bites(*,*) is a binary relation. man, dog are objects.
• Probabilities can be attached.
Vector representation:
• vec(“dog bites man”) = (0.2, -0.3, 1.5,…)  n
• Sentences similar in meaning should be close to this
embedding (e.g. use human judgments)

Propositional Semantics
• Allow logical inferences “Socrates is a man,” + “all men are
mortal”  “Socrates is mortal”
• Important for inference in well-defined domains, e.g.
inferring gene regulation from medical journals.
See DARPA’s “Big Mechanism” project
From “DARPAʼs Big Mechanism program” Paul R Cohen, Phys. Biol. 12 (2015)

“KB-LDA: Jointly Learning a Knowledge Base of Hierarchy, Relations,
and Facts,” Dana Movshovitz-Attias. William W. Cohen, ACL 2015
• Contemporary approaches use latent variable models to
group entities (objects) and the relations between them in a
data-driven way.

Per-topic Relation
distribution
Per-topic instance
distribution
Think of it as a matrix mapping
topic to instance distribution
A matrix mapping topic to relation distribution

Document membership
observations

Subject-Object-Verb triples
from parsing each sentence

Class-instance relations found from
linguistic patterns (Hearst Patterns)
“Netscape, an early web browser…”

𝐶𝑖 = 𝛼𝑘𝑍𝐶𝑖
Per-word topic
topic distribution
for ontologies

KB-LDA

A blueprint for deep semantic network design?

Other Machine Reading Systems
Aristo from AI2: Allen Institute for Artificial Intelligence:
http://aristo-demo.allenai.org/

Vector Embedding of Words
Word embeddings depend on a notion of word similarity.
A very useful definition is paradigmatic similarity:
Similar words occur in similar contexts. They are
exchangable.
POTUS
Yesterday The President called a press conference
Obama

Vector Embedding of Words
Much of the work on text embedding has used word
embeddings and bag-of-words representation:
vec(“dog”) = (0.2, -0.3, 1.5,…)
vec(“bites”) = (0.5, 1.0, -0.4,…)
vec(“man”) = (-0.1, 2.3, -1.5,…)
vec(“dog bites man”) = (0.6, 3.0, -0.4,…)

Vector Embedding: Word Similarity
Word embeddings depend on a notion of word similarity.
A very useful definition is paradigmatic similarity:
Similar words occur in similar contexts. They are
exchangable.
This definition supports unsupervised learning: cluster or
embed words according to their contexts.

Embedding: Latent Semantic Analysis
Latent semantic analysis studies documents in Bag-Of-Words
format (1988).
i.e. given a matrix T encoding some documents:
Tij is the count* of word j in document i. Most entries are 0.
* Often tfidf or other “squashing” functions of the count are used.
T
N docs
M word features

Given a bag-of-words matrix T, compute a factorization T  UT *
V (e.g. a best L2 approximation to T)
Factors encode similar whole document contexts.
Factors are rows of V.
T  UT
V
N docs
N
M word features
M features
K latent dims
K

If in addition U and V are orthogonal, S a diagonal matrix of
singular values. Then if 𝑡 is a document (row of 𝑇):
𝑣 = 𝑉𝑡 is an embedding of the document in the latent space.
𝑡′ = 𝑈𝑇𝑣 = 𝑈𝑇𝑉𝑡 is the decoding of the sentence from its
embedding.
S  UT
V
N
docs
N
M word
features
M word
features
K latent
dims
S

𝑡′ = 𝑈𝑇𝑣 = 𝑈𝑇𝑉𝑡 is the decoding of the sentence from its
embedding.
An SVD factorization gives the best possible
reconstructions of the documents 𝑡′ from their embeddings.
S  UT
V
N
docs
N
M word
features
M word
features
K latent
dims
S

t-SNE of Word Embeddings
From: “Word representations: A simple and general method for semi-
supervised learning” Joseph Turian, Lev Ratinov, Yoshua Bengio, ACL 2010.

t-SNE of Word Embeddings
Left: Number Region; Right: Jobs Region
from “Deep Learning, NLP, and Representations” by Chris Olah. See also
http://colah.github.io/posts/2015-01-Visualizing-Representations/

Word2vec: Local contexts
Instead of entire documents, Word2vec uses words a few
positions away from each center word. The pairs of center
word/context word are called “skip-grams.”
“It was a bright cold day in April, and the clocks were striking”
Center word: red
Context words: blue
Word2vec considers all words as center words, and all their
context words.

The pairs of center word/context word are called “skip-
grams.” Typical distances are 3-5 word positions. Skip-gram
model:
Distributed Representations of Words and Phrases and their Compositionality
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean, NIPS 2013
Center word
Context words

Models can also predict center word from context, CBOW
model. Generally, skip-gram performs better.
Distributed Representations of Words and Phrases and their Compositionality
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean, NIPS 2013
Center word
Context words

Word2vec optimizes a softmax loss for each output word:
𝑝 𝑗|𝑖 =
exp 𝑢𝑗
𝑇
𝑣𝑖
𝑘=1
𝑉
exp 𝑢𝑘
𝑇
𝑣𝑖
Where 𝑗 is the output word, 𝑖 is the input word. 𝑗 ranges over
a context of  3-5 positions around the input word.
𝑢 is an output embedding vector.
𝑣 is an input embedding vector.
Word2vec can be implemented with standard DNN toolkits, by
backpropagating to optimize 𝑢 and 𝑣.

Matrix perspective
Using matrix representation:
See: “GloVe: Global Vectors for Word Representation” Jeffrey Pennington,
Richard Socher, Christopher D. Manning, 2014
UT
V
N
M word
features
K latent
dims
𝑗 = 𝑖
Output word
Weights,
with Softmax loss
One input word
(one-hot encoding)

Local contexts capture much more information about relations
and properties than LSA:

Composition
Algebraic relations:
vec(‘‘woman")−vec(‘‘man") ≃ vec(‘‘aunt")−vec(‘‘uncle")
vec(‘‘woman")−vec(‘‘man") ≃ vec(‘‘queen")−vec(‘‘king")
From “Linguistic Regularities in Continuous Space Word Representations”
Tomas Mikolov , Wen-tau Yih, Geoffrey Zweig, NAACL-HLT 2013

Relations Learned by Word2vec
“Efficient Estimation of Word Representations in Vector Space” Tomas Mikolov, Kai
Chen, Greg Corrado, Jeffrey Dean, Arxiv 2013
Word2vec model computed from 6 billion word corpus of news articles

Relations Learned by Word2vec
“Efficient Estimation of Word Representations in Vector Space” Tomas Mikolov, Kai
Chen, Greg Corrado, Jeffrey Dean, Arxiv 2013
A relation is defined by the vector displacement in the first column. For
each start word in the other column, the closest displaced word is shown.

Lexical and Compositional Semantics
Lexical Semantics: focuses on the meaning of individual
words.
Compositional Semantics: meaning depends on the words,
and on how they are combined.

Beyond Bag-Of-Words: Skip-Thought Vectors
The models we discussed so far embed texts as the sum of
their words (lexical semantics).
Clearly there is a lot missing from these representations:
“man bites dog” = “dog bites man”
“the quick, brown fox jumps over the lazy dog” =
“the lazy fox over the brown dog jumps quick”
…
How can we model text structure as well as word meanings?

Skip-thought embeddings use sequence-to-sequence
RNNs to predict the next and previous sentences.
The output state vector of the boundary layer (dotted box)
forms the embedding. RNN units are GRU units.
Once the network is trained, we can discard the red and
green sections of the network.
From “Skip-Thought Vectors,” Ryan Kiros et al., Arxiv 2015.

Skip-thought embeddings use sequence-to-sequence RNNs
to predict the next and previous sentences.
Encoding doesn’t require backpropagation, so we can
represent the encoder as a (truly) recurrent network.
Thus we can encode longer units of text: sentences or
paragraphs.
𝑥𝑡 𝑦𝑡
𝑧𝑡
𝑧𝑡−1

Embedding of TREC queries
Points are colored by query type (t-SNE embedding):
From “Skip-Thought Vectors,” Ryan Kiros et al., Arxiv 2015.

Sentence Similarity
Approximately two weeks of training on a billion-word Books corpus

Semantic Relatedness Evaluation
SICK semantic relatedness task: score sentences for
semantic similarity from 1 to 5 (average of 10 human ratings)
Sentence A: A man is jumping into an empty pool
Sentence B: There is no biker jumping in the air
Relatedness score: 1.6
Sentence A: Two children are lying in the snow and are making snow angels
Sentence B: Two angels are making snow on the lying children
Sentence A: The young boys are playing outdoors and the man is smiling nearby
Sentence B: There is no boy playing outdoors and there is no man smiling
Sentence A: A person in a black jacket is doing tricks on a motorbike
Sentence B: A man in a black jacket is doing tricks on a motorbike

Note: a separate model is trained to predict the scores from
pairs of embedded sentences.
Sentence A: A man is jumping into an empty pool
Sentence B: There is no biker jumping in the air
Sentence A: Two children are lying in the snow and are making snow angels
Sentence B: Two angels are making snow on the lying children
Sentence A: The young boys are playing outdoors and the man is smiling nearby
Sentence B: There is no boy playing outdoors and there is no man smiling
Sentence A: A person in a black jacket is doing tricks on a motorbike
Sentence B: A man in a black jacket is doing tricks on a motorbike

SICK semantic relatedness scores for skip-thought methods:

A Siamese Network for Semantic Relatedness
This network is trained on pairs of sentences a, b with a
similarity label y.
Parameters are shared between the two networks.
From “Siamese Recurrent Architectures for Learning Sentence
Similarity” Jonas Mueller, Aditya Thyagarajan, AAAI-2016
300-dimensional
word2vec embeddings
50 dims
Manhattan distance

The network is trained on Semeval similar sentence pairs,
expanded by substituting for random words using WordNet
(a dataset of synonyms). Results:
From “Siamese Recurrent Architectures for Learning Sentence
Similarity” Jonas Mueller, Aditya Thyagarajan, AAAI-2016

The network is trained on Semeval similar sentence pairs,
expanded by substituting for random words using WordNet
(a dataset of synonyms). Results:
r = Pearson correlation,  = Spearman’s rank correlation.

Moral: Train for your evaluation metric!

Hidden Unit Factors 1,2, and 6

Semantic Entailment Evaluation
SICK semantic entailment task: score sentences for
relations: ENTAILMENT, CONTRADICTION, NEUTRAL:
Sentence A: Two teams are competing in a football match
Sentence B: Two groups of people are playing football
Entailment judgment: ENTAILMENT
Sentence A: The brown horse is near a red barrel at the rodeo
Sentence B: The brown horse is far from a red barrel at the rodeo
Entailment judgment: CONTRADICTION
Sentence A: A man in a black jacket is doing tricks on a motorbike
Sentence B: A person is riding the bicycle on one wheel
Entailment judgment: NEUTRAL

Semantic Entailment for MaLSTM

Sequence-To-Sequence RNNs
An input sequence is fed to the left array, output sentence to
the right array for training:
For translation:
I love coffee <EOS> Amo el café
Amo el café <EOS>

Sequence-To-Sequence RNNs
Generation:
Keep an n-best list of partial sentences, along with their
partial softmax scores.
I love coffee <EOS> Amo
Me
Amo
Me
el
gusta
el
gusta
café
el
café
el
<EOS>
café
<EOS>
café
<EOS>
<EOS>

Bleu Scores for Translation
The goal of bleu scores is to compare machine translations
against human-generated translations, allowing for variation.
Consider these translations for a Chinese sentence:
Candidate 1: It is a guide to action which ensures that the
military always obeys the commands of the party.
Candidate 2: It is to insure the troops forever hearing the
activity guidebook that party direct.

Candidate 1: It is a guide to action which ensures that the military always
obeys the commands of the party.
Candidate 2: It is to insure the troops forever hearing the activity
guidebook that party direct.
Reference 1: It is a guide to action that ensures that the military will
forever heed Party commands.
Reference 2: It is the guiding principle which guarantees the military
forces always being under the command of the Party.
Reference 3: It is the practical guide for the army always to heed the
directions of the party.

Unigram precision:
correct unigrams occuring in reference sentence
unigrams occuring in test sentence
Modified unigram precision: clip counts by maximum
occurrence in any reference sentence:
Candidate: the the the the the the the.
Reference 1: The cat is on the mat.
Reference 2: There is a cat on the mat.
Modified precision is 2/7.

obeys the commands of the party. unigram precision 17/18
guidebook that party direct. unigram precision 8/14

N-gram precision is defined similarly:
correct ngrams occuring in reference sentence
ngrams occuring in test sentence
Modified ngram precision: clip counts by maximum
occurrence in any reference sentence.
Unigram scores tend to capture adequacy
Ngram scores tend to capture fluency

obeys the commands of the party. bigram precision 10/17
guidebook that party direct. bigram precision 1/13

How to combine scores for different n-grams?
Averaging sounds good, but precisions are very different for
different n (unigrams have much higher scores).
BLEU Score: Take a weighted geometric mean of
the logs of n-gram precisions up to some length
(usually 4). Add a penalty for too-short predictions.
Candidate length c shorter
than reference r translation

Sequence-To-Sequence Model Translation
I love coffee <EOS> Amo
Me
Amo
Me
el
gusta
el
gusta
café
el
café
el
<EOS>
café
<EOS>
café
<EOS>
<EOS>
Raw scores for French-English Translation, depth = 4 for
these results.

Raw scores for French-English Translation

Scores using the LSTM model to rerank 1000-best
sentences from a baseline Machine Translation system:

Training details:
• LSTM Array depth = 4. Deeper is better.
• LSTM params initialized from uniform distribution [-0.8,0.8]
• Stochastic gradient descent w/o momentum, fixed learning
rate of 0.7.
• After 5 epochs, learning rate was halved every half epoch.
• Models trained for a total of 7.5 epochs.
• Batch size of 128 sequences.
• Gradient clipping at 𝑔 = 5.
• Sentences were grouped into minibatches of
approximately the same size.

State-of-the-Art Neural Machine Translation
A RNN array with an attention network to regulate information
flow from the source network.
Global attention model Local attention model

State-of-the-Art Neural Machine Translation
English-German translation (WMT 14 results):

Parsing
Recall (Lecture 10) RNNs ability to generate Latex, C code:
They seem to do well with
tree-structured data.
What about natural language
parsing?

Parsing
Sequence models generate linear structures, but these can
easily encode trees by “closing parens” (prefix tree notation):

Parsing Cheat Sheet
S = Sentence
NP = Noun Phrase
VP = Verb Phrase
NNP = Proper Noun (“John”)
VBZ = Verb, 3rd person, singular (“has”)
DT = Determiner (“a”)
NN = Noun, singular (“dog”)

A Sequence-To-Sequence Parser
The model is a depth-3 sequence-to-sequence predictor,
augmented with the attention model of Bahdanau 2014.
Grammar as a Foreign Language Oriol Vinyals, Google, Lukasz Kaiser,
Terry Koo, Slav Petrov, Ilya Sutskever, Geoffrey Hinton, NIPS 2015
“Neural machine translation by jointly learning to align and translate.” Dzmitry
Bahdanau, Kyunghyun Cho, and Yoshua Bengio. arXiv 2014.

Chronology:
• First tried training a basic sequence-to-sequence model on
human-annotated training treebanks. Poor results.
• Then training on parse trees generated by the Berkeley
Parser, achieved similar performance (90.5 F1 score) to it.
• Next added the attention model, trained on human
treebank data, also achieved 90.5 F1.
• Finally, created a synthetic dataset of high-confidence
parse trees (agreed on by two parsers). Achieved a new
state-of-the-art of 92.5 F1 score (WSJ dataset).
F1 is a widely-used accuracy measure that combines precision and recall

Quick Training Details:
• Depth = 3, layer dimension = 256.
• Dropout between layers 1 and 2, and 2 and 3.
• No POS tags!! Improved by F1 1 point by leaving them
out.
• Input reversing.

Dropout layers shown in purple:
This use of dropout in LSTM arrays is now widely used.

Neural Entity-Relation Extraction
Several approaches tried already:
• RNNs running on the raw text only have trouble getting
the relation structure right.
• Adding dependency tree information helps dramatically.
Recent approach: run separate RNNs on the text and
dependency parse tree data.

Recent approach: run separate RNNs on the text and
dependency parse tree data.

Equals previous best scores on an entity-relation benchmark

Wrapup
Semantics
• Propositional models, entity-relation extraction
• Matrix factorization
• Word2vec
• Skip-Thought vectors
• Siamese models
Translation + Structure Extraction
• Translation
• Parsing
• Entity-Relation extraction

Take-Aways
Training data quality (consistency) matters!
• DNNs can model anything, but it shouldn’t be human
inconsistency.
DNNs need good advice (hints)! c.f. resNets
• DNNs are capable of state-of-the-art parsing, but need a
parser to do good ER-extraction now.
Depth matters – Deeper is better
• Between-level dropout is a good regularization scheme

Designing, Visualizing and Understanding Deep Neural Networks

More Related Content

Similar to Designing, Visualizing and Understanding Deep Neural Networks (20)

Recently uploaded (20)

Designing, Visualizing and Understanding Deep Neural Networks

Editor's Notes