SlideShare a Scribd company logo
CS294-129: Designing, Visualizing and
Understanding Deep Neural Networks
John Canny
Fall 2016
Lecture 13: Text Processing with DNNs
Outline
Semantics
• Propositional models
• Matrix factorization
• Word2vec
• Skip-Thought vectors
• Siamese models
Translation + Structure Extraction
• Translation
• Parsing
• Entity-Relation extraction
Text Semantics
• In Natural Language Processing (NLP), semantics is
concerned with the meanings of texts.
• There are two main approaches:
• Propositional or formal semantics: A block of text is
to converted into a formula in a logical language, e.g.
predicate calculus.
• Vector representation. Texts are embedded into a
high-dimensional space.
Semantic Approaches
Propositional:
• “dog bites man”  bites(dog, man)
• bites(*,*) is a binary relation. man, dog are objects.
• Probabilities can be attached.
Vector representation:
• vec(“dog bites man”) = (0.2, -0.3, 1.5,…)  n
• Sentences similar in meaning should be close to this
embedding (e.g. use human judgments)
Propositional Semantics
• Allow logical inferences “Socrates is a man,” + “all men are
mortal”  “Socrates is mortal”
• Important for inference in well-defined domains, e.g.
inferring gene regulation from medical journals.
See DARPA’s “Big Mechanism” project
From “DARPAʼs Big Mechanism program” Paul R Cohen, Phys. Biol. 12 (2015)
Propositional Semantics
“KB-LDA: Jointly Learning a Knowledge Base of Hierarchy, Relations,
and Facts,” Dana Movshovitz-Attias. William W. Cohen, ACL 2015
• Contemporary approaches use latent variable models to
group entities (objects) and the relations between them in a
data-driven way.
Propositional Semantics
“KB-LDA: Jointly Learning a Knowledge Base of Hierarchy, Relations,
and Facts,” Dana Movshovitz-Attias. William W. Cohen, ACL 2015
Per-topic Relation
distribution
Per-topic instance
distribution
Think of it as a matrix mapping
topic to instance distribution
A matrix mapping topic to relation distribution
Propositional Semantics
“KB-LDA: Jointly Learning a Knowledge Base of Hierarchy, Relations,
and Facts,” Dana Movshovitz-Attias. William W. Cohen, ACL 2015
Document membership
observations
Propositional Semantics
“KB-LDA: Jointly Learning a Knowledge Base of Hierarchy, Relations,
and Facts,” Dana Movshovitz-Attias. William W. Cohen, ACL 2015
Subject-Object-Verb triples
from parsing each sentence
Propositional Semantics
“KB-LDA: Jointly Learning a Knowledge Base of Hierarchy, Relations,
and Facts,” Dana Movshovitz-Attias. William W. Cohen, ACL 2015
Class-instance relations found from
linguistic patterns (Hearst Patterns)
“Netscape, an early web browser…”
Propositional Semantics
“KB-LDA: Jointly Learning a Knowledge Base of Hierarchy, Relations,
and Facts,” Dana Movshovitz-Attias. William W. Cohen, ACL 2015
𝐶𝑖 = 𝛼𝑘𝑍𝐶𝑖
Per-word topic
topic distribution
for ontologies
KB-LDA
“KB-LDA: Jointly Learning a Knowledge Base of Hierarchy, Relations,
and Facts,” Dana Movshovitz-Attias. William W. Cohen, ACL 2015
A blueprint for deep semantic network design?
Other Machine Reading Systems
Aristo from AI2: Allen Institute for Artificial Intelligence:
http://aristo-demo.allenai.org/
Vector Embedding of Words
Word embeddings depend on a notion of word similarity.
A very useful definition is paradigmatic similarity:
Similar words occur in similar contexts. They are
exchangable.
POTUS
Yesterday The President called a press conference
Obama
Vector Embedding of Words
Much of the work on text embedding has used word
embeddings and bag-of-words representation:
vec(“dog”) = (0.2, -0.3, 1.5,…)
vec(“bites”) = (0.5, 1.0, -0.4,…)
vec(“man”) = (-0.1, 2.3, -1.5,…)
vec(“dog bites man”) = (0.6, 3.0, -0.4,…)
Vector Embedding: Word Similarity
Word embeddings depend on a notion of word similarity.
A very useful definition is paradigmatic similarity:
Similar words occur in similar contexts. They are
exchangable.
This definition supports unsupervised learning: cluster or
embed words according to their contexts.
Embedding: Latent Semantic Analysis
Latent semantic analysis studies documents in Bag-Of-Words
format (1988).
i.e. given a matrix T encoding some documents:
Tij is the count* of word j in document i. Most entries are 0.
* Often tfidf or other “squashing” functions of the count are used.
T
N docs
M word features
Embedding: Latent Semantic Analysis
Given a bag-of-words matrix T, compute a factorization T  UT *
V (e.g. a best L2 approximation to T)
Factors encode similar whole document contexts.
Factors are rows of V.
T  UT
V
N docs
N
M word features
M features
K latent dims
K
Embedding: Latent Semantic Analysis
If in addition U and V are orthogonal, S a diagonal matrix of
singular values. Then if 𝑡 is a document (row of 𝑇):
𝑣 = 𝑉𝑡 is an embedding of the document in the latent space.
𝑡′ = 𝑈𝑇𝑣 = 𝑈𝑇𝑉𝑡 is the decoding of the sentence from its
embedding.
S  UT
V
N
docs
N
M word
features
M word
features
K latent
dims
S
Embedding: Latent Semantic Analysis
𝑡′ = 𝑈𝑇𝑣 = 𝑈𝑇𝑉𝑡 is the decoding of the sentence from its
embedding.
An SVD factorization gives the best possible
reconstructions of the documents 𝑡′ from their embeddings.
S  UT
V
N
docs
N
M word
features
M word
features
K latent
dims
S
t-SNE of Word Embeddings
From: “Word representations: A simple and general method for semi-
supervised learning” Joseph Turian, Lev Ratinov, Yoshua Bengio, ACL 2010.
t-SNE of Word Embeddings
Left: Number Region; Right: Jobs Region
from “Deep Learning, NLP, and Representations” by Chris Olah. See also
http://colah.github.io/posts/2015-01-Visualizing-Representations/
Word2vec: Local contexts
Instead of entire documents, Word2vec uses words a few
positions away from each center word. The pairs of center
word/context word are called “skip-grams.”
“It was a bright cold day in April, and the clocks were striking”
Center word: red
Context words: blue
Word2vec considers all words as center words, and all their
context words.
Word2vec: Local contexts
The pairs of center word/context word are called “skip-
grams.” Typical distances are 3-5 word positions. Skip-gram
model:
Distributed Representations of Words and Phrases and their Compositionality
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean, NIPS 2013
Center word
Context words
Word2vec: Local contexts
Models can also predict center word from context, CBOW
model. Generally, skip-gram performs better.
Distributed Representations of Words and Phrases and their Compositionality
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean, NIPS 2013
Center word
Context words
Word2vec: Local contexts
Word2vec optimizes a softmax loss for each output word:
𝑝 𝑗|𝑖 =
exp 𝑢𝑗
𝑇
𝑣𝑖
𝑘=1
𝑉
exp 𝑢𝑘
𝑇
𝑣𝑖
Where 𝑗 is the output word, 𝑖 is the input word. 𝑗 ranges over
a context of  3-5 positions around the input word.
𝑢 is an output embedding vector.
𝑣 is an input embedding vector.
Word2vec can be implemented with standard DNN toolkits, by
backpropagating to optimize 𝑢 and 𝑣.
Matrix perspective
Using matrix representation:
See: “GloVe: Global Vectors for Word Representation” Jeffrey Pennington,
Richard Socher, Christopher D. Manning, 2014
UT
V
N
M word
features
K latent
dims
𝑗 = 𝑖
Output word
Weights,
with Softmax loss
One input word
(one-hot encoding)
Word2vec: Local contexts
Local contexts capture much more information about relations
and properties than LSA:
Composition
Algebraic relations:
vec(‘‘woman")−vec(‘‘man") ≃ vec(‘‘aunt")−vec(‘‘uncle")
vec(‘‘woman")−vec(‘‘man") ≃ vec(‘‘queen")−vec(‘‘king")
From “Linguistic Regularities in Continuous Space Word Representations”
Tomas Mikolov , Wen-tau Yih, Geoffrey Zweig, NAACL-HLT 2013
Relations Learned by Word2vec
“Efficient Estimation of Word Representations in Vector Space” Tomas Mikolov, Kai
Chen, Greg Corrado, Jeffrey Dean, Arxiv 2013
Word2vec model computed from 6 billion word corpus of news articles
Relations Learned by Word2vec
“Efficient Estimation of Word Representations in Vector Space” Tomas Mikolov, Kai
Chen, Greg Corrado, Jeffrey Dean, Arxiv 2013
A relation is defined by the vector displacement in the first column. For
each start word in the other column, the closest displaced word is shown.
Relations Learned by Word2vec
“Efficient Estimation of Word Representations in Vector Space” Tomas Mikolov, Kai
Chen, Greg Corrado, Jeffrey Dean, Arxiv 2013
A relation is defined by the vector displacement in the first column. For
each start word in the other column, the closest displaced word is shown.
Relations Learned by Word2vec
“Efficient Estimation of Word Representations in Vector Space” Tomas Mikolov, Kai
Chen, Greg Corrado, Jeffrey Dean, Arxiv 2013
A relation is defined by the vector displacement in the first column. For
each start word in the other column, the closest displaced word is shown.
Relations Learned by Word2vec
“Efficient Estimation of Word Representations in Vector Space” Tomas Mikolov, Kai
Chen, Greg Corrado, Jeffrey Dean, Arxiv 2013
A relation is defined by the vector displacement in the first column. For
each start word in the other column, the closest displaced word is shown.
Relations Learned by Word2vec
“Efficient Estimation of Word Representations in Vector Space” Tomas Mikolov, Kai
Chen, Greg Corrado, Jeffrey Dean, Arxiv 2013
A relation is defined by the vector displacement in the first column. For
each start word in the other column, the closest displaced word is shown.
Relations Learned by Word2vec
“Efficient Estimation of Word Representations in Vector Space” Tomas Mikolov, Kai
Chen, Greg Corrado, Jeffrey Dean, Arxiv 2013
A relation is defined by the vector displacement in the first column. For
each start word in the other column, the closest displaced word is shown.
Relations Learned by Word2vec
“Efficient Estimation of Word Representations in Vector Space” Tomas Mikolov, Kai
Chen, Greg Corrado, Jeffrey Dean, Arxiv 2013
A relation is defined by the vector displacement in the first column. For
each start word in the other column, the closest displaced word is shown.
Lexical and Compositional Semantics
Lexical Semantics: focuses on the meaning of individual
words.
Compositional Semantics: meaning depends on the words,
and on how they are combined.
Beyond Bag-Of-Words: Skip-Thought Vectors
The models we discussed so far embed texts as the sum of
their words (lexical semantics).
Clearly there is a lot missing from these representations:
“man bites dog” = “dog bites man”
“the quick, brown fox jumps over the lazy dog” =
“the lazy fox over the brown dog jumps quick”
…
How can we model text structure as well as word meanings?
Beyond Bag-Of-Words: Skip-Thought Vectors
Skip-thought embeddings use sequence-to-sequence
RNNs to predict the next and previous sentences.
The output state vector of the boundary layer (dotted box)
forms the embedding. RNN units are GRU units.
Once the network is trained, we can discard the red and
green sections of the network.
From “Skip-Thought Vectors,” Ryan Kiros et al., Arxiv 2015.
Beyond Bag-Of-Words: Skip-Thought Vectors
Skip-thought embeddings use sequence-to-sequence RNNs
to predict the next and previous sentences.
Encoding doesn’t require backpropagation, so we can
represent the encoder as a (truly) recurrent network.
Thus we can encode longer units of text: sentences or
paragraphs.
𝑥𝑡 𝑦𝑡
𝑧𝑡
𝑧𝑡−1
Embedding of TREC queries
Points are colored by query type (t-SNE embedding):
From “Skip-Thought Vectors,” Ryan Kiros et al., Arxiv 2015.
Sentence Similarity
Approximately two weeks of training on a billion-word Books corpus
Semantic Relatedness Evaluation
SICK semantic relatedness task: score sentences for
semantic similarity from 1 to 5 (average of 10 human ratings)
Sentence A: A man is jumping into an empty pool
Sentence B: There is no biker jumping in the air
Relatedness score: 1.6
Sentence A: Two children are lying in the snow and are making snow angels
Sentence B: Two angels are making snow on the lying children
Relatedness score: 2.9
Sentence A: The young boys are playing outdoors and the man is smiling nearby
Sentence B: There is no boy playing outdoors and there is no man smiling
Relatedness score: 3.6
Sentence A: A person in a black jacket is doing tricks on a motorbike
Sentence B: A man in a black jacket is doing tricks on a motorbike
Relatedness score: 4.9
Semantic Relatedness Evaluation
Note: a separate model is trained to predict the scores from
pairs of embedded sentences.
Sentence A: A man is jumping into an empty pool
Sentence B: There is no biker jumping in the air
Relatedness score: 1.6
Sentence A: Two children are lying in the snow and are making snow angels
Sentence B: Two angels are making snow on the lying children
Relatedness score: 2.9
Sentence A: The young boys are playing outdoors and the man is smiling nearby
Sentence B: There is no boy playing outdoors and there is no man smiling
Relatedness score: 3.6
Sentence A: A person in a black jacket is doing tricks on a motorbike
Sentence B: A man in a black jacket is doing tricks on a motorbike
Relatedness score: 4.9
Semantic Relatedness Evaluation
SICK semantic relatedness scores for skip-thought methods:
A Siamese Network for Semantic Relatedness
This network is trained on pairs of sentences a, b with a
similarity label y.
Parameters are shared between the two networks.
From “Siamese Recurrent Architectures for Learning Sentence
Similarity” Jonas Mueller, Aditya Thyagarajan, AAAI-2016
300-dimensional
word2vec embeddings
50 dims
Manhattan distance
A Siamese Network for Semantic Relatedness
The network is trained on Semeval similar sentence pairs,
expanded by substituting for random words using WordNet
(a dataset of synonyms). Results:
From “Siamese Recurrent Architectures for Learning Sentence
Similarity” Jonas Mueller, Aditya Thyagarajan, AAAI-2016
A Siamese Network for Semantic Relatedness
The network is trained on Semeval similar sentence pairs,
expanded by substituting for random words using WordNet
(a dataset of synonyms). Results:
r = Pearson correlation,  = Spearman’s rank correlation.
A Siamese Network for Semantic Relatedness
Moral: Train for your evaluation metric!
Hidden Unit Factors 1,2, and 6
Semantic Entailment Evaluation
SICK semantic entailment task: score sentences for
relations: ENTAILMENT, CONTRADICTION, NEUTRAL:
Sentence A: Two teams are competing in a football match
Sentence B: Two groups of people are playing football
Entailment judgment: ENTAILMENT
Sentence A: The brown horse is near a red barrel at the rodeo
Sentence B: The brown horse is far from a red barrel at the rodeo
Entailment judgment: CONTRADICTION
Sentence A: A man in a black jacket is doing tricks on a motorbike
Sentence B: A person is riding the bicycle on one wheel
Entailment judgment: NEUTRAL
Semantic Entailment for MaLSTM
Translation Models
Sequence-To-Sequence RNNs
An input sequence is fed to the left array, output sentence to
the right array for training:
For translation:
I love coffee <EOS> Amo el café
Amo el café <EOS>
Sequence-To-Sequence RNNs
Generation:
Keep an n-best list of partial sentences, along with their
partial softmax scores.
I love coffee <EOS> Amo
Me
Amo
Me
el
gusta
el
gusta
café
el
café
el
<EOS>
café
<EOS>
café
<EOS>
<EOS>
Bleu Scores for Translation
The goal of bleu scores is to compare machine translations
against human-generated translations, allowing for variation.
Consider these translations for a Chinese sentence:
Candidate 1: It is a guide to action which ensures that the
military always obeys the commands of the party.
Candidate 2: It is to insure the troops forever hearing the
activity guidebook that party direct.
Bleu Scores for Translation
Candidate 1: It is a guide to action which ensures that the military always
obeys the commands of the party.
Candidate 2: It is to insure the troops forever hearing the activity
guidebook that party direct.
Reference 1: It is a guide to action that ensures that the military will
forever heed Party commands.
Reference 2: It is the guiding principle which guarantees the military
forces always being under the command of the Party.
Reference 3: It is the practical guide for the army always to heed the
directions of the party.
Bleu Scores for Translation
Candidate 1: It is a guide to action which ensures that the military always
obeys the commands of the party.
Candidate 2: It is to insure the troops forever hearing the activity
guidebook that party direct.
Reference 1: It is a guide to action that ensures that the military will
forever heed Party commands.
Reference 2: It is the guiding principle which guarantees the military
forces always being under the command of the Party.
Reference 3: It is the practical guide for the army always to heed the
directions of the party.
Bleu Scores for Translation
Candidate 1: It is a guide to action which ensures that the military always
obeys the commands of the party.
Candidate 2: It is to insure the troops forever hearing the activity
guidebook that party direct.
Reference 1: It is a guide to action that ensures that the military will
forever heed Party commands.
Reference 2: It is the guiding principle which guarantees the military
forces always being under the command of the Party.
Reference 3: It is the practical guide for the army always to heed the
directions of the party.
Bleu Scores for Translation
Unigram precision:
correct unigrams occuring in reference sentence
unigrams occuring in test sentence
Modified unigram precision: clip counts by maximum
occurrence in any reference sentence:
Candidate: the the the the the the the.
Reference 1: The cat is on the mat.
Reference 2: There is a cat on the mat.
Modified precision is 2/7.
Bleu Scores for Translation
Candidate 1: It is a guide to action which ensures that the military always
obeys the commands of the party. unigram precision 17/18
Candidate 2: It is to insure the troops forever hearing the activity
guidebook that party direct. unigram precision 8/14
Reference 1: It is a guide to action that ensures that the military will
forever heed Party commands.
Reference 2: It is the guiding principle which guarantees the military
forces always being under the command of the Party.
Reference 3: It is the practical guide for the army always to heed the
directions of the party.
Bleu Scores for Translation
N-gram precision is defined similarly:
correct ngrams occuring in reference sentence
ngrams occuring in test sentence
Modified ngram precision: clip counts by maximum
occurrence in any reference sentence.
Unigram scores tend to capture adequacy
Ngram scores tend to capture fluency
Bleu Scores for Translation
Candidate 1: It is a guide to action which ensures that the military always
obeys the commands of the party. bigram precision 10/17
Candidate 2: It is to insure the troops forever hearing the activity
guidebook that party direct. bigram precision 1/13
Reference 1: It is a guide to action that ensures that the military will
forever heed Party commands.
Reference 2: It is the guiding principle which guarantees the military
forces always being under the command of the Party.
Reference 3: It is the practical guide for the army always to heed the
directions of the party.
Bleu Scores for Translation
How to combine scores for different n-grams?
Averaging sounds good, but precisions are very different for
different n (unigrams have much higher scores).
BLEU Score: Take a weighted geometric mean of
the logs of n-gram precisions up to some length
(usually 4). Add a penalty for too-short predictions.
Candidate length c shorter
than reference r translation
Sequence-To-Sequence Model Translation
I love coffee <EOS> Amo
Me
Amo
Me
el
gusta
el
gusta
café
el
café
el
<EOS>
café
<EOS>
café
<EOS>
<EOS>
Raw scores for French-English Translation, depth = 4 for
these results.
Sequence-To-Sequence Model Translation
Raw scores for French-English Translation
Sequence-To-Sequence Model Translation
Scores using the LSTM model to rerank 1000-best
sentences from a baseline Machine Translation system:
Sequence-To-Sequence Model Translation
Training details:
• LSTM Array depth = 4. Deeper is better.
• LSTM params initialized from uniform distribution [-0.8,0.8]
• Stochastic gradient descent w/o momentum, fixed learning
rate of 0.7.
• After 5 epochs, learning rate was halved every half epoch.
• Models trained for a total of 7.5 epochs.
• Batch size of 128 sequences.
• Gradient clipping at 𝑔 = 5.
• Sentences were grouped into minibatches of
approximately the same size.
State-of-the-Art Neural Machine Translation
A RNN array with an attention network to regulate information
flow from the source network.
Global attention model Local attention model
State-of-the-Art Neural Machine Translation
English-German translation (WMT 14 results):
Parsing
Recall (Lecture 10) RNNs ability to generate Latex, C code:
They seem to do well with
tree-structured data.
What about natural language
parsing?
Parsing
Sequence models generate linear structures, but these can
easily encode trees by “closing parens” (prefix tree notation):
Parsing Cheat Sheet
S = Sentence
NP = Noun Phrase
VP = Verb Phrase
NNP = Proper Noun (“John”)
VBZ = Verb, 3rd person, singular (“has”)
DT = Determiner (“a”)
NN = Noun, singular (“dog”)
A Sequence-To-Sequence Parser
The model is a depth-3 sequence-to-sequence predictor,
augmented with the attention model of Bahdanau 2014.
Grammar as a Foreign Language Oriol Vinyals, Google, Lukasz Kaiser,
Terry Koo, Slav Petrov, Ilya Sutskever, Geoffrey Hinton, NIPS 2015
“Neural machine translation by jointly learning to align and translate.” Dzmitry
Bahdanau, Kyunghyun Cho, and Yoshua Bengio. arXiv 2014.
A Sequence-To-Sequence Parser
Chronology:
• First tried training a basic sequence-to-sequence model on
human-annotated training treebanks. Poor results.
• Then training on parse trees generated by the Berkeley
Parser, achieved similar performance (90.5 F1 score) to it.
• Next added the attention model, trained on human
treebank data, also achieved 90.5 F1.
• Finally, created a synthetic dataset of high-confidence
parse trees (agreed on by two parsers). Achieved a new
state-of-the-art of 92.5 F1 score (WSJ dataset).
F1 is a widely-used accuracy measure that combines precision and recall
A Sequence-To-Sequence Parser
Quick Training Details:
• Depth = 3, layer dimension = 256.
• Dropout between layers 1 and 2, and 2 and 3.
• No POS tags!! Improved by F1 1 point by leaving them
out.
• Input reversing.
A Sequence-To-Sequence Parser
Dropout layers shown in purple:
This use of dropout in LSTM arrays is now widely used.
Neural Entity-Relation Extraction
Several approaches tried already:
• RNNs running on the raw text only have trouble getting
the relation structure right.
• Adding dependency tree information helps dramatically.
Recent approach: run separate RNNs on the text and
dependency parse tree data.
Neural Entity-Relation Extraction
Recent approach: run separate RNNs on the text and
dependency parse tree data.
Neural Entity-Relation Extraction
Equals previous best scores on an entity-relation benchmark
Wrapup
Semantics
• Propositional models, entity-relation extraction
• Matrix factorization
• Word2vec
• Skip-Thought vectors
• Siamese models
Translation + Structure Extraction
• Translation
• Parsing
• Entity-Relation extraction
Take-Aways
Training data quality (consistency) matters!
• DNNs can model anything, but it shouldn’t be human
inconsistency.
DNNs need good advice (hints)! c.f. resNets
• DNNs are capable of state-of-the-art parsing, but need a
parser to do good ER-extraction now.
Depth matters – Deeper is better
• Between-level dropout is a good regularization scheme

More Related Content

PPTX
Embedding for fun fumarola Meetup Milano DLI luglio
PDF
Word Embeddings - Introduction
PPTX
Word vectors
PDF
Yoav Goldberg: Word Embeddings What, How and Whither
PPTX
A Simple Introduction to Word Embeddings
PPTX
word vector embeddings in natural languag processing
PPTX
What is word2vec?
PDF
Bijaya Zenchenko - An Embedding is Worth 1000 Words - Start Using Word Embedd...
Embedding for fun fumarola Meetup Milano DLI luglio
Word Embeddings - Introduction
Word vectors
Yoav Goldberg: Word Embeddings What, How and Whither
A Simple Introduction to Word Embeddings
word vector embeddings in natural languag processing
What is word2vec?
Bijaya Zenchenko - An Embedding is Worth 1000 Words - Start Using Word Embedd...

Similar to Designing, Visualizing and Understanding Deep Neural Networks (20)

PPTX
NLP Introduction and basics of natural language processing
PPTX
Word embedding
PPTX
Using Text Embeddings for Information Retrieval
PPTX
Text Mining for Lexicography
PDF
IRJET- Short-Text Semantic Similarity using Glove Word Embedding
PPTX
Efficient estimation of word representations in vector space (2013)
PPTX
Tutorial on word2vec
PDF
Word2Vec
PPTX
Tomáš Mikolov - Distributed Representations for NLP
PPTX
wordembedding.pptx
PDF
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
PDF
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
PPTX
Word embeddings
PPTX
Pycon ke word vectors
PPTX
Vector Space Word Representations - Rani Nelken PhD
PPTX
Neural Text Embeddings for Information Retrieval (WSDM 2017)
PPTX
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
PDF
Lda2vec text by the bay 2016 with notes
PDF
StarSpace: Embed All The Things!
PDF
New word analogy corpus
NLP Introduction and basics of natural language processing
Word embedding
Using Text Embeddings for Information Retrieval
Text Mining for Lexicography
IRJET- Short-Text Semantic Similarity using Glove Word Embedding
Efficient estimation of word representations in vector space (2013)
Tutorial on word2vec
Word2Vec
Tomáš Mikolov - Distributed Representations for NLP
wordembedding.pptx
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
Word embeddings
Pycon ke word vectors
Vector Space Word Representations - Rani Nelken PhD
Neural Text Embeddings for Information Retrieval (WSDM 2017)
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
Lda2vec text by the bay 2016 with notes
StarSpace: Embed All The Things!
New word analogy corpus
Ad

Recently uploaded (20)

PPTX
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
PPTX
GENETIC TECHNOLOGY A level biology
PPTX
neck nodes and dissection types and lymph nodes levels
PPTX
The KM-GBF monitoring framework – status & key messages.pptx
PPTX
Derivatives of integument scales, beaks, horns,.pptx
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PPTX
Taita Taveta Laboratory Technician Workshop Presentation.pptx
PDF
diccionario toefl examen de ingles para principiante
PPTX
LESSON 4_The Scientific Investigation.pptx
PPTX
Comparative Structure of Integument in Vertebrates.pptx
PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
PPTX
Animal Cell and plant cell for junior high school
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PPTX
famous lake in india and its disturibution and importance
PPTX
LESSON 3_States of Matter and Particle Arrangement and Phase Changes.pptx
PDF
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
PPTX
1.pptx 2.pptx for biology endocrine system hum ppt
PDF
Vera C. Rubin Observatory of interstellar Comet 3I ATLAS - July 21, 2025.pdf
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
PPTX
SCIENCE10 Q1 5 WK8 Evidence Supporting Plate Movement.pptx
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
GENETIC TECHNOLOGY A level biology
neck nodes and dissection types and lymph nodes levels
The KM-GBF monitoring framework – status & key messages.pptx
Derivatives of integument scales, beaks, horns,.pptx
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
Taita Taveta Laboratory Technician Workshop Presentation.pptx
diccionario toefl examen de ingles para principiante
LESSON 4_The Scientific Investigation.pptx
Comparative Structure of Integument in Vertebrates.pptx
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
Animal Cell and plant cell for junior high school
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
famous lake in india and its disturibution and importance
LESSON 3_States of Matter and Particle Arrangement and Phase Changes.pptx
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
1.pptx 2.pptx for biology endocrine system hum ppt
Vera C. Rubin Observatory of interstellar Comet 3I ATLAS - July 21, 2025.pdf
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
SCIENCE10 Q1 5 WK8 Evidence Supporting Plate Movement.pptx
Ad

Designing, Visualizing and Understanding Deep Neural Networks

  • 1. CS294-129: Designing, Visualizing and Understanding Deep Neural Networks John Canny Fall 2016 Lecture 13: Text Processing with DNNs
  • 2. Outline Semantics • Propositional models • Matrix factorization • Word2vec • Skip-Thought vectors • Siamese models Translation + Structure Extraction • Translation • Parsing • Entity-Relation extraction
  • 3. Text Semantics • In Natural Language Processing (NLP), semantics is concerned with the meanings of texts. • There are two main approaches: • Propositional or formal semantics: A block of text is to converted into a formula in a logical language, e.g. predicate calculus. • Vector representation. Texts are embedded into a high-dimensional space.
  • 4. Semantic Approaches Propositional: • “dog bites man”  bites(dog, man) • bites(*,*) is a binary relation. man, dog are objects. • Probabilities can be attached. Vector representation: • vec(“dog bites man”) = (0.2, -0.3, 1.5,…)  n • Sentences similar in meaning should be close to this embedding (e.g. use human judgments)
  • 5. Propositional Semantics • Allow logical inferences “Socrates is a man,” + “all men are mortal”  “Socrates is mortal” • Important for inference in well-defined domains, e.g. inferring gene regulation from medical journals. See DARPA’s “Big Mechanism” project From “DARPAʼs Big Mechanism program” Paul R Cohen, Phys. Biol. 12 (2015)
  • 6. Propositional Semantics “KB-LDA: Jointly Learning a Knowledge Base of Hierarchy, Relations, and Facts,” Dana Movshovitz-Attias. William W. Cohen, ACL 2015 • Contemporary approaches use latent variable models to group entities (objects) and the relations between them in a data-driven way.
  • 7. Propositional Semantics “KB-LDA: Jointly Learning a Knowledge Base of Hierarchy, Relations, and Facts,” Dana Movshovitz-Attias. William W. Cohen, ACL 2015 Per-topic Relation distribution Per-topic instance distribution Think of it as a matrix mapping topic to instance distribution A matrix mapping topic to relation distribution
  • 8. Propositional Semantics “KB-LDA: Jointly Learning a Knowledge Base of Hierarchy, Relations, and Facts,” Dana Movshovitz-Attias. William W. Cohen, ACL 2015 Document membership observations
  • 9. Propositional Semantics “KB-LDA: Jointly Learning a Knowledge Base of Hierarchy, Relations, and Facts,” Dana Movshovitz-Attias. William W. Cohen, ACL 2015 Subject-Object-Verb triples from parsing each sentence
  • 10. Propositional Semantics “KB-LDA: Jointly Learning a Knowledge Base of Hierarchy, Relations, and Facts,” Dana Movshovitz-Attias. William W. Cohen, ACL 2015 Class-instance relations found from linguistic patterns (Hearst Patterns) “Netscape, an early web browser…”
  • 11. Propositional Semantics “KB-LDA: Jointly Learning a Knowledge Base of Hierarchy, Relations, and Facts,” Dana Movshovitz-Attias. William W. Cohen, ACL 2015 𝐶𝑖 = 𝛼𝑘𝑍𝐶𝑖 Per-word topic topic distribution for ontologies
  • 12. KB-LDA “KB-LDA: Jointly Learning a Knowledge Base of Hierarchy, Relations, and Facts,” Dana Movshovitz-Attias. William W. Cohen, ACL 2015
  • 13. A blueprint for deep semantic network design?
  • 14. Other Machine Reading Systems Aristo from AI2: Allen Institute for Artificial Intelligence: http://aristo-demo.allenai.org/
  • 15. Vector Embedding of Words Word embeddings depend on a notion of word similarity. A very useful definition is paradigmatic similarity: Similar words occur in similar contexts. They are exchangable. POTUS Yesterday The President called a press conference Obama
  • 16. Vector Embedding of Words Much of the work on text embedding has used word embeddings and bag-of-words representation: vec(“dog”) = (0.2, -0.3, 1.5,…) vec(“bites”) = (0.5, 1.0, -0.4,…) vec(“man”) = (-0.1, 2.3, -1.5,…) vec(“dog bites man”) = (0.6, 3.0, -0.4,…)
  • 17. Vector Embedding: Word Similarity Word embeddings depend on a notion of word similarity. A very useful definition is paradigmatic similarity: Similar words occur in similar contexts. They are exchangable. This definition supports unsupervised learning: cluster or embed words according to their contexts.
  • 18. Embedding: Latent Semantic Analysis Latent semantic analysis studies documents in Bag-Of-Words format (1988). i.e. given a matrix T encoding some documents: Tij is the count* of word j in document i. Most entries are 0. * Often tfidf or other “squashing” functions of the count are used. T N docs M word features
  • 19. Embedding: Latent Semantic Analysis Given a bag-of-words matrix T, compute a factorization T  UT * V (e.g. a best L2 approximation to T) Factors encode similar whole document contexts. Factors are rows of V. T  UT V N docs N M word features M features K latent dims K
  • 20. Embedding: Latent Semantic Analysis If in addition U and V are orthogonal, S a diagonal matrix of singular values. Then if 𝑡 is a document (row of 𝑇): 𝑣 = 𝑉𝑡 is an embedding of the document in the latent space. 𝑡′ = 𝑈𝑇𝑣 = 𝑈𝑇𝑉𝑡 is the decoding of the sentence from its embedding. S  UT V N docs N M word features M word features K latent dims S
  • 21. Embedding: Latent Semantic Analysis 𝑡′ = 𝑈𝑇𝑣 = 𝑈𝑇𝑉𝑡 is the decoding of the sentence from its embedding. An SVD factorization gives the best possible reconstructions of the documents 𝑡′ from their embeddings. S  UT V N docs N M word features M word features K latent dims S
  • 22. t-SNE of Word Embeddings From: “Word representations: A simple and general method for semi- supervised learning” Joseph Turian, Lev Ratinov, Yoshua Bengio, ACL 2010.
  • 23. t-SNE of Word Embeddings Left: Number Region; Right: Jobs Region from “Deep Learning, NLP, and Representations” by Chris Olah. See also http://colah.github.io/posts/2015-01-Visualizing-Representations/
  • 24. Word2vec: Local contexts Instead of entire documents, Word2vec uses words a few positions away from each center word. The pairs of center word/context word are called “skip-grams.” “It was a bright cold day in April, and the clocks were striking” Center word: red Context words: blue Word2vec considers all words as center words, and all their context words.
  • 25. Word2vec: Local contexts The pairs of center word/context word are called “skip- grams.” Typical distances are 3-5 word positions. Skip-gram model: Distributed Representations of Words and Phrases and their Compositionality Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean, NIPS 2013 Center word Context words
  • 26. Word2vec: Local contexts Models can also predict center word from context, CBOW model. Generally, skip-gram performs better. Distributed Representations of Words and Phrases and their Compositionality Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean, NIPS 2013 Center word Context words
  • 27. Word2vec: Local contexts Word2vec optimizes a softmax loss for each output word: 𝑝 𝑗|𝑖 = exp 𝑢𝑗 𝑇 𝑣𝑖 𝑘=1 𝑉 exp 𝑢𝑘 𝑇 𝑣𝑖 Where 𝑗 is the output word, 𝑖 is the input word. 𝑗 ranges over a context of  3-5 positions around the input word. 𝑢 is an output embedding vector. 𝑣 is an input embedding vector. Word2vec can be implemented with standard DNN toolkits, by backpropagating to optimize 𝑢 and 𝑣.
  • 28. Matrix perspective Using matrix representation: See: “GloVe: Global Vectors for Word Representation” Jeffrey Pennington, Richard Socher, Christopher D. Manning, 2014 UT V N M word features K latent dims 𝑗 = 𝑖 Output word Weights, with Softmax loss One input word (one-hot encoding)
  • 29. Word2vec: Local contexts Local contexts capture much more information about relations and properties than LSA:
  • 30. Composition Algebraic relations: vec(‘‘woman")−vec(‘‘man") ≃ vec(‘‘aunt")−vec(‘‘uncle") vec(‘‘woman")−vec(‘‘man") ≃ vec(‘‘queen")−vec(‘‘king") From “Linguistic Regularities in Continuous Space Word Representations” Tomas Mikolov , Wen-tau Yih, Geoffrey Zweig, NAACL-HLT 2013
  • 31. Relations Learned by Word2vec “Efficient Estimation of Word Representations in Vector Space” Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, Arxiv 2013 Word2vec model computed from 6 billion word corpus of news articles
  • 32. Relations Learned by Word2vec “Efficient Estimation of Word Representations in Vector Space” Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, Arxiv 2013 A relation is defined by the vector displacement in the first column. For each start word in the other column, the closest displaced word is shown.
  • 33. Relations Learned by Word2vec “Efficient Estimation of Word Representations in Vector Space” Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, Arxiv 2013 A relation is defined by the vector displacement in the first column. For each start word in the other column, the closest displaced word is shown.
  • 34. Relations Learned by Word2vec “Efficient Estimation of Word Representations in Vector Space” Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, Arxiv 2013 A relation is defined by the vector displacement in the first column. For each start word in the other column, the closest displaced word is shown.
  • 35. Relations Learned by Word2vec “Efficient Estimation of Word Representations in Vector Space” Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, Arxiv 2013 A relation is defined by the vector displacement in the first column. For each start word in the other column, the closest displaced word is shown.
  • 36. Relations Learned by Word2vec “Efficient Estimation of Word Representations in Vector Space” Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, Arxiv 2013 A relation is defined by the vector displacement in the first column. For each start word in the other column, the closest displaced word is shown.
  • 37. Relations Learned by Word2vec “Efficient Estimation of Word Representations in Vector Space” Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, Arxiv 2013 A relation is defined by the vector displacement in the first column. For each start word in the other column, the closest displaced word is shown.
  • 38. Relations Learned by Word2vec “Efficient Estimation of Word Representations in Vector Space” Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, Arxiv 2013 A relation is defined by the vector displacement in the first column. For each start word in the other column, the closest displaced word is shown.
  • 39. Lexical and Compositional Semantics Lexical Semantics: focuses on the meaning of individual words. Compositional Semantics: meaning depends on the words, and on how they are combined.
  • 40. Beyond Bag-Of-Words: Skip-Thought Vectors The models we discussed so far embed texts as the sum of their words (lexical semantics). Clearly there is a lot missing from these representations: “man bites dog” = “dog bites man” “the quick, brown fox jumps over the lazy dog” = “the lazy fox over the brown dog jumps quick” … How can we model text structure as well as word meanings?
  • 41. Beyond Bag-Of-Words: Skip-Thought Vectors Skip-thought embeddings use sequence-to-sequence RNNs to predict the next and previous sentences. The output state vector of the boundary layer (dotted box) forms the embedding. RNN units are GRU units. Once the network is trained, we can discard the red and green sections of the network. From “Skip-Thought Vectors,” Ryan Kiros et al., Arxiv 2015.
  • 42. Beyond Bag-Of-Words: Skip-Thought Vectors Skip-thought embeddings use sequence-to-sequence RNNs to predict the next and previous sentences. Encoding doesn’t require backpropagation, so we can represent the encoder as a (truly) recurrent network. Thus we can encode longer units of text: sentences or paragraphs. 𝑥𝑡 𝑦𝑡 𝑧𝑡 𝑧𝑡−1
  • 43. Embedding of TREC queries Points are colored by query type (t-SNE embedding): From “Skip-Thought Vectors,” Ryan Kiros et al., Arxiv 2015.
  • 44. Sentence Similarity Approximately two weeks of training on a billion-word Books corpus
  • 45. Semantic Relatedness Evaluation SICK semantic relatedness task: score sentences for semantic similarity from 1 to 5 (average of 10 human ratings) Sentence A: A man is jumping into an empty pool Sentence B: There is no biker jumping in the air Relatedness score: 1.6 Sentence A: Two children are lying in the snow and are making snow angels Sentence B: Two angels are making snow on the lying children Relatedness score: 2.9 Sentence A: The young boys are playing outdoors and the man is smiling nearby Sentence B: There is no boy playing outdoors and there is no man smiling Relatedness score: 3.6 Sentence A: A person in a black jacket is doing tricks on a motorbike Sentence B: A man in a black jacket is doing tricks on a motorbike Relatedness score: 4.9
  • 46. Semantic Relatedness Evaluation Note: a separate model is trained to predict the scores from pairs of embedded sentences. Sentence A: A man is jumping into an empty pool Sentence B: There is no biker jumping in the air Relatedness score: 1.6 Sentence A: Two children are lying in the snow and are making snow angels Sentence B: Two angels are making snow on the lying children Relatedness score: 2.9 Sentence A: The young boys are playing outdoors and the man is smiling nearby Sentence B: There is no boy playing outdoors and there is no man smiling Relatedness score: 3.6 Sentence A: A person in a black jacket is doing tricks on a motorbike Sentence B: A man in a black jacket is doing tricks on a motorbike Relatedness score: 4.9
  • 47. Semantic Relatedness Evaluation SICK semantic relatedness scores for skip-thought methods:
  • 48. A Siamese Network for Semantic Relatedness This network is trained on pairs of sentences a, b with a similarity label y. Parameters are shared between the two networks. From “Siamese Recurrent Architectures for Learning Sentence Similarity” Jonas Mueller, Aditya Thyagarajan, AAAI-2016 300-dimensional word2vec embeddings 50 dims Manhattan distance
  • 49. A Siamese Network for Semantic Relatedness The network is trained on Semeval similar sentence pairs, expanded by substituting for random words using WordNet (a dataset of synonyms). Results: From “Siamese Recurrent Architectures for Learning Sentence Similarity” Jonas Mueller, Aditya Thyagarajan, AAAI-2016
  • 50. A Siamese Network for Semantic Relatedness The network is trained on Semeval similar sentence pairs, expanded by substituting for random words using WordNet (a dataset of synonyms). Results: r = Pearson correlation,  = Spearman’s rank correlation.
  • 51. A Siamese Network for Semantic Relatedness Moral: Train for your evaluation metric!
  • 52. Hidden Unit Factors 1,2, and 6
  • 53. Semantic Entailment Evaluation SICK semantic entailment task: score sentences for relations: ENTAILMENT, CONTRADICTION, NEUTRAL: Sentence A: Two teams are competing in a football match Sentence B: Two groups of people are playing football Entailment judgment: ENTAILMENT Sentence A: The brown horse is near a red barrel at the rodeo Sentence B: The brown horse is far from a red barrel at the rodeo Entailment judgment: CONTRADICTION Sentence A: A man in a black jacket is doing tricks on a motorbike Sentence B: A person is riding the bicycle on one wheel Entailment judgment: NEUTRAL
  • 56. Sequence-To-Sequence RNNs An input sequence is fed to the left array, output sentence to the right array for training: For translation: I love coffee <EOS> Amo el café Amo el café <EOS>
  • 57. Sequence-To-Sequence RNNs Generation: Keep an n-best list of partial sentences, along with their partial softmax scores. I love coffee <EOS> Amo Me Amo Me el gusta el gusta café el café el <EOS> café <EOS> café <EOS> <EOS>
  • 58. Bleu Scores for Translation The goal of bleu scores is to compare machine translations against human-generated translations, allowing for variation. Consider these translations for a Chinese sentence: Candidate 1: It is a guide to action which ensures that the military always obeys the commands of the party. Candidate 2: It is to insure the troops forever hearing the activity guidebook that party direct.
  • 59. Bleu Scores for Translation Candidate 1: It is a guide to action which ensures that the military always obeys the commands of the party. Candidate 2: It is to insure the troops forever hearing the activity guidebook that party direct. Reference 1: It is a guide to action that ensures that the military will forever heed Party commands. Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. Reference 3: It is the practical guide for the army always to heed the directions of the party.
  • 60. Bleu Scores for Translation Candidate 1: It is a guide to action which ensures that the military always obeys the commands of the party. Candidate 2: It is to insure the troops forever hearing the activity guidebook that party direct. Reference 1: It is a guide to action that ensures that the military will forever heed Party commands. Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. Reference 3: It is the practical guide for the army always to heed the directions of the party.
  • 61. Bleu Scores for Translation Candidate 1: It is a guide to action which ensures that the military always obeys the commands of the party. Candidate 2: It is to insure the troops forever hearing the activity guidebook that party direct. Reference 1: It is a guide to action that ensures that the military will forever heed Party commands. Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. Reference 3: It is the practical guide for the army always to heed the directions of the party.
  • 62. Bleu Scores for Translation Unigram precision: correct unigrams occuring in reference sentence unigrams occuring in test sentence Modified unigram precision: clip counts by maximum occurrence in any reference sentence: Candidate: the the the the the the the. Reference 1: The cat is on the mat. Reference 2: There is a cat on the mat. Modified precision is 2/7.
  • 63. Bleu Scores for Translation Candidate 1: It is a guide to action which ensures that the military always obeys the commands of the party. unigram precision 17/18 Candidate 2: It is to insure the troops forever hearing the activity guidebook that party direct. unigram precision 8/14 Reference 1: It is a guide to action that ensures that the military will forever heed Party commands. Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. Reference 3: It is the practical guide for the army always to heed the directions of the party.
  • 64. Bleu Scores for Translation N-gram precision is defined similarly: correct ngrams occuring in reference sentence ngrams occuring in test sentence Modified ngram precision: clip counts by maximum occurrence in any reference sentence. Unigram scores tend to capture adequacy Ngram scores tend to capture fluency
  • 65. Bleu Scores for Translation Candidate 1: It is a guide to action which ensures that the military always obeys the commands of the party. bigram precision 10/17 Candidate 2: It is to insure the troops forever hearing the activity guidebook that party direct. bigram precision 1/13 Reference 1: It is a guide to action that ensures that the military will forever heed Party commands. Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. Reference 3: It is the practical guide for the army always to heed the directions of the party.
  • 66. Bleu Scores for Translation How to combine scores for different n-grams? Averaging sounds good, but precisions are very different for different n (unigrams have much higher scores). BLEU Score: Take a weighted geometric mean of the logs of n-gram precisions up to some length (usually 4). Add a penalty for too-short predictions. Candidate length c shorter than reference r translation
  • 67. Sequence-To-Sequence Model Translation I love coffee <EOS> Amo Me Amo Me el gusta el gusta café el café el <EOS> café <EOS> café <EOS> <EOS> Raw scores for French-English Translation, depth = 4 for these results.
  • 68. Sequence-To-Sequence Model Translation Raw scores for French-English Translation
  • 69. Sequence-To-Sequence Model Translation Scores using the LSTM model to rerank 1000-best sentences from a baseline Machine Translation system:
  • 70. Sequence-To-Sequence Model Translation Training details: • LSTM Array depth = 4. Deeper is better. • LSTM params initialized from uniform distribution [-0.8,0.8] • Stochastic gradient descent w/o momentum, fixed learning rate of 0.7. • After 5 epochs, learning rate was halved every half epoch. • Models trained for a total of 7.5 epochs. • Batch size of 128 sequences. • Gradient clipping at 𝑔 = 5. • Sentences were grouped into minibatches of approximately the same size.
  • 71. State-of-the-Art Neural Machine Translation A RNN array with an attention network to regulate information flow from the source network. Global attention model Local attention model
  • 72. State-of-the-Art Neural Machine Translation English-German translation (WMT 14 results):
  • 73. Parsing Recall (Lecture 10) RNNs ability to generate Latex, C code: They seem to do well with tree-structured data. What about natural language parsing?
  • 74. Parsing Sequence models generate linear structures, but these can easily encode trees by “closing parens” (prefix tree notation):
  • 75. Parsing Cheat Sheet S = Sentence NP = Noun Phrase VP = Verb Phrase NNP = Proper Noun (“John”) VBZ = Verb, 3rd person, singular (“has”) DT = Determiner (“a”) NN = Noun, singular (“dog”)
  • 76. A Sequence-To-Sequence Parser The model is a depth-3 sequence-to-sequence predictor, augmented with the attention model of Bahdanau 2014. Grammar as a Foreign Language Oriol Vinyals, Google, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, Geoffrey Hinton, NIPS 2015 “Neural machine translation by jointly learning to align and translate.” Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. arXiv 2014.
  • 77. A Sequence-To-Sequence Parser Chronology: • First tried training a basic sequence-to-sequence model on human-annotated training treebanks. Poor results. • Then training on parse trees generated by the Berkeley Parser, achieved similar performance (90.5 F1 score) to it. • Next added the attention model, trained on human treebank data, also achieved 90.5 F1. • Finally, created a synthetic dataset of high-confidence parse trees (agreed on by two parsers). Achieved a new state-of-the-art of 92.5 F1 score (WSJ dataset). F1 is a widely-used accuracy measure that combines precision and recall
  • 78. A Sequence-To-Sequence Parser Quick Training Details: • Depth = 3, layer dimension = 256. • Dropout between layers 1 and 2, and 2 and 3. • No POS tags!! Improved by F1 1 point by leaving them out. • Input reversing.
  • 79. A Sequence-To-Sequence Parser Dropout layers shown in purple: This use of dropout in LSTM arrays is now widely used.
  • 80. Neural Entity-Relation Extraction Several approaches tried already: • RNNs running on the raw text only have trouble getting the relation structure right. • Adding dependency tree information helps dramatically. Recent approach: run separate RNNs on the text and dependency parse tree data.
  • 81. Neural Entity-Relation Extraction Recent approach: run separate RNNs on the text and dependency parse tree data.
  • 82. Neural Entity-Relation Extraction Equals previous best scores on an entity-relation benchmark
  • 83. Wrapup Semantics • Propositional models, entity-relation extraction • Matrix factorization • Word2vec • Skip-Thought vectors • Siamese models Translation + Structure Extraction • Translation • Parsing • Entity-Relation extraction
  • 84. Take-Aways Training data quality (consistency) matters! • DNNs can model anything, but it shouldn’t be human inconsistency. DNNs need good advice (hints)! c.f. resNets • DNNs are capable of state-of-the-art parsing, but need a parser to do good ER-extraction now. Depth matters – Deeper is better • Between-level dropout is a good regularization scheme

Editor's Notes

  • #2: Where to put block multi-vector algorithms? Percentage of Future Work Diagram
  • #53: Negation, classification by object (or action)?, classification by subject.