Natural Language Processing Sose 2016: Discourse Analysis
Natural Language Processing Sose 2016: Discourse Analysis
SoSe 2016
Discourse Analysis
●
Introduction
●
Text Segmentation
●
Text Coherence
●
Reference Resolution
●
Evaluation
2
Outline
●
Introduction
●
Text Segmentation
●
Text Coherence
●
Reference Resolution
●
Evaluation
3
Discourse
●
Discourse is a coherent structured group of sentence.
4 (http://evolution.berkeley.edu/evolibrary/teach/journal/dissectingapaper2.php)
Types of Discourse
●
Monologues
●
Dialogue
– Human-human
– Human-computer (conversational agent)
5 (http://www.imdb.com/title/tt1798709/)
Motivation: Information extraction
●
Reference resolution
Angelina Jolie Pitt is an American actress, filmmaker, and humanitarian. She has
received an Academy Award, two Screen Actors Guild Awards, and three Golden
Globe Awards, and has been cited as Hollywood's highest-paid actress.
Divorced from actors Jonny Lee Miller and Billy Bob Thornton, she has been
married to actor Brad Pitt since 2014. They have six children together, three of
whom were adopted internationally.
6 (https://en.wikipedia.org/wiki/Angelina_Jolie)
Motivation: Summarization
●
Text coherence and reference resolution
7
Motivation: Conversational agents
●
Reference resolution and text coherence
8
Motivation: Automatic Essay Grading
●
Text coherence
9 (http://www.educationnews.org/technology/idea-works-using-automated-grading-for-collaborative-learning/)
Outline
●
Introduction
●
Text Segmentation
●
Text Coherence
●
Reference Resolution
●
Evaluation
10
Text Segmentation
11 (http://evolution.berkeley.edu/evolibrary/teach/journal/dissectingapaper2.php)
Motivation for Text Segmentation
●
Information extraction
– Some sections are more informative than others
●
Summarization
– Include information from all sections
●
Information retrieval
– Retrieve particular information from the correct section
12
Text Segmentation
●
Linear segmentation (no hierarchy)
●
Approaches
– Unsupervised discourse segmentation
– Supervised discourse segmentation
13
Unsupervised Discourse Segmentation
●
Based on cohesion:linguistic devices to link textual units
●
Lexical cohesion: given by relations between words
– e.g., identical words, synonyms or hypernyms.
●
Cohesion chain: sequence of related words.
14
Unsupervised Discourse Segmentation
●
TextTiling algorithm (Hearst 1997)
– Tokenization
●
Lowercase conversion, stoplist removal and stemming
●
Create pseudo-sentences (e.g., length 20)
– No real sentences!
– Lexical score determination
– Boundary identification
15
Unsupervised Discourse Segmentation
●
TextTiling algorithm (Hearst 1997)
– Tokenization
– Lexical score determination
●
Average similarity of words in the pseudo-sentences
●
Create two vectors
– Blocks of k pseudo-sentences before and after each
gap
– Calculate cosine similarity between the vectors
– Boundary identification
16 (https://en.wikipedia.org/wiki/Cosine_similarity)
Unsupervised Discourse Segmentation
1 2 3 4 5 6 7 8
2x1 A A A
1x1 B B B B
2x1 C C C
1x1 D D
1x2 E E E E
F F F
G G G
H H H H
I
17
Unsupervised Discourse Segmentation
1 2 3 4 5 6 7 8
A A A
B B B B
C C C
D D
E E E E
F F F
G G G
H H H H
I
8 3 9
18
Unsupervised Discourse Segmentation
●
TextTiling algorithm (Hearst 1997)
– Tokenization
– Lexical score determination
– Boundary identification
●
Compute depth score: distance from the peaks in both
sides of the valley
– (ya1-ya2)+(ya3-ya2) = (8-3)+(9-3)
●
Define boundaries for valleys deeper than a cutoff
threshold
19
Supervised Discourse Segmentation
●
There are labeled data available:
– Paragraph segmentation (e.g., Web pages, <p> tag)
●
Methods
– Binaries classifiers (e.g., SVM, Naïve Bayes)
– Sequential classifiers (HMM, CRF)
●
Features
– Word overlap, word cosine, Latent Semantic Analysis
(LSA), lexical chains, coreference, etc
– Discourse markers or cue words
20
Supervised Discourse Segmentation
●
Discourse markers are domain-specific:
– Broadcasting news: „Good evening, I'm ...“, „coming
now...“, etc.
– Scientific articles: „Introduction“, „Background“,
„Methods“, „results“, etc.
– Business: „XYZ incorporated“ then only „XYZ“
21
Evaluation of Text Segmentation
●
WindowDiff [Pevzner and Hearst 2002]
– Moving window of size „k“
– „k“ is half the average segment in reference text
– # boundaries in the probe: ri (reference) and hi (hypothesis)
Reference
Hypotesis
22
Outline
●
Introduction
●
Text Segmentation
●
Text Coherence
●
Reference Resolution
●
Evaluation
23
Text Coherence
●
Meaning relation between two units.
– The meaning of different units can combine to build a
discouse meaning for the larger unit.
24
Text coherence
●
Better if the focus is on one entity
25
Coherence Relations
●
Result: one event can cause a following event
– „John hid Bill's car keys. He was very upset about it.“
●
Explanation: a previous event causing one event.
– „John hid Bill's car keys. He was drunk.“
●
Parallel: both events happening at the same time
– „John hid Bill's car keys. Bill was sleeping.“
●
Elaboration: Detailed elaboration of an event.
– „John hid Bill's car keys. He put it in his bag.“
●
Occasion: change of state:
– „John hid Bill's car keys. He found it ten minutes later.“
26
Discourse structure
(S1) Bill was drunk.
(S2) John hid Bill's car keys.
(S3) (While) Bill was sleeping.
(S4) He put it in his bag. Parallel
(S5) Bill was very upset about it.
(S6) Bill found it ten minutes later.
Elaboration S3
Result S4
S5 Explanation
S1 S5
27
Rhetorical Structure Theory (RST)
●
Model of text organization originally for text generation (Mann
and Thompson 1987)
●
Uses 23 rethorical relations hold between a nucleus and a
satellite clauses or sentences
28
Rhetorical Structure Theory
●
Also called discourse parsing
– It is still a hard task
– Open research question
30
Cue-phrase-based algorithm
31
Cue-phrase-based algorithm
32
Cue-phrase-based algorithm
33
When no cue phrases are available
●
Explore lexical semantics
– „I don't want a truck; I'd prefer a convertible“ [CONTRAST]
●
negative vs. affirmative
●
truck vs. convertible
●
Use of bootstrapping:
– Use strong cue markers (e.g., „because“, „but“) to acquire
text;
– Remove the cue markers from the text;
– Use the labeled data as training data
34
Outline
●
Introduction
●
Text Segmentation
●
Text Coherence
●
Reference Resolution
●
Evaluation
35
Reference resolution
36 (http://www.bbc.com/news/world-europe-36433061)
Reference resolution
●
The task of determining what entities are referred to by which
linguistic expressions.
●
Anaphora: reference to an entity (antecedent) previously
introduced in the discourse
referent
referring expressions
37
Reference resolution
●
Discourse model:
– Representation of the entities that have been referred to in
the discourse and the relationships in which they
participate.
●
Two componnets are required
– A discourse model should evolve with the dinamically
changing discourse
– A method for mapping between the signals that various
referring expressions encode and the hearer's set of beliefs
38
Reference resolution
●
evoke (first mention) vs. access (subsequent mentions)
Discourse Model
evoke Lionel Messi to give evidence at
tax fraud trial in Spain
39
Reference resolution
●
Coreference resolution
– Find referring expressions in the text that refer to the
same entity, i.e., that corefer.
– e.g., „Messi“, „the player“, „both men“, „his father“
●
Pronomial anaphora resolution
– Find the antecedent for a single pronoun
●
e.g., „he“, „they“
– It is subtask of coreference resolution.
40
Types of referring expressions
●
Indefinite noun phrases:
– e.g., „a“, „an“, „this“, „some“
– An entity new to the hearer
– Evoke a new entity into the discourse model
– Can be specific or non-specific
41
Types of referring expressions
●
Definite noun phrases:
– An entity identifiable to the hearer
– Evoke a representation of the referrent into the discourse
model
“The authorities allege that the two used tax havens in Belize and
Uruguay to conceal earnings from image rights.”
“Because of the trial, Messi has missed part of his national team's
preparations for the Copa America, which starts on Friday in the US.
Argentina's first game is on Monday June 6.”
42
Types of referring expressions
●
Pronouns (Pronominalization):
– e.g., „he“, „she“, „they“
– The entity was usually referred one or two sentence back.
– Cataphora, mentioned before the referent.
“His lawyers had argued that the player had "never devoted a minute of
his life to reading, studying or analysing" the contracts.”
43
Types of referring expressions
●
Demonstratives:
– e.g., „this“, „that“
– Can appear either alone or as determiners.
“The authorities allege that the two used tax havens in Belize and
Uruguay to conceal earnings from image rights. Both men deny that.”
44
Types of referring expressions
●
Names:
– Names of people, organizations, locations, etc.
– Can be both new and old entities.
“Messi and his father Jorge, who manages his financial affairs, are
accused of defrauding Spain of more than €4m (£3m; $4.5m) between
2007 and 2009.”
45
Information Status (or Information Structure)
●
The way that different referential forms are used to provide
new or old information.
●
Givenness hierarchy (Gundel et al. 1993):
in focus > activated > familiar > uniquely identifiable > referential > type identifiable
{that
{it} this {that N} {the N} {indef. this N} {a N}
this N}
●
Accessibility scale (Ariel 2001):
full name > long definitive description > short definite description > last name >
first name > distal demonstrative > NP > stressed pronoun > unstressed pronoun
46
Information Status
●
Complicating factors for relations between referring
expressions and information status:
– Inferrable: not explicitly evoked in text, but related to an
evoked entity
●
„Because of the trial, Messi has missed part of his
national team's preparations for the Copa America,
which starts on Friday in the US. Argentina's first
game is on Monday.“
47
Information Status
●
Complicating factors for relations between referring
expressions and information status:
– Generics: not explicitly evoked in text, but a generic
reference
●
„I only worried about playing football," he told the
judge. But they did not believe a word of it.“
48
Information Status
●
Complicating factors for relations between referring
expressions and information status:
– Non-referential uses:
●
„It was the judge who asked Messi the questions.“
49
Features Pronomial Anaphora Resolution
●
Number agreement:
– „I only worried about playing football," Messi told the
judge. But she did not believe a word of it.“
instead of
– „I only worried about playing football," he told the judge.
But they did not believe a word of it.“
but semantically plural entities can use it or they:
– „Argentina's first game is on Monday. They cannot count
with Messi for the first game.“
50
Features Pronomial Anaphora Resolution
●
Person agreement:
– „I only worried about playing football," Messi told the
judge. But he should not be granted immunity for not
knowing what was happening with his finances.“
●
Gender agreement:
– „Messi's lawyers had argued that the player had never
devoted a minute of his life to reading, studying or
analysing the contracts. He should be more careful about
them.“
●
Binding Theory Constraints:
– „Messi said that the father himself manages his finances.“
51
Preferences in Pronoun Interpretation
●
Recency:
– „Messi played in the match against Brazil last week. Argentina
will play against Chile this week. They hope to win it.“
●
Grammatical rules: subject position more salient than object
position
– „Angentina played a match against Brazil last week. They
won.“
●
Repeated mention: focused entities are likely to continue to be
focused
– „Angentina played a match against Brazil last week. They also
played matches against Chile and Peru. But Uruguay didn't
want to play against them.“
52
Preferences in Pronoun Interpretation
●
Parallelism: preferences can be induced by parallelism effects.
– „Argentina's first match was against Brazil in the last
America Cup. Chile will also play against them this year.“
●
Verb semantics: interpretation can be biased by semantically
oriented emphasys.
– „Argentina beat Brazil. They won.“
– „Argentina lost to Brazil. They won.“
●
Selectional restrictions: semantic role can play a role in
referent preferences.
– „Messi said he signed the documents related to his finances
without reading them.“
53
Algorithms for Anaphora resolution
●
Hobbs
●
Log-linear
●
Centering (check the book)
54
Hobbs algorithm
●
Relies on
– Syntactic parser
– Morphological gender and number checker
●
Input: the pronoun to be resolved
●
Output: an noun phrase
55
Hobbs algorithm (Breadth-first, left-to-right
search) (ROOT
(S
(S
(NP (PRP I))
(ADVP (RB only))
(VP (VBN worried)
I only worried about playing football, (PP (IN about)
(S
Messi told the judge. (VP (VBG playing)
(NP (NN football)))))))
(, ,)
(NP (NNP Messi))
(VP (VBD told)
(NP (DT the) (NN judge)))
(. .)))
(ROOT
(S (CC But)
(NP (PRP she))
(VP (VBD did) (RB not)
But she did not believe a word of it. (VP (VB believe)
(NP
(NP (DT a) (NN word))
(PP (IN of)
(NP (PRP it))))))
56 (. .)))
Hobbs algorithm (Breadth-first, left-to-right
search) (ROOT
(S
(S
(NP (PRP I))
(ADVP (RB only))
(VP (VBN worried)
I only worried about playing football, (PP (IN about)
(S
Messi told the judge. (VP (VBG playing)
(NP (NN football)))))))
(, ,)
(NP (NNP Messi))
(VP (VBD told)
(NP (DT the) (NN judge)))
(. .)))
(ROOT
No NPs (S (CC But)
found (NP (PRP she))
(VP (VBD did) (RB not)
But she did not believe a word of it. (VP (VB believe)
(NP
(NP (DT a) (NN word))
(PP (IN of)
(NP (PRP it))))))
57 (. .)))
Hobbs algorithm (Breadth-first, left-to-right
search) (ROOT
Check all (S
(S
NPs (NP (PRP I))
(ADVP (RB only))
(VP (VBN worried)
I only worried about playing football, (PP (IN about)
(S
Messi told the judge. (VP (VBG playing)
(NP (NN football)))))))
(, ,)
(NP (NNP Messi))
(VP (VBD told)
(NP (DT the) (NN judge)))
(. .)))
(ROOT
(S (CC But)
(NP (PRP she))
(VP (VBD did) (RB not)
But she did not believe a word of it. (VP (VB believe)
(NP
(NP (DT a) (NN word))
(PP (IN of)
(NP (PRP it))))))
58 (. .)))
Hobbs algorithm
●
Advantages:
– The search order gives considers binding theory, recency
and grammatical role
– A final check accounts for gender, person and number
constraints
●
Disadvantages:
– Do not have a explicit discourse model
59
Log-Linear Algorithm
●
Supervised machine learning approach
●
Log-linear, but also any other ML algorithm
●
Must rely on
– annotated data with positive and negative examples
– full parser or chunker
●
Usually, pleonasms are removed, e.g., „It is raining.“
●
Input: pair of NP and pronoun
●
Output: 0 or 1 (binary classification)
60
Features for ML
●
Strict number (true/false)
●
Strict gender (true/false)
●
Regarding pronoun and NP
– Compatible number (true/false)
– Compatible gender (true/false)
– Sentence distance [0,1,2,3,...]: # of sentences
– Hobbs distance [0,1,2,3,...]: # of skipped NPs
– Grammatical role [subject,object,PP] of the potential
antecedent
– Linguistic form [proper, definite, indefinite, pronoun] of the
potential antecedent
61
Coreference Resolution
●
Deal with definite noun phrases and names
– „Messi“, „the player“, „the football star“, „Barcelona,
„Argentina“
●
We can use a similar log-linear/ML classifier
62
Coreference Resolution
●
We can rely on the same features for anaphora, plus:
– Anaphor edit distance: minimum edit distance from
potential antecedent to anaphor
– Antecedent edit distance: minimum edit distance from
anaphor to antecedent
– Alias (true/false) based on named-entity recognition
– Appositive (true/false): e.g., „Lionel Messi, the Barcelona
football star, ….“
– Linguistic form [proper, definite, indefinite, pronoun]
63
Outline
●
Introduction
●
Text Segmentation
●
Text Coherence
●
Reference Resolution
●
Evaluation
64
Evaluation of Coreference Resolution
●
B-CUBED
– Reference chain (or true chain)
– Hypothesis chain
– Computation of precision and recall of entities in the
hypothesis against the reference chains
N
# of correct elements in hypothesis chain containing entity
Precision = ∑ wi
i=1 # of elements in hypothesis chain containing entity
N
# of correct elements in hypothesis chain containing entity
Recall = ∑ wi
i=1 # of elements in reference chain containing entity
65
Further Reading
●
Book „Speech and Language Processing“
– Chapter 21
66