0% found this document useful (0 votes)
90 views66 pages

Natural Language Processing Sose 2016: Discourse Analysis

The document discusses discourse analysis and natural language processing. It covers text segmentation, coherence, and reference resolution. For text segmentation, it describes unsupervised approaches like TextTiling that identify boundaries based on lexical cohesion scores. It also discusses supervised segmentation using classifiers and discourse markers. For coherence, it explains how the meaning of units combines to build discourse meaning and gives examples of coherence relations. The goal is to develop models that understand these aspects of discourse.

Uploaded by

Tanmay Purandare
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
90 views66 pages

Natural Language Processing Sose 2016: Discourse Analysis

The document discusses discourse analysis and natural language processing. It covers text segmentation, coherence, and reference resolution. For text segmentation, it describes unsupervised approaches like TextTiling that identify boundaries based on lexical cohesion scores. It also discusses supervised segmentation using classifiers and discourse markers. For coherence, it explains how the meaning of units combines to build discourse meaning and gives examples of coherence relations. The goal is to develop models that understand these aspects of discourse.

Uploaded by

Tanmay Purandare
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

Natural Language Processing

SoSe 2016

Discourse Analysis

Dr. Mariana Neves June 6th, 2016


Outline


Introduction

Text Segmentation

Text Coherence

Reference Resolution

Evaluation

2
Outline


Introduction

Text Segmentation

Text Coherence

Reference Resolution

Evaluation

3
Discourse


Discourse is a coherent structured group of sentence.

4 (http://evolution.berkeley.edu/evolibrary/teach/journal/dissectingapaper2.php)
Types of Discourse


Monologues


Dialogue
– Human-human
– Human-computer (conversational agent)

5 (http://www.imdb.com/title/tt1798709/)
Motivation: Information extraction


Reference resolution

Angelina Jolie Pitt is an American actress, filmmaker, and humanitarian. She has
received an Academy Award, two Screen Actors Guild Awards, and three Golden
Globe Awards, and has been cited as Hollywood's highest-paid actress.

Divorced from actors Jonny Lee Miller and Billy Bob Thornton, she has been
married to actor Brad Pitt since 2014. They have six children together, three of
whom were adopted internationally.

6 (https://en.wikipedia.org/wiki/Angelina_Jolie)
Motivation: Summarization


Text coherence and reference resolution

“To review available studies of empagliflozin, a


sodium glucose co-transporter-2 (SGLT2) inhibitor
approved in 2014 by the European Commission and
the United States Food and Drug Administration for “this protein”
the treatment of type 2 diabetes mellitus (T2DM).
Inhibitors of the sodium-glucose co-transporter 2 “SGLT2”
(SGLT2) promote the excretion of glucose to reduce
glycated hemoglobin (HbA1c) levels.”

7
Motivation: Conversational agents


Reference resolution and text coherence

- Goog morning, I want to fly to - Goog morning, I want a flight to


Denver. Denver.
- When do you want it to leave? - Which kind of hotel do you want to
- Next Thursday. book?
- An early or a late flight?
…..

8
Motivation: Automatic Essay Grading


Text coherence

9 (http://www.educationnews.org/technology/idea-works-using-automated-grading-for-collaborative-learning/)
Outline


Introduction

Text Segmentation

Text Coherence

Reference Resolution

Evaluation

10
Text Segmentation

11 (http://evolution.berkeley.edu/evolibrary/teach/journal/dissectingapaper2.php)
Motivation for Text Segmentation


Information extraction
– Some sections are more informative than others


Summarization
– Include information from all sections


Information retrieval
– Retrieve particular information from the correct section

12
Text Segmentation


Linear segmentation (no hierarchy)


Approaches
– Unsupervised discourse segmentation
– Supervised discourse segmentation

13
Unsupervised Discourse Segmentation


Based on cohesion:linguistic devices to link textual units

Lexical cohesion: given by relations between words
– e.g., identical words, synonyms or hypernyms.

Cohesion chain: sequence of related words.

Peel, core and slice the pears and the apples.


Add the fruits to the skillet.
When they are soft, ….

14
Unsupervised Discourse Segmentation


TextTiling algorithm (Hearst 1997)
– Tokenization

Lowercase conversion, stoplist removal and stemming

Create pseudo-sentences (e.g., length 20)
– No real sentences!
– Lexical score determination
– Boundary identification

15
Unsupervised Discourse Segmentation


TextTiling algorithm (Hearst 1997)
– Tokenization
– Lexical score determination

Average similarity of words in the pseudo-sentences

Create two vectors
– Blocks of k pseudo-sentences before and after each
gap
– Calculate cosine similarity between the vectors
– Boundary identification

16 (https://en.wikipedia.org/wiki/Cosine_similarity)
Unsupervised Discourse Segmentation

1 2 3 4 5 6 7 8

2x1 A A A
1x1 B B B B
2x1 C C C
1x1 D D
1x2 E E E E
F F F
G G G
H H H H
I

17
Unsupervised Discourse Segmentation

1 2 3 4 5 6 7 8

A A A
B B B B
C C C
D D
E E E E
F F F
G G G
H H H H
I

8 3 9

18
Unsupervised Discourse Segmentation


TextTiling algorithm (Hearst 1997)
– Tokenization
– Lexical score determination
– Boundary identification

Compute depth score: distance from the peaks in both
sides of the valley
– (ya1-ya2)+(ya3-ya2) = (8-3)+(9-3)

Define boundaries for valleys deeper than a cutoff
threshold

19
Supervised Discourse Segmentation


There are labeled data available:
– Paragraph segmentation (e.g., Web pages, <p> tag)

Methods
– Binaries classifiers (e.g., SVM, Naïve Bayes)
– Sequential classifiers (HMM, CRF)

Features
– Word overlap, word cosine, Latent Semantic Analysis
(LSA), lexical chains, coreference, etc
– Discourse markers or cue words

20
Supervised Discourse Segmentation


Discourse markers are domain-specific:
– Broadcasting news: „Good evening, I'm ...“, „coming
now...“, etc.
– Scientific articles: „Introduction“, „Background“,
„Methods“, „results“, etc.
– Business: „XYZ incorporated“ then only „XYZ“

21
Evaluation of Text Segmentation


WindowDiff [Pevzner and Hearst 2002]
– Moving window of size „k“
– „k“ is half the average segment in reference text
– # boundaries in the probe: ri (reference) and hi (hypothesis)

Reference

Hypotesis

22
Outline


Introduction

Text Segmentation

Text Coherence

Reference Resolution

Evaluation

23
Text Coherence


Meaning relation between two units.
– The meaning of different units can combine to build a
discouse meaning for the larger unit.

John hid Bill's car keys. He was drunk. 

John hid Bill's car keys. He likes spinach. 

24
Text coherence


Better if the focus is on one entity

a. John went to his favorite music store to buy a piano.


b. He had frequented the store for many years.
c. He was excited that he could finally buy a piano. 
d. He arrived just as the store was closing for the day.

a. John went to his favorite music store to buy a piano.


b. It was a store John had frequented for many years.
c. He was excited that he could finally buy a piano. 
d. It was closing for the day just as John arrived.

25
Coherence Relations


Result: one event can cause a following event
– „John hid Bill's car keys. He was very upset about it.“

Explanation: a previous event causing one event.
– „John hid Bill's car keys. He was drunk.“

Parallel: both events happening at the same time
– „John hid Bill's car keys. Bill was sleeping.“

Elaboration: Detailed elaboration of an event.
– „John hid Bill's car keys. He put it in his bag.“

Occasion: change of state:
– „John hid Bill's car keys. He found it ten minutes later.“

26
Discourse structure
(S1) Bill was drunk.
(S2) John hid Bill's car keys.
(S3) (While) Bill was sleeping.
(S4) He put it in his bag. Parallel
(S5) Bill was very upset about it.
(S6) Bill found it ten minutes later.
Elaboration S3

Result S4

S5 Explanation

S1 S5

27
Rhetorical Structure Theory (RST)


Model of text organization originally for text generation (Mann
and Thompson 1987)

Uses 23 rethorical relations hold between a nucleus and a
satellite clauses or sentences

His car is parked outside.

Kevin must be here.

28
Rhetorical Structure Theory

(figure taken from Marcu 2000)


29
Automatic Coherence Assignment


Also called discourse parsing
– It is still a hard task
– Open research question

30
Cue-phrase-based algorithm

1. Identify the cue phrases in text



Search for connectives, e.g., „because“, „but“, „for
example“, „with“, „and“, etc.

„John hid Bill's car keys because he was drunk.“
(EXPLANATION)
2. Segment the text into discourse segments
3. Classify the relation between each consecutive segment

31
Cue-phrase-based algorithm

1. Identify the cue phrases in text


2. Segment the text into discourse segments
– Segments can be clauses or sentences

„[John hid Bill's car keys] [because he was drunk].“
– Segmentation can be carried out rule-based and can rely
on the output of syntactic parsing
3. Classify the relation between each consecutive segment

32
Cue-phrase-based algorithm

1. Identify the cue phrases in text


2. Segment the text into discourse segments
3. Classify the relation between each consecutive segment

Usually also rule-based and according to the cue phrases

However, one cue phrase can be related to various RST
relations

e.g., „because“ can be CAUSE or EVIDENCE

33
When no cue phrases are available


Explore lexical semantics
– „I don't want a truck; I'd prefer a convertible“ [CONTRAST]

negative vs. affirmative

truck vs. convertible


Use of bootstrapping:
– Use strong cue markers (e.g., „because“, „but“) to acquire
text;
– Remove the cue markers from the text;
– Use the labeled data as training data

34
Outline


Introduction

Text Segmentation

Text Coherence

Reference Resolution

Evaluation

35
Reference resolution

Lionel Messi to give evidence at tax fraud trial in Spain


The Argentina and Barcelona footballer Lionel Messi is due to give evidence in a
Spanish court on tax fraud charges.
Messi and his father Jorge, who manages his financial affairs, are accused of
defrauding Spain of more than €4m (£3m; $4.5m) between 2007 and 2009.
The authorities allege that the two used tax havens in Belize and Uruguay to
conceal earnings from image rights.
Spain's tax agency is demanding heavy fines and prison sentences. Both men
deny any wrongdoing.
...
Messi's lawyers had argued that the player had "never devoted a minute of his
life to reading, studying or analysing" the contracts.
But the high court in Barcelona ruled in June 2015 that the football star should
not be granted immunity for not knowing what was happening with his finances,
which were being managed in part by his father.
..
The footballer is the five-time World Player of the Year and one of the richest
athletes in the world.

36 (http://www.bbc.com/news/world-europe-36433061)
Reference resolution


The task of determining what entities are referred to by which
linguistic expressions.

Anaphora: reference to an entity (antecedent) previously
introduced in the discourse

referent

Lionel Messi to give evidence at tax fraud trial


in Spain

Messi's lawyers had argued that the player had


"never devoted a minute of his life to reading,
studying or analysing" the contracts.

referring expressions
37
Reference resolution


Discourse model:
– Representation of the entities that have been referred to in
the discourse and the relationships in which they
participate.


Two componnets are required
– A discourse model should evolve with the dinamically
changing discourse
– A method for mapping between the signals that various
referring expressions encode and the hearer's set of beliefs

38
Reference resolution


evoke (first mention) vs. access (subsequent mentions)

Discourse Model
evoke Lionel Messi to give evidence at
tax fraud trial in Spain

Messi's lawyers had argued that


the player had "never devoted a
access minute of his life to reading,
studying or analysing" the
contracts.

39
Reference resolution


Coreference resolution
– Find referring expressions in the text that refer to the
same entity, i.e., that corefer.
– e.g., „Messi“, „the player“, „both men“, „his father“


Pronomial anaphora resolution
– Find the antecedent for a single pronoun

e.g., „he“, „they“
– It is subtask of coreference resolution.

40
Types of referring expressions


Indefinite noun phrases:
– e.g., „a“, „an“, „this“, „some“
– An entity new to the hearer
– Evoke a new entity into the discourse model
– Can be specific or non-specific

“The Argentina and Barcelona footballer Lionel Messi is due to give


evidence in a Spanish court on tax fraud charges.”

“A verdict is not expected until next week.”

41
Types of referring expressions


Definite noun phrases:
– An entity identifiable to the hearer
– Evoke a representation of the referrent into the discourse
model

“The authorities allege that the two used tax havens in Belize and
Uruguay to conceal earnings from image rights.”

“Because of the trial, Messi has missed part of his national team's
preparations for the Copa America, which starts on Friday in the US.
Argentina's first game is on Monday June 6.”

42
Types of referring expressions


Pronouns (Pronominalization):
– e.g., „he“, „she“, „they“
– The entity was usually referred one or two sentence back.
– Cataphora, mentioned before the referent.

“Because of the trial, he has missed part of his national team's


preparations for the Copa America, which starts on Friday in the US.
Argentina's first game is on Monday June 6.”

“His lawyers had argued that the player had "never devoted a minute of
his life to reading, studying or analysing" the contracts.”

43
Types of referring expressions


Demonstratives:
– e.g., „this“, „that“
– Can appear either alone or as determiners.

“The trial began on Tuesday, and Thursday is expected to be the final


day. A verdict is not expected until the end of this week.”

“The authorities allege that the two used tax havens in Belize and
Uruguay to conceal earnings from image rights. Both men deny that.”

44
Types of referring expressions


Names:
– Names of people, organizations, locations, etc.
– Can be both new and old entities.

“Messi and his father Jorge, who manages his financial affairs, are
accused of defrauding Spain of more than €4m (£3m; $4.5m) between
2007 and 2009.”

“The Argentina and Barcelona footballer Lionel Messi is due to give


evidence in a Spanish court on tax fraud charges.”

45
Information Status (or Information Structure)


The way that different referential forms are used to provide
new or old information.


Givenness hierarchy (Gundel et al. 1993):

in focus > activated > familiar > uniquely identifiable > referential > type identifiable
{that
{it} this {that N} {the N} {indef. this N} {a N}
this N}


Accessibility scale (Ariel 2001):
full name > long definitive description > short definite description > last name >
first name > distal demonstrative > NP > stressed pronoun > unstressed pronoun
46
Information Status


Complicating factors for relations between referring
expressions and information status:
– Inferrable: not explicitly evoked in text, but related to an
evoked entity

„Because of the trial, Messi has missed part of his
national team's preparations for the Copa America,
which starts on Friday in the US. Argentina's first
game is on Monday.“

47
Information Status


Complicating factors for relations between referring
expressions and information status:
– Generics: not explicitly evoked in text, but a generic
reference

„I only worried about playing football," he told the
judge. But they did not believe a word of it.“

48
Information Status


Complicating factors for relations between referring
expressions and information status:
– Non-referential uses:

„It was the judge who asked Messi the questions.“

49
Features Pronomial Anaphora Resolution


Number agreement:
– „I only worried about playing football," Messi told the
judge. But she did not believe a word of it.“
instead of
– „I only worried about playing football," he told the judge.
But they did not believe a word of it.“
but semantically plural entities can use it or they:
– „Argentina's first game is on Monday. They cannot count
with Messi for the first game.“

50
Features Pronomial Anaphora Resolution


Person agreement:
– „I only worried about playing football," Messi told the
judge. But he should not be granted immunity for not
knowing what was happening with his finances.“

Gender agreement:
– „Messi's lawyers had argued that the player had never
devoted a minute of his life to reading, studying or
analysing the contracts. He should be more careful about
them.“

Binding Theory Constraints:
– „Messi said that the father himself manages his finances.“

51
Preferences in Pronoun Interpretation


Recency:
– „Messi played in the match against Brazil last week. Argentina
will play against Chile this week. They hope to win it.“

Grammatical rules: subject position more salient than object
position
– „Angentina played a match against Brazil last week. They
won.“

Repeated mention: focused entities are likely to continue to be
focused
– „Angentina played a match against Brazil last week. They also
played matches against Chile and Peru. But Uruguay didn't
want to play against them.“

52
Preferences in Pronoun Interpretation


Parallelism: preferences can be induced by parallelism effects.
– „Argentina's first match was against Brazil in the last
America Cup. Chile will also play against them this year.“

Verb semantics: interpretation can be biased by semantically
oriented emphasys.
– „Argentina beat Brazil. They won.“
– „Argentina lost to Brazil. They won.“

Selectional restrictions: semantic role can play a role in
referent preferences.
– „Messi said he signed the documents related to his finances
without reading them.“

53
Algorithms for Anaphora resolution


Hobbs

Log-linear


Centering (check the book)

54
Hobbs algorithm


Relies on
– Syntactic parser
– Morphological gender and number checker

Input: the pronoun to be resolved

Output: an noun phrase

55
Hobbs algorithm (Breadth-first, left-to-right
search) (ROOT
(S
(S
(NP (PRP I))
(ADVP (RB only))
(VP (VBN worried)
I only worried about playing football, (PP (IN about)
(S
Messi told the judge. (VP (VBG playing)
(NP (NN football)))))))
(, ,)
(NP (NNP Messi))
(VP (VBD told)
(NP (DT the) (NN judge)))
(. .)))

(ROOT
(S (CC But)
(NP (PRP she))
(VP (VBD did) (RB not)
But she did not believe a word of it. (VP (VB believe)
(NP
(NP (DT a) (NN word))
(PP (IN of)
(NP (PRP it))))))
56 (. .)))
Hobbs algorithm (Breadth-first, left-to-right
search) (ROOT
(S
(S
(NP (PRP I))
(ADVP (RB only))
(VP (VBN worried)
I only worried about playing football, (PP (IN about)
(S
Messi told the judge. (VP (VBG playing)
(NP (NN football)))))))
(, ,)
(NP (NNP Messi))
(VP (VBD told)
(NP (DT the) (NN judge)))
(. .)))

(ROOT
No NPs (S (CC But)
found (NP (PRP she))
(VP (VBD did) (RB not)
But she did not believe a word of it. (VP (VB believe)
(NP
(NP (DT a) (NN word))
(PP (IN of)
(NP (PRP it))))))
57 (. .)))
Hobbs algorithm (Breadth-first, left-to-right
search) (ROOT
Check all (S
(S
NPs (NP (PRP I))
(ADVP (RB only))
(VP (VBN worried)
I only worried about playing football, (PP (IN about)
(S
Messi told the judge. (VP (VBG playing)
(NP (NN football)))))))
(, ,)
(NP (NNP Messi))
(VP (VBD told)
(NP (DT the) (NN judge)))
(. .)))

(ROOT
(S (CC But)
(NP (PRP she))
(VP (VBD did) (RB not)
But she did not believe a word of it. (VP (VB believe)
(NP
(NP (DT a) (NN word))
(PP (IN of)
(NP (PRP it))))))
58 (. .)))
Hobbs algorithm


Advantages:
– The search order gives considers binding theory, recency
and grammatical role
– A final check accounts for gender, person and number
constraints

Disadvantages:
– Do not have a explicit discourse model

59
Log-Linear Algorithm


Supervised machine learning approach

Log-linear, but also any other ML algorithm

Must rely on
– annotated data with positive and negative examples
– full parser or chunker

Usually, pleonasms are removed, e.g., „It is raining.“


Input: pair of NP and pronoun

Output: 0 or 1 (binary classification)

60
Features for ML


Strict number (true/false)

Strict gender (true/false)

Regarding pronoun and NP
– Compatible number (true/false)
– Compatible gender (true/false)
– Sentence distance [0,1,2,3,...]: # of sentences
– Hobbs distance [0,1,2,3,...]: # of skipped NPs
– Grammatical role [subject,object,PP] of the potential
antecedent
– Linguistic form [proper, definite, indefinite, pronoun] of the
potential antecedent

61
Coreference Resolution


Deal with definite noun phrases and names
– „Messi“, „the player“, „the football star“, „Barcelona,
„Argentina“


We can use a similar log-linear/ML classifier

62
Coreference Resolution


We can rely on the same features for anaphora, plus:
– Anaphor edit distance: minimum edit distance from
potential antecedent to anaphor
– Antecedent edit distance: minimum edit distance from
anaphor to antecedent
– Alias (true/false) based on named-entity recognition
– Appositive (true/false): e.g., „Lionel Messi, the Barcelona
football star, ….“
– Linguistic form [proper, definite, indefinite, pronoun]

63
Outline


Introduction

Text Segmentation

Text Coherence

Reference Resolution

Evaluation

64
Evaluation of Coreference Resolution


B-CUBED
– Reference chain (or true chain)
– Hypothesis chain
– Computation of precision and recall of entities in the
hypothesis against the reference chains

N
# of correct elements in hypothesis chain containing entity
Precision = ∑ wi
i=1 # of elements in hypothesis chain containing entity
N
# of correct elements in hypothesis chain containing entity
Recall = ∑ wi
i=1 # of elements in reference chain containing entity

65
Further Reading


Book „Speech and Language Processing“
– Chapter 21

66

You might also like