Discourse Coherence
Discourse Coherence
All
rights reserved. Draft of January 12, 2025.
CHAPTER
24 Discourse Coherence
And even in our wildest and most wandering reveries, nay in our very dreams,
we shall find, if we reflect, that the imagination ran not altogether at adven-
tures, but that there was still a connection upheld among the different ideas,
which succeeded each other. Were the loosest and freest conversation to be
transcribed, there would immediately be transcribed, there would immediately
be observed something which connected it in all its transitions.
David Hume, An enquiry concerning human understanding, 1748
Orson Welles’ movie Citizen Kane was groundbreaking in many ways, perhaps most
notably in its structure. The story of the life of fictional media magnate Charles
Foster Kane, the movie does not proceed in chronological order through Kane’s
life. Instead, the film begins with Kane’s death (famously murmuring “Rosebud”)
and is structured around flashbacks to his life inserted among scenes of a reporter
investigating his death. The novel idea that the structure of a movie does not have
to linearly follow the structure of the real timeline made apparent for 20th century
cinematography the infinite possibilities and impact of different kinds of coherent
narrative structures.
But coherent structure is not just a fact about movies or works of art. Like
movies, language does not normally consist of isolated, unrelated sentences, but
instead of collocated, structured, coherent groups of sentences. We refer to such
discourse a coherent structured group of sentences as a discourse, and we use the word co-
coherence herence to refer to the relationship between sentences that makes real discourses
different than just random assemblages of sentences. The chapter you are now read-
ing is an example of a discourse, as is a news article, a conversation, a thread on
social media, a Wikipedia page, and your favorite novel.
What makes a discourse coherent? If you created a text by taking random sen-
tences each from many different sources and pasted them together, would that be a
local coherent discourse? Almost certainly not. Real discourses exhibit both local coher-
global ence and global coherence. Let’s consider three ways in which real discourses are
locally coherent;
First, sentences or clauses in real discourses are related to nearby sentences in
systematic ways. Consider this example from Hobbs (1979):
(24.1) John took a train from Paris to Istanbul. He likes spinach.
This sequence is incoherent because it is unclear to a reader why the second
sentence follows the first; what does liking spinach have to do with train trips? In
fact, a reader might go to some effort to try to figure out how the discourse could be
coherent; perhaps there is a French spinach shortage? The very fact that hearers try
to identify such connections suggests that human discourse comprehension involves
the need to establish this kind of coherence.
By contrast, in the following coherent example:
(24.2) Jane took a train from Paris to Istanbul. She had to attend a conference.
2 C HAPTER 24 • D ISCOURSE C OHERENCE
the second sentence gives a REASON for Jane’s action in the first sentence. Struc-
tured relationships like REASON that hold between text units are called coherence
coherence relations, and coherent discourses are structured by many such coherence relations.
relations
Coherence relations are introduced in Section 24.1.
A second way a discourse can be locally coherent is by virtue of being “about”
someone or something. In a coherent discourse some entities are salient, and the
discourse focuses on them and doesn’t go back and forth between multiple entities.
This is called entity-based coherence. Consider the following incoherent passage,
in which the salient entity seems to wildly swing from John to Jenny to the piano
store to the living room, back to Jenny, then the piano again:
(24.3) John wanted to buy a piano for his living room.
Jenny also wanted to buy a piano.
He went to the piano store.
It was nearby.
The living room was on the second floor.
She didn’t find anything she liked.
The piano he bought was hard to get up to that floor.
Entity-based coherence models measure this kind of coherence by tracking salient
Centering
Theory entities across a discourse. For example Centering Theory (Grosz et al., 1995), the
most influential theory of entity-based coherence, keeps track of which entities in
the discourse model are salient at any point (salient entities are more likely to be
pronominalized or to appear in prominent syntactic positions like subject or object).
In Centering Theory, transitions between sentences that maintain the same salient
entity are considered more coherent than ones that repeatedly shift between entities.
entity grid The entity grid model of coherence (Barzilay and Lapata, 2008) is a commonly
used model that realizes some of the intuitions of the Centering Theory framework.
Entity-based coherence is introduced in Section 24.3.
topically Finally, discourses can be locally coherent by being topically coherent: nearby
coherent
sentences are generally about the same topic and use the same or similar vocab-
ulary to discuss these topics. Because topically coherent discourses draw from a
single semantic field or topic, they tend to exhibit the surface property known as
lexical cohesion lexical cohesion (Halliday and Hasan, 1976): the sharing of identical or semanti-
cally related words in nearby sentences. For example, the fact that the words house,
chimney, garret, closet, and window— all of which belong to the same semantic
field— appear in the two sentences in (24.4), or that they share the identical word
shingled, is a cue that the two are tied together as a discourse:
(24.4) Before winter I built a chimney, and shingled the sides of my house...
I have thus a tight shingled and plastered house... with a garret and a
closet, a large window on each side....
In addition to the local coherence between adjacent or nearby sentences, dis-
courses also exhibit global coherence. Many genres of text are associated with
particular conventional discourse structures. Academic articles might have sections
describing the Methodology or Results. Stories might follow conventional plotlines
or motifs. Persuasive essays have a particular claim they are trying to argue for,
and an essay might express this claim together with a structured set of premises that
support the argument and demolish potential counterarguments. We’ll introduce
versions of each of these kinds of global coherence.
Why do we care about the local or global coherence of a discourse? Since co-
herence is a property of a well-written text, coherence detection plays a part in any
24.1 • C OHERENCE R ELATIONS 3
task that requires measuring the quality of a text. For example coherence can help
in pedagogical tasks like essay grading or essay quality measurement that are trying
to grade how well-written a human essay is (Somasundaran et al. 2014, Feng et al.
2014, Lai and Tetreault 2018). Coherence can also help for summarization; knowing
the coherence relationship between sentences can help know how to select informa-
tion from them. Finally, detecting incoherent text may even play a role in mental
health tasks like measuring symptoms of schizophrenia or other kinds of disordered
language (Ditman and Kuperberg 2010, Elvevåg et al. 2007, Bedi et al. 2015, Iter
et al. 2018).
Elaboration: The satellite gives additional information or detail about the situation
presented in the nucleus.
(24.8) [NUC Dorothy was from Kansas.] [SAT She lived in the midst of the great
Kansas prairies.]
Evidence: The satellite gives additional information or detail about the situation
presented in the nucleus. The information is presented with the goal of convince the
reader to accept the information presented in the nucleus.
(24.9) [NUC Kevin must be here.] [SAT His car is parked outside.]
4 C HAPTER 24 • D ISCOURSE C OHERENCE
Attribution: The satellite gives the source of attribution for an instance of reported
speech in the nucleus.
(24.10) [SAT Analysts estimated] [NUC that sales at U.S. stores declined in the
quarter, too]
evidence
We can also talk about the coherence of a larger text by considering the hierar-
chical structure between coherence relations. Figure 24.1 shows the rhetorical struc-
ture of a paragraph from Marcu (2000a) for the text in (24.12) from the Scientific
American magazine.
(24.12) With its distant orbit–50 percent farther from the sun than Earth–and slim
atmospheric blanket, Mars experiences frigid weather conditions. Surface
temperatures typically average about -60 degrees Celsius (-76 degrees
Fahrenheit) at the equator and can dip to -123 degrees C near the poles. Only
the midday sun at tropical latitudes is warm enough to thaw ice on occasion,
but any liquid water formed in this way would evaporate almost instantly
because of the low atmospheric pressure.
Title 2-9
(1)
evidence
Mars
2-3 4-9
background elaboration-additional
Figure 24.1 A discourse tree for the Scientific American text in (24.12), from Marcu (2000a). Note that
asymmetric relations are represented with a curved arrow from the satellite to the nucleus.
The leaves in the Fig. 24.1 tree correspond to text spans of a sentence, clause or
EDU phrase that are called elementary discourse units or EDUs in RST; these units can
also be referred to as discourse segments. Because these units may correspond to
arbitrary spans of text, determining the boundaries of an EDU is an important task
for extracting coherence relations. Roughly speaking, one can think of discourse
24.1 • C OHERENCE R ELATIONS 5
Class Type
Example
TEMPORAL The parishioners of St. Michael and All Angels stop to chat at
SYNCHRONOUS
the church door, as members here always have. (Implicit while)
In the tower, five men and women pull rhythmically on ropes
attached to the same five bells that first sounded here in 1614.
CONTINGENCY REASON Also unlike Mr. Ruder, Mr. Breeden appears to be in a position
to get somewhere with his agenda. (implicit=because) As a for-
mer White House aide who worked closely with Congress,
he is savvy in the ways of Washington.
COMPARISON CONTRAST The U.S. wants the removal of what it perceives as barriers to
investment; Japan denies there are real barriers.
EXPANSION CONJUNCTION Not only do the actors stand outside their characters and make
it clear they are at odds with them, but they often literally stand
on their heads.
Figure 24.2 The four high-level semantic distinctions in the PDTB sense hierarchy
Temporal Comparison
• Asynchronous • Contrast (Juxtaposition, Opposition)
• Synchronous (Precedence, Succession) •Pragmatic Contrast (Juxtaposition, Opposition)
• Concession (Expectation, Contra-expectation)
• Pragmatic Concession
Contingency Expansion
• Cause (Reason, Result) • Exception
• Pragmatic Cause (Justification) • Instantiation
• Condition (Hypothetical, General, Unreal • Restatement (Specification, Equivalence, Generalization)
Present/Past, Factual Present/Past)
• Pragmatic Condition (Relevance, Implicit As- • Alternative (Conjunction, Disjunction, Chosen Alterna-
sertion) tive)
• List
Figure 24.3 The PDTB sense hierarchy. There are four top-level classes, 16 types, and 23 subtypes (not all
types have subtypes). 11 of the 16 types are commonly used for implicit ¯ argument classification; the 5 types in
italics are too rare in implicit labeling to be used.
EDU break 0 0 0 1
softmax
linear layer
ENCODER
Mr. Rambo says that …
Figure 24.4 Predicting EDU segment beginnings from encoded text.
Figure
Figure 24.5example
1: An ExampleofRST
RSTdiscourse tree,tree,
discourse showing four{e
where EDUs. Figure from Yu et al. (2018).
1 , e2 , e3 , e4 } are EDUs, attr and elab are
discourse relation labels, and arrows indicate the nuclearities of discourse relations.
Step Stack Queue Action Relation
2Previous
Transition-based
The Discourse
RSTParsing
decoder is then
transition-based a feedforward network W that outputs an action o based on a
discourse parsing studies exploit statistical models, using manually-
concatenation of the top three subtrees on the stack (so , s1 , s2 ) plus the first EDU in
designed
We follow Jidiscrete features
and Eisenstein
the (q(Sagae,
queue(2014),0 ):
2009;aHeilman
exploiting and Sagae,
transition-based framework2015; forWang et al., 2017).
RST discourse parsing.In this work, we
The framework
propose is conceptually simple
a transition-based neuraland flexible
model fortoRST
support arbitraryparsing,
discourse features, which
which has been widely
follows an encoder-decoder
o =DyerW(h t , ht , ht , he ) (24.20) a
used in a number
framework. of NLP
Given tasks (Zhu
an input sequenceet al., of
2013;
EDUs {eet1s0al., 2015;
, ...,s2enZhang
, e2s1 q0 theetencoder
}, al., 2016). In addition,
computes the input represen-
transition-based model formalizes a certain task into predicting a sequence e of actions, which is essential
tations {h1 , h2 , ...,
e e e
hn },
where theand the decoder
representation of predicts
the EDU next on thestepqueue actions
h comes conditioned
directly on
fromthetheencoder outputs.
similar to sequence-to-sequence models proposed recently (Bahdanau et q0 al., 2014). In the following,
we first describe encoder, and the three hidden vectors representing partial trees
the transition system for RST discourse parsing, and then introduce our neural network are computed by
2.2.1 Encoder average pooling over the encoder output for the EDUs in those trees:
model by its encoder and decoder parts, respectively. Thirdly, we present our proposed dynamic oracle
We follow
strategy Li toet enhance
aiming al. (2016), using hierarchical
the transition-based model.Bi-LSTMs
Then j to encode the the source method
EDU inputs, where the
t 1 weX introduce integration of
first-layer is used to represent sequencial
implicit syntax features. Finally we describe the training words inside
hs = method of our of EDUs, e and the second layer
hkneural network models.(24.21) is used to represent
j−i+1
sequencial EDUs. Given an input sentence {w1 , w2 , ...,k=i wm }, first we represent each word by its form
2.1 The Transition-based System
(e.g., wi ) and POS tag (e.g. ti ), concatenating their neural embeddings. By this way, the input vectors
The transition-based
of the framework converts
first-layer Bi-LSTM are {xw a structural learning problem
w into a sequence ofemb(t
action predic-
1 , x2 , ..., xm }, where xi = emb(wi ) i ), and then we apply
w w
tions, whose key point is a transition system. A transition system consists of two parts: states and actions.
Bi-LSTM directly, obtaining:
The states are used to store partially-parsed results and the actions are used to control state transitions.
w w w w w w
24.2 • D ISCOURSE S TRUCTURE PARSING 9
Training first maps each RST gold parse tree into a sequence of oracle actions, and
then uses the standard cross-entropy loss (with l2 regularization) to train the system
to take such actions. Give a state S and oracle action a, we first compute the decoder
output using Eq. 24.20, apply a softmax to get probabilities:
exp(oa )
pa = P (24.22)
exp(oa0 )
a0 ∈A
λ
LCE () = − log(pa ) + ||Θ||2 (24.23)
2
RST discourse parsers are evaluated on the test section of the RST Discourse Tree-
bank, either with gold EDUs or end-to-end, using the RST-Pareval metrics (Marcu,
2000b). It is standard to first transform the gold RST trees into right-branching bi-
nary trees, and to report four metrics: trees with no labels (S for Span), labeled
with nuclei (N), with relations (R), or both (F for Full), for each metric computing
micro-averaged F1 over all spans from all documents (Marcu 2000b, Morey et al.
2017).
24.3.1 Centering
Centering
Theory Centering Theory (Grosz et al., 1995) is a theory of both discourse salience and
discourse coherence. As a model of discourse salience, Centering proposes that at
any given point in the discourse one of the entities in the discourse model is salient:
it is being “centered” on. As a model of discourse coherence, Centering proposes
that discourses in which adjacent sentences CONTINUE to maintain the same salient
entity are more coherent than those which SHIFT back and forth between multiple
entities (we will see that CONTINUE and SHIFT are technical terms in the theory).
The following two texts from Grosz et al. (1995) which have exactly the same
propositional content but different saliences, can help in understanding the main
Centering intuition.
(24.28) a. John went to his favorite music store to buy a piano.
b. He had frequented the store for many years.
c. He was excited that he could finally buy a piano.
d. He arrived just as the store was closing for the day.
24.3 • C ENTERING AND E NTITY-BASED C OHERENCE 11
entity. In a RETAIN relation, the speaker intends to SHIFT to a new entity in a future
utterance and meanwhile places the current entity in a lower rank C f . In a SHIFT
relation, the speaker is shifting to a new salient entity.
Let’s walk though the start of (24.28) again, repeated as (24.30), showing the
representations after each utterance is processed.
(24.30) John went to his favorite music store to buy a piano. (U1 )
He was excited that he could finally buy a piano. (U2 )
He arrived just as the store was closing for the day. (U3 )
It was closing just as John arrived (U4 )
Using the grammatical role hierarchy to order the C f , for sentence U1 we get:
C f (U1 ): {John, music store, piano}
C p (U1 ): John
Cb (U1 ): undefined
and then for sentence U2 :
C f (U2 ): {John, piano}
C p (U2 ): John
Cb (U2 ): John
Result: Continue (C p (U2 )=Cb (U2 ); Cb (U1 ) undefined)
The transition from U1 to U2 is thus a CONTINUE. Completing this example is left
as exercise (1) for the reader
Government
Competitors
Department
3.3 Grid Construction: Linguistic Dimensions
Microsoft
Netscape
Evidence
Earnings
Products
Software
Markets
Brands
Tactics
Case
Trial
Suit
One of the central research issues in developing entity-based models of coherence is
determining what sources 1 S ofO linguistic
S X O – –knowledge
– – – – – are – –essential
– 1 for accurate prediction,
and how to encode them 2 succinctly
– – O – – in X aS discourse – – – 2
O – – – – representation. Previous approaches
3 – – S O – – – – S O O – – – – 3
tend to agree on the 4features – – S of – –entity
– – – distribution
– – – S – – related
– 4 to local coherence—the
disagreement
Barzilay lies in the5 way
and Lapata – – these
– – –features
– – – –are – modeled.
– – S O – 5 Modeling Local Coherence
Our study of alternative 6 – X S encodings
– – – – – is – –not – O 6duplication of previous ef-
– –a –mere
forts (Poesio
Figure 24.8 Part et al.
of2004) that grid
the entity focusforonthe
linguistic aspects
text in Fig. 24.9. of parameterization.
Entities Because
are listed by their head we
are interested in an automatically
noun; each cell represents whether constructed model, we have to take into account
an entity appears as subject (S), object (O), neither (X), or com-
6
putational and learning issues when
is absent (–). Figure from Barzilay and Lapata (2008).
Table 2 considering alternative representations. Therefore,
our exploration
Summary augmented of thewithparameter space is guided
syntactic annotations bycomputation.
for grid three considerations: the linguistic
importance of a parameter, the accuracy of its automatic computation, and the size of the
1resulting
[The Justice Department]
feature space. From S is conducting an [anti-trust trial] O against [Microsoft Corp.] X
the linguistic side, we focus on properties of entity distri-
with [evidence]X that [the company]S is increasingly attempting to crush [competitors]O .
bution that are tightly linked
2 [Microsoft]O is accused of trying to to
local coherence,
forcefully and[markets]
buy into at the same time allow for multiple
X where [its own
interpretations
products]S are notduring the encoding
competitive enoughprocess.
to unseatComputational considerations
[established brands] O.
prevent us
3from
[Theconsidering
case]S revolves discourse representations
around [evidence] that cannot
O of [Microsoft] be computed
S aggressively reliably by exist-
pressuring
[Netscape]
ing tools. For O into merging
instance, we[browser
could not software] O.
experiment with the granularity of an utterance—
4sentence
[Microsoft] S claims [its tactics] S are commonplace and good economically.
versus clause—because available clause separators introduce substantial noise
5 [The government]S may file [a civil suit]O ruling that [conspiracy]S to curb [competition]O
into a grid[collusion]
through construction. Finally, we exclude representations that will explode the size of
X is [a violation of the Sherman Act] O .
6the feature space,
[Microsoft] S
thereby
continues to increasing
show the earnings]
[increased amount Oofdespite
data required
[the trial]for
X.
training the model.
Figure
Entity24.9 A discourse
Ex traction. with thecomputation
The accurate entities marked and annotated
of entity classes iswithkey grammatical
to computing func-
mean-
tions.
ingfulFigure
When entity from
a noun Barzilay
grids. is and Lapata
Inattested
previous more (2008).
than once with
implementations a different grammatical
of entity-based models, classes roleof in the
coref-
same
erent sentence,
nouns have we default
been extractedto the role with the (Miltsakaki
manually highest grammatical
and Kukich ranking:
2000; subjects
Karamanis are
ranked
resolution
et al. 2004;higher
to clusterthan
Poesio them objects,
et al. into which
2004), in
discourse turn are
but thisentities ranked
is not (Chapter higher
an option23) than
forasour the
well rest. For
as parsing
model. example,
the
An obvious
the entityto
solution
sentences forMicrosoft
identifying
get is mentioned
grammatical entity twiceisintoSentence
classes
roles. employ an 1 with the grammatical
automatic coreferenceroles x (for
resolution
Microsoft
tool that Corp.) and s
determines (for the
which noun company
phrases ), but
refer istorepresented
the same only in
entity bya sdocument.
in the grid (see
In the1 and
Tables
resulting grid, columns that are dense (like the column for Microsoft) in-
Current 2). approaches recast coreference resolution as a classification task. A pair
dicate entities that are mentioned often in the texts; sparse columns (like the column
of NPs is classified as coreferring or not based on constraints that are learned from
foranearnings)
annotated indicate
corpus. entities
A separate that are mentioned rarely.
3.2 Entity Grids as Feature Vectorsclustering mechanism then coordinates the possibly
In the entity pairwise
contradictory grid model, coherence is
classifications andmeasured
constructs by apatterns
partition of local
on theentity
set oftran-NPs. In
sition. For example, Department is a subject in sentence
A fundamental assumption underlying our approach is that the distribution of system.
our experiments, we employ Ng and Cardie’s (2002) 1,
coreference and then not
resolution men-
entities
tioned
The
in in sentence
system
coherent decides
texts 2; this
whether
exhibits iscertain
thetwo transition
NPs are[ Scoreferent
regularities –]. The transitions
reflected by
in exploiting are athus
grid topology. sequences
wealth
Some ofoflexical,
these
, O X , –}n which
grammatical,
{Sregularities aresemantic,
can be
formalized and positional
extracted
in as features.
Centering continuous
Theory It is trained
ascells on the
from
constraints each MUC
on (6–7) data
column.
transitions Eachof sets
the
and yields
transition
local focus hasstate-of-the-art
ina adjacent
probability; performance
sentences. Grids (70.4
the probability of of F-measure theongrid
[ S –] intexts
coherent areMUC-6
fromand
likely Fig.
to 63.4
24.8
have onisMUC-7).
some 0.08dense
(itcolumns
occurs 6(i.e., times columns
out of the with 75just
totala transitions
few gaps, such as Microsoft
of length two). Fig. in 24.10
Table shows
1) and the many
distribution over transitions of length 2 for the text of Fig. 24.9 (shown as the first 1).
sparse columns which will consist mostly of gaps (see markets and earnings in Table
One
row d1would
Table ),3 and 2further expect that entities corresponding to dense columns are more often
other documents.
subjects
Example or of aobjects. These characteristics
feature-vector document representationwill be less pronounced
using all transitionsin low-coherence
of length two given texts.
syntactic
Inspiredcategories S , O , X , and
by Centering –.
Theory, our analysis revolves around patterns of local entity
transitions. A local entity transition is a sequence {S, O, X, –}n that represents entity
SS SO SX S– OS OO OX O– XS XO XX X– –S –O –X ––
occurrences and their syntactic roles in n adjacent sentences. Local transitions can be
d1 .01
easily .01 0 from.08
obtained a grid .01as0continuous
0 .09subsequences
0 0 0 of each
.03 column.
.05 .07 Each .03 transition
.59
d2 have
will .02 .01 .01 .02
a certain 0
probability .07 in0 a given
.02 grid.
.14 For .14 instance,
.06 .04 .03 .07 0.1 .36of the
the probability
d3 .02 0 0 .03 .09 0 .09 .06 0 0 0 .05 .03 .07 .17 .39
transition [ S –] in the grid from Table 1 is 0.08 (computed as a ratio of its frequency
[i.e., six]
Figure 24.10 divided by the
A feature totalfor
vector number of transitions
representing documentsofusing length all two [i.e., 75]).
transitions Each2.text
of length
can thusdbe
Document 1 isviewed
the textas inaFig.
distribution
24.9. Figure defined over transition
from Barzilay and Lapatatypes.(2008).
8
We can now go one step further and represent each text by a fixed set of transition
sequences using a standard feature vector notation. Each grid rendering j of a document
The transitions and their probabilities can then be used as features for a machine
di corresponds to a feature vector Φ(x ij ) = (p1 (x ij ), p2 (x ij ), . . . , pm (x ij )), where m is the
learning model. This model can be a text classifier trained to produce human-labeled
number of all predefined entity transitions, and pt (x ij ) the probability of transition t
coherence
in grid x ijscores (for example
. This feature vector from humans labeling
representation is usefully each text as coherent
amenable to machine or learning
inco-
herent).
algorithms But such (see data is expensive in
our experiments to gather.
SectionsBarzilay and Lapata (2005)
4–6). Furthermore, it allows introduced
the consid-
a eration
simplifying of large innovation:
numbers of coherence
transitions models
whichcan couldbe potentially
trained by uncoverself-supervision:
novel entity
distribution
trained patterns relevant
to distinguish the natural for coherence
original order assessment or other in
of sentences coherence-related
a discourse from tasks.
Note that considerable latitude is available when specifying the transition types to
be included in a feature vector. These can be all transitions of a given length (e.g., two
or three) or the most frequent transitions within a document collection. An example of
7
14 C HAPTER 24 • D ISCOURSE C OHERENCE
a subtopic have high cosine with each other, but not with sentences in a neighboring
subtopic.
A third early model, the LSA Coherence method of Foltz et al. (1998) was the
first to use embeddings, modeling the coherence between two sentences as the co-
sine between their LSA sentence embedding vectors1 , computing embeddings for a
sentence s by summing the embeddings of its words w:
sim(s,t) = cos(s, t)
X X
= cos( w, w) (24.31)
w∈s w∈t
and defining the overall coherence of a text as the average similarity over all pairs of
adjacent sentences si and si+1 :
n−1
1 X
coherence(T ) = cos(si , si+1 ) (24.32)
n−1
i=1
E p(s0 |si ) is the expectation with respect to the negative sampling distribution con-
ditioned on si : given a sentence si the algorithms samples a negative sentence s0
1 See Chapter 6 for more on LSA embeddings; they are computed by applying SVD to the term-
document matrix (each cell weighted by log frequency and normalized by entropy), and then the first
300 dimensions are used as the embedding.
16 C HAPTER 24 • D ISCOURSE C OHERENCE
Figure
Figure 2
24.12 Argumentation structure of a persuasive essay. Arrows indicate argumentation relations, ei-
Argumentation structure of the example essay. Arrows indicate argumentative relations.
ther of SUPPORT (with arrowheads) or ATTACK (with circleheads); P denotes premises. Figure from Stab and
Arrowheads denote argumentative support relations and circleheads attack relations. Dashed
Gurevych (2017).relations that are encoded in the stance attributes of claims. “P” denotes premises.
lines indicate
annotation scheme for modeling these rhetorical goals is the argumentative zon-
argumentative
zoning ing model of Teufel et al. (1999) and Teufel et al. (2009), which is informed by the
idea that each scientific paper tries to make a knowledge claim about a new piece
of knowledge being added to the repository of the field (Myers, 1992). Sentences
in a scientific paper can be assigned one of 15 tags; Fig. 24.13 shows 7 (shortened)
examples of labeled sentences.
Teufel et al. (1999) and Teufel et al. (2009) develop labeled corpora of scientific
articles from computational linguistics and chemistry, which can be used as supervi-
sion for training standard sentence-classification architecture to assign the 15 labels.
24.6 Summary
In this chapter we introduced local and global models for discourse coherence.
• Discourses are not arbitrary collections of sentences; they must be coherent.
Among the factors that make a discourse coherent are coherence relations
between the sentences, entity-based coherence, and topical coherence.
• Various sets of coherence relations and rhetorical relations have been pro-
posed. The relations in Rhetorical Structure Theory (RST) hold between
spans of text and are structured into a tree. Because of this, shift-reduce
and other parsing algorithms are generally used to assign these structures.
The Penn Discourse Treebank (PDTB) labels only relations between pairs of
spans, and the labels are generally assigned by sequence models.
• Entity-based coherence captures the intuition that discourses are about an
entity, and continue mentioning the entity from sentence to sentence. Cen-
tering Theory is a family of models describing how salience is modeled for
discourse entities, and hence how coherence is achieved by virtue of keeping
the same discourse entities salient over the discourse. The entity grid model
gives a more bottom-up way to compute which entity realization transitions
lead to coherence.
20 C HAPTER 24 • D ISCOURSE C OHERENCE
Exercises
24.1 Finish the Centering Theory processing of the last two utterances of (24.30),
and show how (24.29) would be processed. Does the algorithm indeed mark
(24.29) as less coherent?
24.2 Select an editorial column from your favorite newspaper, and determine the
discourse structure for a 10–20 sentence portion. What problems did you
encounter? Were you helped by superficial cues the speaker included (e.g.,
discourse connectives) in any places?
Exercises 23
Althoff, T., C. Danescu-Niculescu-Mizil, and D. Jurafsky. Elvevåg, B., P. W. Foltz, D. R. Weinberger, and T. E. Gold-
2014. How to ask for a favor: A case study on the suc- berg. 2007. Quantifying incoherence in speech: an auto-
cess of altruistic requests. ICWSM 2014. mated methodology and novel application to schizophre-
Asher, N. 1993. Reference to Abstract Objects in Dis- nia. Schizophrenia research, 93(1-3):304–316.
course. Studies in Linguistics and Philosophy (SLAP) Feng, V. W. and G. Hirst. 2011. Classifying arguments by
50, Kluwer. scheme. ACL.
Asher, N. and A. Lascarides. 2003. Logics of Conversation. Feng, V. W. and G. Hirst. 2014. A linear-time bottom-up
Cambridge University Press. discourse parser with constraints and post-editing. ACL.
Baldridge, J., N. Asher, and J. Hunter. 2007. Annotation for Feng, V. W., Z. Lin, and G. Hirst. 2014. The impact of deep
and robust parsing of discourse structure on unrestricted hierarchical discourse structures in the evaluation of text
texts. Zeitschrift für Sprachwissenschaft, 26:213–239. coherence. COLING.
Bamman, D., B. O’Connor, and N. A. Smith. 2013. Learning Finlayson, M. A. 2016. Inferring Propp’s functions from se-
latent personas of film characters. ACL. mantically annotated text. The Journal of American Folk-
lore, 129(511):55–77.
Barzilay, R. and M. Lapata. 2005. Modeling local coherence:
An entity-based approach. ACL. Foltz, P. W., W. Kintsch, and T. K. Landauer. 1998. The
measurement of textual coherence with latent semantic
Barzilay, R. and M. Lapata. 2008. Modeling local coher-
analysis. Discourse processes, 25(2-3):285–307.
ence: An entity-based approach. Computational Linguis-
tics, 34(1):1–34. Grosz, B. J. 1977. The representation and use of focus in
a system for understanding dialogs. IJCAI-77. Morgan
Barzilay, R. and L. Lee. 2004. Catching the drift: Prob- Kaufmann.
abilistic content models, with applications to generation
and summarization. HLT-NAACL. Grosz, B. J., A. K. Joshi, and S. Weinstein. 1983. Provid-
ing a unified account of definite noun phrases in English.
Bedi, G., F. Carrillo, G. A. Cecchi, D. F. Slezak, M. Sig- ACL.
man, N. B. Mota, S. Ribeiro, D. C. Javitt, M. Copelli,
and C. M. Corcoran. 2015. Automated analysis of free Grosz, B. J., A. K. Joshi, and S. Weinstein. 1995. Center-
speech predicts psychosis onset in high-risk youths. npj ing: A framework for modeling the local coherence of
Schizophrenia, 1. discourse. Computational Linguistics, 21(2):203–225.
Biran, O. and K. McKeown. 2015. PDTB discourse parsing Guinaudeau, C. and M. Strube. 2013. Graph-based local co-
as a tagging task: The two taggers approach. SIGDIAL. herence modeling. ACL.
Habernal, I. and I. Gurevych. 2016. Which argument is more
Braud, C., M. Coavoux, and A. Søgaard. 2017. Cross-lingual
convincing? Analyzing and predicting convincingness of
RST discourse parsing. EACL.
Web arguments using bidirectional LSTM. ACL.
Brennan, S. E., M. W. Friedman, and C. Pollard. 1987. A
Habernal, I. and I. Gurevych. 2017. Argumentation mining
centering approach to pronouns. ACL.
in user-generated web discourse. Computational Linguis-
Carlson, L. and D. Marcu. 2001. Discourse tagging manual. tics, 43(1):125–179.
Technical Report ISI-TR-545, ISI.
Halliday, M. A. K. and R. Hasan. 1976. Cohesion in English.
Carlson, L., D. Marcu, and M. E. Okurowski. 2001. Building Longman. English Language Series, Title No. 9.
a discourse-tagged corpus in the framework of rhetorical Hearst, M. A. 1997. Texttiling: Segmenting text into multi-
structure theory. SIGDIAL. paragraph subtopic passages. Computational Linguistics,
Chafe, W. L. 1976. Givenness, contrastiveness, definiteness, 23:33–64.
subjects, topics, and point of view. In C. N. Li, ed., Sub- Hernault, H., H. Prendinger, D. A. duVerle, and M. Ishizuka.
ject and Topic, 25–55. Academic Press. 2010. Hilda: A discourse parser using support vector ma-
Chen, E., B. Snyder, and R. Barzilay. 2007. Incre- chine classification. Dialogue & Discourse, 1(3).
mental text structuring with online hierarchical ranking. Hidey, C., E. Musi, A. Hwang, S. Muresan, and K. McKe-
EMNLP/CoNLL. own. 2017. Analyzing the semantic types of claims and
Cialdini, R. B. 1984. Influence: The psychology of persua- premises in an online persuasive forum. 4th Workshop on
sion. Morrow. Argument Mining.
Ditman, T. and G. R. Kuperberg. 2010. Building coherence: Hobbs, J. R. 1979. Coherence and coreference. Cognitive
A framework for exploring the breakdown of links across Science, 3:67–90.
clause boundaries in schizophrenia. Journal of neurolin- Hovy, E. H. 1990. Parsimonious and profligate approaches to
guistics, 23(3):254–269. the question of discourse structure relations. Proceedings
Elsner, M., J. Austerweil, and E. Charniak. 2007. A unified of the 5th International Workshop on Natural Language
local and global model for discourse coherence. NAACL- Generation.
HLT. Iter, D., K. Guu, L. Lansing, and D. Jurafsky. 2020. Pretrain-
Elsner, M. and E. Charniak. 2008. Coreference-inspired co- ing with contrastive sentence objectives improves dis-
herence modeling. ACL. course performance of language models. ACL.
Elsner, M. and E. Charniak. 2011. Extending the entity grid Iter, D., J. Yoon, and D. Jurafsky. 2018. Automatic detec-
with entity-specific features. ACL. tion of incoherent speech for diagnosing schizophrenia.
Fifth Workshop on Computational Linguistics and Clini-
cal Psychology.
24 Chapter 24 • Discourse Coherence
Ji, Y. and J. Eisenstein. 2014. Representation learning for Louis, A. and A. Nenkova. 2012. A coherence model based
text-level discourse parsing. ACL. on syntactic patterns. EMNLP.
Ji, Y. and J. Eisenstein. 2015. One vector is not enough: Lukasik, M., B. Dadachev, K. Papineni, and G. Simões.
Entity-augmented distributed semantics for discourse re- 2020. Text segmentation by cross segment attention.
lations. TACL, 3:329–344. EMNLP.
Joshi, A. K. and S. Kuhn. 1979. Centered logic: The role Mann, W. C. and S. A. Thompson. 1987. Rhetorical structure
of entity centered sentence representation in natural lan- theory: A theory of text organization. Technical Report
guage inferencing. IJCAI-79. RS-87-190, Information Sciences Institute.
Joshi, A. K. and S. Weinstein. 1981. Control of inference: Marcu, D. 1997. The rhetorical parsing of natural language
Role of some aspects of discourse structure – centering. texts. ACL.
IJCAI-81. Marcu, D. 1999. A decision-based approach to rhetorical
Joty, S., G. Carenini, and R. T. Ng. 2015. CODRA: A novel parsing. ACL.
discriminative framework for rhetorical analysis. Compu- Marcu, D. 2000a. The rhetorical parsing of unrestricted
tational Linguistics, 41(3):385–435. texts: A surface-based approach. Computational Linguis-
tics, 26(3):395–448.
Karamanis, N., M. Poesio, C. Mellish, and J. Oberlander.
2004. Evaluating centering-based metrics of coherence Marcu, D., ed. 2000b. The Theory and Practice of Discourse
for text structuring using a reliably annotated corpus. Parsing and Summarization. MIT Press.
ACL. Marcu, D. and A. Echihabi. 2002. An unsupervised approach
Kehler, A. 1993. The effect of establishing coherence in el- to recognizing discourse relations. ACL.
lipsis and anaphora resolution. ACL. Mesgar, M. and M. Strube. 2016. Lexical coherence graph
modeling using word embeddings. ACL.
Kehler, A. 1994. Temporal relations: Reference or discourse
coherence? ACL. Miltsakaki, E., R. Prasad, A. K. Joshi, and B. L. Webber.
2004. The Penn Discourse Treebank. LREC.
Kehler, A. 2000. Coherence, Reference, and the Theory of
Grammar. CSLI Publications. Morey, M., P. Muller, and N. Asher. 2017. How much
progress have we made on RST discourse parsing? a
Kintsch, W. and T. A. Van Dijk. 1978. Toward a model of replication study of recent results on the rst-dt. EMNLP.
text comprehension and production. Psychological re-
Morris, J. and G. Hirst. 1991. Lexical cohesion computed by
view, 85(5):363–394.
thesaural relations as an indicator of the structure of text.
Knott, A. and R. Dale. 1994. Using linguistic phenomena Computational Linguistics, 17(1):21–48.
to motivate a set of coherence relations. Discourse Pro- Muller, P., C. Braud, and M. Morey. 2019. ToNy: Contextual
cesses, 18(1):35–62. embeddings for accurate multilingual discourse segmen-
Lai, A. and J. Tetreault. 2018. Discourse coherence in the tation of full documents. Workshop on Discourse Relation
wild: A dataset, evaluation and methods. SIGDIAL. Parsing and Treebanking.
Lakoff, G. 1972. Structural complexity in fairy tales. In The Musi, E., M. Stede, L. Kriese, S. Muresan, and A. Rocci.
Study of Man, 128–50. School of Social Sciences, Uni- 2018. A multi-layer annotated corpus of argumenta-
versity of California, Irvine, CA. tive text: From argument schemes to discourse relations.
Lapata, M. 2003. Probabilistic text structuring: Experiments LREC.
with sentence ordering. ACL. Myers, G. 1992. “In this paper we report...”: Speech acts and
Lascarides, A. and N. Asher. 1993. Temporal interpretation, scientific facts. Journal of Pragmatics, 17(4):295–313.
discourse relations, and common sense entailment. Lin- Nguyen, D. T. and S. Joty. 2017. A neural local coherence
guistics and Philosophy, 16(5):437–493. model. ACL.
Li, J. and D. Jurafsky. 2017. Neural net models of open- Nie, A., E. Bennett, and N. Goodman. 2019. DisSent: Learn-
domain discourse coherence. EMNLP. ing sentence representations from explicit discourse rela-
tions. ACL.
Li, J., R. Li, and E. H. Hovy. 2014. Recursive deep models
for discourse parsing. EMNLP. Park, J. and C. Cardie. 2014. Identifying appropriate support
for propositions in online user comments. First workshop
Li, Q., T. Li, and B. Chang. 2016. Discourse parsing with on argumentation mining.
attention-based hierarchical neural networks. EMNLP.
Peldszus, A. and M. Stede. 2013. From argument diagrams
Lin, Z., M.-Y. Kan, and H. T. Ng. 2009. Recognizing im- to argumentation mining in texts: A survey. International
plicit discourse relations in the Penn Discourse Treebank. Journal of Cognitive Informatics and Natural Intelligence
EMNLP. (IJCINI), 7(1):1–31.
Lin, Z., H. T. Ng, and M.-Y. Kan. 2011. Automatically eval- Peldszus, A. and M. Stede. 2016. An annotated corpus of
uating text coherence using discourse relations. ACL. argumentative microtexts. 1st European Conference on
Lin, Z., H. T. Ng, and M.-Y. Kan. 2014. A pdtb-styled end- Argumentation.
to-end discourse parser. Natural Language Engineering, Pitler, E., A. Louis, and A. Nenkova. 2009. Automatic sense
20(2):151–184. prediction for implicit discourse relations in text. ACL
Logeswaran, L., H. Lee, and D. Radev. 2018. Sentence IJCNLP.
ordering and coherence modeling using recurrent neural Pitler, E. and A. Nenkova. 2009. Using syntax to disam-
networks. AAAI. biguate explicit discourse connectives in text. ACL IJC-
NLP.
Exercises 25
Poesio, M., R. Stevenson, B. Di Eugenio, and J. Hitzeman. Stab, C. and I. Gurevych. 2014b. Identifying argumentative
2004. Centering: A parametric theory and its instantia- discourse structures in persuasive essays. EMNLP.
tions. Computational Linguistics, 30(3):309–363. Stab, C. and I. Gurevych. 2017. Parsing argumentation struc-
Polanyi, L. 1988. A formal model of the structure of dis- tures in persuasive essays. Computational Linguistics,
course. Journal of Pragmatics, 12. 43(3):619–659.
Polanyi, L., C. Culy, M. van den Berg, G. L. Thione, and Stede, M. 2011. Discourse processing. Morgan & Claypool.
D. Ahn. 2004. A rule based approach to discourse pars- Stede, M. and J. Schneider. 2018. Argumentation Mining.
ing. Proceedings of SIGDIAL. Morgan & Claypool.
Prasad, R., N. Dinesh, A. Lee, E. Miltsakaki, L. Robaldo, Subba, R. and B. Di Eugenio. 2009. An effective discourse
A. K. Joshi, and B. L. Webber. 2008. The Penn Discourse parser that uses rich linguistic information. NAACL HLT.
TreeBank 2.0. LREC.
Surdeanu, M., T. Hicks, and M. A. Valenzuela-Escarcega.
Prasad, R., B. L. Webber, and A. Joshi. 2014. Reflections on 2015. Two practical rhetorical structure theory parsers.
the Penn Discourse Treebank, comparable corpora, and NAACL HLT.
complementary annotation. Computational Linguistics,
40(4):921–950. Tan, C., V. Niculae, C. Danescu-Niculescu-Mizil, and
L. Lee. 2016. Winning arguments: Interaction dynam-
Propp, V. 1968. Morphology of the Folktale, 2nd edition. ics and persuasion strategies in good-faith online discus-
University of Texas Press. Original Russian 1928. Trans- sions. WWW-16.
lated by Laurence Scott.
Teufel, S., J. Carletta, and M. Moens. 1999. An annotation
Qin, L., Z. Zhang, and H. Zhao. 2016. A stacking gated scheme for discourse-level argumentation in research ar-
neural architecture for implicit discourse relation classifi- ticles. EACL.
cation. EMNLP.
Teufel, S., A. Siddharthan, and C. Batchelor. 2009. To-
Qin, L., Z. Zhang, H. Zhao, Z. Hu, and E. Xing. 2017. Ad-
wards domain-independent argumentative zoning: Ev-
versarial connective-exploiting networks for implicit dis-
idence from chemistry and computational linguistics.
course relation classification. ACL.
EMNLP.
Reed, C., R. Mochales Palau, G. Rowe, and M.-F. Moens.
Walker, M. A., A. K. Joshi, and E. Prince, eds. 1998. Cen-
2008. Language resources for studying argument. LREC.
tering in Discourse. Oxford University Press.
Rosenthal, S. and K. McKeown. 2017. Detecting influencers
Wang, Y., S. Li, and J. Yang. 2018. Toward fast and accurate
in multiple online genres. ACM Transactions on Internet
neural discourse segmentation. EMNLP.
Technology (TOIT), 17(2).
Webber, B. L., M. Egg, and V. Kordoni. 2012. Discourse
Rutherford, A. and N. Xue. 2015. Improving the inference
structure and language technology. Natural Language
of implicit discourse relations via classifying explicit dis-
Engineering, 18(4):437–490.
course connectives. NAACL HLT.
Wolf, F. and E. Gibson. 2005. Representing discourse coher-
Sagae, K. 2009. Analysis of discourse structure with syn-
ence: A corpus-based analysis. Computational Linguis-
tactic dependencies and data-driven shift-reduce parsing.
tics, 31(2):249–287.
IWPT-09.
Xu, P., H. Saghir, J. S. Kang, T. Long, A. J. Bose, Y. Cao,
Scha, R. and L. Polanyi. 1988. An augmented context free
and J. C. K. Cheung. 2019. A cross-domain transferable
grammar for discourse. COLING.
neural coherence model. ACL.
Sidner, C. L. 1979. Towards a computational theory of defi-
Xue, N., H. T. Ng, S. Pradhan, A. Rutherford, B. L. Web-
nite anaphora comprehension in English discourse. Tech-
ber, C. Wang, and H. Wang. 2016. CoNLL 2016 shared
nical Report 537, MIT Artificial Intelligence Laboratory,
task on multilingual shallow discourse parsing. CoNLL-
Cambridge, MA.
16 shared task.
Sidner, C. L. 1983. Focusing in the comprehension of defi-
Yang, D., J. Chen, Z. Yang, D. Jurafsky, and E. H. Hovy.
nite anaphora. In M. Brady and R. C. Berwick, eds, Com-
2019. Let’s make your request more persuasive: Model-
putational Models of Discourse, 267–330. MIT Press.
ing persuasive strategies via semi-supervised neural nets
Somasundaran, S., J. Burstein, and M. Chodorow. 2014. on crowdfunding platforms. NAACL HLT.
Lexical chaining for measuring discourse coherence qual-
Yu, N., M. Zhang, and G. Fu. 2018. Transition-based neural
ity in test-taker essays. COLING.
RST parsing with implicit syntax features. COLING.
Soricut, R. and D. Marcu. 2003. Sentence level discourse
Yu, Y., Y. Zhu, Y. Liu, Y. Liu, S. Peng, M. Gong, and
parsing using syntactic and lexical information. HLT-
A. Zeldes. 2019. GumDrop at the DISRPT2019 shared
NAACL.
task: A model stacking approach to discourse unit seg-
Soricut, R. and D. Marcu. 2006. Discourse generation using mentation and connective detection. Workshop on Dis-
utility-trained coherence models. COLING/ACL. course Relation Parsing and Treebanking 2019.
Sporleder, C. and A. Lascarides. 2005. Exploiting linguistic Zhou, Y. and N. Xue. 2015. The Chinese Discourse Tree-
cues to classify rhetorical relations. RANLP-05. Bank: a Chinese corpus annotated with discourse rela-
Sporleder, C. and M. Lapata. 2005. Discourse chunking and tions. Language Resources and Evaluation, 49(2):397–
its application to sentence compression. EMNLP. 431.
Stab, C. and I. Gurevych. 2014a. Annotating argument com-
ponents and relations in persuasive essays. COLING.