Sprouse - A Program for Experimental Syntax
Sprouse - A Program for Experimental Syntax
Jon Sprouse
Doctor of Philosophy, 2007
collection, as well as the reliability of the results. It seems, though, that the past
several years have seen an increase in the number of studies employing formal exper-
imental techniques for the collection of acceptability judgments, so much so that the
term experimental syntax has come to be applied to the use of those techniques. The
question this dissertation asks is whether the extent of the utility of experimental syn-
tax is to find areas in which informal judgment collection was insufficient, or whether
there is a complementary research program for experimental syntax that is more than
tax. This dissertation is a first attempt at a tentative yes: the tools of experimental
syntax can be used to explore the relationship between acceptability judgments and
the form or nature of grammatical knowledge, not just the content of grammatical
knowledge. This dissertation begins by identifying several recent claims about the na-
ture of grammatical knowledge that have been made based upon hypotheses about the
nature of acceptability judgments. Each chapter applies the tools of experimental syn-
that acceptability judgments have nothing further to contribute to debates over the
number and nature of dependency forming operations. Using wh-movement and Is-
land effects as the empirical basis of the research, the results of these studies suggest
that the relationship between acceptability and grammatical knowledge is much more
program for experimental syntax that is independent of simple data collection: only
through the tools of experimental syntax can we achieve a better understanding of the
by
Jon Sprouse
Advisory Committee:
Professor Howard Lasnik, Chair
Professor Norbert Hornstein
Associate Professor Jeffrey Lidz
Associate Professor Colin Phillips
Professor Nan Ratner
c Copyright by
Jon Sprouse
2007
ACKNOWLEDGMENTS
Words cannot express the gratitude, and appreciation, I feel for all of the
wonderful people in my life, but this page will have to be a feeble attempt. I will
apologize now for any sentiments I forget to express, or any names I forget to mention.
Luckily, if you have been an important part of my life and are reading this dissertation,
First and foremost, I owe more than I can say to my advisor Howard Lasnik. I
was once told by a close friend that Howard is the second greatest living syntactician.
I am not qualified to judge that statement (although anyone who has read Howard’s
work can attest to his brilliance), but I can say this: He is by far one of the greatest
living advisors, both within the field of linguistics and without, and the best teacher
Although he would never admit it, Norbert Hornstein also deserves a large
debt of gratitude for guiding me through graduate school. Always content to let
others take the credit, Norbert is a de facto advisor to all of the students in the
department, and in many ways a role model for all of us. He has that rare ability to
find the best in everyone, and coax it out (by force if necessary). I can only hope to
Robert Freidin also deserves the credit (or blame) for pulling me into the field
ii
for that I will always be grateful. While it would have been easy enough for him to
stop advising me after I moved on to graduate school, he has always been willing to
offer advice and encouragement, accepting me as a colleague despite the fact that I
am nowhere near his equal. Though I can never thank him appropriately for this, I
can promise to remember his example while teaching my own undergraduate students
one day.
friends. I won’t name them all here (you know who you are), but from conversations
to gym trips, smoothie runs to Moe’s burritos, it has been an incredible four years.
And although this dissertation signifies my departure from you all too soon, you
And of course, I must thank my parents. Although they still can’t understand
why I have gone to school for this long and am not yet a millionaire, they have
always been willing to let me follow my own path. I’ve learned in recent years that
such freedom is a true rarity. I can only hope to remember that if I ever have children
of my own.
am bad with words, and I fear that I only have one good attempt at expressing my
love for you in me, so I will save that for the wedding. However, you should know
that I could never - never - have finished this, or anything, without you. You are my
best friend and the best thing that has ever happened to me. I can only hope that
iii
Table of Contents
List of Tables vi
List of Figures ix
1 Introduction 1
iv
4.4.1 Magnitude estimation and balanced designs . . . . . . . . . . 101
4.4.2 The memory confound . . . . . . . . . . . . . . . . . . . . . . 109
4.4.3 Magnitude estimation and unbalanced designs . . . . . . . . . 111
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
4.6 Some remaining questions . . . . . . . . . . . . . . . . . . . . . . . . 118
4.6.1 Whether Islands versus the That-trace effect . . . . . . . . . . 118
4.6.2 What about the models of satiation? . . . . . . . . . . . . . . 120
7 Conclusion 166
v
List of Tables
2.9 Qualitative normality results for few observations per participant, large
sample size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.10 Qualitative normality results for few observations per participant, small
sample size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.7 Two-way repeated measures ANOVAs using ordinal and rank data . . 68
vi
4.1 Violations tested in Snyder 2000 . . . . . . . . . . . . . . . . . . . . . 83
4.2 Summary of results for Snyder 2000, Hiramatsu 2000, and Goodall 2005 84
vii
5.1 Reading times (in ms) at critical words for each dependency, from
Stowe 1986 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.2 Reading times at critical words by filler type, from Pickering and
Traxler 2003 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.3 Mean complexity ratings for short and long movement, from Phillips
et al. 2005 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.5 Mean complexity ratings for short and long movement, from Phillips
et al. 2005 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
viii
List of Figures
6.1 Overt versus covert Island effects following Huang 1982 . . . . . . . . 143
ix
Chapter 1
Introduction
As the extensive review by Schütze 1996 demonstrates, there has always been
liability of the results. It seems, though, that the past several years have seen an
increase in the number of studies employing formal experimental techniques for the
has come to be applied to the use of those techniques (perhaps from the title of Cow-
art 1997). Indeed, there have been several journal articles in recent years advocating
the use of such techniques by demonstrating the new data that they may reveal, and
how this new data prompts reconsideration of theoretical analyses. The question this
dissertation asks is whether that is the extent of the utility of experimental syntax -
to find areas in which informal judgment collection was insufficient - or whether there
is a complementary research program for experimental syntax that is more than just
This dissertation is a first attempt at a tentative yes: the tools of experimental syntax
can be used to explore the relationship between acceptability judgments and the form
1
1. What is the content of grammatical knowledge?
Potential answers to the first question arise in any theoretical syntax paper: Island
language Z, et cetera. The potential answers to the second question are much more
complicated, and rarely appear in any single analysis. In fact, they usually form the
foundation for the different approaches to syntactic theory: transformations are or are
form the primary data for most syntactic analyses. However, applying experimental
syntax to the second question is less straightforward, as it is not always clear what
role individual acceptability facts play in supporting these claims about grammatical
knowledge.
This dissertation begins by identifying several recent claims about the nature of
grammatical knowledge that have been made based upon hypotheses about the nature
between acceptability and grammatical knowledge. While each chapter has its own
experimental syntax that is independent of simple data collection: only through the
2
acceptability, and how it relates to the nature of grammatical knowledge.
question formation known as Island effects(Ross 1967). As will soon become obvious,
there is no shortage of claims about the nature of the grammar that have been made
on the basis of the acceptability of wh-questions. This is, of course, not entirely sur-
prising given the amount of research that has been done on wh-questions over the
past forty years. Each of the topics of the subsequent chapters are briefly summarized
below.
who are unfamiliar the similarities and differences between the two techniques. Sec-
ond, chapter 2 provides a critical assessment of the claim that magnitude estimation
provides ‘better’ data than other acceptability collection techniques, focusing on the
assertion that magnitude estimation provides interval level acceptability data. The
results of that assessment suggest that linguistic magnitude estimation does not pro-
vide interval level data, and in fact, is more like the standard ordinal rating tasks
knowledge is gradient. The gradience claim is by no means new, but has received
added attention over the past few years thanks in no small part to magnitude es-
3
timation and the continuous data that it yields (see especially Keller 2000, 2003,
Sorace and Keller 2005). While obtaining gradient data from a continuous task such
One of the recurrent questions facing syntactic research is whether the fact
that acceptability judgments are given for isolated sentences out of their natural lin-
guistic context affects the results. This chapter lays out several possible ways in which
the presence or absence of context may affect the results of acceptability studies, and
what effect each could have on theories of grammatical knowledge. Because the effect
on a case by case basis, this chapter then presents 3 case studies of the effect of vari-
there is an effect, and ii) if so, what the underlying source of the effect may be. These
case studies serve a dual purpose: they serve as a first attempt at systematically
investigating the effect of context on properties of wh-movement, and, given that wh-
movement serves as the empirical object of study in the rest of the dissertation, they
least the types of context studied here do not affect major properties of wh-movement
such as Island effects and D-linking effects, indicating that context need not be in-
4
cluded as a factor in subsequent studies of Island effects in the rest of the dissertation.
that is more acceptable, after several repetitions. The fact that some violations
satiate while others do not has been interpreted in the literature as an indication of
different underlying sources of the unacceptability. In other words, the nature of the
violation (or the nature of the grammatical knowledge) affects its relationship with
This chapter presents evidence that the reported satiation data is not robust,
in that it is not replicable. Several experiments are presented that attempt to track
down the reason for the replication problem, in the process teasing apart the various
factors that may contribute to the satiation effect. In the end, the results suggest that
Linguists have agreed since at least Chomsky 1965 that acceptability judg-
ments are too coarse grained to distinguish between effects of grammatical knowledge
(what Chomsky 1965 would call competence effects) and effects of implementing that
knowledge (or performance effects). With the rise of experimental methodologies for
5
collecting acceptability judgments, there has been a renewed interest in attempting
to acceptability judgments. For instance, Fanselow and Frisch 2004 report that lo-
cal ambiguity in German can lead to increases in acceptability, suggesting that the
momentary possibility of two representation can affect acceptability. Sag et al. sub-
mitted report that factors affecting the acceptability of Superiority violations also
there might be a correlation between processing factors and the acceptability of Su-
periority violations. This chapter builds on this work by investigating three different
(Frazier and Flores d’Arcais 1989) to determine if they affect the acceptability of
the final representation. The question is whether every type of processing effect that
arises due to the active filling strategy affects acceptability, or whether acceptability
is differentially sensitive to such processing effects. The results suggest that judgment
tasks are indeed differentially sensitive: they are sensitive to some processing effects,
but not others. This differential sensitivity in turn suggests that further research is
required to determine the class of processing effects that affect acceptability in order
fects. Determining that relationship will be the first step toward assessing the merits
ability.
6
Chapter 6: The role of acceptability in theories of wh-in-situ
of the nature of grammatical knowledge, this chapter demonstrates a more direct re-
The claim in this chapter is straightforward: there are new types of acceptability
the relationship between these new acceptability effects and grammatical knowledge
can have significant consequences for the set of possible dependency forming opera-
consensus that there must be a dependency between these two positions, but there
is significant debate over the nature of that dependency, and in particular, over the
dependency forming operation(s) that create it. Various proposals have been made
in the literature, such as covert wh-movement (Huang 1982), null operator movement
and unselective binding (Tsai 1994), choice-function application and existential clo-
sure (Reinhart 1997), overt movement and pronunciation of the lower copy (Bošković
2002), and long distance AGREE (Chomsky 2000). This chapter uses the two new
pieces of evidence gained from experimental syntax to refine our understanding of the
English.
7
The results of these investigations suggest that:
3. While there still may be different causes underlying various violations, satiation
4. While it goes without saying that processing effects affect acceptability judg-
ments, it is not the case that all processing effects have an effect. This dif-
5. While the value of non-acceptability data such as possible answers are undoubt-
were presented that may have important consequences for wh-in-situ theories.
8
Like many studies, the work presented in this dissertation raises far more questions
than it answers. However, it is clear from these results that there is a good deal of
potential for experimental syntax to provide more than a simple fact-checking service
for theoretical syntax: the tools of experimental syntax are in a unique position to
knowledge, and ultimately, refine our theories of the nature of grammatical knowledge
itself.
9
Chapter 2
Although logically separable, in many ways the term experimental syntax has
become synonymous with the use of the magnitude estimation task to collect ac-
judgments, and ii) to provide a foundation from which theoretical syntacticians can
begin to evaluate the claim that magnitude estimation reveals the gradient nature of
chophysics and linguistics, the claim that linguistic magnitude estimation data has
through meta analyses of several of the experiments that will be presented in the rest
of the dissertation. The surprising finding from these analyses is that at least two of
untrue (that the intervals between data points are regular, and that the responses
for analyzing the data. These results suggest that linguistic magnitude estimation
claims in the literature that magnitude estimation demonstrates the gradient nature
10
of linguistic knowledge, and suggestions are made as to what the true benefit of mag-
nitude estimation may be: the freedom it gives participants to convey any distinction
they see as relevant. In fact, surprising evidence emerges that participants use this
ical sentences, despite the lack of any explicit or implicit mention of a categorical
the relationship between the physical strength of a given stimulus, such as the bright-
ness of light, and the perceived strength of that stimulus. This subsection provides
example. Imagine you are presented with a set of lines. The first line is the modulus
or reference, and you are told its length is 100 units. You can use this information
to estimate the length of the other lines using your perception of visual length. For
instance, if you believe that the line labeled Item 1 is twice as long as the reference
line, you could assign it a length of 200 units. If you believe Item 2 is half as long
as the reference line, you could assign it a length of 50 units: The resulting data
are estimates of the length of the items in units equal to the length of the reference
11
Reference:
Length: 100
Item 1:
Length: 200
Item 2:
Length: 50
Item 3:
Length: 300
line. These estimates can be compared to the actual physical length of the lines:
the reference line is actually 100 points, or about 1.4 inches, item 1 is 200 points
(2.8 inches), item 2 is 50 points (.7 inches) and item 3 is 300 points (4.2 inches).
different lengths of lines, psychologists can determine how accurate humans are at
While line length is a simple case, the magnitude estimation technique can
be extended to any physical stimulus, and indeed, over the past 50 years magnitude
estimation has been applied to hundreds of physical stimuli such as light (brightness),
sound (volume), heat, cold, and pressure. One of the early findings in psychophysics
was that the perception of physical stimuli is regular, but that the relationship be-
tween the perceived strength and physical strength of most stimuli is non-linear, or
of interest is between the physical strength of the stimulus and the perceived strength,
12
one can plot the physical strength on the x-axis and the perceived (estimated) strength
on the y-axis. If the relationship between physical and perceived strength were linear
(as it is in the line length example above), the graph would be a straight line.1
1
This abstracts away from the fact that the slope of this line could be a value other than 1. A
slope other than 1 would indicate a regular (lawful) distortion of the perception.
13
The graph can be read in the following way: a stimulus that is physically iden-
tical to the reference (a value of 1 along the x-axis) will be perceived as being equally
as strong as the reference (a value of 1 along the y-axis). However, a stimulus that is
physically twice as strong as the reference (a value of 2 along the x-axis) will be per-
ceived as 4 times as strong as the reference (a value of 4 along the y-axis). In general,
exponential relationships in which the relationship is greater than 1 have two con-
sequences: (i) physical stimuli larger than the reference will be overestimated, while
physical stimuli smaller than the reference will be underestimated, and (ii) the larger
the difference between the stimulus and the reference, the larger the overestimation
posite is true: (i) physical stimuli larger than the reference will be underestimated
while physical stimuli smaller than the reference will be overestimated, and (ii) the
larger the difference between the reference and the stimulus, the larger the underes-
timation (or overestimation if the stimulus is smaller than the reference). This can
opening):
14
For clarity, exponential relationships like the ones above can be graphed on
a logarithmic scale. The resulting graph is a straight line with a slope equal to the
exponent of the relationship. In the first example the exponent was 2, therefore the
same relationship graphed on a logarithmic scale would result in a straight line with
a slope of 2. In the second example the exponent was .5, therefore the same graph
The fact that the relationship between physical stimuli and perception is ex-
ponential is known as the Psychophysical Power Law (Stevens 1957). Each physical
15
stimulus has its own characteristic exponent; some of these exponents are provided
in the following table, adapted from Lodge 1981, along with the ratios of perception
Stimulus Exponent 2x 3x 4x
brightness 0.5 1.4 1.7 2.9
volume .67 1.6 2.1 2.5
vibration .95 1.9 2.8 3.7
length 1.0 2 3 4
cold 1.0 2 3 4
duration 1.1 2.1 3.3 4.6
finger span 1.3 2.5 4.2 6
sweet 1.3 2.5 4.2 6
salty 1.4 2.6 4.7 7
hot 1.6 3 5.8 9.2
forward. Participants are presented with a pair of sentences. The first is the reference
sentence, and has a value associated with its acceptability (in this example, 100). The
acceptability of the second sentence can then be estimated using the acceptability of
the first. If the sentence is two times more acceptable than the reference, it would
receive a value twice the value of the reference (e.g., 200). If the sentence is only half
as acceptable, it would receive a value half that of the reference (e.g., 50):
16
Reference: What do you wonder whether Mary bought?
Acceptability: 100
Again, by using the same reference sentence to judge several other sentence
types, the relative acceptability of those sentence types can be estimated in units
The obvious question, then, is why linguists should adopt magnitude esti-
mation over other more familiar measurement techniques, such as yes/no tasks or
5- (or 7-) point scale tasks. There are two major purported benefits of magnitude
estimation for linguistics. First, the freedom of choice offered by the real number
line means that respondents can report any difference between stimuli that they find
relevant. This is in contrast with categorization tasks such as the yes/no or scale
tasks in which the number of choices is set by the experimenter. Categorization tasks
increase the likelihood of losing relevant data by forcing respondents to group poten-
tially different stimuli into the same category (e.g., the alphabetic grading system in
the U.S.). Magnitude estimation sets no such categories, and the theoretical infinity
of the real number line ensures that respondents always have the option of creating
an additional distinction. While the question of whether scale tasks actually lose
linguistically relevant data is an empirical one (see Schütze 1996 for a discussion of
the effect of different scale sizes), there is growing evidence that the answer may be
yes: Bard et al. 1996 demonstrate that respondents tend to distinguish more than
17
2005b demonstrates that magnitude estimation reveals distinctions among sentences
the data itself: it has been claimed that magnitude estimation data can be analyzed
using parametric statistics, while yes/no and scale studies are only amenable to non-
parametric statistics (Bard et al. 1996, Keller 2000). Parametric statistics are so
named because they assume the data has the following 5 properties:
The first three assumptions are independent of the task, and generally under the
control of the experimenter, therefore not much more will be said about them.2 As
2
However, that is not to say that they are necessarily satisfied by acceptability studies. For one,
the assumption of random sampling is hardly ever satisfied: most participants are college students
who have actively sought out the experimenter. The assumption of independence of observations
is of considerable concern, especially given the claims tested in chapter 4 that the repetition of
structures affects acceptability. As for the assumption of equal variances, no large scale studies of
the variance of acceptability across structures has ever been conducted, although there is no a priori
18
for the final two assumptions, the claim for magnitude estimation has been that it
In general there are four types of scales in statistics: nominal, ordinal, interval,
and ratio. Nominal scales assign categorical differences to data, but do not specify
any ranking, ordering, or distance among the categories. The visible color names are
a good example of a nominal scale: while there is a physical spectrum of color, the
names themselves do not specify this order; the names simply categorize chunks of
the spectrum into groups, with no regard for their physical relationship (hence the
need for mnemonics such as ROY G BIV). The yes/no task of acceptability uses a
nominal scale. Ordinal scales add ordering to nominal scales, but still do not specify
the distance between the categories. Again, the alphabetic grading system in the
U.S. is a good example: there is a definite ordering of the grades, but the distance
between them is not necessarily stable, as anyone who has dealt with multiple graders
or multiple sections of a class can attest. The scale tasks of linguistic acceptability
are ordinal scales: they specify the order of structures, but it is not clear that the
differences between points on the scale are stable either within or across participants.
Interval scales, which are assumed by parametric statistics, specify both the
ordering of items and the distance between them, but not the location of 0. The tem-
perature scales Fahrenheit and Celsius are examples of interval scales: the distances
between any two degree marks on a thermometer are identical, so the intervals are
stable, but there is no meaningful 0 point. The 0 point on the Celsius scale is arbi-
trarily set as the point at which water freezes, whereas the 0 point on the Fahrenheit
scale is even more arbitrarily set as 32 degrees below the point at which water freezes.
19
Because the 0 points are arbitrary, it is not the case that 64 degrees F is twice as
much as 32 degrees F; all we can say is that 64 degrees F is 32 degrees higher than
timation scales of acceptability have been claimed to be interval scales: the distance
between any two structures is measured in units equal to the acceptability of the
reference, therefore the intervals are stable. It is not clear what it means to say that
Ratio scales are also amenable to parametric statistics because they also as-
sume stable intervals, but add a meaningful 0 point. The Kelvin temperature scale is
a good example of a ratio scale: 0 Kelvin is commonly called ‘absolute zero’ because
it represents the absence of all temperature. On this scale 100 Kelvin is twice as
much as 50 Kelvin because 0 Kelvin is meaningful. The Kelvin scale confirms that
64 F is about 291 K. The magnitude estimates from psychophysics are all on ratio
et cetera.
In addition to providing interval level data, it has also been suggested that
magnitude estimation provides normally distributed data (as opposed to the ordinal
data from scale tasks, which is never normally distributed). The normal distribution
(or Gaussian distribution, also commonly known as the bell curve) is a crucial as-
asks the following question: given these (two) samples (or sets of data), how likely
is it that they come from the same population? In behavioral studies, it is generally
20
assumed that if the likelihood is less than 5% that the two samples come from the
same population, then it can be concluded that they do not (hence, a significant
difference).
While the mathematics of each statistical test is different, the logic is the
population mean from the sample mean, we can then compare the population mean
estimates from each sample mean to determine if they are from the same or different
properties of the normal distribution can be used to estimate the population mean
from a sample mean. So in a very real sense, the numbers that are used to determine
statistical significance (the estimated population means for each sample) are con-
tingent upon the samples being normally distributed. If the sample is not normally
distributed, the estimated population means will be incorrect, and the likelihood that
they are from the same sample will be incorrect. Therefore, the fact that magnitude
In the following sections, these two claims about magnitude estimation data
21
Unfortunately, the results suggest that the data, although internally consistent, is
not interval level at all, and while the data is normally distributed, the standard
analysis techniques applied in the syntax literature destroy the normality prior to
statistical analysis. These facts will be considered with respect to the two major
goals of syntactic theory presented in the first chapter to determine whether we need
to reconsider any of the conclusions that have been drawn from magnitude estimation
data. One note is in order before proceeding: Because of the quantity of data being
meta-analyzed (a necessity given the goal) and because the details of these exper-
iments will be presented in subsequent chapters, these sections will provide only a
claim that magnitude estimation of acceptability yields interval level data, this raises
the obvious question: with no possibility of external validation, how can we be sure
22
Bard et al. 1996 address this concern directly. They demonstrate that indi-
and asked to estimate their magnitude using one modality, for instance real numbers
as presented above. Then, the same participants are asked to estimate the magnitude
of the same stimuli using a different modality, such as drawing lengths of lines. For
this second modality, rather than assigning numbers with the correct proportions,
they would be asked to draw lines with the correct proportions. The two modalities
rect relationship between the responses in each modality. In concrete terms, imagine
a reference sentence whose acceptability is set using the number modality as 100.
200. Then imagine the same reference sentence and stimulus sentence pair, in the
line length modality. The reference sentence’s acceptability could be assigned a line
length of 100 points (about 1.4 inches). If the participant is being internally consis-
tent, they should then assign a line length of 200 points (about 2.8 inches) to the
the two modalities. Bard et al. 1996 find exactly that internal consistency.
23
2.2.2 Interval regularity and the effect of the reference sentence
the intervals in magnitude estimation are regular, just that the non-regularity is con-
sistent: scale tasks are known to result in non-regular intervals while still yielding
consistent results within participants. To test the regularity of intervals, two experi-
ments will be compared. The experiments are identical in nearly every way: they test
the same conditions, use the same items, are presented in the same order, et cetera.
The only difference between the two experiments is that they each use a different
reference sentence. The first experiment uses an If-Island violation as the reference
and the second uses a Coordinate Structure Constraint (CSC) violation as the ref-
each experiment. Therefore, the acceptability values for each of the conditions in the
first experiment can be translated into values of the Coordinate Structure Constraint
- in essence, predicting the values that should be obtained in the second experiment
Participants were presented with 5 tokens of each violation type, along with
10 grammatical fillers, for a total of 50 items. Mean values for each condition were
obtained by first dividing the response by the reference value (to normalize the scale
3
The If-reference was chosen because it is a frequently used reference in the literature (e.g., Keller
2000).
24
Table 2.3. Conditions in both experiments
Adjunct Island What does Jeff do the housework because Cindy injured?
CSC violation What did Sarah claim she wrote the article and ?
Infin. Sent. Subject What will to admit in public be easier someday?
Left Branch Condition How much did Mary saw that you earned money?
Relative Clause What did Sarah meet the mechanic who fixed quickly?
Sentential Subject What does that you bought anger the other students?
Complex NP Constraint What did you doubt the claim that Jesse invented?
Whether Island What do you wonder whether Sharon spilled by accident?
to ratios of the reference sentence), then determining the mean of each condition for
each participant. The grand mean of the participant means was then calculated for
each condition:
As the table indicates, the mean for the CSC condition in the If-reference
experiment was 0.65 times the reference. This value can then be set as 1 to calculate
the predicted values for the other conditions in an experiment in which the CSC is
25
Table 2.5. Predicted means for CSC-reference based on If-reference results
Condition Predicted Mean
Whether Island 1.37
Complex NP Constraint 1.21
Left Branch Condition 1.08
Relative Clause 1.08
Adjunct Island 1.07
CSC violation 1.00
Sentential Subject 0.89
Infin. Sent. Subject 0.80
As the table indicates, the predictions do not hold experimentally. In fact, the
relative order of the absolute values of the means in the CSC experiment are different
from the relative order of the absolute values of the means in the If experiment:
26
Figure 2.1. Relative means for both experiments
CSC - reference
PC
C
C
J
H
F
CN
RC
LB
CS
SS
IN
W
A
If - reference
PC
C
RC J
C
H
F
CN
LB
CS
SS
IN
W
A
These results indicate that the intervals in the first experiment are different
from the intervals in the second experiment. This suggests that either (i) magnitude
acceptability is not real magnitude estimation (i.e., the magnitudes are not estimated
using the reference sentence as a unit). Either way, this suggests that the intervals
The irregularity of the intervals in the previous subsection is not that surpris-
ing: it has already been admitted that there is no meaningful 0 point for acceptability,
4
In fact, at least one published study has implicitly accepted this as true. Featherston 2005a
based on how far each response is from that particular participant’s mean. In Featherston’s words,
”this effectively unifies the different scales that the individual subjects adopted for themselves.”
27
yet despite this fact, the magnitude estimation task explicitly asks participants to give
ratio judgments. Bard et al. 1996 rationalize this by suggesting that the ratio scale
created by participants assumes an interval scale, therefore the data will be interval
level:
which .6 times as acceptable, and so forth, can at least give us the interval
However, it is not clear how sound this logic is. If there is no true 0 point,
but participants are still asked to respond as if there were one, are we not asking the
participants to estimate a 0 point? Even in the unlikely event that all participants
estimate the same 0 point for a given reference sentence, it is still possible that they
will estimate a different 0 point for a different reference sentence, and that the two
given experiment. Such a situation is not that unlike the scale tasks that magnitude
While the lack of interval level data may be distressing to the statistically con-
scientious, the real question for the practicing linguist is whether this fact affects the
interpretation of magnitude estimation data with respect to the two driving questions
of theoretical syntax:
28
Ultimately, the answer to both of these questions will be empirically determined as
the body of magnitude estimation results grows. These two experiments can only
individual phenomena, the answer appears to be yes and no. Any analysis that is
strongly dependent upon the choice of reference sentence, and at least from these
two studies, there does not appear to be a regular relationship between the values
of different reference sentences. For the most part this will have little impact on
theoretical syntax, as the field in general has never attempted to make numerical
predictions regarding the acceptability of sentences. There have been some recent
attempts to calculate weightings for different syntactic constraints (e.g., Keller 2003),
which may require further testing using a variety of reference sentences to determine
For analyses that are predicated upon relative acceptability, as the majority of
theoretical studies are, these experiments suggest little reason for concern. The sta-
tistically significant relative comparisons among these islands are maintained across
these two experiments (except for comparisons involving the CSC, which will be ad-
dressed shortly). This can be seen both in graphs of the relative acceptability of
the conditions in both experiments, which show the same general shape, and in the
29
Figure 2.2. Pattern of acceptability for both ratios and logs
pected difference between the two experiments. In the If-reference experiment, the
CSC is one of the least acceptable violations; indeed, only the two Sentential Subject
all of the violations except the CNPC Island are judged less acceptable than the CSC-
reference. In a certain sense, it seems as though the reference sentence was judged as
more acceptable than the other violations by virtue of being the reference sentence,
as if its inherent acceptability were ignored. This is also true of the If-reference ex-
periment: all of the violations were judged as worse than the If-reference sentence.
30
should be judged worse than some of the other violations. However, in the case of the
CSC, the first experiment suggests that we should expect it to be near the bottom of
One claim about the nature of grammaticality that has been put forward is that
(e.g., Keller 2000, Sorace and Keller 2005). Of course, the fact that the results
from magnitude estimation are continuous is unsurprising given that the task is a
continuous task, much the way it is unsurprising that yes/no tasks provide evidence
for binary grammatical knowledge. However, it has been claimed that one of the
virtues of magnitude estimation tasks is that they do not impose any conception of
relative both to a reference item and the individual subject’s own previous
applied.
ungrammatical distinction being brought to the task by the participants: these results
suggest that ungrammatical sentences are judged as equal to or less acceptable than
the reference sentence regardless of its relative acceptability to the other violations.
In fact (and unsurprisingly), the grammatical fillers in these experiments are judged
as more acceptable than the reference as well. Only further studies can demonstrate
31
whether this is in fact a stable trend across all possible reference sentences or just
a quirk inherent to the two under consideration here. For now, though, this fact
suggests that the despite the lack of explicit grammaticality in the magnitude es-
the sentences being investigated - a potential piece of evidence for a binary form of
grammatical knowledge.
physical stimuli was not normal. In fact, the distribution of psychophysical magnitude
estimates was log-normal, in that it was characterized by a power law. The broad
will demonstrate that acceptability magnitude estimates are normal (as assumed by
Before addressing the broad question about the nature of the distribution of
data analysis methods of linguistic magnitude estimation. The issue at hand is that
most of the magnitude estimation experiments in the syntactic literature do not report
32
the raw ratio responses given by the participants; instead, these experiments report
the (natural) logs of the ratios. Transforming the ratios into logs (a log transforma-
to deconstruct the logic behind this transformation before addressing the underlying
There appear to be several reasons given in the literature for why the log-
Logs are used both to keep the scale manageable in the presence of very
the difference between log estimates provides the ratio of the acceptability
The first reason, to keep large numbers manageable, is a non-technical reference to the
fact that magnitude estimation data given in terms of numbers will by definition have
a rightward skew: the number line used by participants is infinite in the rightward
(increasing) direction, but bounded by 0 to the left. The log transformation limits
the effect of large numbers, therefore keeping the data ‘manageable’ by limiting the
effect of outliers. However, there are many other methods for dealing with outliers.
In fact, the log transformation is a rather extreme measure for simple outlier removal:
as the name implies, it changes the entire nature of the distribution. Weaker outlier
removal techniques can achieve the same goal without affecting the overall distribution
33
of the data. The second reason, because exponentiating logs yield ratios, is true, but
is peculiar to the goals of the study being analyzed: the reported study focused
indeed reveal the ratio of the two original scores, which in this case saves one step
of the analysis. However, reporting the original ratios of each condition would just
as easily make both the differences and the ratio relationships apparent (with one
Keller 2000 offers two additional reasons for applying the log transformation
and is standard practice for magnitude estimation data (Bard et al. 1996,
Lodge 1981).
The first reason is exactly the empirical question that needs to be answered. If it
the case that acceptability judgments are log-normal, the log transformation will be
necessary to create a normal distribution of the data. The second reason, that it is
just seen, the reasons offered by Bard et al. 1996 may not be the most compelling. The
other referenced work, Lodge 1981, is a brief how-to guide for applying magnitude
psychophysical/sociological data analysis, it does not have the same effect as it does
34
this more apparent.
participants. The fact that there is a 0 point in physical perception meant that these
ratios were meaningful, and the fact that the physical stimuli could be externally
measured meant that the perceived ratios could be compared to the actual ratios.
Because ratios are a form of multiplication, the appropriate central tendency for
ratios is not the more familiar arithmetic mean (commonly called the average), but
rather the geometric mean.5 Now, there is a simple algorithm for determining the
chophysicists did apply log transformations, but these transformation were then ‘un-
done’ by the exponentiation. So while the magnitude estimates were indeed log-
normal (as proven by the Power Law), the log transformation was not used to nor-
malize the distribution - after exponentiation, the distribution was again log-normal
5
If the arithmetic mean answers the question What is the value I need to add to itself X times to
achieve this sum?, the geometric mean answers the question What is the value I need to multiply by
itself X times to achieve this product?. In everyday life, the geometric mean is the appropriate way
35
(as demonstrated by the use of logarithmic graphs).
pants’ responses are log transformed, and the arithmetic mean of the logs is calculated.
However, the parallels with the psychophysical analysis end there. The logs are used
in the subsequent statistical analysis, not the geometric means that can be calculated
from the logs. The question, then, is why linguistic magnitude estimation uses the log
follow-up question being why linguistic magnitude estimation does not calculate the
We have already discussed one possible reason for not analyzing the geometric
means of the ratios: acceptability ratios have no meaningful 0 point, so the ratios
themselves are not meaningful. However, if we admit the ratios are meaningless,
then we must ask why it is that the task asks participants to determine ratios. Log
transforming the meaningless ratios will not make them more meaningful. As for the
logic behind the log transformation, we are left with Keller’s reason: the acceptability
judgments are log-normal to begin with, so log transforming them is necessary to make
them normal.
reporting normality tests (specifically the Shapiro-Wilk test) for both the raw (ra-
36
tio) and log transformed responses to four of the experiments reported in subsequent
chapters. As we shall see shortly, the tests reveal that the raw magnitude estimates
are normally distributed when multiple responses are collected for each participant,
but that the log-transformation destroys this normality. This suggests that the log-
transformation is not only unnecessary from a normality perspective, but also inap-
propriate as a method for removing outliers as suggested by Bard et al. 1996. More
The first two experiments to be analyzed are the two Island studies reported
condition, which were averaged for each participant to reduce the influence of out-
condition, for a total of 18 conditions across the two experiments. The Shapiro-Wilk
test was applied to each condition for both ratios and logs. The results of the nor-
mality tests for both types of data (ratio and log) can be summarized in the following
table:6
Table 2.8. Qualitative normality results for multiple observations per participant
ratios logs
normal 16 4
non-normal 2 14
6
Due to the sheer number of Shapiro-Wilk tests performed, the results will only be reported with
qualitative summaries. The statistics for each test are reported at the end of the chapter.
37
As the table suggests, the raw magnitude estimates are overwhelmingly nor-
mal. In fact, the two non-normal conditions are easily explained. The first non-normal
condition is the CSC condition within the CSC-reference experiment. The fact that
the responses to this condition are non-normal is unsurprising given that it is struc-
turally identical to the reference sentence. The second non-normal condition, the
Perhaps more interesting that these two exceptions to the overwhelming normality
of the ratio judgments is the fact that the logs of the ratios are overwhelmingly non-
normal. These two facts combined suggest, at least for these two experiments, that
the log transformation is unnecessary (and inappropriate) as the ratio judgments are
The second pair of experiments differ from the first pair in that multiple obser-
vations were not collected for each participant. However, these two experiments did
involve a large number of participants (86 and 92), resulting in an equivalent number
of total observations. While the lack of multiple observations per participant means
that outliers have a greater effect on normality, the unusually large number of total
observations ensures that once the outliers are accounted for, the distribution of the
resulting scores will not be non-normal due to insufficient measurements. The two
38
for a total of 18 conditions, as summarized in the following table:
Table 2.9. Qualitative normality results for few observations per participant, large
sample size
ratios logs
normal 1 0
non-normal 17 18
Unlike the first pair of experiments, the raw ratios of these two experiments
are overwhelmingly non-normal. Interestingly, the logs are also overwhelmingly non-
normal. This suggests that the non-normality in the ratios is not because they are
log-normal. Given the lack of multiple observations per condition per participant, the
non-normality may be due the influence of outliers. Whatever the ultimate source of
the non-normality of the raw ratios, together with the results from the first pair of ex-
periments, these results strongly suggest that the log transformation is inappropriate
estimates are log-normal, and in fact, there is evidence that the untransformed ratios
along both dimensions: only 1 observation per participant and a relatively small
number of participants (between 20 and 26). The relatively low number of total
observations means that these experiments are more likely to be influenced by outliers,
and therefore less likely to be normal. However, we can still compare the distribution
of the raw ratios to the distribution of the logs to determine the validity of the log
39
transformation:
Table 2.10. Qualitative normality results for few observations per participant, small
sample size
ratios logs
normal 19 35
non-normal 43 27
For the raw ratios, there are significantly more non-normal conditions than
normal conditions by Sign Test (p<.003), which is unsurprising given the increased
influence of outliers in these designs. And while the log transformation does increase
the number of normal conditions from 19 to 35, there still are not significantly more
normal conditions than non-normal conditions by Sign test (p<.374). So while the
log transformation does improve the overall normality in these designs, there is no
are log-normal.
Table 2.11. Qualitative normality results for multiple observations per participant
ratios logs
normal non-normal p normal non-normal p
many observations 16 2 .001 4 14 .031
few observations, large N 1 17 .001 0 18 .001
few observations, small N 19 43 .003 35 27 .374
40
The general picture that emerges is as follows:
1. The raw ratios are normally distributed in designs that minimize the influence
of outliers.
2. In designs that do not minimize the influence of outliers, the raw ratios are
3. This suggests that the non-normality is more likely to be due to outliers than
in fact tends to be normal in designs that minimize the influence of outliers, it is more
The results of the previous subsection seem to suggest that the log transfor-
purpose, to ensure the normality of the results, seems unnecessary given the nor-
mality of the responses prior to the transformation. The question, then, is whether
the application of the log transformation has affected the results of magnitude esti-
mation experiments in a way that affects syntactic theory; or in other words, does
the log transformation affect the first major question driving theoretic syntax, what
41
In general, F-tests such as ANOVA are robust to violations of normality when
the the non-normal distributions in question are identical (i.e., non-normal in the same
way) (Wilcox 1997). In the case of acceptability magnitude estimates, it is true that
the log transformation creates a non-normal distribution, but it is also true that the
log transformation is applied to all conditions, making it likely that the non-normal
non-normality, we would not expect the log transformation to significantly alter the
results of these tests. This was tested using the experiments from this dissertation
by comparing the raw ratios and logs: while absolute p-value and effect sizes did
change after the log transformation, the qualitative results (the presence or absence
One of the major claims about magnitude estimation is that it provides data
that is more amenable to parametric statistics than other standard judgment col-
lection techniques, such as scale tasks. However, the results of this chapter suggest
that the quality of data from magnitude estimation suffers from some of the same
drawbacks as data from scale techniques. For one, it has been claimed that magni-
tude estimation yields interval level data, as is required for parametric statistics. Yet
intervals constructed by participants are not regular. While this result is not entirely
surprising given that participants are actually asked to produce ratio responses to a
42
stimulus with no meaningful 0 point, it does suggest that magnitude estimation data is
ordinal, just like the data from scale tasks. In fact, the distribution of responses with
respect to the reference sentence suggests that participants are not performing magni-
tude estimation at all, but rather a standard scaling technique in which grammatical
sentences are placed above the reference, and ungrammatical below the reference.
and ungrammatical sentences, it is also very similar to standard scale tasks in which
the mid-point of the scale marks the difference between the two categories.
It has also been claimed that the continuous measures produced by magni-
tude estimation are normally distributed following the ‘standard’ log transformation.
Yet normality tests on several experiments revealed that magnitude estimation re-
sponses are already normal before the log transformation in designs that minimize
the influence of outliers. And although the untransformed responses are non-normal
in designs that do not minimize the influence of outliers, the log transformation does
not make these distributions normal, suggesting that simple outlier removal would
be a more appropriate procedure. Taken together, these findings suggest that the
absolute numbers associated with magnitude estimation are unreliable: the lack of
regular intervals indicates that the magnitude of differences is unreliable, and the lack
of normality after the log transformation suggests that the exact statistics may be
unreliable. However, comparing statistical analyses on both raw and log transformed
data, and on experiments with different reference sentences (hence different inter-
vals), has demonstrated that the qualitative effects are not affected (i.e., statistically
43
These results suggest that in many ways magnitude estimation of acceptabil-
standard parametric statistical tests such as the ANOVA, but this has more to do
with the robust nature of the effects (and the statistical tests) than the precision
of the measuring instrument. In the end, these results suggest that the real benefit
of magnitude estimation rests not in the data it yields, but in the freedom it gives
participants. Bard et al. 1996 and Featherston 2005b have already demonstrated that
there are significant differences in acceptability that appear reliably with magnitude
estimation but nevertheless may not be apparent to other rating techniques. Fur-
ical and ungrammatical sentences discussed above must have been introduced by the
participants themselves. If this finding is found to hold for a variety of reference sen-
tences, then magnitude estimation may become new psychological evidence for the
The final question, then, is what this means for the practice of acceptability
The bottom line appears to be that the absolute values of judgments, and the gradi-
ence implied by them, are not necessarily reliable. The relative differences are reliable,
44
despite the inappropriate application of the log transformation.7 So as long as effects
are defined relative to control conditions, there is no cause for concern. There is
reason to be cautious in drawing conclusions of gradience from these results (after all,
gradience is built into the task), especially if precise values or weights are assigned to
that gradience, as the values may change with the reference sentence. However, the
ultimately provide a valuable new insight into the nature of grammatical knowledge.
7
Even though the log transformation is not ultimately causing any statistical harm, given that it
obscures the intentions of the participants (a sentence judged as twice as acceptable as the reference
becomes .3 after the log transformation), it would probably be worth abandoning the log transfor-
mation in favor of the geometric means. However, for consistency with the standard practices of the
field, the results for the rest of the dissertation will still include the log transformation.
45
Chapter 3
One of the recurrent questions facing syntactic research is whether the fact that
acceptability judgments are given for isolated sentences out of their natural linguistic
context affects the results. This chapter lays out several possible ways in which the
presence or absence of context may affect the results of acceptability studies, and
what effect each could have on theories of grammatical knowledge. Because the effect
on a case by case basis, this chapter then presents 3 case studies of the effect of various
is an effect, and ii) if so, what the underlying source of the effect may be. These
case studies serve a dual purpose: they serve as a first attempt at systematically
wh-movement serves as the empirical object of study in the rest of the dissertation,
apparent, these experiments suggest that at least the types of context studied here
do not affect major properties of wh-movement such as Island effects and D-linking
effects, indicating that context need not be included as a factor in subsequent studies
46
3.1 The complexities of context
property is dependent on context, and given that context is rarely supplied during
ally unaware of the multiple variables that are known to affect linguistic
Christiansen 2003)
In fact, even linguistic methodology seems to acknowledge the cognitive cost of at-
47
Value Judgment task was created to ease that very cognitive burden such that children
are able to give fairly complex linguistic judgments (Crain and Thornton 1998).
At a methodological level, It is obvious that the first issue facing any research
that intends to use acceptability judgments as primary data must determine the effect,
gathered, a much more difficult question emerges: What, if anything, does an effect
divided into constraints that are affected by context (context-dependent) and con-
straints that are not affected by context (context-independent). Keller argues that
and soft constraints, a property that distinguishes between constraints that are
present in every language cross-linguistically, and constraints that may or may not
it does not easily transfer from the OT Syntax framework in which Keller is working
into other grammatical theories. In other frameworks, the constraints that Keller
constraints on the use of a given syntactic structure, or preferences for a given in-
terpretation, but not as constraints on the syntactic structure itself. For instance,
from Kuno 1976 that when the remnant material is an NP and VP, they must be
48
(1) The Tendency for Subject-Predicate Interpretation
b. John persuaded Dr. Thomas to examine Jane and John persuaded Bill
to examine Martha.
c. * John persuaded Dr. Thomas to examine Jane and Bill persuaded Dr.
In many non-OT frameworks, the fact that this tendency can be overridden by an
effect, with the interesting question then being why the information structure of the
gapped sentence presented without any context favors one interpretation over the
other, equally structurally possible interpretation. This is exactly the tack taken by
Erteschik-Shir 2006.
Structure constraints, on the other hand, are by definition dependent on the lin-
context effects into a diagnostic for grammatical knowledge of a sort not normally
49
investigated by syntactic theory. So in essence, if a structure is found acceptable in
any context, it is a grammatically possible structure, but the contexts in which the
complexity of the structure makes the intended meaning difficult to determine. Be-
is obscured by the indeterminacy of the meaning. Of course, there are many different
gies, default IS strategies, et cetera. - which potentially overlap with the IS proposal
of Erteschik-Shir.
Any of these analyses of context effects are possible, and can really only be
determined empirically on a case by case basis. The rest of this chapter looks at
and in particular Island effects, form the empirical object of study throughout the
rest of this dissertation, so this chapter serves a dual function: to investigate the
possibility that one or more of the scenarios above affects the acceptability of wh-
the experiment presents participants with Island violations without any context,
and Island violations with a context sentence that is a possible answer to the
50
question, to determine if fore-knowledge of the meaning of the Island violation
gating a claim from Deane 1991 that the acceptability of wh-questions is affected
by the number of focused elements in the sentence: moved wh-words are inter-
Deane argues is the case for most Island structures as well as some non-Island
structures, the two focused elements compete for attention and acceptability is
context.
that have been associated with Discourse Linked wh-phrases, such as Superi-
ority amelioration (Pesetsky 1987) and better resumption binding (Frazier and
text.
The picture that emerges from these three experiments is striking: the contexts in
these experiments do not interact with the properties of wh-movement. This sug-
gests that the properties of wh-movement under investigation, namely Island effects
and Discourse Linking effects, are not due to any of the possibilities discussed above.
51
In fact, experiment 2 suggests an even stronger result. Deane’s analysis of Island
no effect of conflicting attention using sentences taken directly from Deane’s paper,
regardless of context.
The first experiment asks the simple question: Does fore-knowledge of the in-
tended meaning of an Island violation increase the acceptability? Of course, there are
two possibilities. First, knowledge of the intended meaning could increase acceptabil-
ical wh-questions as well as Island violations. On the other hand, knowledge of the
example. It has been claimed (since at least Huang 1982) that clausal adjuncts
between two factors structure and movement. structure has two levels: clausal
52
complement (2) and clausal adjunct (3); movement also has two levels: movement
from the matrix clause (i) and movement from the embedded clause (ii):
i. Who1 t1 suspects [CP that you left the keys in the car?]
ii. What1 do you suspect [CP that you left t1 in the car?]
i. Who1 t1 worries [ADJ that you leave the keys in the car?]
All things being equal, one might expect sentences containing clausal complements
to be judged as more acceptable than sentences containing clausal adjuncts since the
semantics of clausal adjuncts involve a more complicated relationship with the matrix
predicate. Thus, we might expect both of the examples in (2) to be more acceptable
than the examples in (3). We might also expect movement out of the matrix clause
to be more acceptable than movement out of the embedded clause (perhaps because
shorter movements require less working memory, cf. Phillips et al. 2005), therefore
the (i) examples should be more acceptable than the (ii) examples.
If these hypotheses were to hold, we would expect a graph of these four condi-
tions to yield two parallel lines, indicating that there were just two main effects. The
slope of the top line shows the effect of distance on wh-movement in sentences con-
taining clausal complements. Similarly, the slope of the bottom line shows the effect
distance between the pairs of points shows the effect of clausal complements versus
53
Figure 3.1. Two main effects, no interaction, no Island effect
the effect of clausal adjuncts. This is the standard graph of two main effects (the
main effect of structure and the main effect of movement, and no interaction.
All things are not equal. The claim is that wh-movement out of clausal ad-
juncts is impossible, because clausal adjuncts are Islands to wh-movement (eg. Huang
1982). This means that there is more affecting the acceptability of the adjunct-
embedded condition than just the acceptability decreases we expect from long dis-
tance movement and from clausal adjuncts. On the graph, this extra acceptability
54
decrease would change the slope of the bottom line such that the two lines are no
two factors, and captures the effect of the Adjunct Island while controlling for the
i. CP Complement
ii. What1 did you deny [CP that you could afford t1 ?]
ii. NP Complement
i. Who1 t1 denied [N P the fact that you could afford the house?]
ii. * What1 did you deny [N P the fact that you could afford t1 ?]
i. CP Complement
ii. What1 did you know [CP that the woman read t1 ?
ii. NP Complement
55
(6) Whether Island
i. CP Complement
(7) WH Island
i. CP Complement
i. Who1 t1 thinks [CP that the doctor bought flowers for the nurse?]
ii. Who1 do you think [CP that the doctor bought flowers for t1 ?]
ii. WH Complement
The only minor exception is Subject Islands, for which the two levels of structure
are simple NPs such as the manager versus complex NPs (NPs that contain another
NP) such as the manager of the store, and the two levels of movement are movement
out of the embedded object position and movement out of the embedded subject
position:
56
(8) Subject Island
i. Simple NPs
TV show about t1 ]?
about whales]?
These 6 sets of conditions will be presented with and without context sentences
(sentences which are fully lexicalized answers for the questions) to determine whether
fore-knowledge of the meaning interacts with each these Island effects, and if so, to
what extent.
Participants
As discussed above, Island effects were defined as the interaction between the
and distributed among 8 lists in a Latin Square design. 3 orders of each list were
57
created (pseudorandomized such that no two related conditions were consecutive),
Crucially, a third factor, context, also with two levels (no context and con-
text), was added to determine whether any of the 6 Island effects are affected by the
intended meaning. The context sentence was a fully lexicalized answer appropriate
(9) You think the speech by the president interrupted the TV show about whales.
Who do you think the speech by interrupted the TV show about whales?
The two levels of context created two versions of each of the 24 lists, one with
context and one without. These two versions were paired to create 24 lists with two
sections each (48 total items to be judged). The order of the two sections were coun-
terbalanced, with 12 lists presenting items with context first, and 12 lists presenting
The task was magnitude estimation with an If Island reference sentence: What
do you wonder if your mother bought for your father? The directions were a modified
version of the instructions bundled with the WebExp online experimental software
suite (Keller et al. 1998). Each section had its own set of instructions, so that the con-
text sentence could be explained. For the context conditions, the reference sentence
58
3.2.1 Results
chapter 2, all of the scores were divided by the reference sentence and log-transformed
prior to analysis.
ture x movement x context for each of the 6 Island types. Island effects are
and effect sizes1 for each Island are given in the following tables:
1
Partial Eta-squared is a measure of the proportion of variance accounted for by the effect. For
59
Table 3.3. Results for three-way repeated measures ANOVA
As hypothesized, there are highly significant and very large effects for struc-
ture, movement, and the interaction structure x movement for all of the is-
lands except for Subject Islands, which do not show an effect of movement. This
interaction can be seen graphically by the non-parallel lines that emerge when the
four conditions of each Island effect are plotted (solid black lines indicate no-context,
dashed lines indicate context): The exception to the distance effect for Subject Islands
is unsurprising given that movement in the Subject Island condition is the difference
between movement out of subject position and object position in the same clause,
instance, .840 would indicate that 84% of the variance is accounted for by that effect. Following
convention, .01 is considered a small effect, .09 a medium effect, and .25 a large effect
60
Figure 3.3. Island effects and context
61
whereas movement in the other Island effects is the difference between movement
The only other effect is the medium-sized significant main effect of context on WH
Islands. This too is unsurprising given that WH Islands are the only Islands that
no context sentence, the participants must determine which wh-word goes in which
position. The context sentence supplies this information, making the processing of
Discussion
structure and movement, as can be seen by the fact that the dashed lines in the
graphs track the solid lines almost perfectly. This suggests that these 6 Island effects
are not affected by the intended meaning. In fact, there is no effect of context in
this experiment except for the one main effect on WH Islands, which as previously
mentioned, may simply be due to the unique problem of identifying the gap position
62
of each wh-word in WH Islands.
The goal of Deane 1991 is to demonstrate that the classic Subjacency account
account for the full range of acceptability facts, and that an attention-based account
is empirically superior. The argument has two parts: i) there are acceptable sentences
that Subjacency predicts should be unacceptable, and ii) there are unacceptable sen-
a. This is one newspaper that the editor exercises strict control over the
publication of.
Deane argues that the crucial factor in each of these cases is semantic, and that a the-
ory of attention can capture these facts more adequately than a structural constraint
when the displaced element must compete for focal attention with another element in
the meaning: for instance, it is not surprising to talk about editors exercising control
63
over the publication of newspapers, so only newspaper needs attention; however, cars
are not often distinguished based upon their female occupants, so girls commands
Given that this effect is predicated upon the attention of the participant, and
given that attention is determined by how surprising the meaning of the sentence is,
this analysis predicts that the effect may be neutralized in an appropriate context. If
participants are given a context that biases them to expect the non-moved elements in
the sentence, these elements should no longer compete for attention with the moved
wh-word, and one might predict that the effect would disappear. The experiment
Participants
This experiment tested 8 pairs of sentences taken directly from Deane 1991.
Each pair consisted of 1 sentence with elements that do not command attention,
and one sentence with unexpected elements that do command attention, so the pairs
formed the factor attention with two levels: no special attention (acceptable) and
sentences to make the conditions closer to minimal pairs (for instance, what type was
changed to which type in the first pair): The 8 items in each level were distributed
using a Latin Square design yielding 8 non-matched sentence pairs. These pairs were
64
Table 3.5. Pairs from Deane 1991
acceptable Which apartments do we have security keys to?
unacceptable Which type of security key do you have an apartment with?
acceptable Which reserve divisions do you know the secret locations of?
unacceptable Which locations do you have reserve divisions in?
combined with 3 filler items in a pseudorandomized order such that the two target
sentences were not consecutive, yielding 8 lists of 5 items. The lists were then paired
with non-identical lists to yield 8 surveys, each consisting of two sections, with each
Short (4-5 line) stories were created for each sentence to create a context in
which the elements in the unacceptable conditions would be expected. The sentence
to be judged was always the final sentence of the story, and was bold. For instance:
After test driving many cars, the teenager finally came to a decision about
which car to purchase. He said to his girlfriend: You know, all of the cars
65
were really nice to drive, but some of them just felt cooler. Knowing him
well, his girlfriend responded: Cooler? I know whats going on. Youve
seen commercials for these cars, and some of them had just what you
were looking for, like pretty girls driving them. So tell me: Which car
Two versions of each of the 8 surveys were created: 1 version in which context stories
were added to the first section, and 1 version in which the context stories were added
to the second section, thus the factor context was counterbalanced for order of
presentation across 16 total surveys. An 8 item practice section was added to the
beginning of each survey consisting of items that cover the full 7 point scale.
The length of the context stories dictated two design decisions. First, as
already mentioned, the survey itself was very short: including practice items there
were only 18 items in the survey. Second, the task chosen was a 7 point ordinal scale
task rather than magnitude estimation: because magnitude estimation involves the
comparison of two sentences, in the context conditions it would require two stories,
and thus make the task almost unmanageable. The instructions for this task were
a modified form of the instructions for previous experiments that included explicit
instruction in the 7 point scale task (with two example items at either end of the
Results
66
Table 3.6. Mean values and standard deviations for each condition
Mean Standard Deviation
context, acceptable 6.12 0.90
context, unacceptable 5.71 1.60
no context, acceptable 5.42 1.53
no context, unacceptable 5.00 2.04
distributed interval level data, and the 7 point scale task in this experiment only yields
ordinal level data which may or may not be normally distributed. Unfortunately, there
standard ANOVA is often reported for ordinal data in the psychological literature.
To be safe, two analyses were performed. First, a two-way repeated measures ANOVA
was performed on the ordinal data despite violating the assumptions of the test, as
were found with this analysis, but because this could be due to the inappropriate use
of ANOVA, a second analysis was performed. The second analysis was a standard
version of the ordinal data,2 following the suggestion of Conover and Iman 1981. As
Seaman et al. 1994 point out, the Conover and Iman method is more susceptible to
Type I errors - that is, more likely to produce a significant effect. Given that no
significant effects were found with this method either, we can be fairly certain that
67
Table 3.7. Two-way repeated measures ANOVAs using ordinal and rank data
although there was a nearly significant main effect of context which would reach
significance under a one-tailed test (if we hypothesized that context always leads to
higher acceptability).
Discussion
In the end, the Deane 1991 contrast was not a good candidate for testing the
context hypothesis because the crucial contrast either does not exist, or is not de-
tectable by a 7 point scale task. In fact, while it is possible that a more sensitive
task such as magnitude estimation may detect a significant effect for attention,
there is reason to believe that the contrast is not one between acceptable and unac-
ceptable sentences, but rather between two acceptable sentences: if one inspects the
mean ordinal rankings, we see that all of the conditions are above 5, which is well
above the middle of the scale (3.5). This combined with the lack of any significant
differences suggests that participants found all of the items acceptable. While this is
disappointing from the point of view of investigating the effect of context on Deane’s
effects, as it suggests that the data underlying Deane’s analysis is not robust, and
68
3.4 Discourse Linking and context
be true discourse or IS constraints. D-linking was chosen for two reasons: First, as the
name suggests, D-linking has been related to the semantic interpretation of the wh-
phrase with respect to a given discourse, thus a priori seems like a good candidate for
Although neither property will figure substantively in any of the analyses in the rest
of this dissertation, they are both major components of any comprehensive theory of
wh-movement.
The Superiority effect is the decrease in acceptability that has been observed
command) wh-word, for instance the object of a verb, is moved ’across’ a structurally
69
In 14a the subject wh-word who moves to the specifier of the embedded CP and the
sentence is completely acceptable. In 14b the object wh-word what moves to the
specifier of the embedded CP and the sentence is unacceptable. In 14c what again
moves to the specifier of the embedded CP, but does not cross over a ’higher’ wh-word,
Pesetsky 1987 observed that the Superiority effect disappears when the wh-
While a precise semantic definition of D-linking has remained elusive for the past
20 years, the crucial difference appears to be the difference between the possible
sets of answers to D-linked and non-D-linked wh-questions. For instance, the set of
possible answers to the embedded question in 14a is (almost) any human being and
(almost) any piece of reading material. However, the set of possible answers to the
embedded question in 15a is not only restricted to students and books, but to students
and books that have been previously mentioned in the discourse (or possibly made
salient in some non-linguistic way). And as these examples illustrate, D-linking does
not depend upon an answer actually being necessary in the conversation: embedded
question in English are not normally (or easily) answered in standard conversations,
70
Given that Superiority appears to be context-dependent in that it is affected
by discourse restrictions on the set of possible answers, the first question investigated
by this experiment is whether a context that restricts the set of possible answers can
for Island violations in which the illicit gap position is filled by a pronoun agreeing
Frazier and Clifton 2002 present a series of experiments demonstrating that D-linked
wh-phrases are ’more prominent’ antecedents for pronouns than non-D-linked wh-
words. One of the experiments is a standard 7 point scale acceptability task comparing
a. Who did the teacher wonder if they had gone to the library?
b. Which students did the teacher wonder if they had gone to the library?
71
There is significant effect of D-linking, with mean ratings of 5.58 and 4.87 respec-
pronouns increase the overall acceptability of the sentence. The obvious follow-up
question then is whether context can affect the same increase in acceptability by cre-
ating a D-linking effect for non-D-linked wh-words. However, there are two confounds
in the Frazier and Clifton materials that need to be addressed before manipulating
context.
The first confound is in the materials themselves: all of the items tested by
all conditions, it would most likely lower the acceptability of all conditions equally.
And because the effect is defined as a comparison between two conditions, this should
not have affected the results, but could explain why the mean ratings were so low.
In order to keep the results of the experiment reported in this section comparable to
those of Frazier and Clifton, the Comp-trace violation will be retained in all of the
The second confound is the design of the experiment. It has long been observed
Islands, independently of whether there are resumptive pronouns (e.g., Pesetsky 1987:
72
This confound means that the effect found by Frazier and Clifton could at least
partially be due to the general effect of D-linking on Islands. Given the other (mostly
reading time) studies presented by Frazier and Clifton, this does not necessarily call
all of their results into question, but it is a confounding factor in their design that
should be rectified. As such, the experiment presented here uses a full 2x2 design of
the factors resumptive pronouns and gaps, and D-linked and non-D-linked wh-words:
(19) Non-D-Linked
b. Who1 did the teacher wonder if they1 had gone to the library?
(20) D-Linked
a. Which student1 did the teacher wonder if t1 had gone to the library?
b. Which student1 did the teacher wonder if they1 had gone to the library?
extra credit. All were native speakers of English and were enrolled in an introductory
linguistics course. The course did not introduce them to the Superiority effect or the
Resumption effect, and the survey was administered in the first half of the semester
The D-linking effect on the Superiority effect is defined as the interaction of the factors
73
superiority and d-linking, each with two levels: no Superiority violation (i) and
(21) Non-D-linked
(22) D-linked
Similarly, the D-linking effect on the Resumption effect is also defined as the inter-
action of two factors with two level each: resumption, with the levels gap (i) and
(23) Non-D-linked
ii. Who1 did the teacher wonder if they1 had gone to the library?
(24) D-linked
i. Which student1 did the teacher wonder if t1 had gone to the library?
ii. Which student1 did the teacher wonder if they1 had gone to the library?
lists using a Latin Square design. Context stories were created for each sentence in
74
which the set of possible answers for the wh-word was restricted. The sentence to be
judged was the final sentence of the story, and was bold:
Last semester, Professor Smith assigned 36 books for his literature stu-
dents to read, but he knows that no one read all of them. In fact, hes
pretty sure that each book was read by only one student, so he wants to
only order the books that the literature majors read. Before placing the
book order for the next semester, he thinks to himself: I wish I knew
read.
Two filler items were added to the survey for a total of 10 items, pseudorandomized
such that no two related conditions were consecutive, and preceded by 8 practice
items spanning the entire range of acceptability. Given the length of the context
item survey, then followed by a second survey from this experiment. The two surveys
were paired such that they did not contain the same lexicalizations. Half of the
respondents were given context stories for the first survey but not the second, and
half were given context stories for the second by not the first, such that context
Results
The results of both sub-designs are presented in the following two tables, which
75
Table 3.8. Superiority: descriptive results
Once again, given that the data obtained from the 7 point scale task is ordinal,
two analyses were performed: one standard three-way repeated measures ANOVA on
the untransformed data, and a second three-way repeated measures ANOVA on the
rank-transformed data (Conover and Iman 1981, Seaman et al. 1994). Both analyses
Table 3.10. Results for three-way repeated measures ANOVA, untransformed data
Superiority Resumption
F p partial-eta2 F p partial-eta2
sup/res 359.8 ∗∗∗ .800 46.5 ∗∗∗ .341
d-linking 125.4 ∗∗∗ .582 68.7 ∗∗∗ .433
context 29.4 ∗∗∗ .246 64.1 ∗∗∗ .416
sup/res x d-link 101.2 ∗∗∗ .529 9.1 ∗∗ .092
con x sup/res 16.8 ∗∗∗ .158 1.3 .258 .014
con x d-link 0.1 .717 .001 0.8 .365 .009
c x s/r x d 1.4 .238 .015 0.4 .523 .005
As the tables indicate, the two analyses returned the same results. There
was a large and highly significant main effect of each factor. There was a large and
highly significant interaction of D-linking with the Superiority effect, confirming the
76
Table 3.11. Results for three-way repeated measures ANOVA, rank-transformed data
Superiority Resumption
F p partial-eta2 F p partial-eta2
sup/res 431.9 ∗∗∗ .828 43.5 ∗∗∗ .326
d-linking 100.4 ∗∗∗ .527 77.3 ∗∗∗ .462
context 24.3 ∗∗∗ .213 69.0 ∗∗∗ .434
sup/res x d-link 79.8 ∗∗∗ .470 6.4 ∗ .066
con x sup/res 4.0 ∗ .043 2.1 .152 .023
con x d-link 0.7 .403 .008 0.1 .728 .001
c x s/r x d 0.1 .710 .002 0.0 .936 .001
observation of Pesetsky 1987, and a medium sized interaction of D-linking and the
Resumption effect, confirming the observation of Frazier and Clifton 2002. There was
also a surprising interaction of context and the Superiority effect, but not equivalent
interaction of context and the Resumption effect. There was no interaction of context
Discussion
Turning to each of the main effects first, we first see that the main effects of
superiority and resumption confirm that the Superiority and Resumption effects
exist, which is in itself unsurprising. However, we also see that there is a main effect of
Superiority and non-Island structures. In itself this effect is also not very surprising -
and acceptability judgments are predicated upon the participants determining the
meaning of the sentence, which may be easier with this extra information. However,
this effect does raise questions for the observation mentioned briefly before that Island
effects are weaker with D-linked wh-phrases. If D-linking increases acceptability for all
77
structures, then an experiment along the lines of the first experiment in this chapter
will be necessary to determine whether D-linking has more of an effect on Islands than
non-Islands. The final main effect, context, is also unsurprising as we have been
amassing evidence throughout this chapter than context increases the acceptability
of all structures.
Turning next to the interactions, we see that this experiment confirms the
observations of Pesetsky and Frazier and Clifton that D-linking interacts with the
context and effect for Superiority, which suggests that the main effect of context
ing the means of each of the conditions highlights this effect, as the vertical distance
is greater between the context (dashed lines) and no-context (solid lines) conditions
no violation
context
Superiority violation
no context
non-D-linked D-linked
78
The fact that context has less of an effect on no-violation may be an artifact
of the task: the no-violation conditions in this experiment are already near the ceil-
ing of the scale (with means above 6), and context has a main effect of increasing
acceptability. Given that there is very little room left on the scale for increasing
acceptability for the no-violation condition, it may be the case that this interaction
is really a ceiling effect. A follow-up study using a ceiling-less response scale, such as
magnitude estimation, could determine whether this interaction is indeed real or just
The final two interactions are the two of interest for this experiment, as they
the effect of D-linking does not change based on whether there is context or not, or
findings are reinforced by the parallel patterns of no context (solid lines) and context
(dashed lines) conditions in the Superiority graph above, and the Resumption graph
below. So it seems once again we’ve failed to find an interaction of context with
79
Figure 3.5. Resumption and D-Linking
context
resumption
no context
gap
resumption
gap
non-D-linked D-linked
plex. Compounding this complexity is the fact that the presence and source of context
effects must be identified for each piece of grammatical knowledge case by case. This
chapter laid out several possibilities for these underlying sources, and investigated
erties, and if so, which source causes the effect. The results suggest that various types
this is unfortunate from the point of view of studying context effects, these results
are encouraging for the study of structural effects of wh-movement: they suggest
that context need not be considered in subsequent acceptability studies. While the
effect of context on other aspects of grammatical theory awaits future research, these
results suggest, at a minimum, that when it comes to Island effects and D-linking,
80
Chapter 4
ter, that is more acceptable, after several repetitions. The fact that some violations
satiate while others do not has been interpreted in the literature as an indication
the violation (or the nature of the grammatical knowledge) affects its relationship
with acceptability such that acceptability may change over time. This chapter takes
a closer look at the satiation effect using the tool of experimental syntax. Section 1
reviews the existing satiation literature and the motivation for interpreting satiation
for such analyses: the fact that the reported satiation results cannot be replicated.
The remainder of the chapter attempts to tease apart various experimental factors
that may have led to the satiation effect. The picture that emerges is one in which
the satiation effect is actually an artifact of experimental design rather than an nat-
81
4.1 The problem of Syntactic Satiation
Nearly every linguist has been there. After judging several sentences with the
same structure over days or even months while working on a project, the accept-
ability begins to increase - an effect that has come to be called syntactic satiation
(Snyder 2000). While this sounds like a minor occupational hazard for linguists, it
belies a serious problem: the complex analyses created by syntacticians are based
sentences tend to become acceptable over time, then there is reason to be skeptical
of the analyses. Snyder 2000 offers a provocative response to this state of affairs:
if it is the case that some violations satiate while other do not, then this may be
a crucial piece of evidence for syntactic analyses. One possibility is that there are
different classes of violations, those that satiate and those that do not, which needs to
be taken into account by syntactic analyses. Another possibility is that the satiating
violations may not be due grammatical effects at all, and may actually indicate that
the source of the initial unacceptability is a processing effect (that can be overcome
systematically occurs with some violations and not others, than it is not a problem
for syntactic analyses at all, but rather a new set of data that needs to be integrated
Snyder 2000 reports an experiment that does indeed suggest that only cer-
English with a survey to investigate whether the following 7 violations satiate over
82
5 repetitions: Adjunct Island, Complex NP Constraint (CNPC) Island, Left Branch
Constraint (LBC) violation, Subject Island, That-trace effect, Want-for effect, and
Whether Island.
The results suggest that Whether Islands and CNPC Islands do satiate, that Subject
Islands marginally satiate, and that the other violations do not satiate over 5 repeti-
tions. Prima facie, that only a subset of violations exhibit satiation confirm Snyder’s
contention that satiation could be a new type of classifying data for linguistic analysis,
Interest in these findings has to led to at least two follow-up studies: the
natural classes of constraints within the grammar itself, and the second, Goodall
and processing-based effects. These follow-up studies are near replications of Snyder’s
original experiment in design, task, and content. However, the results are at best only
a partial replication:1
1
Hiramatsu used Snyder’s original materials, but added 2 blocks, and therefore 2 instances of
each violation to the end of the survey (resulting in 7 instances of each violation) to investigate
whether additional exposures would lead to satiation of Subject Islands (which were marginal in
83
Table 4.2. Summary of results for Snyder 2000, Hiramatsu 2000, and Goodall 2005
satiation is systematic and thus a valid object of study, this lack of replicability is
distressing. This chapter continues in the tradition of Snyder 2000 and the follow-
not others. The picture that emerges is that Snyder’s original results are not easily
replicable, what I will call the replication problem, suggesting that the source under-
lying satiation in Snyder’s results is not the violation, but some other property of the
aspects of the design that could give rise to the judgment instability, in particular
the statistical definition of satiation, the task used, and the composition of the ex-
attempt to isolate the conditions necessary to license the type of judgment instabil-
the Snyder 2000 study). Goodall followed the general design of Snyder 2000 in that there were 5
blocks of 10 sentences, but there were 6 violations per block instead of 7. Five of these violations are
listed in the table. The sixth violation was the violation of interest, lack of Subject-Aux inversion
84
ity reported by Snyder. The results suggest that judgment instability only arises in
acceptable sentences), and is much more likely to occur in categorical tasks such as
the yes/no task than in non-categorical tasks such as magnitude estimation. These
piece of evidence at the center of Snyder’s original claim: that in the rare cases when
that satiate, not other weak violations such as the That-trace effect. This receives a
natural explanation under a task-centered account, as it has long been known that
the judgment process underlying violations that are easily correctable (e.g., the That-
trace effect) is qualitatively different from violations that have no obvious correction
(e.g., Island effects) (Crain and Fodor 1987). While these findings cast doubt on
Snyder’s original solution to the satiation problem (that satiation can be studied like
any other property of violations), they simultaneously cast doubt on the satiation
problem itself, instead suggesting that judgments are a strikingly stable type of data.
lem, the first step is to confirm that the replication problem is more than just an
accident of the Hiramatsu and Goodall experiments. This subsection reports three
85
additional attempts at replication. The first is a direct attempt at replication using
the very same materials as Snyder 2000.2 The second attempt also uses the materials
from Snyder 2000, but includes an additional task after each yes/no judgment: par-
ticipants were also asked to rate their confidence in each yes/no judgment on a scale
of 1 to 7. The third attempt uses the same design as that of Snyder 2000, but with a
few small modifications and new materials. As we shall see, none of these replications
resulted in satiation.
Participants
The direct replication was identical in all respects to the study in Snyder
2000. The Snyder 2000 design was a standard blocked design with 5 blocks, each
containing 10 items. Of the 10 items in each block, 7 of the items were the violations
discussed previously, and the remaining 3 items were grammatical fillers. The order
of the items in each block were randomized, and 2 global orders of items were created,
each the reverse order of the other. This forward-backward balance for the order of
2
A special thank you to William Snyder for providing the original materials.
86
presentation insured that the specific tokens of each violation that were seen 1st and
2nd in one order were 4th and 5th in the other order. Thus, the responses in the first
two blocks could be compared to the final two blocks without interference from order
Each sentence was preceded by a context sentence that provided a fully lexicalized
potential answer to the question to be judged. The task was a standard yes/no task.
The replication with confidence judgments was also identical in materials and
design to the Snyder 2000 study, with the addition of a confidence judgment with
their yes/no judgment on a scale from 1 to 7 (1 being the least confident) following
each item. Therefore each page of the survey included the context sentence, the
question to be judged, a line for indicating yes or no, and a scale from 1 to 7 for
The modified replication followed the general design of Snyder 2000 with a few
minor modifications. First, there were 8 violations per block instead of 7, resulting
in 2 acceptable sentences per block rather than 3 (the reason for this modification
will be discussed in section 2.3). Second, all of the unacceptable sentence types were
Island violations. Third, the individual sentences were constructed according to the
following parameters: i) the length of all of the sentences was 2 clauses, and the length
in number of words was identical for every token of each violation; ii) all of the moved
wh-words for the violations were either who or what to avoid the known acceptability
effects of other wh-words3 (except for LBC violations for which this impossible); iii) all
3
For instance, wh-phrases involving which have been observed to be more acceptable than other
87
of the names chosen were high frequency (appearing in the top 100 names of the 1980s
according to the Social Security Administration). Fourth, the order of the blocks was
distributed using a Latin Square design resulting in 5. The island violations tested
Subject Island (ISS), Left Branch Condition (LBC), Relative Clause Island (RC),
Sentential Subject Island (SS), Complex Noun Phrase Constraint (CNPC), and the
Whether Island:
Results
The data from these experiments were analyzed following the procedure in
Snyder 2000. The steps to this procedure are discussed in detail in section 2.2.2, so
they will not be repeated here. However, the basic method is to compare the number
changed from yes to no by using the Sign Test. If the Sign Test returns a significant
wh-words when extracted out of Islands (Pesetsky 1987). Also, wh-adjuncts such as where, when,
how and why can modify most predicates, therefore there is always an acceptable interpretation
of Island violations involving wh-adjuncts in which the displaced wh-adjunct modifies the matrix
predicate.
88
result, then that violation is interpreted as satiating. The results for each condition
89
Discussion
As one can see, there was no satiation in any of these replications, including the
direct replication without any modifications whatsoever. The results can be added to
the previous three satiation studies, yielding a new summary of results that confirms
that the replication problem is real: no violation satiates in more than 2 studies
The replication problem suggests that violation-type is not the primary factor
in predicting satiation. The question then is whether there are other factors, perhaps
components of Snyder’s original design, that also contribute to the judgment insta-
bility that leads to satiation as defined in Snyder 2000. In any experiment there are
sometimes asked whether Snyder’s original sample may have been unique in being from a private
90
1. the design
2. the task
factors (artifacts) that may contribute to the effect being investigated (Schütze 1996,
Cowart 1997, Kaan and Stowe ms). For instance, in these experiments, it is unlikely
that fatigue is contributing to changes in judgment for any given sentence, as the order
of presentation is balanced across all of the participants (if Participant 1 sees sentence
A first, then Participant 2 sees sentence A last). However, these experiments do not
control for the possibility that participants are biased in their responses, perhaps due
to a response strategy: 70% of the items in the survey are by hypothesis unacceptable,
which has the potential to bias participants toward judging sentences as acceptable
The task chosen for these experiments was the yes/no task in which partici-
pants are asked whether a sentence is an acceptable sentence of English. The yes/no
task is a categorization task consisting of two categories, and therefore suffers from
two drawbacks given the design used in these studies. First, because there are only
that participants are not memorizing their responses. While this is not stated as a goal by Snyder
himself (2000), it is possible, and would mean that both unbalanced and balanced designs would
introduce an artifact (response strategy versus memorization). This issue is taken up in a later
section devoted to controlling for participants that may be memorizing their responses.
91
easy to track two response types, and as the experiment progresses, realize that one is
being used disproportionately more often than the other. In fact, verbal debriefing of
participants confirms this as nearly every participant asked why there were so many
‘bad’ sentences. Second, as a very extreme categorization task, the yes/no task is
prone to lose potentially relevant data: there is a growing body of research indicating
ing that acceptability may be best characterized as a continuous quantity (e.g., Bard
1. Count the number of yes responses after the first two exposures of each violation
for each participant
2. Count the number of yes responses after the last two exposures of each violation
for each participant
3. If the number of yes responses in the last two exposures is higher than in the
first two, the participant is defined as satiating for that violation
4. If the number of yes responses in the last two exposures is lower than in the
first two, the participant is defined as not-satiating for that violation
5. For each violation, count the number of participants that satiated and the
number of participants that not-satiated (n.b., the participants whose responses
were stable are ignored)
6. If there are statistically more satiaters than non-satiaters then the violation is
said to satiate.
Basically, this definition asks: For those people who have unstable judgment, are
the analysis, this definition artificially limits the scope of satiation in two important
92
it is a property of violations in speakers of English who have unstable judgments,
CNPC Islands only 23% of the population is unstable, for Whether Islands 55% of
the population is unstable. Second, the question of whether the instability is positive
or negative is a biased question: these are violations, which all things being equal,
will be more likely to be judged no than yes. If there is instability at all, then one
would expect it to manifest itself as a change from no to yes because of the initial
disproportion.
One can see how these three factors could interact to license the type of insta-
bility that is labeled satiation under Snyder’s original definition. The task involves
two response choices, so participants are likely to employ a strategy to balance them.
The disproportionate number of unacceptable sentences means that the strategy will
be one that leads to more yes responses later in the experiment. Because stable par-
ticipants are excluded, the final analysis is conducted over those participants who
demonstrated instability, or in other words, participants who are likely to have em-
ployed just such a strategy. The fact that these violations are initially judged un-
acceptable, and the fact that the composition of the survey leads to a strategy of
increasing yes responses, make it unsurprising that a satiation effect is found when
in itself (e.g., Why does this strategy tend to affect Whether Islands but not That-trace
effects? - a question that is briefly considered at the end of this chapter). However,
93
given that the effect is defined over the yes/no task, it still is not obvious that the
quantity, then the categories of yes and no might be masking the true nature of
increase in yes responses over time, there are at least three potential models for the
The first model is one in which the acceptability judgments increase over time,
and eventually those that started below the yes/no threshold (solid line) cross it to
become yes responses. This is the model that Snyder and others have assumed is
underlying the satiation effect in the previous studies, and correspondingly, is the one
94
that is informative from the point of view of studying the violations themselves. The
second model is one in which the mean acceptability of the violation does not change
at all, but the spread or variation of the judgments does change over time, such that
some of the judgments that were below the yes/no threshold cross it over time. If
this were the model underlying previous satiation effects it would be evidence that
judgments are not a good source of data, at least for some violations. The final model
demonstrates that defining satiation based on a categorical judgment means that the
effect could be the result of a change in the category threshold rather than a change in
the underlying percept. If the threshold decreases over time, it would have the same
effect: an increase in the number of yes responses. If this were the model underlying
previous satiation findings, then it would simply be evidence that categorization tasks
do more than lose potentially relevant data, they may also indicate changes in the
terpreted as satiation, these models will begin to be teased apart. Models 1 and 2 are trivial to
investigate as they simply require a non-categorical task. Model 3 is probably impossible to mea-
sure directly, as the threshold in yes/no tasks is most likely due to a combination of normative
factors (grammar, processing, frequency, information structure, context, etc.). However, suggestive
evidence for model 3 will come from two sources: i) that models 1 and 2 appear to be incorrect, and
ii) that confidence in yes/no judgments decreases over the course of these experiments, which could
95
4.2.3 A roadmap
The contribution of these three factors (design, task, and definition) to the
several experiments. For instance, it is fairly straightforward to cross the factor type
of task (with 2 levels: yes/no (non-categorical) and MagE (categorical)) with the
factor type of design (also with two factors: balanced and unbalanced), resulting in
Yes/No MagE
Balanced ? ?
Unbalanced unstable ?
The problem with the definition of satiation is probably the hardest to manip-
ulate. The Snyder 2000 definition is technically a valid definition, albeit with a very
limited scope. To ensure that the experiments in this study are directly comparable
which is one particular statistical definition (that can be argued over), and instabil-
ity, which is any statistically definable change in judgments for a given structure.
Broadening the domain of investigation in this way allows for the possibility that
the satiation effect only occurs under very specific circumstances (for instance under
certain tasks with certain designs), thus allowing an investigation into the replication
96
4.3 The yes/no task and balanced/unbalanced designs
Crossing the factors task and design leads to 4 cells. One of these cells has
been studied extensively: the effect of yes/no tasks in unbalanced designs. As we have
seen, these two factors do lead to judgment instability with some violations in some
experiment, but not in every experiment. One of the experiments in this cell differed
from Snyder’s design in two small ways: first, it was composed of 8 violations instead
of 7, and second, the violations have all at one point or another been classified as
Island violations. The reason for these changes can now be made explicit: If we focus
attention on the two violations that Snyder interpreted as satiating (Whether Island
and CNPC Island), then this design is really 6 non-satiating violations, 2 satiating
violations, and 2 completely acceptable sentences. We can then manipulate the design
to be the inverse, while maintaining the two satiating violations as a pivot point: 2
Thus, the effect of the factor design on the yes/no task can be isolated:
Unbalanced Balanced
Adjunct Island Acceptable Sentence
Coordinate Structure Constraint Acceptable Sentence
Infinitival Sentential Subject Island Acceptable Sentence
LBC Violation Acceptable Sentence
Relative Clause Island Acceptable Sentence
Sentential Subject Island Acceptable Sentence
CNPC Island CNPC Island
Whether Island Whether Island
Acceptable Sentence Adjunct Island
Acceptable Sentence Relative Clause Island
97
Participants
glish, none with formal exposure to linguistics, participated in the unbalanced ex-
unrelated self-paced reading study during their visit to the lab. All of the participants
ceptable sentences. Therefore the items were divided into 5 blocks of 10 items, with
the composition of the 10 items manipulated as outlined above. All of the items
for LBC violations for which this is impossible. All of the items were controlled for
length in clauses and in number of words. The order of presentation of the blocks
was distributed using a Latin Square design, and the items within each block were
pseudorandomized such that two acceptable sentences did not follow each other in
the unbalanced design, and two violations did not follow each other in the balanced
7
The sample size in Snyder 2000 was 22, which was the target for both of these experiments.
Human error in the scheduling of participants resulted in 3 additional participants in the unbalanced
experiment, and 3 fewer in the balanced, as these were run concurrently. The unequal sample sizes
98
design. The instructions for the task were identical to those in Snyder 2000.
Results
participants whose judgments change even once, we find that 15 out of 25 partici-
judgments change in the balanced experiment. Fisher’s exact test reveals that this
Discussion
Once again, we face the replication problem: there were no effects by Snyder’s
99
instability, in this case, the number of participants that show a change in judgments,
we see that there is a significant effect of design: unbalanced designs lead to much
more instability. In fact, there was barely any instability under the more balanced
design. This is a first step toward our goal of isolating the factors or interactions
that lead to judgment instability: at least within yes/no tasks, balanced designs are
less susceptible to judgment instability - as expected under the theory that unbal-
anced designs lead to a response strategy that causes participants to include more
Recall that there are at least two reasons that yes/no tasks are a less than ideal
choice for investigating judgment instability. First, as a categorization task with only
two categories, they may be more likely to lead to a response strategy under an
true nature of the acceptability judgments, leaving at least three different types of
instability that could lead to a satiation effect, and no way to determine which is
actually the cause. Overcoming these two problems simply a non-categorization task
for measuring acceptability, such as magnitude estimation (Stevens 1957, Bard et al.
1996).
100
4.4.1 Magnitude estimation and balanced designs
Having seen the effect of balanced and unbalanced designs on the yes/no task,
and having seen the benefits of magnitude estimation over categorization tasks, the
next logical step is to investigate whether design has a similar effect on magnitude
estimation, and if so, where the source of the instability lies. The first set of ex-
stable under a balanced design. In this case, balanced design refers to a set of design
properties which are part of the best practices of psycholinguistic experimental design
This subsection reports the results of 5 magnitude estimation experiments using this
type of balanced design. Each of the first 4 tested a different Island violation (Subject,
Adjunct, Whether, and CNPC Islands). Because the magnitude estimation task
requires the comparison of two sentences, context sentences were not included in
these 4 designs even though they were part of Snyder’s original design (as well as the
replication attempts). To ensure that the lack of context sentences had no effect on
the results, a fifth experiment was included in which the CNPC Islands were tested
8
There are undoubtedly many more “best practices” for materials construction (e.g., controlling
for the frequency of lexical items). And while the Snyder 2000 materials violated some of these best
practices as well, it seems likely that they contributed noise across the conditions, not artifacts into
101
again, but this time with context sentences along the lines of those in Snyder 2000.
Participants
glish. Two of the experiments, Subject and Adjunct Islands, were administered over
the internet using the WebExp experimental software suite (Keller et al. 1998) in
exchange for extra course credit. The other three experiments were conducted in the
participants also participated in an unrelated self-paced reading study and were paid
for their time. The sample sizes were 20, 24, 20, 17, and 20 for Subject, Adjunct,
The general design for each of these experiments was identical. Materials were
distributed using a blocked design. Each block contained 2 tokens of the island viola-
of each block ensured adherence to the balanced design ratios of 1:1 acceptable to
9
The sample sizes for Subject and Adjunct Island experiments were based upon responses to an
extra credit offer, and therefore were contingent upon the number of students who followed through
and completed the survey. Given that the smaller sample size was 20 in these two experiments,
the target sample size for the Whether, CNPC and CNPC with context experiments was also 20.
However, 3 participants were eliminated from the analysis of the CNPC analysis for reporting a
102
unacceptable and 2:1 distracters to experimental items:10
Both the experimental items and distractors were controlled for length in
clauses and length in number of words (with the exact number varying by exper-
iment given the differences among the Island violations). All of the experimental
involving which (Pesetsky 1987). Half of the experimental items used the present
form do and half the past did. The distracters included all possible wh-words for
10
One may wonder why this design was chosen over the balanced design used in the yes/no
tasks. The answer is straightforward: because there was no effect (no instability) in that design,
it seems unlikely that there would be an effect under magnitude estimation. While running that
design under magnitude estimation would certainly prove the stability point that is made with these
experiments, the fact that there are only 5 exposures of each violation type in that design means
that the results would be of limited value (e.g., perhaps satiation would occur after 7 exposures as
Hiramatsu 2000 has claimed for Subject Islands). These designs not only allow us to closely follow
general psycholinguistic best practices, but also allow us to increase the number of exposures of
each Island type (10 or 14) without overburdening the participants with too many judgments (recall
that magnitude estimation requires comparing two judgments for each data point, and a little bit
103
variety (crucially including who and what to avoid response strategies).
The instructions for all of the experiments were a modified version of the
instructions published with the WebExp experimental software package. The modi-
fications were i) changing the example items from declaratives to questions because
of the nature of the experimental items, and ii) including a short passage indicating
that the task was not a memory task to discourage participants from attempting
to memorize their responses. All of the experiments were preceded by two practice
phases: the first to teach them the magnitude estimation task using line lengths, the
It should be noted that there were some minor differences among the exper-
iments. However, there are no theoretical reasons to suspect that these differences
would affect the result. In fact, the differences were included in an attempt to ensure
that the lack of instability (i.e., the stability) found in these experiments was not due
to some unknown experimental design decision. Thus, the fact that these differences
do not actually result in any effect is further corroboration of the striking stability
104
(25) Differences among the experiments
i. Subject and Adjunct Islands were 7 blocks long (14 exposures), the others
ii. Subject and Adjunct Islands used the reference What did Kate prevent
there from being in the cafeteria?, the others used the reference What did
you say that Larry bought a shirt and?. Neither sentence type has ever
iii. The unacceptable distracters for Subject and Adjunct Islands were agree-
ment violations. The unacceptable distracters for the others were Infini-
tival Sentential Subject Island violations. Neither sentence type has been
claimed to satiate.
iv. Subject and Adjunct Islands were administered over the internet, the
Results
Because the unit of measure is the reference value, the first step in analyzing
magnitude estimation data is to divide all of the responses by the reference value to
obtain a standard interval scale of measure. Because responses are made with the
set of positive numbers, and because the set of positive numbers is unbounded to the
right and bounded by zero to the left, magnitude estimation data is not normally
distributed (it has a rightward skew). To correct for this non-normality, standard
analysis. The log transformation is chosen for at least two reasons: first, it minimizes
105
the impact of large numbers, thus bringing the data closer to normal; second, it is
straightforward to calculate the geometric mean after a log transformation (it only
requires exponentiation), and given that psychophysics is concerned with the ratios
of stimulus estimate, the geometric mean is the correct choice of central tendency. In
point for acceptability), so the second reason for using the log transformation does
not hold. However, it is still the case that the data is non-normal, therefore must
available in inferential statistics, the log transformation is well established within the
magnitude estimation literature, and ensures that the analysis of linguistic magnitude
stimulus of interest.
and Myers 1990, were performed on the means of the log-transformed judgments for
each island to determine whether the mean of the responses changed after repeated
exposures. The essence of the test is to compare two lines: the first line is simply
the horizontal line defined by the grand mean of every response, thus a line that
assumes no change based on the number of exposures; the second line is the line of
best fit obtained by looking at each exposure to the violation independently. If there
is an effect of repeated exposures, then this second line will be significantly different
from the grand mean line. Although there is no standard method for reporting
linear regression coefficients, a table listing the y-intercept and slope of each line is
provided (the two coefficients that define any line, represented in linear regression by
106
the variable b and Exp(b) respectively), along with the p-value of of the comparison
between this line and the grand mean. As is evident, there is no effect of repetition
on the means:
Table 4.14. Linear regressions for means of magnitude estimation in a balanced design
effect is found in the variation or spread of the scores over time, not in their means.
Therefore a repeated measures linear regression was also performed on the residual
scores. Residual scores are the absolute value of the difference between each score
and the grand mean, and form the basis for Levene’s test for homogeneity of variance
(Levene 1960). If the spread of the scores is increasing over time, then there should
be an increase in residual scores over time. A table similar to the one for means is
sentative scatterplot (Subject Island) of the scores and the non-significant trendline
is also included below for a graphical representation of the linear regression method.
Discussion
There are three major points made by these analyses. First, there is no increase
in mean judgments using the magnitude estimation task in a balanced design. Second,
balanced design. And finally, the fact that the first four experiments did not include
107
Table 4.15. Linear regressions for residuals of magnitude estimation in a balanced
design
Means Residuals
1.0
1.3 !
0.8 1.2
1.1
0.6
1.0
0.4 !
! !
0.9 !
! ! ! !
! ! ! !
0.2 ! ! ! 0.8 !
judgment
! ! !! ! ! judgment
! ! ! ! ! ! ! ! ! ! !! !
! ! ! ! ! ! !
! ! ! ! !! ! ! !
!
!
!
0.7
0.0 !
!!
!
!
!
!
!
!
!
!!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!
! !
!
! ! ! ! ! ! !
! ! ! ! ! ! ! ! !! ! !
!
! !
!
!!
!
!
!
!
! !
!
! !
!
! ! !
0.6 ! !
!
!
!! !
-0.2 ! !
!
!
! !
!
! !
!
! !
! !
!
! ! !
!
! ! ! ! ! ! ! 0.5 !
!
!! ! ! ! ! ! ! !
! !
-0.4 ! ! ! !
! !
0.4 ! !
! ! !
! ! ! ! !
! ! ! ! ! !
! ! !
0.3 ! !!
-0.6 ! !
! !
!
!
! ! ! !
!
!!
!
!
!
! ! ! ! ! !
! ! ! ! ! ! ! ! ! !
0.2 !
! !!
!
!
! !!
!
! !
! !
!
!
!
!
! ! ! ! ! ! !
! ! !
-0.8 !
! ! ! ! ! ! !!
!!
!
!
!
! !
!
! !!
! !!
!
!
0.1 !
!!
!
!
! !!
!!
!
!
!
!
!!
!
!
!
!!
!
!
!
! ! ! !! !
!
! ! ! ! ! ! ! ! !
!! ! ! !
! ! ! !! ! ! ! ! ! !
! ! !
!! ! !!
-1.0 0.0 ! ! ! !
1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
repetitions repetitions
context sentences did not have an effect, as there was no effect of mean or residuals
in the CNPC Island with context experiment. These facts are not entirely surprising
given the lack of instability of the yes/no task with a balanced design. However, taken
together, these results indicate that the instability that lead to the satiation effect in
Snyder 2000 does not persist in balanced designs, either for categorical yes/no data
108
4.4.2 The memory confound
under balanced designs could be due to a confound that is always present in balanced
designs: balanced designs increase the likelihood that participants are able to track
their responses, and this memorization may lead to the statistical stability. While the
In principle, there are two types of evidence that may bear on this problem.
First, if participants were indeed memorizing their responses, one might expect them
to report this during post-experiment debriefing. One of the standard questions dur-
ing debriefing is whether they noticed any sentences being repeated, or any sentences
such similarities. However, given that the participants had no training in linguistics,
they may have lacked the vocabulary to describe their intuitions. Indeed, many of
the participants were highly cognizant of there being ‘bad’ sentences since they had
any participant that reported the same judgment 5 or more times. Given the nature
of memorization, one might choose to only remove participants with repeated scores
109
later in the course of the experiment, or to only remove participants with consecutive
point in the experiment, even if none of the repetitions were consecutive, were elimi-
nated. Thus, the participants that remained in the analysis were those with no overt
Linear regressions for the remaining participants follow. There are additional
columns to indicate the number of participants who showed significant internal satia-
tion in that their individual judgments showed an increase over time (Satiaters), and
the number of participants left in the sample (N) after removing participants that
As can be seen, judgments were statistically stable even after removing the obvi-
ously consistent participants because they could have potentially memorized their
satiating participants in each sample, well below the critical threshold of 6 required
for a potentially significant result using the Sign Test from the satiation definition of
Snyder 2000. So it seems clear that memorization is not the cause of the stability
110
4.4.3 Magnitude estimation and unbalanced designs
The final cell of our crossed design is the stability of data collected using the
magnitude estimation technique with an unbalanced design. In this case, the unbal-
anced design used are the materials from the unbalanced yes/no design in section
2.3. Even though no satiation effect was found for these materials under Snyder’s
definition using the yes/no task, instability was recorded in the form of participants
ity for magnitude estimation and unbalanced designs, this experiment may also bear
Two experiments were conducted using this design. The only difference be-
tween the two was the reference sentence used: the first experiment used the Coor-
dinate Structure Constraint (CSC) violation What did you say Larry bought a shirt
and?, and the second the If Island violation What did you wonder if Larry had
bought?. The logic behind this manipulation is as follows: The CSC-reference was
initially chosen because it is an Island-type violation, but has never been claimed to
over time). However, the CSC violation is also considered a very strong violation.
So while the 8 violations in this design are by hypothesis unacceptable, they are not
necessarily worse than a CSC violation, which means that there was no pattern to the
whether scores were higher or lower than the reference sentence. While it is not clear
whether participants actually track their responses with respect to whether they are
higher or lower than the reference, it is at least conceivable that some do.
111
To ensure that this did not have an affect on the judgments, a second ex-
identical to Whether Islands, therefore are also in the middle range of acceptabil-
ity. This would ensure that the majority of the violations in the study should be
judged worse than the reference. The drawback to this design is that If Islands may
be expected to satiate given their relation to Whether Islands (one of the satiating
violations in Snyder 2000). A priori, this worry is tempered by the fact that the
Islands do not change over time. More interestingly, if it were the case that the If
Island reference satiated over time, then the experiment should yield several negative
the reference is increasing in acceptability. As we shall see, this was not the case.
Participants
in an unrelated self-paced reading experiment during their visit to the lab. All par-
The design of these two experiments is identical to the yes/no version: 5 blocks
The violation types in each block are repeated below for convenience:
112
Table 4.17. Violations in unbalanced MagE task
Adjunct Island
Coordinate Structure Constraint
Infinitival Sentential Subject Island
LBC Violation
Relative Clause Island
Sentential Subject Island
CNPC Island
Whether Island
The only manipulation between the two was in choice of reference sentence and inci-
Results
As before, the responses were divided by the reference judgment and log-
transformed prior to analysis. First, repeated measures linear regressions were per-
formed on the means for each experiment. Summary tables of the y-intercept, slope,
and p-value are included below. Significant effects, are marked in bold:
113
Table 4.19. Linear regressions for means, If-reference
Discussion
The effects appear to break down like this. First, there were three significant
effects with the CSC-reference, all of which appeared in the variance (or spread) of the
judgments: Left Branch Constraint violations, Relative Clause Islands, and Whether
Islands. However, these three effects were not replicated with the If-reference, as
there were no significant effects of variance. There were two significant effects in
means with the If-reference: the Infinitival Sentential Subject Islands and Whether
Islands. Despite the effects showing up in two different measures (variance versus
mean), this does appear to be a partial replication (with respect to Whether Islands)
between these two experiments. The question, then, is what to make of it.
114
Table 4.21. Linear regressions for residuals, If-reference
Unfortunately, the answer seems to be that not too much can be made of
it. First, it should be noted that there were 32 statistical analyses conducted in
this analysis with direct comparisons of at least 4 conditions at a time (the means
and variances of each island across the two experiments), and upwards of 16 or 32
comparisons. Given the nature of probabilities, the more analyses one performs, the
more likely a significant result will be. In fact, with a target p-value of .05, 20 analyses
will nearly guarantee a significant result. One of the most conservative corrections
for this problem is the Bonferroni correction. The Bonferroni corrected p-value for
4 comparisons is .0125. The only significant effect that achieves this level is in the
mean of Infinitival Sentential Subject Islands with the If-reference. None of the results
reach significance when corrections are made for larger numbers of comparisons such
as 16 or 32.
As one anonymous reviewer points out, there may be more going on in the
similar to Whether Islands. For instance, it could be the case that this structural
equivalence means that the participants are in fact seeing 55 instances of Whether
115
Islands in this experiment, and that the extreme number of repetitions is what causes
the significant increase in mean judgment for Whether Islands. Setting aside the
previous argument that the correct p-value for Whether Islands is not significant,
there are other reasons that this argument does not go through. First, under this
conception both the reference If Island and the experimental Whether Islands should
be affected by satiation. If that were true, there should be no effect on the Whether
Islands at all, since both the reference and the experiment items would be increasing in
acceptability together. The fact that there is an effect (for the sake of argument; after
Bonferroni correction there is no effect) indicates that the reference If Islands and the
the reference sentence were indeed increasing in acceptability, we would expect to find
negative satiation, that is, decreases in acceptability, for the other violations (unless,
of course, they were satiating at the same rate as the reference). Since there were no
significant negative effects, it does not seem like the If-reference was satiating at all.
So the answer to whether judgments are stable given the magnitude estimation
task and unbalanced designs is a guarded yes. There was one true instability effect,
but much like the instability found in the yes/no task, it is not overwhelming evidence
4.5 Conclusion
Now we are in a position to fill in all four cells of our crossed design:
116
Table 4.22. Crossed design of factors Task and Design
Yes/No MagE
Balanced stable stable
Unbalanced unstable stable
What we’ve found is that acceptability judgments are strikingly stable within bal-
anced designs. There is instability with unbalanced designs and yes/no tasks, al-
though it seems that magnitude estimation tasks are more resilient to the effect of
leads to instability, but more instability for yes/no tasks than magnitude estimation
tasks. The replication problem for previous satiation studies now receives a natu-
ral explanation: violation type is not the major factor determining instability, the
interaction of task and design are, perhaps due to a response strategy in which par-
The implications for syntactic theory and linguistic methodology are straight-
theory can no longer hold, but it does not have to. Given that satiation is most likely
violations. But that is a small price to pay for the empirical benefit of data that is
117
4.6 Some remaining questions
The conclusion that instability can be avoided through balanced designs is not
to say that there are not questions about the judgment task that are ripe for future
research. As a final section, I review two such questions, and propose starting points
Statistical tests aside, one does get the impression that there is a pattern
throughout the experiments presented in this chapter: Whether Islands arise in the
discussion of instability more often than any other violation, and some violations,
for instance That-trace effects, never arise. Even given the analysis of instability
presented in this paper, it still may be the case that only certain violations can be
unstable (and conversely that some are always stable). To be clear, given that the
instability has to be licensed by very specific design factors, this is not saying that
there may be a new classification system. The claim would have to be the weaker
claim that some violations are susceptible to instability (not unstable by definition),
and others are not. This suggests that susceptibility to instability may be a side-
effect of the judgment process itself, and how it interacts with the nature of certain
violations.
For instance, the fact that That-trace effects are never susceptible to instabil-
ity could reduce to the fact that That-trace effects are correctable, as demonstrated
experimentally by Crain and Fodor 1987 in their discussion of the sentence matching
118
task. Because participants can easily identify the source of the violation, their judg-
ments may become ‘anchored’ in a way that is not possible with structural violations
that cannot be easily corrected, such as Whether Islands. And the fact that Whether
Islands seem susceptible to instability while Sentential Subject Islands do not may
the yes/no threshold would logically be more likely to cross that threshold. Interest-
ingly, Snyder rejects relative acceptability as an explanation for the satiating versus
non-satiating violations based on the fact that (in a scale-based rating study with
10 participants and no error terms reported), the order of relative acceptability from
As laid out, there is no direct relationship between relative acceptability and satiation.
for and That-trace from the paradigm since they are both correctable by the removal
of a single word (for or that), the relatively acceptability order corresponds almost
119
Table 4.24. Relative acceptability versus satiation based on non-correctability
during the debriefing of participants after the Snyder replications reported in section
that some of them could be corrected by “changing a for or that”. Given the results
of Crain and Fodor 1987 and their potential relevance for understanding the complete
picture of judgment instability, there is obviously room for future research into the
can only go so far toward identifying the nature of the instability that gives rise to
satiation effects as defined in Snyder 2000. What little we do know is this: magnitude
estimation studies are resilient to instability such that there is no strong evidence for
changes in mean or variance, that is, there is no evidence for model 1 or model 2.
Unfortunately, it is not clear whether this is because these models do not capture the
type of instability seen in yes/no tasks or because magnitude estimation tasks are
too stable. Compounding this problem is the fact that we do not yet have validated
120
methodologies for investigating model 3, a change in the yes/no threshold itself, given
was run with an additional task: participants were asked to rate their confidence in
their yes/no response on a 7-point scale following each judgment. The idea was that
there might be a correlation between violations that satiate and violations that lead to
lower confidence in judgments over time. For instance, it is plausible that a changing
could track threshold instability (although there are many other reasons for confidence
to change over time). Unfortunately, as we have seen, there were no satiation effects
correlations. However, despite the lack of satiation in this experiment, there were
that future research on the factors influencing participants’ confidence about their
judgments could be correlated with the factors that we have seen influence stability,
121
Chapter 5
Linguists have agreed since at least Chomsky 1965 that acceptability judg-
ments are too coarse grained to distinguish between effects of grammatical knowledge
(what Chomsky 1965 would call competence effects) and effects of implementing that
knowledge (or performance effects). With the rise of experimental methodologies for
to acceptability judgments. For instance, Fanselow and Frisch 2004 report that local
ambiguity in German can lead to increases in acceptability, suggesting that the mo-
mentary possibility of two representation can affect acceptability. Sag et al. submitted
report that factors affecting the acceptability of Superiority violations also affect the
olations. This chapter builds on this work by investigating three different types of
and Flores d’Arcais 1989) to determine if they affect the acceptability of the final rep-
resentation. The question is whether every type of processing effect that arises due
tially sensitive to such processing effects. The results suggest that judgment tasks are
122
indeed differentially sensitive: they are sensitive to some processing effects, but not
others. This differential sensitivity in turn suggests that further research is required
to determine the class of processing effects that affect acceptability in order to refine
termining that relationship will be the first step toward assessing the merits of both
The experiments in this chapter build upon one of the major findings of sen-
tence processing research: the active filling strategy. The active filling strategy is
defined by Frazier and Flores d’Arcais (1989) as when a filler has been identified,
rank the possibility of assigning it to a gap above all other options. Or, in other
words, the human parser prefers to complete long distance dependencies as quickly
as possible. Because the quickest possible completion site is not always the correct
one, the active filling strategy entails the construction of many temporary, incorrect
representations.
One of the major pieces of evidence for the active filling strategy is the filled-
gap effect. Simply put, the filled-gap effect arises when the parser completes a wh-
dependency with a verb that subsequently turns out to have an object, and thus no
free thematic positions. Because sentence processing in English proceeds from left to
123
right, potentially transitive verbs are encountered prior to their objects. The active
filling strategy mandates that the parser complete an open wh-dependency at the
first appropriate verb. If after the dependency is completed, the parser encounters an
object of the verb, there is a corresponding slow-down in reading times (at the object
of the verb) due to the competition of the two NPs for the object thematic position
of the verb. Stowe 1986 demonstrated this effect with the following quadruplet:
Stowe found a significant reading time slow-down at the position of the object
wh-filler (26a):
Table 5.1. Reading times (in ms) at critical words for each dependency, from Stowe
1986
Ruth us Mom
None (if) 661 755 755
WH-Subject — 801 812
WH-Object 680 — 833
WH-Preposition 689 970 —
The plausibility effect is a second piece of evidence for the active filling strategy.
The plausibility effect is a slow-down in reading times caused when the completed
124
filling strategy mandates that the dependency be completed as soon as possible,
regardless of the ensuing semantic anomaly. As such, there is a reading time slow-
down after the verb when the semantic anomaly is detected. For instance, Pickering
and Traxler (2003) found a significant slow-down at the verb when the displaced
(27) a. That’s the general1 that the soldier killed enthusiastically for t1 dur-
b. % That’s the country that the soldier killed enthusiastically for t1 during
Table 5.2. Reading times at critical words by filler type, from Pickering and Traxler
2003
killed enthusiastically
Plausible 1045 ms
Implausible 1157 ms
5.1.2 Rationale
tions have an effect on the judgment of the final representation, it goes without saying
that it must be established that the temporary representations being manipulated are
actually constructed. The filled-gap effect (Crain and Fodor 1985, Stowe 1986) and
the plausibility effect (Garnesey et al. 1989, Tanenhaus et al. 1989) were chosen
because their effects are so well-established that they serve as tools for investigating
125
reflexes of active filling, the source of the effect for each paradigm is different. The
violation. On the other hand, the slow-down in the plausibility paradigm is due to a
two paradigms are ideal for investigating the effects of different types of temporarily
illicit representations.
to avoid any interfering grammaticality effects, three design elements were incorpo-
rated into experiment 1 to ensure that a failure to detect either the filled-gap or the
plausibility effect was not due to a lack of sensitivity. First, the task chosen was
well suited for detecting differences among acceptable sentences (e.g., Featherston
in this study. Recent work has suggested that reliable results can be obtained from
samples as small as 10 (Myers 2006), therefore the large sample size for experiment
1 (N=86) should be adequate to detect even very small differences. Finally, a third
condition set was included to determine whether the task and participant pool were
sensitive to distinctions among acceptable sentences. The third condition set was
taken from an ERP study of the distance between a wh-filler and its gap by Phillips
et al. (2005). Phillips et al. manipulated the distance by displacing the wh-filler
126
(28) a. The detective hoped [that the lieutenant knew [which accomplice the
b. The lieutenant knew [which accomplice the detective hoped [that the
Phillips et al. found a delay in the onset of the P600, a brain response that has been
linked to the association of a wh-filler with its gap, for the Long WH condition, which
they interpret as a reflex of the time it takes to retrieve the stored filler from working
memory (longer distance = longer retrieval time, perhaps because of a decaying rep-
resentation). But crucial to our purposes, Phillips et al. conducted a ratings survey
in which they asked the participants to rate the complexity of the two conditions on
a scale from 1 to 5:
Table 5.3. Mean complexity ratings for short and long movement, from Phillips et al.
2005
Mean Standard Deviation
Short – 1 clause 2.71 0.65
Long – 2 clauses 3.51 0.51
The difference between the two conditions was highly significant (t(23)=5.83,
p<.001), indicating that judgment tasks could indeed detect an effect that leads to
baseline to test the sensitivity of the task and participant pool (although experiment
127
Participants
of the participants were self-reported native speakers of English. The survey was 36
items long including practice items, and took about 15 minutes to complete.
The design included 3 condition sets: the filled-gap paradigm to test temporary
check the sensitivity of the task and participant pool. For the filled-gap condition set
of this experiment, the WH-Object and WH-Preposition conditions from Stowe 1986
were reconstructed:
Christmas.
Christmas.
In the filled-gap condition (29b) , the failure to integrate the displaced wh-filler with
the verb creates a representation in which the dependency is incomplete, which per-
sists until the gap in the prepositional phrase. The materials for the plausibility con-
dition set were taken directly from the published materials of Pickering and Traxler
2003, although an additional adverb was added to each token to increase the duration
of the semantically ungrammatical representation, and the matrix clause was changed
128
to match the style of the filled-gap conditions (i.e., declarative sentences):
a. John wondered which general the soldier killed effectively and enthu-
b. John wondered which country the soldier killed effectively and enthu-
The implausible condition (30b) creates a dependency at the verb killed that is se-
mantically ungrammatical, which persists through the two adverbs until the gap in
the prepositional phrase. The materials for the wh-distance condition set were taken
a. The detective hoped [that the lieutenant knew [which accomplice the
b. The lieutenant knew [which accomplice the detective hoped [that the
izations of the filled-gap conditions were reconstructed following the examples from
Stowe 1986. 24 lexicalizations of the wh-distance conditions were taken from the ma-
terials of Phillips et al. 2005. Only 12 lexicalizations were available for the plausibility
conditions from Pickering and Traxler 2003. A 24-cell Latin Square was constructed
such that each list contained 2 tokens of each condition. 14 unacceptable fillers (var-
ious syntactic island violations) were added, and each list was pseudo-randomized
129
such that no more than 2 target conditions were consecutive, and no related condi-
tions were consecutive. 8 practice items were added, resulting in a 34 item survey.
The instructions were a modified version of the instructions distributed with the We-
bExp software suite (Keller et al. 1998). The reference sentence for both the practice
violation: Mary figured out what her mother wondered whether she was hiding.
Results
Results were divided by the reference score and log transformed prior to anal-
mean SD df t p r
long-distance .08 .19
short-distance .20 .17 85 5.324 .001 .50
As the chart indicates, there was a large and highly significant decrease in
acceptability for longer wh-dependencies, and in exactly the same direction as ob-
tained by Phillips et al 2005. There was also a large and highly significant decrease
in acceptability for filled-gaps, mirroring the direction of the effect found by Stowe
1986. However, there was no effect of plausibility. Even though there are no direct
statistical comparisons across the groups, it is clear that both of the significant p
values are well under the conservative Bonferroni correction level of .0167.
130
Discussion
clear from the wh-distance effect that the design is capable of detecting differences
that lead to processing effects. However, when it comes to the two active filling ef-
fects, a significant effect was only found for the filled-gap effect. Also, given the large
sample size, it seems unlikely that increasing the sample size will lead to an effect of
plausibility. At first glance, this seems to suggest that temporary syntactic ungram-
inition, the filled-gap condition of the filled-gap paradigm involves abandoning one
the association between the wh-filler and the thematically saturated verb fails, the
parser must reanalyze the structure such that the wh-filler is then associated with the
preposition. In other words, the parser attempts to ‘drop the filler twice. However,
the true gap condition of the paradigm involves no such reanalysis because the first
association with the verb succeeds. It could be the case then that the difference in
This would also account for the lack of effect in the plausibility conditions: in both
conditions, the wh-filler is initially associated with the verb and later reanalyzed as
ity, one would expect an effect in the filled-gap paradigm but no in the plausibility
paradigm. Experiment 2 was designed tease apart these two hypotheses (asymmetry
131
5.2 The reanalysis confound
the true gap condition of the filled-gap paradigm, thus making it completely parallel
to the plausibility paradigm in that both conditions will contain reanalysis. If the
asymmetry in the presence of reanalysis across the two paradigms was the source of the
asymmetry in the results for experiment 1, then eliminating the reanalysis asymmetry
should eliminate the asymmetry in the results such that both paradigms return no
the true-gap condition from experiment 1 that lacks reanalysis and comparing it the
Participants
were self-reported native speakers of English without any formal training in linguistics.
The materials for experiment 2 were adapted from the materials for the plausi-
bility conditions in experiment 1, which were themselves adapted from the published
132
materials of Pickering and Traxler 2003. These materials were chosen for two reasons:
(i) the plausibility materials already contained the necessary structure to include re-
analysis in both the filled-gap and true-gap conditions; and (ii) if a filled-gap effect is
indeed found using these materials, it would serve to exclude the possibility that the
lack of effect for plausibility in experiment 1 was due to the meanings of the materials.
Three conditions were used to test whether the source of the asymmetry from exper-
iment 1 was the reanalysis asymmetry. First, a filled-gap condition was constructed
out of the materials from Pickering and Traxler 2003. Next a true-gap condition
with no gap in the prepositional phrase to serve as both a replication of the filled-gap
John wondered which general the soldier killed the enemy effectively
John wondered which general the soldier killed effectively and enthu-
Gap (G)
John wondered which general the soldier killed effectively and enthu-
133
Again, the competing hypotheses make different predictions: if reanalysis is
the source of the asymmetry, then experiment 2 should yield no effect between FG+R
and G+R because both conditions involve reanalysis, and a significant effect between
the nature of the representation constructed, then there should again be an effect
between FG+R and G and also an effect between FG+R and G+R. This hypothesis
makes no prediction about G+R and G, but that comparison would indicate whether
reanalysis has any effect at all. 8 lexicalizations of each triplet were constructed and
distributed using a Latin Square design. Each list contained 1 token of each condi-
hypothesis these 4 of these fillers were considered acceptable, while 6 were considered
items were included for a total of 21 items. The task was magnitude estimation, and
the instructions were identical to those of experiment 1. The reference sentence was
also identical.
Results
As before, results were divided by the reference score and log-transformed prior
to analysis:
mean SD
filled-gap -.02 .22
gap + reanalysis .09 .22
gap only .11 .20
There was a large and significant effect of FG+R versus G+R (t(20)=2.8,
134
p=.005, r=.53), and as expected of FG+R versus G (t(20)=2.8, p=.005, r=.53).
There was no effect of G+R versus G (t(20)=0.32, p=.37). And although all of the
p values were one-tailed, it should be noted that both of the significant p values were
well below the Bonferroni corrected level of .017, even at their two-tailed value of
p=.01.
Discussion
By introducing a second gap within the prepositional phrase of the gap condi-
tion, experiment 2 was able to eliminate the asymmetry of reanalysis from the design
of the filled-gap paradigm, and thus tease apart the two possible explanations of the
asymmetry in the results of experiment 1. The persistence of the effect despite the in-
troduction of reanalysis into both conditions confirms that there is something peculiar
to the filled-gap effect that affects the judgment of the final representation. Further-
more, the lack of effect between the two gap conditions suggests that reanalysis has
lasting cost associated with abandoning one well-formed representation for another.1
1
Because there was no comprehension task included in experiment 2, it is possible that the lack
of effect of reanalysis actually represents a lack of reanalysis, in that the participants might not
notice the gap position in the string for during. Of course, if it was the case that for during was
not an appropriate cue for a gap, then it would be unclear why there was no effect of plausibility in
experiment 1, as without reanalysis the implausible condition is actually unacceptable, and should
135
5.3 The differential sensitivity of acceptability to processing effects
At an empirical level, the results from the experiments in this chapter reveal
ability, suggesting that syntactic difficulties are treated by the judgment process in
indicate that judgment tasks are tapping directly into syntactic knowledge in a very
real sense. At a methodological level, these results demonstrate the sensitivity of for-
mal judgment experiments: the ability to detect significant differences between two
phenomena that are typically the domain of sentence processing studies. And at a
theoretical level, these results indicate that some, but not all, processing effects affect
acceptability facts by first determining whether the processing effects in question af-
fect acceptability at all, and then whether the acceptability of theoretically related
136
Chapter 6
One of the most salient properties of human language is the presence of non-
local dependencies. One of the major goals of syntactic theory over the past 40 years
has been to classify the properties of these dependencies, and ultimately attempt to
explain them with the fewest number of dependency constructing operations. Yet
surprising that there are a number of different proposals in the literature to capture
long distance AGREE, et cetera. What is surprising is that in many ways the field of
syntax has decided that acceptability judgments can provide little additional insight
into the nature of these dependencies. One of the major factors contributing to this
arguments) is not constrained by Islands. The Island facts thus form the basis from
which all analyses must begin. The data that constitutes evidence for or against
these analyses usually comes from either i) non-wh dependencies that also use the
postulated operation, or ii) the nature of the possible answers to the different types
137
While previous chapters in this dissertation focused on the relationship be-
of the nature of grammatical knowledge, this chapter demonstrates a more direct re-
The claim in this chapter is straightforward: there are new types of acceptability
the relationship between these new acceptability effects and grammatical knowledge
can have significant consequences for the set of possible dependency forming opera-
consensus that there must be a dependency between these two positions, but there
is significant debate over the nature of that dependency, and in particular, over the
dependency forming operation(s) that create it. Various proposals have been made
in the literature such as covert wh-movement (Huang 1982), null operator movement
and unselective binding (Tsai 1994), choice-function application and existential clo-
sure (Reinhart 1997), overt movement and pronunciation of the lower copy (Bošković
Section 1 provides the first discussion of new data, focusing on Huang’s (1982)
claim that there are no Island effects with wh-in-situ in English. A series of exper-
iments are presented that demonstrate the existence of Subject Island effects with
wh-in-situ, but no other Island effects. Section 2 is the second data section, pre-
senting evidence that the distance between a wh-in-situ and the higher wh-phrase
138
affects acceptability, while the distance of similarly complex long distance dependen-
these new data points on the various proposals for wh-in-situ dependencies in En-
glish. The general conclusion is that these facts suggest a movement-based account
such as overt movement with lower copy pronunciation, covert movement, or null
operator movement, and raise serious difficulties for non-movement approaches such
Working under the assumption that the dependency between an in-situ wh-
word and the matrix [+wh] C is formed through covert movement, Huang 1982 argues
that there is a direct parallelism between wh-in-situ languages like Chinese and the
wh-in-situ that occurs in multiple wh-questions in English: both undergo covert move-
ment, but (in the case of wh-arguments) neither show Island effects. This can be seen
The first example is just a standard Whether Island effect with overt wh-movement.
In the second example, the in-situ wh-word has matrix scope (because it must be
ensuing Whether Island effect. Huang argues that this is also the case for CNPC
Islands:
139
(34) a. * What did you make the claim that John bought?
Huang used these facts together with the lack of Island effects in wh-in-situ (of wh-
cency.
properties of movement, it is easy to see how the lack of Island effects with wh-in-situ
Therefore, the first step of this study was to evaluate Huang’s claim that there are no
Island effects with wh-in-situ in English with the major Island types. As will become
clear momentarily, while Huang’s claim is mostly correct, the comparisons such as the
ones above actually obscure potential evidence for covert movement Island effects.
Participants
6 Island types were tested following the paired contrasts for CNPC and Whether
Islands published in Huang 1982. The (a) condition in each pair is an Island violation
with overt wh-movement. The (b) condition in each pair is a multiple wh-question in
140
which the in-situ wh-word covertly moves out of an Island structure.
a. Who did you claim the bully teased his brother and ?
141
12 lexicalizations were created for each condition. Lexicalizations were controlled
for length in number of words within each condition set. These 12 lexicalizations
experiment, and distributed using a Latin Square design for a total of 12 lists. The
12 lists were pseudorandomized such that related conditions were never consecutive.
The task was magnitude estimation. The instructions were a modified version
of the instructions published with the WebExp software suite (Keller et al. 1998).
The reference sentence was: What did Mary wonder whether her mother was buying
Results
Responses were divided by the value of the reference sentence and log-transformed
prior to graphing and analysis. Paired t-tests were performed on each Island type
(one-tailed). The results are summarized in the following table and chart, in which it
is clear that covert movement out of an Island is significantly more acceptable than
overt movement out of an Island for every Island type except Subject Islands. There
is no significant difference between overt and covert movement out of Subject Islands:
142
Figure 6.1. Overt versus covert Island effects following Huang 1982
Discussion
Unsurprisingly, the results of this experiment confirm the claims from Huang
1982 for CNPC and Whether Islands, and extend to Adjunct, CSC, and Relative
Clause Islands. However, there is no difference between overt and covert movement
for Subject Islands, suggesting that Subject Islands are different than the other Island
types. Unfortunately, there is no way to directly interpret the lack of effect with
Subject Islands: it could be that there is a covert movement Island effect, or that
there is no overt movement Island effect, or even that overt Subject Islands are weak
lead to similar acceptability. In fact, because Island effects were not defined across
conditions (the Island structures were tested without non-Island controls) we cannot
directly interpret the effects that were found for the other Island types: the strongest
143
claim that can be made is that the overt movement Island effect is stronger than
the covert movement Island effect; we cannot actually claim that there are no covert
movement Island effects. To get around these confounds, a second experiment was
run using the designs in which Island effects are defined as interactions of two factors:
ity between the two structures themselves without any wh-in-situ. In other words,
there are two factors structure and wh-in-situ each with two levels. So for each
i. Complement
i. Who1 t1 suspects [CP that you left the keys in the car?]
ii. Who1 t1 suspects [CP that you left what in the car?]
ii. Adjunct
i. Who1 t1 worries [ADJ that you leave the keys in the car?]
ii. Who1 t1 worries [ADJ that you leave what in the car?]
144
(42) CNPC Island
i. CP Complement
ii. NP Complement
i. Who1 t1 denied [N P the fact that you could afford the house?]
ii. Who1 t1 denied [N P the fact that you could afford what?]
i. Simple NPs
ii. * Who1 t1 thinks [the speech by who] interrupted [the TV show about
whales]?
i. CP Complement
145
ii. Whether Complement
i. Non-specific
ii. Specific
ii. Who1 t1 thinks that you read [John’s book about what]?
i. CP Complement
ii. NP Complement
Specificity Islands were included in this follow-up study for comparison to Subject
146
Participants In order to keep the number of items per survey manageable, the Island
types were split among two experiments: experiment 1 tested Adjunct, CNPC, Sub-
ject, and Whether Islands, while experiment 2 tested Specificity and Relative Clause
Design
using a Latin Square design. Conditions from an unrelated experiment were added
ment 2. Three orders for each list were created by pseudorandomizing the conditions
such that no two related conditions were consecutive, for a total of 24 surveys for
each experiment. The task for both experiments was magnitude estimation, and the
reference sentence was What did you ask if your mother bought for your father?. The
instructions were a modified version of the instructions published with the WebExp
Results
Responses were divided by the score of the reference sentence and log-transformed
prior to analysis. The means and standard deviations for each level of each factor for
147
Table 6.2. Wh-in-situ: descriptive results
While there were various significant main effects of structure and wh-in-
situ, the focus of this experiment was on the interaction of the two, as this indicates
an Island effect with wh-in-situ. As the ANOVA table indicates, the only Island type
effect.
Discussion
shortcomings of the first experiment, and determine whether there is indeed a wh-in-
148
situ Island effect for several Island types. The results suggest that there is no wh-in-
situ Island effect in English for Adjunct, CNPC, Whether, Specificity, and Relative
Clause Islands. However, there is a wh-in-situ Island effect for Subject Islands. This
result clarifies the lack of significant difference between overt movement and wh-in-situ
with Subject Islands in the first experiment: there is an Island effect for both types of
The design for this experiment is very similar to the first experiment testing
of overt movement and covert movement out of the following Islands: Adjunct, CSC,
CNPC, Relative Clause, Subject and Whether Islands. 12 tokens of each condition
were created and distributed among 6 lists (2 per list). 4 orders of each list were
created for 24 lists. 20 acceptable fillers were added to the lists to better approximate
a 1:1 ratio of acceptable to unacceptable items, for a total of 44 items. The task was
149
Results and Discussion
Since each participant judged 2 tokens of each condition, there were three pos-
sible response patterns, two unambiguous and one ambiguous: both tokens judged
yes, both no, or one of each judgment. The total number of each type of unambigu-
ous judgment was summed across participants and compared using a Sign Test to
As the table indicates, all of the overt movement Island types were signifi-
cantly judged as categorically unacceptable except for Whether Islands, which were
marginally significant. Wh-in-situ in Islands, on the other hand, were less straightfor-
ward. The only clear cases were CSC and Subject Islands, which were judged as cate-
gorically unacceptable. None of the other Island types reached significance, although
that increasing the sample size would lead to a significant number of unacceptable
responses.
150
to a categorical judgment of unacceptable. This in turn strongly suggests that wh-in-
situ Subject Islands are ungrammatical. However, this result is a double-edged sword:
if the other wh-in-situ Island types are also categorically unacceptable, as suggested
effects in the previous studies other than the Subject Island. It may be the case that
bi-clausal (all of the Island structures involve two clauses) multiple wh-questions are
The results of the three studies presented in this section can be summarized
2. Wh-in-situ within a Subject Island is not significantly different than overt wh-
151
4. Wh-in-situ within a Subject Island is significantly less acceptable than wh-in-
Or, in other words, there are Subject Island effects with wh-in-situ in English that
are nearly identical to overt wh-movement Subject Islands. These results have con-
sequences for syntactic theory on at least two levels. First, at the level of individual
analyses, these results suggest that Subject Islands are unique among the other Islands
tested, at least with respect to wh-in-situ. This raises obvious problems for analyses
in which the underlying cause of Subject Island effects is the same as other Island
types, such as the Subjacency approach to Islands (Chomsky 1973, 1986), or the CED
(Condition on Extraction Domains) approach of Huang 1982 (see Stepanov 2007 for
other arguments against the CED approach). At the level of analysis types, such
the empirical landscape. These analyses must be modified to allow the possibility
of wh-in-situ Island effects, but restrict this to Subject Island effects. This entails a
overt wh-movement and wh-in-situ in that both exhibit at least one Island effect.
152
in-situ, either through covert movement or through overt movement with Spell-out
of the lower copy. In fact, a movement-based approach also offers the possibility of
accounting for the unique nature of Subject Islands through freezing-style approaches
to Subject Islands (Wexler and Culicover 1981): if subjects must move from a VP
that phrase to become an Island to further movement, then the Island status of
subjects for wh-in-situ would follow directly from movement. As Cedric Boeckx points
out (p.c.), this predicts that wh-in-situ Island should not occur if the subject has not
moved, as is possible in some Romance languages (e.g., Spanish in Gallego 2007. The
the presence of Subject Island effects. Those results suggest that wh-in-situ may be
more like overt wh-movement than previously thought. This section reports a second
between the distance of movement (in clauses) and acceptability - and whether wh-
153
6.2.1 Distance and wh-dependencies
length on working memory, Phillips et al. 2005 report the results of an offline rating
study in which participants were asked to rate the complexity of sentences along a
5-point scale. They found that manipulating the length of an overt wh-movement
dependency affected complexity ratings such that longer wh-dependencies were rated
more complex:
(47) The detective hoped [that the lieutenant knew [which accomplice the shrewd
(48) The lieutenant knew [which accomplice the detective hoped [that the shrewd
Table 6.5. Mean complexity ratings for short and long movement, from Phillips et al.
2005
mean SD
Short – 1 clause 2.71 0.65
Long – 2 clauses 3.51 0.51
The first question is whether this effect arises in acceptability tasks as well, and if so,
whether wh-in-situ, with no visible movement, is also affected by the distance of the
and acceptability. Given the likelihood that complexity influences acceptability judgments and vice
is difficult to interpret the course of the either a complexity effect or an acceptability effect.
154
(49) Overt wh-movement distance
a. Who hoped that you knew who the mayor would honor ?
The second set crucially manipulated the distance of the covert movement depen-
dency:
the size of the embedded question (1-clause covert movement involves a 1-clause em-
bedded question, 2-clause covert movement has 2-clause embedded question), a third
minimal pair is necessary to tease apart the contribution of covert movement distance
and embedded question size. In these two conditions, the distance of the wh-in-situ
dependency is always the entire length of the question (answers to these questions
are pair lists involving both the matrix wh-word and the in-situ wh-word, each of
which is marked in bold), but the size of the embedded question is manipulated (the
a. Who hoped that you knew whether the mayor would honor who?
b. Who knew whether you hoped that the mayor would honor who?
155
If any effect found for (50) is due to the size of the embedded question, we would
expect the same effect for these conditions because the size of the embedded question
is manipulated in these conditions; if the effect is due to the distance of the wh-in-
situ dependency then we would not expect an effect for these conditions because the
Participants
were self-reported monolingual, native speakers of English. Participants were paid for
their participation.
Design
among 8 lists using a Latin Square design. 4 pseudorandomized orders of each list
were created such that related conditions were never consecutive. The task was
magnitude estimation. The instructions were based upon the published instructions
in the WebExp software suite (Keller et al. 1998), and the reference sentence was
What did you ask if your mother bought for your father?
Responses were divided by the score of the reference sentence and log-transformed
prior to analysis. Paired t-tests were performed on each condition set (all p-values
156
Figure 6.2. Effects for wh-movement distance
There were large significant effects for distance with overt movement and wh-in-situ,
with shorter dependencies being more acceptable than longer dependencies, but no
effect of embedded question size. This suggests that the distance effects with wh-in-
situ are indeed due to the dependency length and not due to the size of the embedded
question. These results suggest yet another parallelism between overt wh-movement
and wh-in-situ dependencies: the effect of distance. However, as the next subsection
will discuss, there are potential analyses of the wh-in-situ distance effect that do not
157
involve movement.
situ length effect does turn out to be due to (covert) movement, the lack of effect of
embedded question size would become evidence that a large scale pied-piping analysis
along the lines of Nishigauchi 1990 and Dayal 1996 is incorrect for English Whether
Islands. Under a large scale pied-piping analysis, the lack of Whether Island effects
with wh-in-situ follows from covert movement of the entire Whether Island to the
matrix C: because the in-situ wh-word does not move out of the Island, there is
no Island effect. An interesting side effect of this analysis is that the larger the
constituent being moved, the shorter the movement distance. In this case, large scale
pied-piping of the entire embedded question would mean that the larger embedded
question would only covertly move one clause, whereas the smaller embedded question
must move two clauses. If covert movement leads to distance effects, the pied-piping
analysis would predict that larger embedded questions should be more acceptable than
smaller embedded questions because smaller embedded questions must move farther.2
However, there was no effect of the embedded question size in this experiment.
2
It is also possible that movement of larger constituents leads to a decrease in acceptability that
neutralizes the benefit larger constituents gain from moving shorter distances. Such a counteranalysis
would require a detailed investigation of the effect of constituent size on the effect of acceptability
158
6.2.2 Distance and Binding dependencies
fect with wh-in-situ is compelling evidence for a movement approach. One possibility
is that the successively cyclic nature of movement, which entails one instance of the
movement operation for each CP crossed, has a direct effect on acceptability. Such an
analysis would mean that either covert movement is successive cyclic contrary to ac-
cepted wisdom (see especially Epstein 1992), or that all movement is overt movement,
with the option of pronouncing the lower copy (e.g., Bošković 2002).
There are, of course, other plausible explanations for the distance effect that
have nothing at all to do with the syntactic operations involved. For instance, the
Phillips et al. (2005) study suggests that the farther the displaced wh-word is from
the gap position, the harder it is to process, perhaps because the representation of the
wh-word in working memory decays over time. This working memory cost for long
distance dependencies could underlie the distance effect for overt wh-movement. For
upon the matrix wh-word, therefore a representation of the first wh-word must be
maintained in memory in order to interpret the second wh-word. This is very similar
An analysis such as the one above in which the interpretation of two function-
ally related items leads to a processing cost on acceptability that is distance dependent
159
makes a very specific prediction: distance effects should arise with other dependencies
that lead to this functional interpretation, not just wh-dependencies. As such, three
ii. Who knew who hoped that the police found their wallet?
ii. Who knew if everyone hoped that the police found their wallet?
ii. Who knew if John hoped that the police found his wallet?
Participants
were self-reported native speakers of English. All volunteered their time for this study.
3
Possessive pronouns were chosen to avoid Principle B violations in the short distance conditions
with standard pronouns, or a Principle A violation in the long distance conditions with reflexives.
160
Design
items to balance acceptability, then distributed among 8 lists using a Latin Square
design. 3 orders of each list were created by pseudorandomizing such that no two
The task was magnitude estimation. The instructions were a modified version
of the instructions distributed with the WebExp software suite (Keller et al. 1998).
The reference sentence was: What do you ask if your mother bought for your father?
Responses were divided by the score of the reference sentence and log-transformed
prior to analysis. Paired t-tests were performed on each pair of conditions; all p-values
are two-tailed:
161
Table 6.7. Results for binding distance t-tests
Short Long
Mean SD Mean SD df t p
BV, wh-word 0.13 0.41 0.02 0.30 21 1.386 .180
BV, quantifier 0.16 0.36 0.09 0.25 21 0.899 .379
R-expression 0.16 0.30 0.11 0.16 21 0.899 .379
There were no significant effects of distance for any of the binding dependencies
with longer distance) for all three dependency types, which raises the question of
whether a larger sample size would lead to significant effects. A power analysis
suggests that if the means and standard deviations are accurate, sample sizes of 53,
106, and 144 respectively would be needed for the results to become significant. Given
that the wh-distance results appeared with a sample size of 26, the relatively weak
effect of distance for binding dependencies suggests that even if significant effects
The lack of distance effects with binding dependencies suggests that the dis-
tance effects with wh-in-situ dependencies is not due to the interpretive relationship
between the two wh-words. While acceptability experiments cannot rule out a pro-
cessing explanation, these results do indicate that a processing explanation must take
into account the differential effects between wh-in-situ dependencies and binding de-
pendencies; a broad stroke analysis based on working memory cannot capture this
distinction.
While these results once again suggest a strong similarity between overt move-
ment and wh-in-situ dependencies, they also have interesting consequences for recent
162
attempts to unify binding and wh-dependencies (e.g., Hornstein 2000). While it is still
possible that binding and wh-dependencies are built by the same structure building
operation (perhaps movement), such unificational theories face the same problem as
processing-based accounts of the distance effects: they must distinguish between the
There are many possibilities (perhaps binding dependencies are not successive cyclic,
This chapter began with the general consensus that acceptability judgments
had little contribute to the debate between competing analyses of wh-in-situ in En-
glish. The studies presented in this chapter have uncovered new acceptability effects
Or, in other words, wh-in-situ shows two characteristic properties of movement that
were previously unnoticed: distance effects and at least one island effect. These results
they argue against one of the movement approaches: the large scale pied-piping ap-
163
proach of Nishigauchi 1990 and Dayal 1996. These results also raise many interesting
questions for future research, although the exact nature of the questions depends on
Under a dual cycle syntax model in which covert movement is possible, the
particular, the Subject Island effects with covert movement could receive a straightfor-
ward account under a freezing-style analysis (Wexler and Culicover 1981), although
not under a linearization and freezing analysis such as the one in Uriagereka 1999
(though the other Island effects could still be analyzed as deriving from linearization
as they do not constrain covert movement). Such an analysis also raises the possibil-
ity that the distance effect with covert movement is due to successive cyclicity, which
runs counter to the general consensus in the field that covert movement is not succes-
sive cyclic (see especially Epstein 1992). Furthermore, this would raise the question
As Norbert Hornstein (p.c) points out, a single cycle syntax model avoids the
altogether. Under such a model, the distance effects with overt wh-movement and
wh-in-situ derive from the same source: overt movement. The difference between
the two dependencies rests solely in which copy, higher or lower, is pronounced (e.g.,
Bošković 2002). Such a model can also adopt the freezing-style analysis for Subject
Islands, but must also eschew the linearization analysis for Subject Islands.
164
These facts are also compatible with an unselective binding approach with null
operator movement, such as the one in Tsai 1994. Under this model, wh-in-situ does
involve movement, but it is (overt) movement of a null operator, not of the wh-word.
The distance facts fall out naturally from the movement of the null operator. The
the null operator must move out of the Island via something like short movement
from Hagstrom 1998: because subjects have already moved, the null operator cannot
The facts in this chapter raise the most difficult problems for AGREE based
(Chomsky 2000, 2001, 2005) and choice-function based analyses (Reinhart 1997).
Both types of analyses define Island effects as reflexes of movement, therefore they
must be weakened to account for the Subject Island effects with wh-in-situ. Further-
more, while both analyses involve a long distance dependency (long distance AGREE
of distance effects with binding dependencies suggests that distance effects may be
165
Chapter 7
Conclusion
This dissertation has argued that the tools of experimental syntax can be used
and the nature of grammatical knowledge. To that end, several existing claims about
166
2. While context may have an effect on some acceptability judgments, it is likely
3. While there still may be different causes underlying various violations, satiation
4. While it goes without saying that processing effects affect acceptability judg-
ments, it is not the case that all processing effects have an effect. This dif-
5. While the value of non-acceptability data such as possible answers are undoubt-
were presented that may have important consequences for wh-in-situ theories.
Like many studies, the work presented in this dissertation raises far more questions
than it answers. However, it is clear from these results that there is a good deal of
potential for experimental syntax to provide more than a simple fact-checking service
for theoretical syntax: the tools of experimental syntax are in a unique position to
167
knowledge, and ultimately, refine our theories of the nature of grammatical knowledge
itself.
168
Bibliography
Bard, Ellen Gurman, Dan Robertson, and Antonella Sorace. 1996. Magnitude esti-
Bresnan, Joan. 2007. A few lessons from typology. Linguistic Typology 11.
M.I.T. Press.
ris Halle, ed. Stephen Anderson and Paul Kiparsky, 232–286. Holt, Rinehart and
Winston.
Chomsky, Noam. 2000. Minimalist inquiries: The framework. In Step by step: Es-
says on minimalist syntax in honor of Howard Lasnik , ed. Roger Martin, David
Chomsky, Noam. 2001. Derivation by phase. In Ken Hale: A life in linguistics, ed.
Chomsky, Noam. 2005. Three factors in language design. Linguistic Inquiry 36:1–22.
Cohen, Jacob. 1973. Eta-squared and partial eta-squared in fixed factor ANOVA
169
Conover, W. J., and R. L. Iman. 1981. Rank transformations as a bridge between
Crain, Stephen, and Janet Fodor. 1987. Sentence matching and overgeneration. Cog-
nition 26:123–169.
in [h]indi . Kluwer.
Dayal, Vaneeta. 2006. Multipl [w]h-questions. In The syntax companion, ed. M. Ev-
Edelman, Shimon, and Morten Christiansen. 2003. How seriously should we take
170
Erteschik-Shir, Nomi. 2006. What’s what? In Gradience in grammar , ed. Caroline
Fery, Fanselow Gisbert, Matthias Schlesewsky, and Ralf Vogel. Oxford University
Press.
Featherston, Sam. 2005a. Magnitude estimation and what it can do for your syntax:
Frazier, L., and G. Flores d’Arcais. 1989. Filler driven parsing: A study of gap filling
Frazier, Lyn, and Jr. Clifton, Charles. 2002. Processing “d-linked” phrases. Journal
Gallego, Angel. 2007. Phase theory and parametric variation. Doctoral Dissertation,
Goodall, Grant. 2005. Satiation and inversion in wh-questions. Talk given at Univer-
sity of Hawaii.
171
Hagstrom, Paul. 1998. Decomposing questions. Doctoral Dissertation, Massachusetts
cut.
Review 1:369–416.
of Edinburgh.
172
Lodge, Milton. 1981. Magnitude scaling: Quantitative measurement of opinions. Sage.
Lorch, R. F., Jr., and J. L. Myers. 1990. Regression analyses of repeated measures
Pesetsky, David. 1987. Wh-in-situ: Movement and unselective binding. In The rep-
Phillips, Colin, Nina Kazanina, and Shani Abada. 2005. ERP effects of the processing
Pickering, Martin, and Michael Traxler. 2003. Evidence against the use of subcate-
Reinhart, Tanya. 1997. Quantifier scope: How labor is divided between QR and
173
Sag, Ivan, Inbal Arnon, Bruno Estigarribia, Philip Hofmeister, T. Florian Jaeger,
Jeanette Pettibone, and Neal Snider. submitted. Processing accounts for superiority
effects .
Seaman, J. W., S. C. Walls, S. E. Wide, and R. G. Jaeger. 1994. Caveat emptor: Rank
Sorace, Antonella, and Frank Keller. 2005. Gradience in linguistic data. Lingua
115:1497–1524.
Stepanov, Arthur. 2007. The end of CED? Minimalism and extraction domains.
Syntax 10:80–126.
64:153–181.
Wexler, Kenneth, and Peter Culicover. 1981. Formal principles of language acquisi-
174
Wilcox, Rand R. 1997. Introduction to robust estimation and hypothesis testing.
Academic Press.
175