Tony McEnery - Andrew Wilson - Corpus Linguistics-Edinburgh University Press (2022)
Tony McEnery - Andrew Wilson - Corpus Linguistics-Edinburgh University Press (2022)
Corpus Linguistics
Corpus Ling/Prelims 22/2/01 4:07 pm Page 2
CORPUS LINGUISTICS
by Tony McEnery and Andrew Wilson
Ed Finegan
University of Southern California, USA
Dieter Mindt
Freie Universität Berlin, Germany
Bengt Altenberg
Lund University, Sweden
Knut Hofland
Norwegian Computing Centre for the Humanities, Bergen, Norway
Jan Aarts
Katholieke Universiteit Nijmegen,The Netherlands
Pam Peters
Macquarie University, Australia
If you would like information on forthcoming titles in this series, please contact
Edinburgh University Press, 22 George Square, Edinburgh EH8 9LF
Corpus Ling/Prelims 22/2/01 4:07 pm Page 3
Corpus Linguistics
An Introduction
Tony McEnery and Andrew Wilson
Second Edition
E D I N B U R G H U N I V E R S I T Y P R E S S
Corpus Ling/Prelims 22/2/01 4:07 pm Page 4
E D I N B U R G H U N I V E R S I T Y P R E S S
A CIP record for this book is available from the British Library
Contents
Acknowledgements viii
List of Abbreviations ix
3 Quantitative Data 75
3.1 Introduction 75
3.2 Qualitative vs. Quantitative Analysis 76
3.3 Corpus Representativeness 77
3.4 Approaching Quantitative Data 81
3.5 Chapter Summary 98
Corpus Ling/Prelims 22/2/01 4:07 pm Page 6
Glossary 197
Appendix A: Corpora Mentioned in the Text 201
Appendix B: Some Software for Corpus Research 209
Appendix C: Suggested Solutions to Exercises 215
Bibliography 219
Index 233
Corpus Ling/Prelims 22/2/01 4:07 pm Page 8
Acknowledgements
Abbreviations
1
Early corpus linguistics
and the Chomskyan revolution
2 E A R LY C O R P U S L I N G U I S T I C S
E A R LY C O R P U S L I N G U I S T I C S 3
4 E A R LY C O R P U S L I N G U I S T I C S
W H AT C H O M S K Y S A I D 5
6 E A R LY C O R P U S L I N G U I S T I C S
both approaches, which will be discussed later in this chapter. But for the
moment we shall use this characterisation of empiricism and rationalism
within our discussion without exploring the concepts further.
Chomsky changed the object of linguistic enquiry from abstract descriptions
of language to theories which reflected a psychological reality, cognitively
plausible models of language.9 In doing so he apparently invalidated the
corpus as a source of evidence in linguistic enquiry. Chomsky suggested that
the corpus could never be a useful tool for the linguist, as the linguist must
seek to model language competence rather than performance. Chomsky’s
distinction between competence and performance has now been somewhat
superseded by the concepts of I and E Language (see Chomsky 1988), but for
the purposes of this discussion we will consider Chomsky’s original concepts.
Competence is best described as our tacit, internalised knowledge of a
language. Performance, on the other hand, is external evidence of language
competence and its usage on particular occasions when, crucially, factors other
than our linguistic competence may affect its form. Chomsky argued that it
was competence rather than performance that the linguist was trying to
model. It is competence which both explains and characterises a speaker’s
knowledge of the language. As the linguist is attempting to explain and char-
acterise our knowledge of language, it is this that he or she should be trying
to model rather than performance. Performance, it was argued, is a poor
mirror of competence. As already stated, performance may be influenced by
factors other than our competence. For instance, factors as diverse as short-
term memory limitations and whether or not we have been drinking can alter
how we speak on any particular occasion. This brings us to the nub of
Chomsky’s initial criticism. A corpus is by its very nature a collection of
externalised utterances; it is performance data and, as such, it must of necessity
be a poor guide to modelling linguistic competence.
But what if we choose to waive this fact and suggest that it is good enough
anyway? Is it possible that this is a mere quibble with the nature of the data?
Chomsky (1957) suggested not. How, for example, can a theory of syntax
develop from the observation of utterances which only partly account for the
true model of language – one’s linguistic competence? This externalised
language not only encodes our competence but also, as noted, an indeter-
minate number of related features on any particular occasion of language use.
How do we determine from any given utterance what are the linguistically
relevant performance phenomena? This is a crucial question, for, without an
answer to this, we are not sure whether, for any set of observations we make
based upon a corpus, what we are discovering is directly relevant to linguis-
tics. We may easily be commenting on the effects of drink on speech pro-
duction without knowing it!
To paint an extreme example, consider a large body of transcribed speech
based on conversations with aphasics. If we are not told they are aphasics, we
Corpus Ling/ch1 22/2/01 4:37 pm Page 7
W H AT C H O M S K Y S A I D 7
8 E A R LY C O R P U S L I N G U I S T I C S
If the above view holds, many exciting possibilities are opened up. To be
able to use solely empirical methods to describe language would be feasible. It
would, in short, be possible to make language description a matter of objective
fact and not a matter of subjective speculation. It is possible to see why such
an approach to language may be attractive.To set linguistics up alongside other
empirical sciences such as physics may indeed seem a laudable goal. But is it a
valid one? Is it possible to eschew introspection totally? When we consider
Chomsky’s criticisms we must conclude, unfortunately, that this is not possible.
The number of sentences in a natural language is not merely arbitrarily
large. It is no use sitting around speculating about the number of sentences in
a natural language.The number is uncountable – the number of sentences in
a natural language is potentially infinite.The curious reader may try a simple
test at this point. Go to any page in this book and choose any sentence which
is not a direct quotation. Now go to your local public lending library and start
to search for that exact sentence in another book in the library. Unless it is a
very formulaic sentence (such as those sentences appearing as part of a legal
disclaimer at the beginning of the book), it is deeply unlikely that you will
find it repeated in its exact form in any book, in any library, anywhere. The
reasons for this become apparent when we consider the sheer number of
choices, lexical and syntactic, which are made in the production of a sentence,
and when we observe that some of the rules of language are recursive.
Recursive rules may be called repeatedly; indeed they may even call them-
selves.The following phrase structure rules include recursion:
Index
Symbol Meaning
S Sentence
NP Noun Phrase
VP Verb Phrase
PP Prepositional Phrase
JP Adjectival Phrase
AT Definite Article
N Noun
Prep Preposition
V Verb
J Adjective
PropN Proper Noun
Rules
S ➝ NP VP
NP ➝ AT N
NP ➝ AT N PP
PP ➝ Prep NP
VP ➝ V JP
JP ➝ J
NP ➝ PropN
NP ➝ PropN PP
Corpus Ling/ch1 22/2/01 4:37 pm Page 9
W H AT C H O M S K Y S A I D 9
In this set of rules the second NP rule and the sole PP rule refer to one
another. In principle, there could be an infinite number of prepositional
phrases enclosing an infinite number of noun phrases within a sentence,
according to these simple rules. There is a certain circularity in the phrase
structure of English here.These rules alone, by continued application, may give
infinitely many sentences.We may even begin the infinite sentence. Consider
the following sentence from the Associated Press () corpus:11
The official news agency carried excerpts from a speech by Brezhnev at
a Kremlin dinner for visiting Cambodian leader Heng Samrin.
The sequence that interests us here is ‘from a speech by Brezhnev at a Kremlin
dinner’. Notice how the recursive nature of the preposition and noun phrase
construction rules allows the simple prepositional phrase ‘from a speech’ to
grow by successive postmodification. If we break this example down into layers
of successive modification, the point can be made clearly:
from a speech
PP ➝ Prep NP
NP ➝ AT N
from a speech by Brezhnev
PP ➝ Prep NP
NP ➝ AT N PP
PP ➝ Prep NP
NP ➝ PropN
from a speech by Brezhnev at the Kremlin dinner
PP ➝ Prep NP
NP ➝ AT N PP
PP ➝ Prep NP
NP ➝ PropN PP
PP ➝ Prep NP
NP ➝ AT PropN N
The example continues beyond this level of recursion. Is there any grammati-
cal reason that it should stop when it finally does? Could the recursion not
continue forever – an endless unfolding of preposition and noun phrases,
successively postmodifying each other? It is a possibility – a theoretical door-
way to an infinitely long sentence.
Observing the recursive nature of phrase structure rules shows clearly how
the sentences of natural language are not finite. A corpus could never be the
sole explicandum of natural language. Our knowledge of, say, grammar is
enshrined in our syntactic competence.This may be composed of a finite set of
rules which give rise to an infinite number of sentences. Performance data, such
Corpus Ling/ch1 22/2/01 4:37 pm Page 10
10 E A R LY C O R P U S L I N G U I S T I C S
W H AT C H O M S K Y S A I D 11
from the mid 1950s onwards. Both quotes have behind them an important
question.Why look through a corpus of a zillion words for facts which may
be readily available via introspection? Chomsky (1984: 44) sums up the
supposed power of introspection by saying that ‘if you sit and think for a few
minutes, you’re just flooded with relevant data’. An example of spontaneous
access to, and use of, such data by Chomsky can be seen in the following
exchange (Hill, 1962: 29):
Chomsky:The verb perform cannot be used with mass word objects: one
can perform a task but one cannot perform labour.
Hatcher: How do you know, if you don’t use a corpus and have not
studied the verb perform?
Chomsky: How do I know? Because I am a native speaker of the English
language.
Such arguments have a certain force – indeed one is initially impressed by the
incisiveness of Chomsky’s observation and subsequent defence of it.Yet the
quote also underlines why corpus data may be useful. Chomsky was, in fact,
wrong. One can perform magic, for example, as a check of a corpus such as the
reveals.13 Native-speaker intuition merely allowed Chomsky to be wrong
with an air of absolute certainty. In addition to providing a means of checking
such statements, corpora are our only true sources of accurate frequency-
based information for language. Nonetheless, we concede that at times intu-
ition can save us time in searching a corpus.14
So the manifesto laid out by Chomsky saw the linguist, or native speaker of a
language, as the sole explicandum of linguistics.The conscious observations of
a linguist who has native competence in a language are just as valid as
sentences recorded furtively from somebody who did not know they were
swelling some corpus. Indeed, it is not a simple question of empowerment.
Without recourse to introspective judgements, how can ungrammatical utter-
ances be distinguished from ones that simply haven’t occurred yet? If our finite
corpus does not contain the sentence:
*He shines Tony books.
how do we conclude that it is ungrammatical? Indeed, there may be persuasive
evidence in the corpus to suggest that it is grammatical. The construction ‘He
shines’followed by a proper name does not occur in the British National Corpus.
However, the following examples do occur in the corpus:
He gives Keith the stare that works on small boys.
And apparently he owes Dempster a lot of money and Dempster was
trying to get it back.
The man doesn't say anything; he pushes Andy down into the ferns, and
gets a hand free and punches Andy in the face.
Corpus Ling/ch1 22/2/01 4:37 pm Page 12
12 E A R LY C O R P U S L I N G U I S T I C S
We may see nothing to suggest that the complementation of shines is any differ-
ent to that of gives, owes or pushes, if we have never seen shine before. It is only by
asking a native or expert speaker of a language for their opinion of the grammat-
icality of a sentence that we can hope to differentiate unseen but grammatical
constructions from those which are simply ungrammatical and unseen.This may
seem a minor point, but as language is non-finite and a corpus is finite, the prob-
lem is all too real.
Let us sum up the arguments against the use of corpora so far. First, the
corpus encourages us to model the wrong thing – we try to model perfor-
mance rather than competence. Chomsky argued that the goals of linguistics
are not the enumeration and description of performance phenomena, but
rather they are introspection and explanation of linguistic competence.
Second, even if we accept enumeration and description as a goal for linguis-
tics, it seems an unattainable one, as natural languages are not finite.As a conse-
quence, the enumeration of sentences can never possibly yield an adequate
description of language. How can a partial corpus be the sole explicandum of
an infinite language? Finally, we must not eschew introspection entirely. If we
do, detecting ungrammatical structures and ambiguous structures becomes
difficult and, indeed, may be impossible.
The power and compulsion of these arguments ensured the almost total
rejection of corpus-based (empirical) methodologies in linguistics and the
establishment of a new orthodoxy. Rationalist introspection-based approaches
rose in the ascendant.
We will not present any justifications of Chomsky’s theories here. The
purpose of this section is to summarise Chomsky’s criticisms of early corpus
linguistics, not to review what path Chomsky’s theories followed. For readers
interested in this, Horrocks (1987) presents an engaging overview, Matthews
(1981) presents an expressly critical review15 and Haegeman (1991), Radford
(1997) and Smith (1999) review later work.
million words using humans alone is, to put it simply, slow, expensive and
prone to error.
Abercrombie’s (ibid.) concept of the pseudo-procedure can certainly be
construed as a criticism of corpus-based linguistics before computerizsation.
There were plenty of things which may have seemed like a good idea, but
were well nigh impossible in practice in the 1950s and before. Most of the
procedures described in this book would be nearly impossible to perform on
corpora of an average size today, if we were still relying on humans alone for
the analysis. Whatever Chomsky’s criticisms, Abercrombie’s discussion of the
pseudo-procedure certainly revealed a practical limitation of early corpus
linguistics. Early corpus linguistics required data processing abilities that were
simply not readily available at the time. Without that data processing ability,
their work was necessarily made more expensive, more time consuming, less
accurate and therefore, ultimately, less feasible.
The impact of the criticisms levelled at early corpus linguistics in the 1950s
was immediate and profound. It seems to the casual reader that almost
overnight linguistics changed and the corpus became an abandoned, discred-
ited tool. But, as the next section shows, that was certainly not the case.
14 E A R LY C O R P U S L I N G U I S T I C S
by him. In short, work with corpus data of sorts continued because certain
topics, entirely worthy of being described as part of linguistics, could not be
effectively studied in the artificial world of well-formedness judgements and
idealised speaker-hearers created by Chomsky.
On a general note, work based on the corpus methodology was undertaken
in the 1960s and 1970s, but, as we will examine more closely in the next
section, as a somewhat minority methodology.The next question we must ask
ourselves then is this: why did some researchers bother to continue to use a
corpus-based approach to linguistics? The answer is that, in the rush for ratio-
nalism sparked by Chomsky, drawbacks became apparent which were, in their
own way, just as profound as the drawbacks he had so clearly pointed out in
the position of the early corpus linguists. At this point we must recall the
earlier description of the nature of the data Chomsky wanted to observe.The
great advantage of the rationalist approach is that, by the use of introspection,
we can gather the data we want, when we want, and also gather data which
relates directly to the system under study, the mind. Chomsky had rightly
stated that a theory based on the observation of natural data could not make
as strong a claim on either of these points. But there are advantages to the
observation of natural data, and it may also be that the case against natural data
was somewhat overstated.
Naturally occurring data has the principal benefit of being observable and
verifiable by all who care to examine it.When a speaker makes an introspec-
tive judgement, how can we be sure of it? When they utter a sentence we can
at least observe and record that sentence. But what can we do when they
express an opinion on a thought process? That remains unobservable, and we
have only one, private, point of view as evidence: theirs. With the recorded
sentence we can garner a public point of view – the data is observable by all
and can be commented on by all.This problem of public vs. private point of
view is one which bedevils not only linguistics, but other disciplines where
the divide exists between natural and artificial observation, such as psychology, as
discussed by Baddeley (1976: 3–15).The corpus has the benefit of rendering
public the point of view used to support a theory. Corpus-based observations
are intrinsically more verifiable than introspectively based judgements.
There is another aspect to this argument. The artificial data is just that –
artificial. Sampson (1992: 428) made this point very forcefully in some ways,
when he observed that the type of sentence typically analysed by the intro-
spective linguist is far away from the type of evidence we tend to see typically
occurring in the corpus. It almost seems that the wildest skew lies not in
corpus evidence, but in introspectively informed judgements. It is a truism
that this can almost not be helped. By artificially manipulating the informant,
we artificially manipulate the data itself.This leads to the classic response from
the informant to the researcher seeking an introspective judgement on a
sentence: ‘Yes I could say that – but I never would.’ Chomsky’s criticism that
Corpus Ling/ch1 22/2/01 4:37 pm Page 15
16 E A R LY C O R P U S L I N G U I S T I C S
speeches that 95 per cent of the utterances in natural language are ungram-
matical.18 Yet Chomsky’s assault on the grammaticality of language has turned
out to be somewhat inaccurate. Labov (1969: 201) argued that, based upon his
experience of working with spoken corpus data, ‘the great majority of utter-
ances in all contexts are complete sentences’.Along one dimension at least, we
can suggest that corpus-based enquiry may not be as invalid as originally
supposed.The corpus is not necessarily a mishmash of ungrammatical sentences.
It would appear that there is reason to hope that the corpus may generally
contain sentences which are grammatical. Note we are not saying here that all
of the sentences in a corpus are grammatically acceptable.They are not neces-
sarily so, a theme that will be returned to in Chapter 4. But it does at least seem
that good grounds exist to allow us to believe that the problem of un-gram-
matical sentences in corpora may not be as acute as was initially assumed.
This brings us back to a point made earlier. We showed that corpora are
excellent sources of quantitative data, but noted that Chomsky may well
respond that quantitative data is of no use to linguists.Well, here again, we can
suggest that his point, though well made, is actually not supported by reality.
Setting aside the fact that quantitative approaches to linguistic description have
yielded important results in linguistics, such as in Svartvik’s (1966) study of
passivisation, the quantitative data extracted from corpora can be of great prac-
tical use to linguists developing tools for analysis.To take the example of part-
of-speech analysis, all of the successful modern approaches to automated
part-of-speech analysis rely on quantitative data derived from a corpus.We will
not labour this point here as it is properly the province of Chapter 5. But one
observation must be made.Without the corpus, or some other natural source
of evidence yielding comparable quantitative data, such powerful analytical
tools would not be available to any linguist or computational linguist today.
‘The proof of the pudding is in the eating’ is an old English saying and it
certainly allows us to dismiss Chomsky’s suggestion that quantitative data is of
no use or importance.
So some of the criticisms of corpus linguistics made by Chomsky were in
part valid and have helped, as we will see shortly, to foster a more realistic atti-
tude towards corpora today. But the criticisms were also partially invalid and
avoided a genuine assessment of the strengths of corpora as opposed to their
weaknesses.This observation begins to suggest why some people continued to
work with corpora, yet fails to suggest why they did so in the face of the other
important criticism of early corpus linguistics reviewed here, that of Aber-
crombie (ibid.). So let us turn finally to this criticism.The pseudo-procedure
observation made by Abercrombie was, at the time, somewhat accurate. We
could argue that it need not necessarily have been so, as we shall see. But, for
the moment, we will take its accuracy for granted.A crucial point, however, is
that today, corpus linguistics, certainly where it is driven by lexical goals, is no
longer a pseudo-procedure. The digital computer has been instrumental in
Corpus Ling/ch1 22/2/01 4:37 pm Page 17
18 E A R LY C O R P U S L I N G U I S T I C S
seems worthwhile to consider in slightly more detail what these processes that
allow the machine to aid the linguist are. The computer has the ability to
search for a particular word, sequence of words or even perhaps part of speech
in a text. So, if we are interested, say, in the usage of the word horde in a text,
we can simply ask the machine to search for this word in the text. Its ability
to retrieve all examples of this word, usually in context, is a further aid to the
linguist. The machine can find the relevant text and display it to the user. It
may also calculate the number of occurrences of the word so that information
on the frequency of the word may be gathered.We may then be interested in
sorting the data in some way – for example alphabetically on words appearing
to the right or left.We may even sort the list by searching for words occurring
in the immediate context of the word.We may take our initial list of examples
of horde presented in context (usually referred to as a concordance21) and
extract from that another list, say of all examples of hordes with the word people
close by. Below is a sample of such a concordance for hordes coupled with people.
The concordance is taken from the British National Corpus using the SARA22
concordance program. A wide variety of concordance programs which can
work on a wide variety of data and carry out a host of useful processes is now
available, with some, such as WordSmith,23 being free for evaluation purposes.
Note the concordance program typically highlights and centres the examples
found, with one example appearing per line with context to the left and right
of each example. Concordance programs will be mentioned again from time
to time throughout this book, as they are useful tools for manipulating corpus
data. After using a concordance program it is all too easy to become blasé
about the ability to manipulate corpora of millions of words. In reality we
should temper that cavalier attitude with the realisation that, without the
computer, corpus linguistics would be terrifically difficult and would hover
grail-like beyond reasonable reach.Whatever philosophical advantages we may
eventually see in a corpus, it is the computer which allows us to exploit
Corpus Ling/ch1 22/2/01 4:37 pm Page 19
corpora on a large scale with speed and accuracy, and we must never forget
that.Technology has allowed a pseudo-procedure to become a valuable linguistic
methodology.
Maybe now it is possible to state why this book is being written. A
theoretical and technical shift over the past half century has led to some of the
criticisms of the corpus-based approach being tempered. Note that the term
‘tempered’ is used rather than ‘discarded’. Only in certain circumstances in the
case of the pseudo-procedure can we safely say that objections to the use of
corpora in linguistics have been countered.
Chomsky’s criticisms are not wholly invalidated, nor indeed could they be.
Chomsky revealed some powerful verities and these shape the approach taken
to the corpus today. Chomsky stated that natural language was non-finite.This
book will not argue with that finding. Chomsky argued that externalised
speech was affected by factors other than our linguistic competence.This book
will not argue with that finding. Some other criticisms of Chomsky’s may be
reduced in degree, but, with the possible exception of his denigration of quan-
titative data, this text would not seek to dispute the fundamental point being
made.The argument being made here is that, in abandoning the corpus-based
approach, linguistics, if we can speak idiomatically, threw the baby out with the
bath-water. The problems Chomsky rightly highlighted were believed to be
fundamental to the corpus itself, rather than being fundamental to the approach
taken to the corpus by the post-Bloomfieldian linguists. In other words, if you
think language is finite, then your interpretation of the findings in a corpus
may reflect that - if we can change the interpretation of the findings in a corpus
to match the verities Chomsky revealed, then the natural data provided by the
corpus can be a rich and powerful tool for the linguist. But we must under-
stand what we are doing when we are looking in a corpus and building one.
The mention of natural data brings in the other general point.Why move
from one extreme of only natural data to another of only artificial data? Both
have known weaknesses.Why not use a combination of both and rely on the
strengths of each to the exclusion of their weaknesses? A corpus and an intro-
spection-based approach to linguistics are not mutually exclusive. In a very real
sense they can be gainfully viewed as being complementary.
The reasons for the revival of corpus linguistics should now be quite obvi-
ous. It is, in some ways, an attempt to redress the balance in linguistics between
the use of artificial data and the use of naturally occurring data. As we have
stated already and will see again in later chapters, artificial data can have a place
in modern corpus linguistics.Yet it should always be used with naturally occur-
ring data which can act as a control, a yardstick if you will. Corpus linguistics
is, and should be, a synthesis of introspective and observational procedures,
relying on a mix of artificial and natural observation.
Before concluding this chapter and moving on to review the constitution of
the modern corpus, however, we will present a brief overview of important
Corpus Ling/ch1 22/2/01 4:37 pm Page 20
20 E A R LY C O R P U S L I N G U I S T I C S
work in corpus linguistics that occurred during the interregnum of the 1960s
and 1970s.
C O R P U S L I N G U S T I C S F RO M T H E 1 9 5 0 S T O T H E E A R LY 1 9 8 0 S 21
22 E A R LY C O R P U S L I N G U I S T I C S
linguistics. It was Busa and Juilland together who worked out much of the
foundations of modern corpus linguistics. If their contributions are less than
well known, it is largely because corpus linguistics became closely associated
with work in English only, and neither Busa nor Juilland worked on corpora of
modern English.To review work in English linguistics, we need to consider the
last two major groups working on corpus linguistics from the 1950s onwards.
C O R P U S L I N G U S T I C S F RO M T H E 1 9 5 0 S T O T H E E A R LY 1 9 8 0 S 23
Eastern Europe25 (e.g. Leipzig, Potsdam).There is little doubt that a great deal
of the current popularity of corpus linguistics, especially in studies of the
English language, can be traced to this line of work. However, another related,
though somewhat separate, strand of corpus work has been similarly influen-
tial in English corpus linguistics over the past forty years.That is the work of
the neo-Firthians.
24 E A R LY C O R P U S L I N G U I S T I C S
longer than its association with Firth may suggest (see Kennedy 1998 for a
discussion) and the basic idea has been expressed by linguists of the Prague
school since at least the 1930s, who used the term automation rather than
collocation (see Fried 1972 for a review in English of this work). However, it
was Firth who inspired the work of the corpus linguists who worked with the
idea of collocation, hence the term is his and the concept behind the term is
commonly ascribed to him.
By far the largest programme of research inspired by neo-Firthian corpus
linguists has been the project carried out at Birmingham University
by John Sinclair and his team from around 1980 onwards (though Sinclair had
been undertaking corpus-based research at Birmingham earlier than this).The
project and its associated corpus, the Bank of English, will be men-
tioned again in the next chapter. It is worthy of note not just because of the
language resources it has given rise to (i.e. the Collins Cobuild series of publi-
cations) but because the neo-Firthian principles upon which it is based have
produced a different type of corpus from that established in the tradition of
Juilland and the associated work. We will discuss this difference in more
detail in section 2.1. For the moment, we will say, in advance, that the work of
the neo-Firthians is based upon the examination of complete texts and the
construction of fairly open-ended corpora. The other tradition of corpus
building relies upon sampling and representativeness to construct a corpus of
a set size which, by and large, eschews the inclusion of complete texts within
a corpus.
1.6. CONCLUSION
So a more accurate pattern of the development of corpus linguistics is now
apparent. During the 1950s a series of criticisms were made of the corpus-
based approach to language study. Some were right, some were half-right and
some have proved themselves, with the passage of time, to be wrong or
irrelevant.The first important point is that these criticisms were not necessar-
ily fatal ones, though they were widely perceived as such at the time. The
second important point is that some linguists carried on using the corpus as a
technique and tried to establish a balance between the use of the corpus and
the use of intuition.
Although the methodology went through a period of relative neglect for
two decades, it was far from abandoned. Indeed, during this time essential
advances in the use of corpora were made. Most importantly of all, the linking
of the corpus to the computer was completed during this era. Following these
advances, corpus studies boomed from 1980 onwards, as corpora, techniques
and new arguments in favour of the use of corpora became more apparent.
Currently this boom continues – and both of the ‘schools’ of corpus linguis-
tics are growing, with work being carried out in the and neo-Firthian
traditions world-wide. Corpus linguistics is maturing methodologically and
Corpus Ling/ch1 22/2/01 4:37 pm Page 25
S T U DY Q U E S T I O N S 25
26 E A R LY C O R P U S L I N G U I S T I C S
NOTES
11. It is important to state that Chomsky’s attacks were rarely directed at the use of corpora as such. He
was attacking the behaviourist and logical positivist underpinnings of the structuralist tradition in most
of his early writings. Corpora – associated with and endorsed by the structuralists – were attacked
largely by implication.
12. Though the corpora used were merely large collections of transcribed interactions. Important notions
such as representativeness (see Chapter 2) were not used in the construction of these ‘corpora’.
13. See chapter five for a more detailed discussion of the use of parallel and translation corpora in studies
such as this.
14. Fries based his grammar on a corpus of transcribed telephone conversation.This is interesting as the
modern corpus grammars are based almost exclusively on written data. In some respects, it is fair to
say that the work of Fries, has only recently been surpassed by the grammar of Biber, Johansson,
Leech, Conrad and Finegan (1999).
15. See the Glossary for a definition of this term.
16. See the Glossary for a definition of this term.
17. See Seuren (1998) for a coverage of the debate over the nature of data in linguistics. Pages 259 to 267
are especially interesting.
18. It is of interest to note that Chafe (1992: 87) has suggested that naturally occurring, almost subcon-
scious, introspective judgements may be another source of evidence we could draw upon. It is difficult
to conceive of a systematic method for recovering such observations, however, so for the purposes of
this discussion we shall dismiss this source of natural data and refer the curious reader to Chafe (ibid.).
19. For those readers aware of Chomsky’s apparently changing position on realism, not we are presenting
him here as being essentially a realist.There has been some debate over the years about what his posi-
tion actually is. For the purposes of this discussion, we will take Chomsky at his word (Chomsky,
1975: 35) that he was an avowed realist.
10. Sebba (1991) gives an interesting account of early corpus linguistics.
11. This corpus is introduced in detail in Chapter 6.
12. Postal quoted in Harris (1993: 34).
13. In the perform magic occurs once, performing magic occurs three times. Other examples are available
in the – for example perform sex.
14. Indeed, intuition can also be of use at times in correcting the faulty intuition of others.The perform
Corpus Ling/ch1 22/2/01 4:37 pm Page 27
NOTES 27
magic counter example used by Hatcher was drawn from intuition (Hill 1962: 31).
15. Or to use Chomsky’s definition, from his ‘Lectures on Government and Binding’, a ‘pathological’ set
of criticisms.
16. Note that in historical linguistics, corpora of sorts remained in use too – here again it is impossible to
question native speakers.
17. See Ingram (1989: 223) for further discussion - and criticism - of this view. Labov (1969) is crucial
reading for those wanting to read a powerful response to Chomskyís view on the degenerate nature
of performance data.
18. There is no written work by Chomsky that we are aware of where this claim is explicitly made.
However, there is no reason to doubt Labovís report of comments made by Chomsky.
19. See section 2.2 for more information on corpus annotations. Garside, Leech and McEnery (1997) is
a detailed review of the topic of corpus annotation.
20. The motivation for the link between corpora and language engineering outlined in Chapter 5
becomes more apparent when we consider this point – corpus linguists have a vested interest in the
development of systems by language engineers which will eliminate the psudo-procedure argument
for an ever wider set of research questions. Roughly speaking, corpus linguists want language
engineers to provide tools which enable ever ësmarterí searches of linguistic corpora to be
undertaken.
21. Concordance programs will be referred to briefly in this book. Barnbrook (1996) and Aston and
Burnard (1998) deal with this topic in greater depth.
22. See Aston and Burnard (1998) and Appendix B for details of this program.
23. See Appendix B for details of this program.
24. Francis (1979) describes the early days of the Brown corpus.
25. The East German school of corpus linguistics produced the interesting corpus-based grammar
English Grammar – a University Handbook (Giering, Gottfried, Hoffmann, Kirsten, Neubert and Thiele
1979), which is a rather sophisticated corpus grammar. Parts of it do, however, read rather strangely
nowadays.There cannot be many grammars within which sentences such as ‘Capitalism is busy
destroying what it helped to build’ (page 291) and ‘In a rational socialist state - as in the Soviet
Union - it could bring relief for both manual and clerical workers’ (page 347) are analysed.
26. Of course in doing so Firth was not suggesting something entirely new – there are echoes here of
Sapir’s (1929: 214) call for linguistics to become ‘increasingly concerned with the many
anthropological, sociological and psychological problems which invade the field of language’.
Corpus Ling/ch1 22/2/01 4:37 pm Page 28
Corpus Ling/ch2 23/2/01 10:36 am Page 29
2
What is a corpus and
what is in it?
30 W H AT I S A C O R P U S A N D W H AT I S I N I T ?
therefore necessary to choose the second option and build a sample of the
language variety in which we are interested.
As we discussed in Chapter 1, it was Chomksy’s criticism of early corpora
that they would always be skewed: in other words, some utterances would be
excluded because they are rare, other much more common utterances might
be excluded simply by chance, and chance might also act so that some rare
utterances were actually included in the corpus. Although modern computer
technology means that nowadays much larger corpora can be collected than
those Chomsky was thinking about when he made these criticisms, his criti-
cism about the potential skewedness of a corpus is an important and valid one
which must be taken seriously. However, this need not mean abandoning the
corpus analysis enterprise. Rather, consideration of Chomsky’s criticism
should be directed towards the establishment of ways in which a much less
biased and more generally repesentative corpus may be constructed.
In building a corpus of a language variety, we are interested in a sample
which is maximally representative of the variety under examination, that is,
which provides us with as accurate a picture as possible of the tendencies of
that variety, including their proportions.We would not, for example, want to
use only the novels of Charles Dickens or Charlotte Brontë as a basis for
analysing the written English language of the mid-nineteenth century. We
would not even want to base our sample purely on text selected from the
genre of the novel.What we would be looking for are samples of a broad range
of different authors and genres which, when taken together, may be considered
to ‘average out’ and provide a reasonably accurate picture of the entire
language population in which we are interested.We shall return in more detail
to this issue of corpus representativeness and sampling in Chapter 3.
C O R P O R A V S. M AC H I N E - R E A DA B L E T E X T S 31
ing in size and are less rigorously sampled than finite corpora, they are not such
a reliable source of quantitative (as opposed to qualitative) data about a
language. With the exception of the monitor corpus observed, though, it
should be noted that it is more often the case that a corpus has a finite number
of words contained in it. At the beginning of a corpus-building project, the
research plan will set out in detail how the language variety is to be sampled,
and how many samples of how many words are to be collected so that a pre-
defined grand total is arrived at.With the Lancaster-Oslo/Bergen (LOB) corpus
and the Brown corpus the grand total was 1,000,000 running words of text; with
the British National Corpus (BNC) it was 100,000,000 running words. Unlike
the monitor corpus, therefore, when such a corpus reaches the grand total of
words, collection stops and the corpus is not thereafter increased in size. (One
exception to this is the London-Lund corpus, which was augmented in the
mid-1970s by Sidney Greenbaum to cover a wider variety of genres.)
32 W H AT I S A C O R P U S A N D W H AT I S I N I T ?
T E X T E N C O D I N G A N D A N N OTAT I O N 33
Leech (1993) identifies seven maxims which should apply in the annotation
of text corpora.These may be paraphrased as follows:
1. It should be possible to remove the annotation from an annotated corpus
and revert to the raw corpus. Thus if the raw corpus contains the sentence
Claire collects shoes (BNC)1 and this is annotated for part of speech as
‘Claire_NP1 collects_VVZ shoes_NN2’, then it should be possible to remove
the annotation and revert to the original Claire collects shoes.The ease of recov-
erability in this case (simply by stripping everything between an underscore
character and a space or punctuation mark) may be contrasted with the
prosodic annotation in the London-Lund corpus, which is interspersed within
words – for example ‘g/oing’ indicates a rising pitch on the first syllable of
going – and means that the original words cannot so easily be reconstructed.
2. It should be possible to extract the annotations by themselves from the text
for storage elsewhere, for example in the form of a relational database or in an
interlinear format where the annotation occurs on a separate line below the
relevant line of running text.This is the flip side of (1).Taking points (1) and
(2) together, in other words, the annotated corpus should allow the maximum
flexibility for manipulation by the user.
3.The annotation scheme should be based on guidelines which are available to
the end user. For instance, most corpora have a manual available with full
details of the annotation scheme and the guidelines issued to the annotators.
This enables the user to understand fully what each instance of annotation
represents without resorting to guesswork and to understand in cases where
more than one interpretation of the text is possible why a particular annota-
tion decision was made at that point.
4. It should be made clear how and by whom the annotation was carried out.
For instance, a corpus may be annotated manually, sometimes just by a single
person and sometimes by a number of different people; alternatively, the anno-
tation may be carried out completely automatically by a computer program,
whose output may or may not then be corrected by human beings.Again, this
information is often contained in a printed manual or a documentation file
issued with the corpus.
5.The end user should be made aware that the corpus annotation is not infal-
lible, but simply a potentially useful tool.Although, as prescribed by point (6),
annotators normally try to aim for as consensus based an annotation scheme as
possible, any act of corpus annotation is by definition also an act of interpreta-
tion, either of the structure of the text or of its content.
6.Annotation schemes should be based as far as possible on widely agreed and
theory-neutral principles. Thus, for instance, as we shall see below in section
2.2.3(b), parsed corpora often adopt a basic context-free phrase structure
Corpus Ling/ch2 23/2/01 10:36 am Page 34
34 W H AT I S A C O R P U S A N D W H AT I S I N I T ?
T E X T E N C O D I N G A N D A N N OTAT I O N 35
36 W H AT I S A C O R P U S A N D W H AT I S I N I T ?
T E X T E N C O D I N G A N D A N N OTAT I O N 37
Corpus; the Lampeter Corpus of Early Modern English Tracts; and the CELT
project, which aims to provide a range of machine-readable texts in the vari-
ous languages which have been used in Ireland during its history – Irish,
Hiberno-Latin and Hiberno-English. It should be noted, however, that the
TEI’s guidelines refer only to the final user-ready text and need not affect the
annotation practices adopted during the preliminary processing of the text: at
Lancaster University, for example, initial part-of-speech is carried out using
the same practice which has been in use for over a decade (i.e. attaching a tag
to the end of a word using the underscore character) and this is only converted
to TEI-conformant format at a later stage in the processing.
The full TEI mark-up system is very comprehensive and quite complex.
However, it is possible to encode documents using only a subset of the
complete TEI guidelines – TEI-LITE. - is a standardised set of the most
common or important TEI tags, which has been developed to facilitate and
broaden the use of the TEI guidelines and to function as a starter package for
those coming to them for the first time. - documents are still TEI-
conformant, but they will tend to contain a smaller set of explicitly marked
features than documents which are encoded using the full guidelines.
The TEI only provides quite broad guidelines for the encoding of texts.
There exists, therefore, considerable scope for variation, even within the broad
remit of being TEI-conformant. Hence, further standards are necessary if
groups of researchers and resource users wish to specify in more detail the
content and form of annotations.The European Union (EU), for example, has
set up an advisory body known as EAGLES (Expert Advisory Groups on
Language Engineering Standards), whose remit is to examine existing prac-
tices of encoding and annotation for the official languages of the European
Union, and to arrive at specifications for European standards which are to be
employed in future EU-funded work. EAGLES consists of a number of working
groups which deal with the various types of resource, including lexicons,
computational linguistic formalisms and, most importantly for our present
purpose, text corpora. The text corpora working group is particularly
concerned with defining standard schemes for annotation which may be
applied to all the EU languages: for example, it has produced a base set of
features for the annotation of parts of speech (e.g. singular common noun,
comparative adverb and so on) which will form the standard scheme for all
the languages, but will be sufficiently flexible to allow for the variations
between languages in how they signal different types of morphosyntactic
information. (See section 2.2.2.3(a) on the EAGLES part-of-speech guidelines
and section 2.2.2.3(c) on a set of preliminary guidelines for syntactic annota-
tion.) Thus the EAGLES initiative, as compared with the TEI, aims more directly
at specifying the content of text annotations rather than their form. It should be
noted, however, that a number of features of language are so susceptible to
variation across languages, and are of varying importance in different contexts,
Corpus Ling/ch2 23/2/01 10:36 am Page 38
38 W H AT I S A C O R P U S A N D W H AT I S I N I T ?
that a number of EAGLES guidelines opt for the description of ‘good laboratory
practice’ rather than specifying concrete features and values to be annotated.
In the context of the , the most important EAGLES recommendation is
the Corpus Encoding Standard or (Ide, Priest-Dorman and Veronis 1996).
This provides a standard definition of the markup – both textual and linguis-
tic – which should be applied to corpora in the EU languages.
It is also being applied to languages outside the EU – for instance to Eastern
European languages in the - project (see http://www.lpl.univ-
aix.fr/projects/multext-east/).The specifies formally such things as which
character sets should be used; what header tags should be used and how; what
corpus and text architectures should be recognised (e.g. various levels of text
division); how to treat ‘encyclopaedic’ information such as bibliographic refer-
ences, proper names and so on; and how linguistic annotation and alignment
information should be represented. All the markup specified in the is in
and furthermore adheres to the guidelines of : the is an application
of the to a specific document type (the linguistic corpus).Thus, any corpus
that is -conformant is also, by definition, -conformant and -confor-
mant. This allows for maximal interchangeability of corpus data world-wide.
The recognises three levels of conformance to its own specifications, with
level 1 being a minimal subset of recommendations which must be applied to
a corpus for it to be considered -conformant. (This is a little like the
- scheme that we just mentioned.) The fact that the goes beyond
the – that is, specifies what elements are to be applied and how – not only
means that -conformant corpora are maximally interchangeable at the /
level, but also allows more specific assumptions to be made in, for example,
designing software architectures which involve the processing of corpus data.
Finally in this section, we may note briefly a recent development, which
implements Leech’s principle of the separability of text and annotation –
standoff annotation. In a version of this, Thompson (1997) has used
Extensible Markup Language () (a restricted, but conformant, form of
) to annotate texts in such a way that the annotation is stored separately
from the text itself.The text has within it the most basic level of encoding –
for example, sequential numbers of words in the text.The separate annotation
then makes use of this information to associate items of annotation with
specific words or stretches of words within the text. For instance, it might
assign the part of speech y to word 16.The advantage of stand-off annotation
is that it can help maintain readability and processibility with heavily anno-
tated texts.
Corpus Ling/ch2 23/2/01 10:36 am Page 39
T E X T E N C O D I N G A N D A N N OTAT I O N 39
40 W H AT I S A C O R P U S A N D W H AT I S I N I T ?
annotated in corpora. For example, CELT indicates those words in the texts
which are in a language other than the main language of the document and
says which language they belong to. A variety of what might be termed
‘encyclopaedic knowledge’ is also encoded in the CELT texts: for instance, all
words which are personal or place names are annotated as such. Figure 2.3
shows a short TEI-encoded extract from a Middle/Early Modern Irish text
from CELT.
Note in Figure 2.3 the various kinds of information encoded using SGML
tags delimited by < … > and </ … >:
T E X T E N C O D I N G A N D A N N OTAT I O N 41
<TEI.2 ID="G201001">
<TEIHEADER STATUS="UPDATE" CREATOR="Donnchadh Ó Corráin"
DATE.CREATED="1995" DATE.UPDATED="1997-09-15">
<FILEDESC>
<TITLESTMT>
<TITLE>Lives of the Saints from the Book of Lismore</TITLE>
<TITLE TYPE="GMD">An electronic edition</TITLE>
<RESPSTMT>
<RESP>compiled by</RESP>
<NAME ID="EBJ">Elva Johnston</NAME>
</RESPSTMT>
<FUNDER>University College, Cork</FUNDER>
<FUNDER>Professor Marianne McDonald via the CELT Project.</FUNDER>
</TITLESTMT>
<EDITIONSTMT>
<EDITION N="2">First draft, revised and corrected.</EDITION>
</EDITIONSTMT>
<EXTENT><MEASURE TYPE="words">56 024</MEASURE></EXTENT>
<PUBLICATIONSTMT>
<PUBLISHER>CELT: Corpus of Electronic Texts: a project of University College,
Cork</PUBLISHER>
<ADDRESS>
<ADDRLINE>College Road, Cork, Ireland.</ADDRLINE>
</ADDRESS>
<DATE>1995</DATE>
<DISTRIBUTOR>CELT online at University College, Cork,
Ireland.</DISTRIBUTOR>
<IDNO TYPE="celt">G201001</IDNO>
<AVAILABILITY STATUS="RESTRICTED">
<P>Available with prior consent of the CELT programme for purposes of academic
research and teaching only.</P>
</AVAILABILITY>
</PUBLICATIONSTMT>
<SOURCEDESC>
<LISTBIBL>
<HEAD>Manuscript sources.</HEAD>
<BIBL N="1">Chatsworth, Book of Lismore (at Chatsworth since 1930), folios
41ra1–84ra9 (facsimile foliation); published in facsimile: R. A. S. Macalister
(ed.), The Book of Mac Carthaigh Riabhach otherwise the Book of Lismore,
Facsimiles in Collotype of Irish Manuscripts V (Dublin: Stationery Office for
Irish Manuscripts Commission 1950). Stokes's edition was made from the original,
lodged for him in the British Museum.</BIBL>
<BIBL N="2">Dublin, Royal Irish Academy, MS 57, 477 (olim 23 K 5 see Catalogue
of Irish Manuscripts in the Royal Irish Academy, fasc. 2, 163), made by Eugene
O'Curry, c. 1839.</BIBL>
<BIBL N="3">Dublin, Royal Irish Academy, MS 478, 474 (olim 23 H 6 see Catalogue
of Irish Manuscripts in the Royal Irish Academy, fasc. 10, 1278), made by Joseph
O'Longan in 1868.</BIBL>
</LISTBIBL>
<LISTBIBL>
<HEAD>The edition used in the digital edition.</HEAD>
<BIBLFULL>
<TITLESTMT>
<TITLE>Lives of the Saints from the Book of Lismore.</TITLE>
<EDITOR ID="WS">Whitley Stokes</EDITOR>
</TITLESTMT>
Corpus Ling/ch2 23/2/01 10:36 am Page 42
42 W H AT I S A C O R P U S A N D W H AT I S I N I T ?
<EDITIONSTMT>
<EDITION>First edition</EDITION>
</EDITIONSTMT>
<EXTENT>cxx + 411 pp</EXTENT>
<PUBLICATIONSTMT>
<PUBLISHER>Clarendon Press</PUBLISHER>
<PUBPLACE>Oxford</PUBPLACE>
<DATE>1890</DATE>
</PUBLICATIONSTMT>
</BIBLFULL>
</LISTBIBL>
</SOURCEDESC>
</FILEDESC>
<ENCODINGDESC>
<PROJECTDESC>
<P>CELT: Corpus of Electronic Texts</P>
</PROJECTDESC>
<SAMPLINGDECL>
<P>All editorial introduction, translation, notes and indexes have been omitted.
Editorial corrigenda (pp. 40–411 and elsewhere in the edition) are integrated
into the electronic edition. In general, only the text of the Book of Lismore
had been retained, since variants are cited from other MSS in a non-systemtic
way and most are non-significant. Therefore, only significant variants are
retained and these are tagged as variants. Missing text supplied by the
editor is tagged.</P>
</SAMPLINGDECL>
<EDITORIALDECL>
<CORRECTION STATUS="HIGH">
<P>Text has been thoroughly checked and proofread. All corrections and
supplied text are tagged.</P>
</CORRECTION>
Figure 2.2 Partial extract from TEI (P3) document header (CELT project)
2.2.2.2 Orthography
Moving on from general, largely external, attributes of the text, we now
consider the annotation of specific features within the text itself.
It might be thought that converting a written or spoken text into machine-
readable form is a relatively simple typing or optical scanning task. But, even
with a basic running machine-readable text, issues of encoding are vital,
although their extent may not at first be apparent to English-speakers.
Corpus Ling/ch2 23/2/01 10:36 am Page 43
T E X T E N C O D I N G A N D A N N OTAT I O N 43
Figure 2.3 Extract from a TEI-encoded (P3) Middle/Early Modern Irish text from the CELT project
In languages other than English, there arise the issues of accents and, even
more seriously, of non-Roman alphabets such as Greek, Russian, Japanese,
Chinese and Arabic. IBM-compatible personal computers, along with certain
other machines, are capable of handling accented characters by using the
extended 8-bit character set. Many mainframe computers, however, especially
in English-speaking countries, do not make use of an 8-bit character set. For
maximal interchangeability, therefore, accented characters need to be encoded
in other ways. Native speakers of languages with accented characters tend to
adopt particular strategies for handling their alphabets when using computers
or typewriters which lack these characters. French speakers on the one hand
typically omit the accent entirely, so that, for example, the name Hélène would
be written Helene. German speakers on the other hand either introduce the
additional letter e for the umlaut (which is historically what the umlaut repre-
sented), so that Frühling would be written Fruehling, or alternatively they place
a double quote mark immediately before the relevant letter, for example
Fr"uhling; at the same time, they simply ignore the variant scharfes s (ß) and
replace it with the ordinary ss. However, although these strategies for text
encoding are the natural practice of native speakers, they pose problems in the
encoding of electronic text for other purposes.With the French strategy, one
loses information which is present in the original text, and with the German
strategies this problem is compounded by the addition of extraneous ortho-
graphic information. In encoding texts, we want to aim as much as possible at
a complete representation of the text as it exists in its natural state. In response
to this need, the TEI has recently put forward suggestions for the encoding of
such special characters. Although the TEI provides for the use of several
standard character sets (e.g. the extended character set including the basic
accented characters), which must be defined in a writing system declara-
tion or WSD, it is strongly recommended that only the ISO-646 subset (which
is more or less the basic English alphabet) is used, with special characters being
annotated in other ways.The guidelines suggest that the characters are encoded
Corpus Ling/ch2 23/2/01 10:36 am Page 44
44 W H AT I S A C O R P U S A N D W H AT I S I N I T ?
as TEI entities, using the delimiting characters of & and ;.Thus, for example, ü
would be encoded in the TEI as üaut;.This can make the text somewhat
hard for the human user to read (see, for example, Figure 2.3 above),2 but has
the huge advantage that all the information in the original text is encoded.
As to the problem of non-Roman alphabets, a number of solutions have
been adopted by text encoders.There is presently a movement towards the use
of , a form of encoding that can represent non-Roman and other
characters as they are. However, until becomes firmly and widely
established, other approaches are required. One of the most common
approaches, now quite frequently used for machine-readable Greek texts such
as the Fribergs’ morphologically analysed Greek New Testament, is to repre-
sent the letters of the non-Roman alphabet by an artificial alphabet made up
of only the basic ISO-646 (‘English’) characters which are available to nearly
all computers. In the Friberg Greek text, for instance, ISO-646 letters which
have direct correspondences to Greek letters are used to represent those letters
so that, for example, A = alpha; ISO-646 letters which do not exist in Greek
(e.g. W ) are used to represent those Greek characters which do not have
direct equivalents in the computer character set (e.g. W is used for omega, as
it closely resembles a lower-case omega). An alternative approach to this,
which has been adopted by some projects, is actually to use the non-Roman
character set itself.The TEI guidelines, as noted above, do allow for the use of
a number of different character sets, but such an approach makes text
exchange and analysis less easy: special graphics capabilities are required to
display the characters on screen and the encoding may pose problems for
some of the commonly available text analysis programs (e.g. concordancers)
which are geared solely towards Roman alphabets.
In encoding a corpus, a decision must also be made at an early stage as to
whether to represent the language as unaffected by external aspects of typog-
raphy and so on, or to represent as fully as possible the text in all its aspects.
Representing the text in the latter way means that one needs to account for
various aspects of its original form – these include the formal layout (e.g.
where line breaks and page breaks occur), where special fonts are used for
emphasis, where non-textual material such as figures occurs and so on.Again,
in the past these aspects have been handled in various ways – for example, the
LOB corpus used asterisks to represent changes of typography. However, the
TEI has now also suggested standard ways of representing these phenomena,
based upon their concept of start and end tags.
The transcription of spoken data presents special problems for text encod-
ing. In speech there is no explicit punctuation: any attempt at breaking down
spoken language into sentences and phrases is an act of interpretation on the
part of the corpus builder. One basic decision which needs to be made with
spoken data is whether to attempt to transcribe it in the form of orthographic
sentences or whether to use intonation units instead, which are often, though
Corpus Ling/ch2 23/2/01 10:36 am Page 45
T E X T E N C O D I N G A N D A N N OTAT I O N 45
not always, coterminous with sentences.There also follows from the intona-
tion unit/sentence decision the issue of whether to attempt to further punc-
tuate spoken language or leave it at the sentence level without punctuation.
The London-Lund corpus, which exists solely in a prosodically transcribed
form (see section 2.2.2.3(g) below), was transcribed using intonation units
alone, with no punctuation whatsoever. In contrast, the Lancaster/IBM Spoken
English Corpus used intonation units for its prosodic version, but ortho-
graphic transcriptions of the speech were also produced; where possible, these
transcriptions were made using the actual scripts used by speakers (given that
most of the corpus is made up of radio broadcasts) or from transcriptions
made by the speakers themselves.
Spoken language also entails further encoding issues. One major issue in
spoken discourse involving several speakers is overlapping speech, that is,
where more than one speaker is speaking simultaneously. This could be
ignored and treated as an interruption, that is, giving each speaker a different
turn, but this distorts what is actually occurring in the discourse. A common
way of approaching turn taking and overlaps in written transcriptions is to use
‘staves’, such as the following:
D did you go to the doctor hmmm
P I went to the clinic
N did they weigh you
Here we have a conversation between three people – D, P and N. The
movement of the speech with time is from left to right.We can see clearly here
who is speaking at any one time and where speech overlaps: for example, P
starts speaking as D is saying the word go and their speech then overlaps. In this
method, when one reaches the end of a line of the page, a new three-person
stave is begun, so one ‘line’ of text is actually made up of three parallel lines,
one representing each speaker at each point in the discourse. But staves are
almost impossible to manipulate computationally, so other techniques need to
be adopted in machine-readable text encoding. Typically, these involve
enclosing the overlapping speech in some kind of marker to indicate where
the overlap occurs: the London-Lund corpus used balanced pairs of asterisks
(** … **) for this, whereas the TEI uses its normal tag-based markup to indi-
cate the extent of overlapping sections (see above).
In transcribing speech one also needs to make a decision whether to
attempt to indicate phenomena such as false starts, hesitations, associated ‘body
language’ (such as the direction of eye contact between speakers), and non-
linguistic material such as laughs, coughs and so on. Corpora vary consider-
ably in the extent to which these features are included, but, where included,
they are also typically included as comments, within some form of delimiting
brackets.
Corpus Ling/ch2 23/2/01 10:36 am Page 46
46 W H AT I S A C O R P U S A N D W H AT I S I N I T ?
T E X T E N C O D I N G A N D A N N OTAT I O N 47
Figure 2.4 Example of part-of-speech tagging from LOB corpus (C1 tagset)
Figure 2.5 Example of part-of-speech tagging from Spoken English Corpus (C7 tagset)
Figure 2.6 Examples of part-of-speech tagging from the British National Corpus (C5 tagset in TEI-
conformant layout)
48 W H AT I S A C O R P U S A N D W H AT I S I N I T ?
T E X T E N C O D I N G A N D A N N OTAT I O N 49
Figure 2.7 Example of part-of-speech tagging of Spanish, from the CRATER corpus
50 W H AT I S A C O R P U S A N D W H AT I S I N I T ?
T E X T E N C O D I N G A N D A N N OTAT I O N 51
with both words receiving the tag for subordinating conjunction, followed by
(a) a number indicating the number of words in the phraseological unit (in
this case 2); and (b) a number indicating the position of word x in that unit.
Hence so is the first part of a two-part unit (CS21) and that is the second part
of a two-part unit (CS22).
One of the main design specifications for linguistic annotation schemes, as
mentioned above, is recoverability, that is, the possibility for the user to be able
to recover the basic original text from any text which has been annotated with
further information.This means that the linguistic content of texts themselves
should not be manipulated or, if it is, it should be made explicit how this was
done so that the original text can be reconstructed. One of the problems for
recoverability in the context of part-of-speech annotation is that of contracted
forms such as don’t. Clearly, these are made up of two separate part-of-speech
units: a verbal part (do) and a negative particle (not). Although it is possible to
expand out such forms to, for example, do not, this makes recoverability impossi-
ble. One alternative possibility is to assign two parts of speech to the one
orthographic unit, so that don’t might be tagged:
don’t_VD0+XX
This is, however, far from ideal as it is the basic aim of part-of-speech annota-
tion that there should be only one part of speech per unit. What the CLAWS
system did, therefore, was to expand out the forms for tagging purposes but
also to mark clearly where this had been done using angled brackets to mark
the interdependency of the two forms. In this way, each unit could receive a
single part-of-speech tag, but it would also be clear where the text had been
expanded so that the original contracted form could be recovered, for example:
0000117 040 I 03 PPIS1
0000117 050 do > 03 VD0
0000117 051 n’t < 03 XX
0000117 060 think 99 VVI
One design consideration in annotation has been that of ease of use. This
has come partly to be reflected in the ways that tagsets often strive for
mnemonicity, that is the ability to see at a glance from a tag what it means
rather than having to look it up in a table. So, for example, a singular proper
noun might be tagged NP1 (where the N clearly stands for ‘noun’, the P
stands for ‘proper’ and the 1 indicates singularity) rather than, say, 221, a
number which only makes sense when it is looked up in a tag table.A further
usability precept which has become commonplace is divisibility, the idea that
each element in each tag should mean something in itself and add to the over-
all ‘sense’ of that tag. In the tagsets for the Brown and, subsequently, the LOB
corpora, divisibility was only partly present. Whilst distinctions were made
between, for example, present participles (VBG) and past participles (VBN),
Corpus Ling/ch2 23/2/01 10:36 am Page 52
52 W H AT I S A C O R P U S A N D W H AT I S I N I T ?
which both belonged to the verb class (VB-), modal verbs had their own tag
of MD which had no visible relation with the other verbal tags, which all
began with V. In later versions of the CLAWS tagset, the concept of divisibility
has been more rigorously applied. In the current tagset (known as C7), all verb
tags begin with V. The second element in the tag indicates what class of verb
they belong to: lexical (V), modal (M) or one of the irregular verbs be (B), do
(D) or have (H). The third element indicates the verb form, for example, N
(past participle) or G (present participle). Similar hierarchies apply to the other
parts of speech such as nouns and adjectives.
As commented earlier, there is a conflict in annotation between retaining
fine distinctions for maximal utility to the end user and removing difficult
distinctions to make automatic annotation more accurate. Some annotation
projects have considerably reduced the number of part-of-speech distinctions
which are made in the tagset. Amongst those projects which employ such
reduced tagsets is the Penn Treebank project at the University of Pennsylvania.
Whereas the CLAWS tagset used at Lancaster distinguishes lexical verbs, modal
verbs and the irregular verbs be, do and have, the Penn tagset uses the same base
tag for all verbs – namely VB – and distinguishes only the morphosyntactic
forms of the verb.This reduction is made on the perfectly justifiable grounds
that these distinctions are lexically recoverable and do not need additional
annotation.Whilst, for example, one cannot easily separate out the verbal uses
of boot from the nominal uses, one can separate out the modal verbs from other
verbs, since they form a lexically closed set: the only important thing for anno-
tation is that they are actually identified as verbs rather than nouns. Explicit
annotation makes it a bit quicker to extract all the modals, but this particular
distinction is not as essential as others.
has made concrete recommendations on the content of part-of-
speech tagsets for European languages (Leech and Wilson 1994).These recom-
mendations recognise three levels of features:
• obligatory features, which are the most basic distinctions that must be
annotated in any text on which part-of-speech tagging is performed
• recommended features, which are further widely recognized gram-
matical categories that should be annotated if possible
• optional features, which may be of use for specific purposes but which
are not sufficiently important to be obligatory or generally recommended.
The obligatory features that recognises are the major parts of speech
Noun, Verb, Adjective, Pronoun/Determiner, Article, Adverb, Adposition,
Conjunction, Numeral, Interjection, Unique, Residual and Punctuation.
(Unique consists of word forms such as the English negative particle not or the
inifinitive marker to and Residual covers such things as foreign words, mathe-
matical formulae and so on; Adposition for most languages means preposi-
tion.) To give an example of the recommended features for the obligatory
feature Noun, recommends that Number, Gender, Case and Type (i.e.
Corpus Ling/ch2 23/2/01 10:36 am Page 53
T E X T E N C O D I N G A N D A N N OTAT I O N 53
54 W H AT I S A C O R P U S A N D W H AT I S I N I T ?
NP VP
PP
NP
N V P AT N
T E X T E N C O D I N G A N D A N N OTAT I O N 55
A form of indented layout similar to this is used by the Penn Treebank project.
Different parsing schemes are employed by different annotators. These
schemes differ both in the number of constituent types which they employ (as
with part-of-speech tagging) and in the way in which constituents are permit-
ted to combine with one another. To give an example of the latter, with
sentences such as Claudia sat on a stool, some annotators have the prepositional
phrase governed by the verb phrase (as shown above) whilst others have it as
a sister of the noun phrase and the verb phrase, governed only by the sentence
(S) node, as in the diagram opposite.
However, despite differences such as these, the majority of parsing schemes
have in common the fact that they are based on a form of context-free phrase
structure grammar.Within this broad framework of context-free phrase structure
grammars, an important distinction which is made is between full parsing
and skeleton parsing. Full parsing on the one hand aims to provide as detailed
as possible an analysis of the sentence structure. Skeleton parsing on the other
hand is, as its name suggests, a less detailed approach which tends to use a less
finely distinguished set of syntactic constituent types and ignores, for example,
the internal structure of certain constituent types.The difference between full
and skeleton parsing will be evident from the Figures 2.9 and 2.10.
Corpus Ling/ch2 23/2/01 10:36 am Page 56
56 W H AT I S A C O R P U S A N D W H AT I S I N I T ?
NP VP PP
NP
N V P AT N
T E X T E N C O D I N G A N D A N N OTAT I O N 57
58 W H AT I S A C O R P U S A N D W H AT I S I N I T ?
T E X T E N C O D I N G A N D A N N OTAT I O N 59
“<to>”
“to” PREP @<NOM @ADVL
“<a>”
“a” <Indef> DET CENTRAL ART SG @DN>
“<nation>”
“nation” N NOM SG @<P
“<of>”
“of” PREP @<NOM-OF
“<23>”
“23” NUM CARD @QN>
“<member>”
“member” N NOM SG @NN>
“<states>”
“state” N NOM PL @<P
“<$.>”
“<*it>”
“it” <*> <NonMod> PRON NOM SG3 SUBJ @SUBJ
“<has>”
“have” <SVO> <SVOC/A> V PRES SG3 VFIN @+FAUXV
“<maintained>”
“maintain” <Vcog> <SVO> <SVOC/A> PCP2 @-FMAINV
“<its>”
“it” PRON GEN SG3 @GN>
“<independence>”
“independence” <-Indef> N NOM SG @OBJ @NN>
“<and>”
“and” CC @CC
“<present>”
“present” <SVO> <P/in> <P/with> V INF @-FMAINV
“present” A ABS @AN>
“<boundaries>”
“boundary” N NOM PL @OBJ
“<intact>”
“intact” A ABS @PCOMPL-O @<NOM
“<since>”
“since” PREP @<NOM @ADVL
“<1815>”
“1815” <1900> NUM CARD @<P
“<$.>”
the line is a tag indicating the grammatical function of the word. Grammatical
function tags used in Figure 2.11 above are:
@+FMAINV finite main predicator
@-FMAINV non-finite main predicator
@<NOM-OF postmodifying of
@<NOM-FMAINV postmodifying non-finite verb
@<NOM other postmodifier
@<P other complement of preposition
@ADVL adverbial
@AN> premodifying adjective
Corpus Ling/ch2 23/2/01 10:36 am Page 60
60 W H AT I S A C O R P U S A N D W H AT I S I N I T ?
@CC coordinator
@DN> determiner
@GN> premodifying genitive
@INFMARK> infinitive marker
@NN> premodifying noun
@OBJ object
@PCOMPL-O object complement
@PCOMPL-S subject complement
@QN> premodifying quantifier
@SUBJ subject
Unlike part-of-speech annotation, which is almost always performed auto-
matically by computer (occasionally with human post-editing), parsing is
performed by a greater variety of means. This reflects the fact that at the
present time, with a very small number of exceptions, automatic parsing soft-
ware has a lesser success rate than automated part-of-speech tagging. Parsing
may be carried out fully automatically (again perhaps with human post-
editing); it may be carried out by human analysts aided by specially written
parsing software; or it may be carried out completely by hand.The Lancaster-
Leeds treebank, a fully parsed subset of the LOB corpus, was parsed manually
by Geoffrey Sampson (now at the University of Sussex) and the parses then
entered into the computer. By contrast, the Lancaster Parsed Corpus, also a
subset of the LOB corpus, was parsed by a fully automated parser and then
substantially corrected by hand. In between these two approaches, it is also
common for human analysts to parse corpora with the aid of specially writ-
ten intelligent editors. At Lancaster University, for example, an editor known
as EPICS (Garside 1993a) has been used by analysts to parse several corpora,
including a subset of the BNC. Sometimes two approaches to parsing may be
combined. At the Catholic University of Nijmegen, a hybrid approach was
used to parse the Nijmegen corpus: parts of speech and constituent bound-
aries were identified by human annotators, and then full parse trees were
assigned using an automatic parser. At the present time, human parsing or
correction cannot completely be dispensed with.The disadvantage of manual
parsing, however, in the absence of extremely strict guidelines, is inconsis-
tency.This is particularly true where more than one person is parsing or edit-
ing the corpus, which is often the case on large corpus annotation projects.
But even with guidelines, there also frequently occur ambiguities where at the
level of detailed syntactic relations more than one interpretation is possible.
has made some preliminary recommendations on syntactic annotation
(Leech, Barnett and Kahrel 1995), but, unlike the recommendations on part-of-
speech tagging, they do not yet have the official stamp of final recommendations.
Nevertheless, it is worthwhile noting their contents.The syntax working group
set out with a similar approach to that taken with the part-of-speech guidelines,
that is, they planned to arrive at obligatory, recommended and optional features.
Corpus Ling/ch2 23/2/01 10:36 am Page 61
T E X T E N C O D I N G A N D A N N OTAT I O N 61
However, as we have observed, syntactic annotation can take different forms – for
example, phrase structure grammar, dependency grammar, systemic functional
grammar and so on. Furthermore, the rules which say how elements are permit-
ted to combine in the several European languages are less amenable to ‘universal’
generalisations than are parts of speech. Thus there are no absolute obligatory
features, but it is suggested that the basic bracketing of constituents should be
obligatory if a form of phrase-structure grammar is employed.The remaining two
levels – recommended and optional – also concern themselves (at least for the
present) only with phrase-structure grammars. At the recommended level, it is
suggested that the basic constituent types of Sentence, Clause, Noun Phrase,Verb
Phrase, Adjectival Phrase, Adverbial Phrase and Prepositional Phrase should be
distinguished. Examples of optional annotations include the marking of sentence
types (Question, Imperative etc.), the functional annotation of subjects and
objects and the identification of semantic subtypes of constituents such as adver-
bial phrases.
2.2.2.3(d) SemanticsTwo broad types of semantic annotation may be identified:
1. the marking of semantic relationships between items in the text, for
example, the agents or patients of particular actions;
2. the marking of semantic features of words in a text, essentially the anno-
tation of word senses in one form or another.
The first type of annotation has scarcely begun to be widely applied to corpora
at the time of writing, although forms of parsing which use dependency or
functional grammars (see above) capture much of its import.
The second type of semantic annotation, by comparison, has quite a long
history. The machine-aided annotation of lexical meanings has previously
spent much of its history within the social scientific research methodology of
content analysis, a technique which, to simplify greatly, uses frequency counts
of groups of related words (e.g. words connected with death) in texts to arrive
at conclusions about the prominence of particular concepts. Despite the fact
that reasonable quality automatic annotation of this type was being carried out
in the late 1960s within the content analysis paradigm, predating much of the
research on the arguably easier task of part-of-speech annotation (see e.g.
Stone et al. 1966), it is only recently that it has become prominent in main-
stream corpus linguistics. Few of these older tagged texts have survived, or at
least they have not been made more widely available, and, at the time of writ-
ing, few semantically analysed corpora exist: to the best of our knowledge, the
only publicly available word-sense-tagged corpus is a subset of the Brown
corpus which is distributed as a component of WORDNET, a piece of software
which provides access to a semantic network of English words (Miller et al.
1990/93).
As with part-of-speech annotation, there is no universal agreement in
semantics about which features of words should be annotated. In the past,
owing to the practical orientation of word sense annotation in content analy-
Corpus Ling/ch2 23/2/01 10:36 am Page 62
62 W H AT I S A C O R P U S A N D W H AT I S I N I T ?
sis, much of the annotation was not in fact motivated by notions of linguistic
semantics at all, but by social scientific theories of, for instance, social interac-
tion or political behaviour.An exception to this was the work of Sedelow and
Sedelow (e.g. 1969), who made use of Roget’s Thesaurus – in which words are
organised into general semantic categories – for text annotation, but they
represent only a small minority. However, there has been a recent resurgence
of interest in the semantic analysis of corpora in general, as well as from the
standpoint of content analysis. At the University of Amsterdam, for instance,
some preliminary work has been carried out on the enrichment of parsed
corpora with semantic information.This work has focused on the automatic
disambiguation of word senses and the assignment of semantic domains to the
words in the text using the ‘field codes’ from the machine-readable Longman
Dictionary of Contemporary English (Janssen 1990).
Other projects on semantic text analysis are also underway. For example, a
joint project between the Bowling Green State University in Ohio and the
Christian-Albrechts-Universität Kiel, directed by Klaus Schmidt, is carrying
out the semantic analysis of a body of mediaeval German epic poetry (see also
4.3). One of the present authors has carried out a similar analysis of a Latin
biblical text (Wilson 1996). Figure 2.12 shows an English example based on
the latter work and gives an idea 61of the sorts of semantic categories which
are often used in this type of work. Such projects frequently store their data in
a relational database format, but the illustrative example here is read vertically,
with the text words on the left and the semantic categories, represented by 8-
digit numbers, on the right. The category system exemplified here, which is
based on that used by Schmidt (e.g. 1993), has a hierarchical structure, that is,
it is made up of three top-level categories, each of which is sub-divided into
further categories, which are themselves further subdivided and so on. For
example, it will be seen that the word thorns has a category which begins with
a number 1 (indicating the top-level category of ‘Universe’), purple has a category
which begins with a number 3 (indicating the top-level category of ‘Man and
the World’) and other words have categories beginning with a number 2 (indi-
cating the top-level category of ‘Man’).The finer levels of sub-division can be
exemplified by looking at the word crown: this belongs in the category 211104,
where 2, as already stated, indicated the top-level category of ‘Man’; the first 1
indicates the first sub-division of that category (‘Bodily Being’); the second
two 1s indicate the eleventh subdivision of this category (‘General Human
Needs’); and the 4 in turn marks the fourth sub-division of ‘General Human
Needs’ (‘Headgear’). Words with a lower degree of semantic importance (for
example, pronouns and articles) are placed in a completely separate category
of 00000000.
Corpus Ling/ch2 23/2/01 10:36 am Page 63
T E X T E N C O D I N G A N D A N N OTAT I O N 63
And 00000000
the 00000000
soldiers 23241000
platted 21072000
a 00000000
crown 21110400
of 00000000
thorns 13010000
and 00000000
put 21072000
it 00000000
on 00000000
his 00000000
head 21030000
and 00000000
they 00000000
put 21072000
on 00000000
him 00000000
a 00000000
purple 31241100
robe 21110321
Key:
00000000 Low content word
13010000 Plant life in general
21030000 Body and body parts
21072000 Object-oriented physical activity
21110321 Men’s clothing: outer clothing
21110400 Headgear
23241000 War and conflict: general
31241100 Colour
Figure 2.12 Example of semantic text analysis, based upon Wilson (1996)
64 W H AT I S A C O R P U S A N D W H AT I S I N I T ?
A039 1 v
(1 [N Local_JJ atheists_NN2 N] 1) [V want_VV0 (2 [N the_AT (9
Charlotte_NP1 9) Police_NN2 Department_NNJ N] 2) [Ti to_TO get_VV0
rid_VVN of_IO [N (3 <REF=2 its_APP$ chaplain_NN1 3) ,_, [N {{3 the_AT
Rev._NNSB1 Dennis_NP1 Whitaker_NP1 3} ,_, 38_MC N]N]Ti]V] ._.
T E X T E N C O D I N G A N D A N N OTAT I O N 65
called the ‘bootstrapping problem’.) To help the Lancaster analysts in this task,
Roger Garside developed a special editing program called Xanadu which
enabled the analyst to highlight an anaphor and its antecedent and select the
reference type from a menu (Garside 1993b). The set of possible reference
types was based on Halliday and Hasan’s theory, but excluded what they call
collocational cohesion, that is, the use of words with different senses but within the
same semantic domain.The marking of this latter form of cohesion is implicit
within word sense annotation discussed above. Figure 2.13 shows an extract
from the anaphoric treebank with the categories of cohesion which are
present.
The text in Figure 2.13 has been part-of-speech tagged and skeleton parsed
using the scheme demonstrated in Figure 2.10. Further annotation has been
added to show anaphoric relations. The following are the anaphoric annota-
tion codes used in this example:
(1 1) etc. noun phrase which enters into a relationship with anaphoric
elements in the text
<REF=2 referential anaphor; the number indicates the noun phrase which
it refers to – here it refers to noun phrase number 2, the Charlotte
Police Department
{{3 3} etc. noun phrase entering in equivalence relationship with preceding
noun phrase; here the Rev. Dennis Whitaker is identified as being
the same referent as noun phrase number 3, its chaplain.
2.2.3(f) Phonetic transcription Corpora of spoken language may, in addi-
tion to being transcribed orthographically, also be transcribed using a form of
phonetic transcription. Although the limitations of phonetic transcription are
well known (speech is a continuous process, hence the same ‘sound’ is pro-
duced slightly differently in different contexts; transcription cannot encode
precise details such as frequency; pronunciation varies between individuals),
such corpora are nevertheless useful, especially to those who lack the tools and
expertise for the laboratory analysis of recorded speech.
For the reasons discussed above under the heading of character sets, phonetic
corpus transcription is, at present, unlikely to use the standard International
Phonetic Alphabet () characters, but rather will employ methods by which
the content of the system may be represented using either the basic char-
acters available on the computer or conventions.
One increasingly common scheme for representing the is (Speech
Assessment Methods Phonetic Alphabet), developed in the EU-funded
project. uses basic keyboard characters (both upper and lower case) to
represent characters. For example, @ is used to represent the schwa char-
acter (ə),V for the character and Q for the ɒ character; more transparently, i
represents i and j represents .An extended version of (-) has also
been proposed to take account of the diacritics needed for more detailed
phonetic transcription: for example, a fronted /k/ (k+) would be represented as
Corpus Ling/ch2 23/2/01 10:36 am Page 66
66 W H AT I S A C O R P U S A N D W H AT I S I N I T ?
k_+. guidelines have been developed for the sound systems of a number
of different languages (including all EU languages) and has gained
considerable currency among the community working with speech corpora.
Phonetic transcription can be either narrow (where an attempt is made to
represent the precise details of how a sound was pronounced in a given
instance) or broad (where only the phoneme – or functional sound family – is
represented). Narrow transcription can at present only be done by hand and
moreover by skilled transcribers. This perhaps explains the paucity of publicly
available phonetically transcribed corpora. However, broad transcriptions can
be made relatively automatically from orthographic transcriptions, by means of
pattern matching on a pronunciation dictionary.This was the method adopted
for producing the phonetic transcription in the version of the
Lancaster/ Spoken English Corpus (Knowles 1994).The output from such
pattern matching still needs to be checked manually, for example, for homo-
graphs which differ in pronunciation, but this approach considerably speeds up
the production of phonetically transcribed corpora.
The corpus is probably the only phonetically transcribed corpus that
is widely available and well known to corpus linguists. However, it should be
noted that there are many more speech databases in existence, some of which
contain phonetic transcriptions and other annotations, as well as the speech
waveforms themselves. Examples are the set of speech databases for Central
and East European languages being collected by the ’s project (Roach et
al 1996). Speech databases are often built by speech technologists, rather than
corpus linguists, and do not always contain naturally occurring language,
hence our use of the term ‘database’ rather than ‘corpus’.
2.2.3(g) Prosody Although phonetic transcription has not been employed
until very recently on a large scale in corpus annotation, prosodic annotation
has a much longer history. The spoken parts of Randolph Quirk’s Survey of
English Usage, collected in the early 1960s, were prosodically annotated and
later computer encoded as the London-Lund corpus. More recently, a major
component of the analysis of the Lancaster/IBM Spoken English Corpus was
the production of a prosodically annotated version of the corpus.
Prosodic annotation aims to capture in a written form the suprasegmental
features of spoken language – primarily stress, intonation and pauses.The most
commonly encountered schemes for prosodic markup are those based upon the
American o system (Beckman and Ayers Elam 1997) and those based upon
the British Tonetic Stress Marker () tradition exemplified by O'Connor and
Arnold (1961). o has been gaining ground among speech researchers and
adaptations of it have been made for a number of different languages and dialects.
Some have questioned its ability to represent the richness and diversity of
prosody, since it does not, for example, distinguish pitch range in the way that the
method does (Nolan and Grabe 1997); however, attempts have been made to
map between the two schemes and these have not been wholly unsuccessful,
Corpus Ling/ch2 23/2/01 10:36 am Page 67
T E X T E N C O D I N G A N D A N N OTAT I O N 67
although the schemes are not wholly compatible either (e.g. Roach 1994).
Certain other annotation schemes have also been proposed, including a
partner to the transcription scheme mentioned in the previous section
(). For a longer discussion of these schemes, see Gibbon et al. (1997).
Figure 2.14 shows an example of prosodic annotation from the London-Lund
corpus, which follows the British tradition.The codes in Figure 2.14 are:
# end of tone group
^ onset
/ rising nuclear tone
\ falling nuclear tone
/\ rise-fall nuclear tone
_ level nuclear tone
[] enclose partial words and phonetic symbols
' normal stress
! booster: higher pitch than preceding prominent syllable
= booster: continuance
(()) unclear, with estimated length or attempted transcription
** simultaneous speech
- pause of one stress unit
Rather than attempting to indicate the intonation of every syllable in the
corpus, it is more usual for only those which are most prominent to be anno-
tated and the intonation of others left to be inferred from the direction of the
intonation on neighbouring prominent syllables.
It may or may not be the case that the annotator of a spoken corpus has access
to the intonation (or F0) curve.This partially determines whether the prosodic
annotation will be auditory and impressionistic or whether it will be based on
physical acoustic measurements. In the former case, there are a number of diffi-
culties. These revolve around the impressionistic nature of the judgements
which are made and the specialist ear training required for making those judge-
ments. Prosodic transcription is a task which requires the manual involvement
of highly skilled phoneticians: unlike part-of-speech analysis, it is not a task
which can be delegated to the computer and, unlike the construction of an
Corpus Ling/ch2 23/2/01 10:37 am Page 68
68 W H AT I S A C O R P U S A N D W H AT I S I N I T ?
M U LT I L I N G UA L C O R P O R A 69
70 W H AT I S A C O R P U S A N D W H AT I S I N I T ?
M U LT I L I N G UA L C O R P O R A 71
72 W H AT I S A C O R P U S A N D W H AT I S I N I T ?
direct Polish equivalents of English words which would not be used in similar
contexts by Polish speakers – the connotations are inappropriate:
English: People who are overweight experience difficulties
Polish translation: Osoby z nadgwa doswiadczaja trudnosci z
Polish: Osoby otyle odczuwaja
Overweight is translated as with overweight rather than as otyle (overweight). Also,
the direct Polish translation of experience, doswiadczyc, is more to do with exter-
nal experiences rather than internal/bodily ones, for which odczuc is more
typical. Further to this, some unusual lexical preferences occur, such as:
English: Currently
Polish translation: Wspolczesnie
Polish: Obecnie
English: Every/each year
Polish translation: Kazdego roku
Polish: Co roku
English: Contains information
Polish translation: Zamieszcza informacje
Polish: Zawiera informacje
The evident flavour of a translation is present here. Unusual syntactic choices
are evident in the use of finite clauses as opposed to the participial clauses of
natural Polish, the use of prepositional constructions instead of inflectional
ones.There are also too many analytical constructions in general in the trans-
lations provided.
Another point worthy of mention is that Subject-Verb-Object and
Subject-Verb ordering are more common than one would expect in the
Polish, giving evidence again for a noticeable translation effect. The effect of
this to a Polish reader is that new information is rendered via a preverbal
subject much more often than would be the case in natural Polish, where such
a subject is much more likely to be postverbal, as noted by Siewierska (1993),
who documents that, whereas only 24 per cent of the subjects in SVO clauses
convey new information, of the clause-final ones, 79 per cent are new.
So the reasons why people may wish only to compare and exploit L1 texts
is clear – incorporating L2 texts within any multilingual corpus-based system
would be to permit the possibility of incorporating inaccurate and/or unrep-
resentative data.Yet parallel corpora continue to be the subject of research and
construction. Of importance to understanding why this is the case is the
sustainability of projects aimed at the construction of parallel and translation
corpora. While translation corpora are highly attractive because of the ‘natu-
ralness’ of the data they contain (all material within such a corpus being L1
material) populating such a corpus can be difficult – Johansson et al. (1993)
Corpus Ling/ch2 23/2/01 10:37 am Page 73
C H A P T E R S U M M A RY 73
74 W H AT I S A C O R P U S A N D W H AT I S I N I T ?
and your tutor is not able to provide samples, you will find some part-
of-speech tagsets in the appendices to Garside et al. 1987.)
3. Ready-annotated corpora are now widely available. How far does the
use of such corpora constrain the end user and to what extent is such
constraint a serious problem? Are there any aspects of language for
which you consider pre-encoded annotation to be unsuitable? Explain
your answer.
NOTES
1. Here and elsewhere, (BNC) = example taken from British National Corpus.
2. However,‘intelligent’ text retrieval programs can replace the TEI markup with the original graphic
character.
3. Sometimes the lexeme itself is also referred to as the lemma.
4. For the latter, see e.g. Souter (1990).
5. See http://www.linguistics.rdg.ac.uk/research/speechlab/babel.
6. The Minority Language Engineering Project: see http://www.ling.lancs.ac.uk/monkey/ihe/mille/
public/title.htm.
7. Also referred to as comparable corpora.
8. Our thanks to Professor Anna Siewierska for her help with this study.
9. This problem is exactly the same as Shastri et al. (1986: xii–xiii) reported in trying to build an
Indian–English equivalent of the Brown and corpora, the so-called Kohlapur Corpus.
Corpus Ling/ch3 22/2/01 5:04 pm Page 75
3.1. INTRODUCTION
3
Quantitative data
The notion of a form of linguistics based firmly upon empirical data is present
throughout this book. But, as the warnings of scholars such as Chomsky
(which we reviewed in Chapter 1) suggest, empirical data can sometimes be
deceptive if not used with appropriate caution. Chomsky’s argument against
corpora was based upon the observation that when one derives a sample of a
language variety it will be skewed: chance will operate so that rare construc-
tions may occur more frequently than in the variety as a whole and some
common constructions may occur less frequently than in the variety as a
whole. It was suggested briefly in Chapter 2 that, whilst these criticisms are
serious and valid ones, the effects which they describe can at least partially be
countered through better methods of achieving representativeness.This is one
of the themes of this chapter. But there is also another issue implicit in
Chomsky’s criticism and that is the issue of frequency. Chomsky’s concentra-
tion on rarity and commonness appears to assume that corpus linguistics is a
quantitative approach.This is at least partly true: a corpus, considered to be a
maximally representative finite sample, enables results to be quantified and
compared to other results in the same way as any other scientific investigation
which is based on a data sample.The corpus thus stands in contrast to other
empirical data sets which have not been sampled to be maximally representa-
tive and from which broader conclusions cannot therefore be extrapolated.
But it is not essential that corpus data be used solely for quantitative research
and, in fact, many researchers have used it as a source of qualitative data. Before
moving on to consider the various quantitative issues which arise in corpus
linguistics, therefore, we shall look in the next section at the relationship
between quantitative and qualitative approaches to corpus analysis.
Corpus Ling/ch3 22/2/01 5:04 pm Page 76
76 Q UA N T I TAT I V E DATA
C O R P U S R E P R E S E N TAT I V E N E S S 77
78 Q UA N T I TAT I V E DATA
it is true with any kind of sample that rare elements may occur in higher
proportions and frequent elements in lesser proportions than in the population
as a whole – and this criticism applies not only to linguistic corpora but to any
form of scientific investigation which is based on sampling rather than on the
exhaustive analysis of an entire and finite population: in other words, it applies
to a very large proportion of the scientific and social scientific research which
is carried out today. However, the effects of Chomsky’s criticism are not quite
so drastic as it appears at first glance, since there are many safeguards which
may be applied in sampling for maximal representativeness.
The reader will recall from Chapter 1 that, at the time when Chomsky first
made his criticism in the 1950s, most corpora were very small entities.This was
due as much to necessity as to choice: the development of text analysis by
computer had still to progress considerably and thus corpora had still largely to
be analysed by hand. Hence these corpora had to be of a manageable size for
manual analysis. Although size – short of including the whole target popula-
tion – is not a guarantee of representativeness, it does enter significantly into
the factors and calculations which need to be considered in producing a maxi-
mally representative corpus. Small corpora tend only to be representative for
certain high frequency linguistic features, and thus Chomsky’s criticism was at
least partly true of these early corpora. But since today we have powerful
computers which can readily store, search and manipulate many millions of
words, the issue of size is no longer such a problem and we can attempt to
make much more representative corpora than Chomsky could dream of when
he first criticised corpus-based linguistics.
In discussing the ways of achieving the maximal degree representativeness,
it should first be emphasised once again that in producing a corpus we are
dealing with a sample of a much larger population. Random sampling tech-
niques in themselves are standard to many areas of science and social science,
and these same techniques are also used in corpus building. But there are
particular additional caveats which the corpus builder must be aware of.
Biber (1993b), in a detailed survey of this issue, emphasises as the first step
in corpus sampling the need to define as clearly as possible the limits of the
population which we are aiming to study before we can proceed to define
sampling procedures for it. This means that we should not start off by saying
vaguely that we are interested in, for instance, the written German of 1993, but
that we must actually rigorously define what the boundaries of ‘the written
German of 1993’ are for our present purpose, that is, what our sampling
frame – the entire population of texts from which we will take our samples –
is.Two approaches have been taken to this question in the building of corpora
of written language.The first approach is to use a comprehensive bibliograph-
ical index. So, for ‘the written German of 1993’, we might define our sampling
frame as being the entire contents of an index of published works in German
for that year, for example, the Deutsche National-Bibliographie. This is the
Corpus Ling/ch3 22/2/01 5:04 pm Page 79
C O R P U S R E P R E S E N TAT I V E N E S S 79
80 Q UA N T I TAT I V E DATA
A P P RO A C H I N G Q UA N T I TAT I V E DATA 81
82 Q UA N T I TAT I V E DATA
Empirical Linguistics series – notably Language and Computers and Statistics for
Corpus Linguistics – pick up again on these methods and present them in much
more detail than we have space to do here. Our recommendation is that the
student reads what we have to say here as a brief introduction to how the tech-
niques may be used and then progresses to the more detailed treatments in the
other texts for further explanation. In what follows, precise references are
given to the more detailed treatments in these two volumes.
A P P RO A C H I N G Q UA N T I TAT I V E DATA 83
type in another text, may actually indicate a smaller proportion of the type in
that text than the smaller count indicates for the other text. For instance,
assume that we have a corpus of spoken English with a size of 50,000 words
and a corpus of written English with a size of 500,000 words. We may find
that, in the corpus of spoken English, the word boot occurs 50 times whereas,
in the corpus of written English, it occurs 500 times. So it looks at first glance
as if boot is more frequent in written than in spoken English. But let us now
look at these data in a different way. This time we shall go one step further
beyond the simple arithmetical frequency and calculate the frequency of
occurrence of the type boot as a percentage of the total number of tokens in the
corpus, that is, the total size of the corpus. So we do the following calculations:
spoken English: 50 / 50,000 x 100 = 0.1%
written English: 500 / 500,000 x 100 = 0.1%
Looking at these figures, we see that, far from being 10 times more frequent in
written English than in spoken English, boot has the same frequency of occur-
rence in both varieties: it makes up 0.1 per cent of the total number of tokens
in each sample. It should be noted, therefore, that if the sample sizes on which
a count is based are different, then simple arithmetical frequency counts
cannot be compared directly with one another: it is necessary in those cases to
normalise the data using some indicator of proportion. Even where disparity
of size is not an issue, proportional statistics are a better approach to presenting
frequencies, since most people find it easier to understand and compare figures
such as percentages than fractions of unusual numbers such as 53,000.
There are several ways of indicating proportion, but they all boil down to a
ratio between the size of the sample and the number of occurrences of the
type under investigation.The most basic involves simply calculating the ratio:
ratio = number of occurrences of the type / number of tokens in entire sample
The result of this calculation may be expressed as a fraction or, more com-
monly, as a decimal. Usually, however, when working with large samples such
as corpora and potentially many classifications, this calculation gives unwieldy
looking small numbers. For example, the calculation we performed above
would give a simple ratio of 0.0001. Normally, therefore, the ratio is scaled up
to a larger, more manageable number by multiplying the result of the above
equation by a constant.This is what we did with the example: in that case the
constant was 100 and the result was therefore a percentage. Percentages are
perhaps the most common way of representing proportions in empirical
linguistics but, with a large number of classifications, or with a set of classifica-
tions in which the first few make up something like half the entire sample, the
numbers can still look awkward, with few being greater than 1. It may some-
times be sensible, therefore, to multiply the ratio formula by a larger constant,
for example 1,000 (giving a proportion per mille (‰)) or 1,000,000 (giving a
Corpus Ling/ch3 22/2/01 5:04 pm Page 84
84 Q UA N T I TAT I V E DATA
dicit dixit
Matthew 46 107
John 118 119
From these figures, it looks as if John uses the present tense form (dicit) propor-
tionally more often than Matthew does. But with what degree of certainty can
we infer that this is a genuine finding about the two texts rather than a result
of chance? From these figures alone we cannot decide: we need to perform a
further calculation – a test of statistical significance – to determine how
high or low the probability is that the difference between the two texts on
these features is due to chance.
There are several significance tests available to the corpus linguist – the chi-
squared test, the [Student’s] t-test,Wilcoxon’s rank sum test and so on – and we
will not try to cover each one here.As an example of the role of such tests we
will concentrate on just one test – the chi-squared test (see Oakes 1998:
24–9).The chi-squared test is probably the most commonly used significance
test in corpus linguistics and also has the advantages that (1) it is more sensitive
than, for example, the t-test; (2) it does not assume that the data are ‘normally
distributed’ – this is often not true of linguistic data; and (3) in 2 x 2 tables such
as the one above – a common calculation in linguistics – it is very easy to calcu-
late, even without a computer statistics package (see Swinscow 1983). Note,
however, in comparison to Swinscow, that Oakes (1998: 25) recommends the
use of Yates’s correction with 2 x 2 tables.The main disadvantage of chi-square
is that it is unreliable with very small frequencies. It should also be noted that
proportional data (percentages etc.) cannot be used with the chi-squared test:
disparities in corpus size are unimportant, since the chi-squared test itself
compares the figures in the table proportionally.
Very simply, the chi-squared test compares the difference between the actual
frequencies which have been observed in the corpus (the observed frequencies)
and those which one would expect if no factor other than chance had been
operating to affect the frequencies (the expected frequencies). The closer the
expected frequencies are to the observed frequencies, the more likely it is that
Corpus Ling/ch3 22/2/01 5:04 pm Page 85
A P P RO A C H I N G Q UA N T I TAT I V E DATA 85
the observed frequencies are a result of chance. On the other hand, the greater
the difference between the observed frequencies and the expected frequencies,
the more likely it is that the observed frequencies are being influenced by
something other than chance, for instance, a true difference in the grammars of
two language varieties.
Let us for the present purpose omit the technicality of calculating the chi-
square value and assume that it has already been calculated for our data. Having
done this, it is then necessary (if not using a computer program which gives
the information automatically) to look in a set of statistical tables to see how
significant our chi-square value is. To do this one first requires one further
value – the number of degrees of freedom (usually written d.f.).This is very
simple to work out. It is simply:
(number of columns in the frequency table – 1) x (number of rows in the frequency table – 1)
We now look in the table of chi-square values in the row for the relevant
number of degrees of freedom until we find the nearest chi-square value to the
one which has been calculated, then we read off the probability value for that
column. A probability value close to 0 means that the difference is very
strongly significant, that is, it is very unlikely to be due to chance; a value close
to 1 means that it is almost certainly due to chance. Although the interval
between 1 and 0 is a continuum, in practice it is normal to assign a cut-off
point which is taken to be the difference between a ‘significant’ result and an
‘insignificant’ result. In linguistics (and most other fields) this is normally taken
to be a probability value of 0.05: probability values of less than 0.05 (written as p
< 0.05) are assumed to be significant, whereas those greater than 0.05 are not.
Let us then return to our example and find out whether the difference
which we found is statistically significant. If we calculate the chi-square value
for this table (using Yate’s correction) we find that it is 14.04. We have two
columns and two rows in the original frequency table, so the number of
degrees of freedom in this case is (2 – 1) x (2 – 1) = 1 d.f. For 1 d.f. we find
that the probability value for this chi-square value is 0.0002.Thus the differ-
ence which we found between Matthew and John is significant at p < 0.05,
and we can therefore say with quite a high degree of certainty that this differ-
ence is a true reflection of variation in the two texts and is not due to chance.
As an alternative to the chi-squared test, Dunning (1993) proposes the use
of the log-likelihood test (also G^2 or λ), which, he argues, is more
reliable with low frequencies and with samples that have a comparatively large
discrepancy in size.
86 Q UA N T I TAT I V E DATA
Chapter 5.) Kjellmer (1991), for instance, has argued that our mental lexicon
is made up not only of single words but also of larger phraseological units, both
fixed and more variable.The identification of patterns of word co-occurrence
in textual data is particularly important in dictionary writing, since, in addition
to identifying Kjellmer’s phraseological units, the company which individual
words keep often helps to define their senses and use (cf. the basis of proba-
bilistic tagging, which is discussed in section 5.3).This information is in turn
important both for natural language processing and for language teaching. But
in connected discourse every single word occurs in the company of other words.
How, therefore, is it possible to identify which co-occurrences are significant
collocations, especially if one is not a native speaker of a language or language
variety so that introspection is not an available option?
Given a text corpus, it is possible to determine empirically which pairs of
words have a substantial amount of ‘glue’ between them and which are, hence,
likely to constitute significant collocations in that variety rather than chance
pairings.The two formulae which are most commonly used to calculate this
relationship are mutual information (Oakes 1998: 63–5, 89–90; Barnbrook
1996: 98–100) and the Z-score (Oakes 1998: 7–8, 163–6; Barnbrook 1996:
95–7).
Mutual information is a formula borrowed from the area of theoretical
computer science known as information theory.The mutual information score
between any given pair of words – or indeed any pair of other items such as,
for example, part-of-speech categories – compares the probability that the two
items occur together as a joint event (i.e. because they belong together) with
the probability that they occur individually and that their co-occurrences are
simply a result of chance. For example, the words riding and boots may occur as
a joint event by reason of their belonging to the same multiword unit (riding
boots) whereas the words formula and borrowed in the sentence above simply
occur together in a relatively one-off juxtaposition: they do not have any
special relationship to each other.The more strongly connected two items are,
the higher will be the mutual information score. On the other hand, if the two
items have a very low level of co-occurrence, that is, they occur more often in
isolation than together, then the mutual information score will be a negative
number. And if the co-occurrence of item 1 and item 2 is largely due to
chance, then the mutual information score will be close to zero. In other
words, pairs of items with high positive mutual information scores are are
more likely to constitute characteristic collocations than pairs with much
lower mutual information scores.2
The Z-score provides similar data to mutual information. For any given
word (or other item) in a text, this test compares the actual frequency of all
other words occurring within a specified size of context window (for exam-
ple three words either side of the item) with their expected frequency of
occurrence within that window if only chance were affecting the distribution.
Corpus Ling/ch3 22/2/01 5:04 pm Page 87
A P P RO A C H I N G Q UA N T I TAT I V E DATA 87
The higher the Z-score is for a given word or item in connection with the
node word (i.e. the word whose associations we are looking at), the greater is its
degree of collocability with that word. The Z-score is on the whole used
rather less frequently than mutual information in corpus linguistics, but it is
important to mention it since the TACT concordance package – one of the
very few widely available packages which include a collocation significance
function – does in fact use the Z-score rather than mutual information.
As suggested above, techniques such as mutual information and the Z-score
are of particular use in lexicography. One of their uses is to extract what are
known as multiword units from corpora, which include not only traditional
idiomatic word groups such as cock and bull but also, for example, multiword
noun phrases such as temporal mandibular joint.The extraction of this latter kind
of terminology is useful not only in traditional lexicography but also particu-
larly in specialist technical translation where a detailed knowledge of the
terminology of a field, both at the level of individual words and at the level of
multiword units, is an important step towards establishing an exhaustive data-
base of translation equivalents.
A second use of mutual information and the Z-score is as aids to sense
discrimination in corpus data. In this case, instead of trying to extract specific
multiword units, we are interested in the more general patterns of collocation
for particular words. If we take the most significant collocates for a word, it is
possible either (1) to group similar collocates together to help in semi-auto-
matically identifying different senses of the word (for example, bank might
collocate with geographical words such as river (indicating the landscape sense
of bank) and with financial words such as investment (indicating the financial
type of bank)); or (2) to compare the significant collocates of one word with
those of another with the aim of discriminating differences in usage between
two rather similar words. As an example of the latter we may consider one of
the experiments carried out by Church et al. (1991), in which they looked at
the different collocations of strong and powerful in a 44.3-million-word corpus
of press reports.Although these words may be seen to have very similar mean-
ings, the mutual information scores for their associations with other words in
the corpus in fact revealed interesting differences. Strong, for example, collo-
cated particularly with words such as northerly, showings, believer, currents,
supporter and odor, whereas powerful had significant collocations with words
such as tool, minority, neighbor, symbol, figure, weapon and post. Although these
collocates do not form very generalisable semantic groups, such information
about the delicate differences in collocation between the two words has a
potentially important role, for example, in helping students of English as a
foreign language to refine their vocabulary usage.
One further application of mutual information should also be noted.
Mutual information can be used not only to study associations within a corpus
but also to help define associations between two parallel aligned corpora.
Corpus Ling/ch3 22/2/01 5:04 pm Page 88
88 Q UA N T I TAT I V E DATA
Assuming that a bilingual parallel corpus has been aligned at the level of the
sentence, so that, for any given sentence in one language, we know which
sentence is its translation in the other language, we may then want to know
which words in those sentences are translations of each other. If we were to take
the two sentences, we could make a list of all the possible pairs of words which
could be translations of each other. So with the two sentences
Die Studentin bestand ihre Prüfung.
The student passed her exam.
die could potentially be translated as the, student, passed, her or exam; and simi-
larly Studentin could potentially be translated as the, student, passed, her or exam.
Throughout the corpus, some words will be paired together more often than
other words, but patterns of frequency distribution may result in a word’s being
paired less frequently with its correct translation than with some other word.
However, if mutual information is used instead of pure frequencies to guide
this process of pairing, then it is possible to discover which word pairs are the
most statistically significant rather than simply the most frequent and, hence,
to approximate more closely to the correct pairing of translation equivalents.
A P P RO A C H I N G Q UA N T I TAT I V E DATA 89
A B C D E F G H J K L M N P R
which the techniques work.All the techniques start off with a traditional basic
cross-tabulation of the variables and samples.Table 3.1 shows an example of
a hypothetical cross-tabulation of the frequencies of different modal verbs (the
variables) across fifteen different genres (the samples) within the Kolhapur
corpus of Indian English.
For factor analysis (Oakes 1998: 105–8), an intercorrelation matrix is
then calculated from the cross-tabulation, showing how statistically similar all
pairs of variables in the table are in their distributions across the various
samples.Table 3.2 shows the first seven columns of the intercorrelation matrix
calculated from the data in Table 3.1. Here we see that the similarity of vari-
ables with themselves (e.g. can and can) is 1: they are, as we would expect, iden-
tical. But we can also see that some variables show a greater similarity in their
distributions than others: for instance, can shows a greater similarity to may
(0.798) than it does to shall (0.118).
Factor analysis takes intercorrelation matrices such as that shown in Table
3.2 and attempts to ‘summarise’ the similarities between the variables in terms
of a smaller number of reference factors which the technique extracts. The
hypothesis is that the many variables which appear in the original frequency
cross-tabulation are in fact masking a smaller number of variables (the factors)
which can help explain better why the observed frequency differences occur.
Each variable receives a loading on each of the factors which are extracted,
signifying its closeness to that factor. Different variables will have larger or
smaller loadings on each of the hypothesised factors, so that it is possible to see
which variables are most characteristic of a given factor: for example, in analy-
sing a set of word frequencies across several texts, one might find that words in
a certain conceptual field (e.g. religion) received high loadings on one factor,
whereas those in another field (e.g. government) loaded highly on another factor.
Correspondence analysis is very similar in intention to factor analysis. It
again tries to summarise the similarities between larger sets of variables and
samples in terms of a smaller number of ‘best fit’ axes, rather like the factors in
Corpus Ling/ch3 22/2/01 5:04 pm Page 90
90 Q UA N T I TAT I V E DATA
factor analysis. However, it differs from factor analysis in the basis of its calcu-
lations, though these details need not detain us here.
Multidimensional scaling (MDS) (Oakes 1998: 109) is another useful
technique for visualising the relationships between different variables. MDS
starts off with an intercorrelation matrix in the same way as factor analysis.
However, this is then converted to a matrix in which the correlation coeffi-
cients are replaced with rank order values, that is the highest correlation value
receives a rank order of 1, the next highest a rank order of 2 and so on. MDS
then iteratively attempts to plot and arrange these variables in a (usually) two-
dimensional space, so that the more closely related items are plotted closer to
each other than the less closely related items, until the difference between the
rank orderings on the plot and the rank orderings in the original table is
minimised as far as possible.
These last two techniques may be thought of as mapping techniques,
since their results are normally represented graphically on a set of axes.5 The
techniques attribute scores to each sample as well as to each variable on the
same sets of axes.When the scores and axes are plotted, therefore, it is possible,
by looking at the graphs of variables and samples side by side, to see not only
how the variables group together but also where the samples fall on the same
axes, and thus to attempt to explain the differences between the samples in
terms of their closeness to and distance from particular variables.
In cluster analysis (Oakes 1998: 110–20) the idea is slightly different,
namely to assemble variables into unique groups or clusters of similar items.
Starting from the initial cross-tabulation, cluster analysis requires a matrix of
statistics in the same way as factor analysis and the mapping techniques.This
may be an intercorrelation matrix, as in factor analysis, or sometimes instead it
may be a distance matrix, showing the degree of difference rather than
similarity between the pairs of variables in the cross-tabulation. Using the
matrix which has been constructed, cluster analysis then proceeds to group
Corpus Ling/ch3 22/2/01 5:04 pm Page 91
A P P RO A C H I N G Q UA N T I TAT I V E DATA 91
92 Q UA N T I TAT I V E DATA
the most important collocates. He then counted the frequencies of the collo-
cations of each of these words with right in all the texts longer than 20,000
words in his corpus. Biber thus constructed a cross-tabulation from these data,
from which he computed an intercorrelation matrix. This matrix was then
factor analysed.
The factor analysis of the right data suggested that four factors accounted
best for the data in the original table. It will be recalled from the brief descrip-
tion of factor analysis given above that in a factor analysis each item (in this
case a collocation) receives a loading on each factor signifying its contribution
to that factor. By looking down the lists of loadings, it is therefore possible to
see which items receive the highest loadings on each factor and hence which
are most characteristic of those individual factors. In this case, looking at the
loadings of the different collocations on his four factors, Biber was able to see
that each factor appeared to represent a different usage of the word right.
Factor 1 gave high loadings to collocations such as right hemisphere, right sided,
right hander and so on: this factor thus appeared to identify the locational sense
of right. Factor 2 gave high loadings to collocations such as right now, right away
and right here, thus identifying the sense of ‘immediately’, ‘exactly’ and so on.
Factor 3, with high loadings for such collocations as that’s right, you’re right and
not right, seemed to signify the sense of ‘correct’, whereas Factor 4 appeared to
mark a somewhat less clearly defined stylistic usage of right at the end of a clause.
This example shows the role that factor analysis can play in an investigation
where a corpus linguist wishes to look for more general groupings and inter-
pretations to explain a large number of ostensibly independent variables (such
as the different collocations in the example). In such cases, factor analysis is
useful in that it is able to reduce these variables to a much smaller number of
reference factors, with loadings signifying the degree of association of each
given variable with each reference factor. By looking at how the variables load
on the different factors, the linguist is then able to identify the high-loading
groups of items on each factor and interpret those factors in terms of more
general properties underlying the variables in the high-loading groups (such
as the word senses of the other collocate (right) in the example above).
Let us now look at an example of the use of cluster analysis. Mindt (1992)
was interested in how far the English depicted in German textbooks of
English as a foreign language constituted an accurate representation of the
language as used by native speakers. In this particular study he was interested
in one specific area of English grammar, namely that of future time reference.
Mindt took four corpora as the data for his study: the Corpus of English
Conversation; a corpus of twelve modern plays; and the texts of two different
German English courses. For each of these four corpora, Mindt made a count
of a number of morphological, syntactic and semantic variables related to the
expression of future time, for example the frequencies of individual verb
lemmas and verb forms denoting futurity, the frequencies of different subjects
Corpus Ling/ch3 22/2/01 5:04 pm Page 93
A P P RO A C H I N G Q UA N T I TAT I V E DATA 93
Figure 3.1 Cluster analysis dendrogram (Source: From Mindt 1992, reproduced by courtesy of Gunter
Narr Verlag)
of future time constructions (e.g. first person singular or third person plural)
and so on.When this frequency analysis had been carried out, Mindt was left
with a frequency table of values for 193 variables across each of the four
corpora. Such a table is hard to interpret in its totality and so, to try to make
sense of this table in terms of the degrees of similarity and difference between
the corpora, Mindt carried out a hierarchical cluster analysis of the table.This,
as we have seen, first required the calculation of a matrix of pairwise distance
measures between the 4 corpora on the basis of the 193 variables and then
proceeded to group the corpora into clusters according to how low the
degree of dissimilarity was between them. The dendrogram which resulted
from this analysis is reproduced in Figure 3.1.
If we look along the bottom of Mindt’s dendrogram, we see a scale which
represents the various distance values. On the left-hand side we see four digits,
each representing one of the four corpora. Moving right from each digit, we
see lines of dashes.At some point each line of dashes comes to an end with an
‘I’ character, which links it with the line for one of the other corpora or
groups of corpora.Where this happens, we say that the two corpora or groups
have formed a cluster, as we discussed above. The closer to the left of the
diagram a cluster is formed, the less is the difference between the two corpora
or groups in relation to the overall pattern of variables in the original table. In
this dendrogram, we see that corpora 1 and 2 (the corpus of conversation and
and the plays corpus) form a cluster very close to the left, that is, there is very
little difference between them. Somewhat further on, but still within the left-
hand half of the dendrogram, corpora 3 and 4 (the two textbooks) also form
a cluster: there is thus also not a great deal of difference between these two
samples, but they are not quite as closely related as the conversation and plays
corpora. However, these two independent clusters which are formed by the
Corpus Ling/ch3 22/2/01 5:04 pm Page 94
94 Q UA N T I TAT I V E DATA
Figure 3.2 Hayashi’s quantification method type III: three-dimensional distribution of genres in the Brown
corpus (Source: Nakamura 1993, reproduced by courtesy of ICAME Journal )
Figure 3.3 Hayashi’s quantification method type III: three-dimensional distribution of modals in the
Brown corpus (Source: Nakamura 1993, reproduced by courtesy of ICAME Journal )
Corpus Ling/ch3 22/2/01 5:04 pm Page 95
A P P RO A C H I N G Q UA N T I TAT I V E DATA 95
corpus pairs do not themselves link together until the very right-hand side of
the dendrogram, that is, there is a very large degree of difference between the
cluster formed by corpora 1 and 2 and the cluster formed by corpora 3 and 4.
So what does this analysis tell us? Corpora 1 and 2, which form the first
cluster, are both corpora of native-speaker English.Their clustering at the left-
hand side of the dendrogram suggests that they have a close relationship to one
another in their ways of expressing future time. Corpora 3 and 4, which form
the second cluster, are both German textbooks of English. Again the cluster
analysis shows them to be closely related in their future time usage. However,
since the two clusters which are formed by the data are so dissimilar, we must
conclude that there exist important differences between the patterns of usage
which the textbooks present to the student and those which are actually used
by native speakers.
What the cluster analysis has done for us in this case is to take the large
number of variables in the original table, summarise them in terms of the
overall amount of difference between the four corpora and then use this
information to automatically group the corpora. It is then up to us to interpret
the empirical groupings in terms of what we already know about their
members and to draw conclusions from them. As this example shows, cluster
analysis is especially useful in corpus linguistics when we want to see in purely
empirical terms whether the nature of a number of corpora or linguistic
features can be accounted for in terms of a smaller number of broader but
distinct groupings. In this case the four corpora could be so accounted for: the
analysis reliably distinguished two groups representing native-speaker English
and textbook English respectively.
Let us finally turn to an example of one of the mapping techniques.
Nakamura (1993) was interested in exploring further an observation made by
Hofland and Johansson (1982), namely that the frequencies of the modal verbs
in the LOB and Brown corpora exhibit considerable variations between genre
categories.To assist him in examining this two-way relationship between verbs
and genres, Nakamura selected a statistical technique known as Hayashi’s
Quantification Method Type III, which is a very close cousin of correspon-
dence analysis. Like correspondence analysis, this technique enables the analyst
to look simultaneously at both the variables (in this case the modal verbs) and
the samples (in this case the corpus genre categories). The first step in
Nakamura’s analysis was to make a cross-tabulation, similar to those which we
have already met, of the frequencies of each modal verb in each genre category
for each of the two corpora. This cross-tabulation was then subjected to
Hayashi’s quantification method type III.
The analysis produced three types of quantitative data for each corpus
which was analysed: (1) a set of numerical scores for each genre category on
each of n axes; (2) a set of numerical scores for each modal verb on each of the
same n axes; and (3) a set figures denoting the proportion of the information
Corpus Ling/ch3 22/2/01 5:04 pm Page 96
96 Q UA N T I TAT I V E DATA
in the original cross-tabulation which was accounted for by each of the n axes.
The latter set of results showed that the first three axes extracted by the analy-
sis accounted for approximately 90 per cent of the Brown corpus data and 84
per cent of the LOB corpus data. Nakamura was therefore able to select with
some confidence just these three axes for further analysis. (It is in any case not
possible to plot the results graphically in more than three dimensions.) He
proceeded to plot the genres and modal verbs in the three-dimensional space
formed by these three axes according to their scores on each axis, which can be
considered as constituting three-dimensional coordinates. Nakamura performed
three analyses – one for the Brown corpus, one for the LOB corpus and one
for the combined corpora – but, for the purposes of illustrating the role of
multivariate methods in corpus analysis, it will suffice here only to look briefly
at the analysis of the Brown corpus.The plots for this analysis are reproduced
as Figures 3.2 and 3.3.
Axis 1 of the three Brown corpus axes (which runs diagonally from bottom
left to top right in the figures) accounted for more than half of the total infor-
mation in the original table (65 per cent). Looking at the points on this axis
which represent the various genres (see Figure 3.2), Nakamura found that the
axis reliably differentiated the ‘informative’ genres (such as learned writing)
from the ‘imaginative’ genres (such as fiction): the informative genres (apart
from press reportage) fell in the negative (bottom left) range of the axis,
whereas the imaginative genres fell in the positive (top right) range. This
distribution of points suggested that the informative/imaginative distinction
was a major factor influencing the distribution of the different modal verbs
and this conclusion corresponded well with previous studies of other linguis-
tic categories which had suggested the importance of the informative/imagi-
native distinction. Looking next at the plots of the modal verbs themselves on
the same set of axes (Figure 3.3), Nakamura was able to identify, by compar-
ing the two graphs, which of the verbs were characteristic of the different
genre types: would, used, ought and might, which fell in the positive range (top
right of the diagram), could be seen to make up a group representative of
imaginative prose, and dare, need, may, shall and could, which fell in the negative
range (bottom left of the diagram), could be seen to make up a group charac-
teristic of informative prose. Can, should and must showed only a smaller
tendency towards a preferred use in informative prose (being situated closer to
zero than the other modals) and will was neutral between the two types, falling
as it did almost on the zero point on the axis.
Nakamura also looked at where the plots for verbs and genres fell on Axis
2 (which runs from top left to bottom right in the two figures). Here he was
able to see some more specific relationships between the individual modals
and the corpus categories. Shall was situated closest on this axis to category D
(religion) – most likely reflecting the frequency of shall in biblical quotation –
and category H (miscellaneous), which is mainly composed of government
Corpus Ling/ch3 22/2/01 5:04 pm Page 97
A P P RO A C H I N G Q UA N T I TAT I V E DATA 97
documents; could was located closest to category G (skills and hobbies); and
will was located closest to category A (press reportage).
This sample analysis shows how the complex variations in frequency for
the thirteen modals across the fifteen genre categories of the Brown corpus
can be made more understandable by using a form of multivariate analysis to
summarise these data statistically. It is then possible to depict the statistical
summary data diagrammatically and to interpret an otherwise almost im-
possibly difficult matrix of numbers in terms of a broader distinction underly-
ing those numbers which can be seen quite clearly on the graph, that is, a
distinction between informative and imaginative prose. The example also
demonstrates how plotting multivariate analysis scores for both variables and
samples – in this case modal verbs and genres – on the same set of axes can
facilitate the drawing of specific inferences about the connections between
one and the other: for example, in this study it could be concluded from a
comparison of the two graphs that the modal verb shall appeared to be
especially characteristic of government documents and religious prose.6
98 Q UA N T I TAT I V E DATA
each variable at a time from that model and see whether significance is main-
tained in each case, and so on until we reach the model with the lowest possi-
ble number of dimensions. So, if we were positing three variables (e.g. in the
above example, genre, verb class and separation by an adverb from the phrase
of duration), we would first test the significance of the three-variable model,
then each of the three two-variable models (taking away one of the variables
in each case) and then each of the three one-variable models generated from
the two-variable models.The best model would be taken to be the one with
the fewest number of variables which still retained statistical significance.
Another technique which allows us to examine the interaction of several
variables is known as variable rule analysis (VARBRUL) (Oakes 1998: 35–6).
VARBRUL was pioneered in North American sociolinguistics by David Sankoff
(1988) and others to study patterns of linguistic variation, but it is now slowly
becoming more widely employed in corpus linguistics: Gunnel Tottie, for
example, has used VARBRUL to examine negation in spoken and written English
(Tottie 1991). John Kirk has also used it to examine the factors influencing the
use of the modals in several corpora of modern English (Kirk 1993, 1994b).7
VARBRUL uses an estimating procedure to calculate theoretical probability
values which would maximise the likelihood of the same result that is embod-
ied in the actual data, and this model is then used to generate predicted
frequencies. The results of the model’s predictions are then compared with
those which actually occurred in the data (for example, using the chi-squared
test, which we met earlier in this chapter, section 3.4.3) to find out how well
the model accounts for the data.
S T U DY Q U E S T I O N S 99
relativiser 124 85
no relativiser 30 36
Corpus A Corpus B
BE 500 800
DO 80 400
HAVE 300 1500
GO 60 70
GIVE 40 30
TAKE 20 100
NOTES
1. Schmied goes further in suggesting that this is necessarily the case, but he appears to overlook the
possibility of deriving categories for quantification directly from theory. Furthermore, it is not always
essential that categories should be pre-defined at all, though this is normally the case: in the study of
lexis, for instance,‘categories’ have sometimes been extracted empirically and quantitatively using
statistical techniques on word frequency data.
2. More details about mutual information, including the formula, may be found in Church et al. (1991).
It should be observed, however, that recently Daille (1995) has shown that mutual information can
sometimes carry with it an unwanted effect which gives less frequent pairings more significance than
frequent ones. One possible alternative measure that is suggested in her paper (IM3) is based on
Corpus Ling/ch3 22/2/01 5:04 pm Page 101
NOTES 101
mutual information but cubes the enumerator of the formula.A range of further measures with a
similar function are discussed in Oakes (1998: 162–89).
3. Genuine factor analysis is somewhat different from principal components analysis. Its overall aims,
however, are the same.The term ‘factor analysis’ is so widely used that it is difficult to distinguish the
two in research reports unless one is well versed in their statistical foundations.
4. For a clear and relatively non-mathematical explanation of the details of these techniques, the reader
is referred to the work by Alt (1990), cited in the Bibliography and in the further reading for this
chapter.
5. Factor analyses can also be plotted graphically, though this happens much less often than with corre-
spondence analysis and MDS.
6. For another example of Hayashi’s Quantification Method Type III, see Nakamura and Sinclair (1995),
also summarized in Oakes (1998: 192–3).
7. In the later work, Kirk (1994b) questions, on the basis of semantic indeterminacy, the value of the
regression analysis in looking at the modal verbs. However, the technique is certainly applic-
able to less ‘fuzzy’ areas of grammar, etc.
Corpus Ling/ch3 22/2/01 5:04 pm Page 102
Corpus Ling/ch4 22/2/01 5:13 pm Page 103
4
The use of corpora
in language studies
104 T H E U S E O F C O P O R A I N L A N G UAG E S T U D I E S
onreal data to see whether or not they appear to hold or (2) to generate
hypotheses inductively from the corpus which may then be tested on further
data.
An example of the first of these research paradigms is Wilson’s (1989) study
of prepositional phrases and intonation group boundaries. He hypothesised
that postmodification of a noun by a prepositional phrase (e.g. the man with the
telescope) would constitute a barrier to the placing of an intonation group
boundary between the head noun and the preposition, since the prepositional
phrase forms part of a larger noun phrase: an intonation group boundary
would on the other hand be more likely to occur between a verb and a prepo-
sition where a prepositional phrase functions as an adverbial (e.g. She ran with
great speed).These hypotheses were tested on a subsample of the Lancaster/IBM
Spoken English Corpus and were found generally to hold, suggesting that
there is indeed a relationship between the syntactic cohesiveness of a phrase
and the likelihood of a prosodic boundary.
An example of the second paradigm is the work of Altenberg (1990), also
on intonation group boundaries. Unlike Wilson, Altenberg did not start off
with a hypothesis but instead generated a detailed account of the relationships
between intonation group boundaries and syntactic structures from a mono-
logue from the London-Lund corpus. From the results of this analysis he
devised a set of rules for predicting the location of such boundaries, which
were then applied by a computer program to a sample text from outside the
corpus (a text from the Brown written corpus). When the sample text was
read aloud, the predictions were found to identify correctly 90 per cent of the
actual intonation group boundaries. In Altenberg’s case, therefore, the research
progressed from analysing corpus data to the generation of hypotheses to the
testing of the hypotheses on more corpus data, rather than progressing from
theory to the generation and testing of hypotheses.
A second type of work has looked at the basis of the prosodic transcriptions
which are typically encoded within spoken corpora and used by researchers.
Prosodic transcription raises the question of how far what is perceived and
transcribed relates to the actual acoustic reality of the speech. Looking at the
overlap passages of the Lancaster/IBM Spoken English Corpus, where the same
passages were prosodically transcribed independently by two different phone-
ticians,Wilson (1989) and Knowles (1991) both found significant differences
in the perception of intonation group boundaries, which suggested either that
individual perception of the phonetic correlates of such boundaries differed or
that other factors were affecting the placement of boundaries in the transcrip-
tion. Wichmann (1993) looked more closely at the differences in the tran-
scription of tones rather than boundaries. Looking at the transcription of
falling tones in the corpus, she found that in the overlap passages there were
major discrepancies in the perception of such tones.The transcribers seemed
to have different notions of pitch height in relation to preceding syllables
Corpus Ling/ch4 22/2/01 5:13 pm Page 106
106 T H E U S E O F C O R P O R A I N L A N G UAG E S T U D I E S
which was also sometimes overridden according to the level of a given tone
in the speakers’ overall pitch range, and the results of a perception experiment
by Wichmann suggested that there is in fact no real perceptual category of high
and low. Such studies seem to suggest, therefore, that, in comparison to other
forms of annotation such as part of speech, prosodic annotation is a much less
reliable guide, at least to what it claims to depict.
The third type of work with speech corpora has looked at the typology of
texts from a prosodic perspective. A good example of this is Wichmann’s
(1989) prosodic analysis of two activity types in the Lancaster/IBM Spoken
English Corpus – poetry reading and liturgy. Considering Crystal and Davy’s
(1969) suggestion that a high frequency of level tones is especially character-
istic of liturgy, she made a count of the distribution of level tones in all the text
categories in the corpus.This count showed that, whilst liturgy did have a high
proportion of level tones, this was not markedly the case and in fact the high-
est number of level tones was to be found in poetry reading. Looking in more
detail at poetry reading and liturgy, Wichmann found that in the liturgical
passages the highest concentration of level tones was in the prayer, whilst in
the poetry reading the level tones tended to cluster in a final lyrical section of
the poem which was included in the corpus.Wichmann suggests that, in the
context of the prayer reading, the listener may be assumed to constitute an
audience rather than the addressee (which is God), whilst, in the case of the
lyric poetry, the reading is more of a performance than an act of informing. In
contrast, the narrative section of the same poem could be considered to be an
act of informing and this in fact showed a much more conversational typol-
ogy of tones. Wichmann links these observations about the nature of the
speaker/hearer roles to the prosodic patterns which were discovered and, on
the basis of these results, argues that, contrary to the generalisation proposed
by Crystal and Davy, the intonation patterns are not related to activity type
(such as liturgy) but rather to the discourse roles of the hearer such as audi-
ence and addressee. In this study, therefore, we see clearly how corpus data can
be of value in challenging and amending existing theories.
up all the examples of the usage of a word or phrase from many millions of
words of text in a few seconds.This means not only that dictionaries can be
produced and revised much more quickly than before – thus providing more
up-to-date information about the language – but also that the definitions can
(hopefully) be more complete and precise, since a larger sample of natural
examples is being examined. To illustrate the benefits of corpus data in lexi-
cography we may cite briefly one of the findings from Atkins and Levin’s
(1995) study of verbs in the semantic class of ‘shake’. In their paper, they quote
the definitions of these verbs from three dictionaries – the Longman Dictionary
of Contemporary English (1987, 2nd ed.), the Oxford Advanced Learner’s Dictionary
(1989, 4th ed.) and the Collins COBUILD Dictionary (1987, 1st ed.). Let us look
at an aspect of just two of the entries they discuss – those for quake and quiver.
Both the Longman and COBUILD dictionaries list these verbs as being solely
intransitive, that is they never take a direct object; the Oxford dictionary simi-
larly lists quake as intransitive only, but lists quiver as being also transitive, that
is, it can sometimes take an object. However, looking at the occurrences of
these verbs in a corpus of some 50,000,000 words,Atkins and Levin were able
to discover examples of both quiver and quake in transitive constructions (for
example, It quaked her bowels; quivering its wings). In other words, the dictionar-
ies had got it wrong: both these verbs can be transitive as well as intransitive.
This small example thus shows clearly how a sufficiently large and representa-
tive corpus can supplement or refute the lexicographer’s intuitions and
provide information which will in future result in more accurate dictionary
entries.
The examples extracted from corpora may also be organised easily into
more meaningful groups for analysis, for instance, by sorting the right-hand
context of a word alphabetically so that it is possible to see all instances of a
particular collocate together. Furthermore, the corpora being used by lexi-
cographers increasingly contain a rich amount of textual information – the
Longman-Lancaster corpus, for example, contains details of regional variety,
author gender, date and genre – and also linguistic annotations, typically part-
of-speech tagging. The ability to retrieve and sort information according to
these variables means that it is easier (in the case of part-of-speech tagging) to
specify which classes of a homograph the lexicographer wants to examine and
(in the case of textual information) to tie down usages as being typical of
particular regional varieties, genres and so on.
It is in dictionary building that the concept of an open-ended monitor
corpus, which we encountered in Chapter 2, has its greatest role, since it
enables the lexicographer to keep on top of new words entering the language
or existing words changing their meanings or the balance of their use accord-
ing to genre, formality and so on. But the finite sampled type of corpus also
has an important role in lexical studies and this is in the area of quantification.
Although frequency counts, such as those of Thorndike and Lorge (1944),
Corpus Ling/ch4 22/2/01 5:13 pm Page 108
108 T H E U S E O F C O R P O R A I N L A N G UAG E S T U D I E S
able more easily to provide the sorts of sense frequency data that we discussed
in the previous paragraph. Church et al.’s study shows a further way in which
co-occurrence data can perhaps be used, that is to add greater delicacy to defi-
nitions: strong and powerful, for example, are often treated in dictionaries almost
as synonyms, but the identification of differences in collocation can enable the
lexicographer to draw these important distinctions in their usage as well as
identifying their broad similarity in meaning.
As well as word meaning, we may also consider under the heading of lexi-
cal studies corpus-based work on morphology (word structure).The fact that
morphology deals with language structure at the level of the word may suggest
that corpora do not have any great advantage here over other sources of data
such as existing dictionaries or introspection. However, corpus data do have an
important role to play in studying the frequencies of different morphological
variants and the productivity of different morphemes. Opdahl (1991), for
example, has used the LOB and Brown corpora to study the use of adverbs
which may or may not have a -ly suffix (e.g. low/lowly), finding that the forms
with the -ly suffix are more common than the ‘simple’ forms and that, contrary
to previous claims, the ‘simple’ forms are somewhat less common in American
than in British English. Bauer (1993) has also begun to use data from his
corpus of New Zealand English for morphological analysis.At the time that he
wrote his paper his corpus was incomplete and so his results are suggestive
rather than definitive, but they demonstrate the role which a corpus can play
in morphology. One example which Bauer concentrates on is the use of strong
and weak past tense forms of verbs (e.g. spoilt (strong) vs. spoiled (weak)). In a
previous elicitation study amongst New Zealand students, Bauer had
concluded that the strong form was preferred by respondents to the weak
form, with the exceptions of dreamed and leaned.The written corpus data, on
the other hand, suggested that the weak form, with the exception of lit, was
preferred to a greater degree than the elicitation experiment had suggested.
Bauer wonders how far this difference between the elicitation experiment and
the texts of the written corpus may be due to editorial pressure on writers to
follow the more regular spelling variant, a non-linguistic factor which was not
present in his elicitation experiment, and he looks forward to testing this
theory in relation to the New Zealand spoken corpus, which also lacks this
editorial constraint. Here, then, we see how a corpus, being naturalistic data,
can help to define more clearly which forms are most frequently used and
begin to suggest reasons why this may be so.
110 T H E U S E O F C O R P O R A I N L A N G UAG E S T U D I E S
role as empirical data, also quantifiable and representative, for the testing of
hypotheses derived from grammatical theory.
Until the last quarter of the twentieth century, the empirical study of gram-
mar had to rely primarily upon qualitative analysis. Such work was able to
provide detailed descriptions of grammar but was largely unable to go beyond
subjective judgements of frequency or rarity.This is even the case with more
recent classic grammars such as the Comprehensive Grammar of the English
Language (Quirk et al. 1985), whose four authors are all well-known corpus
linguists. But advances in the development of parsed corpora (see Chapter 2)
and tools for retrieval from them mean that quantitative analyses of grammar
may now more easily be carried out. Such studies are important, because they
can now at last provide us with a representative picture of which usages are
most typical and to what degree variation occurs both within and across vari-
eties.This in turn is important not only for our understanding of the grammar
of the language itself but also in studies of different kinds of linguistic variation
and in language teaching (see sections 4.7, 4.8, 4.9 and 4.11).
Most smaller-scale studies of grammar using corpora have included quanti-
tative data analyses. Schmied’s (1993) study of relative clauses, for example,
provides quantitative information about many aspects of the relative clauses in
the LOB and Kolhapur corpora. However, there is now also a greater interest in
the more systematic treatment of grammatical frequency and at least one
current project (Oostdijk and de Haan 1994a) is aiming to analyse the
frequency of the various English clause types. Oostdijk and de Haan have
already produced preliminary results based upon the syntactically parsed
Nijmegen corpus and they plan to extend this work in the near future to
larger corpora.The Nijmegen corpus is only a small corpus of some 130,000
words, but, with the completion of the British National Corpus and the
International Corpus of English, the stage seems set for much more intensive
treatments of grammatical frequency.
As explained in Chapter 1, there has since the 1950s been a division in
linguistics between those who have taken a largely rationalist view of linguis-
tic theory and those who have carried on descriptive empirical research with a
view to accounting fully for all the data in a corpus. Often these approaches
have been presented as competitors but they are in fact not always as mutually
exclusive as some would wish to claim: there is a further, though not at present
very large, group of researchers who have harnessed the use of corpora to the
testing of essentially rationalist grammatical theory rather than to pure linguis-
tic description or the inductive generation of theory.
One example of this kind of rationalist-to-empiricist approach to grammar
is provided by the exchange of papers between the team of Taylor, Grover and
Briscoe, and Geoffrey Sampson. Taylor, Grover and Briscoe (1989) had
produced an automatic parser for English in the form of a generative gram-
mar, not directly based on empirical data, which they wanted to test on corpus
Corpus Ling/ch4 22/2/01 5:13 pm Page 111
112 T H E U S E O F C O R P O R A I N L A N G UAG E S T U D I E S
situation various choices are more or less likely to be selected by the speaker.
Halliday (1991) uses this idea of a probabilistically ordered choice to interpret
many aspects of linguistic variation and change in terms of the differing
probabilities of linguistic systems. For example, it may be that written English
prefers which to that as a relativiser. We might quantify this statement using a
corpus and say that which is 39% probable as opposed to that which is 12%
probable.3 But it may be that, in contrast to writing, conversational speech
shows a greater tendency towards the use of that, so that which is only 29%
probable whereas that is 18% probable.4 It is one of Halliday’s suggestions that
the notion of a register, such as that of conversational speech, is really equiv-
alent to a set of these kinds of variations in the probabilities of the grammar.
Halliday is enthusiastic about the role which corpora may play in testing and
developing this theory further, in that they can provide hard data from which
the frequency profiles of different register systems may be reconstructed. Here,
then, in contrast to the frequent hostility between grammatical theory and
corpus analysis, we see a theoretician actively advocating the use of corpus data
to develop a theory.
the present progressive and the simple present – in two corpora – the Corpus
of English Conversation and a corpus of twelve contemporary plays – and
examined the frequency of specification with the four different constructions.
He found in both corpora that the simple present had the highest frequency of
specification, followed in order by the present progressive, will and be going to.
The frequency analysis thus established a hierarchy with the two present tense
constructions at one end of the scale, often modified adverbially to intensify
the sense of future time, and the two inherently future-oriented constructions
at the other end, with a much lesser incidence of additional co-occurring
words indicating futurity. Here, therefore, Mindt was able to demonstrate that
the empirical analysis of linguistic contexts is able to provide objective indica-
tors for intuitive semantic distinctions: in this example, inherent futurity was
shown to be inversely correlated with the frequency of specification.
The second major role of corpora in semantics has been in establishing
more firmly the notions of fuzzy categories and gradience. In theoretical linguis-
tics, categories have typically been envisaged as hard and fast ones, that is, an
item either belongs in a category or it does not. However, psychological work
on categorisation has suggested that cognitive categories typically are not hard
and fast ones but instead have fuzzy boundaries so that it is not so much a
question of whether or not a given item belongs in a particular category as of
how often it falls into that category as opposed to another one.This has impor-
tant implications for our understanding of how language operates: for instance,
it suggests that probabilistically motivated choices of ways of putting things
play a far greater role than a model of language based upon hard and fast cate-
gories would suggest. In looking empirically at natural language in corpora it
becomes clear that this ‘fuzzy’ model accounts better for the data: there are
often no clear-cut category boundaries but rather gradients of membership
which are connected with frequency of inclusion rather than simple inclusion
or exclusion. Corpora are invaluable in determining the existence and scale of
such gradients.To demonstrate this, let us take a second case study from Mindt.
In this instance, Mindt was interested in the subjects of verb constructions with
future time reference, specifically the distinction between subjects that do or
do not involve conscious human agency, which theory had previously identi-
fied as an important distinction.As a rough correlate of this distinction, Mindt
counted the frequency of personal and non-personal subjects of the four
future time constructions in the same two corpora referred to above. He found
that personal subjects occurred most frequently with the present progressive,
whereas the lowest number of personal subjects occurred with the simple
future, refuting previous theoretical claims. Will and be going to had only a small
preference (just 2–3%) for personal subjects, and the rank order was the same
for both corpora. So this case study seemed to suggest that there is a semantic
relationship correlating the type of agency with the verb form used for future
time reference. But note that none of the constructions occurred solely with
Corpus Ling/ch4 22/2/01 5:13 pm Page 114
114 T H E U S E O F C O R P O R A I N L A N G UAG E S T U D I E S
116 T H E U S E O F C O R P O R A I N L A N G UAG E S T U D I E S
furthermore, the spoken part of the BNC has been collected using demo-
graphic market research techniques for age and social class as well as
geographical location. These sociolinguistically annotated corpora should
enable corpus-based work on social variation in language to begin in earnest.
118 T H E U S E O F C O R P O R A I N L A N G UAG E S T U D I E S
the differences between spoken language and written language. For instance,
Altenberg (1984) has examined the differences in the ordering of cause–result
constructions and Tottie (1991) has examined the differences in negation
strategies. Other work has looked at variation between particular genres, using
subsamples of corpora as the database.Wilson (1992b), for example, looking at
the usage of since, used the learned, Belles-Lettres and fiction genre sections
from the LOB and Kolhapur corpora, in conjunction with a sample of modern
English conversation and the Augustan Prose Sample, and found that causal
since had evolved from being the main causal connective in late seventeenth
century writing to being particularly characteristic of formal learned writing
in the twentieth century. A project has also begun at Lancaster University to
build a speech presentation corpus (Leech, McEnery and Wynne 1997:
94–100). Speech presentation (i.e., how spoken language is represented in
written texts) is an area of stylistics to which much attention has been given.
The speech presentation corpus will contain a broad sample of direct and indi-
rect speech from a variety of genres and allow researchers to look for system-
atic differences between, for example, fictional and non-fictional prose.
Allied to their use for comparing genres, corpora have been used to chal-
lenge, empirically, existing approaches to text typology.Variation studies, and
also the sampling of texts for corpora, have typically been based on external
criteria such as channel (e.g., speech and writing) or genre (e.g., romantic
fiction, scientific writing). However, there is now a large body of work that
addresses textual variation from a language internal perspective. Biber (1988),
for example, looking initially at the variation between speech and writing,
carried out factor analyses of 67 linguistic features across text samples from 23
major genre categories taken mostly from the LOB and London-Lund corpora.
From these analyses, Biber extracted 5 factors which, by reference to the high-
est loaded items on each, he interpreted as representing particular dimensions
of linguistic variation. For instance, Biber’s Factor 2 is taken to represent a
distinction between narrative and non-narrative: past tense verbs, third person
pronouns and verbs with perfective aspect receive high positive loadings on
this factor. In the factor analysis, each genre sample also received a factor score
on each dimension so that, taken together, it is possible to see how genres
differ from one another and on what dimensions. Having once arrived at this
5-factor framework, it is then possible to use it to score other texts. For
example, Biber and his collaborators have already applied the framework to
the historical development of English genres, to the examination of primary
school reading materials and to texts in other languages (cf. Biber 1995).What
is important about Biber’s work from a methodological point of view is that it
enables a broad, empirically motivated comparison of language variation to be
made, rather than the narrow single-feature analyses which have often been
used to add a little bit at a time to our understanding of variation. It is also a
very clear example of how the quantitative empirical analysis of a finite
Corpus Ling/ch4 22/2/01 5:13 pm Page 119
T E AC H I N G L A N G UAG E S A N D L I N G U I S T I C S 119
120 T H E U S E O F C O R P O R A I N L A N G UAG E S T U D I E S
T E AC H I N G L A N G UAG E S A N D L I N G U I S T I C S 121
122 T H E U S E O F C O R P O R A I N L A N G UAG E S T U D I E S
also call up a concordance of similar examples. Students are given four chances
to get an annotation right. The program keeps a record of the number of
guesses made on each item and how many were correctly annotated by the
student. In the Lent Term of 1994, a preliminary experiment was carried out
to determine how effective the corpus-based CALL system was at teaching the
English parts of speech. A group of volunteer students taking the first-year
English Language course were split randomly into two groups. One group was
taught parts of speech in a traditional seminar environment, whilst others were
taught using the CALL package. Students’ perfomance was monitored through-
out the term and a final test administered. In general the computer-taught
students performed better than the human-taught students throughout the
term, and the difference between the two groups was particularly marked
towards the end of the term. Indeed, the performance of CALL students in a
final pen and paper annotation test was significantly higher than the group
taught by traditional methods (McEnery, Baker and Wilson 1995).
The increasing availability of multilingual parallel corpora makes possible a
further pedagogic application of corpora, namely as the basis for translation
teaching. Whilst the assessment of translation is frequently a matter of style
rather than of right and wrong, and therefore perhaps does not lend itself to
purely computer-based tutoring, a multilingual corpus has the advantage of
being able to provide side-by-side examples of style and idiom in more than
one language and of being able to generate exercises in which students can
compare their own translations with an existing professional translation or
original. Such an approach is already being pioneered at the University of
Bologna using corpora which, although they do not contain the same text in
more than one language, do contain texts of a similar genre which can be
searched for relevant examples (Zanettin 1994).
Parallel corpora are also beginning to be harnessed for a form of language
teaching which focuses especially on the problems that speakers of a given
language face when learning another. For example, at Chemnitz University of
Technology work is under way on an internet grammar of English aimed
particularly at German-speaking learners.An example of the sort of issue that
this focused grammar will highlight is aspect, an important feature of English
grammar but one which is completely missing from the grammar of German.
The topic will be introduced on the basis of relatively universal principles
(reference time, speech time and event time) and the students will be helped
to see how various combinations of these are encoded differently in the two
languages.The grammar will make use of a German–English parallel corpus to
present the material within an explicitly contrastive framework.The students
will also be able to explore grammatical phenomena for themselves in the
corpus as well as working with interactive online exercises based upon it
(Hahn and Schmied 1998).
Corpus Ling/ch4 22/2/01 5:13 pm Page 123
124 T H E U S E O F C O R P O R A I N L A N G UAG E S T U D I E S
building: indeed, this has almost become a growth industry. A few examples of
other English historical corpora that have recently been developed are the
Zürich Corpus of English Newspapers (ZEN) (a corpus covering the period
from 1660 up to the establishment of The Times newspaper), the Lampeter
Corpus of Early Modern English Tracts (a sample of English pamphlets from
between 1640 and 1740, all taken from the collection at the library of St
David’s University College, Lampeter) and the ARCHER corpus (a corpus of
British and American English between the years 1650 and 1990).
The actual work which is carried out on historical corpora is qualitatively
very similar to that which is carried out on modern language corpora,
although, in the case of corpora, such as the Helsinki corpus, which provide
diachronic coverage rather than a ‘snapshot’ of the language at a particular
point in history, it is also possible to carry out work on the evolution of the
language through time.As an example of this latter kind of work, one may take
Peitsara’s (1993) study of prepositional phrases denoting agency with passive
constructions. She made use of four subperiods from the Helsinki corpus
covering late Middle and Early Modern English (c. 1350–1640) and calculated
the frequencies of the different prepositions introducing such agent phrases.
The calculation showed that throughout the period the most common prepo-
sitions introducing agent phrases were of and by, but that, from being almost
equal in frequency at the very beginning of the period (a ratio of 10.6:9), by
rapidly gained precedence so that by the fifteenth century it is three times
more common than of and by 1640 around eight times as common. Peitsara also
made use of the text type information, showing that, whilst by the end of the
period up to half of the individual texts contained agent phrases introduced by
more than one preposition type, some texts showed an unusual tendency to
use just one type.This was particularly marked in documents, statutes and offi-
cial correspondence and it is suggested that this may be a result of bilingual
influence from French. Individual authors of texts within categories are also
shown to differ in their personal and stylistic preferences.
This kind of quantitative empirical study, by providing concrete data which
it is now at last possible to obtain through the use of a computer corpus, can
only help our understanding of the evolution of a language and its varieties:
indeed, it has a particular importance in the context of Halliday’s (1991)
conception of language evolution as a motivated change in the probabilities of
the grammar. But it is important, as Rissanen (1989) has pointed out, also to
be aware of the limitations of historical corpus linguistics. Rissanen identifies
three main problems. First, there is what he calls the ‘philologist’s dilemma’,
that is the danger that the use of a corpus and computer to extract specific data
may supplant the in-depth knowledge of language history which is to be gained
from the study of the original texts in their context.This is, however, not a danger
inherent in corpus-based research per se but in an overreliance on corpora in
training researchers. Second, there is the ‘God’s truth fallacy’, which is the
Corpus Ling/ch4 22/2/01 5:13 pm Page 125
C O R P O R A I N D I A L E C T O L O G Y A N D VA R I AT I O N S T U D I E S 125
126 T H E U S E O F C O R P O R A I N L A N G UAG E S T U D I E S
Schmied and his colleagues are using the East African component of to look
at the English spoken and written in Kenya and Tanzania (e.g., Schmied 1994).
One of the roles for corpora in national variation studies has been as a test-
bed for two theories of language variation: Quirk et al.’s (1985: 16) ‘common
core’ hypothesis – namely, that all varieties of English have central fundamen-
tal properties in common which differ quantitatively rather than qualitatively
– and Braj Kachru’s conception of national varieties as forming many unique
‘Englishes’ which differ in important ways from one another. To date, most
work on lexis and grammar in the Kolhapur Indian corpus, studied in direct
comparison with Brown and LOB, has appeared to provide support for the
common core hypothesis (Leitner 1991). However, there is still considerable
scope for the extension of such work and the availability of the ICE subcorpora
should provide a wider range of data to test these hypotheses.
Compared to ‘national variety’, ‘dialect’ is a notoriously tricky term in
linguistics, since dialects cannot readily be distinguished from languages on
solely empirical grounds. However, the term ‘dialect’ is most commonly used
of sub-national linguistic variation which is geographically motivated. Hence
Australian English might not be considered expressly to be a dialect of English,
whereas Scottish English, given that Scotland is a part of the United Kingdom,
might well be so regarded; a smaller subset of Scottish English – for example,
the English spoken in the Lowlands – would almost certainly be termed a
‘dialect’.Taking ‘dialect’ to be defined in this way, it is the case that rather few
dialect corpora exist at the present time. However, two examples are the
Helsinki corpus of English dialects and John Kirk’s Northern Ireland
Transcribed Corpus of Speech (NITCS).These corpora both consist of sponta-
neous conversations with a fieldworker: in Kirk’s corpus, as the name suggests,
from Northern Ireland, and in the Helsinki corpus from several English regions.
Dialectology is a firmly empirical field of linguistics but has tended to
concentrate on elicitation experiments and less controlled sampling rather
than using corpora.The disadvantage of this approach is that elicitation exper-
iments tend to concentrate on vocabulary and pronunciation, whereas other
aspects of dialects, such as syntax, have been relatively neglected.The collection
of stretches of natural spontaneous conversation in a corpus means, however,
that these aspects of the language are now more amenable to study. Moreover,
because the corpus is sampled so as to be representative, quantitative as well as
qualitative conclusions can be drawn from it about the target population as a
whole and the corpus can also be compared with corpora of other varieties.
This ability to make comparisons using dialect data has opened up a new
avenue of research for dialectologists, namely, the opportunity to examine the
degree of similarity and difference of dialects as compared with ‘standard’ vari-
eties of a language. A particularly good example of the latter type of research
is the work carried out by John Kirk on the identity of Scots. Kirk has used
corpora of Scots – both dramatic texts and natural conversations – alongside
Corpus Ling/ch4 22/2/01 5:13 pm Page 127
128 T H E U S E O F C O R P O R A I N L A N G UAG E S T U D I E S
130 T H E U S E O F C O R P O R A I N L A N G UAG E S T U D I E S
S T U DY Q U E S T I O N S 131
132 T H E U S E O F C O R P O R A I N L A N G UAG E S T U D I E S
(1991) and is also available electronically from the ICAME file server (gopher:
nora.hd.uib.no). More recent unpublished updates are also to be found on the
ICAME server. Although not totally exhaustive, this bibliography contains the
vast majority of published work using English language corpora.
For a more detailed overview of corpus-based projects than it has been
possible to provide in this chapter, the student is recommended to look in the
various specialist collections of papers.The festschrifts for Jan Svartvik (Aijmer
and Altenberg 1991), Jan Aarts (Oostdijk and de Haan 1994b), Geoffrey Leech
(Thomas and Short 1996) and Gunnel Tottie (Fries, Müller and Schneider 1997)
contain papers across a broad range of fields, whilst the books of proceedings
from annual ICAME conferences (Aarts and Meijs 1984, 1986, 1990; Johansson
and Stenström 1991; Leitner 1992; Johansson 1982; Meijs 1987; Kytö,
Ihalainen and Rissanen 1988; Souter and Atwell 1993; Aarts, de Haan and
Oostdijk 1993; Fries,Tottie and Schneider 1994; Percy et al. 1996; Ljung 1997)
provide a diachronic as well as a broad perspective of corpus-based research.
For corpus-based historical linguistics see three recent specialist collections of
papers: Kytö, Rissanen and Wright (1994); Hickey, Kytö and Lancashire (1997);
and Rissanen, Kytö and Palander-Collin (1993). Papers, reviews and informa-
tion are also to be found in the annual ICAME Journal (formerly ICAME News).
Foreign language corpus research is harder to track down, since it lacks the
central organisation which English language research has. Students are advised
to search keywords (e.g. corpus/Korpus) or citations of basic corpus linguistic
texts in the standard linguistic bibliographies.
NOTES
1. It should be noted, however, that spoken corpus data are not always purely naturalistic. Naturalism is
easier to achieve with pre-recorded material such as television and radio broadcasts.With spontaneous
conversation, however, there are ethical and legal considerations which prevent the use of surreptitious
recording without prior consent.
2. Taylor, Grover and Briscoe also argue that Sampson’s argument about generative grammars is tied to
the way he has chosen to analyse his data. However, these details are too complex to consider here;
Sampson (1992) continues to dispute their claims.
3. Frequencies, slightly simplified, from Schmied (1993).
4. Frequencies, slightly simplified, from Pannek (1988).
5. See also McEnery and Wilson 1997.
Corpus Ling/ch5 22/2/01 5:23 pm Page 133
5
Corpora and
language engineering
5.1. INTRODUCTION
The aim of this chapter is to provide a general overview of the use of corpora
in natural language processing (NLP), and to give the reader a general understand-
ing of how they have been used and where.This chapter is either a brief excur-
sion for the linguist who is not primarily interested in NLP or a starting point
for the linguist interested in corpus-based NLP. In the opinion of the authors it
most certainly should not be the end of the road for either type of reader.
In this chapter the focus will be on language engineering. Language engin-
eering is principally concerned with the construction of viable natural language
processing systems for a wide range of tasks. It is essentially a ‘rather pragmatic
approach to computerised language processing’ which seeks to bypass the
‘current inadequacies of theoretical computation linguistics’.1 Several areas of
language engineering are considered here and the impact of corpora upon
them assessed.These areas, split into sections of the chapter, are: part-of-speech
analysis, automated lexicography, parsing and multilingual corpus exploitation.
Each section considers how the various techniques, employed by specific
systems, exploit corpora to achieve some goal. But, before starting to look at
the use of corpora in NLP, it seems worthwhile to consider what, in general,
corpora can provide for the language engineer.
134 C O R P O R A A N D L A N G UAG E E N G I N E E R I N G
P A R T- O F - S P E E C H A N A L Y S I S 135
in number. Most often the corpus and data derived from it are used to enhance
the effectiveness of fairly traditional AI (artificial intelligence) systems. The
main point is worth reiterating: any sacrifice of cognitive plausibility is most often one
of degree and rarely an absolute. With this stated we can now begin to consider
some specific examples of the use of corpora in language engineering.
136 C O R P O R A A N D L A N G UAG E E N G I N E E R I N G
P A R T- O F - S P E E C H A N A L Y S I S 137
Input natural
language text
Lexicon identifies
words and associates Morphological
them with parts of analysis occurs
speech
Non-decompositional Disambiguation
elements identified
explaining how, let us briefly consider the other stages and consider whether
the corpus may have a role to play elsewhere in this schema.
5.3.1.2. Lexicon
The system first tries to see if each word is present in a machine-readable
lexicon it has available.These lexicons are typically of the form <word> <part
of speech 1, … part of speech n>. If the word is present in the lexicon, then
the system assigns to the word the full list of parts of speech it may be
associated with. For example, let us say that the system finds the word dog.
Checking its lexicon, it discovers that the word dog is present and that dog may
be a singular common noun or a verb. Consequently it notes that the word
dog may have one of these two parts of speech. With this task achieved the
lexicon moves on to considering the next word.
Corpus Ling/ch5 22/2/01 5:23 pm Page 138
138 C O R P O R A A N D L A N G UAG E E N G I N E E R I N G
P A R T- O F - S P E E C H A N A L Y S I S 139
the surrounding words? For the moment we will assume that the system
assumes that the word may have any of the accepted parts of speech and leave
the assignment of a unique part of speech to the disambiguation phase.
5.3.1.5. Disambiguation
After a text has been analysed by the lexicon/morphological processor and
any syntactic idioms have been identified, the task of assigning unique part-
of-speech codes to words is far from complete. As we saw in earlier examples,
the lexicon and morphological component merely indicate the range of parts
of speech that may be associated with a word. So, using a previous example,
we may know that dog may be a noun or verb, but we have no idea which it
is in the current context, hence the need for disambiguation. It is possible to
try a rule-based approach to disambiguation, using rules created by drawing
on a linguist’s intuitive knowledge. Indeed, it would be possible to base such
a set of rules on a corpus and it would certainly be possible to test such a
system on a corpus. But this has not tended to be the use corpora have been
put to for disambiguation in part-of-speech tagging. They have, much more
frequently, been used to create a matrix of probabilities, showing how likely it
is that one part of speech could follow another.
The basic idea behind probability-matrix-based disambiguation is very simple
and immediately demonstrable. Consider the following English sentence
fragment: ‘The dog ate the’. Irrespective of what the actual form of the next
word may be, what do you think the part of speech of the next word may be?
Most readers would guess that a singular or plural common noun may be a
distinct possibility, such as ‘The dog ate the bone’ or ‘The dog ate the bones’.
Others may assume an adjective may occur, as in ‘The dog ate the juicy bone’.
But not many would assume that a preposition, article or verb would occur
next. Some may argue that this is because the readers have access to a grammar
of the language which shows what is possible and what is not possible.
But this would avoid an important point – why is it that native speakers of
a language would feel that some answers are more likely than others? Halliday
Corpus Ling/ch5 22/2/01 5:23 pm Page 140
140 C O R P O R A A N D L A N G UAG E E N G I N E E R I N G
P A R T- O F - S P E E C H A N A L Y S I S 141
Brill (1992) 5
Cutting et al. (1992) 4
De Rose (1991) 4
Garside (1987) 4
Greene and Rubin (1971) 23
Voutilainen (1995) 0.7
5.3.3. So what?
Having seen how a part-of-speech tagger may be constructed, it is now poss-
ible to begin to see the relationship between corpora and language engineer-
ing a little more clearly. Language-engineering systems can be based upon
corpus data, which is used to train some model of language which the system
possesses. That training can be undertaken on raw – unannotated – corpus
data or it may be undertaken on corpus data which has been hand-annotated
by linguists. Such hand-annotated data can have at least two functions. Firstly,
it may allow the model developed by the program to become more accurate.
If the annotation represents reliable evidence of the type of linguistic dis-
tinctions that a program wants to model, then the material is likely to provide
better training data than a raw corpus.3 Secondly, such data can be useful as an
evaluation testbed for such programs. Rather than relying on human beings to
test and assess the output of a part-of-speech tagger, this can be done fairly
rapidly and automatically by allowing the part-of-speech tagger to tag text
which has already been annotated – if you like, a human-generated crib-sheet
is available to the computer to allow it to rate its own performance.
Corpus Ling/ch5 22/2/01 5:23 pm Page 142
142 C O R P O R A A N D L A N G UAG E E N G I N E E R I N G
It is clear from this example alone why corpus linguistics and language
engineering interact so frequently. Successful language-engineering systems
have been developed on the basis of corpus data. Consequently language
engineers have called for the creation of corpus data – which is of subsequent
use to corpus linguists – and have also developed systems, such as part-of-
speech taggers – which are of subsequent use to corpus linguists. In the
following section we wish to further this explanation of the relationship
between corpus linguistics and language engineering by examining one area
where language engineering and corpus linguistics are currently in very close
liaison – multilingual corpus linguistics.
AU TO M AT E D L E X I C O G R A P H Y 143
144 C O R P O R A A N D L A N G UAG E E N G I N E E R I N G
PA R S I N G 145
5.6. PARSING
As shown in Chapter 4 (section 4.4.4), the field of automated parsing is an
active one (see the reference to the Nijmegen work) and a lively one (see the
references to the Taylor/Grover/Briscoe–Sampson debate). In some senses,
some of the prerequisites for a powerful automated parsing system already
exist.To substantiate this claim, consider what such a system should be able to
do. It should, minimally, be able to:
1. identify the words of a sentence;
2. assign appropriate syntactic descriptions to those words;
3. group those words into higher-order units (typically phrases and
clauses) which identify the main syntactic constituents of a sentence;
4. name those constituents accordingly.4
It should also be able to do this with a high degree of accuracy, producing plau-
sible parses for any given input sentence, a goal often described as robustness.
Considering the advances made in automated part-of-speech recognition,
it is quite easy to see that some of these goals have been as good as achieved.
The elusive goal, however, remains the identification of structural units above
the word level and below or at the sentence level.5 Indeed, such is the
difficulty of this goal that, if you are reading this book twenty years from its
publication date, the authors would not be in the least surprised if no robust
parser for general English has yet been created.The current state of the art is
somewhat unimpressive for everyday English. Black et al. (1993) provide a
brief overview of parsing-system testing. The review makes somewhat
depressing reading for the ambitious linguist hoping to annotate automatically
a corpus with syntactic information above the word level. Satisfactory parses
in automated broad-coverage parsing competitions rarely seem to break the
60% barrier and more commonly languish in the region of 30–40 per cent
accuracy. To compare this to the state of the art in part-of-speech tagging,
recall that the tagger of Greene and Rubin (TAGGIT) achieved an accuracy rate
of 77 per cent in the 1970s. Considering such comparisons, we are tempted to
conclude, appropriately we think, that the parsing problem is much more
complex than the part-of-speech assignment problem. As a consequence the
tools currently available are certainly much cruder than part-of-speech taggers
and, as a practical tool, probably useless for the corpus linguist at present.
So why discuss it in a book such as this? Well, as with other areas of lan-
guage engineering , corpora have been utilised to address the parsing problem
and again seem to provide the promise of superior performance.To assess the
Corpus Ling/ch5 22/2/01 5:23 pm Page 146
146 C O R P O R A A N D L A N G UAG E E N G I N E E R I N G
PA R S I N G 147
148 C O R P O R A A N D L A N G UAG E E N G I N E E R I N G
handcrafted rules – the grammars derived by these systems are most certainly
not any which a linguist would recognise or that anyone would claim had the
remotest claim to cognitive plausibility.
Radical statistical grammars seek to use abstract statistical modelling tech-
niques to describe and ascertain the internal structure of language.At no point
is any metaknowledge used beyond the annotated corpus, such as a linguist’s
or system designer’s intuitions about the rules of language.The system merely
observes large treebanks and on the basis of these observations decides which
structures, in terms of word clusterings, are probable in language and which
are not. In many ways it is best described as a statistical derivation of syntactic
structure.
Magerman’s system is an interesting example of this. Magerman (1994)
looks at a sentence in a radical way – he assumes that any ‘tree’ which can fit
the sentence is a possible parse of the sentence. There is no pre-selection of
trees – Magerman simply generates all right-branching trees that fit a given
sentence. He then clips through this forest of parses and uses his corpus-based
probabilities in order to determine what is the most likely parse. Magerman’s
system uses little or no human input – the only implicit input is that language
can be analysed as a right-branching tree and annotated corpora are used as a
source of quantitative data. Otherwise linguistic input to Magerman’s model is
nil. Bod (1993) produces similarly radical work, again organising language
independent of any handcrafted linguistic rules.
The work of Magerman and Bod is genuinely radical. It is common to find
works which, as this book has noted, merely augment traditional AI systems
with some stochastic information and claim that a radical paradigm shift has
taken place.An example of this is Charniak (1993), who develops a probabilis-
tic dependency grammar and argues that this constitutes a major paradigm
shift. A careful examination of Charniak’s work reveals that the major para-
digm shift is constituted by the incorporation of statistical elements into a
traditional parsing system, not surprisingly, at the stage of disambiguation.
When we compare this to the radical paradigm shift Magerman or Bod
promise, we see that, while Charniak’s work is, in some ways, novel, it is by no
means as powerful a radical paradigm shift as that proposed in the work of
Magerman and Bod.
The main point of importance for the corpus linguist is this: if the radical
statistical paradigm shift in language engineering is to take place, corpora are
required to enable it. Both Magerman and Bod rely on annotated corpora for
training their statistical parsers. Without corpora the a priori probabilities
required for a linguistically atheoretical approach to parsing would be difficult
to acquire.With this stated, it is now possible to move on and consider the pre-
theoretical hybrid approach to parsing mentioned in the previous section.
Corpus Ling/ch5 22/2/01 5:23 pm Page 149
PA R S I N G 149
150 C O R P O R A A N D L A N G UAG E E N G I N E E R I N G
some respectable accuracy scores, with results running from 96 per cent to 94
per cent for computer manual sentences averaging 12 and 15 words respec-
tively. Black’s work is interesting as it shows how raw quantitative data and
traditional linguists’ grammars can be combined to achieve a useful goal –
accurate parses of sentences in restricted domains. It also shows again, quite
clearly, the need for appropriately annotated corpora in language engineering.
M U LT I L I N G UA L C O R P U S E X P L O I TAT I O N 151
5.7.1. Alignment
Alignment at sentence, word and multiword unit level is seen as a key process
in the exploitation of parallel corpora for a variety of purposes, including the
following:
Corpus Ling/ch5 22/2/01 5:23 pm Page 152
152 C O R P O R A A N D L A N G UAG E E N G I N E E R I N G
Sentence alignment software has been developed and tested, based upon a
variety of techniques as described by Gale and Church (1991), Kay and
Corpus Ling/ch5 22/2/01 5:23 pm Page 153
M U LT I L I N G UA L C O R P U S E X P L O I TAT I O N 153
Röscheisen (1993) and Garside et al. (1994). Such alignment programs allow
users to align parallel corpora as a prelude to exploitation in, for example, lexi-
cographic research. Table 5.3 gives a series of results obtained at Lancaster
using the Gale and Church technique (which is based upon statistical heuris-
tics) for sentence alignment (taken from McEnery and Oakes 1996).
Table 5.4 The success of word alignment in English–French using Dice’s similarity coefficient
Corpus Ling/ch5 22/2/01 5:23 pm Page 154
154 C O R P O R A A N D L A N G UAG E E N G I N E E R I N G
120
100
80
Accuracy
Filtered
60 Unfiltered
40
20
0
40 50 60 70 80 90
Similarity Score (Dice)
Figure 5.2 The success of compound noun alignment between English and Spanish using finite state
automata and similarity data both with and without co-occurrence measures as an additional filter
brief and sketch-like, it is possible to see from the work presented that a range
of techniques have been developed to achieve alignment. Further to this,
sentence-alignment technology has proved to be fairly reliable on the langu-
age pairs it has been tested on so far.While word alignment is still not as reliable,
advances have been made towards this goal also, and work on phrasal align-
ment is under way. Aligned data is clearly of use for machine-translation tasks
(see section 5.7.3), but what of machine-aided translation? At least part of the
answer to this question lies in the development of parallel concordancers
which allow translators to navigate parallel text resources.
M U LT I L I N G UA L C O R P U S E X P L O I TAT I O N 155
156 C O R P O R A A N D L A N G UAG E E N G I N E E R I N G
M U LT I L I N G UA L C O R P U S E X P L O I TAT I O N 157
it could be hoped that, some time soon, there will be an answer to this criti-
cism. For the moment we must keep an open mind. But, the fact that we must
wait, admirably underscores the point that this approach to requires paral-
lel aligned corpora. Corpora are not an option here.They are a necessity.
158 C O R P O R A A N D L A N G UAG E E N G I N E E R I N G
On the other hand it also seeks to avoid the abstract non-linguistic processing
of pure statistical , in favour of a more cognitively plausible approach to the
problem. The important point is that, without parallel aligned corpora, this
approach to would simply not be possible. Corpora have once again pro-
vided the raw resources for a new and promising approach to a difficult prob-
lem in language engineering.
To return to the general discussion of , it is obvious that corpora have
had quite an impact in the field of . Even where traditional rule-based
is being undertaken, it may be the case that the system designers may exploit
corpora to clear the knowledge acquisition bottleneck for their system in
certain key areas (e.g., the lexicon). But where or statistical are taking
place the corpus is of paramount importance and is, indeed, a necessity.
and statistical have, like so many other areas of language engineering, been
enabled by the creation of certain kinds of corpora. Without parallel aligned
corpora, there would be no and there would be no statistical .
5.8. CONCLUSION
In general then, what have corpora got to offer to language engineering?
From what has been reviewed in this chapter it is apparent that corpora have
a great deal to offer language engineering .Apart from being excellent sources
of quantitative data, corpora are also excellent sources of linguistic knowledge,
in such systems as EBMT.
Indeed, corpora are essential to some applications in language engineering,
such as so-called ‘probabilistic’ part-of-speech annotation systems.Their use in
such systems is, however, revealing of the general trend of the use of corpora
in language engineering .Although in general these systems are highly effective
in terms of reliability, they are also generally of the ‘disambiguation’ variety.As
noted in this chapter,‘radical’ applications of corpus data in language engineer-
ing are relatively rare. Some variety of disambiguation seems to be the most
commonplace use of corpora in language engineering , with few exceptions.
The impact of corpora on language engineering is increasingly profound.
At the time of writing, there seems no reason to suppose that this trend will
not be amplified on an ongoing basis. The use of corpora in language engi-
neering , especially within hybrid systems where corpus data are used in some
process of disambiguation, is burgeoning.
S T U DY Q U E S T I O N S 159
LEXICON
ARTICLES (A)
the a an
160 C O R P O R A A N D L A N G UAG E E N G I N E E R I N G
DETERMINERS (D)
a all an another any both double each either enough every few half last least
less little many more most much neither next no none other several some the
twice here that there these this those
ORDINALS (M)
(irregular only) first second third fifth eighth ninth twelfth (M)
PRONOUNS (P)
anybody anyone anything ’em everybody everyone everything he he’d he’ll
he’s her hers herself him himself his hisself i i’d i’ll i’m i’ve it it’s its itself me
mine my myself nobody nothing oneself ours ourselves she she’d she’ll she’s
somebody someone something their theirs them themselves they they’d they’ll
they’re they’ve ’tis ’twas us we we’d we’ll we’re we’ve what what’re whatever
whatsoever where’d where’re which whichever whichsoever who who’d who’ll
who’s who’ve whoever whom whomever whomso whomsoever whose whoso
whosoever yer you you’d you’ll you’re you’ve your yours yourself yourselves
ADVERBIALS (R)
anyhow anymore anyplace anyway anyways anywhere e’en e’er else elsewhere
erstwhile even ever evermore everyplace everyway everywhere hence hence-
forth henceforward here hereabout hereabouts hereafter hereby herein here-
inabove hereinafter hereinbelow hereof hereon hereto heretofore hereunder
hereunto hereupon herewith hither hitherto ne’er never nohow nonetheless
noplace not now nowadays noway noways nowhere nowise someday some-
how someplace sometime sometimes somewhat somewhere then thence
thenceforth thenceforward thenceforwards there thereabout thereabouts
thereafter thereat thereby therefor therefore therefrom therein thereinafter
thereinto thereof thereon thereto theretofore thereunder thereunto thereupon
therewith therewithal thither thitherto thitherward thitherwards thrice thus
thusly too
AUXILIARIES (V)
ain’t aint am are aren’t be been being did didn’t do does doesn’t don’t had
hadn’t has hasn’t have haven’t having is isn’t was wasn’t were weren’t can can’t
cannot could couldn’t may mayn’t might mightn’t must mustn’t needn’t ought
oughtn’t shall shan’t should shouldn’t used usedn’t will won’t would wouldn’t
Corpus Ling/ch5 22/2/01 5:23 pm Page 161
S T U DY Q U E S T I O N S 161
Suffix Rules
-s Remove -s, then search lexicon
-ed V J@
-ing(s) -ing V N J@ ; -ings N
-er(s) -er N V J ; -ers N V
-ly R J@
-(a)tion N
-ment N
-est J
-(e)n JN
-ant JN
-ive No rule possible
-(e)th NM
-ful J N%
-less J
-ness N
-al J N%
-ance N
-y JN
-ity N
-able J
DEFAULT
If nothing else works, the word in question may be VERB, NOUN or
ADJECTIVE (V, N, J).
2. How would you disambiguate the analysis of the sentence from ques-
tion (1), using your linguistic knowledge?
3. Using the table and equation below, show how we could disamb-
iguate the part-of-speech analysis of the sentence given in question
(1).
Corpus Ling/ch5 22/2/01 5:23 pm Page 162
162 C O R P O R A A N D L A N G UAG E E N G I N E E R I N G
EQUATION
For each of the possible tag sequences spanning an ambiguity, a value is gener-
ated by calculating the product of the values for successive tag transitions taken
from the transition. Imagine we were disambiguating the sequences N
followed by a word which is either N or V, followed by a word which is either
N or V:
N – N – V = 8 + 17 = 25
N – N – N = 8 + 8 = 16
N – V – V = 17 + 18 = 35
N – V – N = 17 +4 = 21
The probability of a sequence of tags is then determined by dividing the value
obtained for the sequence by the number of sequences, e.g., for the this exam-
ple the most likely answer is the third:
35
= — = 17.5%.
2
If a tag has a rarity marker (@ for 10 per cent rare, % for 100 per cent rare,
then divide the percentage from the matrix by either 10 (@) or 100 (%) before
inserting it into the equation.
4. Use the resources provided to analyse sentences taken at random from
this book. How easy is it to analyse these sentences using the materi-
als provided? Do any shortcomings of the materials provided become
apparent?
Corpus Ling/ch5 22/2/01 5:23 pm Page 163
F U RT H E R R E A D I N G 163
NOTES
1. Both quotes are from Robert Garigliano, see http://www.dur.ac.uk/~dcs0www3/lnle/editorial.html.
2. Disambiguation: choosing the preferred analysis from a variety of possible analyses.This may occur at
many levels, from deciding the meaning, in context, of a polysemous word, through to choosing one
possible translation from many.
3. See, for example, http://www.comp.lancs.ac.uk/computing/research/ucrel/claws/.
4. A note of caution should, however, be sounded here.There is evidence that there is an optimal size
for such a training corpus. Should the program be exposed to too much training data of this sort, the
likelihood seems to be that the model which actually deteriorate rather than improve. See Merialdo
(1994) and McEnery et al. (1997).
5. It may be that in the literature you will see sentences described as S-units.This is a more neutral
term, reflecting the difficulty of unambiguously defining what is and what is not a sentence.
6. For readers interested in seeing a detailed example of this see Markantonatou and Sadler (1994).
7. Needless to say, the probabilistic element is used for ambiguity resolution.
8. Needless to say, these were parallel aligned corpora. It also goes without saying that as a rule of
thumb, the larger they are the better.Though recent research has shown that one can overtrain a
statistical model, and actually force the performance of a statistical model to degrade by exposing it
to a larger sample (Merialdo 1994).
9. Though sentence alignment is much more effective than word alignment.
10. There is another heuristic component to this system, which estimates how often one sentence in
language A translates into one sentence in language B, or two in language B and so on and so forth.
Corpus Ling/ch5 22/2/01 5:23 pm Page 164
Corpus Ling/ch6 23/2/01 10:26 am Page 165
6
A case study:
sublanguages
(section 6.2) and the tools used in annotating and manipulating the corpora
described (section 6.3). Having presented the corpora, we will pause for a
moment to consider how generalisable and valid the results that we may draw
from this study may be (section 6.4).This will provide us with a sound foun-
dation for making any claims on the basis of our study.With this groundwork
in place, we will then proceed to an investigation of the corpora in question
and examine a series of claims made under the general banner of the sublan-
guage hypothesis (sections 6.5–8).With our study complete, we can then make
a series of claims of our own (section 6.9), based on the findings of our study
and tempered by any limitations, following from section 6.4, we may place
upon our results.2
evidence that they are not sublanguages.This contrast, between the potential
sublanguage corpus and the potential non-sublanguage corpora, explains the
selection of these corpora for the study. But there is one other important factor
which commended these corpora over other corpora which are available in
the public domain: annotation.
Annotated corpora will prove of particular use to this study, for, while it
would be possible to study some levels of closure using an unannotated
corpus, most notably lexical closure, other levels of closure would have been
more difficult; indeed, some would have been impossible given an unanno-
tated corpus. By using corpora which have been annotated to record various
forms of linguistic description, we can move our observation from the level of
word form to more interesting levels of linguistic description.Again, advances
in natural language processing, as well as the development of intensive human-
aided editing, allow us here to undertake a study by means which would have
been nothing more than a psuedo-technique only fifty years ago.
So annotation exists in all of these corpora – some introduced automatically
using state-of-the-art annotation tools, others input by semi-automated means
using human-friendly corpus annotation software.3 Table 6.1 describes, for
each corpus, three things: the size of the corpus, what variety of annotation is
available in that corpus and how much of that annotation is reliable.
With this range of annotation available, it should now be apparent that,
when we test for closure in these three corpora, we can do so in a variety of
ways. As well as lexical closure based upon word forms, we can also test for
morphosyntactic closure, using the part-of-speech codes, and grammatical
closure, using the parsing information.
But, before beginning the study, it is useful to pause and consider what soft-
ware tools were essential to the processing of this study.
Corpus Ling/ch6 23/2/01 10:26 am Page 170
6.3.2. Parsing
The parsing of the texts had been done by a combination of human and
machine annotation. Using a program called EPICS, human grammarians
parsed the corpora for the study. EPICS allows humans to parse texts quite
rapidly. It has a variety of useful features which allow for speedy annotation.5
Specialised editors such as EPICS are not generally available in the public
domain at present and neither are the corpora which have been produced by
them.There are certain exceptions such as the Penn Treebank (see Appendix
B), but on the whole parsed corpora are hard to find and time-consuming to
produce, even with editing tools such as .
high frequency of occurrence, then relatively small samples of text can yield
interesting results – Biber and Finegan (1991) for instance use corpus samples
of around 1,000 words for such features (as discussed in section 3.4). But the
lower the frequency of the feature you wish to observe, the larger the corpus
you need. For the purpose of the current study, the corpora are certainly large
enough to monitor the various forms of closure we wish to observe. But, if,
for instance, our goal was to observe, say, the relative frequency of the verbs spy
and snoop, we may need a much larger corpus. In the 2.25 million words of
English gathered for this study, the verbs spy and snoop did not occur at all! This
is hardly a surprise as, in general, words with an open-class7 part-of-speech will
tend to occur with a low frequency. So for the purpose of that study, the
corpora collected may be quite inadequate.8 For the study in hand, however,
corpus size is not a concern.
A final issue we may address is whether the prior categorisation of corpus
texts, which our study may rely upon, is in itself valid. Corpora tend to be
balanced for genre (see section 2.1), and we may wish to proceed using these
genre distinctions as a series of anchor points for our study. However, the genre
distinction itself is one which is largely reflects purpose, not necessarily style
(see section 4.3 for a fuller discussion). We have already seen that style may
vary within a text, depending on the function of a section of writing. Similarly,
texts within a corpus may not be stylistically homogeneous.The texts usually
have a wide range of writers.Within certain genres, where there may be some
control over the writers’ style (such as in a corpus of newspaper reports or in
a technical manual), we may expect the style of individual authors not to vary
widely. But what if our corpus has a ‘fiction’ category, which contains writers
as diverse as D. H. Lawrence and George Orwell? Should we really expect
them to have an homogeneous style? Of course not. So, when composing our
study, we should also have regard to how valid any prior corpus organisation
may be from the point of view of a linguist interested in the form of the
language contained in the corpus.We should not assume that any corpus cate-
gorisations are linguistically motivated. For the purpose of the current study,
the two largest corpora – the Hansard and the IBM manuals – may both be
assumed to be fairly homogeneous stylistically, because of editorial decisions
made in their production.The smaller corpus – the APHB – is not as homoge-
neous stylistically, and this should certainly be one factor we take into account
when explaining any difference in behaviour between this corpus and the
other unrestricted language corpus, the Hansard corpus.
We have now considered three important methodological issues. We have
decided that
1. as we are seeking a gross characterisation of the corpora concerned,
we need not worry about heterogeneity of purpose in sections of the
corpus text;
Corpus Ling/ch6 23/2/01 10:26 am Page 173
B E G I N N I N G WO R K 173
2. the corpora we have are large enough for the purposes of this study;
3. stylistic heterogeneity may only be a factor which influences the
smallest corpus we will be using.
With the corpora used in the study described, the source of their annotation
explained, the tools used to manipulate them briefly reviewed and relevant
methodological issues addressed, we are now ready to start our study.
and the relative sizes of the lexicons seem to be pointing in the same direction
– towards the IBM manuals having a more restricted lexicon – then we should
investigate further.
The APHB corpus has a yet lower type/token ratio, but as this corpus is signif-
icantly smaller than the Hansard or IBM corpora, we have some reason for caution
in comparing this figure to the other two. To give an example of why we
might be cautious: what if, in a further 800,000 word sample of APHB we did
not spot one new word form? The type/token ratio of the corpus would rise
to 1:85.9! Although this is an extreme case, it does at least allow us to see the
Corpus Ling/ch6 23/2/01 10:26 am Page 175
T H E F I R S T S T U DY O F C L O S U R E 175
T H E S E C O N D S T U DY O F C L O S U R E 177
language can have a variety of syntactic functions, for example, spy can be a
noun or a verb, as shown in the following examples from the : McLean may
act as Scotland’s spy and She might spy on him. Now, while the IBM manuals have
an enumerable lexicon, is it the case that the functions associated with those
words are similarly finite? If we look at the data, we can see some interesting
evidence to suggest that, while the IBM manuals have a more carefully
controlled lexicon, the words in that lexicon seem to have a wider variety of
syntactic functions than entries in the unconstrained language lexicons.Table
6.3 lists, for each corpus, the number of words in the corpus and the average
numbers of parts of speech associated with each word in the corpus.
Surprisingly, the IBM manuals corpus has lexicon entries which are associ-
ated, on average, with more parts of speech than the unrestricted language
corpora. But could it simply be the case that the unrestricted language corpora
have lexicon entries composed of a large number of words which are only
used once and consequently that the average is brought down for them by this
large number of hapax legomena, appearing only once and consequently with
only one part of speech? In order to test this, the average number of parts of
speech per entry was recomputed excluding entries which were only associ-
ated with one part of speech to give a more representative sample. Table 6.4
gives, for each corpus, how many word forms in the corpus were associated
with more than one part of speech and the average number of parts of speech
for each of these entries.
It is interesting that the picture does not change, but rather becomes some-
what clearer. Not only is it the case that any given lexicon entry in the IBM
manuals corpus is likely to be, on average, associated with more parts of speech
than in either of the two unconstrained language corpora, it is also the case
that it is more likely that any given lexicon entry generated from the IBM
corpus will have multiple parts of speech associated with it.Although the IBM
manuals lexicon appears limited, it also appears that the limited lexicon is used
inventively, with any given word being more likely to be used as a variety of
parts of speech.
But what would be of most interest is a series of graphs showing how often a
new usage for a word was added.As noted previously, the average represented
by the type of ratio given above does not guarantee that closure has or has not
been achieved. If there is closure of syntactic word category as of lexis in the
Corpus Ling/ch6 23/2/01 10:27 am Page 179
T H E S E C O N D S T U DY O F C L O S U R E 179
sublanguage, then we should see that the growth of unique word/tag pairs
levels off for the sublanguage, yet does not for the unconstrained language
corpora. Figures 6.5 to 6.8 are a series of graphs which chart the growth of
those word/part-of-speech tag pairs for the first 200,000 words of each
corpus.
Having looked at these graphs we see a different story from the simple
ratios revealed. Although the IBM corpus takes longer to level off in this case
than it did when lexical growth was being examined, it none the less levels off
after about 180,000 words have been processed. The other corpora do not
Corpus Ling/ch6 23/2/01 10:27 am Page 180
level off, but continue to develop. In this graph we see the answer to our
conundrum. The IBM manuals have a lexicon which practically enumerates
itself after some 100,000 words have been observed. It continues to find differ-
ent syntactic uses of those words fairly steadily, however, for a further 80,000
words or so.The other corpora, on the other hand, are adding both new words
and new functions for their words already observed throughout the 200,000
words studied.The higher ratio of parts of speech to words is due solely to the
limited nature of the lexicon of the IBM manuals.
Let us use an analogy to illustrate. Imagine that you, person a, have been
given the job of seeing, over a one-hour period, what colours a car may be and
that a friend of yours has been given the same job in a different part of the
world. Now, in your part of the world there may be 100 makes of car, but you
only see 50 types of car during your hour.You see 40 types in only one colour, 5
in two colours, 3 in three colours, 1 in four colours and 1 in all five colours. From
your observation, you may guess that, on average, there are 68–50 car/colour
combinations.Your friend, person b, however, lives in a part of the world with
only twenty types of car. She sees nearly as many cars as you during her hour,
but she sees a smaller range of cars in a wider range of colours. She sees 12
types in three colours, 4 types in four colours and 4 types in five colours. From
her observation, she may guess that, on average, there are 72–20 car/colour
combinations. When you both check with manufacturers, you discover that
there are always five possible colours for a car in any part of the world. Person
b came closest to this answer because person b had a smaller range of cars to
look at, but just as wide a range of colours, so she is likely to see a higher
number of car colours on average (given that all other things are equal).
How does this relate to the work here? We simply switch the cars into
words and the colours into part-of-speech tags to see the point. In the IBM
corpus we have a smaller number of words (the cars observed by person b)
exhibiting just as wide a variety of parts of speech (colours) as words in
another corpus where there is a wider variety of words (the cars observed by
person a). So the higher part-of-speech count per word is explicable not in
terms of a wider range of meaning associated with words as of right in the IBM
corpus, but rather as a function of the lexical closure of the IBM corpus.
We can now see that the IBM manuals achieve closure not just at the lexical
level, but also at this level of morphosyntax. Not only do all of the words
enumerate themselves, but the parts of speech associated with those words
enumerate themselves also.
T H E T H I R D S T U DY O F C L O S U R E 181
types in each corpus. In Chapter 1, we noted that Chomsky had stated that the
concept of a sentence type was empirically indistinguishable from zero. In
other words, language is so endlessly productive that the chance of one
sentence mirroring another perfectly is so low that it may as well be viewed as
having a probability of zero. We can use this observation to test our corpora
further: if the manuals are subject to closure at the level of constituent
structure, the consequent reduction of the variety of sentences generated by
the grammar associated with this sublanguage may well lead to the probability
of a sentence type moving away from zero.To test this we will need a working
definition of sentence type.As we have three syntactically parsed corpora, our
concept of a sentence type will be at the level of annotation. Each sentence
will be judged not on the basis of the word forms it contains, but rather in
terms of its ‘parse tree’. Let's take two example sentences from the manu-
als corpus to illuminate this distinction. Some of the corpus is concerned with
getting diagnostic reports from computer systems. The conclusion to the
attempt usually finishes with the same sentence type, summarised below:
[S[N The system N][V does not produce [N the report N]V]S]
[S[N The system N][V does not produce [N a report N]V]S]
Note that the parse tree is identical in both cases.To give an idea of the stabil-
ity of that sentence type in one section of the corpus alone, the construct
occurs fourteen times in the hundred-sentence section dealing with reports
from systems.As a subpart of other parse trees it would be more numerous still.
With this definition in place we can consider what evidence the corpora
yield.This is summarised in Table 6.5.
It would appear that the unconstrained language corpora broadly confirm
Chomsky’s suspicion about sentence types. Only 6.28 per cent of the Hansard
sentences are repetitions of a previously observed sentence type.9 A mere
2.17% of the APHB corpus is composed of repeated sentence types. The IBM
is an independent unit of analysis, usually taken to be a clause (Quirk et al. 1985). We will use sentence
length here for simplicity.
T H E T H I R D S T U DY O F C L O S U R E 183
manuals on the other hand seem to have a much more constrained set of
sentence types: 39.8% of the corpus is composed of sentences constituting re-
peated sentence types. It would appear that some persuasive evidence exists to
suggest that sentence type within the IBM manuals corpus may be constrained.
To discover the degree of constraint, it seems sensible once again to plot the
growth of sentence types within the three corpora. If Chomsky’s statement
about sentence types is true, we should see roughly one new sentence type per
sentence examined. If the sublanguage hypothesis is correct, the IBM corpus
should not follow this pattern. Figures 6.9 to 6.12 show the rate of sentence-
type growth for the three corpora.
Looking at these graphs we can see that the IBM corpus is markedly differ-
ent from the APHB and Hansard corpora. The two unconstrained language
corpora do seem to exhibit the general pattern ‘each new sentence means a
new sentence type’.This is not true of the IBM corpus.While the corpus does
not achieve closure, which would be indicated by a ‘flattening-off ’ of the
curve showing sentence-type growth, the curve is dramatically slower in
terms of growth compared to the unconstrained language corpora.While the
unconstrained language Hansard and APHB corpora add a new sentence type
for each 1.06 and 1.02 sentences processed respectively, the IBM corpus adds a
new sentence type for each 1.45 sentences processed. Repetition of sentence
types is clearly more marked in the IBM corpus than in the Hansard corpus and
is far from being of the ‘one sentence equals one new sentence type’ variety.This
particular example is interesting. It may be that the corpus concerned is not
large enough to demonstrate sentence-type closure for the IBM manuals,
hence we cannot say for sure that there is a tendency towards closure at the
sentence-type level for the IBM corpora. We can say, however, that the IBM
corpus seems to be the most prone to a repetition of sentence types, which
again distinguishes it quite clearly apart from those unconstrained language
corpora.
Corpus Ling/ch6 23/2/01 10:27 am Page 184
F U RT H E R R E A D I N G 185
NOTES
1. See section 1 for a definition of the term sublanguage.
2. For readers interested in seeing samples of the corpora used in this chapter and also sample outputs
from the programs used to process this study, these are available on the World Wide Web at
http://www.ling.lancs.ac.uk/staff/tony/tony.htm – please feel free to browse the resources.
3. See Chapter 2 for a description of the process of corpus annotation and the form of annotated
corpora.
4. See Garside and McEnery (1993) for a description of the software used in this so-called ‘corpus
cleanup’ process.
5. Garside and McEnery (1993) describe the EPICS program in detail.
6. Garside and McEnery (1993) again describe this program in some detail.
7. The open classes of words, such as nouns, verbs, adjectives, are notable as a set for being composed of
a large number of members which are capable of increasing frequently in number over time.
8. As a matter of interest, in the 100,000,000-word BNC, the word spy occurs 709 times and the word
snoop occurs only 22 times.
9. Calculated by the formula r = 100 – ((100/a) * b), where a is the number of valid sentences analysed,
b is the number of sentence types observed and r is the percentage of corpus sentences based upon
the repetition of a sentence type.
Corpus Ling/ch6 23/2/01 10:27 am Page 186
Corpus Ling/ch7 23/2/01 10:47 am Page 187
In this, the final chapter of the book, we would like to look back to our past
predictions about the future of corpus linguistics and to try, once again, to
In the first edition of our book, we made a series of predictions about the
future of corpus linguistics. We outlined a number of pressures that would
define the future shape of corpora, related to size, international concerns,
scope and computing. Let us reconsider these in turn.
188 L O O K I N G B A C K W A R D S , L O O K I N G F O RW A R D S
is developing to set smaller corpora amongst these giants. For the moment it
seems indisputable that corpora will – and can – continue to increase in size.
T H E F U T U R E PA S T 189
The impact of this work over the past five years has been astounding. In that
time the learner corpora available have expanded enormously, covering a
range of genres of academic writing and even encompassing learners’ speech.
Specialist symposia on learner corpora have occurred internationally and
more are undoubtedly to come. From being a type of corpus which was
unknown and/or rare, learner corpora have become a mainstay of a new trend
in second language acquisition studies.
Similarly, corpora of the writing of children have started to become
important.While child language corpora such as generally represen-
ted spoken language, research on the acquisition and development of literacy
has led researchers to compile corpora of writing by children of various ages
– in the British context, Smith, McEnery and Ivanic (1998) have gathered
corpus data representing the writing of children aged 7 to 11, 5 have
gathered data covering the mid-teen period and Kifle (1997) gathered data
from late teenage writers. In short, since we wrote the first edition, we have
gone from a situation in which no corpora representing the writing of
children and young adults existed, to our knowledge. Just a few years later, it
is possible to trace the development of academic writing in the age range 7 to
18.The scope of corpora is expanding as fast as we suggested it would.
190 L O O K I N G B A C K W A R D S , L O O K I N G F O RW A R D S
made in the first edition were quite correct. Corpus linguistics has grown and
our estimates of what would happen have largely proven to be true. Corpora
are getting bigger, they are covering more languages and they are becoming
multimodal.These trends will continue – of that we are quite sure. But what
else can we see happening in the near future? In this section we would like to
set fresh predictions for the future of corpus linguistics – under the general
headings of specialised corpora, non-professional writing, professional writing,
linguistic theory and language engineering.
192 L O O K I N G B A C K W A R D S , L O O K I N G F O RW A R D S
For the moment, it seems that an inordinate amount of attention has been
paid to the writings of professional authors in corpus linguistics and very little
has been paid to the writing of the vast majority of writers on the planet, who
are non-professional authors.We would hope that this imbalance will change
in the near future.
194 L O O K I N G B A C K W A R D S , L O O K I N G F O RW A R D S
construct than written corpora. However, with the advent of corpora such as
the spoken corpus of the , spoken corpus resources of some relevance to
such researchers are now clearly available.With the publication of the Survey
of English Dialects, further spoken language data will become available, this
time including both the acoustic and transcribed data, which should prove
helpful to researchers in sociolinguistics and pragmatics. Also, corpora which
include more of the context of their production are vital if work in areas such
as discourse analysis and pragmatics is going to exploit corpus data.
Let us finally turn to the question of relevance. In short, why should
researchers in, say, sociolinguistics use corpus data? Sociolinguistics has been
using corpora of sorts for decades – what has corpus linguistics got to offer
them that they do not have already? The answer, we believe, is shown in the
way corpus linguists are using micro- and large-scale corpora at the moment
– corpora such as the can provide a useful yardstick to compare micro-
corpora against. If a sociolinguist believes that a particular word/structure is
typical of a dialect, then general-purpose corpora may be able to provide
evidence to support that hypothesis. A further way corpus linguistics differs
from other areas of linguistics in which naturally occurring language data has
been examined is with respect to the articulation and manipulation of data.
Corpus linguistics has had a focus upon the development of schemes to allow
corpora to be encoded reliably. On the basis of that encoding, corpus linguists
have developed programmes to exploit that data.This emphasis on the system-
atic encoding of data and tools for its manipulation is an area where corpus
linguistics excels. So, as well as the corpora produced, other areas of linguistics
should take an interest in how corpora have been composed and manipulated
by corpus linguists.We hope that in the near future a full marriage of corpus
linguistics with a wide range of linguistic theories will occur, in part
encouraged by the observations such as those we have made here.
CONCLUSION 195
7.3. CONCLUSION
This is not a conclusion – it cannot be. Corpus linguistics is an area which
seems to be developing at such an amazing rate that any conclusions can only
be halting and temporary. Indeed, by the time this manuscript appears in
published form, further developments may already have begun.There is little
doubt in our mind that, within a few years of the appearance of this book, the
field will have moved on so far that we will need to substantially rewrite this
book. We welcome that. Corpus linguistics needs to develop further and to
continue doing so into the foreseeable future – because in doing so, it is
changing our view of what we should do when using language data to explore
linguistic hypotheses. What was wild fancy in one year suddenly becomes
technically possible in the next. What seemed like an off-the-wall idea for a
corpus at one point suddenly seems like the best way to approach a subject
when the corpus is built and exploited. Corpora are challenging our view of
what it is possible to study in linguistics and how we should study it. Such
challenges should always be welcomed.
NOTES
1. See http://www.hd.uib.no/AcoHum/nel. for details of such work.Also http://crl.nmsu.edu/
Research/Projects/expedition/index.html.
2. See http://www.ling.lancs.ac.uk/monkey/ihe/mille/1fra1.htm, for example.
3. See http://www.crl.nmsu.edu/Research/Projects/shiraz, for example.
4. See http://www.fltr.ucl.ac.be/FLTR/GERM/ETAN/CECL/cecl.htm.
5. A corpus of essays by school children writing in British English as an L1. See http://www.bricle.
f2s.com/apu_corpus.htm for details.
6. This is a corpus of dialect data from the British Isles. Check the Routledge website for release date
and availability: http://www.routledge.com.
7. See http://www.mpi.nl/world/tg/lapp/lapp.html.
8. See http://www.comp.lancs.ac.uk/computing/ucrel/claws, for example.
Corpus Ling/Appdxs 22/2/01 5:37 pm Page 196
Corpus Ling/Appdxs 22/2/01 5:38 pm Page 197
Glossary
alignment the practice of defining explicit links between texts in a parallel corpus
anaphora pronouns, noun phrases, etc. which refer back to something already
mentioned in a text; sometimes the term is used more loosely – and, technically,
incorrectly – to refer in general to items which co-refer, even when they do not
occur in the text itself (exophora) or when they refer forwards rather than
backwards in the text (cataphora).
annotation (i) the practice of adding explicit additional information to machine-
readable text; (ii) the physical representation of such information
Baum-Welch algorithm a way of finding the path through the states of a hidden-
Markov model which maximizes the probability score output from the model.
The algorithm arranges the calculations undertaken so that they are done
efficiently using a programming technique called ‘dynamic programming’
CALL computer-aided (or assisted) language learning
COCOA reference a balanced set of angled brackets (<>) containing two things: a
code standing for a particular type of information (e.g.A=AUTHOR) and a string
or set of strings, which are the instantiations of that information (e.g., CHARLES
DICKENS).Thus a COCOA reference indicating that Charles Dickens is the author
of the text might look like this: <A CHARLES DICKENS>
concordance a comprehensive listing of a given item in a corpus (most often a word
or phrase), also showing its immediate context
context-free phrase structure grammar a phrase structure grammar is a grammar
made up of rules which define from which elements phrase types (such as noun
phrases and verb phrases) are built up and how sentences are built from these
phrase types; a context-free grammar is one in which these rules apply in all cases,
regardless of context
corpus (i) (loosely) any body of text; (ii) (most commonly) a body of machine-
readable text; (iii) (more strictly) a finite collection of machine-readable text,
sampled to be maximally representative of a language or variety
dependency grammar a type of formal grammar which shows the dependencies
between different items in a sentence, e.g., which items are governed by which
items; e.g., in the sentence Drop it, the verb governs the pronoun it
Corpus Ling/Appdxs 22/2/01 5:38 pm Page 198
198 G L O S S A RY
DTD Document Type Description in the TEI, a formal representation which tells the
user or a computer program what elements a text contains and how these
elements are combined. It also contains a set of entity declarations, for example,
representations of non-standard characters
EAGLES (Expert Advisory Groups on Language Engineering Standards) an EU-
sponsored project to define standards for the computational treatment (e.g.
annotation) of EU languages
element in TEI terminology, a unit of text
empiricism an approach to a subject (in our case linguistics) which is based upon
the analysis of external data (such as texts and corpora); contrast rationalism
entity reference in the TEI, a shorthand way of encoding information in a text
full parsing a form of grammatical analysis which represents all of the grammatical
relationships within a sentence
hapax logema terms that have been used only once in the corpus
header a part of an electronic document preceding the text proper and containing
information about the document such as author, title, source and so on
immediate constituent in a phrase structure grammar, an immediate consituent is
one which belongs directly within a phrase
knowledge base a source of rules used by an artificial intelligence program to carry
out some task. Note that these rules are usually written in a formal, logical
notation
KWAL (key word and line) a form of concordance which can allow several lines of
context either side of the key word
KWIC (key word in context) a form of concordance in which a word is given within
x words of context and is normally centred down the middle of the page
lemma the headword form that one would look for if looking up a word in a
dictionary, e.g., the word-form loves belongs to the lemma LOVE
lexicon essentially synonymous with ‘dictionary’ – a collection of words and
information about them; this term is used more commonly than ‘dictionary’ for
machine readable dictionary databases
modal verb a verb which represents concepts such as possibility, necessity, etc.
English examples include can, should, may
monitor corpus a growing, non-finite collection of texts, of primary use in
lexicography
multivariate statistics those statistical methods which deal with the relationships
between many different variables
mutual information a statistical measure of the degree of relatedness of two
elements
parallel corpus a corpus which contains the same texts in more than one language
parse to assign syntactic structure (often of a phrase structure type) to text
rationalism an approach to a subject (in our case linguistics) which is based upon
introspection rather than external data analysis; contrast empiricism.
representative of a sample, one that approximates the closest to the population from
which it is drawn
semantic field a mental category representing a portion of the universe as we
perceive it; in linguistic terms, a semantic field can be said to consist of a set of
words that are related in terms of their senses or the domain of activity with which
Corpus Ling/Appdxs 22/2/01 5:38 pm Page 199
G L O S S A RY 199
they are associated. For example, the words football, hurdles, javelin and Olympics
could all be said to be members of a notional semantic field of ‘sport’.
significant reaching a degree of statistical certainty at which it is unlikely that a result
is due purely to chance
skeleton parsing a form of grammatical analysis which represents only a basic subset
of the grammatical relationships within a sentence
sublanguage a constrained variety of a language. Although a sublanguage may be
naturally occurring, its key feature is that it lacks the productivity generally
associated with language
tag (i) a code attached to words in a text representing some feature or set of features
relating to those words; (ii) in the TEI, the physical markup of an element such as
a paragraph
tagset a collection of tags in the form of a scheme for annotating corpora
TEI (Text Encoding Initiative) an international project to define standards for the
format of machine readable texts
thesaurus a lexicographic work which arranges words in meaning-related groups
rather than alphabetically; the term is also sometimes used to refer to any
dictionary-style work; however, this latter definition is only rarely encountered in
work on modern languages
transition matrix a repository of transitional probabilities.This is an n-dimensional
matrix, depending on the length of transition under question. So, for example,
with a digram transition, we require a two-dimensional matrix
treebank a corpus which has been annotated with phrase structure information
variable rule analysis a form of statistical analysis which tests the effects of
combinations of different factors and attempts to show which combination
accounts best for the data being analysed
WSD (Writing System Declaration) in the TEI, a formal specification defining the
character set used in encoding an electronic text
Z-score a statistical measure of the degree of relatedness of two elements
Corpus Ling/Appdxs 22/2/01 5:38 pm Page 200
Corpus Ling/Appdxs 22/2/01 5:38 pm Page 201
Appendix A:
Corpora mentioned in the text
This appendix contains brief details of corpora mentioned in the text of this book. More
detailed information, and information on a number of other corpora, can be found
in the surveys carried out by Taylor, Leech and Fligelstone (1991) and Edwards (1994).
For up-to-date links to corpora and tools, see also Michael Barlow's corpus linguistics
web site:http://www.ruf.rice.edu/~barlow/corpus.html.
Note that a number of the corpora mentioned here (Helsinki Diachronic, Lampeter,
-East Africa, London-Lund, , Brown, , , , Kolhapur, Lancaster
Parsed Corpus, ), together with several others, are available on a handy from
. The also provides the Wordsmith, Wordcruncher, , Lexa and
retrieval programs (for which see Appendix B). Details can be found on the Internet at:
http://www.hd.uib.no/corpora.html.
We include here details of e-mail, Gopher and World-Wide-Web information
sources as well as ordinary postal addresses. For specific information on how to access
these electronic information sources, please contact the computer services department
of your university, college or institution.
TEXT COLLECTIONS
APHB
The American Printing House for the Blind corpus is a treebanked corpus of fiction
text produced for IBM USA at Lancaster University. Not available for research purposes.
202 APPENDIX A
DIALECT CORPORA
HELSINKI CORPUS OF ENGLISH DIALECTS
A corpus of approx. 245,000 words of spoken dialect English from several regions of
England. Speakers are elderly and rural in conversation with a fieldworker.
Contact the Department of English, University of Helsinki, PO Box 4, 00014,
Helsinki, Finland.
HISTORICAL CORPORA
A REPRESENTATIVE CORPUS OF HISTORICAL ENGLISH REGISTERS (ARCHER)
A corpus of both British and American English divided into 50-year periods between
1650 and 1990. The corpus contains both spoken and written language. It is being
part-of-speech tagged.
Contact Doug Biber, Department of English, Northern Arizona University, Flagstaff,
AZ 86011-6032, USA (e-mail: [email protected]).
Corpus Ling/Appdxs 22/2/01 5:38 pm Page 203
MONOLINGUAL CORPORA
MIXED CHANNEL
THE BIRMINGHAM CORPUS
A corpus of approx. 20,000,000 words. Approximately 90% of the corpus is written
material and 10% is spoken material. The corpus consists mainly of British English,
although some other varieties are also represented.
Contact The Bank of English,Westmere, 50 Edgbaston Park Road, Birmingham, B15
2RX.
Corpus Ling/Appdxs 22/2/01 5:38 pm Page 204
204 APPENDIX A
SPOKEN
A CORPUS OF ENGLISH CONVERSATION
The London-Lund corpus, minus the additional examples of more formal spoken
English added in the 1970s.
Availability In book form: Svartvik and Quirk (1980).
WRITTEN
THE BROWN CORPUS
A corpus of approx. 1,000,000 words of written American English text dating from 1961.
Availability Contact International Computer Archive of Modern English (ICAME),
Norwegian Computing Centre for the Humanities, Harald Hårfagresgt. 31, N-5007
Bergen, Norway (e-mail: [email protected]).
206 APPENDIX A
Contact Della Summers, Longman Dictionaries, Longman House, Burnt Mill, Harlow,
Essex, CM20 2JE (WWW: http://www.awl-elt.com/dictionaries/lclonlan. shtml.
MULTILINGUAL CORPORA
THE AARHUS CORPUS OF CONTRACT LAW
Three 1,000,000-word subcorpora of Danish, English and French respectively. Texts
are taken from the area of contract law.This is not a parallel corpus.
Availability Contact Karen Lauridsen,The Aarhus School of Business, Fuglesangs Allé
4, DK-8210 Aarhus V, Denmark.
208 APPENDIX A
MILLE CORPORA
Corpus building for a range of Indic languages is under way at Lancaster. Some
sample corpora are already built and available for use on request, subject to agreement.
Availability Contact Tony McEnery, Dept. Linguistics, Lancaster University, Bailrigg,
Lancaster, LA1 4YT (e-mail: [email protected]).
Corpus Ling/Appdxs 22/2/01 5:38 pm Page 209
Appendix B:
Some software for
corpus research
Here we provide brief details of some of the more important software for corpus-
based research. More detailed information and information on other software can be
obtained from Hughes and Lee (1994) and Lancashire (1991). For up-to-date links to
tools, see also Michael Barlow's corpus linguistics web site: http://www.ruf.rice.edu/
~barlow/corpus.html.
CONCORDANCERS ETC.
FOR IBM-COMPATIBLE PCS
LEXA
Sophisticated PC-based corpus analysis system. Lexa produces lexical databases and
concordances.The program is able handle texts marked with COCOA references. Lexa
goes beyond the basic frequency and concordance features of most corpus analysis
programs and also enables simple (i.e. pattern-matched) tagging and lemmatisation
routines to be run.
Availability By from ftp://www.hd.uib.no/lexainf.html. Also available on the
(see Appendix A).
MICRO CONCORD
KWIC concordancer for the PC, especially suited to pedagogic applications.
Availability Contact Electronic Publishing, Oxford University Press, Walton Street,
Oxford, OX2 6DP.
Corpus Ling/Appdxs 22/2/01 5:38 pm Page 210
210 APPENDIX B
MONOCONC
A simple concordance package that runs under Windows on compatible s.A
free version is also available for the Macintosh (see the following section, For the
Apple Mac).
Availability Contact Athelstan, 2476 Bolsover Suite 464, Houston,TX 77005 (e-mail:
[email protected]).Web page: http://www.athelstan.com.
MULTICONC
A concordancer for aligning and browsing parallel texts. The program runs under
Windows on a .
Availability Contact software development via http://web.bham. ac.uk/johnstf/
lingua.htm.
SARA
A sophisticated PC-based concordancer designed specifically to handle texts which use
TEI/SGML markup.
Contact Lou Burnard, Oxford University Computing Services, 13 Banbury Road,
Oxford OX2 6NN.
TACT
Freeware package. TACT is sometimes called ‘the poor man’s Wordcruncher’ (q.v.) as
its functionality is quite similar to that of Wordcruncher.The program’s basic outputs
are KWAL and KWIC concordances and frequency lists. It also enables the user to
produce graphs of the distribution of words through a text or corpus and includes an
option for identifying statistically significant collocations using the Z-score. Further
features are a basic collocation list generator and the ability to group words for search-
ing according to user-defined categories (e.g., semantic fields).Tact requires the user
to convert the raw text into a TACT database using a program called MAKBAS: this
requires a certain amount of skill if full advantage is to be taken of referencing, etc.
Availability Contact Centre for Computing in the Humanities, Room 14297A, Robarts
Library, University of Toronto,Toronto, Ontario, M5S 1A5, Canada (e-mail: cch@epas.
utoronto.ca); also available by anonymous FTP from the aforementioned (epas.
utoronto.ca) and from ICAME (ftp://epas.utoronto.ca/pub/cch/tact/tact2.1).
Corpus Ling/Appdxs 22/2/01 5:38 pm Page 211
S O F T WA R E F O R C O R P U S R E S E A R C H 211
WORDCRUNCHER
Easy to use package which can produce frequency listings, KWAL and KWIC concord-
ances and concordances of user-selected collocations. It can also produce word-
distribution statistics. Like TACT, Wordcruncher requires texts to be in a specially
indexed format. The LOB, Brown, London-Lund, Kolhapur and Helsinki Diachronic
corpora are available on - from ICAME in a ready-indexed form for use with
Wordcruncher (the address of ICAME is given under entries for those corpora in
Appendix A).
Availability At the time of writing, it seems that a new version is becoming available,
but full details have yet to filter through to the corpus linguistics user community
(posting no. 14.0202 by Williard McCarty to the discussion list, 29 August
2000).Web page: http://www.wordcruncher.com.
WORDSMITH TOOLS
A set of tools by Mike Scott for -compatible s running Microsoft Windows.The
tools enable the user to produce wordlists and key-word-in-context concordances.
Other features include the ability to compare two wordlists (using the log-likelihood
statistic), the ability to identify and extract collocations and word clusters, and an
aligner and browser for parallel texts.
Availability Licences for the software can be purchased from Electronic Publishing, Oxford
University Press,Walton Street, Oxford, OX2 6DP (e-mail: ep.help@ oup.co.uk). See
also Mike Scott’s web page: http://www.liv.ac.uk/~ms2928/homepage.html.
CONCORDER
A KWIC concordancer.
Contact David Rand, Les Publications CRM, Université de Montréal, CP 6128-A,
Montréal, Québec, H3C 3J7, Canada (e-mail: [email protected]). WWW:
http://www.crm.umontreal.ca/~rand/Concorder.html.
MONOCONC
A simple concordancer for the Macintosh.
Availability Can be downloaded free of charge from http://www.ruf.rice.edu/
~barlow/mono.html. However, you are asked to notify the author, Michael Barlow,
when you do this (e-mail: [email protected]).
Corpus Ling/Appdxs 22/2/01 5:38 pm Page 212
212 APPENDIX B
PARACONC
A Mac concordancer for parallel (i.e. aligned multilingual) texts. A version for -
compatible s (MS Windows) is forthcoming.
Availability Can be downloaded free of charge from http://www.ruf.rice.edu/
~barlow/para.html. However, you are asked to notify the author, Michael Barlow,
when you do this (e-mail: [email protected]).
HUM
A freeware package of C programs which includes, amongst other things, a concord-
ancer, frequency program and word length charting program. The concordancer
produces an exhaustive concordance of a text rather than a concordance of a selected
word or words.
Availability By ftp://clr.nmsu.edu/CLR/tools/concordances.
QWICK
A concordance and collocation program that is freely available for download from the
Internet. Since it is implemented in Java, it is platform-independent.The collocation
component is particularly sophisticated and incorporates a choice of statistics, inclu-
ding mutual information, Daille’s (1995) cubic association coefficient and the log-
likelihood statistic. Corpora for use with require pre-indexing with ,
another freely available Java application from the same source. is also distribu-
ted, with ready-indexed data, on the (for the corpus) and on the
sampler .
Availability Downloadable from http://www-clg.bham.ac.uk/QWICK/ (for )
and http://www-clg.bham.ac.uk/CUE/ (for ).
PART-OF-SPEECH TAGGERS
CLAWS
A part-of-speech tagger for English which makes use of a probabilistic model trained
on large amounts of manually corrected analysed text.
Contact Prof. Geoffrey Leech, Department of Linguistics and Modern English
Language, Lancaster University, Lancaster LA1 4YT (e-mail: [email protected]).
XEROX TAGGER
A part-of-speech tagger, developed at the Xerox Parc laboratories, which is based
upon a largely self-training Markov Model. The basic tagging program is language-
independent and is being used at the Universidad Autónoma de Madrid to tag the
Spanish part of the CRATER corpus.
Availability By anonymous FTP from parcftp.xerox.com in the directory pub/tagger.
Corpus Ling/Appdxs 22/2/01 5:38 pm Page 213
S O F T WA R E F O R C O R P U S R E S E A R C H 213
STATISTICAL SOFTWARE
VARBRUL
A statistical package for the production of cross-tabulations and regression analyses (see
Chapter 3).
Availability A version for can be obtained by from ftp://ftp.cis.upenn.edu/
pub/ldc/misc_sw/varbrul.tar.Z.
A version for the Macintosh (GoldVarb) can also be obtained by from a different
address (see http://www.crm.umontreal.ca/~sankoff/Gold Varb_Frn.html).
Corpus Ling/Appdxs 22/2/01 5:38 pm Page 214
Corpus Ling/Appdxs 22/2/01 5:38 pm Page 215
Appendix C:
Suggested solutions to exercises
CHAPTER 1
1. It is highly unlikely that you will find the sentence repeated in its exact form
anywhere except in quotation.This in itself is a powerful example of the highly
productive nature of natural language grammars. Look at the section on
sentence types in Chapter 6.You will see that, even if we change our definition
of matching sentence to one that matches at the level of constituent structure,
it is unlikely that we will find a repetition of any given sentence in uncon-
strained natural language.
2. It is, of course, easy to start infinite sentences. Reflexive rules in natural
language grammars, such as clause subordination rules, ensure that infinitely
many and infinitely long sentences are technically possible. Such rules alone guar-
antee the infinite nature of natural language.
3. The word sorting is used three times in this chapter. This took us about five
seconds to discover using a concordance package. Using the human eye, it would
have taken us a great deal longer and the result may have been inaccurate.
4. The ten most frequent words (with their frequencies, ignoring case) in the
chapter are:
540 the
438 of
231 a
226 to
184 in
168 corpus
159 and
152 is
137 that
110 this
While it is likely that you will have guessed some of the words on this list, it
Corpus Ling/Appdxs 22/2/01 5:38 pm Page 216
216 APPENDIX C
is less likely that you guessed all of them. Even less likely is that you guessed the
relative and precise frequencies accurately.
CHAPTER 3
1. The mountain dialect shows a somewhat stronger preference than the coastal
dialect for the use of a relativiser to introduce the relative clause. However, the
chi-square value has a p value of 0.0661; this is somewhat higher than the
normally accepted level for statistical significance (0.05) and thus we cannot say
that the difference is significant, i.e., there is a possibility that the difference may be
due to chance rather than a genuine grammatical difference in the two dialects.
3. DO, HAVE and TAKE have the same proportion in both corpora.
4. a) The chi-square test (or a similar test such as the t-test) is used to test for
simple significance between two or more sets of frequencies
b) Correspondence analysis will give you a pictorial representation of the texts
and the words on the same set of axes so that you can see how similar or
different individual texts are and in roughly what aspects of their vocabularies
vary. Factor analysis would provide very similar results. Cluster analysis could
be used to group the texts according to similarity, but it would be harder to
see what areas of the vocabulary are influencing the similarities and differences.
c) Loglinear or VARBRUL analysis enables you to test combinations of different
factors until you arrive at the most likely combination of factors.
CHAPTER 5
1. The answers are given below.
Sentence one
The (Article) dog (Verb, Noun or Adjective) ate (Verb, Noun or Adjective)
the (Article) bone (Verb, Noun or Adjective)
The and the have been tagged from the lexicon.
All other words have been given the default set of tags.
2. Using your knowledge of grammar, you could deduce the parts of speech for
each word. For example, you could work out that bone was a noun, acting as the
head of a noun phrase premodified by an article.You could work out that dog
was intended as a noun rather than a verb by a similar process. In the second
sentence, old and fat are in a modifying position in a noun phrase, preceded by
a modifying genitive phrase and followed by the head of the noun phrase.As old
and fat are attributes of the cat, we would label them as adjectives. getting is a
main verb premodified by a form of the primary verb be acting as an auxiliary.
lazy is an adjective appearing in a complemetiser rôle.
3. This is the answer to the question, working from left to right disambiguating
pairs of words and then proceeding, assuming that the correct answer has been
found.The best answer is emboldened in each case.
Using the transition tables, we arrive at the analysis ‘The (A) dog (N) ate (V) the
(A) bone (N)’.
Corpus Ling/Appdxs 22/2/01 5:38 pm Page 217
The dog
Tag sequence Diagram probabilities
A,N 64
A,V 0
A,J 29
dog ate
Tag sequence Diagram probabilities
N,V 17
N,N 8
N,J 0
the bone
Tag sequence Diagram probabilities
A,N 64
A,V 0
A,J 29
4. There is little doubt that with the resources provided you rapidly started to
generate bad analyses for the sentences you chose. This is because the lexical
resources given only cover the closed class parts of speech.Assuming that every
unknown word can be noun, verb or adjective soon makes the potential
number of part-of-speech combinations too large and the results too uncertain.
If you generate a supplementary lexicon of nouns, adjectives and main verbs, the
results will improve accordingly.
CHAPTER 6
1. One important question you would have to ask is what corpus data you would
need.Various corpora of speech data are available, but none seem to have the
same categorisation as the written corpora. So, unless you want to compare
specific sections of a spoken corpus with specific parallel sections of a written
corpus (if those could be found), you would be looking to create an overall
comparison of the two.With that said, you may still subdivide one of the two
corpora and, say, compare female spoken data with written data, but the need
for such a division would have to be apparent in the motivation for the study.
In methodological terms, one may find difficulty getting access to as much
spoken as written data (though that situation is slowly changing).
2. Two obvious strategies spring to mind. Firstly, collect a corpus of such writing
yourself. Secondly, and preferably, access the scientific writing section of an
existing corpus, such as LOB or the BNC, and edit the corpus texts to create the
subview of the corpus appropriate to your study. Issues of representativeness and
sampling would still remain – see Chapter 3 for more details.
Corpus Ling/Biblio 22/2/01 5:43 pm Page 218
Corpus Ling/Biblio 22/2/01 5:43 pm Page 219
Bibliography
220 BIBLIOGRAPHY
BIBLIOGRAPHY 221
222 BIBLIOGRAPHY
Copeland, C., Durand, J., Krauwer, S., and Maegaard, B. (eds.) (1991) Studies in
Machine Translation and Natural Language Processing, vol. 1: The Eurotra Linguistic
Specifications, Luxembourg: Office for Official Publications of the Commission of
the European Communities.
Crowdy, S. (1993) ‘Spoken corpus design’, Literary and Linguistic Computing 8(4):
259–65.
Crystal, D., and Davy, D. (1969) Investigating English Style, London: Longman.
Cutting, D., Kupiec, J., Pedersen, J., and Sibun, P. (1992) ‘A practical part-of-speech
tagger’, in Proceedings of the Third Conference on Applied Natural Language Processing
(ANLP-92),Trento, Italy, pp. 133–40.
Daille, B. (1995) Combined Approach for Terminology Extraction: Lexical Statistics and
Linguistic Filtering, Unit for Computer Research on the English Language Technical
Papers 5, Lancaster University.
Debili, F., and Sammouda E. (1992) ‘Appariement des phrases de textes bilingues
français-anglais et français-arabe’, in Proceedings of COLING-92, Nantes.
Déroualt, A. M., and Merialdo, B. (1986) ‘Natural language modeling for phoneme-
to-text transcription’, IEEE Transactions on Pattern Analysis and Machine Intelligence
8(6): 742–9.
Dunning, T. (1993) ‘Accurate methods for the statistics of surprise and coincidence’,
Computational Linguistics 19(1): 61–74.
Eaton, H. (1940) Semantic Frequency List for English, French, German and Spanish, Chicago:
Chicago University Press.
Edwards, J. A. (1994) ‘Survey of electronic corpora and related resources for language
researchers’, in Edwards and Lampert 1994, pp. 263–310.
Edwards, J. A., and Lampert, M. D. (eds.) (1994) Talking Data:Transcription and Coding
in Discourse Research, Hillside, NJ: Lawrence Erlbaum Associates.
El-Béze, M. (1993) ‘Les modèles de langage probabilistes: quelques domaines
d’applications’, Habilitation à diriger des recherches, Université de Paris Nord.
Elithorn, A., and Banerji, R. (eds.) (1984) Artificial and Human Intelligence, Brussels:
NATO Publications.
Fairclough, N. (1993) Discourse and Social Change. Cambridge: Polity Press.
Fillmore, C. (1992) “‘Corpus linguistics” or “Computer-aided armchair linguistics”’,
in Svartvik 1992, pp. 35–60.
Firth, J. R. (1957) Papers in Linguistics 1934–1951, Oxford: Oxford University Press.
Fletcher, P., and Garman, M. (1988) ‘Normal language development and language
impairment: syntax and beyond’, Clinical Linguistics and Phonetics 2(2): 97–113.
Fordham, A., and Croker, M. (1994) ‘A stochastic government and binding parser’, in
Proceedings of the International Conference on New Methods in Language Processing,
CCL, UMIST, pp. 190–7.
Francis, W. (1979) ‘Problems of assembling, describing and computerizing large
corpora’, in Bergenholtz and Schaeder 1979, pp. 110–23.
Fried,V. (1972) The Prague School of Linguistics and Language Teaching, Oxford: Oxford
University Press.
Fries, C. (1952) The Structure of English: An Introduction to the Construction of Sentences,
New York: Harcourt-Brace.
Fries, C., and Traver, A. (1940) English Word Lists. A Study of their Adaptability and
Instruction, Washington, DC: American Council of Education.
Corpus Ling/Biblio 22/2/01 5:43 pm Page 223
BIBLIOGRAPHY 223
Fries, U.,Tottie, G., and Schneider, P. (eds.) (1994) Creating and Using English Language
Corpora, Amsterdam: Rodopi.
Fries, U., Müller, V., and Schneider, P. (1997) From Aelfric to the New York Times.
Amsterdam: Rodopi.
Gale, W. A., and Church, W. A. (1991) ‘A program for aligning sentences in bilingual
corpora’, in Proceedings of ACL-91, Berkeley.
Garnham,A., Shillcock, R., Brown, G., Mill,A., and Cutler,A. (1981) ‘Slips of the tongue
in the London-Lund corpus of spontaneous conversation’, Linguistics 19: 805–17.
Garside, R. (1987) ‘The CLAWS word-tagging system’, in Garside, Leech and
Sampson 1987, pp. 30–40.
Garside, R. (1993a) ‘The large-scale production of syntactically analysed corpora’,
Literary and Linguistic Computing 8(1): 39-46.
Garside, R. (1993b) ‘The marking of cohesive relationships: tools for the construction
of a large bank of anaphoric data’, ICAME Journal 17: 5-27.
Garside, R., and McEnery, A. (1993) ‘Treebanking: the compilation of a corpus of
skeleton parsed sentences’, in Black, Garside and Leech 1993, pp. 17-35.
Garside, R., Leech, G., and Sampson, G. (eds.) (1987) The Computational Analysis of
English: A Corpus Based Approach, London: Longman.
Garside, R., Hutchinson, J., Leech, G. N., McEnery, A. M., and Oakes, M. P. (1994)
‘The exploitation of parallel corpora in projects ET10/63 and CRATER’, New
Methods in Language Processing, UMIST, UK, pp. 108–15.
Garside, R., Leech, G., and McEnery, A. M. (eds.) (1997) Corpus Annotation, London:
Longman.
Gärtner, K., Sappler, P., and Trauth, M. (eds.) (1991) Maschinelle Verarbeitung altdeutscher
Texte IV,Tübingen: Niemeyer.
Garvin, P (ed.) (1963) Natural Languages and the Computer, McGraw-Hill, New York.
Gaussier, E. (1995) ‘Some methods for the extraction of bilingual terminology’, Ph.D.
thesis, Université de Paris Jussieu.
Gaussier, E., Langé, J.-M., and Meunier, F. (1992) ‘Towards bilingual terminology
extraction’, Proceedings of the ALLC/ACH Conference, Oxford: Oxford University
Press, pp. 121–4.
Gerbner, G., Holsti, O. R., Krippendorff, K., Paisley,W. J., and Stone, P. J. (eds.) (1969)
The Analysis of Communication Content, New York: John Wiley.
Gibbon, D., Moore, R., and Winski, R. (eds.) (1997) Handbook of Standards and
Resources for Spoken Language Systems, Berlin: Mouton de Gruyter.
Giering, D., Graustein, G., Hoffmann, A., Kirsten, H., Neubert, A., and Thiele, W.
(1979) English Grammar – A University Handbook, Leipzig:VEB Verlag.
Gougenheim, G., Michéa, R., Rivenc, P., and Sauvegot, A. (1956) L’Elaboration du
français élémentaire, Paris: Didier.
Granger, S. (ed.) (1998) Learner English on Computer, London: Longman.
Greene, B., and Rubin, G. (1971) Automatic Grammatical Tagging of English, Technical
Report, Department of Linguistics, Brown University, RI.
de Haan, P. (1984) ‘Problem-oriented tagging of English corpus data’, in Aarts and
Meijs 1984, pp. 123–39.
de Haan, P., and van Hout, R. (1986) ‘Statistics and corpus analysis’, in Aarts and Meijs
1986, pp. 79–97.
de Haan, P. (1992) ‘The optimum corpus sample size?’ in Leitner 1992, pp. 3–19.
Corpus Ling/Biblio 22/2/01 5:43 pm Page 224
224 BIBLIOGRAPHY
BIBLIOGRAPHY 225
226 BIBLIOGRAPHY
Kytö, M. (1991) Variation and Diachrony, with Early American English in Focus: Studies on
CAN/MAY and SHALL/WILL, Frankfurt/Main: Peter Lang.
Kytö, M., Ihalainen, O., and Rissanen, M. (eds.) (1988) Corpus Linguistics Hard and Soft,
Amsterdam: Rodopi.
Kytö, M., Rissanen, M., and Wright, S. (eds.) (1994) Corpora across the Centuries,
Amsterdam: Rodopi.
Labov,W. (1969) ‘The logic of non-standard English’, Georgetown Monographs on Language
and Linguistics 22. Reprinted in P. P. Giglioli (ed.) Language and Social Context,
London: Penguin, 1992.
Lafon, P. (1984) Dépouillements et statistiques en lexicométrie, Geneva: Slatkine-Champion.
Lancashire, I. (ed.) (1991) The Humanities Computing Yearbook 1989–90, Oxford:
Oxford University Press.
Langendoen, D. T. (1968) The London School of Linguistics: A Study of the Linguistic
Theories of B. Malinowski and J. R. Firth, Cambridge, MA: MIT Press.
Leech, G. (1991) ‘The state of the art in corpus linguistics’, in Aijmer and Altenberg
1991, pp. 8–29.
Leech, G. (1992) ‘Corpora and theories of linguistic performance’, in Svartvik 1992,
pp. 105–22.
Leech, G. (1993) ‘Corpus annotation schemes’, Literary and Linguistic Computing 8(4):
275–81.
Leech, G., and Fallon, R. (1992) ‘Computer corpora – what do they tell us about
culture?’, ICAME Journal 16: 29–50.
Leech, G., and Short, M. (1981) Style in Fiction, London: Longman.
Leech, G., and Wilson, A. (1994) EAGLES Morphosyntactic Annotation, EAGLES
Report EAG-CSG/IR-T3.1, Pisa: Istituto di Linguistica Computazionale.
Leech, G., Garside, R., and Bryant, M. (1994) ‘The large-scale grammatical tagging of
text: experience with the British National Corpus’, in Oostdijk and de Haan
1994b, pp. 47–63.
Leech, G., Barnett, R., and Kahrel, P. (1995) Guidelines for the Standardization of
Syntactic Annotation of Corpora, EAGLES Report EAG-TCWG-SASG/1.8, Pisa:
Istituto di Linguistica Computazionale.
Leech, G., McEnery, A. M., and Wynne, M. (1997) ‘Further levels of annotation’, in
Garside, Leech and McEnery 1997, pp. 85–101.
Leech, G., Myers, G., and Thomas, J. (eds.) (1995) Spoken English on Computer, London:
Longman.
Leech, G., Wilson, A., and Rayson, P. (forthcoming) Word Frequencies in Present-Day
British Speech and Writing, London: Longman.
Leitner, G. (1991) ‘The Kolhapur corpus of Indian English: intravarietal description
and/or intervarietal comparison’, in Johansson and Stenström 1991, pp. 215–32.
Leitner, G. (ed.) (1992) New Dimensions in English Language Corpora, Berlin: Mouton
de Gruyter.
Lenat, D., and Guha, R. (1990) Building Large Knowledge Based Systems, New York:
Addison Wesley.
Ljung, M. (1990) A Study of TEFL Vocabulary, Stockholm: Almqvist and Wiksell.
Ljung, M. (ed.) (1997) Corpus-Based Studies in English, Amsterdam: Rodopi.
Longman Dictionary of Contemporary English (1995) London: Longman.
Lorge, I. (1949) Semantic Content of the 570 Commonest English Words, New York:
Columbia University Press.
Corpus Ling/Biblio 22/2/01 5:43 pm Page 227
BIBLIOGRAPHY 227
228 BIBLIOGRAPHY
BIBLIOGRAPHY 229
230 BIBLIOGRAPHY
Sedelow, S., and Sedelow,W. (1969) ‘Categories and procedures for content analysis in
the humanities’, in Gerbner et al. 1969, pp. 487–99.
Sekine, S. (1994) ‘A new direction for sublanguage NLP’, in Proceedings of the International
Conference on New Methods in Language Processing, CCL, UMIST, pp. 123–9.
Seuren, P. (1998) Western Linguistics: an Historical Introduction, Oxford: Blackwell.
Shannon, C., and Weaver,W. (1949) The Mathematical Theory of Communication. Urbana,
IL: University of Illinois Press.
Shastri, S.V., Patilkulkarni, C. T., and Shastri, G. S. (1986) Manual of Information to
Accompany the Kolhapur Corpus of Indian English for Use with Digital Computers,
Department of English, Shivaji University, Kolhapur.
Siewierska, A., (1993) ‘Syntactic weight versus information structure and word order
variation in Polish’, Journal of Linguistics 29: 233–65.
Sinclair, J. (ed.) (1987) Looking Up: An Account of the COBUILD Project in Lexical
Computing. London: Harper Collins.
Sinclair, J. (1991) Corpus, Concordance, Collocation, Oxford: Oxford University Press.
Singh, S., McEnery, A. M., and Baker, J. (2000) ‘Building a parallel corpus of
Panjabi–English’, in J. Veronis (ed.), Parallel Text Processing: Alignment and Use of
Translation Corpora, Dordrecht: Kluwer, pp. 335–246.
Smadja, F. (1991) ‘From N-grams to collocations: an evaluation of Xtract’, in
Proceedings of the 29th ACL Conference, Berkeley.
Smith, N. I. (1999) Chomsky: Ideas and Ideals, Cambridge: Cambridge University Press.
Smith, N. I., and McEnery, A. M. (1997) ‘Can we improve part-of-speech tagging by
inducing probabilistic part-of-speech annotated lexicons from large corpora?’, in
R. Mitkov and N. Nicolov (eds.), Proceedings of Recent Advances in Natural Language
Processing, Bulgaria, pp. 216–20.
Smith, N. I., McEnery, A. M., and Ivanic, R. (1998) ‘Issues in transcribing a corpus of
children’s handwritten projects’, Literary and Linguistic Computing 13(4).
Souter, C. (1990) ‘Systemic-functional grammars and corpora’, in Aarts and Meijs
1990, 179–212.
Souter, C., and Atwell, E. (eds.) (1993) Corpus Based Computational Linguistics,
Amsterdam: Rodopi.
Sperberg-McQueen, C. M., and Burnard, L. (1994) Guidelines for Electronic Text
Encoding and Interchange (P3), Chicago and Oxford:Text Encoding Initiative.
Stenström, A.-B. (1984a) ‘Discourse items and pauses’, Paper presented at Fifth
ICAME Conference,Windermere. Abstract in ICAME News 9 (1985): 11.
Stenström, A.-B. (1984b) ‘Discourse tags’, in Aarts and Meijs 1984, pp. 65–81.
Stenström, A.-B. (1987) ‘Carry-on signals in English conversation’, in Meijs 1987,
pp. 87–119.
Stern,W. (1924) Psychology of Early Childhood up to Six Years of Age, New York: Holt.
Stone, P., Dunphy, D., Smith, M., and Ogilvie, D. (1966) The General Inquirer: A
Computer Approach to Content Analysis, Cambridge MA: MIT Press.
Summers, D. (1996).‘Computer lexicography: the importance of representativeness in
relation to frequency’, in Thomas and Short 1996, pp. 260–6.
Svartvik, J. (1966) On Voice in the English Verb,The Hague: Mouton.
Svartvik, J. (ed.) (1990) The London-Lund Corpus of Spoken English, Lund: Lund
University Press.
Svartvik, J. (ed.) (1992) Directions in Corpus Linguistics, Berlin: Mouton de Gruyter.
Corpus Ling/Biblio 22/2/01 5:43 pm Page 231
BIBLIOGRAPHY 231
Svartvik, J., and Quirk, R. (1980) A Corpus of English Conversation, Lund: C.W. K. Gleerup.
Swinscow,T. (1983) Statistics at Square One, London: British Medical Association.
Taylor, L., Grover, C., and Briscoe,T. (1989) ‘The syntactic regularity of English noun
phrases’, in Proceedings of ACL European Chapter Meeting, Manchester, pp. 256–63.
Taylor, L., Leech, G., and Fligelstone, S. (1991) ‘A survey of English machine-readable
corpora’, in Johansson and Stenström 1991, pp. 319–54.
Thomas, J., and Short, M. (eds.) (1996) Using Corpora for Language Research, London:
Longman.
Thompson, H. S. (1997) ‘Towards a base architecture for spoken language
transcript{s,tion}’, paper given at COCOSDA, Rhodes, 1997.
Thompson, J. B. (1984). Studies in the Theory of Ideology, Cambridge: Polity Press.
Thorndike, E. (1921) A Teacher’s Wordbook, New York: Columbia Teachers College.
Thorndike, E., and Lorge, I. (1944) The Teacher’s Word Book of 30,000 Words, New York:
Columbia University Press.
Tottie, G. (1991) Negation in English Speech and Writing:A Study in Variation, San Diego:
Academic Press.
Tsujii, J., Ananiadou, S., Carroll, J., and Sekine, S. (1991) Methodologies for the
Development of Sublanguage MT System II, CCL UMIST Report no. 91/11.
Voutilainen,A. (1995) ‘A syntax-based part of speech analyser’, in Proceedings of the 7th
Conference of the European Chapter of the Association for Computational Linguistics,
Association for Computational Linguistics, pp. 157–64.
Warwick-Armstrong, S., and Russell, G. (1990), ‘Bilingual concordancing and
bilingual lexicography’, in EURALEX 4th International Conference, Malaga, Spain.
West, M. (1953) A General Service List of English Words, London: Longman.
Wichmann, A. (1989) Tone of Voice: A Stylistic Approach to Intonation, Lancaster Papers
in Linguistics 70.
Wichmann, A. (1993) ‘Gradients and categories in intonation: a study of the
perception and production of falling tones’, in Souter and Atwell 1993, pp. 71–84.
Wikberg, K. (1992) ‘Discourse category and text type classification: procedural
discourse in the Brown and the LOB corpora’, in Leitner 1992, pp. 247–61.
Wilson, A. (1989) ‘Prepositional phrase postmodifiers of nominals and their prosodic
boundaries: some data from the Lancaster Spoken English Corpus’, MA thesis,
Lancaster University.
Wilson, A. (1992a) Review of ‘The Oxford Psycholinguistic Database’, Computers and
Texts 4: 15–17.
Wilson, A. (1992b) The Usage of Since: A Quantitative Comparison of Augustan, Modern
British and Modern Indian English, Lancaster Papers in Linguistics 80.
Wilson, A. (1996) ‘Conceptual analysis of later Latin texts: a conceptual glossary and
index to the Latin Vulgate translation of the Gospel of John’, Ph.D. thesis, Lancaster
University.
Wilson, A. (2000) Conceptual Glossary and Index to the Vulgate Translation of the Gospel
According to John, Hildesheim: Olms-Weidmann.
Wilson, A., and McEnery, A. (eds.) (1994) Corpora in Language Education and Research:
A Selection of Papers from Talc94, Unit for Computer Research on the English
Language Technical Papers 4 (special issue), Lancaster University.
Woods,A., Fletcher, P., and Hughes,A. (1986) Statistics in Language Studies, Cambridge:
Cambridge University Press.
Corpus Ling/Biblio 22/2/01 5:43 pm Page 232
232 BIBLIOGRAPHY
Woolls, D. (2000) ‘From purity to pragmatism’, in Botley, McEnery and Wilson 2000,
pp. 116–33.
Wu, D. (1994) ‘Aligning a parallel English–Chinese corpus statistically with lexical
criteria’, Proceedings of the 32nd ACL, New Mexico State University, Les Cruces,
New Mexico, pp. 80–7.
Yates, S. (1993) ‘The textuality of computer-mediated communication: speech,
writing and genre in CMC discourse’, Ph.D. thesis, Milton Keynes: The Open
University.
Zanettin, F. (1994) ‘Parallel words: designing a bilingual database for translation
activities’, in Wilson and McEnery 1994, pp. 99–111.
Zernik, U. (ed.) (1991) Lexical Acquisition: Exploiting On-Line Resources to Build a
Lexicon, Hillsdale, NJ: Lawrence Erlbaum Associates.
Corpus Ling/Index 23/2/01 9:05 am Page 233
Index
234 INDEX
INDEX 235
semantic fields, 62
language acquisition, 128 sense frequencies, 61
language competence, 6–7 social psychology, 129–30
language engineering, 133–63, 194 sociolinguistics, 115–17
language learning and pedagogy, 4, speech presentation, 118
199–22 statistics
computer assisted language learning chance, 84–8
(), 121–2 cross-tabulation, 89
languages for specific purposes, dependent probabilities, 139, 142
120–1 frequency, 4, 10, 15, 82, 142, 143,
textbooks, 93, 119–22 170, 172
lemmatisation, 53 information theory, 133
lexicography, 30, 142, 144–5 intercorrelation matrix, 89
dictionary, 106 loglinear analysis, 97–8
lexicon, 137–8, 144 multivariate analysis, 88, 114
Linguistic Data Consortium, 188 cluster analysis, 90, 93
correspondence analysis, 90, 93
metalinguistic awareness, 13 factor analysis, 89
modal verbs, 95–7 Hayashi’s Quantification Method
morphology, 109, 138–9 Type III, 94–7
mapping techniques, 90
national varieties, 125–6 multidimensional scaling, 90
natural language, 102, 129 mutual information, 86
probability distribution, 84–5, 88,
parsing, 53–61, 145–51, 170, see also 139–41
grammar proportion, 83
constituent structure, 55 ratio, 83
dependency, 57, 61 significance, 84
full parsing, 55 significance tests, 84
phrase structure, 8–10 chi-square, 77, 84–5
skeleton parsing, 55 degrees of freedom, 85, 86
treebank, 54–6 type-token ratio, 82–3
performance, 6–7 variable rule analysis, 98
pragmatics, 114 Z-score, 86–7
pseudo-procedure 12, 13, 16, 17 structuralist tradition, 7
psycholinguistics, 127–8 stylistics, 76, 117–19, 173
speech errors, 128 sublanguage, 165–86
sublanguage hypothesis, 166
qualitative analysis, 76 syntax, 4, 139
quantitative analysis, 75–101
term banks, 143
rationalism, 5, 111 terminology, 87, 108
register, 112
unconstrained language, 165–7
s-type (sentence type), 166, 167, 173,
second language acquisition, 119–22 WordSmith, 18, 190
semantics, 4, 61 World Wide Web, 187
Corpus Ling/Index 23/2/01 9:05 am Page 236
Corpus Ling/Index 23/2/01 9:05 am Page 237
Corpus Ling/Index 23/2/01 9:05 am Page 238