Modest XML Encoding for Corpus Linguists
Modest XML Encoding for Corpus Linguists
Andrew Hardie
Lancaster University
[email protected]
Introduction
In this paper I outline a modest approach to XML encoding for the majority of everyday
purposes. The aim is both to introduce and explain XML for those unfamiliar with this kind
of markup, and to argue that this limited level of XML knowledge could work as part of the
normal, general training of corpus linguists.
The recommended approach to XML corpus markup is to use the document specifications
of the Text Encoding Initiative or the Corpus Encoding Standard. However, in practice,
uptake of these standards has been very low in corpus linguistics, except for major projects:
the small-scale or ad hoc data collection that characterises a very large proportion of corpusbased work typically ignores XML altogether. The reason why is fairly obvious: the
implementation of these standards is designed for maximum coverage of all-purpose
document encoding, and as a result they are very extensive, very technical, and not userfriendly. For most purposes in corpus linguistics, only a small fraction of the features of
standards like TEI and XCES are actually used regularly. But many researchers who could
benefit from making use of XML to mark up their data are currently put off from doing so by
the inapproachability of the standards.
So there is clearly a benefit to be found in outlining a minimal set of normal practices for
using XML in a way that is easy for anyone to understand, and avoiding all the more
technical aspects of the heavyweight standards. the goal is that it should be possible for
anyone from undergraduate level upwards to start by knowing nothing about XML, and
progress to being able to read and apply basic tags using a very short manual (five pages or
so).
The modest set of things that, I would argue, are sufficient knowledge about XML for most
corpus linguists day-to-day needs are:
1. Preamble
This document explains a way of adding markup to corpus texts using a system called XML,
the eXtensible Markup Language. XML is an extremely powerful and very flexible system,
widely used for all kinds of documents and databases. Pretty much any kind of information
can be added to a document using XML. In corpus linguistics, however, we most often use
XML to indicate features of the text other than its actual words. So for most purposes, we
only need a modest amount of XML. This includes things like:
1
Paragraph boundaries
Sentence boundaries
Utterance boundaries
Page breaks
Things omitted from the original text
Anonymisation of spoken texts
This document outlines a series of guidelines that you can follow to use XML to indicate
these things in a reasonably standard and easy-to-understand way, without getting entangled
with all the technical details of XML. If you are creating a corpus, and you want to go beyond
just including the words of the texts, then you should strongly consider following the
suggestions here.
2. What XML is
XML is a system of markup where the information that we add to the text is represented by
tags surrounded by <angle brackets>. Anything within angle brackets counts as markup;
anything outside the angle brackets is part of the actual text. When a whole corpus text is
marked up according to the rules of XML, we call it an XML file. The XML file format can
be read and understood by many different computer programs, as well as being
comprehensible to human beings.
A basic example of this would be using the <s> tag to indicate sentence boundaries. Imagine
you had the following in one of your corpus texts:
Johnny visited his grandma. He brought her some chocolates.
If you are familiar with how the World Wide Web works, you might already know about
another angle-bracket-based language: HTML, the markup language used to program web
pages. The big difference is that in XML, you can decide for yourself what tags you want to
use, define what they mean, and decide on the rules for how they are used. So its not
compulsory to use the <s> tag for sentences, though it is common practice.
3. XML files
An XML file is a type of plain text file. A plain text file is a computer file that consists only
of letters, numbers and other characters. There is no formatting information, such as text
colour or text font, in a plain text file: the kinds of file that do contain this information, such
as word processor files or presentation files, are not plain text and cannot be used as plain
text.
You can create and edit plain text files using a text editor program such as Notepad on
Microsoft Windows. It is normally a bad idea to use word processors, like Microsoft Word or
LibreOffice Writer, to edit XML files. There exist lots of specialised XML editor programs as
2
well, but for the modest level of XML that this document is introducing, you dont need to
learn how to use one of these: a text editor is enough.
XML files need to follow all the rules of XML which we will encounter in this document. An
XML file which follows all the rules is called well-formed, which means that computer
programs can process it correctly. If your XML file is not well-formed, you will encounter an
error when a computer program attempts to process it.
One important rule is that XML files should use proper Unicode encoding (for corpus
linguistics, we most often use the form of Unicode called UTF-8). Encoding is a separate
question to markup, so we wont explain it here; see McEnery and Xiao (2004).
Its traditional to save XML files with the three-letter extension .xml (whereas other kinds of
plain text file are often saved with the file extension .txt).
4. How to enter tags
When you have your corpus text in a plain text editor, you can add tags simply by typing
them in. Tags consist of a label between two angle brackets, like this:
<p>
We usually try to make the label an easy-to-remember abbreviation for what it is marking up.
So, for instance, <p> tags are normally used to mark up paragraphs, because p is an easy-toremember abbreviation for paragraph.
Remember that is important to always type the angle brackets correctly. If you misplace an
angle bracket, the XML file will not be well-formed. Another important rule is that the labels
of tags are always case-sensitive. So <p> and <P> count as two different kinds of tag in XML.
Because of this rule, we normally stick to lowercase letters for our tag labels. Tag labels cant
be more than one word, and normally shouldnt contain any characters other than the letters
of the alphabet.
One of the other rules of XML is that all whitespace counts the same. Whitespace includes
line breaks, tabs, and spaces: when a computer reads the XML file, these are all treated as
being just the same as a space. Another rule is that putting spaces around the tags themselves
is normally optional. So, in XML, this...
<u>Hello.</u><u>Hi there!</u>
... (where u is short for utterance) counts as being exactly the same as this:
<u>
Hello.
</u>
<u>
Hi there!
</u>
The important thing to remember here is that line breaks dont mean anything in XML. In
normal typing, we are used to being able to use a double line breaks and/or indent to indicate
the end point of a paragraph, for instance:
This is my first paragraph.
And this is
my second paragraph.
But if we are using XML, the line breaks arent enough to show the paragraphs. wE have to
explicitly indicate the start and end points with tags:
<p>This is my first paragraph.</p>
<p>And this is my second paragraph.</p>
Very often, as in the utterance/paragraph examples above, we use XML tags to represent
regions in the text. Each paragraph of a text, for instance, is a region that has a beginning
point and an end point. When we use XML to mark up regions, the following rules apply:
the beginning of each region is indicated by a start-tag that consists of just the tag
label between angle brackets, for instance <p>
the end of each region is indicated by an end-tag that has a forward-slash before the
tag name, for instance </p>
every start-tag must have a corresponding end-tag
regions cant overlap (that is, one region must end before the next region of the same
type starts)
However, we dont always want to mark up regions in our texts. Sometimes, we want to mark
up something that occurs at a particular single point in the text. For example, if we are
encoding a written text we might want to mark up its page breaks but a page break is a
point not a region. For these, there is a special style of XML tag:
<pb/>
(assuming that pb is our abbreviation for page break). The forward-slash after the tag label
indicates that this is a standalone tag marking a point in the text, which has no corresponding
end-tag. Here is an example of region-style tags and a point-style tag being used together to
mark up a paragraph that starts on one page and finishes on the next:
<p>"It's really a dreadful situation," she said fretfully. "Who on
earth <pb/> can predict how things will turn out?"</p>
followed by the value, which contains the actual information. Attribute-value pairs are
encoded inside the actual XML tags, as follows:
<pb n="23"/>
The attribute-value pair is placed inside the tag, after the tag label, with a space in
between
The label of the attribute comes first; the same rules apply to attribute labels that
apply to tag labels they should just be a single word, should normally contain only
letters, and are case-sensitive (in the example above, the attribute label is n)
The attribute label is followed by an equals sign; after the equals sign comes the value
of the attribute.
The value of the attribute is surrounded by quotation marks.
The idea is that the attributes are the same on every tag of the same type in the text, but the
values can be different each time. For instance:
...
<pb n="23"/>
...
<pb n="24"/>
...
<pb n="25"/>
...
Its possible to use either single quotation marks or double quotation marks to mark the start
and end of a value, but you shouldnt mix the two types of quotation mark should not be
mixed each attribute value pair must use one or the other. We suggest sticking to double
quotation marks most of the time.
Its important to note that if you type a quotation mark in a word processor, it will normally
be turned automatically into a curly quote character that is, either a 66-style quotation
mark (begin quotation) or a 99-style quotation mark (end quotation).
These are curly quotes (sometimes called smart quotes).
"These are straight quotes, the 'correct' ones for attribute-values."
This is fine for normal typing, but incorrect for XML: you must never use curly quotes for
attribute-values as this stops the XML from being well-formed. This is one good reason why
it is better to create your XML in a text editor than in a word processor!
When a region-style tag has an attribute, the attribute is marked up once only, on the start tag
even though it applies to the whole region. In the following example, the who attribute is
added to the <u> tag to indicate who is speaking obviously, this bit of information applies
to the whole of the utterance surrounded by the <u> ... </u>:
<u who="PETER">Hello, how are you?</u>
There are some other common mistakes that people make when adding attributes. For
example:
<pb="12"/>
The mistake here is that the attribute-value is linked directly to the tag. This isnt allowed in
XML. Values have to be linked to attribute names in the way we explained above.
A second common mistake is repeating the attribute-values on the close tag. For example:
<u who="PETER">Hello, how are you?</u who="PETER">
You should never use encoding like this. If you do, your file is not a valid XML file.
Finally, you will sometimes see the following in older corpus files from before XML was
introduced:
<u PETER>Hello, how are you?</u>
Here, the attribute value (name of the speaker) is added directly within the tag, without
linking it to an attribute value. Again, this does not follow the modern rules of XML, so you
should never use markup like this.
5.2. How to use identifiers and sequence numbers
There are two particular attributes that you can use on lots of different tags: id and n.
We saw an example of n above being used with the <pb/> tag. n is short for number and is
customarily used to indicate a sequence number. It can be used with any tag. For example, if
a spoken text has 100 utterances in it, then the <u> tags can contain the attribute-values n="1"
to n="100". This can be useful for keeping track of where examples in a corpus occur; for
instance, you could note that a particular example of a word you are interested in occurs in
utterance number 48 of text AA2 and this would enable you to find the source of the
example whenever you need it.
The id attribute is also used to keep track of particular tags. The difference is that, whereas n
is a sequence number, the id attribute doesnt necessarily tell us anything about sequence.
Instead, it contains an identifier an arbitrary label that is a name for that particular tag.
There are two generally useful rules for identifiers:
Identifiers have to be unique no two tags in the entire corpus should ever have the
same identifier, even when they are different kinds of tag; this is different from
sequence numbers, which can be repeated
Identifiers should usually only consist of letters and numbers if they contain other
characters, this can sometimes create problems when the corpus is processed by some
computer programs
One typical use of identifiers is to give a unique label to each text in a corpus.
<text id="AA01">
An identifier is very useful here because we want each text to be clearly labelled. Typically,
you would store each corpus text in an XML file with the identifier as its filename. So the
text with identifier AA01 would be stored in a file with the name AA01.xml.
5.3. Recommended and unrecommended ways to use attributes (advanced)
Its possible to use attributes in a way that is perfectly well-formed XML, but that actually
makes things difficult later on.
We recommend that you should use attributes for information specific to that instance of the
tag. Try to avoid repeating information unnecessarily.
Utterance tags are a good example of this. Often, we have lots of information about the
speaker sex, age, social class and so on. Its very tempting to add this information to every
<u> tag, and this is indeed well-formed as XML:
<u who="Mary" sex="F" age="18">Hello, how are you?</u>
<u who="Timmy" sex="M" age="42">Quite well, thank you.</u>
<u who="Mary" sex="F" age="18">I'm very glad to hear it!</u>
However, notice that this style results in lots of repetition. Every time Mary speaks, we repeat
the information that she is a female aged 18. This is not ideal (imagine, for instance, finding
out wed made a mistake and Mary was actually 17 when the text was recorded we would
have to go back and change the markup of every single utterance).
The issue here is that things like sex, age and class are information about the speaker, not
information about the utterance. So, a much easier way to arrange things is to use a reference
to an identifier as the value of who, and to store all the speaker information somewhere else.
For example, SP001 could be used as the value of who for Mary and SP002 for Timmy:
<u who="SP001">Hello, how are you?</u>
<u who="SP002">Quite well, thank you.</u>
<u who="SP001">I'm very glad to hear it!</u>
The identifiers then provide an unambiguous link to the speaker information. For example,
you could place this in the header of the spoken text, in this form:
<speaker id="SP001" name="Mary Bloggs" age="18" sex="F"
birthplace="Newcastle-Upon-Tyne" occupation="chef" />
Alternatively, a separate database or spreadsheet can be used to store this information in the
form of a table. That way, pieces of information such as the speakers sex and age only need
to be logged once, rather than on every single utterance.
The who attribute is not itself an identifier (the values are not unique different utterances
can have the same value of who). But its value refers to an identifier, which is unique across
speakers; so we know unambiguously who is speaking here.
Another tempting mistake is to use more than one attribute to mark up the same kind fo
information. For example, you might want to mark up pragmatic functions on utterances, in a
style like this:
<u who="SP001" func="greeting">Hello, how are you?</u>
This is fine so far. However, many utterances have more than one function (the example
above is a question as well as a greeting). How can you encode more than one function? You
might be tempted to have multiple func attributes, like this:
<u who="SP001" func="greeting" func="question">Hello, how are
you?</u>
... but this is not allowed (each attribute can only occur once). Another tempting approach is
to classify it in terms of first function, second function..., and mark it up like this:
<u who="SP001" func1="greeting" func2="question">Hello, how are
you?</u>
This is well-formed XML. However, it is a bad idea for later processing, because it means the
information about pragmatic function is spread across two different attributes a computer
analysing the file wont understand that func1 and func2 are related. It thus becomes very
difficult, for instance, to search for something like all the questions because the
"question" value could be on either func1 or func2. So what might be a better approach?
One way is to combine the values together on a single attribute, like this:
<u who="SP001" func="greeting;question">Hello, how are you?</u>
Because the <func> tags are inside the <u> tag, they count as belonging to it, according to the
rules of XML.
6. How to nest XML tags
You will nearly always want to use more than one kind of XML tag in a text. That means you
have to think about how they fit together. There is one main rule for combining XML tags:
pairs of tags can never overlap.
For example, lets imagine we have both paragraph and sentence tags using <p> and <s>.
After a paragraph tag has been started, we can then start a sentence tag. The rule is that this
sentence tag must be closed before the paragraph tag can close. In other words, this is correct:
<p><s>This is the story of a man called Bill.</s></p>
In the latter example, which is not well-formed, the region of the <p> and the region of the
<s> overlap. In the former example, which is well-formed, the region of the <s> is
completely contained within the region of the <p>. This complete containment is sometimes
called nesting.
Note that nesting is not an issue for tags like <pb/> which indicate a point rather than a region
they cant ever overlap, precisely because they indicate points in the text.
We sometimes use indenting to illustrate the nesting, as follows (remember that whitespace is
irrelevant to the computer, so this is just a visual guide for human readers):
<p>
<s>
This is the story of a man called Bill.
</s>
</p>
In this layout, each extra level of nesting adds one more indentation, and it is easy to see how
the tags must correspond. Nesting can go on for as many levels as needed, for example:
<u who="SP001">
<s>
I er s- started walking and er well left the house.
</s>
<s>
Then when I saw th- the
<unclear>
circus
</unclear>
it was in the er playground there.
</s>
</u>
An XML file is only well-formed when all the tags are correctly nested like this (this is
sometimes called perfect nesting).
There is one additional rule for XML to be well-formed, which is that the entire XML file
must be nested within a single pair of start- and end-tags. That is, the very first thing in the
file must be a start tag, and the very last thing in the file must be the corresponding end tag.
In corpus linguistics, we very often use a <text> tag for this purpose, since most of the time
one file corresponds to one text.
7. How to encode special symbols
As weve seen, XML is based on two special symbols that mark tags off from the actual text:
< and >. If these symbols occur in your actual corpus text, we need to encode them in a
special way to make it clear that these are part of the text, not part of the markup. The way
XML does this is to represent these special symbols using a short code that starts with an
ampersand and ends with a semi-colon. Because ampersand is used for this special purpose,
we also need a code to represent ampersand itself. The codes are:
The technical term for symbols represented using these codes with an ampersand is entities.
is the symbol that commonly causes problems any corpus that contains scientific writing
is likely to have statements of statistical significance like p < 0.05 just one example of
this in your corpus text that has not been converted into an entity is enough to stop the file
being well-formed.
<
For example, imagine that your original text has the sentence The results were tested for
significance using the t-test, & were found to be significant (p < 0.01). You would need to
encode this in XML as:
<s>The results were tested for significance using the t-test, &
were found to be significant (p < 0.01).</s>
An easy way to do this is to use your text editors find-and-replace tool to replace-all
instances of & with & , all examples of < with < , and all examples of > with > .
Two warnings! First, you must do these replacements before you add any XML tags or the
angle brackets around the tags will also be turned into entities. Second, you must do the
ampersand replacement first, because if you do it after adding > or < a replace-all of
& will also affect the &-characters at the start of those entities.
There are also special codes for the single and double quotation marks. You dont need to use
these in normal corpus text, but you do need to use them inside attribute-values because
otherwise, the computer could mistake them for the quotation marks that end the value. The
codes for these symbols are:
The title attribute of the opening <text> tag should be encoded like this:
<text id="NEWS01" title="Prime Minister says, "No way"">
browser. If the browser opens the whole file correctly, the file is well-formed and contains no
errors. If the browser reports an error before it gets to the end of the file, there is a mistake in
your XML maybe an end-tag is missing, or perhaps you have mistyped an angle bracket, or
left a less-than symbol in the text without changing it into < .
Another useful feature of web browsers is that they will usually lay out the XML tags in
indented style (see above) automatically for you, regardless of the layout in the actual plain
text file.
9. The de facto standard tags and their attributes
Certain XML tags are used so widely in corpus linguistics that it is almost always the best
idea to use those same tags, rather than invent your own, if you want to encode the things that
those tags represent. For example, its perfectly possible, if you wanted, to use the tag <para>
... </para> to indicate the start and end of a paragraph. However, the tags <p> ... </p>
are so widely used for this purpose, that it makes more sense to stick to this practice wherever
possible.
This section lists a number of these de facto standard tags, together with the attributes usually
used, examples, and some comments.
Normally, you would only use a selection of these tags, choosing the types of markup relative
to your purposes. If you used all the tags listed here, you would be using a very
comprehensive style of markup, well on the way to one of the major, heavyweight standards.
The id and n attributes which we discussed above arent mentioned again here, because they
are used in the same way for every kind of tag.
9.1. Tags normally used in written texts
<p>
<s>
is very frequently used to indicate sentence start and end points. Weve
already seen examples of this tag in use. Some part-of-speech taggers automatically insert
these tags for you.
<s> ... </s>
<head>
Sometimes, we want to distinguish between normal text paragraphs and paragraphs that are
graphically distinct in the original text, for example section headings or the headline of a
newspaper article. For instance, a newspaper text could be marked up as follows:
<head><s>MAN BITES DOG</s></head>
<p><s>A man bit a dog yesterday.</s> <s>When interviewed, the man
said, "No comment."</s></p>
11
If you dont need to draw the distinction between normal paragraphs and headings, its
perfectly fine to just use the normal <p> tags for headings/headlines.
<head> can have a level attribute, where the level of the heading is used to indicate the
difference between main headings (level 1), sub-headings (level 2), sub-subheadings, and so
on:
<head level="1">VOLUME ONE</head>
<head level="2">Chapter One</head>
<head level="3">- 1 -</head>
<p>I wish to tell the story of my lkife. I was born ...</p>
<pb/>
This tag is used to indicate the position of a page break in a written text (either between two
paragraphs or mid-way through a paragraph).
<q>
In some corpora, <q> ... </q> tags are used to indicate quoted text, such as direct reported
speech. That is, instead of this:
<s>Mr Johnson said, "Hello, everyone!" as he walked into the
room.</s>
It is usually a lot of work to correctly insert <q> tags, so you would only do it if
distinguishing automatically between the main text and quotations is going to be important
for your use of the data.
<gap/>
The <gap/> tag is used to make a note of something that has been omitted when the text was
transcribed into a corpus file most commonly figures and illustrations, but also sometimes
tables and other irrelevant material. It usually has a desc attribute containing a short
description of what has been left out. For example:
<gap desc="Figure 1.1, apparatus diagram" />
<gap desc="Illustration - picture of a clown" />
<gap desc="Table (half a page) listing mortality figures in 19th
century Wales" />
<reg>
The <reg> tag is used to indicate regularised spelling. This is often useful with non-standard
types of data for example, historical corpora predating the full standardisation of spelling, or
written texts produced by young children or foreign language learners. Regualrisation allows
12
the text to be searched using modern spellings. Normally, a <reg> tag would only surround a
single word at a time (there are exceptions to this). It is customary to use an orig attribute to
preserve the original spelling. For example, if you had a historical text with the sentence
Three shippes sayled to Hamborough, you could mark it up like this:
<s>Three <reg orig="shippes">ships</reg>
<reg orig="sayled">sailed</reg> to
<reg orig="Hamborough">Hamburg</reg>.</s>
<u>
The <u> tag is used for utterance boundaries. We have seen examples of this in use before.
The most common attribute is who, which is used to indicate the speaker of the utterance.
This has also been discussed above in the general outline of XML.
<pause/>
The <pause/> tag indicates the point in an utterance where a pause occurs. It often has the
dur attribute to show how long it is (most often measured in seconds). For example:
<u>I wanted to <pause dur="2"/> er wanted to go home</u>
The pause tag here is equivalent to the notation like (2.0) sometimes used in transcribing
pauses in traditional transcription of spoken data.
<voc/>
The <voc/> tag indicates the occurrence of a non-linguistic vocalisation in a spoken text
such as a cough, a sigh, a laugh, or a sneeze. The type of vocalisation is usually specified
using a desc (description) attribute, for instance::
<u>so <voc desc="laugh"/> three llamas right <voc desc="laugh"/> walk
into a bar <voc desc="laugh"/> and one says </u>
<event/>
An <event/> tag is just like a <voc/>, except it indicates some kind of sound on a recording
of spoken language that is not part of the speech you . Here are some examples:
<event desc="prerecorded radio jingle" />
<event desc="noise of passing traffic" />
<event desc="end of side one of tape" />
<stage>
In scripts (plays and other written-to-be-spoken texts), the <stage> ... </stage> tag can
be used to mark off stage directions. This can be useful for later analysis if you want to
distinguish words that are actually spoken by characters in a play from the other words in the
13
original text. Here is an example of this style of markup applied to Shakespeares The
Winters Tale:
<u who="ANTIGONUS">[...] I am gone for ever.</u>
<stage>
[Exit, pursued by a bear.]
[Enter an old SHEPHERD.]
</stage>
<u who="SHEPHERD">I would there were no age between sixteen and
three-and-twenty [...]</u>
<text>
It is very common to use <text> ... </text> as the single tag that encloses an entire
corpus text. (Remember that the rules of XML require every file to be contained within a
single start-tag and end-tag pair.) If you do this, the <text> tag should usually contain the
corpus texts unique identifier code using an id attribute.
<unclear>
The <unclear> tag is very useful for both speech and writing. In a spoken text, you can use it
to mark a region (a word or words) whose transcription you are not certain about perhaps
because it was spoken unclearly, or perhaps because the recording quality was poor:
<u>So I went to um to my <unclear>village</unclear> cos I wanted to
um you know</u>
In a written text, it is used to mark a word or words which cant be made out in the original
text perhaps it is blurry in a printed text, or written in illegible handwriting:
<p>Jenny - what do you think of this? Seems <unclear>fishy</unclear>
to me, best, TH</p>
There are two common ways to use <unclear>. In the examples above, you makes a best
guess about what the words are, and surround the uncertain word or words with <unclear>
tags. If you have no idea at all, you would use an <unclear/> tag to indicate the unclear
words as a point rather than region, optionally with a dur attribute to state how long the
unclear chunk is, for example:
<unclear dur="3 seconds"/>
<unclear dur="10 words"/>
<unclear dur="whole line">
Corpus files often have headers blobs of data about the text (such as: title, publisher, date of
recording, etc.) encoded at the start of each file. If you are adding headers to your files, you
can use <header> and <body> tags to distinguish the header information from the actual text.
The usual way to do this is to have the overall layout of the corpus file as follows.
14
<text id="TEXT01">
<header>
[The header information goes here.]
</header>
<body>
[The actual content of the corpus text goes here.]
</body>
</text>
Note that a <header> is not the same thing as a <head>! See above for <head>.
An approach to creating a modest XML header is outlined near to the end of this guide.
<w>
The <w> tag is used around individual word tokens. Its main use is to carry attributes that
indicate different types of corpus annotation. Common attributes of this sort include pos for a
part-of-speech tag and lemma or hw (short for headword) for a lemma. For example:
<w
<w
<w
<w
<w
<w
<w
pos="ART"
pos="NOUN"
pos="VERB"
pos="PREP"
pos="ART"
pos="NOUN"
pos="PUNC"
lemma="the">The</w>
lemma="cat">cat</w>
lemma="sit">sat</w>
lemma="on">on</w>
lemma="the">the</w>
lemma="mat">mat</w>
lemma=".">.</w>
Normally you would not add this kind of annotation manually, but would use a computer
program to do it for you (e.g. a part-of-speech tagger). But how this works is not our topic
here.
<c>
This tag is sometimes used (for example, in the British National Corpus) as an equivalent for
<w> that is placed around tokens that are punctuation marks. This allows punctuation tokens
(wrapped in <c> ... </c>) to be distinguished from word tokens (wrapped in <w> ..
.</w>). So for instance, in the example above, the last token (full stop) could have been
marked up with <c> instead of <w>. As with <w>, however, you probably wouldnt use <c>
unless you needed to encode word-level annotation.
<anon/>
An <anon/> element can be used as a replacement for any word or phrase that needs to be
omitted for reasons of anonymisation. You can use a type attribute to indicate why the
information has been left out, as in this example:
<p>I want you to assassinate Mr <anon type="name"/> for me. He lives
at <anon type="address"/>. Please hurry, because I really, really
hate him.</p>
15
The fact that there is a set of de facto standard tags does not mean that you cannot also use
additional tags when you need to. Here are two example cases.
10.1.
Often, when we build a corpus, we are not particularly interested in preserving text
highlighting like bold and italics. However, sometimes this is linguistically significant
(especially in historical text). If so, you might add in tags to indicate regions highlighted in
one or both of these ways.
There are lots of ways of doing this. For example, you might use <b>...</b> and
<i>...</i> for bold and italics respectively these are borrowed from the HTML language
used to encode web pages. Or you might borrow the TEI standards technique, which uses a
<hi> tag for this purpose, with a rend attribute to specify whether it is bold or italic.
So, given this text:
Do you seriously think hell forgive you after everything you did?
... we could mark it up in either HTML-style or TEI-style:
Do you <i>seriously think</i> he will <b>forgive<b> you after
everything you <b><i>did</i></b>?
Do you <hi rend="italic">seriously think</hi> he will
<hi rend="bold">forgive</hi> you after everything you
<hi rend="bold italic">did</hi>?
10.2.
In this case, you would create whatever tags and attributes your scheme of grammatical
analysis requires. For example, you might have <np type="definite"> ... </np>
contrasting with <np type="indefinite"> ... </np>.
16