2 Overview of the human
speech mechanism
2.1 The complexity of speech sounds
Human speech is complex, and lay people are not used to describing it
in technical ways. On the other hand, many people have some inkling
of how to describe music. We could describe the rhythm (where are the
beats? what is the tempo?), the melodic structure (what key is it in? what
scale does it use? are there recurrent themes?), instrumentation and so
forth. All of these are different aspects of music, and all of them contrib-
ute to the totality of what we hear.
Describing speech is a similarly complex task. Speech involves the
careful co-ordination of the lips, tongue, vocal folds, breathing and so
on. The signal that we perceive as successive sounds arises from skills
that we learn over years of our lives, even as our bodies grow and age.
In producing even the simplest of speech sounds, we are co-ordinating
a large number things. Phonetics involves something like unpicking
the sounds of speech and working out how all the components work
together, what they do, and when. It is a bit like hearing a piece of music
and working out how the score is constructed.
One problem we face is exactly the interconnectivity of the parts: in a
way, we need to know something about everything all of the time. The
purpose of this chapter is to give you an overview of the speech mecha-
nism. The terms and concepts that are introduced here will be developed
in more detail in later chapters, but understanding even the simplest
things about speech is easiest if we have an overview of the whole system:
so this chapter introduces a lot of basic terminology of phonetics.
2.2 Breathing
Speech sounds are made by manipulating the way air moves out of (or
sometimes into) the vocal tract. There are a number of ways of doing
this, as we will see in Chapter 10, but universally across languages
7
8 an introduction to english phonetics
sounds of speech are produced on an out-breath. This kind of airflow is
called pulmonic (because the movement of air is initiated by the lungs;
the Latin word for lung is ‘pulmo’) and egressive (because the air comes
out of the vocal tract; ‘e-’, ‘out’, ‘-gress-’, ‘move forwards’): all spoken
languages have pulmonic egressive sounds.
Try an experiment. Take a lungful of air and then hum or say ‘aaah’
until you have to stop. Time yourself; it should take you quite a long
time before you run out of air. Now repeat this, but breathe out first.
This time, you will see that you cannot sustain the same sound for any-
thing like as long. This is enough to show you that a simple sound like
‘aaah’ ([ɑ]) or ‘mmm’ ([m] – [] is the diacritic for long) requires an
out-breath with a reasonable amount of air in the lungs.
Now try breathing in while you say ‘aah’ or ‘mmm’. You probably
will find that this is quite hard, and you will probably get a more ‘croaky’
voice quality. If you try saying your name while breathing in, you will
notice that it feels both unpleasant and difficult; and it doesn’t sound
very good either. This is because the vocal tract works best for speech
when breathing out, i.e. on an egressive airflow.
The lungs are large spongy organs in the thoracic cavity (chest).
They are connected to the outside world via the trachea, or windpipe.
The lungs are surrounded at the front by ribs, and at the bottom by the
diaphragm. The ribs are attached to one another by intercostal muscles.
In breathing in, the diaphragm lowers and the intercostal muscles make
the rib cage move upwards and outwards. This increases the size of the
thoracic cavity, and so it lowers the air pressure. As a result, air flows
into the lungs, and they expand and fill up with air. Once inhalation
stops, the diaphragm and the intercostal muscles relax, and exert a
gentle pressure on the lungs. Air is forced out of the lungs, generating a
pulmonic egressive airflow.
2.2.1 In-breaths to project talk
In beginning to speak, people often make audible in-breaths. In-breaths
are one way to communicate: “I am about to say something.”
Extract (1) shows a question–answer pair, where the answer is given
in overlap with the question. (Where two speakers speak at once, this
is marked with ‘[’ and ‘]’, with the respective talk lined up. The codes
before data extracts are an index to the original sources.)
(1) (Voc9/02.01.04;0342 acid)
1 P Marguerite you need a little bit of acid
2 in there to get a s[et as
overview of the human speech mechanism 9
3 M → [h↓
4 P [well is that right]
5 M [you wouldn’t] with redcurrants
P and M are talking about making jam. In line 1, P asks M whether some
acid is needed to make it set. In line 3, M marks that she is about to
speak, by producing an audible in-breath (transcribed [h] with [↓] to
indicate that the air is coming into the body, not out) and then gives her
answer while P1 in line 2 produces the end of his question. Audible in-
breaths like this are one way for a speaker to display “I have something
(more) to say.” Here, the “something to say” is an answer, and M pro-
duces her in-breath at a point relative to P’s talk where it is clear what
kind of answer is relevant in the context. Producing audible in-breaths
is a common device that allows speakers to co-ordinate turn-taking in
conversation.
2.3 The larynx and voicing
The larynx (Figure 2.1) is a structure built of cartilage. Its main purpose
is as a kind of valve to stop things going down into the lungs. We will
look at the larynx in more detail in Chapter 4.
You should be able to locate your larynx quite easily. You probably
know it as your ‘Adam’s apple’ or voice box. It is often visible as a notch
at the front of the neck.
The larynx contains the vocal folds (also known as the vocal cords,
but this suggests that they are like strings on a stringed instrument,
which they are not). When we breathe, they are kept wide apart, which
allows air to pass freely across the glottis, the space between the folds;
but during speaking, the vocal folds play an important role because they
can be made to vibrate. This vibration is called voicing. Sounds which
are accompanied by voicing are called voiced sounds, while those
which are not are called voiceless sounds.
You can sense voicing by a simple experiment. Say the sound [m]
but put your hands over your ears. You will hear quite a loud buzzing
which is conducted through your bones to your ears. Now repeat this
saying a [s] sound, and you will notice that the buzzing stops. Instead,
you will hear a (much quieter) hissing sound, which is due to the turbu-
lent airflow near the back of the teeth. If you now say a [z] sound, you
will notice that everything is the same as for [s], except that there is the
buzzing sound because [z] is voiced. Voicing is caused by the very rapid
vibration of the vocal folds. Voicing is one of the most important fea-
tures of speech sounds, and we will look at it in more detail in Chapter 4.
10 an introduction to english phonetics
nasal cavities
alveolar
ridge
upper hard palate
lip
velum
(soft palate)
blade front
teeth tip
uvula
back
tongue
body
lower pharynx
lip tongue
root
epiglottis
hyoid bone
thyroid cartilage
larynx
cricoid cartilage
Figure 2.1 Cross-section of the vocal tract.
2.4 Airflow
Air passes out of the vocal tract through the mouth or the nose. The way
that it comes out affects the sound generated, so we need a framework to
describe this aspect of speech.
2.4.1 Central and lateral airflow
Central airflow is when the air flows down the middle of the vocal tract.
If you say the sound [s], hold the articulation and then suck air in, you
should feel that it goes cold and dry down the middle of your tongue
overview of the human speech mechanism 11
and the middle of the roof of your mouth. The cold and dry patches will
be more or less symmetrical on each side of your mouth. All languages
have sounds with central airflow.
Lateral airflow is when the air flows down one or both sides of the
vocal tract. If you say the sound [l], hold the articulation and then suck
air in, you should feel this time that it goes cold and dry down one or
both sides of the mouth, but not down the middle. The sides of the
tongue are lowered, and the air passes out between the back teeth.
In theory, lateral airflow can be produced at the lips too: to do this,
keep the sides of the lips together and try saying something like ‘Pepé
bought a pencil’. It will both sound and look strange. It is probably not a
surprise that no language has lateral airflow caused by constricting the
lips at one side, and this combination is blocked out in the chart of the
International Phonetic Association.
2.4.2 Oral and nasal airflow
Air can exit the vocal tract through the nose or the mouth. This is con-
trolled by the position of the velum. The velum is a sort of valve that
controls airflow through the nose. If the velum is raised, then the nasal
cavities are blocked off. Consequently, air cannot pass through them,
and it must exit the vocal tract through the mouth. Sounds with airflow
exiting through the mouth only are said to have oral airflow. If the
velum is lowered, air flows through the nasal cavities, and out through
the nostrils. If the air flows through the nose, the airflow is nasal.
If you say a [s] sound and pinch your nose, you will notice that you
can easily continue the [s] sound. This is because [s] is oral: the velum
is raised and makes a tight seal, preventing escape of air through the
nose. On the other hand, if you say a [m] sound and pinch your nose,
you will notice that you can only continue the [m] sound for a very
short time. This is because the lips are closed, making oral escape
impossible, but the velum is lowered, so that the airflow is nasal. By
pinching your nose, you effectively seal off the only remaining means
of escape for the air.
A third possibility exists, where air escapes through the nose and the
mouth. For these sounds, the velum is lowered, but there is no com-
plete closure in the oral tract, as we had for [m] (where the complete
closure is at the lips). A good example would be a nasalised vowel, as in
the French word ‘pain’, [pã], ‘bread’. You might try making a nasalised
[s] sound, [s̃], but you will notice that it is much quieter and less hissy
than it should be, with as much noise caused by air coming through the
nostrils as through the mouth.
12 an introduction to english phonetics
2.5 Place of articulation
The vocal tract contains some discrete physical landmarks which are
used primarily in producing and describing consonants. In describing
the place of articulation, we are describing where in the vocal tract a
sound is made.
Articulators are the parts of the oral tract that are used in produc-
ing speech sounds. They are often grouped into two kinds, active and
passive. Active articulators are ones that move: the tongue tip is an
active articulator in sounds like [s t n], since it moves up to behind the
teeth. Passive articulators are articulators that cannot move, but are
the target for active articulators. In the case of sounds like [s t n], the
passive articulator is the bony ridge behind the upper teeth, known as
the alveolar ridge.
Most places of articulation are described by reference to the passive
articulator. We start our description of them with the lips, working our
way down the vocal tract.
2.5.1 Bilabial
Bilabial sounds are sounds made at the lips. ‘Bi-’ means ‘two’, and ‘labial’
is an adjective based on the Latin word for ‘lips’. In English, the sounds [p
b m] are bilabial. If you say [apa aba ama] and look in the mirror, you will
see that they look identical. If you say the sounds silently to yourself and
concentrate on your lips, you will feel that the two lips touch one another
for a short period, and the action is basically the same for all three sounds.
2.5.2 Labiodental
Labiodental sounds are made with the upper teeth (‘dental’) against
the lower lip (‘labio’). In English the labiodental sounds [f v] occur.
Logically speaking, labiodental sounds could involve the lower teeth
and the upper lip, but this is difficult for most people to do: it involves
protruding the jaw, and most people have upper teeth that sit in front of
the lower teeth. Labiodental sounds can be made with the teeth against
either the inside surface of the lip (endolabial) or the outside edge of the
lip (exolabial).
2.5.3 Dental
Dental sounds involve an articulation made against the back of the
upper teeth. [θ ð] in English (as in the initial sounds of ‘think’ and ‘then’)
overview of the human speech mechanism 13
are often dental; they can also be interdental, that is, produced with the
tongue between (‘inter’ in Latin) the teeth, especially in North America.
Dental forms of [l] and [n] are used in words like ‘health’ and ‘tenth’,
where they are followed by a dental; and dental forms of [t] and [d] are
regularly used in many varieties of English (e.g. some forms of Irish or
New York English, and in Nigeria) as forms of [θ ð].
2.5.4 Alveolar
Alveolar sounds are made at the alveolar ridge. This is a bony ridge
behind the upper teeth. If you rest your tongue on the upper teeth then
gradually move it backwards, you will feel a change in texture from the
smooth enamel to the bumpier gum. Just behind the teeth you should
be able to feel the alveolar ridge. This sticks out a bit just behind the
teeth. People’s alveolar ridges are very variable: some are very promi-
nent, others hardly noticeable. Alternatively, try isolating the consonant
sounds in the word ‘dent’, and you should feel that the tongue tip is
making contact with the alveolar ridge. Sounds with an alveolar place of
articulation in most varieties of English are [t d n 1 r s z].
2.5.5 Postalveolar
Postalveolar sounds are made just behind (‘post’) the alveolar ridge.
There are four of these in English, [ʃ] and [], the sounds spelt <sh> in
‘ship’, [ʃp], and <si> in ‘invasion’, [veən], and the sounds [tʃ d]
as in ‘church’ and ‘judge’. It can be hard to feel the difference in place
of articulation between alveolar and postalveolar sounds, but if you
produce a [s] sound, then a [ʃ] sound, and suck air in immediately after
each sound, you should feel that part of the roof of the mouth which
goes cold and dry is further back for [ʃ] than for [s].
Special symbols for dentals and postalveolars only exist for the frica-
tives. If dental or postalveolar articulations need to be distinguished,
this can be done using diacritics – characters which modify the basic
value of letters, and are placed over or under simple letters. We can
modify the basic alveolar symbol [t] with diacritics: [] marks dental, so
[t ] stands for a voiceless dental plosive, and [] marks ‘retracted’ (i.e.
further back), so [t] stands for a voiceless postalveolar plosive.
Postalveolars are reported occasionally in dialects which are on their
way to losing distinct [r] sounds. Hedevind (1967) reports a contrast
between dentals/alveolars and postalveolars (transcribed [n, z, t] in
pairs such as those below in a dialect from Dent (Cumbria, Northern
England).
14 an introduction to english phonetics
(2) own [an ] brain (‘harn’) [an ]
mows [maz] mars [maz ]
shot [ʃɔt] short [ʃɔt ]
If you slowly move your tongue away from the alveolar ridge and slide
it back along the roof of your mouth, you will feel a change in texture
(it will get smooth and hard) as well as a distinct change in shape (it will
feel domed). This domed part is known as the hard palate. (You may be
able to curl your tongue even further back, when you will feel a change
in texture again – it will feel soft – and it might feel a bit uncomfortable;
this is the velum, or soft palate.)
2.5.6 Retroflex
Retroflex sounds are made with the tongue curled (‘flex’) back (‘retro’)
to the hard palate. (This is one case where the ‘place of articulation’
refers to the active articulator.) The symbols for retroflex sounds are
easy to remember: they all have a rightward-facing hook on the bottom:
[ ].
Retroflex [ ] are frequently used in Indian varieties of English
instead of alveolars for the sounds [t d n]. (Many Indian languages have
dental and retroflex or postalveolar sounds, but not alveolar.) The retro-
flex fricative sound [] also occurs in some varieties of English, notably
some Scottish and North American varieties, as a combination of [r] +
[s], as in ‘of course’, [əv kɔ]. And many varieties of American English
use [] for the r-sound; this is also known as ‘curled-r’.
2.5.7 ‘Coronal’
On the IPA chart, sounds are described according to where in the mouth
they are made; but it is equally important to think about which part of
the tongue is used to make them. Dental, alveolar, postalveolar and ret-
roflex sounds are all made with the front part of the tongue, the tip (the
very frontmost part of the tongue) or the blade (the part just behind the
tip). There is a lot of variability among English speakers as to which part
of the tongue they use to articulate dental, alveolar and postalveolar
sounds, so usually this factor is ignored, since it seems to play no lin-
guistic role for English. In the phonology literature, sounds made with
the front part of the tongue are often called coronal, a term which does
not appear on the IPA chart. (The Latin word ‘corona’ means ‘crown’;
this is the term used to refer to the front part of the tongue.)
overview of the human speech mechanism 15
2.5.8 Palatal
Palatal sounds are made with the tongue body, the massive part of the
middle of the tongue, raised up to the hard palate, or the roof of the mouth.
Palatal sounds aren’t common in English, except for the sound [j], which
is usually spelt <y>, as in ‘yes’, ‘yacht’, ‘yawn’; or as part of the sequence
[ju] represented by the letter <u> in words like ‘usual’, ‘computer’.
2.5.9 Velar
Velar sounds are made with the tongue back (or dorsum) raised towards
the soft palate. The soft palate is at the back of the roof of the mouth,
and is also known as the velum. The sounds [k ] are velars, as is the
sound [ŋ], represented by <ng> in words like ‘king’, ‘wrong’, ‘hang’;
but as we will see in Chapter 7, there are in fact many variations in the
precise place of articulation in English.
The velum also acts as a kind of valve, because it can be raised and
lowered. When it is lowered, air can pass into the nasal cavities and
escape through the nose. When it is raised, the nasal cavities are sealed
off, and air can only escape through the mouth.
2.5.10 Uvular
Uvular sounds are made with the uvula (which is Latin for ‘little egg’,
the shape of the uvula). The uvula is the little fleshy appendage that
hangs down in the middle of your mouth at the back. If you gargle,
the uvula vibrates. French, German, Dutch and Danish all use uvular
articulations for orthographic <r>; and in fact, one variety of English
(around the north east of England) has, in its more archaic forms, a
uvular sound too in this position.
2.5.11 Pharyngeal
The pharynx is the cavity behind the tongue root and just above the
larynx. Pharyngeal sounds are made by constricting the muscles of the
neck and contracting the pharynx; this kind of articulation occurs rarely
in English.
2.5.12 Glottal
Glottal sounds are made at the glottis, the space between the vocal folds,
which are located at the larynx. English uses a number of such sounds:
16 an introduction to english phonetics
[h] as in ‘head’ and its voiced equivalent between two vowels, [], as
in ‘ahead’; and the glottal stop [ʔ], which is often used alongside or in
place of [t] (as in many Anglo-English – that is, the English of England –
pronunciations of words like ‘water’, [wɔtə, wɔʔə]), and in words that
begin with vowels (as in many American and Australian pronunciations
of phrases like ‘the [ʔ] apple’).
2.6 Manner of articulation
As well as knowing where a sound is made, we need to know how it is
made. Consonants involve at least two articulators. When the articula-
tors are brought closer together, the flow of air between them changes:
for instance, it can be stopped or made turbulent. The channels between
any two articulators govern the pressure and flow of air through the
vocal tract, and in turn this affects the kinds of sound that come out. The
way a sound is made (rather than where it is made) is called manner of
articulation. Most manners of articulation are combinable with most
places of articulation.
2.6.1 Stop articulations
Stop articulations are those sounds where a complete closure is made
in the oral tract between two articulators; this stops the air moving out
of the oral tract. Stop articulations include a whole range of sound types,
which vary according to the kind of airflow (oral vs. nasal) and whether
the closure can be maintained for a long time or not.
Plosives are made with a complete closure in the oral tract, and
with the velum raised, which prevents air escaping through the
nose. English plosives include the sounds [p t k b d ]. Plosives are
‘maintainable’ stops because they can be held for a long time, and the
closure portion arises from a deliberate articulation. The term ‘plosive’
relates to the way the stop is released – with what is sometimes called
an ‘explosion’. We look at the release of plosives in more detail in
Chapter 7. It is worth pointing out that many phoneticians use the
word ‘stop’ to mean ‘plosive’. We are using the word ‘stop’ in Catford’s
(2001) sense.
Nasals are made with a complete closure in the oral tract, but with
the velum lowered so that air escapes through the nose. For English
there are three main nasal sounds, [m n ŋ], bilabial, alveolar and velar
respectively. Nasals are usually voiced in English.
The other kinds of stopped articulation are trills and taps. In these
sounds, a closure is made only for a very short time, and the closure
overview of the human speech mechanism 17
arises because of aerodynamics or the movement of articulators from
one position to another.
Trills are rare in English, but they are one form of ‘rolled r’: they
involve the tongue tip striking the alveolar ridge repeatedly (usually
three to four times). They have a very restricted occurrence in English,
primarily among a very particular kind of theatrical performer, though
they are often thought of as typically Scottish.
Taps on the other hand are quite common in English. These consist
of just one short percussive movement of the tongue tip against the
alveolar ridge. They occur in many varieties of English, but are espe-
cially well known as kinds of /t/ or /d/ sound in many North American
varieties in words like ‘bu[ɾ]er’, ‘wri[ɾ]er’, ‘a[ɾ]om’.
2.6.2 Fricative articulations
Fricative articulations are the result of two articulators being in close
approximation with each other. This is a degree of stricture whereby
the articulators are held close enough together for air to pass between
them, but because the gap between them is small, the airflow becomes
turbulent and creates friction noise. (In lay terms, we might talk about
a ‘hissing’ sound.) Fricatives in English include [f v θ ð s z ʃ ], the
sounds represented orthographically by the underlined portion: fish,
vow, think, then, loose, lose, wish, vision. Notice that there are not very
consistent representations particularly for the sounds [ʃ ] in English
spelling.
Fricative articulations can be held for as long as there is sufficient
air to expel. The amount of friction generated depends on the amount
of air being forced through the stricture and on the degree of stric-
ture. If you produce a [s] sound and then push more air out, you will
notice an increase in the loudness (intensity) of the friction. If you
do this and at the same time make the tongue tenser, the intensity of
the friction will increase and the friction will sound ‘sharper’. On the
other hand, if you relax the articulators in producing a [s] sound, you
will notice that the friction gets quieter and that it changes quality,
becoming ‘flatter’.
Affricates are plosives which are released into fricatives. English has
two of these: [tʃ d], both postalveolar, as in ‘church’ and ‘judge’.
The sounds [h ] as in ‘heart’ and ‘ahead’ are voiceless and voiced
glottal fricatives respectively. These sounds are produced with friction
at the glottis.
Tongue shape plays a determining role in the overall sound of
fricatives. We will return to this in Chapter 8.
18 an introduction to english phonetics
2.6.3 Resonant articulations
If articulators are held so as not to generate friction, but to allow air to
pass between them smoothly, then we get articulations known as reso-
nant. The degree of stricture is known as open approximation, and
consonant sounds generated this way are called approximants. Vowels
are another kind of resonant articulation.
Approximants in English include the sounds [j w l r]. (Note: [j] stands
for the sound usually written <y> in English, as in ‘yes’. The phonetic
symbol [y] stands for a vowel.) [j w] are often called glides, because they
are closely related in phonetic terms to the vowels [i] and [u], and can
be thought of as non-syllabic versions of these vowels. [l r] are often
called liquids, and they have certain similarities in the places where
they occur in consonant clusters. We will use the symbol [r] for now
to represent any kind of [r]-sound, though for the majority of English
varieties, a more accurate symbol would be [ɹ].
The English approximants [w j r] are central and [l] is lateral.
Approximants are among the phonetically most complex of sounds in
English because they typically involve more than one articulation; so we
shall leave further discussion of English approximants to a later chapter.
Summary
There are three main aspects of the production of speech sounds in
English: voicing, place of articulation and manner of articulation. We
have introduced much terminology for describing speech sounds.
In later chapters, we will look at place, manner and voicing in much
more detail. We will focus on those aspects of the sound of English
which relate to meaning in its broadest sense: word meaning, utterance
meaning and social meaning. To do this, we will make extensive use of
the categories of the International Phonetic Alphabet.
Exercises
1. What is the place and manner of articulation of the consonants in the
following words? Remember to refer to the sounds you make in pro-
nunciation, which do not always straightforwardly correspond with the
letters in the spelling!
overview of the human speech mechanism 19
a. club f. Dutch k. psychology
b. heavy g. contact l. hearing
c. deaf h. community m. perform
d. kiss i. industry n. translate
e. raised j. night
2. Divide each of the following groups of symbols into two sets of three,
each of which has something in common phonetically. The first one is
done for you.
Symbols Set 1 Set 2
a. pmtnkŋ p t k (oral plosives) m n ŋ (nasals)
b. slpmvʃ
c. fjwlzθ
d. svhðθ
e. rknlw
f. twsmb
g. ʃ t θ ð t
h. hzlʔs
i. napkjw
j. jwbdɹ
Further reading
Overviews of the production of speech and discussion on the classifi-
cation of speech sounds can be found in Abercrombie (1967), Catford
(2001) and Ladefoged (2005, 2006). Ball (1993) is aimed at clinicians, but
is very approachable. More advanced readings include Laver (1994) and
Pike (1943). For discussion relating to English more specifically, Jones
(1975) and Gimson’s work (Cruttenden 2001) are classics.