Machine Learning For Natural Language Processing Lecture Notes Columbia E6998 Itebooks Download
Machine Learning For Natural Language Processing Lecture Notes Columbia E6998 Itebooks Download
https://ebookbell.com/product/machine-learning-for-natural-
language-processing-lecture-notes-
columbia-e6998-itebooks-23836890
https://ebookbell.com/product/deep-learning-for-natural-language-
processing-develop-deep-learning-models-for-natural-language-in-
python-v11-jason-brownlee-11575184
https://ebookbell.com/product/applied-natural-language-processing-
with-python-implementing-machine-learning-and-deep-learning-
algorithms-for-natural-language-it-pro-york-university-skillsoft-
books-york-university-ii-22049616
https://ebookbell.com/product/natural-language-processing-and-machine-
learning-for-developers-campesato-32802672
https://ebookbell.com/product/natural-language-annotation-for-machine-
learning-1st-edition-james-pustejovsky-amber-stubbs-22437086
https://ebookbell.com/product/natural-language-annotation-for-machine-
learning-a-guide-to-corpusbuilding-for-applications-early-release-
james-pustejovsky-2561356
https://ebookbell.com/product/natural-language-annotation-for-machine-
learning-james-pustejovsky-amber-stubbs-pustejovsky-23530826
https://ebookbell.com/product/machine-learning-for-ecology-and-
sustainable-natural-resource-management-1st-edition-grant-
humphries-7247946
https://ebookbell.com/product/machine-learningbased-natural-scene-
recognition-for-mobile-robot-localization-in-an-unknown-
environment-1st-ed-2020-xiaochun-wang-10805550
Tagging with Hidden Markov Models
Michael Collins
1 Tagging Problems
In many NLP problems, we would like to model pairs of sequences. Part-of-speech
(POS) tagging is perhaps the earliest, and most famous, example of this type of
problem. In POS tagging our goal is to build a model whose input is a sentence,
for example
the dog saw a cat
and whose output is a tag sequence, for example
DNVDN (1)
(here we use D for a determiner, N for noun, and V for verb). The tag sequence is
the same length as the input sentence, and therefore specifies a single tag for each
word in the sentence (in this example D for the, N for dog, V for saw, and so on).
We will use x1 . . . xn to denote the input to the tagging model: we will often
refer to this as a sentence. In the above example we have the length n = 5, and
x1 = the, x2 = dog, x3 = saw, x4 = the, x5 = cat. We will use y1 . . . yn to denote
the output of the tagging model: we will often refer to this as the state sequence or
tag sequence. In the above example we have y1 = D, y2 = N, y3 = V, and so on.
This type of problem, where the task is to map a sentence x1 . . . xn to a tag se-
quence y1 . . . yn , is often referred to as a sequence labeling problem, or a tagging
problem.
We will assume that we have a set of training examples, (x(i) , y (i) ) for i =
(i) (i)
1 . . . m, where each x(i) is a sentence x1 . . . xni , and each y (i) is a tag sequence
(i) (i) (i)
y1 . . . yni (we assume that the i’th example is of length ni ). Hence xj is the j’th
(i)
word in the i’th training example, and yj is the tag for that word. Our task is to
learn a function that maps sentences to tag sequences from these training examples.
1
2 Generative Models, and The Noisy Channel Model
Supervised problems in machine learning are defined as follows. We assume train-
ing examples (x(1) , y (1) ) . . . (x(m) , y (m) ), where each example consists of an input
x(i) paired with a label y (i) . We use X to refer to the set of possible inputs, and Y
to refer to the set of possible labels. Our task is to learn a function f : X → Y that
maps any input x to a label f (x).
Many problems in natural language processing are supervised learning prob-
lems. For example, in tagging problems each x(i) would be a sequence of words
(i) (i) (i) (i)
x1 . . . xni , and each y (i) would be a sequence of tags y1 . . . yni (we use ni to
refer to the length of the i’th training example). X would refer to the set of all
sequences x1 . . . xn , and Y would be the set of all tag sequences y1 . . . yn . Our
task would be to learn a function f : X → Y that maps sentences to tag sequences.
In machine translation, each input x would be a sentence in the source language
(e.g., Chinese), and each “label” would be a sentence in the target language (e.g.,
English). In speech recognition each input would be the recording of some ut-
terance (perhaps pre-processed using a Fourier transform, for example), and each
label is an entire sentence. Our task in all of these examples is to learn a function
from inputs x to labels y, using our training examples (x(i) , y (i) ) for i = 1 . . . n as
evidence.
One way to define the function f (x) is through a conditional model. In this
approach we define a model that defines the conditional probability
p(y|x)
for any x, y pair. The parameters of the model are estimated from the training
examples. Given a new test example x, the output from the model is
Thus we simply take the most likely label y as the output from the model. If our
model p(y|x) is close to the true conditional distribution of labels given inputs, the
function f (x) will be close to optimal.
An alternative approach, which is often used in machine learning and natural
language processing, is to define a generative model. Rather than directly estimat-
ing the conditional distribution p(y|x), in generative models we instead model the
joint probability
p(x, y)
over (x, y) pairs. The parameters of the model p(x, y) are again estimated from the
training examples (x(i) , y (i) ) for i = 1 . . . n. In many cases we further decompose
2
the probability p(x, y) as follows:
and then estimate the models for p(y) and p(x|y) separately. These two model
components have the following interpretations:
• p(x|y) is the probability of generating the input x, given that the underlying
label is y.
We will see that in many cases it is very convenient to decompose models in this
way; for example, the classical approach to speech recognition is based on this type
of decomposition.
Given a generative model, we can use Bayes rule to derive the conditional
probability p(y|x) for any (x, y) pair:
p(y)p(x|y)
p(y|x) =
p(x)
where X X
p(x) = p(x, y) = p(y)p(x|y)
y∈Y y∈Y
Thus the joint model is quite versatile, in that we can also derive the probabilities
p(x) and p(y|x).
We use Bayes rule directly in applying the joint model to a new test example.
Given an input x, the output of our model, f (x), can be derived as follows:
Eq. 3 follows by Bayes rule. Eq. 4 follows because the denominator, p(x), does not
depend on y, and hence does not affect the arg max. This is convenient, because it
means that we do not need to calculate p(x), which can be an expensive operation.
Models that decompose a joint probability into into terms p(y) and p(x|y) are
often called noisy-channel models. Intuitively, when we see a test example x, we
assume that has been generated in two steps: first, a label y has been chosen with
probability p(y); second, the example x has been generated from the distribution
3
p(x|y). The model p(x|y) can be interpreted as a “channel” which takes a label y
as its input, and corrupts it to produce x as its output. Our task is to find the most
likely label y, given that we observe x.
In summary:
p(x, y) = p(y)p(x|y)
Finding the output f (x) for an input x is often referred to as the decoding
problem.
p(x1 . . . xn , y1 . . . yn ) ≥ 0
2. In addition,
X
p(x1 . . . xn , y1 . . . yn ) = 1
hx1 ...xn ,y1 ...yn i∈S
4
Hence p(x1 . . . xn , y1 . . . yn ) is a probability distribution over pairs of sequences
(i.e., a probability distribution over the set S).
Given a generative tagging model, the function from sentences x1 . . . xn to tag
sequences y1 . . . yn is defined as
Thus for any input x1 . . . xn , we take the highest probability tag sequence as the
output from the model.
Having introduced generative tagging models, there are three critical questions:
The next section describes how trigram hidden Markov models can be used to
answer these three questions.
5
• A parameter
q(s|u, v)
for any trigram (u, v, s) such that s ∈ K ∪ {STOP}, and u, v ∈ V ∪ {*}.
The value for q(s|u, v) can be interpreted as the probability of seeing the tag
s immediately after the bigram of tags (u, v).
• A parameter
e(x|s)
for any x ∈ V, s ∈ K. The value for e(x|s) can be interpreted as the
probability of seeing observation x paired with state s.
is the prior probability of seeing the tag sequence D N V STOP, where we have
used a second-order Markov model (a trigram model), very similar to the language
models we derived in the previous lecture. The quantity
6
4.2 Independence Assumptions in Trigram HMMs
We now describe how the form for trigram HMMs can be derived: in particular, we
describe the independence assumptions that are made in the model. Consider a pair
of sequences of random variables X1 . . . Xn , and Y1 . . . Yn , where n is the length
of the sequences. We assume that each Xi can take any value in a finite set V of
words. For example, V might be a set of possible words in English, for example
V = {the, dog, saw, cat, laughs, . . .}. Each Yi can take any value in a finite set K
of possible tags. For example, K might be the set of possible part-of-speech tags
for English, e.g. K = {D, N, V, . . .}.
The length n is itself a random variable—it can vary across different sentences—
but we will use a similar technique to the method used for modeling variable-length
Markov processes (see the previous lecture notes).
Our task will be to model the joint probability
P (X1 = x1 . . . Xn = xn , Y1 = y1 . . . Yn = yn )
for any observation sequence x1 . . . xn paired with a state sequence y1 . . . yn , where
each xi is a member of V, and each yi is a member of K.
We will find it convenient to define one additional random variable Yn+1 , which
always takes the value STOP. This will play a similar role to the STOP symbol seen
for variable-length Markov sequences, as described in the previous lecture notes.
The key idea in hidden Markov models is the following definition:
P (X1 = x1 . . . Xn = xn , Y1 = y1 . . . Yn+1 = yn+1 )
n+1
Y n
Y
= P (Yi = yi |Yi−2 = yi−2 , Yi−1 = yi−1 ) P (Xi = xi |Yi = yi ) (5)
i=1 i=1
7
This step is exact, by the chain rule of probabilities. Thus we have decomposed
the joint probability into two terms: first, the probability of choosing tag sequence
y1 . . . yn+1 ; second, the probability of choosing the word sequence x1 . . . xn , con-
ditioned on the choice of tag sequence. Note that this is exactly the same type of
decomposition as seen in noisy channel models.
Now consider the probability of seeing the tag sequence y1 . . . yn+1 . We make
independence assumptions as follows: we assume that for any sequence y1 . . . yn+1 ,
n+1
Y
P (Y1 = y1 . . . Yn+1 = yn+1 ) = P (Yi = yi |Yi−2 = yi−2 , Yi−1 = yi−1 )
i=1
That is, we have assumed that the sequence Y1 . . . Yn+1 is a second-order Markov
sequence, where each state depends only on the previous two states in the sequence.
Next, consider the probability of the word sequence x1 . . . xn , conditioned on
the choice of tag sequence, y1 . . . yn+1 . We make the following assumption:
P (X1 = x1 . . . Xn = xn |Y1 = y1 . . . Yn+1 = yn+1 )
n
Y
= P (Xi = xi |X1 = x1 . . . Xi−1 = xi−1 , Y1 = y1 . . . Yn+1 = yn+1 )
i=1
Yn
= P (Xi = xi |Yi = yi ) (7)
i=1
The first step of this derivation is exact, by the chain rule. The second step involves
an independence assumption, namely that for i = 1 . . . n,
P (Xi = xi |X1 = x1 . . . Xi−1 = xi−1 , Y1 = y1 . . . Yn+1 = yn+1 ) = P (Xi = xi |Yi = yi )
Hence we have assumed that the value for the random variable Xi depends only on
the value of Yi . More formally, the value for Xi is conditionally independent of the
previous observations X1 . . . Xi−1 , and the other state values Y1 . . . Yi−1 , Yi+1 . . . Yn+1 ,
given the value of Yi .
One useful way of thinking of this model is to consider the following stochastic
process, which generates sequence pairs y1 . . . yn+1 , x1 . . . xn :
1. Initialize i = 1 and y0 = y−1 = *.
2. Generate yi from the distribution
q(yi |yi−2 , yi−1 )
8
4.3 Estimating the Parameters of a Trigram HMM
We will assume that we have access to some training data. The training data con-
sists of a set of examples where each example is a sentence x1 . . . xn paired with a
tag sequence y1 . . . yn . Given this data, how do we estimate the parameters of the
model? We will see that there is a simple and very intuitive answer to this question.
Define c(u, v, s) to be the number of times the sequence of three states (u, v, s)
is seen in training data: for example, c(V, D, N) would be the number of times the
sequence of three tags V, D, N is seen in the training corpus. Similarly, define
c(u, v) to be the number of times the tag bigram (u, v) is seen. Define c(s) to be
the number of times that the state s is seen in the corpus. Finally, define c(s ; x)
to be the number of times state s is seen paired sith observation x in the corpus: for
example, c(N ; dog) would be the number of times the word dog is seen paired
with the tag N.
Given these definitions, the maximum-likelihood estimates are
c(u, v, s)
q(s|u, v) =
c(u, v)
and
c(s ; x)
e(x|s) =
c(s)
For example, we would have the estimates
c(V, D, N)
q(N|V, D) =
c(V, D)
and
c(N ; dog)
e(dog|N) =
c(N)
Thus estimating the parameters of the model is simple: we just read off counts
from the training corpus, and then compute the maximum-likelihood estimates as
described above.
9
where the arg max is taken over all sequences y1 . . . yn+1 such that yi ∈ K for
i = 1 . . . n, and yn+1 = STOP. We assume that p again takes the form
n+1
Y n
Y
p(x1 . . . xn , y1 . . . yn+1 ) = q(yi |yi−2 , yi−1 ) e(xi |yi ) (8)
i=1 i=1
Recall that we have assumed in this definition that y0 = y−1 = *, and yn+1 =
STOP.
The naive, brute force method would be to simply enumerate all possible tag
sequences y1 . . . yn+1 , score them under the function p, and take the highest scor-
ing sequence. For example, given the input sentence
and assuming that the set of possible tags is K = {D, N, V}, we would consider all
possible tag sequences:
D D D STOP
D D N STOP
D D V STOP
D N D STOP
D N N STOP
D N V STOP
...
10
Other documents randomly have
different content
“It is the lady I met on the boat between Dover and Calais.
Her necklet had been stolen, and she was naturally in tears.
We travelled together from Calais to Paris,” he explained.
“She is a very intelligent English society woman, and I
asked her to call.”
“Ah! So here you are!” she cried in French, which she spoke
extremely well. “I promised I would call. Do you know, the
French police are so much cleverer than the English! They
have already arrested the thief and returned my necklet to
me!”
“I will consider it,” said the old man. “There is no hurry till
to-morrow. I may find it necessary to telegraph to Fez. I—I
have to think it over, M’sieur Porter.”
“All O.K.,” she said. “Guinness has got the concession and is
bringing it over this afternoon. He’ll be with you to-night.”
“Right. Keep in touch with him till he’s safely away, then get
back here,” were the great crook’s orders.
Meanwhile events were following close upon each other in
those crowded autumn days.
From the worthy pair he learnt how they had received the
young lady at St. Malo from an Englishman and a woman,
apparently his wife. From the description of the woman he
felt convinced that it was Freda Crisp. The girl, under the
influence of the same drug that had been administered to
him, had been smitten by temporary blindness, in addition
to her mind being deranged. Here was still more evidence of
the dastardly machinations of Gray and his unscrupulous
associates. It was now plain that the girl Manners had not
died, after all, but had lapsed into a kind of cataleptic state,
just as he had done.
The police suspected foul play, and frankly told him so.
It was during those eager, anxious days in Bayeux that
Roddy, on glancing at Le Nouvelliste, the daily paper
published in Rennes, saw to his astonishment news of the
tragic death of Mr Sandys’ partner, and hastened to
telegraph his condolences. Hence it was with great surprise
that Elma and her father were aware that the young man
was in France, for the telegram simply bore the place of
origin as Bayeux.
“No, sir. They are in town. But I do not think they will be
back till very late.”
Roddy, who was a shrewd observer, could tell that the man
had received orders to say “not at home.”
Elma was kneeling beside her father with her arm lovingly
around his neck, nobly trying to comfort him.
She had confessed her affection for Roddy, and had spoken
of the young man’s high hopes and aspirations, and shown
her father a hasty letter she had received from him
announcing the fact that the concession for emerald mining
had actually been granted to him by the Moorish Minister,
Mohammed ben Mussa.
“In that case Roddy could marry me, dad,” she said. “And
further, even if he had no concession, I am poor enough
now to marry a poor man,” she added.
“Yes, my child,” was his reply. “If what young Homfray says
is true then he can be the saviour of our firm and of our
family. I confess I have taken a great liking to the young
fellow. I have liked him all along.”
Then Elma flung herself into her father’s arms and kissed
him again and again, with tears of joy. Strangely enough
her father’s ruin had brought about her own happiness.
“Mr Homfray has called, sir, and I told him that you were
not at home, as you ordered.”
And then the man bowed and retired, while the girl, turning
to her father, remarked:
Roddy rose early, as was his wont, and went into his
wireless-room, as was his habit each morning to listen to
the transatlantic messages, and those from Moscow, Nantes
and the rest. His eye rested upon the sensitive little set in
the cigar-box, and it occurred to him to test it that day as a
portable set in the train and elsewhere.
His train arrived at Guildford from Haslemere soon after ten
o’clock, therefore he left the station, and climbing the old
disused coach-road known as the Mount, reached the long
range of hills called the Hog’s Back. There, upon the wide
grass-grown road which has not been used for nearly a
century, he threw up his aerial wire into a high elm and
placing in position his ground wire soldered to a long steel
skewer he put on the telephones, holding the box in his left
hand while he turned the condensers with his right.
“Do you mean the Wad Sus mines?” asked Sandys, much
surprised.
“But it is.”
“What? That man here again?” cried the girl. “He can’t have
any valid concession. Roddy has it. He would never write a
lie to me!”
“You are right, dad. I’ll try at once to get hold of him. He is
probably at Farncombe. I’ll telephone to the Towers and tell
Bowyer to go to the Rectory at once.”
This she did, but half an hour later the reply came back.
The maid Bowyer had been to the Rectory, but Mr Homfray
was out and would not return till five o’clock. She had left a
message from Elma asking him to go to London at once.
The girl was horrified and revolted. She told him so, but he
treated with a conqueror’s contempt her frightened
attempts to evade him. She was to be his toy, his plaything
—or he would not lift a finger to save her father.
On her part she pleaded her love for Roddy, but he told her
brutally that the young fellow was a liar. Why had he not
produced the concession he alleged he had?
And taking his hat, he stalked out of the fine library, well
knowing himself to be the conqueror. To those who are
patient and painstaking the fruits of the world will arrive.
But there are exceptions, even though the devil controls his
own.
For some time the old man remained silent, standing at the
great window past which the motor-’buses were passing up
and down London’s street of the wealthy.
Her career was abnormal, and yet not stranger than that of
some others in these post-war days.
Purcell Sandys had been ruined. She knew it, and laughed.
Welcome to our website – the perfect destination for book lovers and
knowledge seekers. We believe that every book holds a new world,
offering opportunities for learning, discovery, and personal growth.
That’s why we are dedicated to bringing you a diverse collection of
books, ranging from classic literature and specialized publications to
self-development guides and children's books.
ebookbell.com