Language
Modeling
Introduction
to
N-‐grams
Dan
Jurafsky
Probabilistic
Language
Models
• Today’s
goal:
assign
a
probability
to
a
sentence
• Machine
Translation:
• P(high
winds
tonite)
>
P(large winds
tonite)
• Spell
Correction
Why?
• The
office
is
about
fifteen
minuets from
my
house
• P(about
fifteen
minutes from)
>
P(about
fifteen
minuets from)
• Speech
Recognition
• P(I
saw
a
van)
>>
P(eyes
awe
of
an)
• +
Summarization,
question-‐answering,
etc.,
etc.!!
Dan
Jurafsky
Probabilistic
Language
Modeling
• Goal:
compute
the
probability
of
a
sentence
or
sequence
of
words:
P(W)
=
P(w1,w2,w3,w4,w5…wn)
• Related
task:
probability
of
an
upcoming
word:
P(w5|w1,w2,w3,w4)
• A
model
that
computes
either
of
these:
P(W)
or
P(wn|w1,w2…wn-‐1)
is
called
a
language
model.
• Better:
the
grammar
But
language
model
or
LM
is
standard
Dan
Jurafsky
How
to
compute
P(W)
• How
to
compute
this
joint
probability:
• P(its,
water,
is,
so,
transparent,
that)
• Intuition:
let’s
rely
on
the
Chain
Rule
of
Probability
Dan
Jurafsky
Reminder:
The
Chain
Rule
• Recall
the
definition
of
conditional
probabilities
p(B|A)
=
P(A,B)/P(A) Rewriting:
P(A,B)
=
P(A)P(B|A)
• More
variables:
P(A,B,C,D)
=
P(A)P(B|A)P(C|A,B)P(D|A,B,C)
• The
Chain
Rule
in
General
P(x1,x2,x3,…,xn)
=
P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1,…,xn-‐1)
The
Chain
Rule
applied
to
compute
Dan
Jurafsky
joint
probability
of
words
in
sentence
P(w1w 2 …w n ) = ∏ P(w i | w1w 2 …w i−1 )
i
P(“its
water
is
so
transparent”)
=
P(its)
× P(water|its)
× P(is|its water)
× P(so|its water
is)
× P(transparent|its water
is
so)
Dan
Jurafsky
How
to
estimate
these
probabilities
• Could
we
just
count
and
divide?
P(the | its water is so transparent that) =
Count(its water is so transparent that the)
Count(its water is so transparent that)
• No!
Too
many
possible
sentences!
• We’ll
never
see
enough
data
for
estimating
these
Dan
Jurafsky
Markov
Assumption
• Simplifying
assumption:
Andrei
Markov
P(the | its water is so transparent that) ≈ P(the | that)
• Or
maybe
P(the | its water is so transparent that) ≈ P(the | transparent that)
Dan
Jurafsky
Markov
Assumption
P(w1w 2 …w n ) ≈ ∏ P(w i | w i−k …w i−1 )
i
• In
other
words,
we
approximate
each
component
in
the
product
P(w i | w1w 2 …w i−1 ) ≈ P(w i | w i−k …w i−1 )
Dan
Jurafsky
Simplest
case:
Unigram
model
P(w1w 2 …w n ) ≈ ∏ P(w i )
i
Some
automatically
generated
sentences
from
a
unigram
model
fifth, an, of, futures, the, an, incorporated, a,
a, the, inflation, most, dollars, quarter, in, is,
mass
€
thrift, did, eighty, said, hard, 'm, july, bullish
that, or, limited, the
Dan
Jurafsky
Bigram
model
Condition
on
the
previous
word:
P(w i | w1w 2 …w i−1 ) ≈ P(w i | w i−1 )
texaco, rose, one, in, this, issue, is, pursuing, growth, in,
a, boiler, house, said, mr., gurria, mexico, 's, motion,
control, proposal, without, permission, from, five, hundred,
fifty, five, yen
outside, new, car, parking, lot, of, the, agreement, reached
this, would, be, a, record, november
Dan
Jurafsky
N-‐gram
models
• We
can
extend
to
trigrams,
4-‐grams,
5-‐grams
• In
general
this
is
an
insufficient
model
of
language
• because
language
has
long-‐distance
dependencies:
“The
computer(s)
which
I
had
just
put
into
the
machine
room
on
the
fifth
floor
is
(are)
crashing.”
• But
we
can
often
get
away
with
N-‐gram
models
Language
Modeling
Introduction
to
N-‐grams
Language
Modeling
Estimating
N-‐gram
Probabilities
Dan
Jurafsky
Estimating
bigram
probabilities
• The
Maximum
Likelihood
Estimate
count(w i−1,w i )
P(w i | w i−1 ) =
count(w i−1 )
c(w i−1,w i )
P(w i | w i−1 ) =
c(w i−1 )
€
Dan
Jurafsky
An
example
<s>
I
am
Sam
</s>
c(w i−1,w i )
P(w i | w i−1 ) = <s>
Sam
I
am
</s>
c(w i−1 ) <s>
I
do
not
like
green
eggs
and
ham
</s>
Dan
Jurafsky
More
examples:
Berkeley
Restaurant
Project
sentences
• can
you
tell
me
about
any
good
cantonese restaurants
close
by
• mid
priced
thai food
is
what
i’m looking
for
• tell
me
about
chez
panisse
• can
you
give
me
a
listing
of
the
kinds
of
food
that
are
available
• i’m looking
for
a
good
place
to
eat
breakfast
• when
is
caffe venezia open
during
the
day
Dan
Jurafsky
Raw
bigram
counts
• Out
of
9222
sentences
Dan
Jurafsky
Raw
bigram
probabilities
• Normalize
by
unigrams:
• Result:
Dan
Jurafsky
Bigram
estimates
of
sentence
probabilities
P(<s>
I
want
english food
</s>)
=
P(I|<s>)
× P(want|I)
× P(english|want)
× P(food|english)
× P(</s>|food)
=
.000031
Dan
Jurafsky
What
kinds
of
knowledge?
• P(english|want)
=
.0011
• P(chinese|want)
=
.0065
• P(to|want)
=
.66
• P(eat
|
to)
=
.28
• P(food
|
to)
=
0
• P(want
|
spend)
=
0
• P
(i |
<s>)
=
.25
Dan
Jurafsky
Practical
Issues
• We
do
everything
in
log
space
• Avoid
underflow
• (also
adding
is
faster
than
multiplying)
log( p1 × p2 × p3 × p4 ) = log p1 + log p2 + log p3 + log p4
Dan
Jurafsky
Language
Modeling
Toolkits
• SRILM
• http://www.speech.sri.com/projects/srilm/
• KenLM
• https://kheafield.com/code/kenlm/
Dan
Jurafsky
Google
N-‐Gram
Release,
August
2006
…
Dan
Jurafsky
Google
N-‐Gram
Release
• serve as the incoming 92
• serve as the incubator 99
• serve as the independent 794
• serve as the index 223
• serve as the indication 72
• serve as the indicator 120
• serve as the indicators 45
• serve as the indispensable 111
• serve as the indispensible 40
• serve as the individual 234
http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
Dan
Jurafsky
Google
Book
N-‐grams
• http://ngrams.googlelabs.com/
Language
Modeling
Estimating
N-‐gram
Probabilities
Language
Modeling
Evaluation
and
Perplexity
Dan
Jurafsky
Evaluation:
How
good
is
our
model?
• Does
our
language
model
prefer
good
sentences
to
bad
ones?
• Assign
higher
probability
to
“real”
or
“frequently
observed”
sentences
• Than
“ungrammatical”
or
“rarely
observed”
sentences?
• We
train
parameters
of
our
model
on
a
training
set.
• We
test
the
model’s
performance
on
data
we
haven’t
seen.
• A
test
set
is
an
unseen
dataset
that
is
different
from
our
training
set,
totally
unused.
• An
evaluation
metric
tells
us
how
well
our
model
does
on
the
test
set.
Dan
Jurafsky
Training
on
the
test
set
• We
can’t
allow
test
sentences
into
the
training
set
• We
will
assign
it
an
artificially
high
probability
when
we
set
it
in
the
test
set
• “Training
on
the
test
set”
• Bad
science!
• And
violates
the
honor
code
30
Dan
Jurafsky
Extrinsic
evaluation
of
N-‐gram
models
• Best
evaluation
for
comparing
models
A
and
B
• Put
each
model
in
a
task
• spelling
corrector,
speech
recognizer,
MT
system
• Run
the
task,
get
an
accuracy
for
A
and
for
B
• How
many
misspelled
words
corrected
properly
• How
many
words
translated
correctly
• Compare
accuracy
for
A
and
B
Dan
Jurafsky
Difficulty
of
extrinsic
(in-‐vivo)
evaluation
of
N-‐gram
models
• Extrinsic
evaluation
• Time-‐consuming;
can
take
days
or
weeks
• So
• Sometimes
use
intrinsic evaluation:
perplexity
• Bad
approximation
• unless
the
test
data
looks
just like
the
training
data
• So
generally
only
useful
in
pilot
experiments
• But
is
helpful
to
think
about.
Dan
Jurafsky
Intuition
of
Perplexity
mushrooms 0.1
• The
Shannon
Game:
• How
well
can
we
predict
the
next
word? pepperoni 0.1
anchovies 0.01
I
always
order
pizza
with
cheese
and
____
….
The
33rd President
of
the
US
was
____
fried rice 0.0001
I
saw
a
____ ….
• Unigrams
are
terrible
at
this
game.
(Why?) and 1e-100
• A
better
model
of
a
text
• is
one
which
assigns
a
higher
probability
to
the
word
that
actually
occurs
Dan
Jurafsky
Perplexity
The
best
language
model
is
one
that
best
predicts
an
unseen
test
set
• Gives
the
highest
P(sentence) −
1
PP(W ) = P(w1w2 ...wN ) N
Perplexity
is
the
inverse
probability
of
the
test
set,
normalized
by
the
number
1
of
words: = N
P(w1w2 ...wN )
Chain
rule:
For
bigrams:
Minimizing
perplexity
is
the
same
as
maximizing
probability
Dan
Jurafsky
Perplexity
as
branching
factor
• Let’s
suppose
a
sentence
consisting
of
random
digits
• What
is
the
perplexity
of
this
sentence
according
to
a
model
that
assign
P=1/10
to
each
digit?
Dan
Jurafsky
Lower
perplexity
=
better
model
• Training
38
million
words,
test
1.5
million
words,
WSJ
N-‐gram
Unigram Bigram Trigram
Order
Perplexity 962 170 109
Language
Modeling
Evaluation
and
Perplexity
Language
Modeling
Generalization
and
zeros
Dan
Jurafsky
The
Shannon
Visualization
Method
• Choose
a
random
bigram
<s> I
(<s>,
w)
according
to
its
probability I want
• Now
choose
a
random
bigram
want to
(w,
x)
according
to
its
probability to eat
• And
so
on
until
we
choose
</s> eat Chinese
• Then
string
the
words
together Chinese food
food </s>
I want to eat Chinese food
bigrams by first generating a random bigram that starts with <s> (according to its
Dan
Jurafsky
bigram probability), then choosing a random bigram to follow (again, according to
its bigram probability), and so on.
Approximating
Shakespeare
To give an intuition for the increasing power of higher-order N-grams, Fig. 4.3
shows random sentences generated from unigram, bigram, trigram, and 4-gram
models trained on Shakespeare’s works.
–To him swallowed confess hear both. Which. Of save on trail for are ay device and
1gram rote life have
–Hill he late speaks; or! a more to leg less first you enter
–Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry. Live
2gram king. Follow.
–What means, sir. I confess she? then all sorts, he is trim, captain.
–Fly, and will rid me these news of price. Therefore the sadness of parting, as they say,
3gram ’tis done.
–This shall forbid it should be branded, if renown made it empty.
–King Henry. What! I will go seek the traitor Gloucester. Exeunt some of the watch. A
4gram great banquet serv’d in;
–It cannot be but so.
Figure 4.3 Eight sentences randomly generated from four N-grams computed from Shakespeare’s works. All
characters were mapped to lower-case and punctuation marks were treated as words. Output is hand-corrected
Dan
Jurafsky
Shakespeare
as
corpus
• N=884,647
tokens,
V=29,066
• Shakespeare
produced
300,000
bigram
types
out
of
V2=
844
million
possible
bigrams.
• So
99.96%
of
the
possible
bigrams
were
never
seen
(have
zero
entries
in
the
table)
• Quadrigrams worse:
What's
coming
out
looks
like
Shakespeare
because
it
is Shakespeare
Dan
Jurafsky
The
wall
street
journal
is
not
shakespeare
(no
offense) 4.3 • GENERALIZATION AND ZEROS 11
1
gram
Months the my and issue of year foreign new exchange’s september
were recession exchange new endorsed a acquire to six executives
Last December through the way to preserve the Hudson corporation N.
2
gram
B. E. C. Taylor would seem to complete the major central planners one
point five percent of U. S. E. has already old M. X. corporation of living
on information such as more frequently fishing to keep her
They also point to ninety nine point six billion dollars from two hundred
3
gram
four oh six three percent of the rates of interest stores as Mexico and
Brazil on market conditions
Figure 4.4 Three sentences randomly generated from three N-gram models computed from
40 million words of the Wall Street Journal, lower-casing all characters and treating punctua-
Dan
Jurafsky
Can
you
guess
the
author
of
these
random
3-‐gram
sentences?
• They
also
point
to
ninety
nine
point
six
billion
dollars
from
two
hundred
four
oh
six
three
percent
of
the
rates
of
interest
stores
as
Mexico
and
gram
Brazil
on
market
conditions
• This
shall
forbid
it
should
be
branded,
if
renown
made
it
empty.
• “You
are
uniformly
charming!”
cried
he,
with
a
smile
of
associating
and
now
and
then
I
bowed
and
they
perceived
a
chaise
and
four
to
wish
for.
43
Dan
Jurafsky
The
perils
of
overfitting
• N-‐grams
only
work
well
for
word
prediction
if
the
test
corpus
looks
like
the
training
corpus
• In
real
life,
it
often
doesn’t
• We
need
to
train
robust
models that
generalize!
• One
kind
of
generalization:
Zeros!
• Things
that
don’t
ever
occur
in
the
training
set
• But
occur
in
the
test
set
Dan
Jurafsky
Zeros
• Training
set: • Test
set
…
denied
the
allegations …
denied
the
offer
…
denied
the
reports …
denied
the
loan
…
denied
the
claims
…
denied
the
request
P(“offer”
|
denied
the)
=
0
Dan
Jurafsky
Zero
probability
bigrams
• Bigrams
with
zero
probability
• mean
that
we
will
assign
0
probability
to
the
test
set!
• And
hence
we
cannot
compute
perplexity
(can’t
divide
by
0)!
Language
Modeling
Generalization
and
zeros
Language
Modeling
Smoothing:
Add-‐one
(Laplace)
smoothing
Dan
Jurafsky
The intuition of smoothing (from Dan Klein)
• When
we
have
sparse
statistics:
P(w
|
denied
the)
allegations
3
allegations
outcome
reports
2
reports
attack
1
claims
…
request
claims
man
1
request
7
total
• Steal
probability
mass
to
generalize
better
P(w
|
denied
the)
2.5
allegations
allegations
1.5
reports
allegations
outcome
0.5
claims
reports
attack
0.5
request
…
man
claims
request
2
other
7
total
Dan
Jurafsky
Add-‐one
estimation
• Also
called
Laplace
smoothing
• Pretend
we
saw
each
word
one
more
time
than
we
did
• Just
add
one
to
all
the
counts!
c(wi−1, wi )
PMLE (wi | wi−1 ) =
• MLE
estimate: c(wi−1 )
c(wi−1, wi ) +1
• Add-‐1
estimate: PAdd−1 (wi | wi−1 ) =
c(wi−1 ) +V
Dan
Jurafsky
Maximum
Likelihood
Estimates
• The
maximum
likelihood
estimate
• of
some
parameter
of
a
model
M
from
a
training
set
T
• maximizes
the
likelihood
of
the
training
set
T
given
the
model
M
• Suppose
the
word
“bagel”
occurs
400
times
in
a
corpus
of
a
million
words
• What
is
the
probability
that
a
random
word
from
some
other
text
will
be
“bagel”?
• MLE
estimate
is
400/1,000,000
=
.0004
• This
may
be
a
bad
estimate
for
some
other
corpus
• But
it
is
the
estimate that
makes
it
most
likely that
“bagel”
will
occur
400
times
in
a
million
word
corpus.
Dan
Jurafsky
Berkeley Restaurant Corpus: Laplace
smoothed bigram counts
Dan
Jurafsky
Laplace-smoothed bigrams
Dan
Jurafsky
Reconstituted counts
Dan
Jurafsky
Compare with raw bigram counts
Dan
Jurafsky
Add-‐1
estimation
is
a
blunt
instrument
• So
add-‐1
isn’t
used
for
N-‐grams:
• We’ll
see
better
methods
• But
add-‐1
is
used
to
smooth
other
NLP
models
• For
text
classification
• In
domains
where
the
number
of
zeros
isn’t
so
huge.
Language
Modeling
Smoothing:
Add-‐one
(Laplace)
smoothing
Language
Modeling
Interpolation,
Backoff,
and
Web-‐Scale
LMs
Dan
Jurafsky
Backoff and Interpolation
• Sometimes
it
helps
to
use
less context
• Condition
on
less
context
for
contexts
you
haven’t
learned
much
about
• Backoff:
• use
trigram
if
you
have
good
evidence,
• otherwise
bigram,
otherwise
unigram
• Interpolation:
• mix
unigram,
bigram,
trigram
• Interpolation
works
better
from
Dan
Jurafskyall
the N-gram estimators, weighing and combining the trigram, bigram, and
unigram counts. P̂(wn |wn 2 wn 1 ) = l1 P(wn |w
In simple linear interpolation, we combine different order N-grams by linearly
Linear
Interpolation 1 ) 2 P(wn |
interpolating all the models. Thus, we estimate the trigram probability P(wn |wn 2 wn +l
by mixing together the unigram, bigram, and trigram probabilities, each weighted
by a l : +l3 P(wn )
• Simple
interpolation
such that the l s sum to 1:
P̂(wn |wn 2 wn 1 ) = l1 P(wn |wn 2 wn 1) X
+l2 P(wn |wn 1 ) li = 1
+l3 P(wn ) i (4.24)
Lambdas
• such conditional
that the l s sum oXn
context:
to 1:In a slightly more sophisticated version of linear i
li = 1 (4.25)
computed ini a more sophisticated way, by condition
ifsophisticated
In a slightly more we have version
particularly accurateeach
of linear interpolation, counts foris a particul
l weight
computed in a morecounts of way,
sophisticated the bytrigrams
conditioningbased on thisThisbigram
on the context. way, will be
if we have particularly accurate counts for a particular bigram, we assume that the
make
counts of the trigrams based the l sbigram
on this for those trigrams
will be more higher
trustworthy, so we and
can thus giv
Dan
Jurafsky
How
to
set
the
lambdas?
• Use
a
held-‐out corpus
Held-‐Out
Test
Training
Data Data Data
• Choose
λs to
maximize
the
probability
of
held-‐out
data:
• Fix
the
N-‐gram
probabilities
(on
the
training
data)
• Then
search
for
λs that
give
largest
probability
to
held-‐out
set:
log P(w1...wn | M (λ1...λk )) = ∑ log PM ( λ1... λk ) (wi | wi−1 )
i
Dan
Jurafsky
Unknown
words:
Open
versus
closed
vocabulary
tasks
• If
we
know
all
the
words
in
advanced
• Vocabulary
V
is
fixed
• Closed
vocabulary
task
• Often
we
don’t
know
this
• Out
Of
Vocabulary =
OOV
words
• Open
vocabulary
task
• Instead:
create
an
unknown
word
token
<UNK>
• Training
of
<UNK>
probabilities
• Create
a
fixed
lexicon
L
of
size
V
• At
text
normalization
phase,
any
training
word
not
in
L
changed
to
<UNK>
• Now
we
train
its
probabilities
like
a
normal
word
• At
decoding
time
• If
text
input:
Use
UNK
probabilities
for
any
word
not
in
training
Dan
Jurafsky
Huge
web-‐scale
n-‐grams
• How
to
deal
with,
e.g.,
Google
N-‐gram
corpus
• Pruning
• Only
store
N-‐grams
with
count
>
threshold.
• Remove
singletons
of
higher-‐order
n-‐grams
• Entropy-‐based
pruning
• Efficiency
• Efficient
data
structures
like
tries
• Bloom
filters:
approximate
language
models
• Store
words
as
indexes,
not
strings
• Use
Huffman
coding
to
fit
large
numbers
of
words
into
two
bytes
• Quantize
probabilities
(4-‐8
bits
instead
of
8-‐byte
float)
Dan
Jurafsky
Smoothing
for
Web-‐scale
N-‐grams
• “Stupid
backoff”
(Brants et
al.
2007)
• No
discounting,
just
use
relative
frequencies
" i
$$ count(wi−k+1 ) i
i−1 i−1
if count(wi−k+1 ) > 0
S(wi | wi−k+1 ) = # count(wi−k+1 )
$ i−1
$% 0.4S(wi | wi−k+2 ) otherwise
count(wi )
S(wi ) =
64 N
Dan
Jurafsky
N-‐gram
Smoothing
Summary
• Add-‐1
smoothing:
• OK
for
text
categorization,
not
for
language
modeling
• The
most
commonly
used
method:
• Extended
Interpolated
Kneser-‐Ney
• For
very
large
N-‐grams
like
the
Web:
• Stupid
backoff
65
Dan
Jurafsky
Advanced Language Modeling
• Discriminative
models:
• choose
n-‐gram
weights
to
improve
a
task,
not
to
fit
the
training
set
• Parsing-‐based
models
• Caching
Models
• Recently
used
words
are
more
likely
to
appear
c(w ∈ history)
PCACHE (w | history) = λ P(wi | wi−2 wi−1 ) + (1− λ )
| history |
• These
perform
very
poorly
for
speech
recognition
(why?)
Language
Modeling
Interpolation,
Backoff,
and
Web-‐Scale
LMs
Language
Modeling
Advanced:
Kneser-Ney Smoothing
Absolute discounting: just subtract a
Dan
Jurafsky
little from each count
• Suppose
we
wanted
to
subtract
a
little
Bigram
count
Bigram
count
in
in
training heldout set
from
a
count
of
4
to
save
probability
0 .0000270
mass
for
the
zeros 1 0.448
• How
much
to
subtract
? 2 1.25
3 2.24
• Church
and
Gale
(1991)’s
clever
idea 4 3.23
• Divide
up
22
million
words
of
AP
5 4.21
Newswire 6 5.23
• Training
and
held-‐out
set 7 6.21
• for
each
bigram
in
the
training
set 8 7.21
• see
the
actual
count
in
the
held-‐out
set! 9 8.26
Dan
Jurafsky
Absolute Discounting Interpolation
• Save
ourselves
some
time
and
just
subtract
0.75
(or
some
d)!
discounted bigram Interpolation weight
c(wi−1, wi ) − d
PAbsoluteDiscounting (wi | wi−1 ) = + λ (wi−1 )P(w)
c(wi−1 )
unigram
• (Maybe
keeping
a
couple
extra
values
of
d
for
counts
1
and
2)
• But
should
we
really
just
use
the
regular
unigram
P(w)?
70
Dan
Jurafsky
Kneser-Ney Smoothing I
• Better
estimate
for
probabilities
of
lower-‐order
unigrams!
Francisco
glasses
• Shannon
game:
I
can’t
see
without
my
reading___________?
• “Francisco”
is
more
common
than
“glasses”
• …
but
“Francisco”
always
follows
“San”
• The
unigram
is
useful
exactly
when
we
haven’t
seen
this
bigram!
• Instead
of
P(w):
“How
likely
is
w”
• Pcontinuation(w):
“How
likely
is
w
to
appear
as
a
novel
continuation?
• For
each
word,
count
the
number
of
bigram
types
it
completes
• Every
bigram
type
was
a
novel
continuation
the
first
time
it
was
seen
PCONTINUATION (w) ∝ {wi−1 : c(wi−1, w) > 0}
Dan
Jurafsky
Kneser-Ney Smoothing II
• How
many
times
does
w
appear
as
a
novel
continuation:
PCONTINUATION (w) ∝ {wi−1 : c(wi−1, w) > 0}
• Normalized
by
the
total
number
of
word
bigram
types
{(w j−1, w j ) : c(w j−1, w j ) > 0}
{wi−1 : c(wi−1, w) > 0}
PCONTINUATION (w) =
{(w j−1, w j ) : c(w j−1, w j ) > 0}
Dan
Jurafsky
Kneser-Ney Smoothing III
• Alternative
metaphor:
The
number
of
#
of
word
types
seen
to
precede
w
| {wi−1 : c(wi−1, w) > 0} |
• normalized
by
the
#
of
words
preceding
all
words:
{wi−1 : c(wi−1, w) > 0}
PCONTINUATION (w) =
∑ {w' i−1 : c(w'i−1, w') > 0}
w'
• A
frequent
word
(Francisco)
occurring
in
only
one
context
(San)
will
have
a
low
continuation
probability
Dan
Jurafsky
Kneser-Ney Smoothing IV
max(c(wi−1, wi ) − d, 0)
PKN (wi | wi−1 ) = + λ (wi−1 )PCONTINUATION (wi )
c(wi−1 )
λ is
a
normalizing
constant;
the
probability
mass
we’ve
discounted
d
λ (wi−1 ) = {w : c(wi−1, w) > 0}
c(wi−1 )
The number of word types that can follow wi-1
the normalized discount = # of word types we discounted
74 = # of times we applied normalized discount
Dan
Jurafsky
Kneser-Ney Smoothing: Recursive
formulation
i
i−1 max(cKN (wi−n+1 ) − d, 0) i−1 i−1
PKN (wi | wi−n+1 ) = i−1
+ λ (wi−n+1 )PKN (wi | wi−n+2 )
cKN (wi−n+1 )
!# count(•) for the highest order
cKN (•) = "
#$ continuationcount(•) for lower order
Continuation count = Number of unique single word contexts for
75
Language
Modeling
Advanced:
Kneser-Ney Smoothing