0% found this document useful (0 votes)
7 views39 pages

Lecture04-Ngram Lang Models

This document discusses n-gram language models in natural language processing, focusing on their use in predicting the next word given context and various applications such as speech recognition and machine translation. It covers concepts like the Markov assumption, probability estimation, and challenges like data sparsity and smoothing techniques. Additionally, it introduces evaluation metrics like perplexity to assess the performance of n-gram models.

Uploaded by

yl5404
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views39 pages

Lecture04-Ngram Lang Models

This document discusses n-gram language models in natural language processing, focusing on their use in predicting the next word given context and various applications such as speech recognition and machine translation. It covers concepts like the Markov assumption, probability estimation, and challenges like data sparsity and smoothing techniques. Additionally, it introduces evaluation metrics like perplexity to assess the performance of n-gram models.

Uploaded by

yl5404
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Natural Language

Processing
Lecture 4: n-gram Language Models

9/13/24

COMS W4705
Daniel Bauer
Probability of a Sentence

• What is the probability that the Naive Bayes’ model


actually computes?

“But it must be recognized that the notion of ‘probability of a


sentence’ is an entirely useless one, under any known
interpretation of this term.”
Noam Chomsky (1969)
Language Modeling

• Task: predict the next word given the context.

• Used in speech recognition, handwritten character


recognition, spelling correction, text entry UI, machine
translation,…
Language Modeling

• Stocks plunged this …

• Let’s meet in Times …

• I took the subway to …


From a NYT story
• Stocks plunged this ....
• Stocks plunged this morning, despite a cut interest
rates by the …

• Stocks plunged this morning, despite a cut in interest


rates by the Federal Reserve, as Wall …

• Stocks plunged this morning, despite a cut in interest


rates by the Federal Reserve, as Wall Street began
Human Word Prediction

• Clearly at least some of us have the ability to predict the


future.

• How does this work?

• Domain knowledge

• Syntactic knowledge (guess correct part of speech)

• Lexical knowledge
Probability of the Next Word
• Idea: We do not need to model domain, syntactic, and
lexical knowledge perfectly.

• Instead, we can rely on the notion of probability of a


sequence (letters, words…).
Applications
• Speech recognition: P(“recognize speech”) > P(“wreck a nice beach”)

• Text generation: P(“three houses”) > P(“three house”)

• Spelling correction P(“my cat eats sh”) > P(“my xat eats sh”)

• Machine Translation P(“the blue house”) > P(“the house blue”)

• Other uses

• OCR

• Summarization

• Document classi cation

• Essay scoring
fi
fi
fi
Language Models

• This model can also be used to describe the probability of


an entire sentence, not just the last word.

• Use the chain rule:


Markov Assumption
• is di cult to estimate.

• The longer the sequence becomes, the less likely


w1 w2 w3 … wn-1 will appear in training data.

• Instead, we make the following simple independence


assumption (Markov assumption):

• The probability to see wn depends only on the previous


k-1 words.
ffi
bi-gram language model
• Using the Markov assumption and the chain rule:

• More consistent to use only bigrams:


n-grams

• The sequence wn is a unigram.

• The sequence wn-1, wn is a bigram.

• The sequence wn-2, wn-1, wn is a trigram….

• The sequence wn-3, wn-2, wn-1, wn is a quadrigram…


Variable-Length Language
Models
• We typically don’t know what the length of the sentence is.

• Instead, we use a special marker STOP/END that indicates


the end of a sentence.

• We typically just augment the sentence with START and


STOP markers to provide the appropriate context.
START i want to eat Chinese food END

P(i|START)·P(want|i)·P(to|want)·P(eat|to)·P(Chinese|eat)·P(food|Chinese)·P(END|food)
trigram example

P(i|START, START)·P(want|START,i)·P(to|i,want)·P(eat|want,to)·
P(Chinese|to,eat) · P(food|eat,Chinese)·P(END|Chinese,food)

Why do we only need one END marker and two START markers?
Bigram example from the Berkeley
Restaurant Project (BeRP)

Eat on 0.16 Eat Thai 0.03


Eat some 0.06 Eat breakfast 0.03
Eat lunch 0.06 Eat in 0.02
Eat dinner 0.05 Eat Chinese 0.02
Eat at 0.04 Eat Mexican 0.02
Eat a 0.04 Eat tomorrow 0.01
Eat Indian 0.04 Eat dessert 0.007
Eat today 0.03 Eat British 0.001

http://www1.icsi.berkeley.edu/Speech/berp.html
Bigram example from the Berkeley
Restaurant Project (BeRP)
START I 0.25 Want some 0.04
START I’d 0.06 Want Thai 0.01
START Tell 0.04 To eat 0.26
START I’m 0.02 To have 0.14
I want 0.32 To spend 0.09
I would 0.29 To be 0.02
I don’t 0.08 British food 0.60
I have 0.04 British restaurant 0.15
Want to 0.65 British cuisine 0.01
Want a 0.05 British lunch 0.01
Bigram example from the Berkeley
Restaurant Project (BeRP)
• Assume P(END | food) = 0.2

P(I want to each British food) =


P(I | START) · P(want | I) · P(to | want) · P(eat | to) ·
P(British | eat) · P(food | British) · P(END | food) =
.25 · .32 · .65 · .26 · .001 · .60 · .2 = .0000016

P(I want to each Chinese food) =


P(I | START) · P(want | I) · P(to | want) · P(eat | to) ·
P(Chinese | eat) · P(food | Chinese) · P(END | food) =
.25 · .32 · .65 · .26 · .02 · .60 · .2 = .000032
log probabilities
• Probabilities can become very small (a few orders of
magnitude per token).

• We often work with log probabilities in practice.

w0 = START
What do ngrams capture?

• Probabilities seem to capture syntactic facts and


world knowledge.

• eat is often followed by a NP.

• British food is not too popular, but Chinese is.


Estimating n-gram
probabilities
• We can estimate n-gram probabilities using maximum
likelihood estimates.

• Or for trigrams:
Bigram Counts from BeRP

I Want To Eat Chinese Food lunch


I 8 1087 0 13 0 0 0 ...

Want 3 0 786 0 6 8 6
To 3 0 10 860 3 0 12
Eat 0 0 2 0 19 2 52
Chinese 2 0 0 0 0 120 1
Food 19 0 17 0 0 0 0
Lunch 4 0 0 0 0 1 0
Counts to Probabilities
I Want To Eat Chinese Food lunch ...
I 8 1087 0 13 0 0 0
Want 3 0 786 0 6 8 6
To 3 0 10 860 3 0 12
Eat 0 0 2 0 19 2 52
Chinese 2 0 0 0 0 120 1
Food 19 0 17 0 0 0 0
Lunch 4 0 0 0 0 1 0
• Unigram counts:
I Want To Eat Chinese Food Lunch
3437 1215 3256 938 213 1506 459
Corpora
• Large digital collections of text or speech. Di erent languages, domains, modalities.
Annotated or un-annontated.

• English:

• Brown Corpus

• BNC, ANC

• Wall Street Journal

• AP newswire

• Gigaword, WAC, ...

• DARPA/NIST text/speech corpora


(Call Home,ATIS, switchboard, Broadcast News,…)

• MT: Hansards, Europarl


ff
Google Web 1T 5-gram
Corpus

File sizes: approx. 24 GB compressed (gzip'ed) text files

Number of tokens: 1,024,908,267,229


Number of sentences: 95,119,665,584
Number of unigrams: 13,588,391
Number of bigrams: 314,843,401
Number of trigrams: 977,069,902
Number of fourgrams: 1,313,818,354
Number of fivegrams: 1,176,470,663
Google Web 1T 5-gram
Corpus
• 3-gram examples:

ceramics collectables collectibles 55


ceramics collectables ne 130
ceramics collected by 52
ceramics collectible pottery 50
ceramics collectibles cooking 45
ceramics collection , 144
ceramics collection . 247
ceramics collection </S> 120
ceramics collection and 43
ceramics collection at 52
ceramics collection is 68
ceramics collection of 76
fi
Google Web 1T 5-gram
Corpus
• 4-gram examples:
serve as the incoming 92
serve as the incubator 99
serve as the independent 794
serve as the index 223
serve as the indication 72
serve as the indicator 120
serve as the indicators 45
serve as the indispensable 111
serve as the indispensible 40
serve as the individual 234
serve as the industrial 52
serve as the industry 607
serve as the info 42
serve as the informal 102
Data sparsity in n-gram
models
• Sparsity is a problem all over NLP: Test data contains
language phenomena not encountered during training.

• For n-gram models there are two issues:

• We may not have seen all tokens.

• We may not have seen all ngrams (even though the


individual tokens are known).

• Token has not been encountered in this context


before.
P(lunch | I ) = 0.0
Unseen Tokens
• Typical approach to unseen tokens:

• Start with a speci c lexicon of known tokens.

• Replace all tokens in the training and testing corpus that


are not in the lexicon with an UNK token.

• Practical approach:

• Lexicon contains all words that appear more than k


times in the training corpus.

• Replace all other tokens with UNK.


fi
Unseen Contexts
• Two basic approaches:

• Smoothing / Discounting: Move some probability mass


from seen trigrams to unseen trigrams.

• Back-o : Use n-1-…, n-2-… grams to compute n-gram


probability.

• Other techniques:

• Class-based backo , use back-o probability for a


speci c word class / part-of-speech.
fi
ff
ff
ff
Zipf’s Law
• Problem: n-grams (and most other linguistic phenomena)
follow a Zip an distribution.

• A few words occur very frequently.

• Most words occur very rarely. Many are seen only once.

Zipf’s law: a word’s frequency is approximately inversely


proportional to its rank in the word distribution list.
fi
Zipf’s Law

frequency

word rank
Smoothing
• Smoothing attens spiky distributions.

• before P(w | We denied the)


3 allegations

allegations
2 reports

reports

charges

benefits
motion
1 claims

request
claims
1 request
7 total

• after P(w | We denied the)


2.5 allegations
1.5 reports

allegations
allegations
0.5 claims

charges

benefits
0.5 request

motion
reports …
2 UNK

claims

request
7 total
Smoothing is like Robin Hood: Steal from the rich, give to the poor.

Example from Dan Klein.


fl
Additive Smoothing
• Classic approach: Laplacian, a.k.a. additive smoothing.

• N is the number of tokens, V is the number of types (i.e.


size of the vocabulary)

• Inaccurate in practice.
Linear Interpolation
• Use denser distributions of shorter ngrams to “ ll in”
sparse ngram distributions.

• Where and .

• Works well in practice (but not a lot of theoretical


justi cation why).

• Parameters can be estimated on development data (for


example, using Expectation Maximization).
fi
fi
Discounting
• Idea: set aside some probability mass, then ll in the
missing mass using back-o .

• where .

• Then for all seen bigrams:

• For each context v the missing probability mass is

• We can now divide this held-out mass between the


unseen words (evenly or using back-o ).
ff
ff
fi
Katz’ Backoff
• Divide the held-out probability mass proportionally to the
unigram probability of the unseen words in context v.
Katz’ Backoff for Trigrams
• For trigrams: recursively compute backo -probability for
unseen bigrams. Then distribute the held-out probability
mass proportionally to that bigram backo -probability.

• where:

• Often combined with Good-Turing smoothing.


ff
ff
Evaluating n-gram models
• Extrinsic evaluation: Apply the model in an application (for
example language classi cation). Evaluate the
application.

• Intrinsic evaluation: measure how well the model


approximates unseen language data.

• Can compute the probability of each sentence


according to the model. Higher probability -> better
model.

• Typically we compute Perplexity instead.


fi
Perplexity
• Perplexity (per word) measures how well the ngram model
predicts the sample.

• Perplexity is de ned as 2-l , where .

• Lower perplexity = better model. Intuition:

• Assume we are predicting one word at a time.

• With uniform distribution, all successor words are equally


likely. Perplexity is equal to vocabulary size.

• Perplexity can be thought of as “e ective vocabulary size”.


fi
ff

You might also like