0% found this document useful (0 votes)

7 views39 pages

Lecture04-Ngram Lang Models

This document discusses n-gram language models in natural language processing, focusing on their use in predicting the next word given context and various applications such as speech recognition and machine translation. It covers concepts like the Markov assumption, probability estimation, and challenges like data sparsity and smoothing techniques. Additionally, it introduces evaluation metrics like perplexity to assess the performance of n-gram models.

Uploaded by

yl5404

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views39 pages

Lecture04-Ngram Lang Models

Uploaded by

yl5404

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

Natural Language

Processing
Lecture 4: n-gram Language Models

9/13/24

COMS W4705
Daniel Bauer
Probability of a Sentence

• What is the probability that the Naive Bayes’ model

actually computes?

“But it must be recognized that the notion of ‘probability of a

sentence’ is an entirely useless one, under any known
interpretation of this term.”
Noam Chomsky (1969)
Language Modeling

• Task: predict the next word given the context.

• Used in speech recognition, handwritten character

recognition, spelling correction, text entry UI, machine
translation,…
Language Modeling

• Stocks plunged this …

• Let’s meet in Times …

• I took the subway to …

From a NYT story
• Stocks plunged this ....
• Stocks plunged this morning, despite a cut interest
rates by the …

• Stocks plunged this morning, despite a cut in interest

rates by the Federal Reserve, as Wall …

• Stocks plunged this morning, despite a cut in interest

rates by the Federal Reserve, as Wall Street began
Human Word Prediction

• Clearly at least some of us have the ability to predict the

future.

• How does this work?

• Domain knowledge

• Syntactic knowledge (guess correct part of speech)

• Lexical knowledge
Probability of the Next Word
• Idea: We do not need to model domain, syntactic, and
lexical knowledge perfectly.

• Instead, we can rely on the notion of probability of a

sequence (letters, words…).
Applications
• Speech recognition: P(“recognize speech”) > P(“wreck a nice beach”)

• Text generation: P(“three houses”) > P(“three house”)

• Spelling correction P(“my cat eats sh”) > P(“my xat eats sh”)

• Machine Translation P(“the blue house”) > P(“the house blue”)

• Other uses

• OCR

• Summarization

• Document classi cation

• Essay scoring
fi
fi
fi
Language Models

• This model can also be used to describe the probability of

an entire sentence, not just the last word.

• Use the chain rule:

…
Markov Assumption
• is di cult to estimate.

• The longer the sequence becomes, the less likely

w1 w2 w3 … wn-1 will appear in training data.

• Instead, we make the following simple independence

assumption (Markov assumption):

• The probability to see wn depends only on the previous

k-1 words.
ffi
bi-gram language model
• Using the Markov assumption and the chain rule:

• More consistent to use only bigrams:

n-grams

• The sequence wn is a unigram.

• The sequence wn-1, wn is a bigram.

• The sequence wn-2, wn-1, wn is a trigram….

• The sequence wn-3, wn-2, wn-1, wn is a quadrigram…

Variable-Length Language
Models
• We typically don’t know what the length of the sentence is.

• Instead, we use a special marker STOP/END that indicates

the end of a sentence.

• We typically just augment the sentence with START and

STOP markers to provide the appropriate context.
START i want to eat Chinese food END

Why do we only need one END marker and two START markers?
Bigram example from the Berkeley
Restaurant Project (BeRP)

Eat on 0.16 Eat Thai 0.03

Eat some 0.06 Eat breakfast 0.03
Eat lunch 0.06 Eat in 0.02
Eat dinner 0.05 Eat Chinese 0.02
Eat at 0.04 Eat Mexican 0.02
Eat a 0.04 Eat tomorrow 0.01
Eat Indian 0.04 Eat dessert 0.007
Eat today 0.03 Eat British 0.001

http://www1.icsi.berkeley.edu/Speech/berp.html
Bigram example from the Berkeley
Restaurant Project (BeRP)
START I 0.25 Want some 0.04
START I’d 0.06 Want Thai 0.01
START Tell 0.04 To eat 0.26
START I’m 0.02 To have 0.14
I want 0.32 To spend 0.09
I would 0.29 To be 0.02
I don’t 0.08 British food 0.60
I have 0.04 British restaurant 0.15
Want to 0.65 British cuisine 0.01
Want a 0.05 British lunch 0.01
Bigram example from the Berkeley
Restaurant Project (BeRP)
• Assume P(END | food) = 0.2

P(I want to each British food) =

P(I want to each Chinese food) =

• We often work with log probabilities in practice.

w0 = START
What do ngrams capture?

• Probabilities seem to capture syntactic facts and

world knowledge.

• eat is often followed by a NP.

• British food is not too popular, but Chinese is.

Estimating n-gram
probabilities
• We can estimate n-gram probabilities using maximum
likelihood estimates.

• Or for trigrams:
Bigram Counts from BeRP

I Want To Eat Chinese Food lunch

I 8 1087 0 13 0 0 0 ...

Want 3 0 786 0 6 8 6
To 3 0 10 860 3 0 12
Eat 0 0 2 0 19 2 52
Chinese 2 0 0 0 0 120 1
Food 19 0 17 0 0 0 0
Lunch 4 0 0 0 0 1 0
Counts to Probabilities
I Want To Eat Chinese Food lunch ...
I 8 1087 0 13 0 0 0
Want 3 0 786 0 6 8 6
To 3 0 10 860 3 0 12
Eat 0 0 2 0 19 2 52
Chinese 2 0 0 0 0 120 1
Food 19 0 17 0 0 0 0
Lunch 4 0 0 0 0 1 0
• Unigram counts:
I Want To Eat Chinese Food Lunch
3437 1215 3256 938 213 1506 459
Corpora
• Large digital collections of text or speech. Di erent languages, domains, modalities.
Annotated or un-annontated.

• English:

• Brown Corpus

• BNC, ANC

• Wall Street Journal

• AP newswire

• Gigaword, WAC, ...

• DARPA/NIST text/speech corpora

(Call Home,ATIS, switchboard, Broadcast News,…)

• MT: Hansards, Europarl

ff
Google Web 1T 5-gram
Corpus

File sizes: approx. 24 GB compressed (gzip'ed) text files

Number of tokens: 1,024,908,267,229

Number of sentences: 95,119,665,584
Number of unigrams: 13,588,391
Number of bigrams: 314,843,401
Number of trigrams: 977,069,902
Number of fourgrams: 1,313,818,354
Number of fivegrams: 1,176,470,663
Google Web 1T 5-gram
Corpus
• 3-gram examples:

ceramics collectables collectibles 55

ceramics collectables ne 130
ceramics collected by 52
ceramics collectible pottery 50
ceramics collectibles cooking 45
ceramics collection , 144
ceramics collection . 247
ceramics collection </S> 120
ceramics collection and 43
ceramics collection at 52
ceramics collection is 68
ceramics collection of 76
fi
Google Web 1T 5-gram
Corpus
• 4-gram examples:
serve as the incoming 92
serve as the incubator 99
serve as the independent 794
serve as the index 223
serve as the indication 72
serve as the indicator 120
serve as the indicators 45
serve as the indispensable 111
serve as the indispensible 40
serve as the individual 234
serve as the industrial 52
serve as the industry 607
serve as the info 42
serve as the informal 102
Data sparsity in n-gram
models
• Sparsity is a problem all over NLP: Test data contains
language phenomena not encountered during training.

• For n-gram models there are two issues:

• We may not have seen all tokens.

• We may not have seen all ngrams (even though the

individual tokens are known).

• Token has not been encountered in this context

before.
P(lunch | I ) = 0.0
Unseen Tokens
• Typical approach to unseen tokens:

• Start with a speci c lexicon of known tokens.

• Replace all tokens in the training and testing corpus that

are not in the lexicon with an UNK token.

• Practical approach:

• Lexicon contains all words that appear more than k

times in the training corpus.

• Replace all other tokens with UNK.

fi
Unseen Contexts
• Two basic approaches:

• Smoothing / Discounting: Move some probability mass

from seen trigrams to unseen trigrams.

• Back-o : Use n-1-…, n-2-… grams to compute n-gram

probability.

• Other techniques:

• Class-based backo , use back-o probability for a

speci c word class / part-of-speech.
fi
ff
ff
ff
Zipf’s Law
• Problem: n-grams (and most other linguistic phenomena)
follow a Zip an distribution.

• A few words occur very frequently.

• Most words occur very rarely. Many are seen only once.

Zipf’s law: a word’s frequency is approximately inversely

proportional to its rank in the word distribution list.
fi
Zipf’s Law

frequency

word rank
Smoothing
• Smoothing attens spiky distributions.

• before P(w | We denied the)

3 allegations

allegations
2 reports

reports

charges

benefits
motion
1 claims
…

request
claims
1 request
7 total

• after P(w | We denied the)

2.5 allegations
1.5 reports

allegations
allegations
0.5 claims

charges

benefits
0.5 request

motion
reports …
2 UNK

claims

request
7 total
Smoothing is like Robin Hood: Steal from the rich, give to the poor.

Example from Dan Klein.

fl
Additive Smoothing
• Classic approach: Laplacian, a.k.a. additive smoothing.

• N is the number of tokens, V is the number of types (i.e.

size of the vocabulary)

• Inaccurate in practice.
Linear Interpolation
• Use denser distributions of shorter ngrams to “ ll in”
sparse ngram distributions.

• Where and .

• Works well in practice (but not a lot of theoretical

justi cation why).

• Parameters can be estimated on development data (for

example, using Expectation Maximization).
fi
fi
Discounting
• Idea: set aside some probability mass, then ll in the
missing mass using back-o .

• where .

• Then for all seen bigrams:

• For each context v the missing probability mass is

• We can now divide this held-out mass between the

unseen words (evenly or using back-o ).
ff
ff
fi
Katz’ Backoff
• Divide the held-out probability mass proportionally to the
unigram probability of the unseen words in context v.
Katz’ Backoff for Trigrams
• For trigrams: recursively compute backo -probability for
unseen bigrams. Then distribute the held-out probability
mass proportionally to that bigram backo -probability.

• where:

• Often combined with Good-Turing smoothing.

ff
ff
Evaluating n-gram models
• Extrinsic evaluation: Apply the model in an application (for
example language classi cation). Evaluate the
application.

• Intrinsic evaluation: measure how well the model

approximates unseen language data.

• Can compute the probability of each sentence

according to the model. Higher probability -> better
model.

• Typically we compute Perplexity instead.

fi
Perplexity
• Perplexity (per word) measures how well the ngram model
predicts the sample.

• Perplexity is de ned as 2-l , where .

• Lower perplexity = better model. Intuition:

• Assume we are predicting one word at a time.

• With uniform distribution, all successor words are equally

likely. Perplexity is equal to vocabulary size.

• Perplexity can be thought of as “e ective vocabulary size”.

fi
ff

Lecture 03
No ratings yet
Lecture 03
41 pages
Lecture 4
No ratings yet
Lecture 4
87 pages
N Grams
No ratings yet
N Grams
51 pages
Language Modeling: Introduction To N-Grams
No ratings yet
Language Modeling: Introduction To N-Grams
79 pages
CME4408 P5 N-grams Smooting
No ratings yet
CME4408 P5 N-grams Smooting
43 pages
Language Modeling
No ratings yet
Language Modeling
88 pages
Chapter 03-Number System
No ratings yet
Chapter 03-Number System
88 pages
Language Modeling: Introduction To N-Grams
No ratings yet
Language Modeling: Introduction To N-Grams
88 pages
Session 2-3 Language Modeling
No ratings yet
Session 2-3 Language Modeling
69 pages
N-Grams and Corpus Linguistics: Julia Hirschberg
No ratings yet
N-Grams and Corpus Linguistics: Julia Hirschberg
47 pages
3 LM 2024
No ratings yet
3 LM 2024
78 pages
3 LM Jan 08 2021
No ratings yet
3 LM Jan 08 2021
77 pages
3 LM Jan 08 2021
No ratings yet
3 LM Jan 08 2021
77 pages
Artificial Intelligence: N-Gram Models: Russell & Norvig: Section 22.1
No ratings yet
Artificial Intelligence: N-Gram Models: Russell & Norvig: Section 22.1
32 pages
Lecture 2. N-Gram LMs
No ratings yet
Lecture 2. N-Gram LMs
77 pages
Notes of NLP - Unit-2
No ratings yet
Notes of NLP - Unit-2
23 pages
Introduction To Language Modeling Final
No ratings yet
Introduction To Language Modeling Final
69 pages
Language Modeling and Spelling Correction
No ratings yet
Language Modeling and Spelling Correction
97 pages
N-Gram Language Models Lecture
No ratings yet
N-Gram Language Models Lecture
59 pages
Lec-3 Language Modeling N-Grams
No ratings yet
Lec-3 Language Modeling N-Grams
41 pages
N-Gram Language Models Lecture
No ratings yet
N-Gram Language Models Lecture
56 pages
Lecture 4
No ratings yet
Lecture 4
37 pages
Language Model PDF
No ratings yet
Language Model PDF
76 pages
08 Language Models
No ratings yet
08 Language Models
69 pages
LM
No ratings yet
LM
76 pages
NLP Unit-4
No ratings yet
NLP Unit-4
48 pages
13 Ngramlm
No ratings yet
13 Ngramlm
27 pages
NLP - N-Gram Language Model
No ratings yet
NLP - N-Gram Language Model
22 pages
14 Ngramlm
No ratings yet
14 Ngramlm
67 pages
N-Grams and Smoothing: CSC 371: Spring 2012
No ratings yet
N-Grams and Smoothing: CSC 371: Spring 2012
39 pages
3_LM_2024
No ratings yet
3_LM_2024
78 pages
Lecture 5: Language Modeling (N-Gram, BOW)
No ratings yet
Lecture 5: Language Modeling (N-Gram, BOW)
25 pages
lm24aug
No ratings yet
lm24aug
84 pages
CS 388: Natural Language Processing:: N-Gram Language Models
No ratings yet
CS 388: Natural Language Processing:: N-Gram Language Models
22 pages
Ngrams
100% (1)
Ngrams
22 pages
Module 2
No ratings yet
Module 2
98 pages
NLP CH 2
No ratings yet
NLP CH 2
59 pages
Multimedia Application L6
No ratings yet
Multimedia Application L6
63 pages
5)Lecture-Feb11&13&17&18
No ratings yet
5)Lecture-Feb11&13&17&18
21 pages
Multimedia Application L5
No ratings yet
Multimedia Application L5
35 pages
lecture5-ngrams
No ratings yet
lecture5-ngrams
40 pages
NLP_Module 2(1)
No ratings yet
NLP_Module 2(1)
77 pages
Language Models: CS6370: Natural Language Processing
No ratings yet
Language Models: CS6370: Natural Language Processing
35 pages
N-Grams - Text Representation
No ratings yet
N-Grams - Text Representation
23 pages
2. Language Modeling
No ratings yet
2. Language Modeling
50 pages
Natural Language Processing_Notes_Unit 2.docx
No ratings yet
Natural Language Processing_Notes_Unit 2.docx
19 pages
NLP_Unit2 (2)
No ratings yet
NLP_Unit2 (2)
65 pages
IS 7118 Unit-4 N-Grams
100% (2)
IS 7118 Unit-4 N-Grams
93 pages
NLP
No ratings yet
NLP
46 pages
NLP UNIT III (Part 1)
No ratings yet
NLP UNIT III (Part 1)
15 pages
04_N-gram Language Models
No ratings yet
04_N-gram Language Models
41 pages
Unit-3 (NLP)
No ratings yet
Unit-3 (NLP)
28 pages
02 Estimating N-Gram Probabilities 9-38
No ratings yet
02 Estimating N-Gram Probabilities 9-38
4 pages
6.Chapter6_LanguageModel
No ratings yet
6.Chapter6_LanguageModel
33 pages
Unit 2b
No ratings yet
Unit 2b
22 pages
Corpus (Pl. Corpora) A Computer-Readable Collection Of: Introduction To NLP
No ratings yet
Corpus (Pl. Corpora) A Computer-Readable Collection Of: Introduction To NLP
3 pages
13 Ai Cse551 NLP 1 PDF
No ratings yet
13 Ai Cse551 NLP 1 PDF
50 pages
Lecture 3 - Language Modelling and RNNs Part 1
No ratings yet
Lecture 3 - Language Modelling and RNNs Part 1
44 pages
NLP_Lec_11
No ratings yet
NLP_Lec_11
6 pages
Hungarian-English, Simple Hungarian, Conversational Topics, Elementary Level: Simple Hungarian, #2
From Everand
Hungarian-English, Simple Hungarian, Conversational Topics, Elementary Level: Simple Hungarian, #2
Alexander Pavlenko
1/5 (1)
[量化策略]使用Python优化RSI策略
No ratings yet
[量化策略]使用Python优化RSI策略
7 pages
简单的基于LSTM的股市分析与预测（Python）
No ratings yet
简单的基于LSTM的股市分析与预测（Python）
17 pages
2024_GR5245 class2 notes
No ratings yet
2024_GR5245 class2 notes
10 pages
GR5010_Handout6OptionsPricing2023 (3)
No ratings yet
GR5010_Handout6OptionsPricing2023 (3)
29 pages
2024 GR5245 HW1_due0929_11pm
No ratings yet
2024 GR5245 HW1_due0929_11pm
2 pages
GR5010_Handout4_Futures_CTA_2024
No ratings yet
GR5010_Handout4_Futures_CTA_2024
24 pages
GR5010_Handout3_Futures2024
No ratings yet
GR5010_Handout3_Futures2024
24 pages
GR5010_Handout5OptionsBasics2024
100% (1)
GR5010_Handout5OptionsBasics2024
31 pages
GR5010_Handout0_2024
No ratings yet
GR5010_Handout0_2024
26 pages
GR5010_Handout2_ProbabilityReviewNew
No ratings yet
GR5010_Handout2_ProbabilityReviewNew
31 pages
GR5010_Handout1_Arbitrage
No ratings yet
GR5010_Handout1_Arbitrage
24 pages
Lecture06-Syntax Formal Languages
No ratings yet
Lecture06-Syntax Formal Languages
43 pages
Lecture02 Ambiguity
No ratings yet
Lecture02 Ambiguity
21 pages
Laos Dialectic
No ratings yet
Laos Dialectic
495 pages
Metropolitan Intimacies An Ethnography On The Poetics Of Daily Life First Francisco Cruces Villalobos download
100% (1)
Metropolitan Intimacies An Ethnography On The Poetics Of Daily Life First Francisco Cruces Villalobos download
53 pages
Bibliometric Analysis of Asian Language and Linguistics ' Research: A Case of 13 Countries
No ratings yet
Bibliometric Analysis of Asian Language and Linguistics ' Research: A Case of 13 Countries
23 pages
The Role of The Youth in Community Action
100% (1)
The Role of The Youth in Community Action
38 pages
Group 2. PPC
No ratings yet
Group 2. PPC
26 pages
Adornos Practical Philosophy Living Less Wrongly Adorno Theodor W Freyenhagen download
No ratings yet
Adornos Practical Philosophy Living Less Wrongly Adorno Theodor W Freyenhagen download
78 pages
GPCOM - MODULE 1 Unit 2 and 3
100% (3)
GPCOM - MODULE 1 Unit 2 and 3
10 pages
Ethics Theory And Contemporary Issues Instructors Edition 5th Mackinnon instant download
100% (1)
Ethics Theory And Contemporary Issues Instructors Edition 5th Mackinnon instant download
76 pages
Chapter 7
No ratings yet
Chapter 7
71 pages
Rule of Experts Egypt Techno Politics Modernity 10th Anniversary Ed Edition Mitchell Download PDF
100% (11)
Rule of Experts Egypt Techno Politics Modernity 10th Anniversary Ed Edition Mitchell Download PDF
70 pages
Sociological Perspectives
No ratings yet
Sociological Perspectives
41 pages
Ix Ut3 PB
No ratings yet
Ix Ut3 PB
9 pages
MIT9 00SCF11 Text-1-283
No ratings yet
MIT9 00SCF11 Text-1-283
283 pages
Critical Sociology 2nd Edition Steven M. Buechler pdf download
100% (2)
Critical Sociology 2nd Edition Steven M. Buechler pdf download
50 pages
Cultural Clinical Psychology And Ptsd 1st Andreas Maercker Eva Heim download
No ratings yet
Cultural Clinical Psychology And Ptsd 1st Andreas Maercker Eva Heim download
91 pages
Manual of Systemic Therapy
No ratings yet
Manual of Systemic Therapy
12 pages
Leadership AY23 - Ha Nguyen-WEEK 7
No ratings yet
Leadership AY23 - Ha Nguyen-WEEK 7
78 pages
Expectancy-Value and Attribution Theory
No ratings yet
Expectancy-Value and Attribution Theory
42 pages
Thesis Qualitative Data Analysis
100% (3)
Thesis Qualitative Data Analysis
8 pages
Affinity-Seeking Strategies and Open Communication in Peer Workplace Relationships
No ratings yet
Affinity-Seeking Strategies and Open Communication in Peer Workplace Relationships
12 pages
A. Javier Treviño (editor) - The Cambridge Handbook of Social Problems_ Volume 1-Cambridge University Press (2018)
No ratings yet
A. Javier Treviño (editor) - The Cambridge Handbook of Social Problems_ Volume 1-Cambridge University Press (2018)
595 pages
Group 2 - Touchmove, Litearay Critism
67% (3)
Group 2 - Touchmove, Litearay Critism
8 pages
Trading Psychology
No ratings yet
Trading Psychology
5 pages
Hansson 2008
No ratings yet
Hansson 2008
18 pages
Chapter 8 - Erich Fromm
No ratings yet
Chapter 8 - Erich Fromm
9 pages
Imaginary Social Worlds - A Cultural Approach (PDFDrive)
No ratings yet
Imaginary Social Worlds - A Cultural Approach (PDFDrive)
282 pages
RPH Cheat Sheet Prelim
No ratings yet
RPH Cheat Sheet Prelim
3 pages
A-Zhu, Y. (2012) - Performing Heritage - Rethinking Authenticity in Tourism. Annals of Tourism Research, 39 (3), 1495-1513.
No ratings yet
A-Zhu, Y. (2012) - Performing Heritage - Rethinking Authenticity in Tourism. Annals of Tourism Research, 39 (3), 1495-1513.
19 pages
The Prestige
No ratings yet
The Prestige
13 pages
Chatterjee-Statistical Thought - A Perspective and History PDF
No ratings yet
Chatterjee-Statistical Thought - A Perspective and History PDF
300 pages

Lecture04-Ngram Lang Models

Uploaded by

Lecture04-Ngram Lang Models

Uploaded by

Natural Language

• What is the probability that the Naive Bayes’ model

“But it must be recognized that the notion of ‘probability of a

• Task: predict the next word given the context.

• Used in speech recognition, handwritten character

• Stocks plunged this …

• Let’s meet in Times …

• I took the subway to …

• Stocks plunged this morning, despite a cut in interest

• Stocks plunged this morning, despite a cut in interest

• Clearly at least some of us have the ability to predict the

• How does this work?

• Syntactic knowledge (guess correct part of speech)

• Instead, we can rely on the notion of probability of a

• Text generation: P(“three houses”) > P(“three house”)

• Machine Translation P(“the blue house”) > P(“the house blue”)

• Document classi cation

• This model can also be used to describe the probability of

• Use the chain rule:

• The longer the sequence becomes, the less likely

• Instead, we make the following simple independence

• The probability to see wn depends only on the previous

• More consistent to use only bigrams:

• The sequence wn is a unigram.

• The sequence wn-1, wn is a bigram.

• The sequence wn-2, wn-1, wn is a trigram….

• The sequence wn-3, wn-2, wn-1, wn is a quadrigram…

• Instead, we use a special marker STOP/END that indicates

• We typically just augment the sentence with START and

Eat on 0.16 Eat Thai 0.03

P(I want to each British food) =

P(I want to each Chinese food) =

• We often work with log probabilities in practice.

• Probabilities seem to capture syntactic facts and

• eat is often followed by a NP.

• British food is not too popular, but Chinese is.

I Want To Eat Chinese Food lunch

• Wall Street Journal

• Gigaword, WAC, ...

• DARPA/NIST text/speech corpora

• MT: Hansards, Europarl

File sizes: approx. 24 GB compressed (gzip'ed) text files

Number of tokens: 1,024,908,267,229

ceramics collectables collectibles 55

• For n-gram models there are two issues:

• We may not have seen all tokens.

• We may not have seen all ngrams (even though the

• Token has not been encountered in this context

• Start with a speci c lexicon of known tokens.

• Replace all tokens in the training and testing corpus that

• Lexicon contains all words that appear more than k

• Replace all other tokens with UNK.

• Smoothing / Discounting: Move some probability mass

• Back-o : Use n-1-…, n-2-… grams to compute n-gram

• Class-based backo , use back-o probability for a

• A few words occur very frequently.

Zipf’s law: a word’s frequency is approximately inversely

• before P(w | We denied the)

• after P(w | We denied the)

Example from Dan Klein.

• N is the number of tokens, V is the number of types (i.e.

• Works well in practice (but not a lot of theoretical

• Parameters can be estimated on development data (for

• Then for all seen bigrams:

• For each context v the missing probability mass is

• We can now divide this held-out mass between the

• Often combined with Good-Turing smoothing.

• Intrinsic evaluation: measure how well the model

• Can compute the probability of each sentence

• Typically we compute Perplexity instead.

• Perplexity is de ned as 2-l , where .

• Lower perplexity = better model. Intuition:

• Assume we are predicting one word at a time.

• With uniform distribution, all successor words are equally

• Perplexity can be thought of as “e ective vocabulary size”.

You might also like