Lecture04-Ngram Lang Models
Lecture04-Ngram Lang Models
Processing
Lecture 4: n-gram Language Models
9/13/24
COMS W4705
Daniel Bauer
Probability of a Sentence
• Domain knowledge
• Lexical knowledge
Probability of the Next Word
• Idea: We do not need to model domain, syntactic, and
lexical knowledge perfectly.
• Spelling correction P(“my cat eats sh”) > P(“my xat eats sh”)
• Other uses
• OCR
• Summarization
• Essay scoring
fi
fi
fi
Language Models
…
Markov Assumption
• is di cult to estimate.
P(i|START)·P(want|i)·P(to|want)·P(eat|to)·P(Chinese|eat)·P(food|Chinese)·P(END|food)
trigram example
P(i|START, START)·P(want|START,i)·P(to|i,want)·P(eat|want,to)·
P(Chinese|to,eat) · P(food|eat,Chinese)·P(END|Chinese,food)
Why do we only need one END marker and two START markers?
Bigram example from the Berkeley
Restaurant Project (BeRP)
http://www1.icsi.berkeley.edu/Speech/berp.html
Bigram example from the Berkeley
Restaurant Project (BeRP)
START I 0.25 Want some 0.04
START I’d 0.06 Want Thai 0.01
START Tell 0.04 To eat 0.26
START I’m 0.02 To have 0.14
I want 0.32 To spend 0.09
I would 0.29 To be 0.02
I don’t 0.08 British food 0.60
I have 0.04 British restaurant 0.15
Want to 0.65 British cuisine 0.01
Want a 0.05 British lunch 0.01
Bigram example from the Berkeley
Restaurant Project (BeRP)
• Assume P(END | food) = 0.2
w0 = START
What do ngrams capture?
• Or for trigrams:
Bigram Counts from BeRP
Want 3 0 786 0 6 8 6
To 3 0 10 860 3 0 12
Eat 0 0 2 0 19 2 52
Chinese 2 0 0 0 0 120 1
Food 19 0 17 0 0 0 0
Lunch 4 0 0 0 0 1 0
Counts to Probabilities
I Want To Eat Chinese Food lunch ...
I 8 1087 0 13 0 0 0
Want 3 0 786 0 6 8 6
To 3 0 10 860 3 0 12
Eat 0 0 2 0 19 2 52
Chinese 2 0 0 0 0 120 1
Food 19 0 17 0 0 0 0
Lunch 4 0 0 0 0 1 0
• Unigram counts:
I Want To Eat Chinese Food Lunch
3437 1215 3256 938 213 1506 459
Corpora
• Large digital collections of text or speech. Di erent languages, domains, modalities.
Annotated or un-annontated.
• English:
• Brown Corpus
• BNC, ANC
• AP newswire
• Practical approach:
• Other techniques:
• Most words occur very rarely. Many are seen only once.
frequency
word rank
Smoothing
• Smoothing attens spiky distributions.
allegations
2 reports
reports
charges
benefits
motion
1 claims
…
request
claims
1 request
7 total
allegations
allegations
0.5 claims
charges
benefits
0.5 request
motion
reports …
2 UNK
claims
request
7 total
Smoothing is like Robin Hood: Steal from the rich, give to the poor.
• Inaccurate in practice.
Linear Interpolation
• Use denser distributions of shorter ngrams to “ ll in”
sparse ngram distributions.
• Where and .
• where .
• where: