NLP 1.2
NLP 1.2
Dr Prabhjot Kaur
E16646
Assistant Professor DISCOVER . LEARN . EMPOWER
CSE(AIT), CU 1
NATURAL LANGUAGE PROCESSING : Course Objectives
The objectives of this course are:
• To understand the foundational concepts of speech and language processing,
including ambiguity and computational models.
• To explore the role of algorithms and automata in morphological parsing and
linguistic analysis.
• To familiarize students with language modelling techniques like n-grams and
smoothing, and their application in speech recognition.
• To analyze the structure of language through parsing, feature structures, and
probabilistic grammars.
• To introduce semantic representation techniques for understanding meaning in
natural language.
• To equip students with the skills to implement NLP systems using tools and
techniques like tagging, parsing, and unification.
2
COURSE OUTCOMES
On completion of this course, the students shall be able to:-
3
Table of Contents
•N-Gram
•Bi-Gram
•Maximum Likelihood Estimation
•Smoothing
•Entropy
4
Words
5
29
9
Bi-Gram
● An N-gram is a sequence of N tokens (or words).
10
Example:
For the sentence “The cow jumps over the moon”. If N=2 (known as
bigrams), then the ngrams would be:
• the cow
• cow jumps
• jumps over
• over the
• the moon
So you have 5 n-grams in this case. Notice that we moved from the->cow
to cow->jumps to jumps->over, etc, essentially moving one word forward
to generate the next bigram. 11
If N=3, the n-grams would be:
If X=Num of words in a given sentence K, the number of n-grams for sentence K would be:
13
How do N-gram language models work?
An N-gram language model predicts the probability of a given N-gram within any
sequence of words in the language. If we have a good N-gram model, we can
predict p(w | h) – what is the probability of seeing the word w given a history of
previous words h – where the history contains n-1 words.
14
• We must estimate this probability to construct an N-gram model.
17
Maximum Likelihood Estimation
● PMLE(w1 ,..,wn )=C(w1 ,..,wn )/N, where C(w1 ,..,wn ) is the frequency
of n-gram w1 ,..,wn
● PMLE(wn |w1 ,..,wn-1)= C(w1 ,..,wn )/C(w1 ,..,wn-1)
● This estimate is called Maximum Likelihood Estimate (MLE) because it
is the choice of parameters that gives the highest probability to the
training corpus.
18
Smoothing
● What do we do with words that are in our vocabulary (they are not
unknown words) but appear in a test set in an unseen context (for
example they appear after a word they never appeared after in
training)?
● To keep a language model from assigning zero probability to these
unseen events, we’ll have to shave off a bit of probability mass from
some more frequent events and give it to the events we’ve never seen.
● This modification is called smoothing or discounting
19
Entropy
● Entropy is a measure of information.
● Given a random variable X ranging over whatever we are predicting
(words, letters, parts of speech, the set of which we’ll call χ) and with a
particular probability function, call it p(x), the entropy of the random
variable X is:
TEXTBOOKS
T1: Speech and Language processing an introduction to Natural Language Processing, Computational Linguistics
and speech Recognition by Daniel Jurafsky and James H. Martin
T2: Natural Language Processing with Python by Steven Bird, Ewan Klein, Edward Lopper
REFERENCE BOOKS:
R1: Handbook of Natural Language Processing, Second Edition—Nitin Indurkhya, Fred J. Damerau, Fred J. Damera
Course Link:
https://in.coursera.org/specializations/natural-language-processing
Video Link:
https://youtu.be/YVQcE5tV26s
Web Link:
https://www.tutorialspoint.com/natural_language_processing/natural_language_processing_tutorial.pdf
21
THANK YOU
For queries
Email:
[email protected]