MODULE 3
Syntax Analysis
Part-Of-Speech tagging(POS)
• The process of assigning one of the parts of speech to the given word.
• Parts of Speech include nouns, verb, adverbs, adjectives, pronouns, conjunction and
their sub-categories.
• Annotate each word in a sentence with a part-of-speech marker.
• Lowest level of syntactic analysis.
• Useful for subsequent syntactic parsing and word sense disambiguation.
Word Sense Disambiguation (WSD) is the task of determining the correct
meaning of a word in a given context, as words can have multiple senses
What are POS Tags?
• Original Brown corpus used a large set of 87 POS tags.
• Most common in NLP today is the Penn Treebank set of 45 tags.
Closed vs. Open Class
• Closed class categories are composed of a small, fixed set of
grammatical function words for a given language.
• Pronouns, Prepositions, Modals, Determiners, Particles, Conjunctions
• Open class categories have large number of words and new ones are
easily invented.
• Nouns (Googler, textlish), Verbs (Google), Adjectives (geeky), Adverb
(automagically)
Ambiguity in POS Tagging
• “Like” can be a verb or a preposition
• I like/VBP candy.
• Time flies like/IN an arrow.
• “Around” can be a preposition, particle, or adverb
• I bought it at the shop around/IN the corner.
• I never got around/RP to getting a car.
• A new Prius costs around/RB $25K.
Other Issues
• New, rare, or misspelled words that are not in the training data can’t be
tagged reliably. Example: Slang, names, or newly coined words like
“google” as a verb.
• A word’s POS tag often depends on surrounding words. Taggers must
analyze syntactic context, not just the word in isolation.
Example:
• “Book a ticket” → book is a verb
• “Read a book” → book is a noun
POS Tagging Process
• Usually assume a separate initial tokenization process that separates
and/or disambiguates punctuation, including detecting sentence
boundaries.
• Degree of ambiguity in English (based on Brown corpus)
• 11.5% of word types are ambiguous (11.5% of words can have more than one
POS)
• 40% of word tokens are ambiguous (40% of the words POS depends on
context)
• Baseline: Picking the most frequent tag for each specific word type
gives about 90% accuracy.
Approaches to Part-of-Speech Tagging
• Rule-based POS tagging: relies on manually crafted linguistic rules and
dictionaries to assign POS tags to words based on their morphological
features and contextual patterns
• Rule-based taggers often employ a combination of lexical rules (based on word forms)
and contextual rules (based on the surrounding words) to disambiguate POS tags.
• Developing comprehensive rule sets for POS tagging can be time-consuming and
requires linguistic expertise, but rule-based taggers can achieve high accuracy for
languages with well-defined grammatical structures (Advantage and Disadvantage)
• Rule-based taggers employ hand-written rules to choose the correct tag when
a word has more than one possible tag.
• The first stage involves assigning each word a list of potential parts of speech using a
lexicon.
• Second step: In the second stage, it sorts the list down to a single part-of-speech for each
term using extensive lists of hand-written disambiguation criteria.
Stochastic Part-of-Speech (POS) tagging
• The model that includes frequency or probability (statistics) can be
called stochastic. Any number of different approaches to the problem of
part-of-speech tagging can be referred to as stochastic taggers.
Tag Sequence Probabilities
• It is an approach of stochastic tagging, where the tagger calculates the
probability of a given sequence of tags occurring.
• It is also called the n-gram approach. It is called so because the best tag
for a given word is determined by the probability at which it occurs with
the n previous tags.
Word Frequency Approach
• In this approach, the stochastic taggers disambiguate the words based
on the probability that a word occurs with a particular tag. We can also
say that the tag encountered most frequently with the word in the
training set is the one assigned to an ambiguous instance of that word.
The main issue with this approach is that it may yield an inadmissible
sequence of tags.
• An inadmissible sequence of tags is a grammatically invalid or
unlikely combination of POS tags that results from tagging without
considering the overall sentence structure.
Properties of Stochastic POS Tagging
Stochastic POS taggers possess the following properties −
• This POS tagging is based on the probability.
• It requires training corpus
• There would be no probability for the words that do not exist in the
corpus.
• It uses different testing corpus (other than training corpus).
Transformation-based Tagging
• It is an example of transformation-based learning (TBL), a rule-based
technique for automatically labeling POS to the provided text.
• It is inspired by both the rule-based and stochastic taggers that were
previously explained. Rule-based and transformation taggers are similar
in that they are both based on rules that indicate which tags must be
applied to which words.
Working of Transformation Based Learning
(TBL)
• Start with the solution − The TBL usually starts with some solution
to the problem and works in cycles.
• Most beneficial transformation chosen − In each cycle, TBL will
choose the most beneficial transformation.
• Apply to the problem − The transformation chosen in the last step
will be applied to the problem.
The algorithm will stop when the selected transformation in step 2 will not add
either more value or there are no more transformations to be selected. Such kind of
learning is best suited in classification tasks.
Detailed Process
• Initial Tagging
• Each word is given an initial POS tag.
• This is often done using a simple method (like assigning the most frequent tag
for each word based on training data).
• Rule Learning Phase
• The model then learns correction rules by comparing the initial tags to the
correct tags in the training data.
• Rules have the form: “Change tag X to tag Y if condition C is met”(e.g.,
Change "NN" to "VB" if the previous word is "to")
Process (TBL)
Applying Transformations
• The rules are applied sequentially to the initial tags to correct
mistakes and improve accuracy.
• These transformations continue until no more improvements can be
made or a stopping condition is met.
Example Rule:
Rule: Change a word's tag from NN (noun) to VB (verb) if the preceding word is "to".
•Before: "to book a ticket" → "to/IN book/NN a/DT ticket/NN"
•After: Apply rule → "to/IN book/VB a/DT ticket/NN"
Example
Advantages of Transformation-based
Learning (TBL)
• We learn a small set of simple rules and these rules are enough for
tagging.
• Development as well as debugging is very easy in TBL because the
learned rules are easy to understand.
• Complexity in tagging is reduced because in TBL there is an interlacing
of machine-learned and human-generated rules.
Disadvantages
The disadvantages of TBL are as follows −
• Transformation-based learning (TBL) does not provide tag
probabilities.
• In TBL, the training time is very long especially on large corpora.
Hidden Markov Model (HMM) POS Tagging
Hidden Markov Model
• An HMM model may be defined as the doubly-embedded stochastic
model, where the underlying stochastic process is hidden. This hidden
stochastic process can only be observed through another set of stochastic
processes that produces the sequence of observations.
Markov Model
Say that there are only three kinds of weather conditions, namely
• Rainy
• Sunny
• Cloudy
Now, since our young friend we introduced above, Peter, is a small kid, he
loves to play outside. He loves it when the weather is sunny, because all
his friends come out to play in the sunny conditions. He hates the rainy
weather for obvious reasons.
Every day, his mother observe the weather in the morning (that is when he
usually goes out to play) and like always, Peter comes up to her right after
getting up and asks her to tell him what the weather is going to be like.
Since she is a responsible parent, she want to answer that question as
accurately as possible. But the only thing she has is a set of observations
taken over multiple days as to how weather has been.
How does she make a prediction of the weather for today based on what
the weather has been for the past N days?
Say you have a sequence. Something like this:
Sunny, Rainy, Cloudy, Cloudy, Sunny, Sunny, Sunny, Rainy
So, the weather for any give day can be in any of the three
states.
Let’s say we decide to use a Markov Chain Model to solve this
problem. Now using the data that we have, we can construct
the following state diagram with the labelled probabilities.
• In order to compute the probability of today’s weather given N
previous observations, we will use the Markovian Property.
Example
Hidden Markov Model
It’s the small kid Peter again, and this time he’s going to pester his new caretaker.
As a caretaker, one of the most important tasks for you is to tuck Peter into bed
and
make sure he is sound asleep.
Once you’ve tucked him in, you want to make sure he’s actually, asleep and not up
to some mischief.
You cannot, however, enter the room again, as that would surely wake Peter up.
So, all you have to decide are the noises that might come from the room.
Either the room is quiet or there is noise coming from the room. These are your
states.
His mother has given you the following state diagram.
The diagram has some states, observations, and probabilities.
Note that there is no direct correlation between sound from the room and
Peter being asleep.
Probabilities
There are two kinds of probabilities that we can see from the state
diagram.
• One is the emission probabilities, which represent the probabilities of
making certain observations given a particular state. For example, we
have P(noise | awake) = 0.5 . This is an emission probability.
• The other ones is transition probabilities, which represent the
probability of transitioning to another state given a particular state. For
example, we have P(asleep | awake) = 0.4 . This is a transition
probability.
HMMs for Part of Speech Tagging
• We know that to model any problem using a Hidden Markov Model
we need a set of observations and a set of possible states. The states in
an HMM are hidden.
• In the part of speech tagging problem, the observations are the words
themselves in the given sequence.
• As for the states, which are hidden, these would be the POS tags for
the words.
• The transition probabilities would be somewhat like P(VP | NP) that is,
what is the probability of the current word having a tag of Verb Phrase
given that the previous tag was a Noun Phrase.
• Emission probabilities would be P(john | NP) or P(will | VP) that is, what
is the probability that the word is, say, John given that the tag is a Noun
Phrase.
• Our problem here was that we have an initial state: Peter was awake when
you tucked him into bed. After that, you recorded a sequence of
observations, namely noise or quiet, at different time-steps. Using these set
of observations and the initial state, you want to find out whether Peter
would be awake or asleep after say N time steps.
• We draw all possible transitions starting from the initial state. There’s an
exponential number of branches that come out as we keep moving forward.
So, the model grows exponentially after a few time steps. Even without
considering any observations. Have a look at the model expanding
exponentially below.
S0 is Awake and S1 is Asleep. Exponential growth
through the model because of the transitions.
• If we had a set of states, we could calculate the probability of the
sequence. But we don’t have the states. All we have are a sequence of
observations. This is why this model is referred to as the Hidden Markov
Model — because the actual states over time are hidden.
Viterbi Algorithm
• The Viterbi algorithm is used to find the most likely sequence of
POS tags for a given sequence of words.
It is particularly useful in Hidden Markov Models (HMMs), where:
• Words are observed outputs
• Tags are hidden states
The algorithm uses dynamic programming to:
• Avoid recomputation
• Store best paths
• Track probability of the most likely tag sequence ending in each
possible tag at each word
Key components
Viterbi Algorithm
• Advantages
• Efficient: O(N×T²) time (N = words, T = tags)
• Guarantees globally optimal tag sequence
• Widely used in practical POS taggers
• Limitations
• Assumes Markov property: only previous tag matters
• Relies on good training data for accurate probabilities
• Struggles with unknown words (needs smoothing)
Generative vs. Discriminative models for POS tagging
• Generative
• Generative models, such as HMMs, aim to model the joint probability distribution
of words and their corresponding POS tags, essentially learning how these pairs are
generated.
• An HMM for POS tagging would learn probabilities for transitions between tags
(e.g., noun following an adjective) and for the emission of words given a tag (e.g.,
the probability of the word "cat" being a noun).
• Discriminative
• Discriminative models, like CRFs, aim to directly learn the conditional probability
of a tag given a word (or a sequence of words).
• They focus on learning the decision boundaries between different tag categories,
essentially learning which features are most indicative of a particular tag.
• A CRF for POS tagging would learn weights for various features (e.g., word itself,
surrounding words, prefixes, suffixes) to predict the most likely tag for a given
word.
Maximum Entropy Model
Why Maximum Entropy?
• Traditional models like Hidden Markov Models (HMMs) make
strong independence assumptions (e.g., only the previous tag
matters).
• MaxEnt allows us to incorporate diverse contextual features
without assuming independence.
• It’s discriminative, modeling P(tag∣context) directly, unlike HMMs
which are generative.
• It offers a flexible approach to handling contextual information and
resolving ambiguities.
• It belongs to the family of classifiers known as exponential or log-linear
classifiers. It works by extracting some set of features from the input and
combining them linearly (with their weights), using the obtained sum as
an exponent.
• Given the features and weights, our goal is to choose a class (for example
a part-of-speech tag) for the word. MaxEnt does this by choosing the
most probable tag; the probability of a particular class c given the
observation x is:
How Maximum Entropy Models Work for
POS Tagging?
1. Feature Extraction:
• The model analyzes the context of a word, extracting
relevant features. These features can include the word
itself, its capitalization, surrounding words, previous and
next POS tags, and other contextual clues.
2. Feature Weighting:
• Each feature is assigned a weight based on its importance
in predicting the correct POS tag.
3. Probability Calculation:
• The model calculates the probability of each possible POS
tag for a given word by combining the weighted features and
applying a normalization factor.
4. Tag Assignment:
• The tag with the highest probability is then assigned to the
word.
Example of feature functions for POS tagging
Consider the word "running." Potential features could include:
• Current word: "running"
• Suffix: "-ing"
• Previous word: (e.g., "is")
• Previous tag: (e.g., verb)
These features, when combined with their corresponding weights, help
the model determine whether "running" is more likely to be a verb or a
noun in a particular context.
HMM vs. CRF
• The probability of a verb following a noun is static in HMMs, that is at
whatever point you encounter a (verb, noun) pair in the sentence, its
probability would remain constant despite its position which isn’t
what is observed in the domain of NLP.
• Similarly, if a hidden state has produced a verb such as “eat”, then
that probability will also remain static.
• An HMM model also has limited dependencies, for instance the third
hidden state is not dependent on the first hidden state which is often
not the case in Natural Language.
• As for a CRF, one can draw as many connections (dependencies) as
needed.
Working
Introduction
• A parser in NLP is a critical component that analyses the
grammatical structure of a sentence.
• It arranges words into specific groups such as nouns, verbs
and phrases.
• It breaks down a sentence into its constituent parts to
understand the syntactic relationships between them.
• POS removes lexical ambiguity while parser removes
syntax ambiguity.
69
Syntax
• It refers to the way in which words are arranged together to
form sentences.
• It involves rules to govern the structure of sentences and
how words are combined to convey meaning.
70
Parsing Ambiguity
• Parsing ambiguity refers to the phenomenon where a single
sentence can have multiple valid parse trees, each
representing a different syntactic interpretation.
• E.g.: I saw the man with the telescope. (2 parsers)
71
72
73
• The exponential increase in the number of possible parses as
sentences become more complex highlights the challenge of
syntactic ambiguity in NLP.
• Parsers must efficiently manage this ambiguity to generate the
correct parse tree for accurate language understanding.
• Advanced parsing techniques and algorithms, such as Probabilistic
context-free grammars and machine learning based parsers, are
often employed to handle this by assigning probabilities to
different parse trees and selecting the most likely one.
74
Modelling Constituency
• Involves representing the hierarchical structure of sentences
by identifying and labeling their constituent parts, such as
noun phrases (NP) and verb phrases (VP).
• This process aims to break down sentences into nested
phrases and construct parse trees that depict their syntactic
relationships.
• It enables machines to understand the grammatical structure
of sentences, facilitating accurate interpretation and
generation of natural language.
75
• Constituency parsers use parsing algorithms like top-down,
bottom-up, or statistical methods to generate parse trees.
• It also helps in semantic analysis and downstream
applications like named entity recognition and syntactic
ambiguity resolution.
76
• How to arrange words in certain groups?
• How to find automatically word arrangements when new
sentences are given?
• e.g. In English, I play cricket (correct statement)
I cricket play (incorrect)
77
• What is the formal tool to model this constituency?
• That is how words are arranged together?
• Which words come together? Which words do not come
together?
• What groups make a sentence and what groups make a verb
phrase? What group makes a noun phrase?
• Solution is Context Free Grammar.
78
Context-Free Grammar (CFG)
• The most common approach to modelling
constituency involves using production
rules.
• These rules specify how the symbols of a
language can be grouped and arranged.
• For example, a noun phrase can consist of
either a proper noun or a determiner
followed by a nominal, where nominal can
include multiple nouns.
79
CFGs
• Provides rules that explain how to create grammatically
correct sentences in that language.
• A Context-Free Grammar (CFG) is defined by the tuple (T,
N, S, R) where:
• Terminal Symbols (T): These are the basic building blocks
of the language. Like words or punctuation marks.
• Non-terminal Symbols (N): These are categories of words
or phrases, like noun phrases or verb phrases. These don’t
represent actual words themselves but categories of words.
80
• Start Symbol (S): This is the starting point for any sentence.
It triggers the creation of a grammatically correct sentence.
• Production rules (R): These rules rewrite non-terminal
symbols into other symbols (terminal or non-terminal) to
ultimately generate a sentence.
81
82
83
84
85
What is Parsing?
• The process of taking a string and a grammar and returning
all possible parse trees for that string.
• Top-Down (Goal-Oriented)
• Bottom-up (Data Directed)
86
Top-Down Parsing
• Top-down parsing starts with the root node S and builds
down to the leaves. Assuming the input can be derived from
the start symbol S.
• Initial Trees: Identify all trees that can start with S by
checking grammar rules with S on the left-hand side.
• Grow Trees: Expand trees downward using these rules until
they reach the POS categories at the bottom.
• Match Input: Reject trees whose leaves do not match the
input words.
87
Top-Down Parsing
• This method searches for a parse tree by starting at the top
and expanding downward, verifying against the input at
each step.
• E.g. Early Parser, Predictive Parsing
• For top-down parsing we can use depth-first or breadth-first
search and goal-ordering.
• Issues: In case of left recursive rules, e.g. NP -> NP PP can
lead to infinite recursion.
88
Parsing Example
VP
Verb NP
book that flight
book Det Nominal
that Noun
flight
90
Bottom-Up Parsing
• The parser starts with the input words and builds trees
upward.
• Start with words: Begin with the words of the input.
• Apply Rules: Build trees by applying grammar rules one at
a time.
• Fit Rules: Look for places in the current parse where the
right-hand side of a rule can fit.
91
92
93
Bottom-up Parsers
• CYK or CKY Parser: this algorithm works on CNF
(Chomsky Normal Form) grammar only. Also generates
multiple trees when statement is ambiguous.
• Shift Reduce Parser: It does not require grammar to be in
CNF form. But there is a need to handle backtrack.
• PCGF (Probabilistic Context Free Grammar)
94
95
96
97