UNIT-III
Syntax analysis Part-Of-Speech tagging( POS)- Tag set for English ( Penn Treebank ) , Rule
based POS tagging, Stochastic POS tagging, Issues -Multiple tags & words, Unknown words.
Introduction to CFG, Sequence labeling: Hidden Markov Model (HMM), Maximum Entropy,
and Conditional Random Field (CRF).
Syntactic analysis or parsing or syntax analysis is the third phase of NLP. The purpose of this
phase is to draw exact meaning, or you can say dictionary meaning from the text. Syntax
analysis checks the text for meaningfulness comparing to the rules of formal grammar. For
example, the sentence like “hot ice-cream” would be rejected by semantic analyzer.
In this sense, syntactic analysis or parsing may be defined as the process of analyzing the strings
of symbols in natural language conforming to the rules of formal grammar. The origin of the
word ‘parsing’ is from Latin word ‘pars’ which means ‘part’.
Concept of Parser
It is used to implement the task of parsing. It may be defined as the software component
designed for taking input data (text) and giving structural representation of the input after
checking for correct syntax as per formal grammar. It also builds a data structure generally in
the form of parse tree or abstract syntax tree or other hierarchical structure.
The main roles of the parse include −
To report any syntax error.
To recover from commonly occurring error so that the processing of the remainder of
program can be continued.
To create parse tree.
To create symbol table.
To produce intermediate representations (IR).
Types of Parsing
Derivation divides parsing into the followings two types −
Top-down Parsing
Bottom-up Parsing
Top-down Parsing
In this kind of parsing, the parser starts constructing the parse tree from the start symbol and
then tries to transform the start symbol to the input. The most common form of topdown parsing
uses recursive procedure to process the input. The main disadvantage of recursive descent
parsing is backtracking.
Bottom-up Parsing
In this kind of parsing, the parser starts with the input symbol and tries to construct the parser
tree up to the start symbol.
Concept of Derivation
In order to get the input string, we need a sequence of production rules. Derivation is a set of
production rules. During parsing, we need to decide the non-terminal, which is to be replaced
along with deciding the production rule with the help of which the non-terminal will be
replaced.
Types of Derivation
In this section, we will learn about the two types of derivations, which can be used to decide
which non-terminal to be replaced with production rule −
Left-most Derivation
In the left-most derivation, the sentential form of an input is scanned and replaced from the left
to the right. The sentential form in this case is called the left-sentential form.
Right-most Derivation
In the left-most derivation, the sentential form of an input is scanned and replaced from right to
left. The sentential form in this case is called the right-sentential form.
Concept of Parse Tree
It may be defined as the graphical depiction of a derivation. The start symbol of derivation
serves as the root of the parse tree. In every parse tree, the leaf nodes are terminals and interior
nodes are non-terminals. A property of parse tree is that in-order traversal will produce the
original input string.
Concept of Grammar
Grammar is very essential and important to describe the syntactic structure of well-formed
programs. In the literary sense, they denote syntactical rules for conversation in natural
languages. Linguistics have attempted to define grammars since the inception of natural
languages like English, Hindi, etc.
The theory of formal languages is also applicable in the fields of Computer Science mainly in
programming languages and data structure. For example, in ‘C’ language, the precise grammar
rules state how functions are made from lists and statements.
A mathematical model of grammar was given by Noam Chomsky in 1956, which is effective
for writing computer languages.
Mathematically, a grammar G can be formally written as a 4-tuple (N, T, S, P) where −
N or VN = set of non-terminal symbols, i.e., variables.
T or ∑ = set of terminal symbols.
S = Start symbol where S ∈ N
P denotes the Production rules for Terminals as well as Non-terminals. It has the form α
→ β, where α and β are strings on VN ∪ ∑ and least one symbol of α belongs to VN
Phrase Structure or Constituency Grammar
Phrase structure grammar, introduced by Noam Chomsky, is based on the constituency relation.
That is why it is also called constituency grammar. It is opposite to dependency grammar.
Example
Before giving an example of constituency grammar, we need to know the fundamental points
about constituency grammar and constituency relation.
All the related frameworks view the sentence structure in terms of constituency relation.
The constituency relation is derived from the subject-predicate division of Latin as well
as Greek grammar.
The basic clause structure is understood in terms of noun phrase NP and verb phrase
VP.
We can write the sentence “This tree is illustrating the constituency relation” as follows −
Dependency Grammar
It is opposite to the constituency grammar and based on dependency relation. It was introduced
by Lucien Tesniere. Dependency grammar (DG) is opposite to the constituency grammar
because it lacks phrasal nodes.
Example
Before giving an example of Dependency grammar, we need to know the fundamental points
about Dependency grammar and Dependency relation.
In DG, the linguistic units, i.e., words are connected to each other by directed links.
The verb becomes the center of the clause structure.
Every other syntactic units are connected to the verb in terms of directed link. These
syntactic units are called dependencies.
We can write the sentence “This tree is illustrating the dependency relation” as follows;
Parse tree that uses Constituency grammar is called constituency-based parse tree; and the parse
trees that uses dependency grammar is called dependency-based parse tree.
Context Free Grammar
Context free grammar, also called CFG, is a notation for describing languages and a superset of
Regular grammar. It can be seen in the following diagram −
Definition of CFG
CFG consists of finite set of grammar rules with the following four components −
Set of Non-terminals
It is denoted by V. The non-terminals are syntactic variables that denote the sets of strings,
which further help defining the language, generated by the grammar.
Set of Terminals
It is also called tokens and defined by Σ. Strings are formed with the basic symbols of terminals.
Set of Productions
It is denoted by P. The set defines how the terminals and non-terminals can be combined. Every
production(P) consists of non-terminals, an arrow, and terminals (the sequence of terminals).
Non-terminals are called the left side of the production and terminals are called the right side of
the production.
Start Symbol
The production begins from the start symbol. It is denoted by symbol S. Non-terminal symbol is
always designated as start symbol.
G={V, T,P,S}
P the rules to generate the sentence
Part of Speech (PoS) Tagging
Tagging is a kind of classification that may be defined as the automatic assignment of
description to the tokens. Here the descriptor is called tag, which may represent one of the part-
of-speech, semantic information and so on.
Now, if we talk about Part-of-Speech (PoS) tagging, then it may be defined as the process of
assigning one of the parts of speech to the given word. It is generally called POS tagging. In
simple words, we can say that POS tagging is a task of labelling each word in a sentence with
its appropriate part of speech. We already know that parts of speech include nouns, verb,
adverbs, adjectives, pronouns, conjunction and their sub-categories.
Tag set for English ( Penn Treebank ):
Penn Treebank P.O.S. Tags (upenn.edu)
www.nltk.org/book_1ed/ch05.html
Most of the POS tagging falls under Rule Base POS tagging, Stochastic POS tagging and
Transformation based tagging.
Rule-based POS Tagging
One of the oldest techniques of tagging is rule-based POS tagging. Rule-based taggers use
dictionary or lexicon for getting possible tags for tagging each word. If the word has more than
one possible tag, then rule-based taggers use hand-written rules to identify the correct tag.
Disambiguation can also be performed in rule-based tagging by analyzing the linguistic features
of a word along with its preceding as well as following words. For example, suppose if the
preceding word of a word is article then word must be a noun.
As the name suggests, all such kind of information in rule-based POS tagging is coded in the
form of rules. These rules may be either −
Context-pattern rules
Or, as Regular expression compiled into finite-state automata, intersected with lexically
ambiguous sentence representation.
We can also understand Rule-based POS tagging by its two-stage architecture −
First stage − In the first stage, it uses a dictionary to assign each word a list of potential
parts-of-speech.
Second stage − In the second stage, it uses large lists of hand-written disambiguation
rules to sort down the list to a single part-of-speech for each word.
o Example of a rule:
If an ambiguous/unknown word X is preceded by a determiner and followed by a
noun, tag it as an adjective.
Properties of Rule-Based POS Tagging
Rule-based POS taggers possess the following properties −
These taggers are knowledge-driven taggers.
The rules in Rule-based POS tagging are built manually.
The information is coded in the form of rules.
We have some limited number of rules approximately around 1000.
Smoothing and language modeling is defined explicitly in rule-based taggers.
Stochastic POS Tagging
Another technique of tagging is Stochastic POS Tagging. Now, the question that arises here is
which model can be stochastic. The model that includes frequency or probability (statistics) can
be called stochastic. Any number of different approaches to the problem of part-of-speech
tagging can be referred to as stochastic tagger.
The simplest stochastic tagger applies the following approaches for POS tagging −
Word Frequency Approach
In this approach, the stochastic taggers disambiguate the words based on the probability that a
word occurs with a particular tag. We can also say that the tag encountered most frequently with
the word in the training set is the one assigned to an ambiguous instance of that word. The main
issue with this approach is that it may yield inadmissible sequence of tags.
Tag Sequence Probabilities
It is another approach of stochastic tagging, where the tagger calculates the probability of a
given sequence of tags occurring. It is also called n-gram approach. It is called so because the
best tag for a given word is determined by the probability at which it occurs with the n previous
tags.
Properties of Stochastic POS Tagging
Stochastic POS taggers possess the following properties −
This POS tagging is based on the probability of tag occurring.
It requires training corpus
There would be no probability for the words that do not exist in the corpus.
It uses different testing corpus (other than training corpus).
It is the simplest POS tagging because it chooses most frequent tags associated with a
word in training corpus.
Transformation-based Tagging
Transformation based tagging is also called Brill tagging. It is an instance of the transformation-
based learning (TBL), which is a rule-based algorithm for automatic tagging of POS to the
given text. TBL, allows us to have linguistic knowledge in a readable form, transforms one state
to another state by using transformation rules.
It draws the inspiration from both the previous explained taggers − rule-based and stochastic. If
we see similarity between rule-based and transformation tagger, then like rule-based, it is also
based on the rules that specify what tags need to be assigned to what words. On the other hand,
if we see similarity between stochastic and transformation tagger then like stochastic, it is
machine learning technique in which rules are automatically induced from data.
Working of Transformation Based Learning(TBL)
In order to understand the working and concept of transformation-based taggers, we need to
understand the working of transformation-based learning. Consider the following steps to
understand the working of TBL −
Start with the solution − The TBL usually starts with some solution to the problem and
works in cycles.
Most beneficial transformation chosen − In each cycle, TBL will choose the most
beneficial transformation.
Apply to the problem − The transformation chosen in the last step will be applied to the
problem.
The algorithm will stop when the selected transformation in step 2 will not add either more
value or there are no more transformations to be selected. Such kind of learning is best suited in
classification tasks.
Advantages of Transformation-based Learning (TBL)
The advantages of TBL are as follows −
We learn small set of simple rules and these rules are enough for tagging.
Development as well as debugging is very easy in TBL because the learned rules are
easy to understand.
Complexity in tagging is reduced because in TBL there is interlacing of machinelearned
and human-generated rules.
Transformation-based tagger is much faster than Markov-model tagger.
Disadvantages of Transformation-based Learning (TBL)
The disadvantages of TBL are as follows −
Transformation-based learning (TBL) does not provide tag probabilities.
In TBL, the training time is very long especially on large corpora.
Hidden Markov Model in POS tagging
HMM is a probabilistic sequence model. POS tagging is one of the sequence labeling problems.
A sequence model assigns a label to each component in a sequence. It computes a probability
distribution over possible sequences of labels and chooses the best label sequence. POS tagging
is a sequence labeling problem because we need to identify and assign each word the correct
POS tag.
A hidden Markov model (HMM) allows us to talk about both observed events (words in the
input sentence) and hidden events (POS tags) unlike Markov chains (which talks about the
probabilities of state sequence which is not hidden).
Two important assumptions used by HMM
HMM uses two assumptions for simplifying the calculations. They are;
Markov assumption: the probability of a state qn (POS tag in tagging problem which are
hidden) depends only on the previous state qn-1 (POS tag).
P(qn | q1, q2, …, qn-1) = P(qn | qn-1)
Output independence: the probability of an observation on (words in tagging problem)
depends only on the state qn (hidden state) that produced the observation and not on any other
states or observations.
P(on | q1,…, qn, …qT, o1, … oi, … oT) = P(on | qn)
Likelihood of the observation sequence (Forward Algorithm)
Let us consider W as word sequence (observation/emission) and T as tag sequence (hidden
states). W consists of a sequence of observations (words) w1, w2, … wn, and T consists of a
sequence of hidden states (POS tags) t1, t2, … tn. Then the joint probability, ie., P(W, T) (also
called as likelihood estimation) can be calculated using the two assumptions discussed above as
follows;
Observation (word sequence) probabilities as per the output independence
assumption – HOW?
State transition (POS tags) probabilities as per Bigram assumption on tags (probability of a
tag depends only on its previous tags) – HOW?
So, Eq 1 can be expanded as follows with observation probabilities followed by transition
probabilities;
P(W|T) * P(T) = P(w1|t1) * P(w2|t2) * … * P(wn|tn) * P(t1|t0) * P(t2|t1)
* … * P(tn|tn-1).
Likelihood estimation - Example:
Question:
Given the HMM λ = (A, B, π) and word sequence W = “the light book”, find P(the light book |
Det JJ NN), the probability of the word sequence (observation) given the tag sequence.
Solution:
Given are initial state probabilities (π), state transition probabilities (A), and observation
probabilities (B).
P(the light book | DT JJ NN)
= P(the|DT) * P(light|JJ) * P(book|NN) * P(DT|start) * P(JJ|DT) * P(NN|JJ)
= 0.3 * 0.002 * 0.003 * 0.45 * 0.3 * 0.2
= 0.0000000486
= 4.86 * 10-8
HMM(definition + VITERBI ALgo)
Refer the slides uploaded
Issues with Markovian model Tagging:
1. Unknown Words:
At runtime, if you get unknown words or the words which are not seen in your training
corpus(emission probabilities will also be not known) there is no way you can use viterbi
encoding technique.So in such case you can use some sort of morphological cues. Suppose the
word encountered is ending with “ed”, you might think that this word might be a past tense of a
word “seen” in the training corpus or if you find capitalization in the middle of the set of the
sentence you might think that this might be a proper name.
2. Limited Context:
In first order Markov assumption, each tag is dependent only on the previous tag. So, this does
not take care of certain situations. For example:
Consider two sentences:
1) Is clearly marked
2) He clearly marked
In first case, the tag sequence is ”verb+pastparticiple” and in the second case, it is verb+past
tense. Just the previous context is not sufficient to indicate the most probable tag sequence. So,
we need to move to a higher order HMM, say second orders(uses bigrams,instead of previous
state)
Maximum Entropy Models:
Refer the slides for the example
Maximum Entropy Model
Similar to logistic regression, the maximum entropy
(MaxEnt) model is also a type of log-linear model. The MaxEnt
model is more general than logistic regression. It handles
multinomial distribution where logistic regression is for binary
classification.
The maximum entropy principle is defined as modeling a
given set of data by finding the highest entropy to satisfy the
constraints of our prior knowledge.
The feature function of MaxEnt model would be multi-classes. For
example, given (x,y), the feature function returns 0,1, or 2.
The maximum entropy model is a conditional probability
model p(y|x) that allows us to predict class labels given a set of
features for a given data point. It does the inference by taking
trained weights and performs linear combinations to find the tag
with the highest probability by finding the highest score for each
tag.
To find the probability for each tag/class, MaxEnt defined as:
We define f_i as a feature function and w_i as the weight vector.
The summation of i=1 to m is summing of all feature functions
where m is the number of unique states. The
denominator Z(x) helped normalize the probability as:
The MaxEnt model makes uses of the log-linear model approach
with the feature function but does not take into account the
sequential data.
Maximum Entropy Markov Model (MEMM)
From the Maximum Entropy model, we can extend into the
Maximum Entropy Markov Model (MEMM). This approach allows
us to use HMM that takes into account the sequence of data and to
combine it with the Maximum Entropy model for features and
normalization.
The Maximum Entropy Markov Model (MEMM) has
dependencies between each state and the full observation
sequence explicitly. This is more expressive than HMMs.
In the HMM model, we saw that it uses two probabilities matrices
(state transition and emission probability). We need to predict a
tag given an observation, but HMM predicts the probability of a
tag producing a certain observation. This is due to its generative
approach. Instead of the transition and observation matrices in
HMM, MEMM has only one transition probability matrix. This
matrix encapsulates all combinations of previous states y_i−1 and
current observation x_i pairs in the training data to the current
state y_i.
Our goal is to find the p(y_1,y_2,…,y_n|x_1,x_2,…x_n). This is:
Since HMM only depends on the previous state, we can limit the
condition of y_n given y_n-1. This is the Markov independence
assumption.
So the Maximum Entropy Markov Models (MEMM) defines
using Log-linear model as:
where x is a full sequence of inputs of x_1 to x_n. Let y be
corresponding labels or sequence of tags (0 and1 in our case). The
variable i is the position to be tagged and n is the length of the
sentence. The denominator Z(y_i-1,x) is the normalizer that
defines as
MEMM can incorporate more features from its feature function as
input while HMM required the likelihood of each of the features to
be computed since it is a likelihood-based. The feature function of
MEMM also has dependencies on previous tag y_i-1. As an
example:
Example function for letter ‘e’ in ‘test’ where the current tag is M and the previous tag is B.
The MEMM has a richer set of observation features that can
describe observations in terms of many overlapping features. For
example in our word segmentation, we could have features like
capitalization, vowel or consonant, or type of the character.
Conditional Random Field:
CRF is a discriminant model for sequences data similar to MEMM.
It models the dependency between each state and the entire input
sequences. Unlike MEMM, CRF overcomes the label bias issue by
using global normalizer.
In this article, we are focusing on linear-chain CRF which is a
special type of CRF that models the output variables as a sequence.
This fits our use case of having sequential inputs.
Let x be inputs vector, y is the label vector, and w is the weight
vector. In MEMM, we define P(y|x) earlier as:
where:
In contrast, Conditional Random Fields is described as:
with Z(x) defined as:
The summation of j=1 to n is the sum of all data points. This is
needed in comparison to the Maximum Entropy Model. The
whole label sequence is considered in the prediction instead of
a single label. The variable j specifies the position of the input
sequence x.
The summation of i=1 to m is the sum of all feature functions.
The summation of y is the sum of all possible label sequences.
It is performed to get the feasible probability.
f_i is the feature function with detail below.
Z(x) will be discussed next.
Feature Function
Similar to MEMM, f(y_j-1,y_j,x,j) is the feature function. For
example, if index j=2 and current state M as denoted
as y_j=M and the previous state is B as denoted as y_j-1=B and x
is character ‘e’, then this is 1 else 0.
Example function j=2 for letter ‘e’ in ‘test’
The feature function can be any real value but often just has the
value of 0 or 1 where 1 is for specific feature and 0 otherwise. The
feature function can overlap in many ways. In NLP, it can be if the
word is capitalized, punctuation or prefix or suffix. Feature
function has access to all of the observation x. So we can also look
at the word to the left or right. In our segmentation case, it can be
one letter to the left or right or type of that letter. In Khmer text,
the type can be consonant, vowel, independent vowel, diacritic,
etc.
Partitioning Function
To overcome the label bias problem, CRF uses the global
normalizer Z(x) instead of local normalizer as in the MEMM. So
Z(x) in CRF takes a sum of all the possible sequences of tag y ∈ Y.
As shown earlier:
Note that y here is not the same as y in the numerator. This y is
local to this calculation and generally notated as y’.
1.crf_intro.dvi (umass.edu)
2. Introduction to Conditional Random Fields (CRFs) - AI, ML, Data Science Articles |
Interviews | Insights | AI TIME JOURNAL
Differences between HMM,MEMM and CRF:
Conditional Random Fields for Sequence Prediction (davidsbatista.net)
HMM, MEMM, and CRF: A Comparative Analysis of Statistical Modeling Methods | by Alibaba Cloud |
Medium