NLP_Module1-4
NLP_Module1-4
Module 1
Chapter1: Introduction
What is Language?
Language is a structured system of communication that uses symbols (like words and
grammar) to convey meaning. In the context of NLP, we usually refer to natural languages,
such as English, Hindi, or Tamil—languages that evolved naturally among humans, as
opposed to artificial or programming languages.
A common confusion exists between Natural Language Processing (NLP) and Natural
Language Understanding (NLU). While NLP is the broader field that encompasses all
tasks involving natural language — including processing, generation, and understanding —
NLU specifically refers to systems that interpret or "understand" the meaning of human
language input. In other words, NLP includes both surface-level tasks (like tokenization,
part-of-speech tagging) and deep-level tasks (like sentiment analysis, intent detection),
whereas NLU focuses on the latter. NLP is sometimes referred to as Natural Language
Understanding because many of its core goals revolve around making machines
understand and interpret human language like humans do. However, this usage is more
informal and reflects the aspiration of the field rather than a strict equivalence.
Another important context for NLP is its relationship with Computational Linguistics, which
is often viewed as a bridge between theoretical linguistics and psycholinguistics.
Theoretical linguistics is concerned with understanding the abstract rules and structures of
language — including syntax, semantics, and phonology — without necessarily applying
them in practical systems. Psycholinguistics, on the other hand, deals with how language
Natural language processing notes 3
In terms of how computers process language, two broad types of computational models
have evolved: knowledge-driven and data-driven models. Knowledge-driven models rely
on hand-crafted rules and symbolic representations of grammar and meaning. These were
dominant in the early decades of NLP and often require linguistic expertise. However, they
struggle with ambiguity and scale. In contrast, data-driven models use statistical or
machine learning techniques to learn language patterns from large text corpora. These
models have become more dominant in recent years, especially with the rise of deep
learning, as they can automatically infer patterns from data without relying heavily on
human-written rules.
One of the earliest and most impactful applications of NLP was in Information Retrieval
(IR), the task of finding relevant documents or information from large collections. Search
engines like Google use sophisticated NLP techniques to understand user queries, correct
spelling, rank documents by relevance, and extract meaningful snippets. NLP helps in query
expansion, synonym detection, and relevance feedback, making IR systems more accurate
and responsive to natural human language rather than just keyword matching.
In the context of NLP, language refers to a system of communication that uses structured
symbols—spoken, written, or signed—to convey meaning. Natural languages like English,
Hindi, or Tamil are inherently ambiguous, complex, and context-dependent, which makes
Natural language processing notes 4
The first step in processing language is usually lexical analysis, which involves breaking
down the input text into tokens—individual units like words, punctuation, and symbols. It also
involves assigning categories like part-of-speech tags (e.g., noun, verb, adjective) to each
token. This stage helps structure the text so it can be processed more deeply in later stages.
Next is word-level processing, where each word's meaning and properties are analyzed.
This includes looking up dictionary definitions, understanding synonyms or antonyms, and
checking word usage. A fundamental concept here is the morpheme, the smallest unit of
meaning in a language. For instance, in the word “unhappiness”, “un-”, “happy”, and “-ness”
are three morphemes. Identifying morphemes helps in understanding the structure and
meaning of words beyond their surface forms.
After word-level processing, the focus shifts to syntactic analysis (or parsing), which
involves determining the grammatical structure of a sentence. This includes analyzing how
words are grouped into phrases and how those phrases relate to each other in a hierarchy.
For example, in the sentence "The boy ate the apple," syntactic analysis identifies "The boy"
as the subject noun phrase and "ate the apple" as the verb phrase.
Once syntax is understood, semantic analysis aims to derive the meaning of a sentence.
This involves mapping syntactic structures to logical representations and identifying the roles
of words (e.g., who is doing what to whom). Semantic analysis tries to resolve word sense
disambiguation (e.g., the word “bank” could mean a financial institution or a riverbank) and
capture the intended meaning of phrases and sentences.
Moving beyond individual sentences, discourse analysis deals with the structure and
meaning of connected text or dialogue. It considers how one sentence relates to the next
and how information flows across sentences. For example, resolving anaphora (i.e.,
identifying what a pronoun refers to) is a key task in discourse analysis — in "John dropped
the glass. It broke," the word "it" refers to "the glass."
Finally, pragmatic analysis focuses on how context influences interpretation. This includes
speaker intention, tone, politeness, and real-world knowledge. For example, if someone
says, “Can you pass the salt?”, a pragmatic analysis understands it not as a question about
ability but as a polite request. Pragmatics allows machines to go beyond literal meanings
and engage in more natural communication.
Natural Language Processing (NLP) deals with the inherently complex and ambiguous
nature of human language. One of the key challenges is representation and
interpretation, which refers to how machines can represent the structure and meaning of
language in a formal way that computers can manipulate. Unlike numbers or code, natural
language involves abstract concepts, emotions, and context, making it difficult to represent
using fixed logical forms or algorithms. Interpretation becomes even harder when the same
sentence can carry different meanings depending on the speaker’s intent, cultural
background, or tone.
Another major challenge is identifying semantics, especially in the presence of idioms and
metaphors. Idioms such as "kick the bucket" or "spill the beans" have meanings that cannot
be derived from the literal meaning of the words. Similarly, metaphors like "time is a thief"
require deep contextual and cultural understanding, which machines struggle to grasp.
These figurative expressions pose a serious problem for semantic analysis since they don't
follow regular linguistic patterns.
Quantifier scoping is another subtle issue, dealing with how quantifiers (like “all,” “some,”
“none”) affect the meaning of sentences. For example, the sentence “Every student read a
book” can mean either that all students read the same book or that each student read a
different one. Disambiguating such sentences requires complex logical reasoning and
context awareness.
Ambiguity is one of the most persistent challenges in NLP. At the word level, there are two
main types: part-of-speech ambiguity and semantic ambiguity. In part-of-speech
ambiguity, a word like “book” can be a noun (“a book”) or a verb (“to book a ticket”), and the
correct tag must be determined based on context. This ties into the task of Part-of-Speech
(POS) tagging, where the system must assign correct grammatical labels to each word in a
sentence, often using probabilistic models like Hidden Markov Models or neural networks.
In terms of semantic ambiguity, many words have multiple meanings—a problem known as
polysemy. For instance, the word “bat” can refer to a flying mammal or a piece of sports
equipment. Resolving this is the goal of Word Sense Disambiguation (WSD), which
attempts to determine the most appropriate meaning of a word in a given context. WSD is
particularly difficult in resource-poor languages or when the context is vague.
Another type of complexity arises from structural ambiguity, where a sentence can be
parsed in more than one grammatical way. For example, in “I saw the man with a telescope,”
it is unclear whether the telescope was used by the speaker or the man. Structural ambiguity
can lead to multiple interpretations and is a major hurdle in syntactic and semantic parsing.
the influence of world knowledge on both the selection of words (lexical items) and the
conventions of structuring them, the boundary between syntax and semantics often
becomes blurred. Nonetheless, maintaining a separation between the two is considered
beneficial for ease of grammar writing and language processing.
One of the main challenges in defining the structure of natural language is its dynamic
nature and the presence of numerous exceptions that are difficult to capture formally. Over
time, several grammatical frameworks have been proposed to address these challenges.
Prominent among them are transformational grammar (Chomsky, 1957), lexical functional
grammar (Kaplan and Bresnan, 1982), government and binding theory (Chomsky, 1981),
generalized phrase structure grammar, dependency grammar, Paninian grammar, and
tree-adjoining grammar (Joshi, 1985). While some of these grammars focus on the
derivational aspects of sentence formation (e.g., phrase structure grammar), others
emphasize relational properties (e.g., dependency grammar, lexical functional grammar,
Paninian grammar, and link grammar).
The most significant contribution in this area has been made by Noam Chomsky, who
proposed a formal hierarchy of grammars based on their expressive power. These
grammars employ phrase structure or rewrite rules to generate well-formed sentences in a
language. The general framework proposed by Chomsky is referred to as generative
grammar, which consists of a finite set of rules capable of generating all and only the
grammatical sentences of a language. Chomsky also introduced transformational grammar,
asserting that natural languages cannot be adequately represented using phrase structure
rules alone. In his work Syntactic Structures (1957), he proposed that each sentence has
two levels of representation: the deep structure, which captures the sentence's core
meaning, and the surface structure, which represents the actual form of the sentence. The
transformation from deep to surface structure is accomplished through transformational
rules.
For example, for the sentence “The police will catch the snatcher,” the phrase structure rules
generate the following parse tree:
Natural language processing notes 7
This tree represents the syntactic structure of the sentence as derived from phrase structure
rules.
Transformational rules are applied to the output of the phrase structure grammar and are
used to modify sentence structures. These rules may have multiple symbols on the left-hand
side and enable transformations such as changing an active sentence into a passive one.
For example, Chomsky provided a rule for converting active to passive constructions:
NP₁ + Aux + V + NP₂ → NP₂ + Aux + be + en + V + by + NP₁.
This rule inserts the strings “be” and “en” and rearranges sentence constituents to reflect a
passive construction. Transformational rules can be either obligatory, ensuring grammatical
agreement (such as subject-verb agreement), or optional, allowing for structural variations
while preserving meaning.
However, phrase structure rules often struggle to account for more complex linguistic
phenomena such as embedded noun phrases containing adjectives, modifiers, or relative
clauses. These phenomena give rise to what are known as long-distance dependencies,
where related elements like a verb and its object may be separated by arbitrary amounts of
intervening text. Such dependencies are not easily handled at the surface structure level. A
specific case of long-distance dependency is wh-movement, where interrogative words like
“what” or “who” are moved to the front of a sentence, creating non-local syntactic
relationships. These limitations highlight the need for more advanced grammatical
frameworks like tree-adjoining grammars (TAGs), which can effectively model such
syntactic phenomena due to their capacity to represent recursion and long-distance
dependencies more naturally than standard phrase structure rules.
Indian Languages follow Subject-object-verb structure and also follows non-linear structure
Indian languages exhibit several unique linguistic properties that distinguish them from many
Western languages, especially English. One of the most prominent characteristics is their
free word order. This means that the words in a sentence can often be rearranged without
altering the core meaning of the sentence. This syntactic flexibility presents significant
challenges in tasks like parsing and machine translation.
Moreover, Indian languages display a rich morphological structure, with a large number of
morphological variants for nouns, verbs, and adjectives. These variants are used to convey
different grammatical categories such as tense, gender, number, and case.
Another notable feature of Indian languages is the extensive and productive use of
complex predicates (CPs). A complex predicate consists of a main verb and one or more
auxiliary verbs that together express a single verbal idea. For example, expressions like "जा
रहा है " (ja raha hai – "is going") and "खेल रही है " (khel rahi hai – "is playing") involve verb
complexes, where auxiliary verbs provide information related to tense, aspect, and modality.
Indian languages predominantly use postpositions, also known as karakas, instead of the
prepositions used in English. These postpositions appear after the noun or pronoun they
relate to and are used to indicate grammatical roles such as subject, object, instrument,
location, etc.
Languages like Hindi and Urdu are closely related in terms of phonology, morphology,
and syntax. Despite being written in different scripts and having different lexical
influences—Hindi borrowing largely from Sanskrit and Urdu from Persian and Arabic—both
languages are free-word-order languages, use postpositions, and share a significant portion
of vocabulary and grammatical structure.
Applications of NLP
Natural Language Processing (NLP) has a wide range of applications that aim to bridge the
gap between human language and computational systems. One of the major applications of
NLP is Machine Translation (MT), which involves automatically converting text or speech
from one language to another. MT systems analyze the source language for syntax and
semantics and generate equivalent content in the target language. Examples include Google
Natural language processing notes 9
Translate and Microsoft Translator. The challenge in MT lies in handling grammar, idioms,
context, and word order, especially for Indian languages, which have a free word order.
Speech Synthesis, also known as Text-to-Speech (TTS), is the reverse process, where
written text is converted into spoken output. TTS systems are used in applications for
visually impaired users, public announcement systems, and interactive voice response (IVR)
systems. These systems require natural-sounding voice output, correct intonation, and
pronunciation.
Natural Language Interfaces to Databases (NLIDB) allow users to interact with databases
using natural language queries instead of structured query languages like SQL. For
example, a user can ask “What is the balance in my savings account?” and the system
translates it into a database query. This application requires robust parsing, semantic
interpretation, and domain understanding.
Information Retrieval (IR) deals with finding relevant documents or data in response to a
user query. Search engines like Google, Bing, and academic databases are practical
implementations of IR. NLP techniques help in query expansion, stemming, and ranking
results by relevance.
Question Answering (QA) systems provide direct answers to user questions instead of
listing documents. For example, a QA system can answer “Who is the President of India?”
by retrieving the exact answer from a knowledge base or corpus. These systems require
deep linguistic analysis, context understanding, and often integrate IR and IE.
N-gram model
The goal of statistical language models is to estimate the probability (likelihood) of a
sentence. This is achieved by decomposing sentence probability into a product of conditional
probabilities using the chain rule as follows:
In order to calculate the sentence probability, we need to calculate the probability of a word,
given the sequence of words preceding it. An n-gram model simplifies this task by
approximating the probability of a word given all the previous words by the conditional
probability given previous n-1 words only
● A model that considers only the previous one word is called a bigram model (n =
2).
● A model that considers the previous two words is called a trigram model (n = 3).
Using bigram and trigram models, the probability of a sentence w1,w2,...,wnw_1, w_2, ...,
w_nw1,w2,...,wncan be estimated as:
Natural language processing notes 11
A special word(pseudo word) <s> is introduced to mark the beginning of the sentence in
bi-gram estimation. The probability of the first word in a sentence is conditioned on <s>.
Similarly, in tri-gram estimation, we introduce two pseudo-words <s1> and <s2>.
Estimation of probabilities is done by training the n-gram model on the training corpus. We
estimate n-gram parameters using the maximum likelihood estimation (MLE) technique, i.e,
using relative frequencies. We count a particular n-gram in the training corpus and divide it
by the sum of all n-grams that share the same prefix.
The sum of all n-grams that share first n-1 words is equal to the common prefix
The model parameters obtained using these estimates maximize the probability of the
training set T given the model M, i.e., P(T|M). However, the frequency with which a word
occurs in a text may differ from its frequency in the training set. Therefore, the model only
provides the most likely solution based on the training data.
Several improvements have been proposed for the standard n-gram model. Before
discussing these enhancements, let us illustrate the underlying ideas with the help of an
example.
Natural language processing notes 12
The n-gram model suffers from data sparseness. n-grams not seen in the training data are
assigned zero probability, even in large corpora. This is due to the assumption that a word’s
probability depends only on the preceding word(s), which is often not true. Natural language
contains long-distance dependencies that this model cannot capture.
To handle data sparseness, various smoothing techniques have been developed, such as
add-one smoothing. As Jurafsky and Martin (2000) state:
The term smoothing reflects how these techniques adjust probabilities toward more uniform
distributions.
Natural language processing notes 13
Add-one Smoothing
Add-One Smoothing is a simple technique used to handle the data sparseness problem in
n-gram language models by avoiding zero probabilities for unseen n-grams.
In an n-gram model, if a particular n-gram (like a word or word pair) does not occur in the
training data, it is assigned a probability of zero, which can negatively affect the overall
probability of a sentence. Add-One Smoothing helps by assigning a small non-zero
probability to these unseen events.
Paninian Framework
Paninian Grammar is a highly influential grammatical framework, based on the ancient
Sanskrit grammarian Panini. It provides a rule-based structure for analyzing sentence
formation using deep linguistic features such as vibhakti (case suffixes) and karaka
(semantic roles). Unlike many Western grammars which focus on syntax, Paninian
grammar emphasizes the relationship between semantics and syntax, making it well-suited
for Indian languages with free word order.
writing.
Natural language processing notes 14
● It contains inflected words with suffixes (like tense, case markers, gender, number,
etc.).
● The sentence is in linear word form but doesn’t reveal deeper structure.
Example (Hindi):
राम ने फल खाया (Rām ne phal khāyā)
At surface level:
राम + ने | फल | खाया
Here, "ने" is a vibhakti marker.
● These vibhaktis provide syntactic cues about the role of each noun in the sentence.
● Different vibhaktis (e.g., ने, को, से, का) indicate nominative, accusative,
instrumental, genitive etc.
Example:
"को" → Dative/Accusative
"से" → Instrumental
● Karaka relations assign semantic roles to nouns, like agent, object, instrument,
source, etc.
Karaka roles are assigned based on the verb and semantic dependency, not fixed
word order.
Example:
राम ने फल खाया
● Captures the final meaning of the sentence by combining all the karaka roles.
Example meaning:
"Rām ate the fruit."
→ {Agent: Ram, Action: eat, Object: fruit}
Karaka Theory
Karaka Theory is a fundamental concept in Paninian grammar that explains the semantic
roles of nouns in relation to the verb in a sentence. It helps identify who is doing what to
whom, with what, for whom, where, and from where, etc.
Unlike English grammar which focuses on Subject/Object, karaka theory goes deeper into
semantic functions.
3. Karana Instrument of the action Instrumental (से) चाकू से काटा Cut with a
knife
Natural language processing notes 16
Features:
● They are language-independent roles; similar roles exist in many world languages.
फल Karma Object
However, many issues remain unresolved, especially in cases of shared Karaka relations.
Another difficulty arises when the mapping between the Vibhakti (case markers and
postpositions) and the semantic relation (with respect to the verb) is not one-to-one. Two
Natural language processing notes 19
different Vibhaktis can represent the same relation, or the same Vibhakti can represent
different relations in different contexts. The strategy to disambiguate the various senses of
words or word groupings remains a challenging issue.
As the system of rules differs across languages, the framework requires adaptation to
handle various applications in different languages. Only some general features of the PG
framework have been described here.
Natural language processing notes 20
Module 2
We have all used simplified forms of regular expressions, such as the file search patterns
used by MS-DOS, e.g., dir*.txt.
The use of regular expressions in computer science was made popular by a Unix-based
editor, 'ed'. Perl was the first language that provided integrated support for regular
expressions. It used a slash around each regular expression; we will follow the same
notation in this book. However, slashes are not part of the regular expression itself.
Regular expressions were originally studied as part of the theory of computation. They
were first introduced by Kleene (1956). A regular expression is an algebraic formula whose
value is a pattern consisting of a set of strings, called the language of the expression. The
simplest kind of regular expression contains a single symbol.
For example, the expression /a/ denotes the set containing the string 'a'. A regular
expression may specify a sequence of characters also. For example, the expression
/supernova/ denotes the set that contains the string "supernova" and nothing else.
In a search application, the first instance of each match to a regular expression is underlined
in Table.
bool The world is a lock, and those who do not travel read only one page
facel Reporters, who do not read the stylebook, should not criticize their
editors
faced Not everything that is faced can be changed. But nothing can be
changed until it is faced.
Character Classes
Characters are grouped by putting them between square brackets [ ]. Any character in the
class will match one character in the input. For example, the pattern /[abcd]/ will match a,
b, c, or d. This is called disjunction of characters.
● /[a-z]/ specifies any lowercase letter (you can use - for range).
Regular expressions can also specify what a character cannot be, using a caret (^) at the
beginning of the brackets.
● This interpretation is true only when the caret is the first character inside brackets.
Case Sensitivity:
● Regex is case-sensitive.
Anchors:
○ strawberry
○ blackberry
○ sugarberry
RE Description
\n Newline character
\t Tab character
\d Digit (0-9)
\D Non-digit
\W Non-alphanumeric character
Natural language processing notes 24
\s Whitespace
\S Non-whitespace
Real-world use:
This example works for most cases. However, the specification is not based on any standard
and may not be accurate enough to match all correct email addresses. It may accept
non-working email addresses and reject working ones. Fine-tuning is required for accurate
characterization.
Natural language processing notes 25
A regular expression may contain symbol pairs. For example, the regular expression /a:b/
represents a pair of strings. The regular expression /a:b/ actually denotes a regular
relation. A regular relation may be viewed as a mapping between two regular languages.
The a:b relation is simply the cross product of the languages denoted by the expressions
/a/ and /b/. To differentiate the two languages that are involved in a regular relation, we call
the first one the upper, and the second one the lower language of the relation. Similarly, in
the pair /a:b/, the first symbol, a, can be called the upper symbol, and the second symbol,
b, the lower symbol.
The two components of a symbol pair are separated in our notation by a colon (:), without
any whitespace before or after. To make the notation less cumbersome, we ignore the
distinction between the language A and the identity relation that maps every string of A to
itself. Therefore, we also write /a:a/ simply as /a/.
Similarly, regular relations can be represented using finite-state transducers. With this
representation, it is possible to derive new regular languages and relations by applying
regular operators, instead of re-writing the grammars.
Finite-state automata have been used in a wide variety of areas, including linguistics,
electrical engineering, computer science, mathematics, and logic. These are an important
tool in computational linguistics and have been used as a mathematical device to implement
regular expressions. Any regular expression can be represented by a finite automaton and
the language of any finite automaton can be described by a regular expression. Both have
the same expressive power.
The following formal definitions of the two types of finite state automaton, namely,
deterministic and non-deterministic finite automaton, are taken from Hopcroft and Ullman
(1979).
● Q is a set of states,
● ∑ is an alphabet,
● δ is a transition function.
Natural language processing notes 27
Path is a sequence of transitions beginning with the start state. A path leading to one of the
final states is a successful path. The FSAs encode regular languages. The language that an
FSA encodes is the set of strings that can be formed by concatenating the symbols along
each successful path. Clearly, for automata with cycles, these sets are not finite.
We now examine what happens to various input strings that are presented to finite state
automata. Consider the deterministic automaton described in Example 3.2 and the input, ac.
We start with state q, and go to state q₁. The next input symbol is c, so we go to state q₃. No
more input left and we have not reached the final state, i.e., we have an unsuccessful end.
Hence, the string ac is not recognized by the automaton.
This example illustrates how a Finite State Automaton (FSA) can be used to accept or
recognize a string. The set of all strings that lead the automaton to a final state is called the
language accepted or defined by the FSA. This means that “er” is not a word in the
language defined by the automaton shown in Figure 3.1.
Now, consider the input string “acb.” We start from the initial state q and move to state q1.
The next input symbol is “c”, so we transition to state qr. The following symbol is “b”, which
leads us to state qp. Since there is no more input left and we have reached a final state, this
is considered a successful termination. Hence, the string “acb” is a valid word in the
language defined by the automaton.
The language defined by this automaton can be described by the regular expression
/ubluch/. The example considered here is quite simple. In practice, however, the list of
transition rules can be extensive. Listing all the transition rules may be inconvenient, so
automata are often represented using a state-transition table.
In such a table:
Natural language processing notes 28
● The entries indicate the resulting state after applying a particular input in a given
state,
This table contains all the information required by the FSA to function. The state-transition
table for the automaton considered in Example 3.2 is shown in Table 3.5.
Two automata that define the same language are said to be equivalent. An NFA can be
converted to an equivalent DFA and vice versa. The equivalent DFA for the NFA shown in
Figure 3.3 is shown in Figure 3.4.
Natural language processing notes 29
Morphological parsing
Morphology is a subdiscipline of linguistics. It studies word structure and the formation of
words from smaller units (morphemes). The goal of morphological parsing is to discover the
morphemes that build a given word. Morphemes are the smallest meaning bearing units in a
language. For example, the word bread consists of a single morpheme and eggs consist of
two morphemes: egg+s.
There are two broad classes of morphemes: stems and affixes. The stem is the main
morpheme - the morpheme that contains the central meaning. Affixes modify the meaning
given by the stem. Affixes are divided into prefixes, suffixes, infixes and circumfixes.
Prefixes are morphemes which appear before the stem, and suffixes are morphemes
applied to the end of the stem. Circumfixes are morphemes that may be applied to either
end of the stem while infixes are morphemes that appear inside a stem.
Example: Here is a list of singular and plural words of a few Telugu words:
Note: * you can give kannada words here instead for examples
There are three main ways of word formation:
1. Inflection: Here, a root word is combined with a grammatical morpheme to yield a
word of the same class as the original stem.
2. Derivation: It combines a word stem with a grammatical morpheme to yield a word
belonging to a different class, e.g., formation of the noun ‘computation’ from the verb
‘compute’.
The formation of noun from a verb or adjective is called normalization.
3. Compounding is the process of merging 2 or more words to form a new word. For
example, personal computer, desktop, overlook.
Morphological analysis and generation are essential to many NLP applications ranging from
spelling corrections to machine translations. In parsing, e.g., it helps to know the agreement
features of words. In IR, morphological analysis helps identify the presence of a query word
in a document in spite of different morphological variants.
Parsing in general, means taking an input and producing some sort of structures for it. In
NLP, this structure might be morphological, syntactic, semantic or pragmatic. Morphological
parsing takes as input the inflected surface form of each word in a text. As output, it
produces the parsed form consisting of a canonical form (or lemma) of the word and a set of
tags showing its syntactical category and morphological characteristics, e.g., possible part of
speech and/or inflectional properties (gender, number, person, tense, etc.). Morphological
generation is the inverse of this process. Both analysis and generation rely on two sources
of information: a dictionary of the valid lemmas of the language and a set of inflection
paradigms.
1. Lexicon
A lexicon lists stems and affixes together with basic information about them.
2. Morphotactics
There exists certain ordering among the morphemes that constitute a word. They cannot be
arranged arbitrarily. For example, rest-less-ness is a valid word in English but not
rest-ness-less. Morphotactics deals with the ordering of morphemes. It describes the way
morphemes are arranged or touch each other.
3. Orthographic rules
These are spelling rules that specify the changes that occur when two given morphemes
combine. For example, the ier spelling rule changes cary to razier and not to easyer.
Morphological analysis can be avoided if an exhaustive lexicon is available that lists features
for all the word-forms of all the roots. Given a word, we simply consult the lexicon to get its
feature values. For example, suppose an exhaustive lexicon for Hindi contains the following
entries for the Hindi root word ghodhaa:
Given a word, say ghodhon, we can look up the lexicon to get its feature values.
However, this approach has several limitations. First, it puts a heavy demand on memory.
We have to list every form of the word, which results in a large number of, often redundant,
entries in the lexicon.
Natural language processing notes 31
Second, an exhaustive lexicon fails to show the relationship between different roots having
similar word-forms. That means the approach fails to capture linguistic generalization, which
is essential to develop a system capable of understanding unknown words.
Third, for morphologically complex languages, like Turkish, the number of possible
word-forms may be theoretically infinite. It is not practical to list all possible word-forms in
these languages.
These limitations explain why morphological parsing is necessary. The complexity of the
morphological analysis varies widely among the world's languages, and is quite high even in
relatively simple cases, such as English.
The simplest morphological systems are stemmers that collapse morphological variations of
a given word (word-forms) to one lemma or stem. They do not require a lexicon. Stemmers
have been especially used in information retrieval. Two widely used stemming algorithms
have been developed by Lovins (1968) and Porter (1980).
Stemmers do not use a lexicon; instead, they make use of rewrite rules of the form:
● ational → ate
to transform words such as rotational into rotate.
Another problem with Porter’s algorithm is that it reduces only suffixes; prefixes and
compounds are not reduced.
Morphological parsing is viewed as a mapping from the surface level into morpheme and
feature sequences on the lexical level.
For example, the surface form ‘playing’ is represented in the lexical form as play+V+PP, as
shown in Figure 3.5. The lexical form consists of the stem ‘play’ followed by the
morphological information +V+PP, which tells us that ‘playing’ is the present participle form
of the verb.
Similarly, the surface form ‘books’ is represented in the lexical form as book+N+PL, where
the first component is the stem, and the second component ⟨N+PL⟩ is the morphological
information, which tells us that the surface level form is a plural noun.
This model is usually implemented with a kind of finite-state automata, called a finite-state
transducer (FST). A transducer maps a set of symbols to another. A finite-state transducer
does this through a finite-state automaton. An FST can be thought of as a two-state
automaton, which recognizes or generates a pair of strings.
An FST passes over the input string by consuming the input symbols on the tape it traverses
and converts it to the output string in the form of symbols.
Formally, an FST has been defined by Hopcroft and Ullman (1979) as follows:
● Q is a set of states,
An FST can be seen as automata with transitions labelled with symbols from Σ₁ × Σ₂,
where Σ₁ and Σ₂ are the alphabets of input and output respectively. Thus, an FST is similar to
a nondeterministic finite automaton (NFA), except that:
Natural language processing notes 33
● transitions are made on pairs of symbols rather than single symbols, and
Figure 3.6 shows a simple transducer that accepts two input strings, hot and cat, and maps
them onto cot and bat respectively. It is a common practice to represent a pair like a:a by a
single letter.
Just as FSAs encode regular languages, FSTs encode regular relations. A regular
relation is the relation between regular languages. The regular language encoded on the
upper side of an FST is called the upper language, and the one on the lower side is termed
the lower language.
If T is a transducer, and s is a string, then we use T(s) to represent the set of strings
encoded by T such that the pair (s, t) is in the relation.
FSTs are closed under union, concatenation, composition, and Kleene closure.
However, in general, they are not closed under intersection and complementation.
With this introduction, we can now implement the two-level morphology using FST. To get
from the surface form of a word to its morphological analysis, we proceed in two steps, as
illustrated in Figure 3.7:
Natural language processing notes 34
First, we split the word up into its possible components. For example, we make birds out of
bird+s, where '+' indicates morpheme boundaries. In this step, we also consider spelling
rules.
● box+es, and
● boxe+s.
The first one assumes that box is the stem and es the suffix, while the second assumes that
boxe is the stem and s has been introduced due to a spelling rule.
The output of this step is a concatenation of morphemes, i.e., stems and affixes. There
can be more than one representation for a given word.
A transducer that does the mapping (translation) required by this step for the surface form
lesser might look like Figure 3.8.
This FST represents the information that the comparative form of the adjective less is
lesser, where e here is the empty string.
In the second step, we use a lexicon to look up categories of the stems and meanings
of the affixes. So:
This tells us that splitting boxes into boxe+s is incorrect, and should therefore be
discarded.
This may not always be the case. We have words like spouses or parses, where splitting the
word into spouse+s or parse+s is correct.
Natural language processing notes 35
Orthographic rules are used to handle these spelling variations. For instance, one such
rule states:
Each of these steps can be implemented with the help of a transducer. Thus, we need to
build two transducers:
1. One that maps the surface form to the intermediate form, and
2. Another that maps the intermediate form to the lexical form.
We now develop an FST-based morphological parser for singular and plural nouns in
English.
The plural form of regular nouns usually ends with s or -es. However, a word ending in s
need not necessarily be the plural form of a word. There are many singular words ending in
s, e.g., miss, ass.
One required translation is the deletion of the 'e' when introducing a morpheme boundary.
This deletion is usually required for words ending in xes, ses, zes (e.g., suffixes and boxes).
Figure 3.10 shows the possible sequences of states that the transducer undergoes, given
the surface forms birds and boxes as input.
Natural language processing notes 36
The next step is to develop a transducer that does the mapping from the intermediate level
to the lexical level. The input to this transducer has one of the following forms:
In the first case, the transducer maps all symbols of the stem to themselves and outputs
N+SG(as shown in Figure 3.7). In the second case, it has to map all symbols of the stem to
themselves, but then output N and replace PL with sg. In the third case, it has to do the
same as in the first case. Finally, in the fourth case, the transducer has to map the irregular
plural noun stem to the corresponding singular stem (e.g., grese to goose) and then it should
add N and PL. The general structure of this transducer looks like Figure 3.11.
Natural language processing notes 37
The mapping from State 1 to State 2, 3, or 4 is carried out with the help of a transducer
encoding a lexicon. The transducer implementing the lexicon maps the individual regular and
irregular noun stems to their correct noun stem, replacing labels like regular noun form, etc.
This lexicon maps the surface form grese, which is an irregular noun, to its correct stem
goose in the following way:
g → g, r → o, e → o, s → s, e → e.
Composing this transducer with the previous one, we get a single two-level transducer with
one input tape and one output tape. This maps plural nouns into the stem plus the
morphological marker +pl and singular nouns into the stem plus the morpheme +sg. Thus a
surface word form birds will be mapped to bird+N+pl as follows.
Each letter maps to itself, while e maps to morphological feature +N, and s maps to
morphological feature pl. Figure 3.12 shows the resulting composed transducer.
The power of the transducer lies in the fact that the same transducer can be used for
analysis and generation. That is, we can run it in the downward direction (input: surface form
and output: lexical form) or in the upward direction.
Natural language processing notes 38
1. Substitution: Replacing one letter with another (e.g., cat → bat).
5. Reversal errors: A specific case of transposition where letters are reversed.
● OCR (Optical Character Recognition) and similar devices introduce errors such as:
○ Substitution
○ Space deletion/insertion
● Speech recognition systems process phoneme strings and attempt to match them
to known words. These errors are often phonetic in nature, leading to non-trivial
distortions of words.
1. Non-word errors: The incorrect word does not exist in the language (e.g., freind
instead of friend).
2. Real-word errors: The incorrect word is a valid word, but incorrect in the given
context (e.g., their instead of there).
○ Useful for real-word errors (e.g., correcting there to their based on sentence
meaning).
We can associate a weight or cost with each operation. The Levenshtein distance between
two sequences is obtained by assigning a unit cost to operation. Another possible alignment
for this sequence is:
Which has a cost of 3. We already have a better alignment than this one.
Dynamic Programming algorithms can be quite useful for finding minimum edit distance
between two sequences. Dynamic programming refers to a class of algorithms that apply a
table-driven approach to solve problems by combining solutions to sub-problems. The
dynamic programming algorithm for minimum edit distance is implemented by creating an
edit distance matrix.
The matrix has one row for each symbol in the source string and one column for each matrix
in the target string.
The (i,j)th cell in this matrix represents the distance between the first i character of the
source and the first j character of the target string.
The value in each cell is computed in terms of 3 possible paths.
The substitution will be 0 if the ith character in the source mathes with jth character in the
target
Natural language processing notes 41
Minimum edit distance algorithms are also useful for determining accuracy in speech
recognition systems.
Part-of-Speech Tagging
Part-of-Speech tagging is the process of assigning an appropriate grammatical category
(such as noun, verb, adjective, etc.) to each word in a given sentence. It is a fundamental
task in Natural Language Processing (NLP), which plays a crucial role in syntactic parsing,
information extraction, machine translation, and other language processing tasks.
POS tagging helps in resolving syntactic ambiguity and understanding the grammatical
structure of a sentence. Since many words in English and other natural languages can serve
multiple grammatical roles depending on the context, POS tagging is necessary to identify
the correct category for each word.
There are several approaches to POS tagging, which are broadly categorized as: (i)
Rule-based POS tagging, (ii) Stochastic POS tagging, and (iii) Hybrid POS tagging.
Rule-based POS tagging uses a set of hand-written linguistic rules to determine the correct
tag for a word in a given context. The approach starts by assigning each word a set of
possible tags based on a lexicon. Then, contextual rules are applied to eliminate unlikely
tags.
The rule-based taggers make use of rules that consider the tags of neighboring words and
the morphological structure of the word. For example, a rule might state that if a word is
preceded by a determiner and is a noun or verb, it should be tagged as a noun. Another rule
might say that if a word ends in "-ly", it is likely an adverb.
The effectiveness of this approach depends heavily on the quality and comprehensiveness
of the hand-written rules. Although rule-based taggers can be accurate for specific domains,
they are difficult to scale and maintain, especially for languages with rich morphology or free
word order.
Stochastic or statistical POS tagging makes use of probabilistic models to determine the
most likely tag for a word based on its occurrence in a tagged corpus. These taggers are
trained on annotated corpora where each word has already been tagged with its correct part
of speech.
In the simplest form, a unigram tagger assigns the most frequent tag to a word, based on the
maximum likelihood estimate computed from the training data:
where f(w,t) is the frequency of word w being tagged as t, and f(w) is the total frequency of
the word w in the corpus. This approach, however, does not take into account the context in
which the word appears.
To incorporate context, bigram and trigram models are used. In a bigram model, the tag
assigned to a word depends on the tag of the previous word. The probability of a sequence
of tags is given by:
The probability of the word sequence given the tag sequence is:
Thus, the best tag sequence is the one that maximizes the product:
This is known as the Hidden Markov Model (HMM) approach to POS tagging. Since the
actual tag sequence is hidden and only the word sequence is observed, the Viterbi algorithm
is used to compute the most likely tag sequence.
Bayesian inference is also used in stochastic tagging. Based on Bayes’ theorem, the
posterior probability of a tag given a word is:
Natural language processing notes 44
Since P(w) is constant for all tags, we can choose the tag that maximizes P(w∣t)⋅P(t).
Statistical taggers can be trained automatically from large annotated corpora and tend to
generalize better than rule-based systems, especially in handling noisy or ambiguous data.
Hybrid approaches combine rule-based and statistical methods to take advantage of the
strengths of both. One of the most popular hybrid methods is Transformation-Based
Learning (TBL), introduced by Eric Brill, commonly referred to as Brill’s Tagger.
In this approach, an initial tagging is done using a simple method, such as assigning the
most frequent tag to each word. Then, a series of transformation rules are applied to
improve the tagging. These rules are automatically learned from the training data by
comparing the initial tagging to the correct tagging and identifying patterns where the tag
should be changed.
Each transformation is of the form: "Change tag A to tag B when condition C is met". For
example, a rule might say: "Change the tag from VB to NN when the word is preceded by a
determiner".
The transformation rules are applied iteratively to correct errors in the tagging, and each rule
is chosen based on how many errors it corrects in the training data. This approach is robust,
interpretable, and works well across different domains.
A Context-Free Grammar (CFG) is a type of formal grammar that is used to define the
syntactic structure of a language. It is particularly useful in natural language processing
for representing the hierarchical structure of sentences, such as phrases, clauses, and
their relationships.
Natural language processing notes 45
A CFG consists of a set of production rules that describe how non-terminal symbols can
be replaced by combinations of non-terminal and terminal symbols. The grammar is termed
“context-free” because the application of rules does not depend on the context of the
non-terminal being replaced.
G=(N,Σ,P,S)
where:
● P: a finite set of production rules of the form A→α, where A∈Nand α∈(N∪Σ)∗,
Each rule in P defines how a non-terminal can be rewritten. The rewriting continues until only
terminal symbols are left, forming a string in the language.
Example: “Henna reads a book”
The rules for the grammar is: The parse tree to the sentence is as follows
Constituency
Constituency refers to the hierarchical organization of words into units or constituents,
where each constituent behaves as a single unit for syntactic purposes. A constituent can be
Natural language processing notes 46
a word or a group of words that function together as a unit within a sentence. These
constituents are the building blocks of phrases and sentences.
Each constituent has a grammatical category (e.g., noun, verb, adjective) and can be
recursively expanded using grammar rules. Constituents combine to form phrase level
and sentence level constructions, forming the constituent structure or phrase structure
of the language.
Phrase-level constructions group words into syntactic units called phrases. These phrases
have internal structure and grammatical behavior. The key phrase types discussed are:
A Noun Phrase is centered around a noun and may include determiners, adjectives, and
prepositional phrases.
Example: "The tall boy", "A bouquet of flowers".
Structure:
A Verb Phrase consists of a main verb and may include auxiliaries, objects,
complements, or adverbials.
Example: "is eating an apple", "has been sleeping".
Structure:
Structure:
Structure:
AdvP→(Deg) Adv
Each of these phrase types can function as a constituent in a larger phrase or sentence.
Grammatical Rules
S→NP VP
This rule indicates that a sentence (S) consists of a noun phrase followed by a verb phrase.
Coordination
Coordination involves linking two constituents of the same category using conjunctions such
as and, or, but.
Example: "The boy and the girl", "sang and danced".
NP→NP and
VP→VP or VP
Natural language processing notes 48
Agreement
CFG alone does not handle agreement well, as it lacks the mechanism to enforce such
constraints across distant parts of the sentence. This leads to the need for richer
representations like feature structures.
Feature Structures
To handle complex agreement and other syntactic dependencies, feature structures are
used. A feature structure is a set of attribute-value pairs that represent syntactic or
semantic information.
Example:
Unification-based grammars use these feature structures to ensure that elements such as
subjects and verbs agree in number, person, and gender.
Parsing
Parsing is the process of analyzing a string of symbols (typically a sentence) according to
the rules of a formal grammar. In Natural Language Processing (NLP), parsing determines
the syntactic structure of a sentence by identifying its grammatical constituents (like noun
phrases, verb phrases, etc.). It checks whether the sentence follows the grammatical rules
defined by a grammar, often a Context-Free Grammar (CFG). The result of parsing is
typically a parse tree or syntax tree, which shows how a sentence is hierarchically
structured. Parsing helps in disambiguating sentences with multiple meanings. It is essential
for understanding, translation, and information extraction. There are two main types:
syntactic parsing, which focuses on structure, and semantic parsing, which focuses on
meaning. Parsing algorithms include top-down, bottom-up, and chart parsing. Efficient
parsing is crucial for developing grammar-aware NLP applications.
Top-down Parsing
As the name suggests, top-down parsing starts its search from the root node S and works
downwards towards the leaves. The underlying assumption here is that the input can be
derived from the designated start symbol, S, of the grammar. The next step is to find all
sub-trees which can start with S. To generate the sub-trees of the second-level search, we
Natural language processing notes 49
expand the root node using all the grammar rules with S on their left hand side. Likewise,
each non-terminal symbol in the resulting sub-trees is expanded next using the grammar
rules having a matching non-terminal symbol on their left hand side. The right hand side of
the grammar rules provide the nodes to be generated, which are then expanded recursively.
As the expansion continues, the tree grows downward and eventually reaches a state where
the bottom of the tree consists only of part-of-speech categories. At this point, all trees
whose leaves do not match words in the input sentence are rejected, leaving only trees that
represent successful parses. A successful parse corresponds to a tree which matches
exactly with the words in the input sentence.
Sample grammar
● S → NP VP
● S → VP
● NP → Det Nominal
● NP → NP PP
● Nominal → Noun
● Nominal → Nominal Noun
● VP → Verb
● VP → Verb NP
● VP → Verb NP PP
● PP → Preposition NP
● Det → this | that | a | the
● Noun → book | flight | meal | money
● Verb → book | include | prefer
● Pronoun → I | he | she | me | you
● Preposition → from | to | on | near | through
Natural language processing notes 50
A top-down search begins with the start symbol of the grammar. Thus, the first level (ply)
search tree consists of a single node labelled S. The grammar in Table 4.2 has two rules
with S on their left hand side. These rules are used to expand the tree, which gives us two
partial trees at the second level search, as shown in Figure 4.4. The third level is generated
by expanding the non-terminal at the bottom of the search tree in the previous ply. Due to
space constraints, only the expansion corresponding to the left-most non-terminals has been
shown in the figure. The subsequent steps in the parse are left, as an exercise, to the
readers. The correct parse tree shown in Figure 4.4 is obtained by expanding the fifth parse
tree of the third level.
Bottom-up Parsing
A bottom-up parser starts with the words in the input sentence and attempts to construct a
parse tree in an upward direction towards the root. At each step, the parser looks for rules in
the grammar where the right hand side matches some of the portions in the parse tree
constructed so far, and reduces it using the left hand side of the production. The parse is
considered successful if the parser reduces the tree to the start symbol of the grammar.
Figure 4.5 shows some steps carried out by the bottom-up parser for sentence Paint the
door.
Each of these parsing strategies has its advantages and disadvantages. As the top-down
search starts generating trees with the start symbol of the grammar, it never wastes time
exploring a tree leading to a different root. However, it wastes considerable time exploring S
trees that eventually result in words that are inconsistent with the input. This is because a
top-down parser generates trees before seeing the input. On the other hand, a bottom-up
parser never explores a tree that does not match the input. However, it wastes time
generating trees that have no chance of leading to an S-rooted tree. The left branch of the
Natural language processing notes 51
search space in Figure 4.5 that explores a sub-tree assuming paint as a noun, is an example
of wasted effort. We now present a basic search strategy that uses the top-down method to
generate trees and augments it with bottom-up constraints to filter bad parses.
A → BC
A → w, where w is a word.
The algorithm first builds parse trees of length one by considering all rules which could
produce words in the sentence being parsed. Then, it finds the most probable parse for all
the constituents of length two. The parse of shorter constituents constructed in earlier
iterations can now be used in constructing the parse of longer constituents.
A* ⇒ wᵢ
1. A → B C is a rule in grammar
2. B* ⇒ wᵢₖ
3. C* ⇒ wₖ₊₁
For a sub-string wᵢ of length j starting at i, the algorithm considers all possible ways of
breaking it into two parts wᵢₖ and wₖ₊₁ . Finally, since A ⇒ wᵢ , we have to verify that S*
⇒ w₁ₙ, i.e., the start symbol of the grammar derives w₁ₙ.
CYK ALGORITHM
Natural language processing notes 52
Let w = w1 w2 w3 ... wn
and w0 = w, wn+1 = ∅
// Initialization step
for i := 1 to n do
for all rules A → wi do
chart[i, i] := [A]
// Recursive step
for j := 2 to n do
for i := 1 to n-j+1 do
begin
chart[i, j] := ∅
for k := i to j-1 do
chart[i, j] := chart[i, j] ∪ { A | A → BC is a production and
B ∈ chart[i, k] and C ∈ chart[k+1, j] }
end
MODULE-3
Introduction
One of the core goals in natural language processing is to build systems that can
understand, categorize, and respond to human language. Text classification, also called
text categorization, is the task of assigning a predefined label or class to a text segment.
Examples include identifying whether a review is positive or negative, whether a document is
about sports or politics, or whether an email is spam or not spam.
Such classification tasks are typically framed as supervised machine learning problems,
where we are given a set of labeled examples (training data) and must build a model that
generalizes to unseen examples. These tasks rely on representing text in a numerical
feature space and applying statistical models to predict classes.
One of the most commonly used classifiers in NLP is the Naive Bayes classifier, a
probabilistic model that applies Bayes’ theorem under strong independence assumptions.
Despite its simplicity, it is robust and surprisingly effective in many domains including spam
filtering, sentiment analysis, and language identification. Naive Bayes is categorized as
a generative model, because it models the joint distribution of inputs and classes to
"generate" data points, in contrast with discriminative models that directly estimate the class
boundary.
Before applying any classifier, text must be converted into a format suitable for machine
learning algorithms. In Naive Bayes, we represent a document as a bag of words (BoW),
which treats the document as an unordered collection of words, discarding grammar and
word order. This assumption simplifies modeling, reducing a complex structured input into a
feature vector.
● Stop Words: Words like “the”, “is”, “and” that occur in all documents and contribute
little to discrimination can be excluded.
● Unknown Words: Words not present in the training vocabulary can either be ignored
or assigned a very small probability.
Label Sentence
Instead of using raw frequencies, we often use binary features indicating word presence.
This reduces bias introduced by repeated terms and increases robustness in sentiment
classification.
Sentiment Lexicons
Lexicons are curated lists of words annotated with their sentiment polarity.
Natural language processing notes 59
● Opinion Lexicon: Divides words into positive (e.g., “love”, “great”) and negative
(e.g., “bad”, “terrible”).
In spam classification, words like “free”, “win”, “credit” may have high probabilities in spam
class. Naive Bayes can be trained on labeled corpora to distinguish spam from ham based
on word distribution.
Language Identification
For this task, character-level n-grams are more effective than word-level features,
especially for short texts. Naive Bayes computes the likelihood of character trigrams in a
sentence under each language model.
Feature Selection
Since text features are sparse and high-dimensional, not all features are informative.
Common selection metrics:
● Mutual Information
● Chi-square Test
● Information Gain
That is, each class (e.g., positive or negative) defines its own unigram distribution over the
vocabulary:
We are given the following unigram probabilities for selected words under each class:
Natural language processing notes 61
Module4
1. Indexing
Indexing is the process of organizing data to enable rapid search and retrieval. In IR, an
inverted index is commonly used. This structure maps each term in the document collection
to a list of documents (or document IDs) where that term occurs. It typically includes
additional information like term frequency, position, and weight (e.g., TF-IDF score).
Efficient indexing allows the system to avoid scanning all documents for every query,
dramatically reducing search time and computational cost. Index construction involves
tokenizing documents, normalizing text, and storing index entries in a sorted and optimized
structure, often with compression techniques to reduce storage requirements.
Stop words are extremely common words that appear in almost every document, such as
"the", "is", "at", "which", "on", and "and". These words usually add little value to
understanding the main content or differentiating between documents.
Removing stop words reduces the size of the index, speeds up the search process, and
minimizes noise in results. However, careful handling is required because some stop words
may be semantically important depending on the domain (e.g., "to be or not to be" in
literature, or "in" in legal texts). Most IR systems use a predefined stop word list, though it
can be customized based on corpus analysis.
3. Stemming
Stemming improves recall in IR systems by ensuring that different inflected or derived forms
of a word are matched to the same root term in the index. This is particularly important in
languages with rich morphology.
Common stemming algorithms include:
Stemming is different from lemmatization, which uses vocabulary and grammar rules to
derive the base form.
4. Zipf’s Law
Zipf’s Law is a statistical principle that describes the frequency distribution of words in
natural language corpora. It states that the frequency f of any word is inversely proportional
to its rank r:
f ∝ 1/r
This means that the most frequent word occurs roughly twice as often as the second most
frequent word, three times as often as the third, and so on.
For example, in English corpora, words like "the", "of", "and", and "to" dominate the
frequency list. Meanwhile, the majority of words occur rarely (called the "long tail").
In IR, Zipf’s Law justifies:
Understanding this law helps in designing efficient indexing and retrieval strategies that
focus on the more informative, lower-frequency words.
IR models
Information Retrieval (IR) models are frameworks used to retrieve and rank relevant
documents from a large collection in response to a user query. These models form the
foundation of search engines and digital information systems by determining which
documents best match the user's needs. The core idea behind any IR model is to represent
both documents and queries in a specific form and then compute their similarity or relevance
using well-defined rules or algorithms.
IR models are generally classified into three categories: Classic Models, Non-Classic
Models, and Alternative Models. Classic models include the Boolean Model, Vector
Space Model (VSM), and Probabilistic Model. These rely on mathematical and logical
principles—Boolean uses exact match logic, VSM represents documents as vectors and
ranks by cosine similarity, while Probabilistic models estimate the probability of a document
being relevant to a given query (e.g., BM25).
The Boolean Model is the simplest and most fundamental model used in Information
Retrieval (IR). It is based on the principles of Boolean algebra and set theory, where both
documents and user queries are represented as sets of indexed terms. This model classifies
documents in a binary manner, either as relevant or non-relevant, with no provision for
partial matching or ranking.
In the Boolean model, each document in the collection is represented as a binary vector
over the set of all index terms. Each component of the vector indicates whether the
corresponding term is present (1) or absent (0) in the document. Similarly, user queries are
Natural language processing notes 65
● AND: A document is retrieved if it contains all the terms in the query. It corresponds
to the intersection of sets.
3.1.3 Example
Advantages:
Limitations:
Before Boolean retrieval is applied, standard NLP preprocessing techniques are used, such
as tokenization, stop word removal, and stemming, to ensure only meaningful content
words are considered in the indexing and matching process.
The Boolean model, while limited in flexibility and effectiveness for large-scale retrieval
tasks, serves as a foundational approach in IR and provides the basis for more sophisticated
models like the Vector Space and Probabilistic models.
Tokenization
This process breaks raw text into individual units called tokens.
Example:
Text: “Information retrieval is important.”
Tokens: [Information, retrieval, is, important]
Common words such as "is", "the", and "of" are removed as they add little value to
information retrieval.
Stemming
Stemming reduces words to their root form, improving matching between similar words.
Example:
connect, connected, connection → stemmed to "connect"
Applications:
b. Jaccard Coefficient
The Jaccard Coefficient is another commonly used similarity measure, but it is most
appropriate when documents and queries are represented as sets of terms (i.e., binary
representations indicating only presence or absence).
Applications:
Documents:
Step 2: Stemming
shaping shape
computing compute
future future
technology technology
Natural language processing notes 71
history history
AI AI
Step 3: Vocabulary
Term D1 TF D2 TF Q TF
AI 1 1 1
shape 1 0 0
future 1 0 1
technology 1 0 0
history 0 1 0
compute 0 1 0
Natural language processing notes 72
Natural language processing notes 73
D1 0.578 1
D2 0.000 2
Assumptions
1. For a given query, each document either is relevant or not relevant.
2. The goal is to maximize the probability of retrieving relevant documents over
non-relevant ones.
3. The ranking is based on computing the probability P(R∣D,Q) — the probability that
document D is relevant R given the query Q.
Working Steps
1. Initial Retrieval: Use term matching to get an initial document set.
3.
NON-CLASSICAL MODELS OF IR
Non-classical IR models diverge from traditional IR models that rely on similarity, probability,
or Boolean operations. Instead, these models are grounded in logic-based inference and
semantic understanding of language and information.
This model utilizes a special logic technique known as logical imaging for retrieval. The
process involves inferring the relevance of a document to a query, similar to how logic
deduces facts from premises. Unlike traditional models, this logic model assumes that if a
Natural language processing notes 75
document does not contain a query term, it may still be relevant if semantic inference
supports it.
Key concept:
The term dish supports the situation s. The support is evaluated by whether t (e.g., “dish”)
implies that the sentence s is true or likely true. If yes, t ⊨ s, and t supports s.
This model extends the logic model by treating retrieval as a flow of information from the
document to the query. Each document is modeled as a situation, and a structural calculus
is used to infer whether this situation supports the query.
In essence:
This model emphasizes logical inferences and structural relationships, not just term
occurrence.
This model was developed by Ingwersen (1992, 1996) and draws from cognitive science.
It interprets retrieval as an interaction process between the user and the information
system, not a one-way search. Relevance is based on user context, and semantic
transformations (like synonym, hyponym, or conceptual linkages) are used to relate
documents to queries.
Natural language processing notes 76
Artificial Neural Networks can also be used under this model to simulate how the human
brain processes relevance in interconnected networks. This model is user-centric and
emphasizes feedback loops, subjective relevance, and interaction history.
Introduction
Long Short-Term Memory Networks (LSTM) are a type of Recurrent Neural Network (RNN)
designed to retain long-term dependencies in sequential data. Unlike standard RNNs, which
suffer from the vanishing gradient problem and struggle to remember long-term information,
LSTMs were developed by Hochreiter and Schmidhuber to overcome this issue. LSTM
networks are commonly implemented using Python libraries such as Keras and TensorFlow.
Real-World Analogy
Consider watching a movie or reading a book—your brain retains context from earlier
scenes or chapters to understand the current situation. LSTM networks emulate this
behavior by maintaining memory across time steps in a sequence.
LSTM Architecture
LSTM networks consist of memory cells and three primary gates:
Each gate controls the flow of information into and out of the cell state, enabling LSTMs to
manage memory efficiently.
Key States:
●
Natural language processing notes 85
Example:
Given two sentences:
The LSTM should forget Bob and shift context to Dan using the forget gate. It retains
relevant features using the input gate and generates predictions using the output gate.
Visualizing Memory
Example Input:
"Bob knows swimming. He told me over the phone that he had served the navy for four long
years."
● Input Gate learns that “served in the navy” is more important than “told me over the
phone.”
Bidirectional LSTM
Bidirectional LSTM processes data in both forward and backward directions:
Use Cases:
● Sentiment Analysis
● Machine Translation
Advantage:
Bidirectional LSTM captures full context from both sides of the sequence, enhancing
prediction accuracy.
Natural language processing notes 87
1. Relevance of Results
● Core Problem: Determining the relevance of a document to a user's query.
● Challenge: Users often express their information need imprecisely, while relevance
is subjective and context-dependent.
● Example: A query like “jaguar” may refer to an animal, a car, or a sports team.
2. Vocabulary Mismatch
● Definition: Users and documents may use different words to describe the same
concept.
● Solution Approaches:
3. Handling Ambiguity
● Word Sense Disambiguation is essential in resolving ambiguous terms.
● Challenge: IR systems must guess the correct meaning based on limited query
context.
4. Ranking of Documents
Natural language processing notes 88
● Issues:
○ Must balance precision (top results relevant) and recall (all relevant results
retrieved).
5. Scalability
● Challenge: Large-scale data processing for millions of documents and queries.
● Requirements: