0% found this document useful (0 votes)
5 views100 pages

NLP_Module1-4

Natural Language Processing (NLP) is a subfield of AI and linguistics focused on enabling computers to understand and generate human language. It encompasses various tasks, including machine translation and sentiment analysis, and employs both rule-based and data-driven approaches. Challenges in NLP include ambiguity, idiomatic expressions, and the need for effective grammatical frameworks to process the complexities of natural languages.

Uploaded by

sara22aiml
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views100 pages

NLP_Module1-4

Natural Language Processing (NLP) is a subfield of AI and linguistics focused on enabling computers to understand and generate human language. It encompasses various tasks, including machine translation and sentiment analysis, and employs both rule-based and data-driven approaches. Challenges in NLP include ambiguity, idiomatic expressions, and the need for effective grammatical frameworks to process the complexities of natural languages.

Uploaded by

sara22aiml
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 100

Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 1

Module 1
Chapter1: Introduction

What is Natural Language Processing (NLP)?


Natural Language Processing (NLP) is a subfield of Artificial Intelligence (AI) and
Linguistics concerned with the interaction between computers and human (natural)
languages. It involves designing algorithms and systems that allow computers to process,
understand, interpret, and generate human language in a meaningful way.

What is Language?

Language is a structured system of communication that uses symbols (like words and
grammar) to convey meaning. In the context of NLP, we usually refer to natural languages,
such as English, Hindi, or Tamil—languages that evolved naturally among humans, as
opposed to artificial or programming languages.

Two Main Reasons for the Development of NLP:

1.​ To Develop Automated Tools for Language Processing:


○​ Machines should be able to read, understand, and generate human language
for tasks like machine translation, speech recognition, sentiment analysis,
chatbots, etc.
○​ Examples include Google Translate, Siri, Alexa, and text summarizers.​

2.​ To Gain a Better Understanding of Human Communication:


○​ NLP helps linguists and AI researchers understand how language works,
including syntax, semantics, pragmatics, and discourse.
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 2

○​ It supports psychological and cognitive research into how humans


comprehend and produce language.​

Two Major Approaches in NLP:

1.​ Rational Approach (Rule-Based / Symbolic):


○​ Based on manually crafted linguistic rules and logic.
○​ Focuses on modeling grammar, syntax, and semantics using expert
knowledge.
○​ Example: Using Context-Free Grammars (CFGs) to parse sentences.​

2.​ Empirical Approach (Statistical / Data-Driven):


○​ Based on learning from large amounts of real-world language data (corpora).
○​ Uses machine learning, probability, and statistics to build models.
○​ Example: Training a sentiment analysis model on labeled movie reviews.

Origins of Natural Language Processing (NLP)


Natural Language Processing (NLP) emerged as a discipline at the intersection of computer
science, linguistics, and artificial intelligence in the 1950s. It was initially motivated by the
goal of automating translation between languages, especially during the Cold War era. Over
time, its focus expanded from basic translation to a wide range of tasks such as parsing,
speech recognition, information retrieval, and dialogue systems.

A common confusion exists between Natural Language Processing (NLP) and Natural
Language Understanding (NLU). While NLP is the broader field that encompasses all
tasks involving natural language — including processing, generation, and understanding —
NLU specifically refers to systems that interpret or "understand" the meaning of human
language input. In other words, NLP includes both surface-level tasks (like tokenization,
part-of-speech tagging) and deep-level tasks (like sentiment analysis, intent detection),
whereas NLU focuses on the latter. NLP is sometimes referred to as Natural Language
Understanding because many of its core goals revolve around making machines
understand and interpret human language like humans do. However, this usage is more
informal and reflects the aspiration of the field rather than a strict equivalence.

Another important context for NLP is its relationship with Computational Linguistics, which
is often viewed as a bridge between theoretical linguistics and psycholinguistics.
Theoretical linguistics is concerned with understanding the abstract rules and structures of
language — including syntax, semantics, and phonology — without necessarily applying
them in practical systems. Psycholinguistics, on the other hand, deals with how language
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 3

is represented and processed in the human brain, focusing on language acquisition,


comprehension, and production in real time. Computational linguistics combines insights
from both these fields and aims to create computer models that simulate or analyze human
language. It adopts theories from linguistics and tests them through computational models,
while also being informed by how humans process language, as studied in psycholinguistics.

In terms of how computers process language, two broad types of computational models
have evolved: knowledge-driven and data-driven models. Knowledge-driven models rely
on hand-crafted rules and symbolic representations of grammar and meaning. These were
dominant in the early decades of NLP and often require linguistic expertise. However, they
struggle with ambiguity and scale. In contrast, data-driven models use statistical or
machine learning techniques to learn language patterns from large text corpora. These
models have become more dominant in recent years, especially with the rise of deep
learning, as they can automatically infer patterns from data without relying heavily on
human-written rules.

One of the earliest and most impactful applications of NLP was in Information Retrieval
(IR), the task of finding relevant documents or information from large collections. Search
engines like Google use sophisticated NLP techniques to understand user queries, correct
spelling, rank documents by relevance, and extract meaningful snippets. NLP helps in query
expansion, synonym detection, and relevance feedback, making IR systems more accurate
and responsive to natural human language rather than just keyword matching.

Language and Knowledge in Natural Language


Processing

In the context of NLP, language refers to a system of communication that uses structured
symbols—spoken, written, or signed—to convey meaning. Natural languages like English,
Hindi, or Tamil are inherently ambiguous, complex, and context-dependent, which makes
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 4

computational processing a challenging task. Knowledge, in NLP, is the information that


supports the interpretation of language—this includes grammatical rules, semantic
relationships, world knowledge, and context awareness. The integration of language and
knowledge is crucial for enabling machines to understand and generate human-like
responses.

The first step in processing language is usually lexical analysis, which involves breaking
down the input text into tokens—individual units like words, punctuation, and symbols. It also
involves assigning categories like part-of-speech tags (e.g., noun, verb, adjective) to each
token. This stage helps structure the text so it can be processed more deeply in later stages.

Next is word-level processing, where each word's meaning and properties are analyzed.
This includes looking up dictionary definitions, understanding synonyms or antonyms, and
checking word usage. A fundamental concept here is the morpheme, the smallest unit of
meaning in a language. For instance, in the word “unhappiness”, “un-”, “happy”, and “-ness”
are three morphemes. Identifying morphemes helps in understanding the structure and
meaning of words beyond their surface forms.

After word-level processing, the focus shifts to syntactic analysis (or parsing), which
involves determining the grammatical structure of a sentence. This includes analyzing how
words are grouped into phrases and how those phrases relate to each other in a hierarchy.
For example, in the sentence "The boy ate the apple," syntactic analysis identifies "The boy"
as the subject noun phrase and "ate the apple" as the verb phrase.

Once syntax is understood, semantic analysis aims to derive the meaning of a sentence.
This involves mapping syntactic structures to logical representations and identifying the roles
of words (e.g., who is doing what to whom). Semantic analysis tries to resolve word sense
disambiguation (e.g., the word “bank” could mean a financial institution or a riverbank) and
capture the intended meaning of phrases and sentences.

Moving beyond individual sentences, discourse analysis deals with the structure and
meaning of connected text or dialogue. It considers how one sentence relates to the next
and how information flows across sentences. For example, resolving anaphora (i.e.,
identifying what a pronoun refers to) is a key task in discourse analysis — in "John dropped
the glass. It broke," the word "it" refers to "the glass."

Finally, pragmatic analysis focuses on how context influences interpretation. This includes
speaker intention, tone, politeness, and real-world knowledge. For example, if someone
says, “Can you pass the salt?”, a pragmatic analysis understands it not as a question about
ability but as a polite request. Pragmatics allows machines to go beyond literal meanings
and engage in more natural communication.

Together, these layers—lexical, syntactic, semantic, discourse, and pragmatic—form a


pipeline through which language is processed using both linguistic rules and background
knowledge. Mastery of these components is essential for building effective NLP applications.

Challenges in Natural Language Processing


Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 5

Natural Language Processing (NLP) deals with the inherently complex and ambiguous
nature of human language. One of the key challenges is representation and
interpretation, which refers to how machines can represent the structure and meaning of
language in a formal way that computers can manipulate. Unlike numbers or code, natural
language involves abstract concepts, emotions, and context, making it difficult to represent
using fixed logical forms or algorithms. Interpretation becomes even harder when the same
sentence can carry different meanings depending on the speaker’s intent, cultural
background, or tone.

Another major challenge is identifying semantics, especially in the presence of idioms and
metaphors. Idioms such as "kick the bucket" or "spill the beans" have meanings that cannot
be derived from the literal meaning of the words. Similarly, metaphors like "time is a thief"
require deep contextual and cultural understanding, which machines struggle to grasp.
These figurative expressions pose a serious problem for semantic analysis since they don't
follow regular linguistic patterns.

Quantifier scoping is another subtle issue, dealing with how quantifiers (like “all,” “some,”
“none”) affect the meaning of sentences. For example, the sentence “Every student read a
book” can mean either that all students read the same book or that each student read a
different one. Disambiguating such sentences requires complex logical reasoning and
context awareness.

Ambiguity is one of the most persistent challenges in NLP. At the word level, there are two
main types: part-of-speech ambiguity and semantic ambiguity. In part-of-speech
ambiguity, a word like “book” can be a noun (“a book”) or a verb (“to book a ticket”), and the
correct tag must be determined based on context. This ties into the task of Part-of-Speech
(POS) tagging, where the system must assign correct grammatical labels to each word in a
sentence, often using probabilistic models like Hidden Markov Models or neural networks.

In terms of semantic ambiguity, many words have multiple meanings—a problem known as
polysemy. For instance, the word “bat” can refer to a flying mammal or a piece of sports
equipment. Resolving this is the goal of Word Sense Disambiguation (WSD), which
attempts to determine the most appropriate meaning of a word in a given context. WSD is
particularly difficult in resource-poor languages or when the context is vague.

Another type of complexity arises from structural ambiguity, where a sentence can be
parsed in more than one grammatical way. For example, in “I saw the man with a telescope,”
it is unclear whether the telescope was used by the speaker or the man. Structural ambiguity
can lead to multiple interpretations and is a major hurdle in syntactic and semantic parsing.

Language and Grammar


Automatic processing of natural language requires that the rules and exceptions of a
language be explicitly described so that a computer can understand and manipulate them.
Grammar plays a central role in this specification, as it provides a set of formal rules that
allow both parsing and generation of sentences. Grammar, thus, defines the structure of
language at the linguistic level, rather than at the level of world knowledge. However, due to
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 6

the influence of world knowledge on both the selection of words (lexical items) and the
conventions of structuring them, the boundary between syntax and semantics often
becomes blurred. Nonetheless, maintaining a separation between the two is considered
beneficial for ease of grammar writing and language processing.

One of the main challenges in defining the structure of natural language is its dynamic
nature and the presence of numerous exceptions that are difficult to capture formally. Over
time, several grammatical frameworks have been proposed to address these challenges.
Prominent among them are transformational grammar (Chomsky, 1957), lexical functional
grammar (Kaplan and Bresnan, 1982), government and binding theory (Chomsky, 1981),
generalized phrase structure grammar, dependency grammar, Paninian grammar, and
tree-adjoining grammar (Joshi, 1985). While some of these grammars focus on the
derivational aspects of sentence formation (e.g., phrase structure grammar), others
emphasize relational properties (e.g., dependency grammar, lexical functional grammar,
Paninian grammar, and link grammar).

The most significant contribution in this area has been made by Noam Chomsky, who
proposed a formal hierarchy of grammars based on their expressive power. These
grammars employ phrase structure or rewrite rules to generate well-formed sentences in a
language. The general framework proposed by Chomsky is referred to as generative
grammar, which consists of a finite set of rules capable of generating all and only the
grammatical sentences of a language. Chomsky also introduced transformational grammar,
asserting that natural languages cannot be adequately represented using phrase structure
rules alone. In his work Syntactic Structures (1957), he proposed that each sentence has
two levels of representation: the deep structure, which captures the sentence's core
meaning, and the surface structure, which represents the actual form of the sentence. The
transformation from deep to surface structure is accomplished through transformational
rules.

Transformational grammar comprises three main components: phrase structure grammar,


transformational rules, and morphophonemic rules. Phrase structure grammar provides the
base structure of a sentence using rules such as:​
S → NP + VP,​
NP → Det + Noun,​
VP → V + NP,​
V → Aux + Verb, etc.​
Here, S represents a sentence, NP a noun phrase, VP a verb phrase, Det a determiner, and
so on. The lexicon includes determiners like “the,” “a,” and “an,” verbs like “catch,” “eat,”
“write,” nouns such as “police,” “snatcher,” and auxiliaries like “is,” “will,” and “can.”
Sentences generated using these rules are said to be grammatical, and the structure they
are assigned is known as constituent or phrase structure.

For example, for the sentence “The police will catch the snatcher,” the phrase structure rules
generate the following parse tree:
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 7

This tree represents the syntactic structure of the sentence as derived from phrase structure
rules.

Transformational rules are applied to the output of the phrase structure grammar and are
used to modify sentence structures. These rules may have multiple symbols on the left-hand
side and enable transformations such as changing an active sentence into a passive one.
For example, Chomsky provided a rule for converting active to passive constructions:​
NP₁ + Aux + V + NP₂ → NP₂ + Aux + be + en + V + by + NP₁.​
This rule inserts the strings “be” and “en” and rearranges sentence constituents to reflect a
passive construction. Transformational rules can be either obligatory, ensuring grammatical
agreement (such as subject-verb agreement), or optional, allowing for structural variations
while preserving meaning.

The third component, morphophonemic rules, connects the sentence representation to a


string of phonemes. For instance, in the transformation of the sentence “The police will catch
the snatcher,” the passive transformation results in “The snatcher will be caught by the
police.” A morphophonemic rule then modifies “catch + en” to its correct past participle form
“caught.”

However, phrase structure rules often struggle to account for more complex linguistic
phenomena such as embedded noun phrases containing adjectives, modifiers, or relative
clauses. These phenomena give rise to what are known as long-distance dependencies,
where related elements like a verb and its object may be separated by arbitrary amounts of
intervening text. Such dependencies are not easily handled at the surface structure level. A
specific case of long-distance dependency is wh-movement, where interrogative words like
“what” or “who” are moved to the front of a sentence, creating non-local syntactic
relationships. These limitations highlight the need for more advanced grammatical
frameworks like tree-adjoining grammars (TAGs), which can effectively model such
syntactic phenomena due to their capacity to represent recursion and long-distance
dependencies more naturally than standard phrase structure rules.

Processing of Indian Languages


There are a number of differences between Indian languages and English. This introduces
differences in their processing. Some of these differences are listed here.
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 8

Indian Languages follow Subject-object-verb structure and also follows non-linear structure

Indian languages exhibit several unique linguistic properties that distinguish them from many
Western languages, especially English. One of the most prominent characteristics is their
free word order. This means that the words in a sentence can often be rearranged without
altering the core meaning of the sentence. This syntactic flexibility presents significant
challenges in tasks like parsing and machine translation.

Spelling standardization in Indian languages, particularly Hindi, is also more nuanced


compared to English. This is largely due to the phonetic nature of Indian scripts and the
presence of multiple acceptable spellings for the same word.

Moreover, Indian languages display a rich morphological structure, with a large number of
morphological variants for nouns, verbs, and adjectives. These variants are used to convey
different grammatical categories such as tense, gender, number, and case.

Another notable feature of Indian languages is the extensive and productive use of
complex predicates (CPs). A complex predicate consists of a main verb and one or more
auxiliary verbs that together express a single verbal idea. For example, expressions like "जा
रहा है " (ja raha hai – "is going") and "खेल रही है " (khel rahi hai – "is playing") involve verb
complexes, where auxiliary verbs provide information related to tense, aspect, and modality.

Indian languages predominantly use postpositions, also known as karakas, instead of the
prepositions used in English. These postpositions appear after the noun or pronoun they
relate to and are used to indicate grammatical roles such as subject, object, instrument,
location, etc.

Languages like Hindi and Urdu are closely related in terms of phonology, morphology,
and syntax. Despite being written in different scripts and having different lexical
influences—Hindi borrowing largely from Sanskrit and Urdu from Persian and Arabic—both
languages are free-word-order languages, use postpositions, and share a significant portion
of vocabulary and grammatical structure.

For computational modeling of Indian languages, Paninian grammar provides a powerful


and linguistically motivated framework. This grammar model is especially suited to
free-word-order languages and focuses on the extraction of karaka relations
(syntactic-semantic roles) from a sentence. These relations play a key role in understanding
sentence structure and meaning, making Paninian grammar an effective tool for natural
language processing tasks in Indian languages.

Applications of NLP
Natural Language Processing (NLP) has a wide range of applications that aim to bridge the
gap between human language and computational systems. One of the major applications of
NLP is Machine Translation (MT), which involves automatically converting text or speech
from one language to another. MT systems analyze the source language for syntax and
semantics and generate equivalent content in the target language. Examples include Google
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 9

Translate and Microsoft Translator. The challenge in MT lies in handling grammar, idioms,
context, and word order, especially for Indian languages, which have a free word order.

Speech Recognition is another significant application where spoken language is converted


into text. This is used in systems like voice assistants (e.g., Google Assistant, Siri) and
dictation tools. It involves acoustic modeling, language modeling, and phonetic transcription.
Speech recognition must account for accents, background noise, and spontaneous speech.

Speech Synthesis, also known as Text-to-Speech (TTS), is the reverse process, where
written text is converted into spoken output. TTS systems are used in applications for
visually impaired users, public announcement systems, and interactive voice response (IVR)
systems. These systems require natural-sounding voice output, correct intonation, and
pronunciation.

Natural Language Interfaces to Databases (NLIDB) allow users to interact with databases
using natural language queries instead of structured query languages like SQL. For
example, a user can ask “What is the balance in my savings account?” and the system
translates it into a database query. This application requires robust parsing, semantic
interpretation, and domain understanding.

Information Retrieval (IR) deals with finding relevant documents or data in response to a
user query. Search engines like Google, Bing, and academic databases are practical
implementations of IR. NLP techniques help in query expansion, stemming, and ranking
results by relevance.

Information Extraction (IE) refers to the automatic identification of structured information


such as names, dates, locations, and relationships from unstructured text. IE is useful in
fields like journalism, business intelligence, and biomedical research. Named Entity
Recognition (NER) and Relation Extraction are key components of IE.

Question Answering (QA) systems provide direct answers to user questions instead of
listing documents. For example, a QA system can answer “Who is the President of India?”
by retrieving the exact answer from a knowledge base or corpus. These systems require
deep linguistic analysis, context understanding, and often integrate IR and IE.

Text Summarization involves automatically generating a condensed version of a given text


while preserving its key information. Summarization can be extractive (selecting key
sentences) or abstractive (generating new sentences). It is useful in generating news
digests, executive summaries, and academic reviews. Summarization systems must
preserve coherence, grammaticality, and meaning.
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 10

Chapter2: Language Modeling


Statistical Language Model
A Statistical language model is a probability distribution P(S) over all possible word
sequences. A number of statistical language models have been proposed in literature. The
dominant approach in statistical language modelling is the n-gram model.

N-gram model
The goal of statistical language models is to estimate the probability (likelihood) of a
sentence. This is achieved by decomposing sentence probability into a product of conditional
probabilities using the chain rule as follows:

In order to calculate the sentence probability, we need to calculate the probability of a word,
given the sequence of words preceding it. An n-gram model simplifies this task by
approximating the probability of a word given all the previous words by the conditional
probability given previous n-1 words only

Thus, an n-gram model calculates P(wi∣hi)P(w_i | h_i)P(wi​∣hi​) by modeling language as a


Markov model of order (n−1), i.e., it looks at only the previous (n−1) words.

●​ A model that considers only the previous one word is called a bigram model (n =
2).​

●​ A model that considers the previous two words is called a trigram model (n = 3).​

Using bigram and trigram models, the probability of a sentence w1,w2,...,wnw_1, w_2, ...,
w_nw1​,w2​,...,wn​can be estimated as:
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 11

A special word(pseudo word) <s> is introduced to mark the beginning of the sentence in
bi-gram estimation. The probability of the first word in a sentence is conditioned on <s>.
Similarly, in tri-gram estimation, we introduce two pseudo-words <s1> and <s2>.
Estimation of probabilities is done by training the n-gram model on the training corpus. We
estimate n-gram parameters using the maximum likelihood estimation (MLE) technique, i.e,
using relative frequencies. We count a particular n-gram in the training corpus and divide it
by the sum of all n-grams that share the same prefix.

The sum of all n-grams that share first n-1 words is equal to the common prefix

. So, we rewrite the previous expression as:

The model parameters obtained using these estimates maximize the probability of the
training set T given the model M, i.e., P(T|M). However, the frequency with which a word
occurs in a text may differ from its frequency in the training set. Therefore, the model only
provides the most likely solution based on the training data.

Several improvements have been proposed for the standard n-gram model. Before
discussing these enhancements, let us illustrate the underlying ideas with the help of an
example.
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 12

The n-gram model suffers from data sparseness. n-grams not seen in the training data are
assigned zero probability, even in large corpora. This is due to the assumption that a word’s
probability depends only on the preceding word(s), which is often not true. Natural language
contains long-distance dependencies that this model cannot capture.

To handle data sparseness, various smoothing techniques have been developed, such as
add-one smoothing. As Jurafsky and Martin (2000) state:

"Smoothing refers to re-evaluating zero or low-probability n-grams and assigning


them non-zero values."

The term smoothing reflects how these techniques adjust probabilities toward more uniform
distributions.
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 13

Add-one Smoothing
Add-One Smoothing is a simple technique used to handle the data sparseness problem in
n-gram language models by avoiding zero probabilities for unseen n-grams.

In an n-gram model, if a particular n-gram (like a word or word pair) does not occur in the
training data, it is assigned a probability of zero, which can negatively affect the overall
probability of a sentence. Add-One Smoothing helps by assigning a small non-zero
probability to these unseen events.

Paninian Framework
Paninian Grammar is a highly influential grammatical framework, based on the ancient
Sanskrit grammarian Panini. It provides a rule-based structure for analyzing sentence
formation using deep linguistic features such as vibhakti (case suffixes) and karaka
(semantic roles). Unlike many Western grammars which focus on syntax, Paninian
grammar emphasizes the relationship between semantics and syntax, making it well-suited
for Indian languages with free word order.

Architecture of Paninian Grammar

Paninian grammar processes a sentence through four levels of representation:

1. Surface Level (Shabda or Morphological Level)

●​ This is the raw form of the sentence as it appears in speech or

writing.​
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 14

●​ It contains inflected words with suffixes (like tense, case markers, gender, number,
etc.).​

●​ The sentence is in linear word form but doesn’t reveal deeper structure.

Example (Hindi):​
राम ने फल खाया (Rām ne phal khāyā)​
At surface level:​
राम + ने | फल | खाया​
Here, "ने" is a vibhakti marker.

2. Vibhakti Level (Case-Level or Morphosyntactic Level)

●​ Focuses on vibhaktis (postpositions or case suffixes).​

●​ These vibhaktis provide syntactic cues about the role of each noun in the sentence.​

●​ Different vibhaktis (e.g., ने, को, से, का) indicate nominative, accusative,
instrumental, genitive etc.

Example:

"ने" → Ergative marker (agent in perfective aspect)

"को" → Dative/Accusative

"से" → Instrumental

3. Karaka Level (Semantic Role Level)

●​ Most critical part of Paninian framework.​

●​ Karaka relations assign semantic roles to nouns, like agent, object, instrument,
source, etc.​

Karaka roles are assigned based on the verb and semantic dependency, not fixed
word order.

Example:​
राम ने फल खाया

राम (Rām) → Karta (Agent)

फल (phal) → Karma (Object)

खाया (khāyā) → Verb (Action)

4. Semantics Level (Meaning Representation)


Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 15

●​ Captures the final meaning of the sentence by combining all the karaka roles.​

●​ Helps in machine translation, question answering, and information extraction.​

●​ This level handles ambiguities, idioms, metaphors, and coreference.​

Example meaning:​
"Rām ate the fruit."​
→ {Agent: Ram, Action: eat, Object: fruit}

Karaka Theory

Karaka Theory is a fundamental concept in Paninian grammar that explains the semantic
roles of nouns in relation to the verb in a sentence. It helps identify who is doing what to
whom, with what, for whom, where, and from where, etc.

●​ The term "Karaka" means "causal relation".


●​ It refers to the role a noun (or noun phrase) plays in relation to the verb.
●​ The karaka relation is verb-dependent and semantic (not just syntactic).

Unlike English grammar which focuses on Subject/Object, karaka theory goes deeper into
semantic functions.

The Six Main Karakas:

Karaka Name Role (Function) Typical Marker Example Meaning


(Vibhakti) (Hindi)

1. Karta Doer/Agent of the action Ergative/Instrumental राम ने खाया Ram ate


(ने/Ø)

2. Karma Object or what action is Accusative (को/Ø) फल खाया Ate fruit


done on

3. Karana Instrument of the action Instrumental (से) चाकू से काटा Cut with a
knife
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 16

4. Recipient/Beneficiary Dative (को) मोहन को Gave to


Sampradana दिया Mohan

5. Apadana Source/Separation point Ablative (से) दिल्ली से Came from


आया Delhi

6. Adhikarana Location/Context Locative (में , पर) कुर्सी पर बैठा Sat on the


chair

Features:

●​ One verb can have multiple karaka relations.​

●​ Karakas are decided based on verb meaning, not word position.​

●​ They are language-independent roles; similar roles exist in many world languages.​

Example Sentence (Hindi):

राम ने चाकू से फल मोहन को प्लेट में दिया।​


(Ram gave the fruit to Mohan with a knife in a plate.)

Word Karaka Role

राम ने Karta Doer

फल Karma Object

मोहन को Sampradana Recipient

चाकू से Karana Instrument


Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 17

प्लेट में Adhikarana Location/Context


Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 18

The two problems challenging linguists are:​


(i) Computational implementation of PG, and​
(ii) Adaptation of PG to Indian and other similar languages.

An approach to implementing PG has been discussed in Bharati et al. (1995). This is a


multilayered implementation. The approach is named 'Utsarga-Apavāda' (default-exception),
where rules are arranged in multiple layers in such a way that each layer consists of rules
which are exceptions to the rules in the higher layer. Thus, as we go down the layers, more
specific information is derived. Rules may be represented in the form of charts (such as the
Karaka chart and Lakṣaṇa chart).

However, many issues remain unresolved, especially in cases of shared Karaka relations.
Another difficulty arises when the mapping between the Vibhakti (case markers and
postpositions) and the semantic relation (with respect to the verb) is not one-to-one. Two
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 19

different Vibhaktis can represent the same relation, or the same Vibhakti can represent
different relations in different contexts. The strategy to disambiguate the various senses of
words or word groupings remains a challenging issue.

As the system of rules differs across languages, the framework requires adaptation to
handle various applications in different languages. Only some general features of the PG
framework have been described here.
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 20

Module 2

Chapter1: Word Level Analysis


Regular Expressions
Regular expressions, or regexes for short, are a pattern-matching standard for string
parsing and replacement. They are a powerful way to find and replace strings that follow a
defined format. For example, regular expressions can be used to parse dates, URLs, email
addresses, log files, configuration files, command line switches, or programming
scripts. They are useful tools for the design of language compilers and have been used in
NLP for tokenization, describing lexicons, morphological analysis, etc.

We have all used simplified forms of regular expressions, such as the file search patterns
used by MS-DOS, e.g., dir*.txt.

The use of regular expressions in computer science was made popular by a Unix-based
editor, 'ed'. Perl was the first language that provided integrated support for regular
expressions. It used a slash around each regular expression; we will follow the same
notation in this book. However, slashes are not part of the regular expression itself.

Regular expressions were originally studied as part of the theory of computation. They
were first introduced by Kleene (1956). A regular expression is an algebraic formula whose
value is a pattern consisting of a set of strings, called the language of the expression. The
simplest kind of regular expression contains a single symbol.

For example, the expression /a/ denotes the set containing the string 'a'. A regular
expression may specify a sequence of characters also. For example, the expression
/supernova/ denotes the set that contains the string "supernova" and nothing else.

In a search application, the first instance of each match to a regular expression is underlined
in Table.

Table Some simple regular expressions


Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 21

Regular Example patterns matched


expression

bool The world is a lock, and those who do not travel read only one page

facel Reporters, who do not read the stylebook, should not criticize their
editors

faced Not everything that is faced can be changed. But nothing can be
changed until it is faced.

Science Reason, Observation, and Experience—the Holy Trinity of Science.

Character Classes

Characters are grouped by putting them between square brackets [ ]. Any character in the
class will match one character in the input. For example, the pattern /[abcd]/ will match a,
b, c, or d. This is called disjunction of characters.

●​ /[0123456789]/ specifies any single digit.​

●​ /[a-z]/ specifies any lowercase letter (you can use - for range).​

●​ /[5-9]/ matches any of the characters 5, 6, 7, 8, or 9.​

●​ /[mnop]/ matches any one of the letters m, n, o, or p.​

Regular expressions can also specify what a character cannot be, using a caret (^) at the
beginning of the brackets.

●​ /[^x]/ matches any character except x.​

●​ This interpretation is true only when the caret is the first character inside brackets.​

Table Use of square brackets

RE Match description Example patterns matched


Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 22

[abc] Match any of a, b, or c Refresher course will start


soon

[A-2] Match any character between A and 2 TREC Conference

[^A-Z Match any character other than uppercase TREC Confere


] letters

[ate] Match a, t, or e 3+2=5

[2] Match 2 The day has three different


slots

[or] Match o or r Match a vowel

Case Sensitivity:

●​ Regex is case-sensitive.​

●​ /s/ matches lowercase s but not uppercase S.​

●​ To solve this, use: [sS].​

○​ /[sS]ana/ matches both "sana" and "Sana".

To match both supernova and supernovas:

●​ Use /supernovas?/ — the ? makes "s" optional (0 or 1 time).​

Repetition using Kleene Star (*) and Plus (+):

●​ /b*/ → matches '', 'b', 'bb', 'bbb' — 0 or more occurrences​

●​ /b+/ → matches 'b', 'bb', 'bbb' — 1 or more occurrences​

●​ /[ab]*/ → matches combinations like 'aa', 'bb', 'abab'​

●​ /a+/ → matches one or more as​


Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 23

●​ /[0-9]+/ → matches one or more digits​

Anchors:

●​ ^ → matches at beginning of line​

●​ $ → matches at end of line​

●​ Example: /^The nature.$/ matches the entire line “The nature.”​

Wildcard Character (.):

●​ Matches any single character​

●​ /at/ → matches cat, mat, rat, etc.​

●​ /.....berry/ → matches 10-letter words ending in berry like:​

○​ strawberry​

○​ blackberry​

○​ sugarberry​

○​ But not: blueberry (9 letters)​

Special Characters (Table 3.3)

RE Description

. Matches any single character

\n Newline character

\t Tab character

\d Digit (0-9)

\D Non-digit

\w Alphanumeric character (a-z, A-Z, 0-9, _)

\W Non-alphanumeric character
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 24

\s Whitespace

\S Non-whitespace

\ Escape special characters (e.g. \. matches


dot)

Disjunction using Pipe |:

●​ /blackberry|blackberries/ matches either one​

●​ Wrong: /blackberry|ies/ → matches either blackberry or ies, not blackberries​

●​ Correct way: /black(berry|berries)/​

Real-world use:

●​ Searching for Tanveer or Siddiqui → /Tanveer|Siddiqui/​

This example works for most cases. However, the specification is not based on any standard
and may not be accurate enough to match all correct email addresses. It may accept
non-working email addresses and reject working ones. Fine-tuning is required for accurate
characterization.
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 25

A regular expression characterizes a particular kind of formal language, called a regular


language. The language of regular expressions is similar to formulas of Boolean logic. Like
logic formulas, regular expressions represent sets. A regular language is a set of strings
described by the regular expression. Regular languages may be encoded as finite state
networks.

A regular expression may contain symbol pairs. For example, the regular expression /a:b/
represents a pair of strings. The regular expression /a:b/ actually denotes a regular
relation. A regular relation may be viewed as a mapping between two regular languages.

The a:b relation is simply the cross product of the languages denoted by the expressions
/a/ and /b/. To differentiate the two languages that are involved in a regular relation, we call
the first one the upper, and the second one the lower language of the relation. Similarly, in
the pair /a:b/, the first symbol, a, can be called the upper symbol, and the second symbol,
b, the lower symbol.

The two components of a symbol pair are separated in our notation by a colon (:), without
any whitespace before or after. To make the notation less cumbersome, we ignore the
distinction between the language A and the identity relation that maps every string of A to
itself. Therefore, we also write /a:a/ simply as /a/.

Regular expressions have clean, declarative semantics (Voutilainen 1996). Mathematically,


they are equivalent to finite automata, both having the same expressive power. This
makes it possible to encode regular languages using finite-state automata, leading to
easier manipulation of context-free and other complex languages.

Similarly, regular relations can be represented using finite-state transducers. With this
representation, it is possible to derive new regular languages and relations by applying
regular operators, instead of re-writing the grammars.

Finite State Automata


A finite automata has following properties:
1.​ A finite set of states, one of which is designated the initial or start state, and one or
more of which are designated as final states.
2.​ A finite alphabetic set, Σ, consisting of input symbols.
3.​ A finite set of transitions that specify for each state and each symbol of the input
alphabet, the state to which it next goes.
A finite automaton can be deterministic or non-deterministic. In a non-deterministic
automaton, more than one transition out of a state is possible for the same input symbol.
Whereas for the deterministic automata only one state is possible for a given input symbol.
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 26

Finite-state automata have been used in a wide variety of areas, including linguistics,
electrical engineering, computer science, mathematics, and logic. These are an important
tool in computational linguistics and have been used as a mathematical device to implement
regular expressions. Any regular expression can be represented by a finite automaton and
the language of any finite automaton can be described by a regular expression. Both have
the same expressive power.

The following formal definitions of the two types of finite state automaton, namely,
deterministic and non-deterministic finite automaton, are taken from Hopcroft and Ullman
(1979).

A deterministic finite-state automaton (DFA) is defined as a 5-tuple (Q, ∑, δ, S, F), where:

●​ Q is a set of states,​

●​ ∑ is an alphabet,​

●​ S is the start state,​

●​ F ⊆ Q is a set of final states, and​

●​ δ is a transition function.​
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 27

The transition function δ defines a mapping from Q × ∑ → Q.​


That is, for each state q and symbol a, there is at most one transition possible as shown in
Figure 3.1.

Unlike DFA, the transition function of a non-deterministic finite-state automaton (NFA)


maps Q × (∑ ∪ {ε}) → P(Q), where P(Q) is the power set of Q.​
That is, for each state, there can be more than one transition on a given symbol, each
leading to a different state.​
This is shown in Figure 3.2, where there are two possible transitions from state q₀ on input
symbol a.

Path is a sequence of transitions beginning with the start state. A path leading to one of the
final states is a successful path. The FSAs encode regular languages. The language that an
FSA encodes is the set of strings that can be formed by concatenating the symbols along
each successful path. Clearly, for automata with cycles, these sets are not finite.

We now examine what happens to various input strings that are presented to finite state
automata. Consider the deterministic automaton described in Example 3.2 and the input, ac.
We start with state q, and go to state q₁. The next input symbol is c, so we go to state q₃. No
more input left and we have not reached the final state, i.e., we have an unsuccessful end.
Hence, the string ac is not recognized by the automaton.

This example illustrates how a Finite State Automaton (FSA) can be used to accept or
recognize a string. The set of all strings that lead the automaton to a final state is called the
language accepted or defined by the FSA. This means that “er” is not a word in the
language defined by the automaton shown in Figure 3.1.

Now, consider the input string “acb.” We start from the initial state q and move to state q1​.
The next input symbol is “c”, so we transition to state qr​. The following symbol is “b”, which
leads us to state qp. Since there is no more input left and we have reached a final state, this
is considered a successful termination. Hence, the string “acb” is a valid word in the
language defined by the automaton.

The language defined by this automaton can be described by the regular expression
/ubluch/. The example considered here is quite simple. In practice, however, the list of
transition rules can be extensive. Listing all the transition rules may be inconvenient, so
automata are often represented using a state-transition table.

In such a table:
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 28

●​ The rows represent states,​

●​ The columns correspond to input symbols,​

●​ The entries indicate the resulting state after applying a particular input in a given
state,​

●​ A missing entry indicates the absence of a transition.​

This table contains all the information required by the FSA to function. The state-transition
table for the automaton considered in Example 3.2 is shown in Table 3.5.

Two automata that define the same language are said to be equivalent. An NFA can be
converted to an equivalent DFA and vice versa. The equivalent DFA for the NFA shown in
Figure 3.3 is shown in Figure 3.4.
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 29

Morphological parsing
Morphology is a subdiscipline of linguistics. It studies word structure and the formation of
words from smaller units (morphemes). The goal of morphological parsing is to discover the
morphemes that build a given word. Morphemes are the smallest meaning bearing units in a
language. For example, the word bread consists of a single morpheme and eggs consist of
two morphemes: egg+s.
There are two broad classes of morphemes: stems and affixes. The stem is the main
morpheme - the morpheme that contains the central meaning. Affixes modify the meaning
given by the stem. Affixes are divided into prefixes, suffixes, infixes and circumfixes.
Prefixes are morphemes which appear before the stem, and suffixes are morphemes
applied to the end of the stem. Circumfixes are morphemes that may be applied to either
end of the stem while infixes are morphemes that appear inside a stem.
Example: Here is a list of singular and plural words of a few Telugu words:

Note: * you can give kannada words here instead for examples
There are three main ways of word formation:
1.​ Inflection: Here, a root word is combined with a grammatical morpheme to yield a
word of the same class as the original stem.
2.​ Derivation: It combines a word stem with a grammatical morpheme to yield a word
belonging to a different class, e.g., formation of the noun ‘computation’ from the verb
‘compute’.
The formation of noun from a verb or adjective is called normalization.
3.​ Compounding is the process of merging 2 or more words to form a new word. For
example, personal computer, desktop, overlook.

Understanding morphology is important to understand the syntactic and semantic properties


of new words in a language.
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 30

Morphological analysis and generation are essential to many NLP applications ranging from
spelling corrections to machine translations. In parsing, e.g., it helps to know the agreement
features of words. In IR, morphological analysis helps identify the presence of a query word
in a document in spite of different morphological variants.

Parsing in general, means taking an input and producing some sort of structures for it. In
NLP, this structure might be morphological, syntactic, semantic or pragmatic. Morphological
parsing takes as input the inflected surface form of each word in a text. As output, it
produces the parsed form consisting of a canonical form (or lemma) of the word and a set of
tags showing its syntactical category and morphological characteristics, e.g., possible part of
speech and/or inflectional properties (gender, number, person, tense, etc.). Morphological
generation is the inverse of this process. Both analysis and generation rely on two sources
of information: a dictionary of the valid lemmas of the language and a set of inflection
paradigms.

A morphological parser uses following information sources:

1. Lexicon​
A lexicon lists stems and affixes together with basic information about them.

2. Morphotactics​
There exists certain ordering among the morphemes that constitute a word. They cannot be
arranged arbitrarily. For example, rest-less-ness is a valid word in English but not
rest-ness-less. Morphotactics deals with the ordering of morphemes. It describes the way
morphemes are arranged or touch each other.

3. Orthographic rules​
These are spelling rules that specify the changes that occur when two given morphemes
combine. For example, the ier spelling rule changes cary to razier and not to easyer.

Morphological analysis can be avoided if an exhaustive lexicon is available that lists features
for all the word-forms of all the roots. Given a word, we simply consult the lexicon to get its
feature values. For example, suppose an exhaustive lexicon for Hindi contains the following
entries for the Hindi root word ghodhaa:

Given a word, say ghodhon, we can look up the lexicon to get its feature values.

However, this approach has several limitations. First, it puts a heavy demand on memory.
We have to list every form of the word, which results in a large number of, often redundant,
entries in the lexicon.
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 31

Second, an exhaustive lexicon fails to show the relationship between different roots having
similar word-forms. That means the approach fails to capture linguistic generalization, which
is essential to develop a system capable of understanding unknown words.

Third, for morphologically complex languages, like Turkish, the number of possible
word-forms may be theoretically infinite. It is not practical to list all possible word-forms in
these languages.

These limitations explain why morphological parsing is necessary. The complexity of the
morphological analysis varies widely among the world's languages, and is quite high even in
relatively simple cases, such as English.

The simplest morphological systems are stemmers that collapse morphological variations of
a given word (word-forms) to one lemma or stem. They do not require a lexicon. Stemmers
have been especially used in information retrieval. Two widely used stemming algorithms
have been developed by Lovins (1968) and Porter (1980).

Stemmers do not use a lexicon; instead, they make use of rewrite rules of the form:

●​ iery (e.g., earlier → early)​

●​ ing (e.g., playing → play)​

Stemming algorithms work in two steps:​


(i) Suffix removal: This step removes predefined endings from words.​
(ii) Recoding: This step adds predefined endings to the output of the first step.

These two steps can be performed sequentially as in Lovins's stemmer or simultaneously


as in Porter's stemmer. For example, Porter’s stemmer makes use of the following
transformation rule:

●​ ational → ate​
to transform words such as rotational into rotate.​

It is difficult to use stemming with morphologically rich languages. Even in English,


stemmers are not perfect. Krovitz (1993) pointed out errors of omissions and commissions
in the Porter algorithm, such as transformation of the word organization into organ and noise
into noisy.

Another problem with Porter’s algorithm is that it reduces only suffixes; prefixes and
compounds are not reduced.

A more efficient two-level morphological model, first proposed by Koskenniemi (1983),


can be used for highly inflected languages. In this model, a word is represented as a
correspondence between its lexical level form and its surface level form.

●​ The surface level represents the actual spelling of the word.​


Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 32

●​ The lexical level represents the concatenation of its constituent morphemes.​

Morphological parsing is viewed as a mapping from the surface level into morpheme and
feature sequences on the lexical level.

For example, the surface form ‘playing’ is represented in the lexical form as play+V+PP, as
shown in Figure 3.5. The lexical form consists of the stem ‘play’ followed by the
morphological information +V+PP, which tells us that ‘playing’ is the present participle form
of the verb.

Surface level: Playing​


Lexical level: Play+V+PP

Figure 3.5: Surface and lexical forms of a word

Similarly, the surface form ‘books’ is represented in the lexical form as book+N+PL, where
the first component is the stem, and the second component ⟨N+PL⟩ is the morphological
information, which tells us that the surface level form is a plural noun.

This model is usually implemented with a kind of finite-state automata, called a finite-state
transducer (FST). A transducer maps a set of symbols to another. A finite-state transducer
does this through a finite-state automaton. An FST can be thought of as a two-state
automaton, which recognizes or generates a pair of strings.

An FST passes over the input string by consuming the input symbols on the tape it traverses
and converts it to the output string in the form of symbols.

Formally, an FST has been defined by Hopcroft and Ullman (1979) as follows:

A finite-state transducer is a 6-tuple (Σ₁, Σ₂, Q, q₀, F, δ), where

●​ Q is a set of states,​

●​ q₀ is the initial state,​

●​ F ⊆ Q is a set of final states,​

●​ Σ₁ is the input alphabet,​

●​ Σ₂ is the output alphabet, and​

●​ δ is a function mapping Q × (Σ₁ ∪ {ε}) × (Σ₂ ∪ {ε}) to a subset of the


power set of Q.​

An FST can be seen as automata with transitions labelled with symbols from Σ₁ × Σ₂,
where Σ₁ and Σ₂ are the alphabets of input and output respectively. Thus, an FST is similar to
a nondeterministic finite automaton (NFA), except that:
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 33

●​ transitions are made on pairs of symbols rather than single symbols, and​

●​ they produce outputs along with state transitions.​

Figure 3.6 shows a simple transducer that accepts two input strings, hot and cat, and maps
them onto cot and bat respectively. It is a common practice to represent a pair like a:a by a
single letter.

Figure 3.6: Finite-state transducer

Just as FSAs encode regular languages, FSTs encode regular relations. A regular
relation is the relation between regular languages. The regular language encoded on the
upper side of an FST is called the upper language, and the one on the lower side is termed
the lower language.

If T is a transducer, and s is a string, then we use T(s) to represent the set of strings
encoded by T such that the pair (s, t) is in the relation.

FSTs are closed under union, concatenation, composition, and Kleene closure.
However, in general, they are not closed under intersection and complementation.

With this introduction, we can now implement the two-level morphology using FST. To get
from the surface form of a word to its morphological analysis, we proceed in two steps, as
illustrated in Figure 3.7:
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 34

First, we split the word up into its possible components. For example, we make birds out of
bird+s, where '+' indicates morpheme boundaries. In this step, we also consider spelling
rules.

Thus, there are two possible ways of splitting up boxes, namely:

●​ box+es, and​

●​ boxe+s.​

The first one assumes that box is the stem and es the suffix, while the second assumes that
boxe is the stem and s has been introduced due to a spelling rule.

The output of this step is a concatenation of morphemes, i.e., stems and affixes. There
can be more than one representation for a given word.

A transducer that does the mapping (translation) required by this step for the surface form
lesser might look like Figure 3.8.

This FST represents the information that the comparative form of the adjective less is
lesser, where e here is the empty string.

The automaton is inherently bi-directional:

●​ it can be used for analysis (surface input, "upward" application), or​

●​ for generation (lexical input, "downward" application).​

In the second step, we use a lexicon to look up categories of the stems and meanings
of the affixes. So:

●​ bird+ will be mapped to bird+N+PL,​

●​ boxe+ will be flagged as invalid (as boxe is not a legal stem).​

This tells us that splitting boxes into boxe+s is incorrect, and should therefore be
discarded.

This may not always be the case. We have words like spouses or parses, where splitting the
word into spouse+s or parse+s is correct.
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 35

Orthographic rules are used to handle these spelling variations. For instance, one such
rule states:

Add e after s, z, x, ch, sh before the s​


(e.g., dish → dishes, box → boxes)

Each of these steps can be implemented with the help of a transducer. Thus, we need to
build two transducers:

1.​ One that maps the surface form to the intermediate form, and​

2.​ Another that maps the intermediate form to the lexical form.​

We now develop an FST-based morphological parser for singular and plural nouns in
English.

The plural form of regular nouns usually ends with s or -es. However, a word ending in s
need not necessarily be the plural form of a word. There are many singular words ending in
s, e.g., miss, ass.

One required translation is the deletion of the 'e' when introducing a morpheme boundary.
This deletion is usually required for words ending in xes, ses, zes (e.g., suffixes and boxes).

The transducer in Figure 3.9 performs this transformation.

Figure 3.10 shows the possible sequences of states that the transducer undergoes, given
the surface forms birds and boxes as input.
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 36

The next step is to develop a transducer that does the mapping from the intermediate level
to the lexical level. The input to this transducer has one of the following forms:

●​ Regular noun stem, e.g., bird, cat​

●​ Regular noun stem + s, e.g., bird+s​

●​ Singular irregular noun stem, e.g., goose​

●​ Plural irregular noun stem, e.g., geese

In the first case, the transducer maps all symbols of the stem to themselves and outputs
N+SG(as shown in Figure 3.7). In the second case, it has to map all symbols of the stem to
themselves, but then output N and replace PL with sg. In the third case, it has to do the
same as in the first case. Finally, in the fourth case, the transducer has to map the irregular
plural noun stem to the corresponding singular stem (e.g., grese to goose) and then it should
add N and PL. The general structure of this transducer looks like Figure 3.11.
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 37


The mapping from State 1 to State 2, 3, or 4 is carried out with the help of a transducer
encoding a lexicon. The transducer implementing the lexicon maps the individual regular and
irregular noun stems to their correct noun stem, replacing labels like regular noun form, etc.
This lexicon maps the surface form grese, which is an irregular noun, to its correct stem
goose in the following way:

g → g, r → o, e → o, s → s, e → e.

Mapping for the regular surface form of a bird is b → b, i → i, r → r, d → d. Representing


pairs like a with a single letter, these two representations are reduced to goose and bird
respectively.


Composing this transducer with the previous one, we get a single two-level transducer with
one input tape and one output tape. This maps plural nouns into the stem plus the
morphological marker +pl and singular nouns into the stem plus the morpheme +sg. Thus a
surface word form birds will be mapped to bird+N+pl as follows.

Each letter maps to itself, while e maps to morphological feature +N, and s maps to
morphological feature pl. Figure 3.12 shows the resulting composed transducer.

The power of the transducer lies in the fact that the same transducer can be used for
analysis and generation. That is, we can run it in the downward direction (input: surface form
and output: lexical form) or in the upward direction.
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 38

Spelling Error Detection and Correction


In computer-based information systems, especially those involving text entry or
automatic recognition systems (like OCR or speech recognition), errors in typing and
spelling are a major source of variation between input strings.

Common Typing Errors (80% are single-error misspellings):

1.​ Substitution: Replacing one letter with another (e.g., cat → bat).​

2.​ Omission: Leaving out a letter (e.g., blue → bue).​

3.​ Insertion: Adding an extra letter (e.g., car → caar).​

4.​ Transposition: Switching two adjacent letters (e.g., form → from).​

5.​ Reversal errors: A specific case of transposition where letters are reversed.​

Errors from OCR and Speech Recognition:

●​ OCR (Optical Character Recognition) and similar devices introduce errors such as:​

○​ Substitution​

○​ Multiple substitutions (framing errors)​

○​ Space deletion/insertion​

○​ Character omission or duplication​

●​ Speech recognition systems process phoneme strings and attempt to match them
to known words. These errors are often phonetic in nature, leading to non-trivial
distortions of words.

Spelling Errors: Two Categories

1.​ Non-word errors: The incorrect word does not exist in the language (e.g., freind
instead of friend).​

2.​ Real-word errors: The incorrect word is a valid word, but incorrect in the given
context (e.g., their instead of there).

Spelling Correction Process

●​ Error Detection: Identifying words that are likely misspelled.​


Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 39

●​ Error Correction: Suggesting valid alternatives for the detected errors.​

Two Approaches to Spelling Correction:

1.​ Isolated Error Detection and Correction:​

○​ Focuses on individual words without considering context.​

○​ Example: Spell-checker highlighting recieve and suggesting receive.​

2.​ Context-Dependent Error Detection and Correction:​

○​ Uses surrounding words to detect and correct errors.​

○​ Useful for real-word errors (e.g., correcting there to their based on sentence
meaning).

Spelling Correction Algorithms

1.​ Minimum Edit Distance:​

○​ Measures the least number of edit operations (insertions, deletions,


substitutions) to convert one word into another.
2.​ Similarity Key Techniques:​

○​ Generates a phonetic or structural key for a word and matches against


similarly keyed dictionary entries.
○​ Example: Soundex
3.​ N-gram Based Techniques:​

○​ Break words into sequences of characters (n-grams) and compare overlaps


to identify similar words.​

4.​ Neural Nets:​

○​ Use machine learning models (e.g., RNNs, Transformers) trained on large


corpora to detect and correct errors.​

○​ Can learn both spelling patterns and contextual usage.​

5.​ Rule-Based Techniques:​

○​ Apply handcrafted or learned rules about common misspellings, phonetic


confusions, and grammar patterns.​

○​ Often used in conjunction with dictionaries.​


Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 40

Minimum Edit Distance


The minimum edit distance is the number of insertions, deletions and substitutions required
to change one string to another. For example, the minimum edit distance between tutor and
tumor is 2: we substitute ‘m’ for ‘t’ and insert ‘u’ before ‘r’. Edit distance can be represented
as a binary function, ed, which maps two strings to their edit distance. Ed is symmetric. For
any two strings, s and t, ed(s,t) is always equal to ed(t,s).
Edit distance can be viewed as a string alignment problem. By aligning two strings, we can
measure the degree to which they match. There may be more than one possible alignment
between two strings.
The alignment shown here, between tutor and tumour, has a distance of 2.

We can associate a weight or cost with each operation. The Levenshtein distance between
two sequences is obtained by assigning a unit cost to operation. Another possible alignment
for this sequence is:

Which has a cost of 3. We already have a better alignment than this one.
Dynamic Programming algorithms can be quite useful for finding minimum edit distance
between two sequences. Dynamic programming refers to a class of algorithms that apply a
table-driven approach to solve problems by combining solutions to sub-problems. The
dynamic programming algorithm for minimum edit distance is implemented by creating an
edit distance matrix.
The matrix has one row for each symbol in the source string and one column for each matrix
in the target string.
The (i,j)th cell in this matrix represents the distance between the first i character of the
source and the first j character of the target string.
The value in each cell is computed in terms of 3 possible paths.

The substitution will be 0 if the ith character in the source mathes with jth character in the
target
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 41

Minimum edit distance algorithms are also useful for determining accuracy in speech
recognition systems.

Ex: find minimum edit distance between execution and intention


Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 42

Therefore the minimum edit distance between words is 8

Part-of-Speech Tagging
Part-of-Speech tagging is the process of assigning an appropriate grammatical category
(such as noun, verb, adjective, etc.) to each word in a given sentence. It is a fundamental
task in Natural Language Processing (NLP), which plays a crucial role in syntactic parsing,
information extraction, machine translation, and other language processing tasks.

POS tagging helps in resolving syntactic ambiguity and understanding the grammatical
structure of a sentence. Since many words in English and other natural languages can serve
multiple grammatical roles depending on the context, POS tagging is necessary to identify
the correct category for each word.

There are several approaches to POS tagging, which are broadly categorized as: (i)
Rule-based POS tagging, (ii) Stochastic POS tagging, and (iii) Hybrid POS tagging.

Rule-Based POS Tagging

Rule-based POS tagging uses a set of hand-written linguistic rules to determine the correct
tag for a word in a given context. The approach starts by assigning each word a set of
possible tags based on a lexicon. Then, contextual rules are applied to eliminate unlikely
tags.

The rule-based taggers make use of rules that consider the tags of neighboring words and
the morphological structure of the word. For example, a rule might state that if a word is
preceded by a determiner and is a noun or verb, it should be tagged as a noun. Another rule
might say that if a word ends in "-ly", it is likely an adverb.

The effectiveness of this approach depends heavily on the quality and comprehensiveness
of the hand-written rules. Although rule-based taggers can be accurate for specific domains,
they are difficult to scale and maintain, especially for languages with rich morphology or free
word order.

Stochastic POS Tagging


Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 43

Stochastic or statistical POS tagging makes use of probabilistic models to determine the
most likely tag for a word based on its occurrence in a tagged corpus. These taggers are
trained on annotated corpora where each word has already been tagged with its correct part
of speech.

In the simplest form, a unigram tagger assigns the most frequent tag to a word, based on the
maximum likelihood estimate computed from the training data:

where f(w,t) is the frequency of word w being tagged as t, and f(w) is the total frequency of
the word w in the corpus. This approach, however, does not take into account the context in
which the word appears.

To incorporate context, bigram and trigram models are used. In a bigram model, the tag
assigned to a word depends on the tag of the previous word. The probability of a sequence
of tags is given by:

The probability of the word sequence given the tag sequence is:

Thus, the best tag sequence is the one that maximizes the product:

This is known as the Hidden Markov Model (HMM) approach to POS tagging. Since the
actual tag sequence is hidden and only the word sequence is observed, the Viterbi algorithm
is used to compute the most likely tag sequence.

Bayesian inference is also used in stochastic tagging. Based on Bayes’ theorem, the
posterior probability of a tag given a word is:
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 44

Since P(w) is constant for all tags, we can choose the tag that maximizes P(w∣t)⋅P(t).
Statistical taggers can be trained automatically from large annotated corpora and tend to
generalize better than rule-based systems, especially in handling noisy or ambiguous data.

Hybrid POS Tagging

Hybrid approaches combine rule-based and statistical methods to take advantage of the
strengths of both. One of the most popular hybrid methods is Transformation-Based
Learning (TBL), introduced by Eric Brill, commonly referred to as Brill’s Tagger.

In this approach, an initial tagging is done using a simple method, such as assigning the
most frequent tag to each word. Then, a series of transformation rules are applied to
improve the tagging. These rules are automatically learned from the training data by
comparing the initial tagging to the correct tagging and identifying patterns where the tag
should be changed.

Each transformation is of the form: "Change tag A to tag B when condition C is met". For
example, a rule might say: "Change the tag from VB to NN when the word is preceded by a
determiner".

The transformation rules are applied iteratively to correct errors in the tagging, and each rule
is chosen based on how many errors it corrects in the training data. This approach is robust,
interpretable, and works well across different domains.

POS tagging is an essential component in many natural language processing systems.


Rule-based taggers rely on linguistic knowledge encoded in rules, stochastic taggers use
statistical models learned from data, and hybrid taggers attempt to combine the two for
better performance. Among these, stochastic methods are particularly effective when large
annotated corpora are available, while rule-based methods are useful for low-resource
languages or when domain-specific rules are well known. Hybrid methods, especially those
using transformation-based learning, provide a good balance between accuracy and
interpretability.

Chapter2: Syntactic Analysis


Context Free Grammar
Context-Free Grammar (CFG)

A Context-Free Grammar (CFG) is a type of formal grammar that is used to define the
syntactic structure of a language. It is particularly useful in natural language processing
for representing the hierarchical structure of sentences, such as phrases, clauses, and
their relationships.
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 45

A CFG consists of a set of production rules that describe how non-terminal symbols can
be replaced by combinations of non-terminal and terminal symbols. The grammar is termed
“context-free” because the application of rules does not depend on the context of the
non-terminal being replaced.

A CFG is formally represented by a 4-tuple:

G=(N,Σ,P,S)

where:

●​ N: a finite set of non-terminal symbols (e.g., S, NP, VP),​

●​ Σ (sigma): a finite set of terminal symbols (words or tokens of the language),​

●​ P: a finite set of production rules of the form A→α, where A∈Nand α∈(N∪Σ)∗,​

●​ S: the start symbol, S∈N.

Each rule in P defines how a non-terminal can be rewritten. The rewriting continues until only
terminal symbols are left, forming a string in the language.
Example: “Henna reads a book”
The rules for the grammar is:​​ ​ The parse tree to the sentence is as follows

Constituency
Constituency refers to the hierarchical organization of words into units or constituents,
where each constituent behaves as a single unit for syntactic purposes. A constituent can be
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 46

a word or a group of words that function together as a unit within a sentence. These
constituents are the building blocks of phrases and sentences.

Each constituent has a grammatical category (e.g., noun, verb, adjective) and can be
recursively expanded using grammar rules. Constituents combine to form phrase level
and sentence level constructions, forming the constituent structure or phrase structure
of the language.

1. Phrase Level Constructions

Phrase-level constructions group words into syntactic units called phrases. These phrases
have internal structure and grammatical behavior. The key phrase types discussed are:

Noun Phrase (NP)

A Noun Phrase is centered around a noun and may include determiners, adjectives, and
prepositional phrases.​
Example: "The tall boy", "A bouquet of flowers".

Structure:

NP→(Det) (Adj)∗ N (PP)∗NP \rightarrow (Det) \ (Adj)^* \ N \ (PP)^*NP→(Det) (Adj)∗ N (PP)∗

A noun phrase can act as a subject, object, or complement in a sentence.

Verb Phrase (VP)

A Verb Phrase consists of a main verb and may include auxiliaries, objects,
complements, or adverbials.​
Example: "is eating an apple", "has been sleeping".

Structure:

VP→V (NP) (PP) (AdvP)

Adjective Phrase (AdjP)

An Adjective Phrase centers around an adjective, which may be modified by adverbs or


may serve as a complement.​
Example: "very tall", "interested in music".

Structure:

AdjP→(Adv) Adj (PP)

Adverb Phrase (AdvP)

An Adverb Phrase consists of an adverb optionally preceded by a degree modifier.​


Example: "very quickly", "too slowly".
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 47

Structure:

AdvP→(Deg) Adv

Each of these phrase types can function as a constituent in a larger phrase or sentence.

2. Sentence Level Constructions

At the sentence level, constituents combine according to grammatical rules to form


complete and meaningful sentences. Sentence-level constructions are governed by the
following elements:

Grammatical Rules

CFG-based grammatical rules define how phrases combine to form sentences.​


For example:

S→NP VP

This rule indicates that a sentence (S) consists of a noun phrase followed by a verb phrase.

Coordination

Coordination involves linking two constituents of the same category using conjunctions such
as and, or, but.​
Example: "The boy and the girl", "sang and danced".

Rules such as:

NP→NP and
VP→VP or VP
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 48

allow the formation of coordinated structures.

Agreement

Agreement refers to grammatical consistency between constituents, such as subject-verb


agreement in number and person.​
Example: "He eats", "They eat".

CFG alone does not handle agreement well, as it lacks the mechanism to enforce such
constraints across distant parts of the sentence. This leads to the need for richer
representations like feature structures.

Feature Structures

To handle complex agreement and other syntactic dependencies, feature structures are
used. A feature structure is a set of attribute-value pairs that represent syntactic or
semantic information.​
Example:

[Number: Singular, Person: 3rd, Gender: Masculine]

Unification-based grammars use these feature structures to ensure that elements such as
subjects and verbs agree in number, person, and gender.

Parsing
Parsing is the process of analyzing a string of symbols (typically a sentence) according to
the rules of a formal grammar. In Natural Language Processing (NLP), parsing determines
the syntactic structure of a sentence by identifying its grammatical constituents (like noun
phrases, verb phrases, etc.). It checks whether the sentence follows the grammatical rules
defined by a grammar, often a Context-Free Grammar (CFG). The result of parsing is
typically a parse tree or syntax tree, which shows how a sentence is hierarchically
structured. Parsing helps in disambiguating sentences with multiple meanings. It is essential
for understanding, translation, and information extraction. There are two main types:
syntactic parsing, which focuses on structure, and semantic parsing, which focuses on
meaning. Parsing algorithms include top-down, bottom-up, and chart parsing. Efficient
parsing is crucial for developing grammar-aware NLP applications.

Top-down Parsing
As the name suggests, top-down parsing starts its search from the root node S and works
downwards towards the leaves. The underlying assumption here is that the input can be
derived from the designated start symbol, S, of the grammar. The next step is to find all
sub-trees which can start with S. To generate the sub-trees of the second-level search, we
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 49

expand the root node using all the grammar rules with S on their left hand side. Likewise,
each non-terminal symbol in the resulting sub-trees is expanded next using the grammar
rules having a matching non-terminal symbol on their left hand side. The right hand side of
the grammar rules provide the nodes to be generated, which are then expanded recursively.
As the expansion continues, the tree grows downward and eventually reaches a state where
the bottom of the tree consists only of part-of-speech categories. At this point, all trees
whose leaves do not match words in the input sentence are rejected, leaving only trees that
represent successful parses. A successful parse corresponds to a tree which matches
exactly with the words in the input sentence.
Sample grammar
●​ S → NP VP
●​ S → VP
●​ NP → Det Nominal
●​ NP → NP PP
●​ Nominal → Noun
●​ Nominal → Nominal Noun
●​ VP → Verb
●​ VP → Verb NP
●​ VP → Verb NP PP
●​ PP → Preposition NP
●​ Det → this | that | a | the
●​ Noun → book | flight | meal | money
●​ Verb → book | include | prefer
●​ Pronoun → I | he | she | me | you
●​ Preposition → from | to | on | near | through
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 50

A top-down search begins with the start symbol of the grammar. Thus, the first level (ply)
search tree consists of a single node labelled S. The grammar in Table 4.2 has two rules
with S on their left hand side. These rules are used to expand the tree, which gives us two
partial trees at the second level search, as shown in Figure 4.4. The third level is generated
by expanding the non-terminal at the bottom of the search tree in the previous ply. Due to
space constraints, only the expansion corresponding to the left-most non-terminals has been
shown in the figure. The subsequent steps in the parse are left, as an exercise, to the
readers. The correct parse tree shown in Figure 4.4 is obtained by expanding the fifth parse
tree of the third level.

Bottom-up Parsing
A bottom-up parser starts with the words in the input sentence and attempts to construct a
parse tree in an upward direction towards the root. At each step, the parser looks for rules in
the grammar where the right hand side matches some of the portions in the parse tree
constructed so far, and reduces it using the left hand side of the production. The parse is
considered successful if the parser reduces the tree to the start symbol of the grammar.
Figure 4.5 shows some steps carried out by the bottom-up parser for sentence Paint the
door.

Each of these parsing strategies has its advantages and disadvantages. As the top-down
search starts generating trees with the start symbol of the grammar, it never wastes time
exploring a tree leading to a different root. However, it wastes considerable time exploring S
trees that eventually result in words that are inconsistent with the input. This is because a
top-down parser generates trees before seeing the input. On the other hand, a bottom-up
parser never explores a tree that does not match the input. However, it wastes time
generating trees that have no chance of leading to an S-rooted tree. The left branch of the
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 51

search space in Figure 4.5 that explores a sub-tree assuming paint as a noun, is an example
of wasted effort. We now present a basic search strategy that uses the top-down method to
generate trees and augments it with bottom-up constraints to filter bad parses.

The CYK Parser


Like the Earley algorithm, the CYK (Cocke–Younger–Kasami) is a dynamic programming
parsing algorithm. However, it follows a bottom-up approach in parsing. It builds a parse tree
incrementally. Each entry in the table is based on previous entries. The process is iterated
until the entire sentence has been parsed. The CYK parsing algorithm assumes the
grammar to be in Chomsky Normal Form (CNF). A CFG is in CNF if all the rules are of only
two forms:

A → BC​
A → w, where w is a word.

The algorithm first builds parse trees of length one by considering all rules which could
produce words in the sentence being parsed. Then, it finds the most probable parse for all
the constituents of length two. The parse of shorter constituents constructed in earlier
iterations can now be used in constructing the parse of longer constituents.

A* ⇒ wᵢ ​
1. A → B C is a rule in grammar​
2. B* ⇒ wᵢₖ​
3. C* ⇒ wₖ₊₁

For a sub-string wᵢ of length j starting at i, the algorithm considers all possible ways of
breaking it into two parts wᵢₖ and wₖ₊₁ . Finally, since A ⇒ wᵢ , we have to verify that S*
⇒ w₁ₙ, i.e., the start symbol of the grammar derives w₁ₙ.

CYK ALGORITHM
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 52

Let w = w1 w2 w3 ... wn
and w0 = w, wn+1 = ∅

// Initialization step
for i := 1 to n do
for all rules A → wi do
chart[i, i] := [A]

// Recursive step
for j := 2 to n do
for i := 1 to n-j+1 do
begin
chart[i, j] := ∅
for k := i to j-1 do
chart[i, j] := chart[i, j] ∪ { A | A → BC is a production and
B ∈ chart[i, k] and C ∈ chart[k+1, j] }
end

if S ∈ chart[1, n] then accept else reject


Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 53

MODULE-3

Introduction
One of the core goals in natural language processing is to build systems that can
understand, categorize, and respond to human language. Text classification, also called
text categorization, is the task of assigning a predefined label or class to a text segment.
Examples include identifying whether a review is positive or negative, whether a document is
about sports or politics, or whether an email is spam or not spam.

Such classification tasks are typically framed as supervised machine learning problems,
where we are given a set of labeled examples (training data) and must build a model that
generalizes to unseen examples. These tasks rely on representing text in a numerical
feature space and applying statistical models to predict classes.

One of the most commonly used classifiers in NLP is the Naive Bayes classifier, a
probabilistic model that applies Bayes’ theorem under strong independence assumptions.
Despite its simplicity, it is robust and surprisingly effective in many domains including spam
filtering, sentiment analysis, and language identification. Naive Bayes is categorized as
a generative model, because it models the joint distribution of inputs and classes to
"generate" data points, in contrast with discriminative models that directly estimate the class
boundary.

4.1 Naive Bayes Classifiers


Text Documents as Bags of Words

Before applying any classifier, text must be converted into a format suitable for machine
learning algorithms. In Naive Bayes, we represent a document as a bag of words (BoW),
which treats the document as an unordered collection of words, discarding grammar and
word order. This assumption simplifies modeling, reducing a complex structured input into a
feature vector.

Let is a feature (usually a word or


term). These features could be binary (word presence), count-based (word frequency), or
even TF-IDF weighted.
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 54
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 55

Feature Engineering Considerations

●​ Stop Words: Words like “the”, “is”, “and” that occur in all documents and contribute
little to discrimination can be excluded.​

●​ Unknown Words: Words not present in the training vocabulary can either be ignored
or assigned a very small probability.

4.3 Worked Example


Given the following labeled training data:
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 56

Label Sentence

+ I love this fun film

+ What a fun and lovely movie

– This was a dull movie

– I hated this film


Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 57
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 58

4.4 Optimizing for Sentiment Analysis


Binary Naive Bayes

Instead of using raw frequencies, we often use binary features indicating word presence.
This reduces bias introduced by repeated terms and increases robustness in sentiment
classification.

Sentiment Lexicons

Lexicons are curated lists of words annotated with their sentiment polarity.
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 59

●​ General Inquirer: Annotates words with dozens of labels including “positive”,


“negative”, “strong”, etc.​

●​ Opinion Lexicon: Divides words into positive (e.g., “love”, “great”) and negative
(e.g., “bad”, “terrible”).​

These lexicons can be used to:

●​ Initialize feature weights​

●​ Enhance feature selection​

●​ Interpret models more transparently

4.5 Naive Bayes for Other Text Classification


Spam Detection

In spam classification, words like “free”, “win”, “credit” may have high probabilities in spam
class. Naive Bayes can be trained on labeled corpora to distinguish spam from ham based
on word distribution.

Language Identification

For this task, character-level n-grams are more effective than word-level features,
especially for short texts. Naive Bayes computes the likelihood of character trigrams in a
sentence under each language model.

Feature Selection

Since text features are sparse and high-dimensional, not all features are informative.
Common selection metrics:

●​ Mutual Information​

●​ Chi-square Test​

●​ Information Gain​

These help reduce dimensionality and improve performance.


Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 60

4.6 Naive Bayes as a Language Model


Naive Bayes classifiers, as discussed in earlier sections, can be flexibly adapted to a wide
variety of feature types: from words and phrases to URLs, email headers, and other
metadata. However, when restricted specifically to using only individual word
features—and using all words in a document rather than a selected subset—Naive Bayes
begins to resemble a language model.

Naive Bayes as Class-Specific Language Models

In this restricted setting, Naive Bayes effectively acts as a collection of class-specific


unigram language models.

That is, each class (e.g., positive or negative) defines its own unigram distribution over the
vocabulary:

Example: Unigram Class-Specific Language Models

Let us consider two sentiment classes:

●​ Positive sentiment (+)​

●​ Negative sentiment (−)​

We are given the following unigram probabilities for selected words under each class:
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 61

“I love this fun film”

Step 1: Compute Likelihood Under Each Class

Using the Naive Bayes likelihood formula (Eq. 4.15):


Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 62

Module4

Chapter1: Information Retrieval


Information Retrieval (IR) is the science and practice of searching for and retrieving relevant
information from large collections of data, particularly unstructured text such as documents,
web pages, or multimedia content. The core goal of IR is to help users find the most useful
and relevant information that satisfies their information need, typically expressed as a query.
This process involves indexing the data, matching queries against indexed content, ranking
results by relevance, and presenting them in an accessible way. IR systems are used in a
wide range of applications including web search engines (like Google), digital libraries, legal
and medical databases, and enterprise information systems. Modern IR also incorporates
natural language processing, machine learning, and user modeling to improve accuracy and
relevance of search results.

Design Features in Information Retrieval


Information Retrieval (IR) systems aim to efficiently locate relevant documents or information
from large datasets. Several key design features play a crucial role in enhancing the
performance, efficiency, and relevance of such systems. These include Indexing, Stop
Word Elimination, Stemming, and understanding word distributions through Zipf’s Law.

1. Indexing

Indexing is the process of organizing data to enable rapid search and retrieval. In IR, an
inverted index is commonly used. This structure maps each term in the document collection
to a list of documents (or document IDs) where that term occurs. It typically includes
additional information like term frequency, position, and weight (e.g., TF-IDF score).​
Efficient indexing allows the system to avoid scanning all documents for every query,
dramatically reducing search time and computational cost. Index construction involves
tokenizing documents, normalizing text, and storing index entries in a sorted and optimized
structure, often with compression techniques to reduce storage requirements.

2. Eliminating Stop Words


Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 63

Stop words are extremely common words that appear in almost every document, such as
"the", "is", "at", "which", "on", and "and". These words usually add little value to
understanding the main content or differentiating between documents.​
Removing stop words reduces the size of the index, speeds up the search process, and
minimizes noise in results. However, careful handling is required because some stop words
may be semantically important depending on the domain (e.g., "to be or not to be" in
literature, or "in" in legal texts). Most IR systems use a predefined stop word list, though it
can be customized based on corpus analysis.

3. Stemming

Stemming is a form of linguistic normalization used to reduce related words to a common


base or root form. For example:

●​ "connect", "connected", "connection", "connecting" → "connect"​

Stemming improves recall in IR systems by ensuring that different inflected or derived forms
of a word are matched to the same root term in the index. This is particularly important in
languages with rich morphology.​
Common stemming algorithms include:

●​ Porter Stemmer: Lightweight and widely used, based on heuristic rules.​

●​ Snowball Stemmer: An improvement over Porter, supporting multiple languages.​

●​ Lancaster Stemmer: More aggressive but sometimes over-stems words.​

Stemming is different from lemmatization, which uses vocabulary and grammar rules to
derive the base form.

4. Zipf’s Law

Zipf’s Law is a statistical principle that describes the frequency distribution of words in
natural language corpora. It states that the frequency f of any word is inversely proportional
to its rank r:

f ∝ 1/r

This means that the most frequent word occurs roughly twice as often as the second most
frequent word, three times as often as the third, and so on.​
For example, in English corpora, words like "the", "of", "and", and "to" dominate the
frequency list. Meanwhile, the majority of words occur rarely (called the "long tail").​
In IR, Zipf’s Law justifies:

●​ Stop word elimination (high-frequency terms contribute little to relevance)​


Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 64

●​ TF-IDF weighting (rare terms are more informative)​

●​ Optimizing index structures for space and search​

Understanding this law helps in designing efficient indexing and retrieval strategies that
focus on the more informative, lower-frequency words.

IR models
Information Retrieval (IR) models are frameworks used to retrieve and rank relevant
documents from a large collection in response to a user query. These models form the
foundation of search engines and digital information systems by determining which
documents best match the user's needs. The core idea behind any IR model is to represent
both documents and queries in a specific form and then compute their similarity or relevance
using well-defined rules or algorithms.

IR models are generally classified into three categories: Classic Models, Non-Classic
Models, and Alternative Models. Classic models include the Boolean Model, Vector
Space Model (VSM), and Probabilistic Model. These rely on mathematical and logical
principles—Boolean uses exact match logic, VSM represents documents as vectors and
ranks by cosine similarity, while Probabilistic models estimate the probability of a document
being relevant to a given query (e.g., BM25).

Non-classic models and alternative models expand beyond traditional methods.


Non-Classic Models include Fuzzy Models, which allow partial matching; Latent Semantic
Indexing (LSI), which captures deeper relationships through dimensionality reduction; and
Neural IR models that use early machine learning techniques. Alternative Models include
Language Models, where each document is seen as a probabilistic generator of queries;
Graph-Based Models, such as those using PageRank; and modern Deep Neural Models
like BERT-based systems that deeply understand language semantics for highly accurate
retrieval.

3.1 Boolean Model

The Boolean Model is the simplest and most fundamental model used in Information
Retrieval (IR). It is based on the principles of Boolean algebra and set theory, where both
documents and user queries are represented as sets of indexed terms. This model classifies
documents in a binary manner, either as relevant or non-relevant, with no provision for
partial matching or ranking.

3.1.1 Document and Query Representation

In the Boolean model, each document in the collection is represented as a binary vector
over the set of all index terms. Each component of the vector indicates whether the
corresponding term is present (1) or absent (0) in the document. Similarly, user queries are
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 65

formulated using Boolean expressions consisting of keywords connected by logical


operators such as AND, OR, and NOT.

3.1.2 Boolean Operators

●​ AND: A document is retrieved if it contains all the terms in the query. It corresponds
to the intersection of sets.​

●​ OR: A document is retrieved if it contains at least one of the query terms. It


corresponds to the union of sets.​

●​ NOT: A document is excluded if it contains the specified term. It represents the


complement of a set.​

3.1.3 Example

Consider the query:​


Q: AI AND Robotics

Suppose the document collection contains the following:

●​ D1: "AI and Robotics in Industry"​

●​ D2: "AI in Healthcare"​

Applying the Boolean model:

●​ D1 contains both "AI" and "Robotics" → retrieved.​

●​ D2 contains only "AI" → not retrieved.​

Thus, only D1 satisfies the query condition.

3.1.4 Advantages and Limitations

Advantages:

●​ Simple and computationally efficient.​

●​ Provides precise control over the retrieval process.​

Limitations:

●​ No ranking or ordering of results.​


Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 66

●​ Does not support partial matching.​

●​ Requires users to formulate exact Boolean expressions, which may be complex or


unintuitive.​

3.1.5 Use in NLP

Before Boolean retrieval is applied, standard NLP preprocessing techniques are used, such
as tokenization, stop word removal, and stemming, to ensure only meaningful content
words are considered in the indexing and matching process.

The Boolean model, while limited in flexibility and effectiveness for large-scale retrieval
tasks, serves as a foundational approach in IR and provides the basis for more sophisticated
models like the Vector Space and Probabilistic models.

3.2 Vector Space Model


The Vector Space Model (VSM) is a fundamental algebraic model used in Information
Retrieval to represent text documents and user queries as vectors in a high-dimensional
space. Each dimension corresponds to a separate index term (keyword), and the vector’s
components represent the importance (weight) of the term in that document or query. Unlike
the Boolean model, VSM allows partial matching and supports ranking of retrieved
documents based on their relevance scores.

3.2.2 Text Preprocessing Steps


To construct vector representations, documents and queries undergo preprocessing:

Tokenization

This process breaks raw text into individual units called tokens.​
Example:​
Text: “Information retrieval is important.”​
Tokens: [Information, retrieval, is, important]

Stop Word Removal


Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 67

Common words such as "is", "the", and "of" are removed as they add little value to
information retrieval.

Stemming

Stemming reduces words to their root form, improving matching between similar words.​
Example:​
connect, connected, connection → stemmed to "connect"

Common algorithms: Porter Stemmer, Lancaster Stemmer

3.2.4 Similarity Measures


In the Vector Space Model (VSM), determining the relevance of a document to a user’s
query involves comparing their respective vector representations. Since documents and
queries are modeled as vectors in a high-dimensional term space, similarity measures
provide a way to quantify how close these vectors are to each other. Two of the most
widely used similarity measures in Information Retrieval are Cosine Similarity and the
Jaccard Coefficient.
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 68

This implies moderate similarity.

Applications:

●​ Used in search engines, document clustering, and recommender systems.​


Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 69

●​ Efficient in high-dimensional sparse spaces, especially with TF-IDF vectors.

b. Jaccard Coefficient

The Jaccard Coefficient is another commonly used similarity measure, but it is most
appropriate when documents and queries are represented as sets of terms (i.e., binary
representations indicating only presence or absence).

Applications:

●​ Effective when dealing with binary term presence vectors.​

●​ Useful in duplicate detection, plagiarism checking, and set-based comparisons.​

3.2.5 Solved Example: Document Ranking using Vector


Space Model
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 70

Query: "AI Future"

Documents:

●​ D1: “AI is shaping the future of technology”​

●​ D2: “History of AI and computing”

Step 1: Tokenization and Stop Word Removal

Document / Original Text After Stop Word Removal


Query

D1 “AI is shaping the future of [AI, shaping, future,


technology” technology]

D2 “History of AI and computing” [History, AI, computing]

Q “AI Future” [AI, future]

Step 2: Stemming

Term in Document or Query Stemmed Term

shaping shape

computing compute

future future

technology technology
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 71

history history

AI AI

Step 3: Vocabulary

Final list of unique stemmed terms across D1, D2, and Q:

→ [AI, shape, future, technology, history, compute]

Step 4: Term Frequencies (TF)

Term D1 TF D2 TF Q TF

AI 1 1 1

shape 1 0 0

future 1 0 1

technology 1 0 0

history 0 1 0

compute 0 1 0


Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 72
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 73

Step 8: Ranking Based on Similarity


Document Cosine Similarity Rank

D1 0.578 1

D2 0.000 2

3.3 Probabilistic Model


The Probabilistic Model is a classic Information Retrieval (IR) approach that ranks
documents based on the probability of relevance to a user’s query. It assumes that
relevance is uncertain and models it using probabilistic reasoning. The core idea is to
assign each document a probability score, estimating how likely it is to be relevant to a
specific query.

Assumptions

1.​ For a given query, each document either is relevant or not relevant.​

2.​ The goal is to maximize the probability of retrieving relevant documents over
non-relevant ones.​

3.​ The ranking is based on computing the probability P(R∣D,Q) — the probability that
document D is relevant R given the query Q.

Binary Independence Model (BIM)


Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 74

The most common implementation is the Binary Independence Model, which:

●​ Represents documents and queries as binary vectors (term present or absent).​

●​ Assumes that terms are independent of each other.​

●​ Computes a retrieval status value (RSV) for each document.

Working Steps

1.​ Initial Retrieval: Use term matching to get an initial document set.​

2.​ Relevance Feedback: User marks relevant/non-relevant documents.​

3.​ ​

4.​ Score Calculation: Calculate RSV for each document.​

5.​ Ranking: Rank documents based on decreasing RSV.​

NON-CLASSICAL MODELS OF IR
Non-classical IR models diverge from traditional IR models that rely on similarity, probability,
or Boolean operations. Instead, these models are grounded in logic-based inference and
semantic understanding of language and information.

1. Information Logic Model

This model utilizes a special logic technique known as logical imaging for retrieval. The
process involves inferring the relevance of a document to a query, similar to how logic
deduces facts from premises. Unlike traditional models, this logic model assumes that if a
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 75

document does not contain a query term, it may still be relevant if semantic inference
supports it.

Key concept:

A measure of uncertainty is introduced when the presence or absence of a term


does not clearly indicate the truth of a statement. This uncertainty is quantified
using principles from van Rijsbergen's work.

Example principle from van Rijsbergen:

Given a sentence s and a term t, a measure of uncertainty u reflects how


strongly t supports or contradicts s.

For example, in the sentence:

"Atul is serving a dish."

The term dish supports the situation s. The support is evaluated by whether t (e.g., “dish”)
implies that the sentence s is true or likely true. If yes, t ⊨ s, and t supports s.

2. Situational Logic Model

This model extends the logic model by treating retrieval as a flow of information from the
document to the query. Each document is modeled as a situation, and a structural calculus
is used to infer whether this situation supports the query.

In essence:

●​ An infon is a unit of information.​

●​ A document affirms an infon if it is relevant to it.​

●​ The polarity of an infon (1 or 0) indicates its truth in that document.​

This model emphasizes logical inferences and structural relationships, not just term
occurrence.

3. Information Interaction Model

This model was developed by Ingwersen (1992, 1996) and draws from cognitive science.
It interprets retrieval as an interaction process between the user and the information
system, not a one-way search. Relevance is based on user context, and semantic
transformations (like synonym, hyponym, or conceptual linkages) are used to relate
documents to queries.
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 76

Artificial Neural Networks can also be used under this model to simulate how the human
brain processes relevance in interconnected networks. This model is user-centric and
emphasizes feedback loops, subjective relevance, and interaction history.

ALTERNATIVE MODELS OF INFORMATION


RETRIEVAL
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 77
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 78
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 79
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 80
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 81
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 82
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 83
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 84

LSTM(not in text book)


Long Short-Term Memory (LSTM) Networks in Deep Learning

Introduction​
Long Short-Term Memory Networks (LSTM) are a type of Recurrent Neural Network (RNN)
designed to retain long-term dependencies in sequential data. Unlike standard RNNs, which
suffer from the vanishing gradient problem and struggle to remember long-term information,
LSTMs were developed by Hochreiter and Schmidhuber to overcome this issue. LSTM
networks are commonly implemented using Python libraries such as Keras and TensorFlow.

Real-World Analogy​
Consider watching a movie or reading a book—your brain retains context from earlier
scenes or chapters to understand the current situation. LSTM networks emulate this
behavior by maintaining memory across time steps in a sequence.

Why LSTM over RNN?​


RNNs suffer from short memory due to the vanishing gradient problem. LSTMs resolve this
by incorporating memory units that can learn when to forget or retain information.

LSTM Architecture​
LSTM networks consist of memory cells and three primary gates:

1.​ Forget Gate​

2.​ Input Gate​

3.​ Output Gate​

Each gate controls the flow of information into and out of the cell state, enabling LSTMs to
manage memory efficiently.

Key States:

●​
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 85

Example:​
Given two sentences:

●​ "Bob is a nice person."​

●​ "Dan, on the other hand, is evil."​


Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 86

The LSTM should forget Bob and shift context to Dan using the forget gate. It retains
relevant features using the input gate and generates predictions using the output gate.

Visualizing Memory

●​ Hidden State HtH_t: Short-term context​

●​ Cell State CtC_t: Long-term memory​

●​ Information is retained and passed through many timestamps.​

Example Input:​
"Bob knows swimming. He told me over the phone that he had served the navy for four long
years."

●​ Input Gate learns that “served in the navy” is more important than “told me over the
phone.”​

Bidirectional LSTM​
Bidirectional LSTM processes data in both forward and backward directions:

●​ Forward pass: Past to future​

●​ Backward pass: Future to past​

Use Cases:

●​ Named Entity Recognition​

●​ Sentiment Analysis​

●​ Machine Translation​

Advantage:​
Bidirectional LSTM captures full context from both sides of the sequence, enhancing
prediction accuracy.
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 87

Major Issues in Information Retrieval


Information Retrieval (IR) is concerned with retrieving relevant information from large
collections of unstructured data, typically textual documents. Although IR has advanced
significantly, it still faces several critical challenges that affect its effectiveness and user
satisfaction.

1. Relevance of Results
●​ Core Problem: Determining the relevance of a document to a user's query.​

●​ Challenge: Users often express their information need imprecisely, while relevance
is subjective and context-dependent.​

●​ Example: A query like “jaguar” may refer to an animal, a car, or a sports team.​

2. Vocabulary Mismatch
●​ Definition: Users and documents may use different words to describe the same
concept.​

●​ Solution Approaches:​

○​ Stemming & Lemmatization: Reducing words to their base form.​

○​ Query Expansion: Adding synonyms or related terms (e.g., using WordNet


or embedding models).​

●​ Example: User queries for “automobile,” but documents use “car.”​

3. Handling Ambiguity
●​ Word Sense Disambiguation is essential in resolving ambiguous terms.​

●​ Challenge: IR systems must guess the correct meaning based on limited query
context.​

●​ Example: The word "python" may mean a programming language or a snake.​

4. Ranking of Documents
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 88

●​ Goal: Show most relevant documents at the top.​

●​ Issues:​

○​ Static ranking may not reflect current relevance.​

○​ Must balance precision (top results relevant) and recall (all relevant results
retrieved).​

●​ Solution: Use ranking algorithms (BM25, TF-IDF, neural ranking models).​

5. Scalability
●​ Challenge: Large-scale data processing for millions of documents and queries.​

●​ Requirements:​

○​ Efficient indexing (e.g., inverted index).​

○​ Distributed systems like Apache Lucene, Solr, or Elasticsearch.​

6. User Interface and Interaction


●​ Query formulation is often hard for users.​

●​ Need: Suggestion mechanisms, query reformulation, relevance feedback to improve


results iteratively.

7. Multilingual and Cross-lingual Retrieval


●​ Problem: Users may query in one language, but documents are in another.​

●​ Challenge: Translate and match semantics accurately.​

●​ Approach: Use translation models and multilingual embeddings.

8. Noise and Irrelevant Content


●​ Many web documents contain advertisements, boilerplate, or irrelevant metadata.​

●​ Solution: Use content filtering and noise detection techniques.​


Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 89

9. Evaluation and Ground Truth


●​ Difficulty: Building test collections with labeled relevance judgments.​

●​ Metrics: Precision, Recall, F1-score, MAP, nDCG.​

●​ Human judgments are subjective and vary across users.​

10. Privacy and Ethical Concerns


●​ Query logs and personalization involve sensitive user data.​

●​ Need: Implement data protection, anonymization, and ethical guidelines.​


Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 90

Chapter2: Lexical Resources


Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 91
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 92
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 93
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 94
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 95
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 96
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 97
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 98
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 99
Natural language processing notes​ ​ ​ ​ ​ ​ ​ ​ 100

You might also like