0% found this document useful (0 votes)
3 views83 pages

NLP 2-5 unit notes

The document discusses language modeling and part-of-speech (PoS) tagging in natural language processing (NLP). It outlines various types of language models, including n-gram models, RNNs, LSTMs, and transformers, as well as approaches to PoS tagging such as rule-based, statistical, and machine learning methods. Additionally, it covers techniques like unigram models, word counting, n-gram smoothing, and back-off strategies to enhance language model performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views83 pages

NLP 2-5 unit notes

The document discusses language modeling and part-of-speech (PoS) tagging in natural language processing (NLP). It outlines various types of language models, including n-gram models, RNNs, LSTMs, and transformers, as well as approaches to PoS tagging such as rule-based, statistical, and machine learning methods. Additionally, it covers techniques like unigram models, word counting, n-gram smoothing, and back-off strategies to enhance language model performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 83

Unit 2 Language Modeling and Part-Of-Speech (PoS) Tagging

Language modeling is a fundamental task in natural language processing (NLP) that involves
predicting the next word in a sequence of words or characters. The goal of language modeling is
to capture the patterns, relationships, and probabilities between words in a given language.
Language models are crucial for various NLP applications, including machine translation, text
generation, speech recognition, and more.

There are different types of language models, such as:

1. N-gram Models: These models predict the next word based on the previous n-1 words.
For example, a trigram model (n=3) would predict the next word using the two preceding
words.

2. Recurrent Neural Networks (RNNs): RNNs are a type of neural network designed for
sequence data. They have a hidden state that captures information about previous words
in the sequence and uses it to predict the next word. However, traditional RNNs suffer
from the vanishing gradient problem and struggle to capture long-range dependencies.

3. Long Short-Term Memory (LSTM) Networks: LSTMs are a type of RNN that address
the vanishing gradient problem. They use memory cells to capture long-range
dependencies and are better suited for language modeling tasks.

4. Transformer Models: Transformers, introduced in the "Attention is All You Need"


paper, have revolutionized language modeling. They use self-attention mechanisms to
capture relationships between all words in a sequence simultaneously, making them
highly parallelizable and efficient. Models like GPT (Generative Pre-trained
Transformer) and BERT (Bidirectional Encoder Representations from Transformers) are
built upon the transformer architecture.

Part-of-Speech (PoS) Tagging:

Part-of-speech tagging, also known as POS tagging or grammatical tagging, is the process of
assigning a grammatical category (such as noun, verb, adjective, etc.) to each word in a sentence.
POS tagging is an essential step in many NLP tasks, as it provides insights into the syntactic
structure of a sentence and helps in extracting meaning.

POS tagging can be approached using rule-based methods, statistical methods, and machine
learning methods:

1. Rule-Based Approaches: These methods use predefined linguistic rules and patterns to
assign POS tags to words. For example, if a word ends in "-ing," it is likely a verb.
2. Statistical Methods: Statistical models use probabilities and patterns learned from large
text corpora to assign POS tags. Hidden Markov Models (HMMs) and Conditional
Random Fields (CRFs) are commonly used statistical methods for POS tagging.

3. Machine Learning Approaches: Machine learning methods, particularly neural


networks, have shown great success in POS tagging. They involve training models on
labeled datasets to learn the relationships between words and their corresponding POS
tags. Recurrent neural networks and transformers can be used for this purpose.

Modern POS tagging systems often use a combination of these approaches to achieve high
accuracy. They leverage large annotated corpora to train models and capture complex syntactic
patterns.

In summary, language modeling focuses on predicting the next word in a sequence, while POS
tagging involves assigning grammatical categories to each word in a sentence. Both tasks are
crucial for understanding and generating human language, and they play a significant role in
various NLP applications.

A unigram language model is a simple type of language model that predicts the next word in a
sequence based solely on the frequency of individual words in the training data. It assumes that
the probability of a word appearing next is independent of the context and is solely determined
by the frequency of that word in the training corpus. In other words, it treats each word as a
standalone entity.

Here's an example of how a unigram language model works:

Let's say we have a small training corpus with the following sentences:

1. The cat is on the mat.

2. The dog is in the yard.

3. A bird is on a branch.

We want to build a unigram language model to predict the next word after "The." The unigram
model would calculate the probabilities of each word following "The" based on their frequencies
in the training data:

 P(cat | The) = frequency("cat") / total_words_in_corpus

 P(dog | The) = frequency("dog") / total_words_in_corpus

 P(bird | The) = frequency("bird") / total_words_in_corpus

Assuming the word frequencies are as follows:


 frequency("cat") = 1

 frequency("dog") = 1

 frequency("bird") = 1

 total_words_in_corpus = 14 (total words in the training corpus)

Then, the probabilities would be:

 P(cat | The) = 1 / 14 ≈ 0.0714

 P(dog | The) = 1 / 14 ≈ 0.0714

 P(bird | The) = 1 / 14 ≈ 0.0714

Since all the probabilities are the same, the unigram model would predict any of these words
with equal likelihood after "The."

So, if we input "The" into the unigram model, it might generate:

 "The cat"

 "The dog"

 "The bird"

As you can see, the unigram model doesn't consider context or word dependencies and relies
solely on word frequencies. While it's a simple model, it doesn't capture the richness of language
structure and context that more advanced models like RNNs or transformers can handle.

Counting Words in Corpora

Counting words in a corpus involves tallying the frequency of each unique word present in a
collection of texts. This is a fundamental step in many natural language processing tasks,
including language modeling, text analysis, and more. Let's walk through an example of
counting words in a small corpus:

Suppose we have the following three sentences as our corpus:

1. The cat chased the mouse.

2. The dog barked at the cat.

3. The mouse squeaked.

To count the words in this corpus, we follow these steps:


1. Tokenization: Break each sentence into individual words or tokens. This process involves
splitting the sentences based on spaces.

 Sentence 1: ["The", "cat", "chased", "the", "mouse."]

 Sentence 2: ["The", "dog", "barked", "at", "the", "cat."]

 Sentence 3: ["The", "mouse", "squeaked."]

2. Lowercasing: Convert all words to lowercase to ensure case-insensitive counting.

 Sentence 1: ["the", "cat", "chased", "the", "mouse."]

 Sentence 2: ["the", "dog", "barked", "at", "the", "cat."]

 Sentence 3: ["the", "mouse", "squeaked."]

3. Removing Punctuation: Remove any punctuation marks from the tokens.

 Sentence 1: ["the", "cat", "chased", "the", "mouse"]

 Sentence 2: ["the", "dog", "barked", "at", "the", "cat"]

 Sentence 3: ["the", "mouse", "squeaked"]

4. Counting Frequency: Count the frequency of each unique word in the corpus.

 "the": 5 occurrences

 "cat": 2 occurrences

 "chased": 1 occurrence

 "mouse": 2 occurrences

 "dog": 1 occurrence

 "barked": 1 occurrence

 "at": 1 occurrence

 "squeaked": 1 occurrence

So, the word frequency count for this corpus would be:

 "the": 5

 "cat": 2
 "chased": 1

 "mouse": 2

 "dog": 1

 "barked": 1

 "at": 1

 "squeaked": 1

This information is valuable for various analyses and tasks. For instance, it can help us
understand which words are most common, identify key terms, or even preprocess text for
further NLP tasks.

Keep in mind that in larger corpora, the process of counting words can become more complex
and may require optimizations to handle memory and efficiency constraints.

Simple (Unsmoothed) N-grams

Simple (unsmoothed) N-grams are a type of language model that predict the next word in a
sequence based on the conditional probability of observing that word given the previous n-1
words. The "n" in n-grams refers to the number of words considered in the context. For example,
in a bigram model (2-gram), the prediction is based on the previous word, while in a trigram
model (3-gram), the prediction considers the two preceding words.

Let's walk through a simple example of a bigram (2-gram) model using the sentence "The cat
chased the mouse."

Step 1: Tokenization and Preprocessing: Tokenize the sentence and remove punctuation.

Input sentence: "The cat chased the mouse." Tokenized: ["The", "cat", "chased", "the", "mouse"]

Step 2: Creating Bigrams: Create all possible pairs of adjacent words.

 Bigrams: [("The", "cat"), ("cat", "chased"), ("chased", "the"), ("the", "mouse")]

Step 3: Counting Bigram Frequencies: Count the occurrences of each bigram in the training
data.

 ("The", "cat"): 1

 ("cat", "chased"): 1

 ("chased", "the"): 1
 ("the", "mouse"): 1

Step 4: Calculating Conditional Probabilities: Calculate the conditional probability of the next
word given the previous word.

 P("cat" | "The") = count("The", "cat") / count("The")

 P("chased" | "cat") = count("cat", "chased") / count("cat")

 P("the" | "chased") = count("chased", "the") / count("chased")

 P("mouse" | "the") = count("the", "mouse") / count("the")

Assuming that each word appears only once in the context:

 P("cat" | "The") = 1/1 = 1

 P("chased" | "cat") = 1/1 = 1

 P("the" | "chased") = 1/1 = 1

 P("mouse" | "the") = 1/1 = 1

Step 5: Making Predictions: Given the context word, choose the word with the highest
conditional probability as the next word prediction.

 Given "The," the model predicts "cat."

 Given "cat," the model predicts "chased."

 Given "chased," the model predicts "the."

 Given "the," the model predicts "mouse."

This simple example illustrates the basic mechanics of a bigram model. Keep in mind that in
practice, more sophisticated models and techniques are used to handle larger datasets, deal with
unseen words, and address issues like data sparsity and smoothing.

n gram Smoothing

N-gram smoothing, also known as language model smoothing, is a technique used to address the
problem of data sparsity and improve the robustness of n-gram models, especially when dealing
with unseen or infrequent n-grams in the training data. Smoothing methods adjust the probability
estimates of n-grams to avoid zero probabilities and better generalize to unseen data.

Let's go through an example of n-gram smoothing using a bigram (2-gram) model and Laplace
(add-one) smoothing. Consider the following training corpus:
Training Corpus:

1. "The cat chased the mouse."

2. "The dog chased the cat."

Step 1: Tokenization and Preprocessing: Tokenize the sentences and remove punctuation.

Tokenized sentences:

1. ["The", "cat", "chased", "the", "mouse"]

2. ["The", "dog", "chased", "the", "cat"]

Step 2: Creating Bigrams: Create all possible pairs of adjacent words.

 Bigrams from sentence 1: [("The", "cat"), ("cat", "chased"), ("chased", "the"), ("the",
"mouse")]

 Bigrams from sentence 2: [("The", "dog"), ("dog", "chased"), ("chased", "the"), ("the",
"cat")]

Step 3: Counting Bigram Frequencies: Count the occurrences of each bigram in the training
data.

 ("The", "cat"): 1

 ("cat", "chased"): 1

 ("chased", "the"): 2

 ("the", "mouse"): 1

 ("The", "dog"): 1

 ("dog", "chased"): 1

Step 4: Applying Laplace Smoothing (Add-One Smoothing): Laplace smoothing adds a small
constant to the count of each unique word in the vocabulary to ensure no zero probabilities.

Vocabulary: ["The", "cat", "chased", "the", "mouse", "dog"]

Smoothed counts:

 ("The", "cat"): 1 + 1 = 2

 ("cat", "chased"): 1 + 1 = 2

 ("chased", "the"): 2 + 1 = 3
 ("the", "mouse"): 1 + 1 = 2

 ("The", "dog"): 1 + 1 = 2

 ("dog", "chased"): 1 + 1 = 2

Step 5: Calculating Smoothed Probabilities: Calculate the smoothed conditional probabilities


using the smoothed counts.

 P("cat" | "The") = count("The", "cat") / count("The") = 2/8 = 0.25

 P("chased" | "cat") = count("cat", "chased") / count("cat") = 2/8 = 0.25

 P("the" | "chased") = count("chased", "the") / count("chased") = 3/8 = 0.375

 P("mouse" | "the") = count("the", "mouse") / count("the") = 2/8 = 0.25

 P("dog" | "The") = count("The", "dog") / count("The") = 2/8 = 0.25

 P("chased" | "dog") = count("dog", "chased") / count("dog") = 2/8 = 0.25

Step 6: Making Predictions: Given the context word, choose the word with the highest
smoothed conditional probability as the next word prediction.

For example, given "The," the model predicts "cat" with a probability of 0.25.

N-gram smoothing, in this case, Laplace smoothing, has helped prevent zero probabilities and
provided more reasonable probability estimates for unseen or infrequent n-grams. This leads to
better generalization and improved performance of the language model on unseen data.

Back off

Back-off is a technique used in language modeling to handle cases where the model encounters
an unseen n-gram in the test data. It involves falling back to lower-order n-grams (with fewer
context words) when the probability of an n-gram cannot be calculated due to data sparsity.

The idea behind back-off is that if a higher-order n-gram has a low probability (or zero
probability) due to lack of data, the model can "back off" to a lower-order n-gram to estimate the
probability of the next word more accurately.

Let's understand the concept of back-off using an example:

Suppose we have the following sentences in our training corpus:

1. "The cat chased the mouse."

2. "The dog chased the cat."


We'll use a trigram (3-gram) language model and apply back-off to calculate the probabilities of
certain trigrams in a sentence:

Step 1: Tokenization and Preprocessing: Tokenize the sentences and remove punctuation.

Tokenized sentences:

1. ["The", "cat", "chased", "the", "mouse"]

2. ["The", "dog", "chased", "the", "cat"]

Step 2: Creating Trigrams: Create all possible sequences of three consecutive words.

 Trigrams from sentence 1: [("The", "cat", "chased"), ("cat", "chased", "the"), ("chased",
"the", "mouse")]

 Trigrams from sentence 2: [("The", "dog", "chased"), ("dog", "chased", "the"), ("chased",
"the", "cat")]

Step 3: Counting Trigram Frequencies: Count the occurrences of each trigram in the training
data.

 ("The", "cat", "chased"): 1

 ("cat", "chased", "the"): 1

 ("chased", "the", "mouse"): 1

 ("The", "dog", "chased"): 1

 ("dog", "chased", "the"): 1

 ("chased", "the", "cat"): 1

Step 4: Calculating Trigram Probabilities: Calculate the probabilities of each trigram using
their counts and the counts of their preceding bigrams.

 P("chased" | "cat", "The") = count("The", "cat", "chased") / count("The", "cat")

 P("the" | "chased", "cat") = count("cat", "chased", "the") / count("cat", "chased")

 P("mouse" | "the", "chased") = count("chased", "the", "mouse") / count("chased", "the")

Step 5: Applying Back-Off: If the count of the higher-order n-gram is zero, we "back off" to a
lower-order n-gram and calculate its probability. We can assign a weight to the lower-order n-
gram's probability to account for the back-off.

For example:
 P("the" | "chased", "cat") = count("cat", "chased", "the") / count("cat", "chased")

 Back off to: P("the" | "chased")

 Calculate P("the" | "chased") using bigram counts and apply a back-off weight.

This way, if the trigram probability cannot be estimated accurately due to lack of data, the model
"backs off" to a lower-order n-gram for a more reasonable estimate.

Back-off is a technique that helps address data sparsity and improves the overall performance of
n-gram language models by allowing the model to make reasonable predictions even for unseen
or infrequent n-grams.

Deleted Interpolation

Deleted Interpolation is a smoothing technique used in language modeling to address the


shortcomings of simple n-gram models. It combines probability estimates from various n-grams
(unigrams, bigrams, trigrams, etc.) to improve the overall prediction accuracy and handle data
sparsity. The "deleted" part refers to excluding one of the n-grams during the interpolation
calculation to avoid double-counting.

Let's understand deleted interpolation using an example:

Suppose we have the following sentences in our training corpus:

1. "The cat chased the mouse."

2. "The dog chased the cat."

We'll use a trigram (3-gram) language model with deleted interpolation to calculate the
probabilities of certain trigrams in a sentence:

Step 1: Tokenization and Preprocessing: Tokenize the sentences and remove punctuation.

Tokenized sentences:

1. ["The", "cat", "chased", "the", "mouse"]

2. ["The", "dog", "chased", "the", "cat"]

Step 2: Creating Trigrams: Create all possible sequences of three consecutive words.

 Trigrams from sentence 1: [("The", "cat", "chased"), ("cat", "chased", "the"), ("chased",
"the", "mouse")]

 Trigrams from sentence 2: [("The", "dog", "chased"), ("dog", "chased", "the"), ("chased",
"the", "cat")]
Step 3: Counting Trigram Frequencies: Count the occurrences of each trigram in the training
data.

 ("The", "cat", "chased"): 1

 ("cat", "chased", "the"): 1

 ("chased", "the", "mouse"): 1

 ("The", "dog", "chased"): 1

 ("dog", "chased", "the"): 1

 ("chased", "the", "cat"): 1

Step 4: Calculating Trigram Probabilities: Calculate the probabilities of each trigram using
their counts and the counts of their preceding bigrams.

 P("chased" | "cat", "The") = count("The", "cat", "chased") / count("The", "cat")

 P("the" | "chased", "cat") = count("cat", "chased", "the") / count("cat", "chased")

 P("mouse" | "the", "chased") = count("chased", "the", "mouse") / count("chased", "the")

Step 5: Applying Deleted Interpolation: Deleted interpolation combines probabilities from


different n-grams, usually by weighting them. One common approach is using a linear
interpolation:

 Interpolated probability = λ1 * P(trigram) + λ2 * P(bigram) + λ3 * P(unigram)

Where λ1, λ2, and λ3 are the weights assigned to each n-gram's probability.

For example, let's assume:

 λ1 = 0.5

 λ2 = 0.3

 λ3 = 0.2

The interpolated probability of "chased" given the context "cat" and "The" would be:

 Interpolated P("chased" | "cat", "The") = 0.5 * P("chased" | "cat", "The") + 0.3 *


P("chased" | "cat") + 0.2 * P("chased")

This way, deleted interpolation combines information from different n-grams to provide a more
accurate estimate of the next word's probability, overcoming the limitations of simple n-gram
models and addressing data sparsity.
N-grams for Spelling and Pronunciation

N-grams can also be used for tasks related to spelling and pronunciation, such as spelling
correction, phonetic transcription, and text-to-speech synthesis. Let's explore how n-grams can
be applied to these tasks with examples:

1. Spelling Correction:

N-grams can be used to identify and correct spelling errors in text by comparing the input words
with a vocabulary of correctly spelled words. This is done by calculating the similarity or edit
distance between the n-grams of the input word and the n-grams of words in the vocabulary.

Example: Input: "helo" Vocabulary: ["hello", "help", "house", "hold"]

In a bigram-based approach, the input "helo" would be split into bigrams: ["he", "el", "lo"].
Then, the system would calculate the similarity between these bigrams and the bigrams of words
in the vocabulary. In this case, "hello" would have the highest similarity, so "helo" would be
corrected to "hello."

2. Phonetic Transcription:

N-grams can be used to generate phonetic transcriptions of words by mapping their sequences of
n-grams to phonetic symbols. This is particularly useful for speech recognition and synthesis
tasks.

Example: Input: "pronunciation" Phonetic transcription: "p r ah n ah n s iy ey sh ah n"

Here, each bigram or trigram in the input word is mapped to a corresponding phonetic symbol to
create the phonetic transcription.

3. Text-to-Speech Synthesis:

For text-to-speech synthesis, n-grams can be used to predict the appropriate pronunciation for
words based on their context within a sentence. This helps generate more natural and fluent
speech.

Example: Input: "The quick brown fox jumps over the lazy dog." Text-to-speech synthesis: "dh
ah k w ih k b r aw n f aa k s jh ah m p s ow v er dh ah l ey z iy d ao g."

In this case, n-grams are used to predict the pronunciation of each word based on the surrounding
words and their context within the sentence.

N-grams play a role in these spelling and pronunciation-related tasks by capturing patterns in the
distribution of letters or phonemes. However, it's important to note that while n-grams can be a
helpful component of these systems, more sophisticated approaches often incorporate various
linguistic and statistical techniques to achieve higher accuracy and robustness.
Entropy Natural Language Generation

Entropy, in the context of natural language generation (NLG), refers to the measure of
uncertainty or randomness in the output generated by a language model. It quantifies the average
amount of information needed to represent the possible outcomes of the model's predictions. In
NLG, lower entropy indicates that the generated text is more predictable and less diverse, while
higher entropy suggests greater diversity and unpredictability.

Entropy can be calculated using the formula:

Entropy = -Σ P(x) * log2(P(x))

Where P(x) represents the probability of a particular outcome x, and the summation is over all
possible outcomes.

Let's see how entropy can be applied to a simple NLG example:

Example: Sentence Generation using a Language Model

Suppose we have a language model that generates sentences about weather conditions. Given a
specific context, the model predicts the next word in the sentence.

Context: "Today's weather is" Possible predictions: "sunny," "cloudy," "rainy," "windy"

Assume the probabilities of these predictions are:

 P("sunny") = 0.4

 P("cloudy") = 0.3

 P("rainy") = 0.2

 P("windy") = 0.1

Step 1: Calculating Entropy

Entropy = -Σ P(x) * log2(P(x))

Entropy = - (0.4 * log2(0.4) + 0.3 * log2(0.3) + 0.2 * log2(0.2) + 0.1 * log2(0.1)) Entropy ≈
1.8464

Step 2: Interpretation

The calculated entropy value (approximately 1.8464) indicates the average amount of uncertainty
or information needed to predict the next word in the sentence. In this case, the lower the
entropy, the more predictable the language model's predictions are.
A lower entropy value would imply that the model tends to generate the same type of word more
often, leading to less diverse and more expected output. Conversely, a higher entropy value
suggests that the model's predictions are more diverse and less predictable.

In practical NLG scenarios, considering entropy can help control the balance between generating
fluent and coherent text while introducing enough variation to make the generated output
interesting and diverse. It's important to find the right trade-off based on the specific application
and the desired characteristics of the generated content.

Parts of Speech Tagging

Part-of-speech (POS) tagging is the process of assigning grammatical categories or labels (such
as noun, verb, adjective, etc.) to each word in a sentence. POS tagging helps in understanding the
syntactic structure of a sentence and is a fundamental task in natural language processing. Let's
go through a simple example of POS tagging:

Example Sentence: "The cat chased the mouse."

Step 1: Tokenization Tokenize the sentence into individual words: ["The", "cat", "chased",
"the", "mouse"]

Step 2: POS Tagging Assign POS tags to each word in the sentence:

 "The": Determiner (DT)

 "cat": Noun (NN)

 "chased": Verb (VBD)

 "the": Determiner (DT)

 "mouse": Noun (NN)

So, the POS-tagged sentence becomes: "DT NN VBD DT NN"

In this example, each word in the sentence has been assigned a POS tag based on its grammatical
role. Here's what each POS tag represents:

 DT: Determiner (e.g., "the", "a")

 NN: Noun (e.g., "cat", "mouse")

 VBD: Past tense verb (e.g., "chased")


The POS tags provide information about the syntactic and grammatical structure of the sentence,
which is essential for various natural language processing tasks, such as parsing, machine
translation, and more.

Keep in mind that while this example is straightforward, POS tagging can become more complex
in cases where words have multiple possible POS tags due to their context. Advanced models use
statistical and machine learning techniques to predict the most likely POS tag for each word
based on the surrounding words and linguistic patterns.

Morphology

Morphology in linguistics refers to the study of the internal structure of words, including how
words are formed and how they can be modified to convey different meanings. Morphemes are
the smallest units of meaning within a word, and understanding morphology helps us analyze
how words are built and how their forms change based on grammatical and semantic factors.

Let's explore morphology with a simple example:

Example Word: "Unhappily"

Step 1: Identify Morphemes Break down the word into its constituent morphemes:

 "Un-" (prefix)

 "happy" (root or base)

 "-ly" (suffix)

Step 2: Analyze Meanings Understand the meanings of individual morphemes:

 "Un-" is a negative prefix, indicating the opposite or negation of the root word's meaning.

 "happy" is the root word, conveying a positive emotional state.

 "-ly" is an adverbial suffix, typically used to form adverbs.

Step 3: Combine Meanings Combine the meanings of individual morphemes to understand the
overall meaning of the word:

"Unhappily" = "Un-" (not) + "happy" (positive emotion) + "-ly" (adverbial)

So, "unhappily" means "in a not happy manner" or "unpleasantly."

This example showcases how morphemes come together to form complex words with specific
meanings. Morphological analysis helps linguists and NLP systems understand how words are
constructed, and it's important for various language-related tasks, such as language
understanding, language generation, and machine translation.
Named Entity Recognition

Named Entity Recognition (NER) is a natural language processing task that involves identifying
and classifying named entities (such as names of people, organizations, locations, dates, and
more) within text. The goal of NER is to extract structured information from unstructured text by
recognizing and categorizing these entities.

Let's go through an example of NER:

Example Text: "Apple Inc. was founded by Steve Jobs on April 1, 1976, in Cupertino,
California."

Step 1: Tokenization Tokenize the text into individual words: ["Apple", "Inc.", "was",
"founded", "by", "Steve", "Jobs", "on", "April", "1,", "1976,", "in", "Cupertino,", "California."]

Step 2: Named Entity Recognition Identify and classify named entities within the text:

 "Apple Inc.": Organization

 "Steve Jobs": Person

 "April 1, 1976": Date

 "Cupertino, California": Location

So, the NER-tagged text becomes: "Organization Person was founded by Person Person on Date,
in Location."

In this example, NER has successfully identified and classified the named entities in the text:

 "Apple Inc." is recognized as an Organization.

 "Steve Jobs" is recognized as a Person.

 "April 1, 1976" is recognized as a Date.

 "Cupertino, California" is recognized as a Location.

Named Entity Recognition is crucial for various NLP applications, such as information
extraction, document summarization, question answering, and more. It helps in extracting
structured data from unstructured text and enabling higher-level understanding of the content.

A Hidden Markov Model (HMM)

A Hidden Markov Model (HMM) is a statistical model used to represent and analyze sequences
of observations where the underlying process generating the observations is assumed to be a
Markov process with hidden states. HMMs are widely used in various applications, including
natural language processing (NLP), speech recognition, bioinformatics, and more.

Here's a simple overview of HMMs and an NLP-related example:

Components of an HMM:

1. States: These are the hidden states of the model, representing the underlying processes.
Each state emits an observation based on a certain probability distribution.

2. Observations: These are the observable outputs associated with each state. Observations
provide information about the hidden state sequence.

3. Transition Probabilities: These are the probabilities of moving from one state to another
in a sequence.

4. Emission Probabilities: These are the probabilities of generating a specific observation


from a given state.

Example: Part-of-Speech Tagging using HMM

Let's consider a simple example of using an HMM for part-of-speech tagging. In this scenario,
the hidden states represent different parts of speech (Noun, Verb, Adjective, etc.), and the
observations are individual words.

Step 1: Building the HMM

 States (Hidden States): Noun (N), Verb (V), Adjective (A)

 Observations (Words): "cat," "chased," "the," "mouse"

 Transition Probabilities:

 P(N | N) = 0.4 (Probability of transitioning from Noun to Noun)

 P(V | N) = 0.3 (Probability of transitioning from Noun to Verb)

 P(A | V) = 0.2 (Probability of transitioning from Verb to Adjective)

 Emission Probabilities:

 P("cat" | N) = 0.6 (Probability of observing "cat" given the state Noun)

 P("chased" | V) = 0.8 (Probability of observing "chased" given the state Verb)

 P("the" | A) = 0.5 (Probability of observing "the" given the state Adjective)

 P("mouse" | N) = 0.4 (Probability of observing "mouse" given the state Noun)


Step 2: Inference

Given the sequence of observations "The cat chased the mouse," we want to determine the most
likely sequence of hidden states (parts of speech).

Using the Viterbi algorithm, we can compute the most likely sequence of states:

1. Start with the initial probabilities for each state.

2. For each word in the observation sequence, compute the probabilities of transitioning
from the previous states to the current state and emitting the current observation.

3. Choose the state with the highest probability at each step.

Step 3: Results

For the observation sequence "The cat chased the mouse," the most likely sequence of hidden
states could be:

 Noun (N), Noun (N), Verb (V), Adjective (A), Noun (N)

This sequence represents the estimated sequence of parts of speech that generated the given
observation sequence.

This simple example demonstrates the basic mechanics of how Hidden Markov Models can be
used for sequence analysis tasks like part-of-speech tagging in natural language processing.

Let us calculate the above two probabilities for the set of sentences below

 Mary Jane can see Will


 Spot will see Mary
 Will Jane spot Mary?
 Mary will pat Spot
Note that Mary Jane, Spot, and Will are all names.
In the above sentences, the word Mary appears four times as a noun. To calculate the emission
probabilities, let us create a counting table in a similar manner.

Words Noun Model Verb

Mary 4 0 0

Jane 2 0 0

Will 1 3 0
Spot 2 0 1

Can 0 1 0

See 0 0 2

pat 0 0 1

Now let us divide each column by the total number of their appearances for example, ‘noun’
appears nine times in the above sentences so divide each term by 9 in the noun column. We get
the following table after this operation.

Words Noun Model Verb

Mary 4/9 0 0

Jane 2/9 0 0

Will 1/9 3/4 0

Spot 2/9 0 1/4

Can 0 1/4 0

See 0 0 2/4

pat 0 0 1

From the above table, we infer that

The probability that Mary is Noun = 4/9


The probability that Mary is Model = 0

The probability that Will is Noun = 1/9

The probability that Will is Model = 3/4

In a similar manner, you can figure out the rest of the probabilities. These are the emission
probabilities.

Next, we have to calculate the transition probabilities, so define ttwo


wo more tags <S> and <E>.
<S> is placed at the beginning of each sentence and <E> at the end as shown in the figure below.

Let us again create a table and fill it with the co


co-occurrence counts of the tags.

N M V <E>
<S> 3 1 0 0

N 1 3 1 4

M 1 0 3 0

V 4 0 0 0

In the above figure, we can see that the <S> tag is followed by the N tag three times, thus the
first entry is 3.The model tag follows the <S> just once, thus the second entry is 1. In a similar
manner, the rest of the table is filled.

Next, we divide each term in a row of the table by the total number of co-occurrences of the tag
in consideration, for example, The Model tag is followed by any other tag four times as shown
below, thus we divide each element in the third row by four.
N M V <E>

<S> 3/4 1/4 0 0

N 1/9 3/9 1/9 4/9

M 1/4 0 3/4 0

V 4/4 0 0 0
These are the respective transition probabilities for the above four sentences. Now how does the
HMM determine the appropriate sequence of tags for a particular sentence from the above
tables? Let us find it out.

Take a new sentence and tag them with wrong tags. Let the sentence, ‘ Will can spot Mary’ be
tagged as-

 Will as a model
 Can as a verb
 Spot as a noun
 Mary as a noun
Now calculate the probability of this sequence being correct in the ffollowing
ollowing manner.

The probability of the tag Model (M) comes after the tag <S> is ¼ as seen in the table. Also, the
probability that the word Will is a Model is 3/4. In the same manner, we calculate each and every
probability in the graph. Now the product of these probabilities is the likelihood that this
sequence is right. Since the tags are not correct, the product is zero.

1/4*3/4*3/4*0*1*2/9*1/9*4/9*4/9=0

When these words are correctly tagged, we get a probability greater than zero as shown below
Calculating the product of these terms we get,

3/4*1/9*3/9*1/4*3/4*1/4*1*4/9*4/9=0.00025720164

For our example, keeping into consideration just three POS tags we have mentioned, 81 different
combinations of tags can be formed. In this case, calculating th
thee probabilities of all 81
combinations seems achievable. But when the task is to tag a larger sentence and all the POS
tags in the Penn Treebank project are taken into consideration, the number of possible
combinations grows exponentially and this task see
seems
ms impossible to achieve. Now let us
visualize these 81 combinations as paths and using the transition and emission probability mark
each vertex and edge as shown below.
The next step is to delete all the vertices and edges with probability zero, also th
thee vertices which
do not lead to the endpoint are removed. Also, we will mention
mention-

Now there are only two paths that lead to the end, let us calculate the probability associated with
each path.

<S>→N→M→N→N→<E> =3/4*1/9*3/9*1/4*1/4*2/9*1/9*4/9*4/9=0.0000084
3/4*1/9*3/9*1/4*1/4*2/9*1/9*4/9*4/9=0.00000846754
3/4*1/9*3/9*1/4*1/4*2/9*1/9*4/9*4/9=0.0000084

<S>→N→M→N→V→<E>=3/4*1/9*3/9*1/4*3/4*1/4*1*4/9*4/9=0.00025720164
3/4*1/9*3/9*1/4*3/4*1/4*1*4/9*4/9=0.00025720164

Clearly, the probability of the second sequence is much higher and hence the HMM is going to
tag each word in the sentence according to this sequence.
UNIT 3 Words and Word Forms

In natural language processing (NLP), understanding the concepts of words and word forms is
essential for tasks like text analysis, language modeling, and machine learning. Let's break down
these concepts:

Words: A word is a basic unit of language that typically represents a single, meaningful unit of
speech or writing. Words are the building blocks of sentences and convey specific meanings. In
most languages, words are separated by spaces or punctuation marks.

For example, in the sentence "The cat chased the mouse," the words are "The," "cat," "chased,"
"the," and "mouse."

Word Forms: Word forms refer to different grammatical variations of a single word, which can
include inflections, tenses, plurals, etc. A single word can have multiple word forms based on its
grammatical role in a sentence.

For example, consider the word "run." Its different word forms can include "runs" (present tense,
third person singular), "running" (present participle), "ran" (past tense), and "runner" (noun
form).

In NLP, dealing with word forms is important for various tasks:

1. Text Normalization: Converting different word forms into their base or canonical form,
also known as lemmatization. For example, converting "running" and "runs" to "run."

2. Morphological Analysis: Analyzing the structure and formation of words, including


their prefixes, suffixes, and inflections.

3. POS Tagging: Identifying the part of speech of a word form, such as noun, verb,
adjective, etc.

4. Information Retrieval: Expanding queries to include related word forms to improve


search results.

5. Language Modeling: Taking into account different word forms to create more accurate
language models.

6. Machine Translation: Handling the translation of words with different forms in the
source and target languages.
Understanding both words and their various forms is crucial for developing robust NLP systems
that can handle the complexities of natural language.

Context-Free Grammars (CFGs) are formal grammars used to describe the syntax of
languages in linguistics and natural language processing. CFGs consist of a set of production
rules that define how sentences in a language can be generated by combining different syntactic
elements, such as nouns, verbs, and phrases. They are commonly used to model the hierarchical
structure of sentences in natural languages like English.

Here's a simple example of a CFG for English:

Grammar Rules:

1. S -> NP VP (A sentence consists of a noun phrase followed by a verb phrase.)

2. NP -> Det N (A noun phrase consists of a determiner followed by a noun.)

3. VP -> V NP (A verb phrase consists of a verb followed by a noun phrase.)

4. Det -> "The" | "A" (Determiners)

5. N -> "cat" | "dog" (Nouns)

6. V -> "chased" | "ate" (Verbs)

Using these rules, we can generate sentences like:

 "The cat chased the dog."

 "A dog chased a cat."

 "A cat ate the dog."

Parsing using the CFG: Let's parse the sentence "The cat chased a dog." using the CFG:

1. Start with the S rule: S -> NP VP

2. Apply the NP rule: NP -> Det N

3. Apply the Det rule: Det -> "The"


4. Apply the N rule: N -> "cat"

5. Apply the VP rule: VP -> V NP

6. Apply the V rule: V -> "chased"

7. Apply the NP rule: NP -> Det N

8. Apply the Det rule: Det -> "a"

9. Apply the N rule: N -> "dog"

The sentence "The cat chased a dog." has been successfully parsed using the given CFG.

Keep in mind that this is a simplified example, and natural languages like English are much more
complex. CFGs can model basic sentence structures, but they have limitations in capturing all the
intricacies of human language. More advanced formalisms, like dependency grammars and
probabilistic models, are used to handle the complexities of natural language syntax.

Lexicalized Parsing and Probabilistic Parsing

Lexicalized Parsing and Probabilistic Parsing are more advanced approaches in natural
language processing that aim to improve the accuracy and realism of syntactic analysis. They
address some of the limitations of context-free grammars (CFGs) by incorporating more
linguistic and statistical information.

Lexicalized Parsing:

In lexicalized parsing, the grammatical rules take into account the specific words in the sentence,
not just the syntactic categories. This means that the behavior of a rule can vary based on the
words it combines. Lexicalized parsers associate each rule with a specific lexical item (word),
allowing for more fine-grained and accurate parsing.

Example: Consider the sentence "Time flies like an arrow." In a lexicalized parser, the verb
"flies" would have different parsing behavior than the noun "flies," even though they have the
same surface form.

Probabilistic Parsing:

Probabilistic parsing adds probabilities to the rules and productions of the grammar. It allows the
parser to select the most likely parse for a given sentence based on the probabilities assigned to
different rules and productions. This is particularly useful for disambiguating between multiple
valid parses of a sentence.

Example: In the sentence "I saw the man with the telescope," probabilistic parsing can help the
parser choose between "I saw the man using the telescope" and "I saw the man who had the
telescope."

Combining Lexicalized and Probabilistic Parsing:

Lexicalized parsing and probabilistic parsing can be combined to create more accurate and
linguistically realistic parsers. By using lexical information and probabilities, these parsers can
capture the nuanced relationships between words and their syntactic contexts.

Example: In the sentence "The old man the boats," a lexicalized probabilistic parser can identify
that "old" is more likely to be an adjective modifying "man" rather than a verb, leading to the
correct parse: "The old man sees the boats."

Overall, lexicalized and probabilistic parsing techniques enhance the accuracy of syntactic
analysis by considering the specific words in the sentence and incorporating statistical
information to make more informed parsing decisions.

Semantic Analysis

Word Vectors and Semantic Similarity:

In modern NLP, words are often represented as vectors in a high-dimensional space, where
words with similar meanings are closer to each other in this space. This representation captures
semantic relationships between words.

Let's consider a simplified example using word vectors and semantic similarity scores:

Suppose we have a set of words represented as word vectors in a 2-dimensional space. Each
vector represents the semantic meaning of a word:

 Vector("car") = [0.8, 0.6]

 Vector("fast") = [0.7, 0.9]

 Vector("vehicle") = [0.5, 0.4]

We can calculate the cosine similarity between vectors to measure their semantic similarity:
Cosine Similarity:

Cosine similarity between vectors A and B is calculated as:

Cosine Similarity(A, B) = (A ⋅ B) / (||A|| * ||B||)

Where:

 A ⋅ B is the dot product of vectors A and B.

 ||A|| and ||B|| are the Euclidean norms (magnitudes) of vectors A and B.

Let's calculate the cosine similarity between the word vectors:

 Cosine Similarity("car", "fast") = ([0.8, 0.6] ⋅ [0.7, 0.9]) / (√(0.8^2 + 0.6^2) * √(0.7^2 +
0.9^2)) ≈ 0.995

 Cosine Similarity("car", "vehicle") = ([0.8, 0.6] ⋅ [0.5, 0.4]) / (√(0.8^2 + 0.6^2) * √(0.5^2
+ 0.4^2)) ≈ 0.983

 Cosine Similarity("fast", "vehicle") = ([0.7, 0.9] ⋅ [0.5, 0.4]) / (√(0.7^2 + 0.9^2) * √(0.5^2
+ 0.4^2)) ≈ 0.966

Interpretation:

In this example, we've calculated the cosine similarity between word vectors representing "car,"
"fast," and "vehicle." The higher the cosine similarity, the more semantically similar the words
are. From the calculated values, we can see that "car" and "fast" are highly similar, "car" and
"vehicle" are moderately similar, and "fast" and "vehicle" are less similar.

This simple mathematical example illustrates how word vectors and cosine similarity can be
used for semantic analysis to measure the similarity of words' meanings. In real NLP
applications, more advanced models and larger vector spaces are used to capture richer semantic
relationships.
WordNet

WordNet is a lexical database designed for natural language processing (NLP) and linguistic
research. It organizes words into synsets (sets of synonyms) and provides various lexical and
semantic relations between words. WordNet has been widely used in tasks like word sense
disambiguation, semantic similarity calculation, and information retrieval.

Some common lexical relations in WordNet include:

1. Synonymy: Words that have similar meanings are grouped together in synsets. For
example, the synset for the word "car" includes synonyms like "automobile," "vehicle,"
and "motorcar."

2. Hyponymy/Hypernymy: A hyponym is a more specific word that is a type of a broader


category known as a hypernym. For instance, "rose" is a hyponym of the hypernym
"flower."

3. Meronymy/Holonymy: Meronyms are parts of a whole, while holonyms refer to the


whole itself. For example, "wheel" is a meronym of "car," and "car" is a holonym of
"wheel."

4. Antonymy: Words that have opposite meanings are considered antonyms. For instance,
"happy" and "sad" are antonyms.

5. Entailment: This relation represents a situation where one action implies another action.
For example, "sleep" entails "rest."

6. Attribute: The attribute relation connects a noun to its adjective form, indicating a
characteristic of the noun. For example, "sweet" is an attribute of "cake."

7. Similarity: Words that are related in meaning but not exact synonyms are connected
through the similarity relation. For instance, "car" and "vehicle" are similar words.

Here's an example using the word "computer" and some of its relations in WordNet:

Word: "computer"

 Synonyms (Synonymy): computing machine, data processor, electronic computer, etc.

 Hypernym (Hypernymy): machine


 Hyponyms (Hyponymy): laptop, desktop, server, mainframe, etc.

 Meronyms (Meronymy): keyboard, monitor, CPU, mouse, etc.

 Holonyms (Holonymy): office, workstation (a computer is part of these)

 Antonyms (Antonymy): manual typewriter

These relations and examples demonstrate how WordNet captures various lexical and semantic
relationships between words, making it a valuable resource for NLP applications and linguistic
studies. Keep in mind that the examples and relations provided here are simplified for illustration
purposes. WordNet contains a more extensive network of words and relations.

Bag of words

The "Bag of Words" (BoW) is a fundamental concept in natural language processing (NLP) that
represents a text as an unordered collection of words, ignoring grammar and word order but
considering the frequency of each word. It's a simple and commonly used technique for text
analysis and information retrieval.

Here's a simple example of the Bag of Words representation:

Let's say we have two short sentences:

1. "The cat chased the mouse."

2. "The dog barked at the cat."

To create a bag of words representation, we first need to tokenize each sentence, which means
splitting them into individual words:

Sentence 1 tokens: ["The", "cat", "chased", "the", "mouse."] Sentence 2 tokens: ["The", "dog",
"barked", "at", "the", "cat."]

Next, we create a vocabulary, which is a unique set of words across both sentences:

Vocabulary: ["The", "cat", "chased", "mouse", "dog", "barked", "at"]

Now, for each sentence, we create a vector where each dimension represents a word in the
vocabulary, and the value in each dimension represents the frequency of that word in the
sentence:

BoW representation of Sentence 1: [2, 1, 1, 1, 0, 0, 0]

 "The" appears twice.


 "cat," "chased," "mouse" each appear once.

 "dog," "barked," "at" do not appear.

BoW representation of Sentence 2: [1, 1, 0, 0, 1, 1, 1]

 "The," "cat," "dog," "barked," and "at" each appear once.

 "chased" and "mouse" do not appear.

As you can see, the Bag of Words representation simplifies text to a numerical vector where each
dimension corresponds to a word's frequency in the text. It doesn't consider the order of words or
any semantic meaning, but it can still be useful for tasks like text classification, sentiment
analysis, and information retrieval.

Keep in mind that in practice, BoW might be extended to include techniques like TF-IDF (Term
Frequency-Inverse Document Frequency) to account for the importance of words across a corpus
of documents and to reduce the impact of frequently occurring words that may not carry much
meaning (e.g., "the," "and").

Skip-gram

Skip-gram is a popular word embedding technique used in natural language processing (NLP)
and is often used to learn distributed representations of words in a continuous vector space. It's a
part of the Word2Vec family of models and aims to predict the context words (surrounding
words) given a target word. This helps in capturing the semantic relationships between words.

Here's how skip-gram works with a simple example:

Consider the sentence: "The quick brown fox jumps over the lazy dog."

In the skip-gram model, we choose a target word and try to predict the context words around it.
Let's choose "fox" as the target word and set a context window size of 2 (meaning we consider
two words on each side of the target word).

Target word: "fox"

Context words: ["quick", "brown", "jumps", "over"]

The skip-gram model tries to learn a representation for the target word ("fox") in such a way that
it's likely to predict the context words ("quick," "brown," "jumps," "over") given the target.

The training data for this example could look like:

Target Word Context Word


Target Word Context Word

fox quick

fox brown

fox jumps

fox over

The skip-gram model will then update its internal parameters (the word embeddings) during
training to make the predictions better match the actual context words.

After training, the word embeddings can be extracted. These embeddings place words with
similar contexts close to each other in the vector space. This allows for capturing semantic
relationships, such as synonyms, antonyms, and analogies. For example, words like "quick" and
"brown" might end up being close to "fox" in the vector space due to their co-occurrence in
similar contexts.

These learned word embeddings can be used for various NLP tasks like text classification,
sentiment analysis, and machine translation, by leveraging the semantic relationships encoded in
the embeddings.

Keep in mind that the actual training process involves optimization techniques like stochastic
gradient descent and negative sampling to efficiently learn the word embeddings. Also, the
context window size, training data size, and other hyperparameters can impact the quality of the
learned embeddings.

Continuous Bag of Words (CBOW)

Continuous Bag of Words (CBOW) is another word embedding technique from the Word2Vec
family that aims to learn distributed representations of words in a continuous vector space.
Unlike Skip-gram, which predicts context words given a target word, CBOW predicts a target
word based on its surrounding context words. It's also used for capturing semantic relationships
between words.

Here's how CBOW works with a simple example:

Consider the same sentence as before: "The quick brown fox jumps over the lazy dog."

In the CBOW model, we choose a target word and use its surrounding context words to predict
that target word. Let's choose "fox" as the target word and again set a context window size of 2.

Context words: ["quick", "brown", "jumps", "over"]


The CBOW model tries to learn a representation for the target word ("fox") based on the context
words ("quick," "brown," "jumps," "over").

The training data for this example could look like:

Context Words Target Word

quick, brown, jumps fox

brown, fox, over jumps

The CBOW model will then update its parameters during training to improve its ability to predict
the target word given the context words.

After training, the word embeddings can be extracted. These embeddings aim to place words
with similar contexts close to each other in the vector space, just like Skip-gram. For example,
the embeddings might place words like "quick," "brown," and "jumps" close to each other due to
their co-occurrence in similar contexts.

These learned word embeddings can be used for various NLP tasks, similarly to the embeddings
learned through Skip-gram.

In summary, while Skip-gram predicts context words given a target word, CBOW predicts a
target word given its context words. Both techniques aim to capture the semantic relationships
between words in a continuous vector space, which can be utilized for various NLP tasks. The
choice between Skip-gram and CBOW often depends on the size of the dataset and the specific
task at hand.

Embedding representations for words Lexical Semantics

Embedding representations for words play a crucial role in capturing lexical semantics, which
refers to the meaning relationships between words in a language. These representations allow
words to be mapped into a continuous vector space, where similar words are close to each other,
and their geometric relationships can reflect their semantic similarities and relationships. Here's
how embedding representations contribute to capturing lexical semantics:

1. Semantic Relationships: Word embeddings can capture various semantic relationships


between words, such as synonyms, antonyms, hypernyms (superordinate terms),
hyponyms (subordinate terms), and meronyms (part-whole relationships). Words that
share these relationships tend to have similar vector representations in the embedding
space.

2. Analogies: Word embeddings can capture analogical relationships like "king - man +
woman = queen." By performing vector arithmetic in the embedding space, these
relationships can be expressed mathematically and translated into meaningful semantic
relationships.

3. Semantic Similarity: Similar words tend to have similar embeddings. Words with similar
meanings, even if they are not synonyms, are likely to be positioned close to each other in
the embedding space. This similarity can be quantified using metrics like cosine
similarity.

4. Contextual Information: Some word embeddings, like those learned by Word2Vec


models (Skip-gram and CBOW), consider the co-occurrence patterns of words in a given
context. This means that words appearing in similar contexts tend to have similar
embeddings, capturing contextual semantics.

5. Polysemy and Word Senses: Words with multiple meanings (polysemous words) can
have distinct embeddings for different senses. This enables models to disambiguate word
senses based on context, aiding in tasks like word sense disambiguation.

6. Rare and Out-of-Vocabulary Words: Word embeddings provide representations for


words, even those that are rare or not seen during training. This can help generalize to
words not present in the training data.

7. Downstream NLP Tasks: Word embeddings serve as powerful features for downstream
NLP tasks like sentiment analysis, machine translation, text classification, and more.
Models can leverage the captured lexical semantics to improve their performance on
these tasks.

It's important to note that while word embeddings capture various aspects of lexical semantics,
they might not always capture very fine-grained nuances of meaning or cultural context.
Additionally, models should be chosen and evaluated based on their ability to capture specific
semantic relationships and perform well on the intended downstream tasks.

Word2Vec, GloVe, FastText, and contextual embeddings like BERT and GPT are examples of
techniques that contribute to capturing lexical semantics by providing meaningful and distributed
word representations.

Consider the words "king," "queen," "man," and "woman." We'll use a hypothetical two-
dimensional embedding space for simplicity, although in practice, embeddings are typically in
much higher-dimensional spaces.

Let's say our embeddings are arranged like this:

 "king" is represented as [0.9, 0.7]

 "queen" is represented as [0.8, 0.6]


 "man" is represented as [0.75, 0.8]

 "woman" is represented as [0.7, 0.85]

In this example, we can see that "king" and "queen" have embeddings that are relatively close to
each other in the vector space. Similarly, "man" and "woman" are also close to each other.

Now, let's explore the semantic relationships these embeddings capture:

1. Semantic Similarity: The cosine similarity between the embeddings of "king" and
"queen" is relatively high, indicating that they are similar in meaning. Similarly, "man"
and "woman" have a high cosine similarity.

2. Analogies: We can perform vector arithmetic to capture analogical relationships. For


instance, to find an analogy to "king is to man as queen is to what," we can calculate:

cCopy code

vector("king") - vector("man") + vector("woman") = vector("queen")

If we perform this calculation using the embeddings, we would indeed end up close to the
embedding of "queen."

3. Gender Relationship: We can observe that the vector from "king" to "man" is similar to
the vector from "queen" to "woman," indicating that the embeddings capture gender
relationships.

4. Hypernym-Hyponym Relationship: While we don't have explicit hypernym-hyponym


relationships in this simple example, in a more extended embedding space, "king" and
"queen" might be closer to the hypernym "royalty."

5. Semantic Clusters: The embeddings place similar words close to each other. In this case,
"king" and "queen" are close, and "man" and "woman" are close, forming semantic
clusters.

These simple embeddings demonstrate how the positions of words in the vector space can reflect
their semantic relationships. In practice, embeddings are learned from large corpora using
techniques like Word2Vec, GloVe, or contextual models like BERT. The learned embeddings
are much higher-dimensional and capture a more nuanced understanding of lexical semantics,
enabling them to be used effectively in various NLP tasks.

Word Sense Disambiguation (WSD)

Word Sense Disambiguation (WSD) is a natural language processing task that involves
determining the correct meaning or sense of a word in a given context. Many words in natural
language have multiple senses, and the intended sense of a word can vary based on the
surrounding words or the overall context of the sentence. WSD is important for improving the
accuracy of various NLP applications like machine translation, information retrieval, and text
summarization.

Here's a simple example of Word Sense Disambiguation:

Sentence: "I saw a bat hanging upside down in the cave."

In this sentence, the word "bat" has two main senses:

1. A flying mammal (animal)

2. A piece of sports equipment used in baseball (object)

The context of the sentence is crucial in determining which sense of "bat" is being referred to.

Possible Disambiguated Senses:

1. "I saw a [flying mammal] hanging upside down in the cave."

2. "I saw a [piece of sports equipment] hanging upside down in the cave."

WSD models aim to determine the correct sense based on the context. Here's how a simple rule-
based approach might work:

If the word "bat" is surrounded by words related to animals (e.g., "hanging," "cave"), the sense is
likely the flying mammal. If the word "bat" is surrounded by words related to sports (e.g.,
"sports," "equipment"), the sense is likely the sports equipment.

Modern WSD approaches, however, use machine learning techniques and large annotated
datasets to make more accurate sense predictions. These models may utilize features like part-of-
speech tags, surrounding words, and even pre-trained word embeddings to make sense
disambiguation decisions.

WSD is a challenging task, as it requires understanding the subtle contextual clues that
differentiate different word senses. It becomes even more complex when considering words with
a higher number of senses or when the context is ambiguous.

1. Knowledge-Based WSD: Knowledge-based WSD relies on external resources like lexical


databases (e.g., WordNet) to disambiguate word senses. It involves mapping words to their
senses based on the meanings and relationships present in these resources.

Example: Consider the sentence: "She caught a glimpse of the river bank."

In this sentence, the word "bank" can have multiple senses, such as:

1. Financial institution
2. Sloping land alongside a body of water

Using a knowledge-based approach with WordNet, we can analyze the context of "bank" and
choose the sense that makes more sense in the context. If the surrounding words are related to
nature or geography, we might disambiguate to the "sloping land alongside a body of water"
sense.

2. Supervised WSD: Supervised WSD involves training machine learning models on annotated
datasets, where each word in context is labeled with its correct sense. These models learn to
recognize patterns in the context that correspond to different senses.

Example: Let's say we have a labeled dataset like this:

Sentence Word to Disambiguate Correct Sense

She caught a glimpse of the river bank. bank Land

He deposited money in the bank. bank Financial

Using this dataset, a supervised machine learning model (e.g., a classifier or neural network) can
be trained to recognize the features in the context that indicate the correct sense of the word. The
model might learn that in sentences mentioning "money" or "deposit," the correct sense of
"bank" is "Financial," while in sentences with "river" or "glimpse," the correct sense is "Land."

Supervised WSD requires a significant amount of labeled data, but it can be highly accurate
when trained properly.

In summary, knowledge-based WSD uses external resources to map words to senses, while
supervised WSD relies on machine learning models trained on annotated data to predict word
senses. Both approaches have their strengths and limitations, and their effectiveness can vary
based on the available resources, the complexity of the language, and the specific application.
Unit 4 Text Analysis, Summarization and Extractions

Text analysis, summarization, and extraction are essential tasks in natural language processing
(NLP) that involve processing and understanding textual data to extract valuable information,
identify key content, and generate concise summaries. Let's explore each of these tasks:

1. Text Analysis: Text analysis involves a range of techniques to understand and extract
information from text. This can include tasks such as:

 Tokenization: Splitting text into individual words or tokens.

 Part-of-Speech Tagging: Assigning grammatical parts of speech to each word.

 Named Entity Recognition (NER): Identifying named entities like names, dates,
locations, etc.

 Sentiment Analysis: Determining the sentiment or emotion expressed in a piece of text


(positive, negative, neutral).

 Topic Modeling: Discovering the main topics or themes present in a collection of


documents.

 Dependency Parsing: Analyzing the grammatical structure of sentences to identify


relationships between words.

 Syntax and Grammar Analysis: Analyzing the grammatical structure and syntax of
sentences.

2. Summarization: Text summarization involves creating concise and coherent summaries


of longer text documents. There are two main types of summarization:

 Extractive Summarization: In this approach, sentences or phrases are selected directly


from the original text to create the summary. These sentences are usually the most
important and relevant ones.

 Abstractive Summarization: Abstractive summarization involves generating new


sentences that capture the main ideas of the original text but may not appear verbatim in
the source text. This approach requires a deeper understanding of the content and often
uses natural language generation techniques.

3. Extraction: Text extraction involves pulling out specific pieces of information from text
documents. This can include:

 Entity Extraction: Identifying and extracting entities like names, dates, locations, etc.
 Keyphrase Extraction: Extracting important keywords or phrases that represent the
main themes of a document.

 Information Extraction: Automatically extracting structured information from


unstructured text, such as extracting relationships between entities.

These tasks are crucial for various NLP applications such as search engines, information
retrieval, content summarization, content recommendation, and more. Advanced techniques like
deep learning and transformer-based models have significantly improved the performance of
these tasks, allowing for more accurate and sophisticated analysis, summarization, and extraction
of textual information.

Sentiment mining, also known as sentiment analysis, is the process of determining the emotional
tone or sentiment expressed in a piece of text, whether it's positive, negative, or neutral. It's a
common natural language processing (NLP) task used to understand public opinion, customer
feedback, and social media sentiment. Let's look at a simple example of sentiment mining:

Example: Movie Reviews

Consider the following movie review:

"Wow, what an amazing movie! The acting was incredible, and the storyline kept me hooked
from start to finish. I couldn't have asked for a better film."

In this example, the sentiment expressed in the text is clearly positive. Sentiment mining aims to
quantitatively identify and label this sentiment.

Sentiment Categories:

 Positive

 Negative

 Neutral

Sentiment Label:

 Positive

Sentiment mining can involve both rule-based and machine learning approaches. Here's a
simplified explanation of how a machine learning model might work:
1. Data Preparation: A dataset of labeled examples (text and their corresponding
sentiments) is collected and prepared for training. Each example is labeled as positive,
negative, or neutral.

2. Feature Extraction: Text is preprocessed, tokenized, and converted into numerical


representations (word embeddings or vectors).

3. Training: A machine learning algorithm, such as a classifier or a neural network, is


trained on the prepared data. The algorithm learns patterns in the text that are indicative
of different sentiment categories.

4. Prediction: When new text is presented to the trained model, it predicts the sentiment
category based on the patterns it learned during training.

For instance, if you input the review "I hated this movie, it was a waste of time," the sentiment
mining model would likely predict a negative sentiment.

In practice, sentiment mining can get more complex, especially when dealing with nuanced
sentiments, sarcasm, and the varying degrees of positivity and negativity. Additionally, models
can be trained on large datasets and leverage advanced techniques like transformer-based
architectures (e.g., BERT, GPT) to achieve state-of-the-art performance in sentiment analysis.

Sentiment mining has a wide range of applications, from gauging customer satisfaction and
product reviews to monitoring social media sentiment and analyzing public opinion on various
topics.

Entity linking

Entity linking, also known as named entity disambiguation, is a natural language processing
(NLP) task that involves identifying and linking named entities mentioned in text to their
corresponding real-world entities in a knowledge base or database. The goal is to disambiguate
which specific entity a mention refers to, especially when multiple entities share the same name.

Here's a simple example of entity linking:

Text: "Barack Obama was born in Hawaii."

In this sentence, "Barack Obama" is a named entity, specifically a person's name. The task of
entity linking involves linking this mention to the correct entity in a knowledge base, such as
linking "Barack Obama" to the corresponding entry for the former U.S. President Barack Obama.

Entity Linking Steps:

1. Mention Detection: Identify named entities in the text. In this case, "Barack Obama" is
detected as a named entity.
2. Candidate Generation: Generate a list of possible entities from a knowledge base that
match the detected named entity. For instance, a knowledge base might contain multiple
entries for individuals named "Barack Obama."

3. Disambiguation: Select the correct entity from the list of candidates that best
corresponds to the context of the sentence. This involves considering the context of the
mention, the surrounding words, and any available semantic information.

Entity Linking Example:

Given a knowledge base with entries for two "Barack Obamas":

1. Barack Obama (Former U.S. President)

2. Barack Obama (Artist)

The disambiguation process might analyze the context of the sentence to determine that the
former U.S. President is the relevant entity. This could be based on the fact that the sentence
mentions "born in Hawaii," which aligns with the biography of the former President.

In this example, entity linking helps disambiguate the mention "Barack Obama" and link it to the
correct entity in the knowledge base.

Entity linking has practical applications in information retrieval, question answering systems,
and knowledge graph construction. It allows NLP systems to connect textual information to
structured knowledge, enhancing the understanding and interpretation of text.

Text classification

Text classification is an important natural language processing (NLP) task that involves
assigning predefined labels or categories to text documents based on their content. It's commonly
used for tasks like sentiment analysis, topic categorization, spam detection, and more. Here's a
simple example of text classification:

Task: Sentiment Analysis

Example Text: "I absolutely loved the movie! The acting was fantastic, and the plot kept me
engaged the entire time."

Possible Labels: Positive, Negative

In this example, the task is to determine whether the sentiment expressed in the text is positive or
negative.

Text Classification Steps:


1. Data Preparation: Gather a labeled dataset containing text examples and their
corresponding labels. Each example should be assigned to one of the predefined
categories.

2. Feature Extraction: Convert the text data into a numerical format that machine learning
models can understand. Common methods include word embeddings, TF-IDF (Term
Frequency-Inverse Document Frequency), or even bag-of-words representations.

3. Model Selection: Choose a suitable machine learning algorithm for text classification.
Common choices include logistic regression, support vector machines, and various types
of neural networks.

4. Training: Train the selected model on the labeled dataset. The model learns to recognize
patterns in the text that are indicative of different categories.

5. Prediction: When presented with new, unlabeled text, the trained model predicts the
category or label that best fits the content.

In the sentiment analysis example:

 If the model predicts "Positive," it means the sentiment expressed in the text is positive.

 If the model predicts "Negative," it means the sentiment expressed in the text is negative.

In practice, text classification can involve more complex scenarios with multiple labels,
imbalanced data, and various challenges related to the content and context of the text. Advanced
techniques, such as deep learning and transformer-based models, have significantly improved
text classification performance, enabling models to learn intricate patterns and nuances in text
data.

Text classification has a wide range of applications, including customer feedback analysis,
content recommendation, email categorization, and more, making it a fundamental task in NLP.

Text Classification and Content Recommendation

Text Classification and Content Recommendation are two important tasks in natural language
processing (NLP) that have significant applications in various domains. Let's take a closer look
at each of these tasks:

1. Text Classification: Text classification involves categorizing text documents into


predefined classes or categories based on their content. This task is used to automatically
label and organize text data, making it easier to manage and analyze large amounts of
textual information. Some common applications of text classification include:
 Sentiment Analysis: Determining the sentiment (positive, negative, neutral) expressed in
a piece of text, such as customer reviews or social media posts.

 Topic Categorization: Assigning topics or themes to documents, which can aid in content
organization and information retrieval.

 Spam Detection: Identifying whether an email or message is spam or legitimate.

 Language Identification: Determining the language in which a given text is written.

 Intent Recognition: Recognizing the user's intent from their input, often used in chatbots
and virtual assistants.

2. Content Recommendation: Content recommendation involves suggesting relevant content


to users based on their preferences, historical behavior, and content similarity. This task
aims to enhance user engagement and satisfaction by delivering content that matches
their interests. Some common applications of content recommendation include:

 News and Article Recommendation: Suggesting news articles or blog posts based on a
user's reading history.

 Product Recommendation: Recommending products to users based on their browsing and


purchasing history.

 Movie and Music Recommendation: Suggesting movies, TV shows, or songs based on


user preferences and viewing/listening history.

 Social Media Feed Personalization: Customizing a user's social media feed to show
content from friends and pages they interact with the most.

 Content Discovery Platforms: Recommending relevant content to users on platforms like


YouTube, Spotify, and Netflix.

Both text classification and content recommendation are enabled by machine learning and NLP
techniques. Text classification often involves supervised learning, where models are trained on
labeled data to recognize patterns that differentiate different classes. Content recommendation
involves collaborative filtering, content-based filtering, and hybrid methods that analyze user
preferences and content attributes.

These tasks play a crucial role in enhancing user experiences, improving content discovery, and
making sense of the vast amounts of text data available in various online platforms.

Content Recommendation Example: Movie Genre Classification and Recommendation

Suppose we have a small dataset of movie descriptions and their genres:


1. "An action-packed adventure of a skilled spy trying to save the world." (Genre: Action)

2. "A heartwarming story of friendship and love in a small town." (Genre: Drama)

3. "A hilarious comedy about a group of friends on a road trip." (Genre: Comedy)

User Preference: The user prefers movies of the "Action" genre.

Scoring System: We'll use a simple scoring system based on keyword matches to determine the
relevance of each movie description to the user's preferred genre:

 Action Score = Number of "action" occurrences / Total words in description

 Drama Score = Number of "drama" occurrences / Total words in description

 Comedy Score = Number of "comedy" occurrences / Total words in description

Calculations:

1. For the "Action" genre:

 "An action-packed adventure..." has 2 occurrences of "action" out of 10 total


words. Action Score = 2/10 = 0.2

 "Drama" Score = 0/10 = 0

 "Comedy" Score = 0/10 = 0

2. For the "Drama" genre:

 "An action-packed adventure..." has 0 occurrences of "drama." Drama Score =


0/10 = 0

 "Drama" Score = 1/10 = 0.1

 "Comedy" Score = 0/10 = 0

3. For the "Comedy" genre:

 "An action-packed adventure..." has 0 occurrences of "comedy." Comedy Score =


0/10 = 0

 "Drama" Score = 0/10 = 0

 "Comedy" Score = 2/10 = 0.2

Recommendation: Based on the scores, we recommend the movie with the highest score for the
user's preferred genre:
 User Prefers "Action": The movie "An action-packed adventure..." with an Action Score
of 0.2 is recommended.

In this simple mathematical example, we've used keyword-based scoring to determine the
relevance of each movie description to the user's preferred genre. In practice, more advanced
techniques, including machine learning models and collaborative filtering, are used to provide
personalized and accurate content recommendations.

1. Latent Dirichlet Allocation (LDA):

LDA is a probabilistic model that assumes each document in a corpus is a mixture of a small
number of topics, and each word's occurrence in a document is attributable to one of the
document's topics. It's often used for topic modeling, where the goal is to uncover the underlying
themes within a collection of documents.

Let's say we have a collection of news articles, and we want to identify the main topics present in
these articles using LDA.

Example:

Documents:

1. "Scientists discover a new species in the Amazon rainforest."

2. "Political leaders discuss climate change in summit."

3. "Technology companies unveil new gadgets at expo."

Using LDA, we might discover topics like:

 Topic 1: Ecology and Nature

 Topic 2: Politics and Climate Change

 Topic 3: Technology and Innovation

LDA assigns a probability distribution of topics to each document and a probability distribution
of words to each topic. This allows us to interpret which topics are present in a document and
which words are representative of each topic.

1. Latent Dirichlet Allocation (LDA):

LDA involves complex probabilistic modeling, but I'll provide a simplified version of the
mathematical representation.

Assumptions:
 There are K topics.

 Each document is a mixture of these K topics.

 Each word in a document is attributed to one of these K topics.

Let's represent the generative process of LDA using mathematical notation:

 Documents are denoted as D, each containing a collection of N words.

 Topics are denoted as K.

 Words are denoted as W.

Parameters:

 θ represents the document-topic distribution.

 φ represents the topic-word distribution.

The generative process for a document d with words w:

1. For each topic k, sample a topic proportion θ_dk from the document's topic distribution.

2. For each word position n in the document: a. Sample a topic assignment z_dn from the
document's topic proportions θ_dk. b. Sample a word w_dn from the topic's word
distribution φ_zdn.

Mathematical representation:

 θ_dk ~ Dir(α), where α is the parameter controlling the document-topic distribution.

 φ_kw ~ Dir(β), where β is the parameter controlling the topic-word distribution.

 z_dn ~ Multinomial(θ_d), which assigns a topic to the word.

 w_dn ~ Multinomial(φ_zdn), which selects a word based on the topic.

2. Matrix Factorization:

Matrix Factorization is a technique used to decompose a matrix into a product of multiple


matrices, which can help in reducing the dimensions of the original data while capturing
important patterns. In NLP, it's often used for tasks like recommendation systems or
collaborative filtering.

Let's say we have a user-item matrix where rows represent users, columns represent items, and
the cells contain ratings given by users to items. We can apply matrix factorization to find latent
features that explain the ratings.
Example:

User-Item Matrix:

markdownCopy code

Item1 Item2 Item3 User1 5 3 - User2 - 4 1 User3 2 - 5

By applying matrix factorization, we can decompose this matrix into two matrices: one
representing users' preferences and another representing items' features. These matrices can then
be multiplied to reconstruct the original matrix and fill in the missing values:

User-Preference Matrix:

markdownCopy code

LatentFeature1 LatentFeature2 LatentFeature3 User1 0.9 0.6 0.3 User2 0.2 0.7 0.5 User3 0.8 0.4
0.6

Item-Feature Matrix:

markdownCopy code

LatentFeature1 LatentFeature2 LatentFeature3 Item1 0.7 0.4 0.5 Item2 0.6 0.9 0.2 Item3 0.4 0.7
0.8

Matrix factorization enables us to make predictions for missing ratings and understand latent
patterns in user-item interactions.

Both LDA and Matrix Factorization are powerful techniques for extracting meaningful insights
from text data in NLP tasks. They are widely used in various applications to uncover hidden
structures and patterns within textual information.

Let's consider a simple example of matrix factorization using a user-item rating matrix.

Assumptions:

 There are N users.

 There are M items.

 Ratings are given in a matrix R, where R[i][j] represents the rating given by user i to item
j.

We want to factorize the matrix R into two lower-dimensional matrices U (user-latent feature)
and V (item-latent feature), such that R ≈ UV^T.
Mathematical representation:

 R is the user-item rating matrix of size NxM.

 U is the user-latent feature matrix of size NxK, where K is the number of latent features.

 V is the item-latent feature matrix of size MxK.

The goal is to find matrices U and V such that the reconstruction error is minimized, typically
using methods like gradient descent.

Matrix Factorization equation:

 R ≈ UV^T

The process involves finding latent features that explain the observed ratings by approximating
the original matrix with the product of the two factorized matrices.

Please note that the actual implementations and optimizations for these methods involve more
sophisticated techniques, but these mathematical formulations provide a basic understanding of
how LDA and Matrix Factorization work.

Text summarization is the process of condensing a longer piece of text into a shorter version
while preserving its main ideas and key information. There are generally two types of text
summarization: extractive and abstractive. Extractive summarization involves selecting and
assembling existing sentences from the original text, while abstractive summarization involves
generating new sentences that capture the essence of the original text.

Here, I'll provide a simple example of extractive summarization:

Original Text:

The quick brown fox jumps over the lazy dog. This classic sentence is often used to showcase
fonts and typography. It contains all the letters of the English alphabet. The fox and the dog are
known to be traditional enemies in many folktales.

Extractive Summary:

"The quick brown fox jumps over the lazy dog. It contains all the letters of the English alphabet."

In this example, the extractive summarization algorithm selected the first and third sentences
from the original text to create a condensed version that still conveys the main idea: the
uniqueness of the sentence and its relevance to typography.

Keep in mind that this is a simplified example. Real-world text summarization involves more
advanced techniques, especially in abstractive summarization, where the system generates new
sentences that may not appear verbatim in the original text. These techniques often involve deep
learning models and sophisticated natural language processing approaches.

Original Text:

Sentence 1: The cat chased the mouse. Sentence 2: The dog barked loudly. Sentence 3: The
mouse escaped into a hole. Sentence 4: The cat gave up the chase.

For this example, let's consider a basic scoring mechanism based on the number of words in each
sentence. The idea is to select sentences with fewer words, assuming that they are more concise
and contain important information.

Step 1: Scoring Sentences We assign a score to each sentence based on the number of words it
contains:

 Sentence 1: The cat chased the mouse. (5 words) => Score: 5

 Sentence 2: The dog barked loudly. (4 words) => Score: 4

 Sentence 3: The mouse escaped into a hole. (6 words) => Score: 6

 Sentence 4: The cat gave up the chase. (6 words) => Score: 6

Step 2: Selecting Sentences Let's say we want to create a summary with a maximum of 10
words. We start by selecting the sentences with the lowest scores until we reach the word limit:

1. Sentence 2: The dog barked loudly. (4 words) => Selected

2. Sentence 1: The cat chased the mouse. (5 words) => Selected

The total word count of the selected sentences is 9, which is within the limit of 10 words.

Summary:

The dog barked loudly. The cat chased the mouse.

Information Extraction

Information extraction (IE) in NLP involves identifying and extracting structured information
from unstructured text. One common form of information extraction is Named Entity
Recognition (NER), which involves identifying entities like names of people, organizations,
locations, dates, etc., from text. Let's go through a mathematical example of Named Entity
Recognition.

Original Text:

John works at XYZ Corporation in New York. He was born on January 15, 1985.
Named Entity Recognition (NER): In this example, we want to identify entities like names,
organizations, locations, and dates.

1. John (Person)

2. XYZ Corporation (Organization)

3. New York (Location)

4. January 15, 1985 (Date)

Now, let's represent this NER process in a simplified mathematical notation:

 T represents the input text.

 E represents the set of entities.

 NER(T) represents the function that extracts entities from the text.

Mathematical representation:

 T = "John works at XYZ Corporation in New York. He was born on January 15, 1985."

 E = {John (Person), XYZ Corporation (Organization), New York (Location), January 15,
1985 (Date)}

 NER(T) = {John, XYZ Corporation, New York, January 15, 1985}

The NER process involves pattern recognition, machine learning, or hybrid methods to identify
entities and classify them into predefined categories (such as Person, Organization, Location,
Date).

Original Text: John works at XYZ Corporation in New York. He was born on January 15, 1985.

1. Tokenization:

code

John | works | at | XYZ | Corporation | in | New | York | . | He | was | born | on | January | 15 | , |


1985 | .

The text is split into individual words or tokens.

2. Part-of-Speech Tagging:

John/NNP | works/VBZ | at/IN | XYZ/NNP | Corporation/NNP | in/IN | New/NNP | York/NNP |


./. He/PRP | was/VBD | born/VBN | on/IN | January/NNP | 15/CD | ,/, | 1985/CD | ./.
Each token is tagged with its part of speech (e.g., NNP for proper noun, VBZ for verb).

3. NER Tagging:

code

John/PERSON | works/O | at/O | XYZ/ORG | Corporation/ORG | in/O | New/LOC | York/LOC |


./O He/O | was/O | born/O | on/O | January/DATE | 15/DATE | ,/O | 1985/DATE | ./O

Tokens are tagged with named entity labels like PERSON, ORGANIZATION, LOCATION, and
DATE.

4. NER Visualization:

John (PERSON) works at XYZ Corporation (ORG) in New York (LOC). He was born on
January 15, 1985 (DATE).

Entities are highlighted with their respective labels for clear visualization.

5. Final Extracted Entities:

 PERSON: John

 ORGANIZATION: XYZ Corporation

 LOCATION: New York

 DATE: January 15, 1985

Relation Extraction

Relation Extraction is a Natural Language Processing (NLP) task that involves identifying and
extracting relationships or associations between entities mentioned in text. The goal is to
discover structured information from unstructured text by determining how different entities are
related to each other. These relationships can be hierarchical (e.g., parent-child), spatial (e.g.,
located-in), temporal (e.g., before-after), or more complex semantic relationships.

Here's a basic overview of the relation extraction process:

1. Entity Recognition: Before identifying relationships, named entities (such as people,


organizations, locations) need to be recognized in the text using techniques like Named
Entity Recognition (NER).

2. Dependency Parsing: The syntactic structure of the sentence is analyzed using


dependency parsing to understand how words are related to each other. This helps in
identifying which entities are connected and how.
3. Pattern Matching and Rules: Rule-based systems or pattern-matching techniques can
be used to identify common linguistic patterns that indicate specific relationships
between entities.

4. Machine Learning Approaches: Relation extraction can also be tackled using machine
learning models, such as deep learning models or traditional classifiers. These models are
trained on labeled data that contains examples of entity pairs and their corresponding
relationships.

5. Feature Extraction: Features are extracted from the text, often including linguistic
features, syntactic features, and sometimes even semantic features to represent the
context of the entities.

6. Training and Prediction: In supervised machine learning approaches, the model is


trained on a labeled dataset to learn patterns and features associated with different
relationships. Once trained, the model can predict relationships in new, unseen text.

7. Output: The output of relation extraction is a set of extracted relationships between


entity pairs, often represented as triples: (Entity1, Relationship, Entity2).

Example:

Text: "Barack Obama was born in Hawaii." Entity1: Barack Obama (Person) Relationship: born
in Entity2: Hawaii (Location)

In this example, the relation "born in" connects the person "Barack Obama" with the location
"Hawaii."

Relation extraction has a wide range of applications, including information retrieval, knowledge
graph construction, question-answering systems, and more. It can provide structured knowledge
from large volumes of unstructured text data, enabling computers to understand and utilize
textual information more effectively.

Question Answering in Multilingual Setting

Question Answering (QA) in a multilingual setting involves building systems that can
understand and answer questions posed in different languages. Here's an example to illustrate
this concept:

Question (English):

What is the capital of France?

Question (French):

Quelle est la capitale de la France ?


Answer:

Paris

In this example, we have the same question posed in both English and French: "What is the
capital of France?" The correct answer, "Paris," is the same for both languages.

To implement multilingual question answering, a system would need to perform the following
steps:

1. Language Identification: The system needs to determine the language of the input
question.

2. Translation (Optional): If the question is not in the language the system is designed to
answer, it may need to translate the question into the target language.

3. Information Retrieval: Retrieve relevant information from a knowledge base or a


corpus of text that contains facts and details. In this case, the system would search for
information about the capital of France.

4. Answer Extraction: Extract the answer from the retrieved information. This could
involve named entity recognition or other techniques to identify relevant entities in the
text.

5. Language Generation: If the system is designed to respond in the same language as the
input question, it would generate the response in the appropriate language.

In a multilingual QA system, the core processes of information retrieval and answer extraction
are language-independent, while language identification, translation, and generation steps handle
the multilingual aspect.

This approach allows users to ask questions in their preferred language and receive accurate
answers, even if the system operates across multiple languages. Multilingual QA systems are
valuable for providing access to information to a diverse and global user base.

Information Retrieval (IR)

Natural Language Processing (NLP) is a crucial component in Information Retrieval (IR), which
focuses on retrieving relevant documents or information in response to user queries. NLP
techniques enhance the effectiveness of IR systems by enabling better understanding of user
queries and document content. Here's how NLP is applied in various aspects of Information
Retrieval:

1. Query Processing:

 Tokenization: Breaking down user queries into individual words or tokens.


 Stemming and Lemmatization: Reducing words to their root forms to handle variations
(e.g., "running" to "run").

 Stopword Removal: Eliminating common words (e.g., "and," "the") that don't contribute
much to the meaning.

2. Document Indexing:

 Inverted Index: Creating an index of terms from documents, associating each term with
the documents in which it appears.

 Term Frequency-Inverse Document Frequency (TF-IDF): Assigning weights to terms


based on their importance in a document relative to their frequency in the entire
collection.

3. Retrieval Models:

 Vector Space Model (VSM): Representing documents and queries as vectors in a high-
dimensional space to calculate similarity scores.

 Probabilistic Models: Estimating the probability that a document is relevant to a query.

4. Ranking and Scoring:

 Cosine Similarity: Measuring the cosine of the angle between query and document
vectors to rank documents by relevance.

 BM25: A ranking function based on term frequency and document length.

5. Relevance Feedback:

 Query Expansion: Automatically adding synonyms or related terms to the user's query
to capture more relevant documents.

 Rocchio Algorithm: Adjusting the query vector based on user feedback to improve
retrieval.

6. Language Understanding:

 Named Entity Recognition (NER): Identifying entities (e.g., names, locations) in


documents and queries.

 Part-of-Speech Tagging: Labeling words in a sentence with their grammatical parts of


speech.

 Dependency Parsing: Understanding syntactic relationships between words.


7. Cross-Lingual Information Retrieval (CLIR):

 Machine Translation: Translating queries or documents to another language for cross-


lingual retrieval.

 Cross-Lingual Document Alignment: Aligning documents in different languages to


facilitate retrieval.

Problem: Given a set of documents and a user query, we want to retrieve the most relevant
document using NLP techniques.

Documents:

1. Document A: "The cat chased the mouse."

2. Document B: "The dog barked loudly."

3. Document C: "The mouse escaped into a hole."

User Query: "cat chased mouse"

NLP Techniques: We'll tokenize the documents and the query and use the Vector Space Model
(VSM) for representation.

Step 1: Tokenization:

Document A: [The, cat, chased, the, mouse, .] Document B: [The, dog, barked, loudly, .]
Document C: [The, mouse, escaped, into, a, hole, .] Query: [cat, chased, mouse]

Step 2: Calculate Term Frequencies (TF): We calculate the frequency of each term in the
documents and the query.

 TF("cat", Document A) = 1

 TF("chased", Document A) = 1

 TF("mouse", Document A) = 1

 ... (similarly for other terms and documents)

Step 3: Calculate Inverse Document Frequencies (IDF): We calculate the inverse document
frequency for each term.

 IDF("cat") = log(3/1) = log(3)

 IDF("chased") = log(3/1) = log(3)


 IDF("mouse") = log(3/2) = log(1.5)

 ... (similarly for other terms)

Step 4: Calculate TF-IDF Weights: We calculate the TF-IDF weight for each term in each
document and the query.

 TF-IDF("cat", Document A) = TF("cat", Document A) * IDF("cat") = 1 * log(3)

 TF-IDF("chased", Document A) = TF("chased", Document A) * IDF("chased") = 1 *


log(3)

 TF-IDF("mouse", Document A) = TF("mouse", Document A) * IDF("mouse") = 1 *


log(1.5)

 ... (similarly for other terms and documents)

Step 5: Calculate Cosine Similarities: We calculate the cosine similarity between the TF-IDF
vectors of each document and the query.

 Cosine similarity(Document A, Query) = (TF-IDF vector(Document A) * TF-IDF


vector(Query)) / (||TF-IDF vector(Document A)|| * ||TF-IDF vector(Query)||)

 ... (similarly for other documents)

Result: The document with the highest cosine similarity score is the most relevant to the query
and is the retrieved document.

This example demonstrates how NLP techniques, specifically tokenization, TF-IDF


representation, and cosine similarity calculation, play a crucial role in information retrieval by
allowing us to measure the relevance of documents to user queries.

The Vector Space Model (VSM)

The Vector Space Model (VSM) is a fundamental concept in Natural Language Processing
(NLP) and Information Retrieval (IR). It's a mathematical framework used to represent text
documents and queries in a high-dimensional vector space, enabling the calculation of similarity
between them. VSM is a basis for many text-related tasks, such as document retrieval, text
classification, and clustering. Let's delve into the Vector Space Model:

Representation: In the VSM, each term in the vocabulary is represented as a dimension in the
vector space. Documents and queries are represented as vectors, where each dimension
corresponds to a term's frequency or some other term weighting scheme.

Steps:
1. Tokenization and Preprocessing: Convert documents and queries into tokens (words or
terms). Apply stemming, lemmatization, and other preprocessing steps.

2. Term Frequency (TF): Count the frequency of each term in a document or query. This
gives the term's raw frequency in the text.

3. Inverse Document Frequency (IDF): Calculate the IDF for each term in the corpus.
IDF measures the inverse of how often a term appears in all documents. It helps give
more weight to terms that are less common and potentially more informative.

4. TF-IDF Weighting: Multiply the TF of a term in a document/query by its IDF. This


yields a TF-IDF weight, which indicates how important a term is to a document/query
relative to its importance in the entire corpus.

5. Document and Query Vectors: Represent each document and query as a vector in the
vector space. The dimensions of the vector correspond to the terms, and the values are the
calculated TF-IDF weights.

6. Cosine Similarity: Calculate the cosine similarity between document vectors and the
query vector to measure their similarity. Cosine similarity is the cosine of the angle
between two vectors and ranges from -1 (dissimilar) to 1 (similar).

7. Ranking and Retrieval: Rank documents based on their cosine similarity scores with the
query vector. Higher scores indicate higher relevance.

Example: Consider two documents and a query:

Document 1: "The cat chased the mouse." Document 2: "The dog barked loudly." Query: "cat
chased"

Assuming appropriate preprocessing and TF-IDF weighting, the vectors might look like this:

Document 1 Vector: [0.5, 0.5, 0, 0, 0, 0.5] # TF-IDF values Document 2 Vector: [0, 0, 0.5, 0.5,
0.5, 0] # TF-IDF values Query Vector: [0.5, 0.5, 0, 0, 0, 0] # TF-IDF values

The cosine similarity between Document 1 and the Query could be calculated, and similarly for
Document 2. Higher cosine similarity indicates higher relevance to the query.

The Vector Space Model allows representing text data numerically, facilitating various NLP and
IR tasks by enabling the quantification of textual similarity and relevance.

Documents: We have three short documents:

1. Document 1: "The cat chased the mouse."

2. Document 2: "The dog barked loudly."


3. Document 3: "The mouse escaped into a hole."

Vocabulary: The vocabulary consists of unique terms from all the documents:

Vocabulary: ["The", "cat", "chased", "dog", "barked", "loudly", "mouse", "escaped", "into", "a",
"hole"]

Term Frequency (TF): Calculate the term frequency (TF) for each term in each document.
We'll use binary representation (0 if term is absent, 1 if present):

Document 1: [1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0] Document 2: [1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0] Document 3:


[1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]

Inverse Document Frequency (IDF): Calculate the IDF for each term using the total number of
documents (3 in this case):

IDF("The") = log(3/3) = 0 IDF("cat") = log(3/1) = log(3) IDF("chased") = log(3/1) = log(3) ...


IDF("hole") = log(3/1) = log(3)

TF-IDF Weights: Calculate the TF-IDF weights for each term in each document:

TF-IDF(Document 1) = [0, log(3), log(3), 0, 0, 0, 0, 0, 0, 0, 0] TF-IDF(Document 2) = [0, 0, 0,


log(3), log(3), log(3), 0, 0, 0, 0, 0] TF-IDF(Document 3) = [0, 0, 0, 0, 0, 0, log(3), log(3), log(3),
log(3), log(3)]

Query: Let's consider a query: "cat chased"

Query: [0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]

Cosine Similarity: Calculate the cosine similarity between the query and each document:

Cosine Similarity(Query, Document 1) = (Query * Document 1) / (||Query|| * ||Document 1||)


Cosine Similarity(Query, Document 2) = (Query * Document 2) / (||Query|| * ||Document 2||)
Cosine Similarity(Query, Document 3) = (Query * Document 3) / (||Query|| * ||Document 3||)

The document with the highest cosine similarity to the query is the most relevant one.

Please note that this is a simplified example, and actual implementations may use more advanced
techniques and normalization methods for cosine similarity calculation.

Cross-Lingual Information Retrieval

Cross-Lingual Information Retrieval (CLIR) involves retrieving relevant documents in one


language in response to queries expressed in a different language. It's particularly important for
users who need information in a language they might not be proficient in. Here's a mathematical
example to illustrate the concept of CLIR:
Documents: Let's consider a set of documents in English and their translations in French:

English Documents:

1. Document E1: "The cat chased the mouse."

2. Document E2: "The dog barked loudly."

French Documents (Translations):

1. Document F1: "Le chat a poursuivi la souris."

2. Document F2: "Le chien a aboyé bruyamment."

User Query: Suppose a user queries in French: "chien aboyé."

NLP Techniques: We'll use machine translation and the Vector Space Model (VSM) to perform
CLIR.

Machine Translation: We'll translate the French query into English using a translation model:

Translated Query: "dog barked"

Vector Space Model: Using the same steps as the previous example, we'll represent documents
and the translated query in the vector space.

Cosine Similarity: Calculate the cosine similarity between the translated query vector and
English document vectors:

Cosine Similarity(Translated Query, Document E1) = (Translated Query * Document E1) /


(||Translated Query|| * ||Document E1||) Cosine Similarity(Translated Query, Document E2) =
(Translated Query * Document E2) / (||Translated Query|| * ||Document E2||)

Result: The document with the highest cosine similarity to the translated query is the most
relevant one.

In this example, Cross-Lingual Information Retrieval involved translating the user query from
French to English and then treating it as a regular query in English. This process allows users to
search for information in a language they are comfortable with, even if the documents are in a
different language. Please note that real-world CLIR systems are more sophisticated, using
advanced translation models and information retrieval techniques.
Unit 5 Machine Translation and Deep Learning

Need of MT
Machine Translation (MT) plays a crucial role in Natural Language Processing (NLP) due to its
ability to automatically translate text from one language to another. Here are some key reasons
highlighting the need for Machine Translation in NLP:

1. Global Communication: In our increasingly interconnected world, MT enables effective


communication across language barriers. People can share information, collaborate, and
communicate with individuals who speak different languages.

2. Information Access: MT allows individuals to access information, news, research, and


online content that may not be available in their native language. This democratizes
information and knowledge dissemination.

3. Business Expansion: Businesses can use MT to expand their reach to global markets.
They can translate product descriptions, marketing materials, and customer support
content to cater to international customers.

4. Cultural Exchange: MT fosters cultural exchange by making literature, art, music, and
other forms of cultural expression accessible to people around the world.

5. Humanitarian Efforts: In emergency situations, MT helps facilitate communication


between aid workers and affected communities who may not share a common language.

6. Language Preservation: MT can aid in the preservation of endangered languages by


allowing texts to be translated and documented for future generations.

7. Language Learning: MT tools can assist language learners in understanding texts in their
target language and comparing them with their native language.

8. Research and Collaboration: Academics and researchers can access work published in
other languages, aiding in cross-lingual research collaboration and knowledge exchange.

9. Legal and Diplomatic Affairs: In legal, diplomatic, and international contexts, MT helps
bridge linguistic gaps in negotiations, agreements, and documentation.

10. Real-time Communication: MT technologies are integrated into messaging apps and
platforms, allowing real-time translation during conversations, enabling global
communication.

11. Content Localization: MT is used to localize content for different regions by adapting
translations to reflect cultural norms and linguistic nuances.
12. Enhanced User Experience: In software applications and websites, MT can provide
localized content for users in different regions, enhancing user experience.

However, it's important to note that while MT has made significant advancements, challenges
such as accuracy, context understanding, idiomatic expressions, and maintaining the nuances of
the source language still exist. MT is most effective when used in conjunction with human
editing or when combined with other NLP techniques to ensure high-quality translations.

Problems of Machine Translation

Machine Translation (MT) in NLP faces several challenges that can impact the quality and
accuracy of translated text. Here are some common problems of Machine Translation:

1. Ambiguity: Many words and phrases have multiple meanings based on the context. MT
systems might struggle to choose the correct sense, leading to ambiguous translations.

2. Idiomatic Expressions: Languages contain idioms and culturally specific phrases that
don't translate literally. MT systems may produce translations that sound awkward or lose
the intended meaning.

3. Lack of Context: MT often struggles to understand the broader context of a sentence or


passage, resulting in inaccurate translations that don't capture the intended message.

4. Out-of-Vocabulary Words: MT systems may encounter words that aren't present in


their training data. Handling these words accurately can be challenging.

5. Syntax and Grammar: Different languages have varying word orders and grammatical
rules. MT systems might produce translations with incorrect syntax or grammar.

6. Rare or Domain-Specific Terms: If an MT system hasn't encountered certain


specialized terms or domain-specific vocabulary in its training data, it may not translate
them correctly.

7. Language Morphology: Languages vary in terms of morphology, inflections, and word


forms. MT systems need to handle these variations accurately.

8. Named Entities: Translating named entities like names of people, places, and
organizations accurately can be difficult, especially if they're not recognized by the
system.

9. Neologisms and Slang: New words, slang, and rapidly evolving language can challenge
MT systems that haven't been updated with recent language developments.

10. Domain Adaptation: MT models trained on general text might not perform well in
specialized domains (e.g., medical, legal) due to lack of domain-specific training data.
11. Low-Resource Languages: Languages with limited available training data present
challenges, as MT systems might not capture the full linguistic complexity.

12. Cultural Sensitivity: Translations might not account for cultural differences, leading to
misunderstandings or inappropriate translations.

13. Language Pairs: Some language pairs are more challenging for MT due to structural
differences between the languages.

14. Evaluation: Measuring the quality of machine-generated translations accurately can be


complex. Metrics might not capture fluency, coherence, or cultural appropriateness.

15. Multilingual Ambiguities: When translating between multiple languages, an MT system


might generate a translation that is grammatically correct in the target language but
conveys a different meaning in another language.

16. Context Discrepancies: MT systems may not recognize or adequately address changes
in context, leading to inconsistent translations within a single text.

Researchers and developers are continuously working to address these problems through
advancements in neural machine translation, improved training data, better pre-processing
techniques, and more sophisticated evaluation methods. However, human review and post-
editing remain important to ensure high-quality translations.

MT Approaches

Machine Translation (MT) in NLP employs various approaches to automatically translate text
from one language to another. These approaches have evolved over time, and modern MT
systems often leverage neural networks for improved performance. Here are some key MT
approaches:

1. Rule-Based Machine Translation (RBMT): RBMT relies on predefined linguistic rules


and dictionaries to translate text. It involves grammatical analysis and generates
translations based on linguistic knowledge. While accurate for certain language pairs and
specific domains, RBMT struggles with handling idioms, ambiguous phrases, and
languages with complex grammatical structures.

2. Statistical Machine Translation (SMT): SMT uses statistical models that learn
translation patterns from large parallel corpora (bilingual text). It involves identifying
word alignments and estimating translation probabilities. SMT performed well for a time
but struggled with handling syntax, long-distance dependencies, and domain adaptation.

3. Neural Machine Translation (NMT): NMT has revolutionized MT using neural


networks, particularly Recurrent Neural Networks (RNNs) and Transformer models.
NMT models learn to map source language sequences to target language sequences end-
to-end, capturing complex relationships and long-range dependencies. The attention
mechanism in Transformers improved the handling of context and made NMT more
powerful.

4. Transformer Architecture: The Transformer architecture, introduced in the "Attention


Is All You Need" paper, is central to many modern NMT systems. It employs self-
attention mechanisms to weigh the importance of different words in a sequence, enabling
better context understanding and improved translation quality.

5. Sequence-to-Sequence Models: Sequence-to-Sequence models, often built using the


Transformer architecture, take input sequences (source language) and generate output
sequences (target language). They're used in tasks like language translation, text
summarization, and more.

6. Transfer Learning and Pre-training: Transfer learning techniques, such as pre-trained


language models (e.g., BERT, GPT), have been applied to MT. These models learn
general linguistic features from massive text corpora and can be fine-tuned for specific
translation tasks.

7. Multilingual and Zero-Shot Translation: Some NMT models support multiple


languages and can perform translation between any language pair even without specific
training data. This is achieved through shared encoders and decoders.

8. Reinforcement Learning: Reinforcement Learning can be used to fine-tune MT models


based on translation quality feedback, optimizing for metrics like BLEU, which measures
translation quality.

9. Domain-Specific Adaptation: MT models can be adapted to specific domains (e.g.,


medical, legal) by fine-tuning on domain-specific data. This improves translation quality
for specialized vocabulary.

10. Hybrid Approaches: Combining multiple MT approaches, such as integrating rule-


based components into statistical or neural models, has been explored for improved
translation accuracy.

Modern MT approaches, particularly neural-based methods, have significantly improved


translation quality, making them more suitable for a wide range of practical applications.
However, choosing the right approach depends on factors like language pair, domain, available
training data, and desired translation quality.

Statistical Machine Translation (SMT)

Statistical Machine Translation (SMT) is an approach to machine translation in NLP that relies
on statistical models and probabilistic techniques to translate text from one language to another.
SMT was one of the dominant approaches to machine translation before the advent of neural
machine translation (NMT). Here's an overview of how SMT works:

Key Components of SMT:

1. Parallel Corpus: SMT requires a parallel corpus, which consists of sentences or texts in
the source language and their corresponding translations in the target language. These
aligned sentences are used to learn translation patterns and probabilities.

2. Word Alignments: Word alignments are established between the source language words
and their translations in the target language. These alignments help identify which words
correspond to each other in the translation process.

3. Translation Models: SMT uses translation models to estimate the likelihood of a


translation given the source sentence. These models are trained on the parallel corpus and
capture how words or phrases in the source language are likely to be translated into the
target language.

4. Language Models: Language models estimate the probability of a sequence of words in


the target language. They help generate fluent and grammatically correct translations.

5. Decoding: The process of translating a sentence involves decoding, which involves


searching for the best translation option based on the translation and language models.

SMT Workflow:

1. Training:

 Create a parallel corpus of sentences in the source language and their translations
in the target language.

 Align words in the parallel corpus to establish word alignments.

 Train translation models using alignment information to learn translation


probabilities.

2. Decoding:

 Given a source sentence in the source language, generate a set of possible


translations.

 Estimate the probability of each translation option using translation and language
models.

 Choose the translation option with the highest overall probability.


Advantages of SMT:

 SMT can handle long-range dependencies and context by considering entire sentences.

 It can work well with limited training data, as long as word alignments are accurately
captured.

Challenges of SMT:

 SMT struggles with handling idiomatic expressions, word order differences, and
languages with complex grammar.

 It requires extensive feature engineering to model translation patterns, which can be


linguistically challenging.

 It might produce fluent but incorrect translations if the training data lacks certain
language pairs or linguistic phenomena.

While SMT was a groundbreaking approach, modern NMT systems, especially those based on
Transformer architectures, have largely outperformed SMT in terms of translation quality and
handling complex linguistic phenomena. However, SMT paved the way for many of the concepts
and techniques that are still relevant in machine translation research today.

Example: English to French Translation

English Sentences:

1. "The cat chased the mouse."

2. "The dog barked loudly."

French Translations:

1. "Le chat a poursuivi la souris."

2. "Le chien a aboyé bruyamment."

SMT Steps:

1. Parallel Corpus: We have a parallel corpus containing the English sentences and their
corresponding French translations.

2. Bilingual Word Alignments: We create bilingual word alignments to identify which words in
the source language correspond to words in the target language. Let's assume the alignments are
as follows:
 "The" -> "Le"

 "cat" -> "chat"

 "chased" -> "a poursuivi"

 "the" -> "la"

 "mouse" -> "souris"

3. Translation Probabilities: Calculate translation probabilities based on word alignments and


frequency counts from the parallel corpus.

Assuming we have the following translation probabilities for English to French:

 P("Le" | "The") = 0.8

 P("chat" | "cat") = 0.9

 P("a poursuivi" | "chased") = 0.7

 P("la" | "the") = 0.8

 P("souris" | "mouse") = 0.9

4. Calculating Translation Scores: For each English sentence, calculate the translation scores
for possible French translations using translation probabilities. The translation score for a
sentence is the product of the translation probabilities of its constituent words.

English Sentence: "The cat chased the mouse." Translation candidates:

 French Translation Candidate 1: "Le chat a poursuivi la souris."

 French Translation Candidate 2: "Le chat a poursuivi la la."

Calculate the translation scores:

 Score1 = P("Le" | "The") * P("chat" | "cat") * P("a poursuivi" | "chased") * P("la" | "the")
* P("souris" | "mouse")

 Score2 = P("Le" | "The") * P("chat" | "cat") * P("a poursuivi" | "chased") * P("la" | "the")
* P("la" | "mouse")

5. Choosing the Best Translation: Select the translation candidate with the highest translation
score as the final translation for the English sentence.
In this example, the translation candidate "Le chat a poursuivi la souris." likely has a higher
translation score compared to "Le chat a poursuivi la la.," and thus, it would be chosen as the
final translation.

Real-world SMT systems use more sophisticated techniques, incorporate language models, and
handle a larger vocabulary to improve translation accuracy.

Parameter learning in Statistical Machine Translation (SMT),

Parameter learning in Statistical Machine Translation (SMT), specifically using IBM Models,
involves estimating the translation probabilities between words in a parallel corpus. Expectation-
Maximization (EM) is a common technique used to learn these parameters. The IBM Models,
often referred to as IBM Model 1 and IBM Model 2, are foundational models in SMT. Here's an
overview of parameter learning using EM in the context of IBM Models:

IBM Models Overview:

IBM Models are a series of generative models that focus on aligning words between source and
target languages. They provide a foundation for estimating translation probabilities and word
alignments.

Expectation-Maximization (EM) Algorithm:

EM is an iterative algorithm used for parameter estimation in probabilistic models. In the context
of IBM Models, it's used to learn the translation probabilities and word alignments that best
explain the observed parallel corpus.

Parameter Learning Steps using EM:

1. Initialization:

 Initialize the translation probabilities uniformly.

 Initialize the word alignment probabilities.

2. E-Step (Expectation Step):

 In the E-step, estimate the expected counts of word alignments based on the
current parameter estimates and the parallel corpus.

3. M-Step (Maximization Step):

 In the M-step, re-estimate the translation probabilities and alignment probabilities


using the expected counts obtained from the E-step.

4. Iteration:
 Repeat the E-step and M-step iteratively until convergence or a predefined
number of iterations.

IBM Model 1:

 In IBM Model 1, the EM algorithm is used to estimate the translation probabilities for
each source word given a target word. These probabilities indicate how likely a source
word is to be translated as a target word.

IBM Model 2:

 IBM Model 2 extends Model 1 by introducing a fertility parameter that accounts for the
number of words in the source sentence aligned to a target word.

 The EM algorithm is used to estimate the translation probabilities, word alignment


probabilities, and fertility probabilities.

Advantages of IBM Models and EM:

 IBM Models provide a foundational understanding of alignment and translation


probabilities in SMT.

 EM helps learn parameters even with missing alignment information and noisy data.

Challenges and Limitations:

 EM can be computationally expensive, especially in later iterations.

 IBM Models are relatively simple and might struggle with complex linguistic
phenomena.

Modern SMT:

While IBM Models and EM have historical significance, modern SMT has transitioned to more
sophisticated models like Phrase-Based Models and Neural Machine Translation (NMT), which
often outperform IBM Models in terms of translation quality.

Overall, understanding parameter learning using EM in the context of IBM Models provides
insights into the early stages of statistical machine translation research.

Statistical Machine Translation (SMT) using the Expectation-Maximization (EM) algorithm for
IBM Model 1. In this example, we'll focus on translating from English to French.

Example: English to French Translation

English Sentences:
1. "The cat chased the mouse."

2. "The dog barked loudly."

French Translations:

1. "Le chat a poursuivi la souris."

2. "Le chien a aboyé bruyamment."

Assumptions:

 Vocabulary: {"The", "cat", "chased", "dog", "barked", "loudly", "mouse", "a",


"poursuivi", "la", "souris", "chien", "aboyé", "bruyamment"}

 Initial uniform translation probabilities: P("word" | "mot") = 1/14 for all word pairs.

IBM Model 1:

Initialization:

 Initialize translation probabilities uniformly: P("The" | "Le") = P("cat" | "chat") = ... =


1/14

E-Step (Expectation Step):

 For each sentence pair, estimate the expected count of each source word aligned to each
target word based on the current translation probabilities.

M-Step (Maximization Step):

 Re-estimate the translation probabilities using the expected counts obtained from the E-
step.

Iteration:

 Repeat the E-step and M-step iteratively until convergence or a predefined number of
iterations.

Example Calculation for E-Step (Sentence Pair 1):

Source Sentence: "The cat chased the mouse." Target Sentence: "Le chat a poursuivi la souris."

Expected Counts:

 P("The" | "Le") = (1/14) * (1 alignment)

 P("cat" | "Le") = (1/14) * (1 alignment)


 P("chased" | "Le") = (1/14) * (1 alignment)

 P("the" | "Le") = (1/14) * (0 alignment)

 P("mouse" | "Le") = (1/14) * (1 alignment)

M-Step: Re-estimate translation probabilities based on expected counts from the E-step.

Iteration: Repeat E-Step and M-Step until convergence.

Please note that this example is a simplified illustration for educational purposes. In practice,
IBM Model 1 and EM involve more complex calculations and iterations. Also, real-world SMT
systems use more advanced models and larger parallel corpora for training.

The encoder-decoder architecture is a fundamental framework used in Natural Language


Processing (NLP), particularly in tasks like machine translation, text summarization, and more. It
involves two main components: an encoder, which processes the input sequence, and a decoder,
which generates the output sequence. Let's explore this architecture with a mathematical example
in the context of machine translation.

Encoder-Decoder Architecture: Machine Translation

Encoder: The encoder processes the input sequence (source language) and produces a fixed-size
representation called the "context" or "thought vector." In this example, we'll use a simple
representation where the encoder sums the word embeddings.

Decoder: The decoder generates the output sequence (target language) based on the context
vector produced by the encoder. It does this step by step, predicting one word at a time,
considering both the context vector and previously generated words.

Example: English to French Translation

Source Sentence: "I love machine learning."

Target Sentence: "J'adore l'apprentissage automatique."

Vocabulary (for simplicity): English: {"I", "love", "machine", "learning"} French: {"J'adore",
"l'apprentissage", "automatique"}

Encoder: We'll represent each word using a one-hot encoding and then sum the word
embeddings. For this example, let's assume the word embeddings are:

rustCopy code

"I" -> [1, 0, 0, 0] "love" -> [0, 1, 0, 0] "machine" -> [0, 0, 1, 0] "learning" -> [0, 0, 0, 1]

The context vector is the sum of these embeddings: [1, 1, 1, 1]


Decoder: We'll use a simple autoregressive approach where the decoder generates one word at a
time based on the context vector and previously generated words. For each time step, the decoder
calculates the probability distribution over the target vocabulary and selects the word with the
highest probability.

Generating the Translation:

 Time step 1: Given the context vector [1, 1, 1, 1], the decoder calculates probabilities for
each French word: P("J'adore" | context) = 0.7 P("l'apprentissage" | context) = 0.2
P("automatique" | context) = 0.1 The decoder selects "J'adore" as the first word.

 Time step 2: Now the decoder's input is the previous word "J'adore." It calculates
probabilities for the next word: P("J'adore" | context, "J'adore") = 0.1 P("l'apprentissage" |
context, "J'adore") = 0.5 P("automatique" | context, "J'adore") = 0.4 The decoder selects
"l'apprentissage" as the second word.

 Time step 3: Similarly, the decoder generates "l'apprentissage" as the third word.

 Time step 4: The decoder generates "automatique" as the fourth and final word.

The generated translation is "J'adore l'apprentissage automatique."

This example demonstrates the basic concept of the encoder-decoder architecture for machine
translation. In practice, more sophisticated techniques, like attention mechanisms and advanced
neural network models, are used to improve translation quality and handle longer sentences.

Neural Machine Translation Statistical Encoding and Decoding

Neural Machine Translation (NMT) is a modern approach that uses neural networks for both
encoding and decoding in machine translation tasks. It replaces the traditional statistical
approaches like IBM Models with neural networks, allowing for end-to-end training and more
complex translation modeling. Let's walk through the neural machine translation process with a
simplified mathematical example:

Neural Machine Translation: Encoder-Decoder Architecture

Encoder: In NMT, the encoder is typically a recurrent neural network (RNN) or a more advanced
architecture like the Transformer. It processes the input sequence and produces a fixed-size
representation (context vector) that captures the input's semantic information.

Decoder: The decoder is also an RNN or Transformer that generates the output sequence word by
word based on the context vector produced by the encoder. At each time step, the decoder
considers the previous generated word and context vector to predict the next word.

Example: English to French Translation

Source Sentence: "I love machine learning."


Target Sentence: "J'adore l'apprentissage automatique."

Vocabulary (for simplicity): English: {"I", "love", "machine", "learning"} French: {"J'adore",
"l'apprentissage", "automatique"}

Neural Network Components:

Word Embeddings: Each word is represented by a continuous vector called a word embedding.

Encoder: Let's assume a simple RNN encoder. It processes each word embedding sequentially and
updates its hidden state.

Encoder Hidden State at Time Step t: h_t = RNN(h_{t-1}, x_t)

Where h_0 is the initial hidden state and x_t is the word embedding at time step t.

Decoder: We'll use an RNN decoder with attention. The decoder's hidden state at each time step is
influenced by the previous hidden state, the previously generated word, and a weighted
combination of encoder hidden states.

Decoder Hidden State at Time Step t: s_t = RNN(s_{t-1}, y_{t-1}, c_t)

Where s_0 is the initial decoder hidden state, y_{t-1} is the previous generated word's embedding,
and c_t is the context vector.

Attention Mechanism: The attention mechanism calculates attention scores between the decoder's
current hidden state and the encoder's hidden states. These scores determine how much each
encoder hidden state contributes to the current context vector.

Attention Score at Time Step t: e_{t,i} = f(s_t, h_i)

Where f is a scoring function, s_t is the decoder hidden state, and h_i is the encoder hidden state at
position i.

Context Vector: The context vector is a weighted sum of encoder hidden states based on attention
scores.

Context Vector at Time Step t: c_t = \sum_i a_{t,i} * h_i

Where a_{t,i} is the attention weight for encoder hidden state h_i at time step t.

Output Generation: The decoder's hidden state and context vector are used to predict the next
word's probability distribution over the target vocabulary.

P(y_t | y_{<t}, x) = softmax(W_o * [s_t; c_t])

Where [s_t; c_t] is the concatenation of the decoder hidden state and context vector, and W_o is a
weight matrix.
The process is repeated until the decoder generates an end-of-sequence token or a predefined
maximum sequence length.

Kernels in NLP:

1. Linear Kernel: The simplest kernel, representing the original feature space. It's useful
when the data is already linearly separable.

2. Polynomial Kernel: Computes the similarity between data points as the polynomial of
the inner product of their original feature vectors. It can capture some non-linear
relationships.

3. Radial Basis Function (RBF) Kernel: Also known as the Gaussian kernel, it's widely
used due to its ability to capture complex non-linear relationships. It assigns higher
similarity to nearby points and lower similarity to distant points.

4. String Kernels: Designed for working with string or sequence data, like text. They
measure the similarity between strings based on sub-sequences or substring matches.

Support Vector Machines (SVM) with Kernels: Kernel methods are often used with Support
Vector Machines (SVMs) to perform tasks like text classification and sentiment analysis. The
kernelized SVM finds the hyperplane that best separates the data in the transformed space. The
SVM with a kernel can capture complex decision boundaries and handle non-linear data.

Kernelized Text Classification: In NLP, kernel methods can be used for text classification by
representing documents as feature vectors (e.g., bag-of-words or TF-IDF) and using a kernel
function to compute the similarity between documents. Kernelized SVMs can then classify
documents into classes based on these similarities.

Kernel Trick and Computational Efficiency: One of the advantages of kernel methods is the
"kernel trick." It allows us to compute the inner product in the higher-dimensional space without
explicitly transforming the data. This can save computational resources and memory.

Limitations:

 Kernel methods can be sensitive to parameter tuning, such as the choice of kernel and its
hyperparameters.

 They might not perform well on very high-dimensional data due to the "curse of
dimensionality."

 For certain problems, kernel methods can be computationally intensive compared to other
techniques.

Examples:
 Using an RBF kernel to classify text documents based on their content.

 Using string kernels to compare DNA sequences in bioinformatics.

Kernel methods offer a powerful way to handle non-linear relationships in NLP tasks, but
choosing the right kernel and tuning parameters is essential for optimal performance.

Example: Sentiment Analysis with RBF Kernel

Task: Given a collection of movie reviews, classify each review as either positive or negative
sentiment.

Dataset: A dataset containing movie reviews labeled as positive or negative.

Steps:

1. Data Preprocessing:

 Tokenize the text: Split each review into individual words.

 Convert text to numerical representation: Use techniques like bag-of-words or TF-


IDF to represent text as feature vectors.

2. Feature Vectorization: Assume we have two movie reviews as follows:

 Review 1 (Positive): "I loved this movie. It was amazing!"

 Review 2 (Negative): "The movie was terrible. I hated it."

We represent these reviews using bag-of-words feature vectors:

 Review 1: [1, 1, 0, 1, 1, 0, 0, ...] (indicating presence/absence of each word)

 Review 2: [1, 0, 1, 0, 0, 1, 1, ...]

3. Kernelized SVM: We use SVM with the RBF kernel to classify the reviews. The RBF
kernel computes similarity between data points in a higher-dimensional space.

The SVM finds the optimal hyperplane that best separates the positive and negative reviews in
this higher-dimensional space.

4. Training and Testing:

 Split the dataset into a training set and a testing set.

 Train the SVM on the training set using the RBF kernel.

 Test the SVM on the testing set to predict sentiment labels.


Results: The trained SVM with the RBF kernel can accurately classify movie reviews into
positive or negative sentiment categories, even when the relationship between words and
sentiment is complex and non-linear. The RBF kernel implicitly captures these non-linear
relationships in the high-dimensional space.

Advantages of Kernel Methods in this Example:

 SVM with the RBF kernel can capture complex decision boundaries, which can be useful
when sentiment analysis involves intricate relationships between words.

 The kernel trick allows us to implicitly work in a higher-dimensional space without


explicitly transforming feature vectors, which can save computational resources.

Note: In practice, NLP tasks can involve more complex preprocessing, feature engineering, and
model tuning. The example provided here is a simplified illustration to showcase the concept of
kernel methods in NLP.

Word-Context Matrix Factorization models are a class of methods used in Natural Language
Processing (NLP) for learning word embeddings or representations by factorizing a word-context
matrix. These models capture the co-occurrence patterns of words in a large corpus to represent
words in a dense vector space. One popular model in this category is the Singular Value
Decomposition (SVD) method for word embedding learning. Let's explore this concept with an
example:

Example: Learning Word Embeddings using SVD

Corpus: Consider a small corpus containing the following sentences:

1. "I love natural language processing."

2. "Machine learning is fascinating."

3. "NLP and machine learning are important."

Vocabulary: The vocabulary consists of unique words in the corpus:

 {"I", "love", "natural", "language", "processing", "Machine", "learning", "is",


"fascinating", "and", "are", "important"}

Word-Context Matrix: Create a word-context matrix where each cell (i, j) represents the co-
occurrence count of word i with context word j (within a certain window of words).

I love natural language processing Machine learning is fascinating and are important

I | 0 1 1 1 0 0 0 0 0 0 0 0
love | 1 0 0 0 0 0 0 0 0 0 0 0

natural | 1 0 0 1 1 0 0 0 0 0 0 0

language | 1 0 1 0 1 0 0 0 0 0 0 0

processing | 0 0 1 1 0 0 0 0 0 0 0 0

Machine | 0 0 0 0 0 1 1 0 1 1 0 0

learning | 0 0 0 0 0 1 0 1 0 1 0 0

is | 0 0 0 0 0 0 1 0 0 0 0 0

fascinating| 0 0 0 0 0 1 0 0 1 0 0 0

and | 0 0 0 0 0 0 1 0 0 0 1 0

are | 0 0 0 0 0 0 0 0 0 1 0 1

important | 0 0 0 0 0 0 0 0 0 0 1 0

SVD Factorization: Apply Singular Value Decomposition (SVD) on the word-context matrix.
SVD decomposes the matrix into three matrices: U (word vectors), Σ (diagonal matrix of
singular values), and V^T (context vectors).

Word-Context Matrix = U * Σ * V^T

Word Embeddings: The U matrix represents word embeddings, and each row corresponds to a
word's dense vector representation in the embedding space.

Example: Assuming the SVD decomposition yields U as follows:

U=[

[0.2, 0.1, 0.3],

[0.5, 0.2, 0.7],

[0.8, 0.6, 0.4],

...

This means the word "love" is represented by the vector [0.5, 0.2, 0.7] in the embedding space.

Application: Word embeddings obtained from SVD can be used in various NLP tasks like word
similarity, text classification, and more.
the word-context matrix can be much larger, and SVD might be approximated using techniques
like Truncated SVD or randomized SVD for efficiency.

Recurrent Neural Networks (RNNs) are a class of neural networks commonly used in Natural
Language Processing (NLP) for tasks that involve sequences of data, such as text. RNNs have a
unique ability to capture sequential dependencies and context, making them well-suited for tasks
like language modeling, machine translation, sentiment analysis, and more. Let's explore the
concept of RNNs in NLP with an example.

Example: Sentiment Analysis with RNN

Task: Given a sequence of words (a sentence), classify its sentiment as positive or negative.

Dataset: A dataset containing labeled movie reviews as positive or negative sentiment.

Architecture: We'll use a simple RNN for this example.

Steps:

1. Data Preprocessing:

 Tokenize the text: Split each review into individual words.

 Convert text to numerical representation: Use word embeddings to represent each


word as a vector.

2. Embedding Layer:

 Transform each word in the sentence into a vector representation using pre-
trained word embeddings or by training embeddings on your dataset.

3. RNN Layer:

 The RNN processes the sequence of word embeddings, one word at a time, while
maintaining hidden states that capture context.

 At each time step, the RNN takes the current word embedding and the previous
hidden state to calculate the current hidden state.

4. Fully Connected Layer:

 After processing the entire sequence, the final hidden state is passed through a
fully connected layer to produce the sentiment prediction.

Example: Sentences:

1. "I loved the movie. It was amazing!"


2. "The movie was terrible. I hated it."

Word Embeddings: Assume we have word embeddings for each word in the vocabulary.

RNN Process:

 For each sentence, the RNN processes the sequence of word embeddings, updating the
hidden state at each time step.

 At the last time step, the final hidden state is used for sentiment prediction.

Sentiment Prediction: The fully connected layer takes the final hidden state and predicts
whether the sentiment is positive or negative.

Results: The RNN can learn the sequential patterns and dependencies in the data, allowing it to
capture the sentiment context in sentences and make accurate sentiment predictions.

Advantages of RNNs in this Example:

 RNNs are capable of modeling long-range dependencies in sequences, which is crucial


for understanding context in language.

 They can handle variable-length sequences, making them suitable for tasks involving
sentences of different lengths.

Limitations:

 Standard RNNs can suffer from the "vanishing gradient" problem, where gradients
become very small during training, affecting the learning process.

 Long sequences can lead to computational inefficiencies and difficulties in learning long-
term dependencies.

In practice, more advanced RNN architectures like LSTM (Long Short-Term Memory) and GRU
(Gated Recurrent Unit) are often used to address the vanishing gradient problem and improve the
modeling of long-term dependencies.

Network (RNN)

Network (RNN) in Natural Language Processing (NLP) for a language modeling task. In this
example, we'll predict the next word in a sentence given the previous words.

Language Modeling with RNN: Predicting the Next Word

Vocabulary: Consider a small vocabulary for simplicity:

 {"I", "love", "machine", "learning", "NLP"}


Training Sentence: "I love machine learning."

RNN Architecture: For this example, we'll use a simple RNN with a single hidden layer and a
softmax output layer. Let's define some parameters:

 Vocabulary size (V): 5

 Hidden layer size (H): 3

 Time steps (T): 4 (one for each word in the training sentence)

Word Embeddings: We'll represent each word with a one-hot encoded vector.

RNN Equations: At each time step t, the RNN computes the hidden state h_t and the predicted
output y_t using the following equations:

1. Hidden State:

arduinoCopy code

h_t = tanh(W_h * h_{t-1} + W_x * x_t + b)

Where W_h is the hidden state weight matrix, W_x is the input (word embedding) weight
matrix, and b is the bias term.

2. Output:

arduinoCopy code

y_t = softmax(W_y * h_t + c)

Where W_y is the output weight matrix and c is the output bias term.

Example Calculation: Let's calculate the hidden states and predicted outputs step by step for the
training sentence "I love machine learning."

1. Step 1 (t=1):

 Input word: "I" (one-hot encoded vector)

 Initial hidden state: zeros (h_0)

 Compute hidden state: h_1 = tanh(W_h * h_0 + W_x * x_1 + b)

 Compute output: y_1 = softmax(W_y * h_1 + c)

2. Step 2 (t=2):

 Input word: "love" (one-hot encoded vector)


 Compute hidden state: h_2 = tanh(W_h * h_1 + W_x * x_2 + b)

 Compute output: y_2 = softmax(W_y * h_2 + c)

3. Step 3 (t=3):

 Input word: "machine" (one-hot encoded vector)

 Compute hidden state: h_3 = tanh(W_h * h_2 + W_x * x_3 + b)

 Compute output: y_3 = softmax(W_y * h_3 + c)

4. Step 4 (t=4):

 Input word: "learning" (one-hot encoded vector)

 Compute hidden state: h_4 = tanh(W_h * h_3 + W_x * x_4 + b)

 Compute output: y_4 = softmax(W_y * h_4 + c)

Training: During training, the network's goal is to minimize the difference between the
predicted outputs (y_1, y_2, y_3, y_4) and the actual target words

You might also like