NLP Lab Manual Final
NLP Lab Manual Final
Elective VI 410255(A):
Natural Language
Processing
Dataset to be used:
https://www.kaggle.com/datasets/CooperUnion/cardataset
5 Morphology is the study of the way words are built up from smaller
meaning bearing units. Study and understand the concepts of morphology
by the use of add delete table
Theory:
Tokenization:
Tokenization is a process of converting raw data into a useful data string. Tokenization is useful
Used in NLP for splitting paragraphs and sentences into smaller chunks that can be more easily
assigned meaning.
Tokenization can be done to either at word level or sentence level. If the text is split into words it
is called as word tokenization and the separation done for sentences is called sentence
tokenization.
In tokenization process unstructured data and natural language text is broken into chunks or
information that can be understood by machine.
Tokenization is the first crucial step of the NLP process as it coverts sentences into
understandable bits of data for the program to work with. Without a proper / Correct
tokenization, the NLP process can quickly devolve into a chaotic task.
Challenges of Tokenization:
1. Ambiguity: Language is inherently ambiguous. Consider the sentence "Flying planes can be
dangerous." Depending on how it's tokenized and interpreted, it could mean that the act of
piloting planes is risky or that planes in flight pose a danger. Such ambiguities can lead to
vastly different interpretations.
2. Languages without clear boundaries:Some languages, like Chinese or Japanese, don't have
clear spaces between words, making tokenization a more complex task. Determining where
one word ends and another begins can be a significant challenge in such languages.
Types of tokenization:
1. Word tokenization. This method breaks text down into individual words. It's the most
common approach and is particularly effective for languages with clear word boundaries like
English.
2. White space tokenization : A Whitespace Tokenizer is a tokenizer that splits on and
discards only whitespace characters. This implementation can return Word, Core Label or
other Lexed Token objects. It has a parameter for whether to make EOL a token or whether
to treat EOL characters as whitespace
3. Rule-Based Tokenization: In this approach, predefined rules are used to determine token
boundaries. Punctuation marks like periods, commas, and question marks are often used as
cues to split text into tokens.
4. Regular expression tokenizer: Regular-Expression Tokenizers. A RegexpTokenizer splits a
string into substrings using a regular expression. For example, the following tokenizer forms
tokens out of alphabetic sequences, money expressions, and any other non-whitespace
sequences: >>> from nltk.
5. Penn Tree Bank tokenizer :It is a rule-based tokenization method that separates out clitics (
words that normally occur only in combination with another word, for example in I'm), keeps
hyphenated words together, and separates out all punctuation.
Stemming:
In Natural Language Processing (NLP), refers to the process of reducing a word to its
word stem that affixes to suffixes and prefixes or the roots. While a stemming algorithm
is a linguistic normalization process in which the variant forms of a word are reduced to a
standard form.
The process of removing affixes from a word so that we are left with the stem of that
word is called stemming. For example, consider the words 'run', 'running', and 'runs', all
convert into the root word 'run' after stemming is implemented on them.
The goal of stemming is to simplify and standardize words, which helps improve the
performance of information retrieval, text classification, and other NLP tasks.
Stemming is a natural language processing technique that lowers inflection in words to
Martin Porter invented the Porter Stemmer or Porter algorithm in 1980. Five steps of word
reduction are used in the method, each with its own set of mapping rules. Porter Stemmer is the
original stemmer and is renowned for its ease of use and rapidity. Frequently, the resultant stem
is a shorter word with the same root meaning.
Porter Stemmer() is a module in NLTK that implements the Porter Stemming technique.
Lemmatization:
Lemmatization is the process of grouping together different inflected forms of the same
word. It's used in computational linguistics, natural language processing (NLP)
and chatbots. Lemmatization links similar meaning words as one word, making tools
such as chatbots and search engine queries more effective and accurate.
The goal of lemmatization is to reduce a word to its root form, also called a lemma. For
example, the verb "running" would be identified as "run."
Implementation:
import nltk
from nltk.tokenize import word_tokenize, TreebankWordTokenizer, TweetTokenizer
from nltk.tokenize import MWETokenizer
from nltk.stem import PorterStemmer, SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
nltk.download('punkt')
nltk.download('wordnet')
# Sample sentence
sentence = "NLTK is a powerful library for natural language processing in Python. It provides
# Whitespace-based tokenization
whitespace_tokens = sentence.split()
# Punctuation-based tokenization
punctuation_tokens = nltk.word_tokenize(sentence)
# Treebank tokenization
treebank_tokenizer = TreebankWordTokenizer()
treebank_tokens = treebank_tokenizer.tokenize(sentence)
# Tweet tokenization
tweet_tokenizer = TweetTokenizer()
tweet_tokens = tweet_tokenizer.tokenize(sentence)
# Porter Stemmer
porter_stemmer = PorterStemmer()
porter_stemmed_words = [porter_stemmer.stem(word) for word in punctuation_tokens]
# Snowball Stemmer
snowball_stemmer = SnowballStemmer('english')
snowball_stemmed_words = [snowball_stemmer.stem(word) for word in punctuation_tokens]
# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in punctuation_tokens]
Conclusion:
Tokenization, stemming and lemmatization are the preprocessing steps in natural language text
Text is broken down into small units in tokenization and stemming an lemmatization give their
roots from.
Assignment no 2
Title :
Perform bag-of-words approach (count occurrence, normalized count occurrence), TF-IDF on
data. Create embeddings using Word2Vec.
Dataset to be used: https://www.kaggle.com/datasets/CooperUnion/cardataset
Theory:
Bag of words:
The bag of words model is a simple way to convert words to numerical representation in natural
language processing. This model is a simple document embedding technique based on word
frequency. Conceptually, we think of the whole document as a “bag” of words, rather than a
sequence. We represent the document simply by the frequency of each word. Using this
technique, we can embed a whole set of documents and feed them into a variety of different
machine learning algorithms.
The bag of words model is a simple document embedding technique based on word frequency.
Conceptually, we think of the whole document as a “bag” of words, rather than a sequence. We
represent the document simply by the frequency of each word. For example, if we have a
vocabulary of 1000 words, then the whole document will be represented by a 1000-dimensional
vector, where the vector’s ith entry represents the frequency of the ith vocabulary word in the
document.
A bag of word is a representation of text that describe the occurrences of the word within a
document.
The following models a text document using bag-of-words. Here are two simple text documents:
Based on these two text documents, a list is constructed as follows for each document:
"John","likes","to","watch","movies","Mary","likes","movies","too"
"Mary","also","likes","to","watch","football","games"
BoW1 = {"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1};
BoW2 = {"Mary":1,"also":1,"likes":1,"to":1,"watch":1,"football":1,"games":1};
Each key is the word, and each value is the number of occurrences of that word in the given text
document.
(3) John likes to watch movies. Mary likes movies too. Mary also likes to watch football games.
BoW3 =
{"John":1,"likes":3,"to":2,"watch":2,"movies":2,"Mary":2,"too":1,"also":1,"football":1,"games":
1};
So, as we see in the bag algebra, the "union" of two documents in the bags-of-words
representation is, formally, the disjoint union, summing the multiplicities of each element.
TF-IDF:
Words within a text document are transformed into importance numbers by a text
vectorization process. There are many different text vectorization scoring schemes, with TF-
IDF being one of the most common.
Term Frequency: TF of a term or word is the number of times the term appears in a
document compared to the total number of words in the document.
TF=number of times the term appears in the document / total number of terms in
the document
Inverse Document Frequency: IDF of a term reflects the proportion of documents in the
corpus that contain the term. Words unique to a small percentage of documents (e.g.,
technical jargon terms) receive higher importance values than words common across all
documents (e.g., a, the, and).
TF-IDF=TF∗IDF
Translated into plain English, importance of a term is high when it occurs a lot in a given
document and rarely in others. In short, commonality within a document measured by TF is
balanced by rarity between documents measured by IDF. The resulting TF-IDF score
reflects the importance of a term for a document in the corpus.
TF-IDF is useful in many natural language processing applications. For example, Search
Engines use TF-IDF to rank the relevance of a document for a query. TF-IDF is also
employed in text classification, text summarization, and topic modeling.
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import re
# Sample data preprocessing (you may need to adjust this based on your dataset)
def preprocess_text(text):
# Remove non-alphanumeric characters and convert to lowercase
text = re.sub(r'[^a-zA-Z0-9\s]', '', text).lower()
return text
data['Model'] = data['Model'].apply(preprocess_text)
# Tokenization
tokenized_text = [word_tokenize(text) for text in data['Model']]
# TF-IDF
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(data['Model'])
# Word2Vec
# Note: Word2Vec requires a list of sentences as input, where each sentence is represented
as a list of words.
# In this case, 'tokenized_text' contains the preprocessed and tokenized sentences.
word2vec_model = Word2Vec(sentences=tokenized_text, vector_size=100, window=5,
min_count=1, workers=4)
print("\nTF-IDF Matrix:")
print(tfidf_matrix.toarray())
Conclusion:
1. Bag of word, TF IDF and Word2 Vec are word embedding techniques.
2. Word are transformed into numerical form which is must to apply machine learning
algorithms.
Text cleaning:
Text cleaning is the process of preparing raw text for NLP (Natural Language Processing) so
that machines can understand human language.
Clean text is human language rearranged into a format that machine models can understand.
Text cleaning can be performed using simple Python code that eliminates stop words, removes
Unicode words, and simplifies complex words to their root form
Most Common Methods for Cleaning the Data Removing HTML tags
Lemmatization: groups words based on root definition, and allows us to differentiate between
present, past, and indefinite.
So, ‘jumps’ and ‘jump’ are grouped into the present ‘jump’, as different from all uses of
‘jumped’ which are grouped together as past tense, and all instances of ‘jumping’ which are
grouped together as the indefinite (meaning continuing/continuous).
So, if we are looking to find all instances of a product (say an engine) having any sort of ‘jump’
related response to analyze all responses, good or bad, we would use stemming.
But, if we want to break this even further down to the type of jump i.e. whether it was in the past,
present, or a continuous problem, and want to approach all three different instances with distinct
types of analysis, then we will use lemmatizing.
INPUT:
“jump”
PYTHON CODE:
import nltk
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
OUTPUT:
jump = jump
jumped = jump
jumps = jump
jumping = jump
= jump
jumps = jump
jumping = jumpjumps = jump
jumping = jump
Bash
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import pickle
def preprocess_text(text):
# Assuming 'text' is the column containing the text data
text = ' '.join([lemmatizer.lemmatize(word) for word in text.split() if word.lower() not in
stop_words])
return text
df['cleaned_text'] = df['text'].apply(preprocess_text)
# Label Encoding
label_encoder = LabelEncoder()
df['encoded_labels'] = label_encoder.fit_transform(df['label'])
# TF-IDF Vectorization
tfidf_vectorizer = TfidfVectorizer(max_features=5000) # You can adjust the max_features
parameter
X_train_tfidf = tfidf_vectorizer.fit_transform(train_df['cleaned_text'])
X_test_tfidf = tfidf_vectorizer.transform(test_df['cleaned_text'])
# Save outputs
train_df.to_pickle("train_data_processed.pickle")
test_df.to_pickle("test_data_processed.pickle")
pickle.dump(tfidf_vectorizer, open("tfidf_vectorizer.pickle", "wb"))
pickle.dump(label_encoder, open("label_encoder.pickle", "wb"))
Conclusion:
We studied the concepts such as text cleaning, perform lemmatization, remove stop words ,label
encoding.
These tools set the foundation for defining the model's architecture, managing data, and
establishing the training process.
import torch
import torch.nn as nn
import math
import copy
The Multi Head Attention code initializes the module with input parameters and linear
transformation layers. It calculates attention scores, reshapes the input tensor into multiple heads,
and combines the attention outputs from all heads. The forward method computes the multi-head
self-attention, allowing the model to focus on some different aspects of the input sequence.
The Position Wise Feed Forward class extends PyTorch’s Module and implements a position-
wise feed-forward network. The class initializes with two linear transformation layers and a ReLU
activation function. The forward method applies these transformations and activation function
sequentially to compute the output. This process enables the model to consider the position of
input elements while making predictions.
The PositionalEncoding class initializes with input parameters d_model and max_seq_length,
creating a tensor to store positional encoding values. The class calculates sine and cosine values
for even and odd indices, respectively, based on the scaling factor div_term. The forward method
computes the positional encoding by adding the stored positional encoding values to the input
tensor, allowing the model to capture the position information of the input sequence.
A)Encoder Layer
An Encoder layer consists of a Multi-Head Attention layer, a Position-wise Feed-Forward layer,
and two Layer Normalization layers.
The Encoder Layer class initializes with input parameters and components, including a Multi
Head Attention module, a Position Wise Feed Forward module, two layer normalization modules,
and a dropout layer. The forward methods computes the encoder layer output by applying self-
attention, adding the attention output to the input tensor, and normalizing the result. Then, it
computes the position-wise feed-forward output, combines it with the normalized self-attention
output, and normalizes the final result before returning the processed tensor.
B) Decoder Layer
A Decoder layer consists of two Multi-Head Attention layers, a Position-wise Feed-Forward
layer, and three Layer Normalization layers.
The DecoderLayer initializes with input parameters and components such as MultiHeadAttention
modules for masked self-attention and cross-attention, a PositionWiseFeedForward module, three
layer normalization modules, and a dropout layer.
The forward method computes the decoder layer output by performing the following steps:
1. Calculate the masked self-attention output and add it to the input tensor, followed by dropout
the normalized masked self-attention output, followed by dropout and layer normalization.
3. Calculate the position-wise feed-forward output and combine it with the normalized cross-
Transformer Model
The Transformer class combines the previously defined modules to create a complete
Transformer model. During initialization, the Transformer module sets up input parameters and
initializes various components, including embedding layers for source and target sequences, a
Positional Encoding module, Encoder Layer and Decoder Layer modules to create stacked layers,
a linear layer for projecting decoder output, and a dropout layer.
The generate_mask method creates binary masks for source and target sequences to ignore
padding tokens and prevent the decoder from attending to future tokens. The forward method
computes the Transformer model’s output through the following steps:
1. Generate source and target masks using the generate_mask method.
2. Compute source and target embeddings, and apply positional encoding and dropout.
3. Process the source sequence through encoder layers, updating the enc_output tensor.
4. Process the target sequence through decoder layers, using enc_output and masks, and
updating the dec_output tensor.
5. Apply the linear projection layer to the decoder output, obtaining output logins.
These steps enable the Transformer model to process input sequences and generate output
sequences based on the combined functionality of its components.
import torch
import torch.nn as nn
import torch.nn.functional as F
class PositionalEncoding(nn.Module):
def __init__(self, d_model, max_len=512):
super(PositionalEncoding, self).__init__()
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
pe = pe.unsqueeze(0).transpose(0, 1)
self.register_buffer('pe', pe)
class MultiHeadAttention(nn.Module):
def __init__(self, d_model, nhead):
super(MultiHeadAttention, self).__init__()
self.d_model = d_model
self.nhead = nhead
self.head_dim = d_model // nhead
class Feedforward(nn.Module):
def __init__(self, d_model, d_ff, dropout=0.1):
super(Feedforward, self).__init__()
self.linear1 = nn.Linear(d_model, d_ff)
self.dropout = nn.Dropout(dropout)
self.linear2 = nn.Linear(d_ff, d_model)
class TransformerBlock(nn.Module):
def __init__(self, d_model, nhead, d_ff, dropout=0.1):
super(TransformerBlock, self).__init__()
self.attention = MultiHeadAttention(d_model, nhead)
self.norm1 = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)
self.feedforward = Feedforward(d_model, d_ff)
self.norm2 = nn.LayerNorm(d_model)
return x
class Transformer(nn.Module):
def __init__(self, vocab_size, d_model, nhead, num_layers, d_ff, max_len=512, dropout=0.1):
super(Transformer, self).__init__()
self.embedding = nn.Embedding(vocab_size, d_model)
self.positional_encoding = PositionalEncoding(d_model, max_len)
self.transformer_blocks = nn.ModuleList([
TransformerBlock(d_model, nhead, d_ff, dropout) for _ in range(num_layers)
])
self.fc = nn.Linear(d_model, vocab_size)
x = self.fc(x.mean(dim=1))
return F.log_softmax(x, dim=-1)
Assignment : 05
Title : Morphology is the study of the way words are built up from smaller meaning bearing
units. Study and understand the concepts of morphology by the use of add delete table .
The "add-delete table" is not a standard term used in linguistics or morphology. However, I
assume you are referring to the concept of morphological operations in linguistics.
Morphological operations involve adding or deleting morphemes (the smallest units of meaning)
to create new words or to change the grammatical category or meaning of existing words.
Let's consider the base word "work" and explore how adding or deleting morphemes can create
different forms.
In this example:
1. Add Prefix "un-": The addition of the prefix "un-" negates the action, transforming "work" into
"unwork," and further addition results in "unworkable," meaning "not workable."
2. Add Suffix "-er": The addition of the suffix "-er" transforms "work" into "worker," indicating a
person who performs the action (in this case, someone who works).
3. Add Suffix "-ing": The addition of the suffix "-ing" transforms "work" into "working,"
indicating an ongoing action.
4. Delete Suffix "-k": The deletion of the final "-k" transforms "work" into "wor," resulting in a
change in the word form.
These operations illustrate how morphological processes involve adding or deleting morphemes
to convey different meanings, grammatical categories, or forms of words. Morphological
analysis helps linguists understand how words are constructed and how meaning is conveyed
through morphemes.
Conclusion: We conclude that, here understand the concepts of morphology by the use of add
delete table