0% found this document useful (0 votes)
32 views25 pages

NLP Lab Manual Final

The document outlines a laboratory practice course focused on Natural Language Processing (NLP), detailing various experiments such as tokenization, stemming, lemmatization, and the bag-of-words approach. It includes theoretical explanations and practical assignments using datasets from Kaggle and GitHub, emphasizing the importance of preprocessing steps in NLP. Additionally, it covers advanced topics like TF-IDF and Word2Vec for creating embeddings.

Uploaded by

vaishnavi mane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views25 pages

NLP Lab Manual Final

The document outlines a laboratory practice course focused on Natural Language Processing (NLP), detailing various experiments such as tokenization, stemming, lemmatization, and the bag-of-words approach. It includes theoretical explanations and practical assignments using datasets from Kaggle and GitHub, emphasizing the importance of preprocessing steps in NLP. Additionally, it covers advanced topics like TF-IDF and Word2Vec for creating embeddings.

Uploaded by

vaishnavi mane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Laboratory Practice VI

Elective VI 410255(A):
Natural Language
Processing

SCSCOE Laboratory Practice V (NLP) Page 1


INDEX

List of Page No.


Experiments

1 Perform tokenization (Whitespace, Punctuation-based, Treebank, Tweet,


MWE) using NLTK library. Use porter stemmer and snowball stemmer
for stemming. Use any technique for lemmatization.
Input / Dataset –use any sample sentence

Perform bag-of-words approach (count occurrence, normalized count


2 occurrence), TF-IDF on data. Create embeddings using Word2Vec.

Dataset to be used:
https://www.kaggle.com/datasets/CooperUnion/cardataset

3 Perform text cleaning, perform lemmatization (any method), remove stop


words (any method), label encoding. Create representations using TF-IDF.
Save outputs.
Dataset: https://github.com/PICT-NLP/BE-NLP-
Elective/blob/main/3-Preprocessing/News_dataset.pickle

4 Create a transformer from scratch using the Pytorch library

5 Morphology is the study of the way words are built up from smaller
meaning bearing units. Study and understand the concepts of morphology
by the use of add delete table

8 MINI Project on NLP application.

SCSCOE Laboratory Practice V (NLP) Page 2


Assignment no 1
Title : Perform tokenization (Whitespace, Punctuation-based, Treebank, Tweet, MWE) using
NLTK library. Use porter stemmer and snowball stemmer for stemming. Use any technique for
lemmatization.
Input / Dataset –use any sample sentence.

Theory:

 Tokenization:
Tokenization is a process of converting raw data into a useful data string. Tokenization is useful
Used in NLP for splitting paragraphs and sentences into smaller chunks that can be more easily
assigned meaning.

Tokenization can be done to either at word level or sentence level. If the text is split into words it
is called as word tokenization and the separation done for sentences is called sentence
tokenization.

 Why tokenization is required?

In tokenization process unstructured data and natural language text is broken into chunks or
information that can be understood by machine.

Tokenization converts an unstructured string(text document)into numerical data structure


suitable for machine learning. This allows the machine to understand each other of the words by
themselves as well as how they function in the larger text. This is especially important for larger
texts as it allows the machine to count the frequencies of certain words as well as where they
frequently appeared.

Tokenization is the first crucial step of the NLP process as it coverts sentences into
understandable bits of data for the program to work with. Without a proper / Correct
tokenization, the NLP process can quickly devolve into a chaotic task.

 Challenges of Tokenization:
1. Ambiguity: Language is inherently ambiguous. Consider the sentence "Flying planes can be
dangerous." Depending on how it's tokenized and interpreted, it could mean that the act of
piloting planes is risky or that planes in flight pose a danger. Such ambiguities can lead to
vastly different interpretations.
2. Languages without clear boundaries:Some languages, like Chinese or Japanese, don't have
clear spaces between words, making tokenization a more complex task. Determining where
one word ends and another begins can be a significant challenge in such languages.

SCSCOE Laboratory Practice V (NLP) Page 3


3. Handling special characters: Texts often contain more than just words. Email addresses,
URLs, or special symbols can be tricky to tokenize. For instance, should
"[email protected]" be treated as a single token or split at the period or the "@" symbol?
4. Use of different Languages : Tokenization needs to adapt to different languages, scripts,
and linguistic variations, adding complexity to the process.

 Types of tokenization:

1. Word tokenization. This method breaks text down into individual words. It's the most
common approach and is particularly effective for languages with clear word boundaries like
English.
2. White space tokenization : A Whitespace Tokenizer is a tokenizer that splits on and
discards only whitespace characters. This implementation can return Word, Core Label or
other Lexed Token objects. It has a parameter for whether to make EOL a token or whether
to treat EOL characters as whitespace
3. Rule-Based Tokenization: In this approach, predefined rules are used to determine token
boundaries. Punctuation marks like periods, commas, and question marks are often used as
cues to split text into tokens.
4. Regular expression tokenizer: Regular-Expression Tokenizers. A RegexpTokenizer splits a
string into substrings using a regular expression. For example, the following tokenizer forms
tokens out of alphabetic sequences, money expressions, and any other non-whitespace
sequences: >>> from nltk.
5. Penn Tree Bank tokenizer :It is a rule-based tokenization method that separates out clitics (
words that normally occur only in combination with another word, for example in I'm), keeps
hyphenated words together, and separates out all punctuation.

Stemming:
 In Natural Language Processing (NLP), refers to the process of reducing a word to its
word stem that affixes to suffixes and prefixes or the roots. While a stemming algorithm
is a linguistic normalization process in which the variant forms of a word are reduced to a
standard form.

 The process of removing affixes from a word so that we are left with the stem of that
word is called stemming. For example, consider the words 'run', 'running', and 'runs', all
convert into the root word 'run' after stemming is implemented on them.

Why stemming is required:

 The goal of stemming is to simplify and standardize words, which helps improve the
performance of information retrieval, text classification, and other NLP tasks.
 Stemming is a natural language processing technique that lowers inflection in words to

SCSCOE Laboratory Practice V (NLP) Page 4


their root forms, hence aiding in the preprocessing of text, words, and documents for text
normalization.

 Types of Stemmer in NLTK


1. Porter stemmer
2. Snowball stemmer
3. Lancaster stemmer
4. Regexp stemmer

1. Porter Stemmer – PorterStemmer()

Martin Porter invented the Porter Stemmer or Porter algorithm in 1980. Five steps of word
reduction are used in the method, each with its own set of mapping rules. Porter Stemmer is the
original stemmer and is renowned for its ease of use and rapidity. Frequently, the resultant stem
is a shorter word with the same root meaning.

Porter Stemmer() is a module in NLTK that implements the Porter Stemming technique.

Lemmatization:
 Lemmatization is the process of grouping together different inflected forms of the same
word. It's used in computational linguistics, natural language processing (NLP)
and chatbots. Lemmatization links similar meaning words as one word, making tools
such as chatbots and search engine queries more effective and accurate.

 The goal of lemmatization is to reduce a word to its root form, also called a lemma. For
example, the verb "running" would be identified as "run."

 Implementation:

import nltk
from nltk.tokenize import word_tokenize, TreebankWordTokenizer, TweetTokenizer
from nltk.tokenize import MWETokenizer
from nltk.stem import PorterStemmer, SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
nltk.download('punkt')
nltk.download('wordnet')

# Sample sentence
sentence = "NLTK is a powerful library for natural language processing in Python. It provides

SCSCOE Laboratory Practice V (NLP) Page 5


tools for tokenization, stemming, lemmatization, and more."

# Whitespace-based tokenization
whitespace_tokens = sentence.split()

# Punctuation-based tokenization
punctuation_tokens = nltk.word_tokenize(sentence)

# Treebank tokenization
treebank_tokenizer = TreebankWordTokenizer()
treebank_tokens = treebank_tokenizer.tokenize(sentence)

# Tweet tokenization
tweet_tokenizer = TweetTokenizer()
tweet_tokens = tweet_tokenizer.tokenize(sentence)

# Multi-Word Expression (MWE) tokenization


mwe_tokenizer = MWETokenizer([('natural', 'language'), ('processing', 'in', 'Python')])
mwe_tokens = mwe_tokenizer.tokenize(sentence.split())

# Porter Stemmer
porter_stemmer = PorterStemmer()
porter_stemmed_words = [porter_stemmer.stem(word) for word in punctuation_tokens]

# Snowball Stemmer
snowball_stemmer = SnowballStemmer('english')
snowball_stemmed_words = [snowball_stemmer.stem(word) for word in punctuation_tokens]

# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in punctuation_tokens]

# Print the results


print("Original Sentence:", sentence)
print("\nWhitespace Tokenization:", whitespace_tokens)
print("\nPunctuation Tokenization:", punctuation_tokens)
print("\nTreebank Tokenization:", treebank_tokens)
print("\nTweet Tokenization:", tweet_tokens)
print("\nMWE Tokenization:", mwe_tokens)
print("\nPorter Stemming:", porter_stemmed_words)
print("\nSnowball Stemming:", snowball_stemmed_words)
print("\nLemmatization:", lemmatized_words)

Conclusion:

Tokenization, stemming and lemmatization are the preprocessing steps in natural language text

SCSCOE Laboratory Practice V (NLP) Page 6


processing.

Text is broken down into small units in tokenization and stemming an lemmatization give their
roots from.

Assignment no 2

Title :
Perform bag-of-words approach (count occurrence, normalized count occurrence), TF-IDF on
data. Create embeddings using Word2Vec.
Dataset to be used: https://www.kaggle.com/datasets/CooperUnion/cardataset

Theory:

Bag of words:
The bag of words model is a simple way to convert words to numerical representation in natural
language processing. This model is a simple document embedding technique based on word
frequency. Conceptually, we think of the whole document as a “bag” of words, rather than a
sequence. We represent the document simply by the frequency of each word. Using this
technique, we can embed a whole set of documents and feed them into a variety of different
machine learning algorithms.

The bag of words model is a simple document embedding technique based on word frequency.
Conceptually, we think of the whole document as a “bag” of words, rather than a sequence. We
represent the document simply by the frequency of each word. For example, if we have a
vocabulary of 1000 words, then the whole document will be represented by a 1000-dimensional
vector, where the vector’s ith entry represents the frequency of the ith vocabulary word in the
document.

A bag of word is a representation of text that describe the occurrences of the word within a
document.

It involves two things:


1. A vocabulary of known words
2. A measure of the presence of the known words

SCSCOE Laboratory Practice V (NLP) Page 7


It is called the bag of words because any information about the the order or structure of
words in the document is discarded. The model is only concerned with whether known words
occurred in the document .

 Advantages of Bow Approach:


The most common significant advantage of bag of words model is its simplicity and ease to use
it can be use to create an initial draft model before proceeding to more sophisticated word
embeddings.

 Disadvantages of Bow Approach:


Semantic Meaning:
The basic BoW model doesn’t consider the word’s meaning in the document. It will completely
ignore the context in which it is used. We might use the same word in different places based on
the context or nearby words.
Vector Size:
For a larger document, the vector size can be huge, which results in a lot of computational time.
We should ignore words based on relevance to our use case.

The following models a text document using bag-of-words. Here are two simple text documents:

(1) John likes to watch movies. Mary likes movies too.


(2) Mary also likes to watch football games.

Based on these two text documents, a list is constructed as follows for each document:

"John","likes","to","watch","movies","Mary","likes","movies","too"

"Mary","also","likes","to","watch","football","games"

Representing each bag-of-words as a JSON object, and attributing to the


respective JavaScript variable:

BoW1 = {"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1};
BoW2 = {"Mary":1,"also":1,"likes":1,"to":1,"watch":1,"football":1,"games":1};

Each key is the word, and each value is the number of occurrences of that word in the given text
document.

SCSCOE Laboratory Practice V (NLP) Page 8


The order of elements is free, so, for
example {"too":1,"Mary":1,"movies":2,"John":1,"watch":1,"likes":2,"to":1} is also equivalent
to BoW1. It is also what we expect from a strict JSON object representation.

Note: if another document is like a union of these two,

(3) John likes to watch movies. Mary likes movies too. Mary also likes to watch football games.

its JavaScript representation will be:

BoW3 =
{"John":1,"likes":3,"to":2,"watch":2,"movies":2,"Mary":2,"too":1,"also":1,"football":1,"games":
1};

So, as we see in the bag algebra, the "union" of two documents in the bags-of-words
representation is, formally, the disjoint union, summing the multiplicities of each element.

TF-IDF:

Term Frequency - Inverse Document Frequency (TF-IDF) is a widely used statistical


method in natural language processing and information retrieval. It measures how important
a term is within a document relative to a collection of documents (i.e., relative to a corpus).

Words within a text document are transformed into importance numbers by a text
vectorization process. There are many different text vectorization scoring schemes, with TF-
IDF being one of the most common.

Term Frequency: TF of a term or word is the number of times the term appears in a
document compared to the total number of words in the document.

TF=number of times the term appears in the document / total number of terms in
the document

Inverse Document Frequency: IDF of a term reflects the proportion of documents in the
corpus that contain the term. Words unique to a small percentage of documents (e.g.,
technical jargon terms) receive higher importance values than words common across all
documents (e.g., a, the, and).

SCSCOE Laboratory Practice V (NLP) Page 9


IDF = log (number of the documents in the corpus / number of documents in the corpus
contain the term)

The TF-IDF of a term is calculated by multiplying TF and IDF scores.

TF-IDF=TF∗IDF

Translated into plain English, importance of a term is high when it occurs a lot in a given
document and rarely in others. In short, commonality within a document measured by TF is
balanced by rarity between documents measured by IDF. The resulting TF-IDF score
reflects the importance of a term for a document in the corpus.

TF-IDF is useful in many natural language processing applications. For example, Search
Engines use TF-IDF to rank the relevance of a document for a query. TF-IDF is also
employed in text classification, text summarization, and topic modeling.

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import re

# Load the dataset


dataset_path = "path_to_your_dataset.csv" # Update with the correct path
data = pd.read_csv(dataset_path)

# Sample data preprocessing (you may need to adjust this based on your dataset)
def preprocess_text(text):
# Remove non-alphanumeric characters and convert to lowercase
text = re.sub(r'[^a-zA-Z0-9\s]', '', text).lower()
return text

data['Model'] = data['Model'].apply(preprocess_text)

# Tokenization
tokenized_text = [word_tokenize(text) for text in data['Model']]

# Bag-of-Words (Count occurrence)

SCSCOE Laboratory Practice V (NLP) Page 10


count_vectorizer = CountVectorizer()
bow_matrix = count_vectorizer.fit_transform(data['Model'])

# Bag-of-Words (Normalized count occurrence)


normalized_count_vectorizer = CountVectorizer(binary=True)
normalized_bow_matrix = normalized_count_vectorizer.fit_transform(data['Model'])

# TF-IDF
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(data['Model'])

# Word2Vec
# Note: Word2Vec requires a list of sentences as input, where each sentence is represented
as a list of words.
# In this case, 'tokenized_text' contains the preprocessed and tokenized sentences.
word2vec_model = Word2Vec(sentences=tokenized_text, vector_size=100, window=5,
min_count=1, workers=4)

# Example of obtaining the embedding for a specific word (e.g., 'sedan')


embedding_example = word2vec_model.wv['sedan']

# Print the results


print("Bag-of-Words (Count occurrence) Matrix:")
print(bow_matrix.toarray())

print("\nBag-of-Words (Normalized count occurrence) Matrix:")


print(normalized_bow_matrix.toarray())

print("\nTF-IDF Matrix:")
print(tfidf_matrix.toarray())

print("\nWord2Vec Embedding for 'sedan':")


print(embedding_example)

Conclusion:
1. Bag of word, TF IDF and Word2 Vec are word embedding techniques.
2. Word are transformed into numerical form which is must to apply machine learning
algorithms.

SCSCOE Laboratory Practice V (NLP) Page 11


Assignment 03:
Title: Perform text cleaning, perform lemmatization (any method), remove stop words (any
method), label encoding. Create representations using TF-IDF. Save outputs.
Dataset: https://github.com/

Text cleaning:
Text cleaning is the process of preparing raw text for NLP (Natural Language Processing) so
that machines can understand human language.

Clean text is human language rearranged into a format that machine models can understand.
Text cleaning can be performed using simple Python code that eliminates stop words, removes
Unicode words, and simplifies complex words to their root form

Most Common Methods for Cleaning the Data Removing HTML tags

 Removing & Finding URL


 Removing & Finding Email id
 Removing Stop Words
 Standardizing and Spell Check
 Chat word correction
 Remove the frequent words
 Removing the less frequent words

Lemmatization: groups words based on root definition, and allows us to differentiate between
present, past, and indefinite.

So, ‘jumps’ and ‘jump’ are grouped into the present ‘jump’, as different from all uses of
‘jumped’ which are grouped together as past tense, and all instances of ‘jumping’ which are
grouped together as the indefinite (meaning continuing/continuous).

So, if we are looking to find all instances of a product (say an engine) having any sort of ‘jump’
related response to analyze all responses, good or bad, we would use stemming.

But, if we want to break this even further down to the type of jump i.e. whether it was in the past,
present, or a continuous problem, and want to approach all three different instances with distinct
types of analysis, then we will use lemmatizing.

INPUT:

“jump”

SCSCOE Laboratory Practice V (NLP) Page 12


“jumps”
“jumped”
“jumping”

PYTHON CODE:

import nltk
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

words = \["jump", "jumped", "jumps", "jumping"]


stemmer = PorterStemmer()
for word in words:

print(word + " = " + stemmer.stem(word))

OUTPUT:

jump = jump
jumped = jump
jumps = jump
jumping = jump
= jump
jumps = jump
jumping = jumpjumps = jump
jumping = jump

Ensure you have the necessary libraries installed by running:

Bash

pip install pandas scikit-learn nltk

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import pickle

# Load your dataset


# Assuming the dataset contains columns like 'text' and 'label'
# Adjust this based on your actual dataset structure

SCSCOE Laboratory Practice V (NLP) Page 13


dataset_path = "path_to_your_dataset.pickle" # Update with the correct path
df = pd.read_pickle(dataset_path)

# Text cleaning, lemmatization, and stop words removal


stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
# Assuming 'text' is the column containing the text data
text = ' '.join([lemmatizer.lemmatize(word) for word in text.split() if word.lower() not in
stop_words])
return text

df['cleaned_text'] = df['text'].apply(preprocess_text)

# Label Encoding
label_encoder = LabelEncoder()
df['encoded_labels'] = label_encoder.fit_transform(df['label'])

# Split the dataset into train and test sets


train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

# TF-IDF Vectorization
tfidf_vectorizer = TfidfVectorizer(max_features=5000) # You can adjust the max_features
parameter
X_train_tfidf = tfidf_vectorizer.fit_transform(train_df['cleaned_text'])
X_test_tfidf = tfidf_vectorizer.transform(test_df['cleaned_text'])

# Save outputs
train_df.to_pickle("train_data_processed.pickle")
test_df.to_pickle("test_data_processed.pickle")
pickle.dump(tfidf_vectorizer, open("tfidf_vectorizer.pickle", "wb"))
pickle.dump(label_encoder, open("label_encoder.pickle", "wb"))

# Optionally, save the TF-IDF matrices as well


with open("X_train_tfidf.pickle", "wb") as f:
pickle.dump(X_train_tfidf, f)

with open("X_test_tfidf.pickle", "wb") as f:


pickle.dump(X_test_tfidf, f)

Conclusion:
We studied the concepts such as text cleaning, perform lemmatization, remove stop words ,label
encoding.

SCSCOE Laboratory Practice V (NLP) Page 14


Assignment 04:

Title: Create a transformer from scratch using the Pytorch library

To build the Transformer model the following steps are necessary:

1. Importing the libraries and modules


2. Defining the basic building blocks - Multi-head Attention, Position-Wise Feed-Forward
Networks, Positional Encoding
3. Building the Encoder block
4. Building the Decoder block
5. Combining the Encoder and Decoder layers to create the complete Transformer network

1. Importing the necessary libraries and modules


We’ll start with importing the PyTorch library for core functionality, the neural network module
for creating neural networks, the optimization module for training networks, and the data utility
functions for handling data. Additionally, we’ll import the standard Python math module for
mathematical operations and the copy module for creating copies of complex objects.

These tools set the foundation for defining the model's architecture, managing data, and
establishing the training process.

import torch

import torch.nn as nn

import torch.optim as optim

import torch.utils.data as data

import math

import copy

Defining the basic building blocks: Multi-Head Attention, Position-wise Feed-Forward


Networks, Positional Encoding
1. Multi-head Attention

SCSCOE Laboratory Practice V (NLP) Page 15


The Multi-Head Attention mechanism computes the attention between each pair of positions in a
sequence. It consists of multiple “attention heads” that capture different aspects of the input
sequence.

The Multi Head Attention code initializes the module with input parameters and linear
transformation layers. It calculates attention scores, reshapes the input tensor into multiple heads,
and combines the attention outputs from all heads. The forward method computes the multi-head
self-attention, allowing the model to focus on some different aspects of the input sequence.

2.Position-wise Feed-Forward Networks

The Position Wise Feed Forward class extends PyTorch’s Module and implements a position-
wise feed-forward network. The class initializes with two linear transformation layers and a ReLU
activation function. The forward method applies these transformations and activation function
sequentially to compute the output. This process enables the model to consider the position of
input elements while making predictions.

The PositionalEncoding class initializes with input parameters d_model and max_seq_length,
creating a tensor to store positional encoding values. The class calculates sine and cosine values
for even and odd indices, respectively, based on the scaling factor div_term. The forward method
computes the positional encoding by adding the stored positional encoding values to the input
tensor, allowing the model to capture the position information of the input sequence.

A)Encoder Layer
An Encoder layer consists of a Multi-Head Attention layer, a Position-wise Feed-Forward layer,
and two Layer Normalization layers.
The Encoder Layer class initializes with input parameters and components, including a Multi
Head Attention module, a Position Wise Feed Forward module, two layer normalization modules,
and a dropout layer. The forward methods computes the encoder layer output by applying self-
attention, adding the attention output to the input tensor, and normalizing the result. Then, it
computes the position-wise feed-forward output, combines it with the normalized self-attention
output, and normalizes the final result before returning the processed tensor.

B) Decoder Layer
A Decoder layer consists of two Multi-Head Attention layers, a Position-wise Feed-Forward
layer, and three Layer Normalization layers.
The DecoderLayer initializes with input parameters and components such as MultiHeadAttention
modules for masked self-attention and cross-attention, a PositionWiseFeedForward module, three
layer normalization modules, and a dropout layer.

The forward method computes the decoder layer output by performing the following steps:

1. Calculate the masked self-attention output and add it to the input tensor, followed by dropout

and layer normalization.

SCSCOE Laboratory Practice V (NLP) Page 16


2. Compute the cross-attention output between the decoder and encoder outputs, and add it to

the normalized masked self-attention output, followed by dropout and layer normalization.

3. Calculate the position-wise feed-forward output and combine it with the normalized cross-

attention output, followed by dropout and layer normalization.

4. Return the processed tensor.

Transformer Model

The Transformer class combines the previously defined modules to create a complete
Transformer model. During initialization, the Transformer module sets up input parameters and
initializes various components, including embedding layers for source and target sequences, a
Positional Encoding module, Encoder Layer and Decoder Layer modules to create stacked layers,
a linear layer for projecting decoder output, and a dropout layer.

The generate_mask method creates binary masks for source and target sequences to ignore
padding tokens and prevent the decoder from attending to future tokens. The forward method
computes the Transformer model’s output through the following steps:
1. Generate source and target masks using the generate_mask method.
2. Compute source and target embeddings, and apply positional encoding and dropout.
3. Process the source sequence through encoder layers, updating the enc_output tensor.
4. Process the target sequence through decoder layers, using enc_output and masks, and
updating the dec_output tensor.
5. Apply the linear projection layer to the decoder output, obtaining output logins.
These steps enable the Transformer model to process input sequences and generate output
sequences based on the combined functionality of its components.

Preparing Sample Data


In this example, we will create a toy dataset for demonstration purposes. In practice, you would
use a larger dataset, preprocess the text, and create vocabulary mappings for source and target
languages.

SCSCOE Laboratory Practice V (NLP) Page 17


Training the Model
Now we’ll train the model using the sample data. In practice, you would use a larger dataset and
split it into training and validation sets.
We can use this way to build a simple Transformer from scratch in Pytorch. All Large Language
Models use these Transformer encoder or decoder blocks for training. Hence understanding the
network that started it all is extremely important. Hope this article helps all looking to deep dive
into LLM’s.

import torch
import torch.nn as nn
import torch.nn.functional as F

class PositionalEncoding(nn.Module):
def __init__(self, d_model, max_len=512):
super(PositionalEncoding, self).__init__()
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
pe = pe.unsqueeze(0).transpose(0, 1)
self.register_buffer('pe', pe)

def forward(self, x):


return x + self.pe[:x.size(0), :]

class MultiHeadAttention(nn.Module):
def __init__(self, d_model, nhead):
super(MultiHeadAttention, self).__init__()
self.d_model = d_model
self.nhead = nhead
self.head_dim = d_model // nhead

self.q_linear = nn.Linear(d_model, d_model)


self.k_linear = nn.Linear(d_model, d_model)
self.v_linear = nn.Linear(d_model, d_model)
self.out_linear = nn.Linear(d_model, d_model)

SCSCOE Laboratory Practice V (NLP) Page 18


def forward(self, q, k, v, mask=None):
q = self.q_linear(q)
k = self.k_linear(k)
v = self.v_linear(v)

q = q.view(q.size(0), -1, self.nhead, self.head_dim).permute(0, 2, 1, 3)


k = k.view(k.size(0), -1, self.nhead, self.head_dim).permute(0, 2, 1, 3)
v = v.view(v.size(0), -1, self.nhead, self.head_dim).permute(0, 2, 1, 3)

attn_weights = torch.matmul(q, k.permute(0, 1, 3, 2)) / math.sqrt(self.head_dim)

if mask is not None:


attn_weights = attn_weights.masked_fill(mask == 0, float('-inf'))

attn_weights = F.softmax(attn_weights, dim=-1)


out = torch.matmul(attn_weights, v)
out = out.permute(0, 2, 1, 3).contiguous().view(out.size(0), -1, self.d_model)
out = self.out_linear(out)
return out

class Feedforward(nn.Module):
def __init__(self, d_model, d_ff, dropout=0.1):
super(Feedforward, self).__init__()
self.linear1 = nn.Linear(d_model, d_ff)
self.dropout = nn.Dropout(dropout)
self.linear2 = nn.Linear(d_ff, d_model)

def forward(self, x):


x = F.relu(self.linear1(x))
x = self.dropout(x)
x = self.linear2(x)
return x

class TransformerBlock(nn.Module):
def __init__(self, d_model, nhead, d_ff, dropout=0.1):
super(TransformerBlock, self).__init__()
self.attention = MultiHeadAttention(d_model, nhead)
self.norm1 = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)
self.feedforward = Feedforward(d_model, d_ff)
self.norm2 = nn.LayerNorm(d_model)

def forward(self, x, mask=None):


attn_output = self.attention(x, x, x, mask)
x = x + self.dropout(attn_output)
x = self.norm1(x)

SCSCOE Laboratory Practice V (NLP) Page 19


ff_output = self.feedforward(x)
x = x + self.dropout(ff_output)
x = self.norm2(x)

return x

class Transformer(nn.Module):
def __init__(self, vocab_size, d_model, nhead, num_layers, d_ff, max_len=512, dropout=0.1):
super(Transformer, self).__init__()
self.embedding = nn.Embedding(vocab_size, d_model)
self.positional_encoding = PositionalEncoding(d_model, max_len)
self.transformer_blocks = nn.ModuleList([
TransformerBlock(d_model, nhead, d_ff, dropout) for _ in range(num_layers)
])
self.fc = nn.Linear(d_model, vocab_size)

def forward(self, x, mask=None):


x = self.embedding(x)
x = x + self.positional_encoding(x)

for transformer_block in self.transformer_blocks:


x = transformer_block(x, mask)

x = self.fc(x.mean(dim=1))
return F.log_softmax(x, dim=-1)

Assignment : 05

Title : Morphology is the study of the way words are built up from smaller meaning bearing
units. Study and understand the concepts of morphology by the use of add delete table .

The "add-delete table" is not a standard term used in linguistics or morphology. However, I
assume you are referring to the concept of morphological operations in linguistics.
Morphological operations involve adding or deleting morphemes (the smallest units of meaning)
to create new words or to change the grammatical category or meaning of existing words.

Here's a simplified example using an "add-delete table" to understand morphology:

Let's consider the base word "work" and explore how adding or deleting morphemes can create
different forms.

Base Word: WORK

SCSCOE Laboratory Practice V (NLP) Page 20


Operation Result Example Explanation
Negates the action, meaning "not
Add Prefix "un-" unwork unworkable workable"
Indicates a person who performs the
Add Suffix "-er" worker worker action
Add Suffix "-
ing" working working Indicates an ongoing action
Delete Suffix "-
k" wor wor Changes the word form

In this example:

1. Add Prefix "un-": The addition of the prefix "un-" negates the action, transforming "work" into
"unwork," and further addition results in "unworkable," meaning "not workable."
2. Add Suffix "-er": The addition of the suffix "-er" transforms "work" into "worker," indicating a
person who performs the action (in this case, someone who works).
3. Add Suffix "-ing": The addition of the suffix "-ing" transforms "work" into "working,"
indicating an ongoing action.
4. Delete Suffix "-k": The deletion of the final "-k" transforms "work" into "wor," resulting in a
change in the word form.

These operations illustrate how morphological processes involve adding or deleting morphemes
to convey different meanings, grammatical categories, or forms of words. Morphological
analysis helps linguists understand how words are constructed and how meaning is conveyed
through morphemes.

Conclusion: We conclude that, here understand the concepts of morphology by the use of add
delete table

SCSCOE Laboratory Practice V (NLP) Page 21


SCSCOE Laboratory Practice V (NLP) Page 22
SCSCOE Laboratory Practice V (NLP) Page 23
SCSCOE Laboratory Practice V (NLP) Page 24
SCSCOE Laboratory Practice V (NLP) Page 25

You might also like