0% found this document useful (0 votes)

20 views54 pages

Report PRIEE

hello

Uploaded by

Karthik Raju

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views54 pages

Report PRIEE

hello

Uploaded by

Karthik Raju

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 54

20CS713 PROJECT PHASE II

Multi-Modal Approach for Brain MRI Image

Enhancement and Tumor Using CNN
A PROJECT REPORT

Submitted by

K. Karthik 111721102056
J.V.B. Sathwik 111721102044
G.B.R.S. Vinayak 11172112179

in partial fulfillment for the award of the degree

of
BACHELOR OF ENGINEERING
IN COMPUTER SCIENCE AND ENGINEERING

R.M.K. ENGINEERING COLLEGE

(An Autonomous Institution)
R.S.M. Nagar, Kavaraipettai-601 206

March 2024
R.M.K. ENGINEERING COLLEGE
(An Autonomous Institution)
R.S.M. Nagar, Kavaraipettai-601 206

BONAFIDE CERTIFICATE

Certified that this project report “Multi-Modal Approach for Brain MRI Image
Enhancement and Tumor Using CNN” is the bonafide work of KARTHIK K
(111721102056), SATHWIK J.V.B (111721102044), VINAYAK G.B.R.S (111721102179)
who carried out the 20CS713 Project Phase II work under my supervision.

SIGNATURE SIGNATURE

Dr. T. Sethukarasi, M.E., M.S. Ph.D., Ms.P.Baby shamini, M.E., (Ph.D)

Professor and Head Assistant Professor
Department of Computer Science and Department of Computer Science and
Engineering Engineering
R.M.K. Engineering College R.M.K. Engineering College
R.S.M. Nagar, Kavaraipettai, Tiruvallur R.S.M. Nagar, Kavaraipettai,
District– 601206. Tiruvallur District–601206.

Submitted for the Project Viva–Voce held on .................................... at R.M.K. Engineering

College, Kavaraipettai, Tiruvallur District– 601206.

INTERNAL EXAMINER EXTERNAL EXAMINER

ACKNOWLEDGEMENT

We earnestly portray our sincere gratitude and regard to our beloved Chairman Shri. R. S.

Munirathinam, our Vice Chairman, Shri. R. M. Kishore and our Director, Shri. R. Jyothi

Naidu, for the interest and affection shown towards us throughout the course.

We convey our sincere thanks to our Principal, Dr. K. A. Mohamed Junaid, for being the source of

inspiration in this college.

We reveal our sincere thanks to our Professor and Head of the Department, Computer Science and

Engineering, Dr. T. Sethukarasi, for her commendable support and

encouragement for the completion of our project.

We would like to express our sincere gratitude for our Project Guide Ms. P. Baby Shamini,

Assistant Professor for her valuable suggestions towards the successful completion for this

project in a global manner.

We take this opportunity to extend our thanks to all faculty members of Department of

Computer Science and Engineering, parents and friends for all that they meant to us during

the crucial times of the completion of our project.

II
ABSTRACT

Neurological and brain-related cancers are one of the main causes of death worldwide. A
common (MRI), yet the manual evaluation of MRI images by medical experts presents
difficulties due to time constraints and variability. This research introduces a novel, two-
module computerized method to increase the speed and accuracy of brain tumor detection.
The first module, termed the Image Enhancement Technique, utilizes a trio of machine
learning and imaging strategies—adaptive Wiener filtering, neural networks, and independent
component analysis—to normalize images and combat issues such as noise and varying low
region contrast. The second module uses Support Vector Machines to validate the output of
the first module and perform tumor segmentation and classification. Applied to various types
of brain tumors, including meningiomas and pituitary tumors, our method exhibited
significant improvements in contrast and classification efficiency. It achieved an average
sensitivity and specificity of 0.991, an accuracy of 0.989, and a Dice score (DSC) of 0.981.
Furthermore, the processing time of our method, averaging 0.43 seconds, was markedly lower
compared to existing methods. These results underscore the superior performance of our
approach over current state-of-the-art methods in terms of sensitivity, specificity, precision,
and DSC. Future enhancements will seek to increase the robustness of the tumor classification
method by employing a standardized approach across a suite of classifiers.
Keywords: Magnetic resonance imaging (MRI), image enhancement technique, brain
tumor segmentation, neural networks, brain tumor classification.

III
TABLE OF CONTENTS

CHAPTER TITLE PAGE NO

ABSTRACT III
LIST OF FIGURES VI
LIST OF TABLES VII
LIST OF ABBREVATIONS VIII
1 INTRODUCTION 01
1.1 Problem Statement 02
1.2 Literature Survey 02
1.3 System Requirement 08
1.3.1 Hardware Requirements 08
1.3.2 Software Requirements 08
1.3.3 Feasibility Study 08

2 SYSTEM ANALYSIS 10
2.1 Existing System 10
2.1.1 Disadvantages of Existing System 10
2.2 Proposed System 12
2.2.1 Advantages of Proposed System 12

3 SYSTEM DESIGN 14
3.1 System Architecture 14
3.2 UML Diagrams 18
3.2.1 Use Case Diagram 18
3.2.2 Class Diagram 19
3.2.3 Data Flow Diagram 20
3.2.4 Activity Diagram 21

4 SYSTEM IMPLEMENTATION 22
4.1 Modules 22
4.2 Module description 23
IV
4.2.1 Data Set Collection 23

4.2.2 Data Pre-Processing 23

4.2.3 Building and applying gensim 26
4.2.4 Ranking the doument 27
4.2.5 Computing the evaluation parameters 27
4.3 Algorithms 28

4.3.1 Natural language processing 28

4.3.2 Gensim 29
4.3.3 Term Frequency-Inverse Document Frequency 30

4.3.4 Bag of words 31

4.4 Testing 32

4.4.1 Testing Methods 32

4.4.1.1 Unit Testing 32

4.4.1.2 Integration Testing 33

4.4.1.3 System Testing 33

5 RESULTS & DISCUSSION 34
6 CONCLUSION 36
REFERENCES 37
APPENDIX I - SOURCE CODE 41
APPENDIX II - SCREENSHOTS 45
LIST OF FIGURES

FIGURE NO FIGURE NAME PAGE NO

3.1 System Architecture 14

3.2.1 Use Case Diagram 18

3.2.2 Class Diagram 19

3.2.3 Data flow diagram 20

3.2.4 Activity Diagram 21

4.3.1 NLP 28

V
4.3.2 Gensim 29

4.3.3 TF-IDF 30

4.3.4 Bag of words 31

5.1 Similarity scores before and after stemming 35

5.2 Difference between similarity scores 35

01 Original Text 45

02 Tokenization 46

03 Removal of punctuation 47

04 Word folding 47

05 Stop Word Removal of text 48

06 Word Stemming 48

07 Document ranking based on the 49 similarity

score before stemming
08 Document ranking based on the 49
similarity score after stemming

TABLE.NO TABLE NAME PAGE NUMBER

4.1 Modules 19

5.1 Similarity scores before and after 35

stemming

VI
LIST OF ABBREVATIONS

S.NO ABBREVATION EXPANSION

01 NLP Natural language processing

Term Frequency Inverse Document

02 Frequency
TF-IDF

03 Artificial intelligence
AI

04
NLTK Natural Language Tool Kit

05
SVM Support Vector Machine

06
OS Operating system

07
PCA Principle Component Analysis

08
STS Semantic textual similarity

09
RAM Random Access Memory

10
LSA Latent Semantic Analysis

11 Multi Lingual Information

MLIR Retrivel

12 BSMRS Basic Similarity-based

Multikeyword Ranked Search

13 ESMRS Enchanced Similarity-based Multi-keyword

Ranked Search

14 MLIR Multi Lingual Information

Retrivel

VII
CHAPTER 1 INTRODUCTION

Over many years, it has been a tremendous task for retrieval of information for any
organization due to the vast volume of data growing continuously. It is an uphill battle for the
user to retrieve the required information from an immense amount of data. Information or
data is stored in the form of documents in any organization. The documents available in
digital form are doubled every five years. So, the identification of similar documents is
beneficial. Similarity among the data plays a crucial role in information retrieval. Segregation
of documents helps the user in many ways. Manual segregation of documents takes a lot of
time and human resources. Thus, this project tends to show the technological innovation to
simplify the task by automatically segregating data. To carry out this thought, various
documents are used as input. The users can take a query of the documents as input and it will
automatically rank the documents in the query based on similarity. This chapter briefly
describes the introduction part of this work by first discussing the overview of the project and
problem statement followed by the objective, existing systems, the significance, and finally
the limitations. Document similarity is a method of getting similar documents from large
document datasets into one query. Ranking the documents in the query based on similarity
helps the user to get access to the required information efficiently. Document similarity is
mainly used in the process of Information retrieval. Information retrieval is a major task in
any organization. Information retrieval helps users to formulate solutions for present
problems by analyzing past results. Information or results are mainly stored in the form of
documents or portable document formats.

1.1 Problem Statement

Because of the increasing amount of data, search engines encounter different difficulties in
fetching better relevant to users search queries. Traditional document ranking methods are
mostly based on the similarity computations between documents and queries.In many cases
users may want to retrieve documents that are not only similar but also general or broad
regarding a certain topic. So, to rank the documents efficiently and accurately there is a need
for document ranking using the semantic measure.

1.2 Literature Survey

1
[1]. Benzi Xu et al. (2021) [1] used pseudo-longest-common-subsequence (pseudo-LCS)
and the Jaccard similarity coefficient is proposed based on this analysis and principal
component analysis (PCA) Volume 132, 2022.
This used pseudo-longest-common-subsequence (pseudo-LCS) and the Jaccard similarity
coefficient is proposed based on this analysis and principal component analysis (PCA) along
with it K-medoids is also used to improve the soft constraints. To effectively measure the
similarity of the operation sequences, a deep analysis was performed to determine the
information requirements and characteristics of the operation sequence similarity problem.
The modified pseudo-LCS is proposed to record the first two pieces of information, and a
corresponding backtracking algorithm is also presented. The Jaccard similarity coefficient is
used here to measure the last information. These two similarity coefficients are combined
based on PCA to generate a novel comprehensive similarity coefficient. The numerical
illustration result shows that it can distinguish all the different cases with rational similarity
values. The typical process route discovery is a practical problem; two conflicting soft
constraints are introduced and solved by the K-medoids method. The proposed method
effectively measures the similarity of the operation sequence. The algorithms provide the
similarity coefficient which helps the user to calculate the ranks of the sequence. The
proposed methodologies also have soft constraints while clustering the documents. The
advantage of this experiment is the quality of products.

[2]. Qian Liu et al. (2021) [2] proposed to use of association rules for measuring word
similarity at a global level and fuzzy similarity to measure the top-k words in in IEEE
Access, vol. 9, pp. 126801-126821,2021.
This proposed to use of association rules for measuring word similarity at a global level and
fuzzy similarity to measure the top-k words. For top-k words, the authors proposed a
similarity measure to word embedding where both local and global information is considered.
The global information is measured with association rules and local information is measured
by word embedding also the authors compared this proposed method to eight state-of-the-art
baselines. The data sets used by the authors are TRECdisk 4&5, WT10G, and RCV1. The
authors found a fuzzy logic system which overcomes the problems associated with combining
the two types of measures by inferring the similarity between words and then returning the
top-k selected words. The advantage of fuzzy logic is that it provides a flexible and
convenient way to transform expert knowledge expressed in natural language into fuzzy
rules. This paper contains three components: local, global and fuzzy systems but there is
nocomponent-wise validation.

2
[3]. N. Kumar, S. K. Yadav and D. S. Yadav, "Similarity Measure Approaches Applied in
Text Document Clustering for Information Retrieval," 2020 Sixth International
Conference on Parallel, Distributed and Grid Computing (PDGC), 2020, pp. 88-921.

This used the standard K-means algorithm. The K-means algorithm calculates the distance
measure. Experimented by using the datasets similar news articles and analysis of customer
feedback, text mining, duplicate content detection, and finding similar documents. In the
dataset, the number of assigned different categories is the same with a set of several clusters.
They can use two evaluation measure-purity and entropy that give the quality of a clustering
result. By using Euclidean distance measures they are used to find the document clustering
more effectively. Pearson and Jaccard methods are more suitable for finding rational clusters
with high clarity values that are represented by documents from a single group that controls
every cluster. The advantage of the experiment is the frequency is calculated according to the
typed dataset but the accuracy is very low.

[4]. Shuaizhang et al. (2019) [4] proposed a model for extended citation model for scientific document
clustering" in IEEE Access, vol. 9, pp. 150865-150877, 2021, doi:
10.1109/ACCESS.2021.3125729.

This proposed a model for extended citation model for scientific document clustering and a
citation network and textual similarity network to enhance the performance of scientific
document clustering. They are conducted by using the PMC database and PubMed database
are two popular databases in the biomedical field and provide a large number of openaccess
and full-text scientific documents with 10,996 scientific documents. They used Java and R
programming for the implementation of the experiment. The data should be preprocessed
before the experiment should be conducted they constructed a textual similarity network
according to the integrated citation network, co-citation network, bibliographic coupling
network, and textual similarity network. They proved the practicability of our proposed
extended citation model by comparing it with the traditional bibliographic coupling model
and the textual similarity model for scientific document clustering by using the R
programming language. They used a random walk algorithm for the popular community
detection algorithm whose input is the similarity network and output is the clustering results.
The advantage is the clustering of scientific documents in an efficient way by considering the
frequency of the document. The limitation is that it used a limited dataset.

[5]. M. P. Mahalakshmi and N. S. Fatima, "Maximum Entropy Principle based

Document Ranking with Term Selection Analysis for Cross-Lingual Information
3
Retrieval," 2021 Third International Conference on Intelligent Communication
Technologies and Virtual Mobile Networks (ICICV), 2021, pp. 1015-1019.

This model the authors used Maximum Entropy Principle based Document Ranking with
Term Selection Analysis (MEPDR-TSA) for CLIR. The user query in the Tamil language is
translated into the English language. Then, the MEPDR technique is employed for the
ranking of the documents and TSA is used for choosing a set of retrieved documents from
each query. Finally, the retrieved English documents are again converted back into the Tamil
language proficiently by using google Translate and the results are tested against precision,
recall, and F-score. The authors found a Maximum Entropy Principle based Document
Ranking with Term Selection Analysis and converted the Tamil query into English also
converting the retrieved English document into Tamil. The advantage of the proposed method
is better than the existing methods like Okapi BM25, IB-MLIR, MULM, KNN, and n-gram.
But, this method should be improved with advanced document ranking methods.

[6]. Mohamed Attia, Manal A. Abdel-Fattah, Ayman E. Khedr, A proposed multi criteria
indexing and ranking model for documents and web pages on large scale data, Journal
of King Saud University - Computer and Information Sciences, 2021.

This proposed a model of multi-criteria indexing and retrieval works for web pages and
documents. It uses different retrieval methods to get an accurate document and it handles the
page ranking algorithm issues, this model utilized the top seven criteria for indexing and
retrieving results. First Phase: Users enter the required queries. The MCIR goes through
crawling online or offline. The first step is finding pages or documents existing on the web if
it is working online or offline through stored files on a machine. Once the system finds a page
URL or document path, it visits and finds out according to the user search query. Second
phase: At this stage, the weight model starts generating weight for each criterion according to
user preferences. After it calls rank statistics to return the final page weight to users. This
model proved that ranking through multiple criteria has different results than one or two
criteria as compared to previous algorithms. It retrieves the document and web pages based
on the top seven criteria: Pages or documents votes, Keywords in (Domain, Page content,
Url), Pages publish date, Pages Modified date, Number of links, Pages load time, and Bad
links.

[7]. Hikmat Ullah Khan, Shumaila Nasir, Kishwar Nasim, Danial Shabbir, Ahsan
Mahmood, Twitter trends: A ranking algorithm analysis on real time data, Expert
Systems with Applications, Volume 164, 2021.
4
This explored Term Frequency-Inverse Document Frequency (Tf-IDF), Combined
Component Approach (CCA) and Biterm Topic Model (BTM) approaches for finding the
topics and terms within given topics. Data Set - Data is collected using the Twitter application
programming interface API that extracts data from Twitter sources. First, the data is extracted
from the Twitter API and the results are stored in the. xlsx file. In the next step data cleaning
involves deleting unused data, and duplicates, doing a spell check, and other modifications
that make the data easier to understand. Stemming translates past tense verbs into present
tenses, tokenization develops tokens in connection to the given roles in a sentence, and
transforms normalization. Data integration is a procedure that gathers information from
several sources and unifies it. The data reduction procedure assists with large-scale dataset
analysis by condensing datasets with a high volume while retaining all relevant information.
The dataset has been used in the last phase to apply various models, including TF-IDF, CCA,
and BTM, to extract the subjects from the tweet collection.

[8]. Dimitris Pappas and Ion Androutsopoulos, A Neural Model for Joint Document and
Snippet Ranking in Question Answering for Large Document Collections, Department
of Informatics, Athens University of Economics and Business, Greece, Institute for
Language and Speech Processing, Research Center ‘Athena’, Greece, 2021.
This used POSIT-DRMM or PDRMM, it is a differentiable extension of DRMM. Proposed
architecture to jointly rank documents and snippets concerning a question; there are two
particularly important stages in QA for large document collections. They instantiated the
proposed architecture using a recent neural relevance model (PDRMM) and a BERT-based
ranker. Using biomedical data (from BIOASQ), they showed that the two resulting joint
models (PDRMM-based and BERT-based) vastly outperform the corresponding pipelines in
snippet retrieval, the main goal in QA for document collections, using fewer parameters, and
also remaining competitive in document retrieval. They provided a modified version of the
Natural Questions dataset, suitable for document and snippet retrieval. The documents are
retrieved by using fewer parameters like snippets of the document for the questions. The
advantage of this method is that document retrieval results are better than DRMM and several
other neural rankers. But, the stage of the dataset should be increased to a multi-granular task
as BIOASQ already includes this multigranular task, but exact answers are provided only for
factoid questions and they are freely written by humans, as in MS-MARCO, with similar
limitations. Hence, appropriately modified versions of the BIOASQ datasets are needed.

[9]. Tianrun Cai, Zeling He, Chuan Hong, Yichi Zhang, Yuk-Lam Ho, Jacqueline

5
Honerlaw, Alon Geva, Vidul Ayakulangara Panickan, Amanda King, David R Gagnon,
Michael Gaziano, Kelly Cho, Katherine Liao, Tianxi Cai, Scalable relevance ranking
algorithm via semantic similarity assessment improves efficiency of medical chart
review, Journal of Biomedical Informatics, Volume 132, 2022.

The authors used the pGUESS algorithm as prior guided semantic similarity to measure the
informativeness of a clinical note to a given phenotype. The algorithm scores the relevance of
a note as the cosine similarity between SEVnote and SEVref. The pGUESS algorithm is fully
knowledge-based except for assigning notes into the three categories via clustering. The
results on note ranking for CAD performed at both VHA and PHS suggests high
transportability of the pGUESS algorithm across two different healthcare systems. So, the
pGUESS algorithm does not require local EHR data but only knowledge sources and
embedding vectors. The algorithm reduced the burden of chart review and it improved the
efficiency and accuracy of human annotation. Determining patient disease status via chart
review is a critical yet labour-intensive task for EHR for training or validating robust
prediction algorithms.

The advantage of this algorithm is that overall ranking quality, as measured by the rank
correlation, was the highest for pGUESS compared to all other methods. But, the publicly
available TMGuassian algorithm does not allow out-of-sample prediction, so only tested the
portability of LDAgibbs and LEAvem.

[10] P. Zhang, X. Huang, Y. Wang, C. Jiang, S. He and H. Wang, "Semantic Similarity

Computing Model Based on Multi Model Fine-Grained Nonlinear Fusion," in IEEE
Access, vol. 9, pp. 8433-8443, 2021.

This used the combination of the traditional statistical method and deep learning model as
well as a novel model based on multi-model nonlinear fusion proposed in this paper. The ant
financial data set comes from the Chinese text similarity contest held by Alipay. Its data
mainly come from Alipay’s customer service data. The data contains 100000 pieces of data.
Semantic textual similarity (STS) datasets from 2012 to 2015 in the experiment are also used.
The model uses the Jaccard coefficient based on part of speech, Term Frequency-Inverse
Document Frequency (TF-IDF) and word2vecCNN algorithm to measure the similarity of
sentences respectively. The model combines the traditional sentence similarity calculation
method based on statistics and completes the coarse-grained extraction of the sentence. The
results of the Jaccard algorithm, TF-IDF algorithm and word2vec-CNN are input into the
shallow fully connected neural network to train the model and give an ideal classification
6
result. IN which the Jaccard algorithm takes grammatical information, TF-IDF calculates
sentence similarity from the word frequency and Inverse document and by using the
word2vec-CNN algorithm, the sentence feature matrix is weighted by a multi-feature
attention mechanism by which it increases the performance. The experimental results of the
proposed method show that the proposed sentence similarity calculation method based on
multi-feature fusion can balance the calculation results of multiple models. Experimental
results show that the matching of sentence similarity calculation method based on multimodel
nonlinear fusion is 84%, and the F1 value of the model is 75%. In this experiment, the
similarity of the sentence is measured with the respective meaning of the words in the
sentence. The limitations are that the word vector given by the word2vec model is static and
cannot describe the dynamic change of semantics and accuracy can be improved.

1.3 System Requirements

1.3.1 Hardware Requirements

Operating system : Windows 8+

RAM : 4 GB Minimum
Hard disc or SSD : More than 500 GB
Processor : Intel 3rd generation or high or Ryzen with 8 GB Ram

1.3.2 Software Requirements

Front End : HTML, CSS, BOOTSRAP
Framework : Flask
Monitor : SVGA
Server side Script : Python
Scripts : JavaScript , J Query

1.3.3 Feasibility Study

The feasibility of the project is analysed in this phase and business proposal is put forth with a very
general plan for the project and some cost estimates.During system analysis the feasibility study of
the proposed system is to be carried out. This is to ensure that the proposed system is not a burden

7
to the company.For feasibility analysis, some understanding of the major requirements for the
system is essential.

Three key considerations involved in the feasibility analysis are

 ECONOMICAL FEASIBILITY
 TECHNICAL FEASIBILITY
 OPERATIONAL FEASIBILITY

ECONOMICAL FEASIBILITY :
The economic feasibility of document ranking depends on various factors, primarily the specific
application and the value it brings to users or businesses. Document ranking,often associate with
information retrieval and search engines, can have economic benefits in different contexts.
Here are some considerations:
• User satisfaction and Engagement
• Productivity and Efficiency
• Adaptability and Scalabily TECHNICAL FEASIBILITY:
The technical feasibility of document ranking involves assessing whether the
implementation of a document ranking system is technically viable given the available
technology, resources and infrastructure.
Here are key considerations for evaluating the technical feasibility of document ranking:
• Algorithm complexity
• Data availability and Quality
• Feature Engineering
• Testing and Evaluation

OPERATIONAL FEASIBILITY:
The operational feasibility for document ranking refers to whether the system can be
effectively integrated into existing operations and processes. It assesses the practicality of
implementing the document ranking system within the operational context of an organization.
Here are key considerations for evaluating the operational feasibility of document ranking:
• User acceptance
• Data input and output
• Resource availability
• Legal and regulatory compilance

8
CHAPTER 2 SYSTEM ANALYSIS

2.1 Existing System

The existing system of Natural language processing is confined to the stream of computer
science and artificial intelligence. Natural language processing mainly deals with the
interaction between computers and human language which mainly focuses on how computers
process and analyze a large amount of data. The NLP techniques help computers to
understand contexts in the document.Due to the increase in data in digital formats such as
documents, the classification of data becomes a difficult task. So, different methods are used
to classify the data based on the user's requirements. The traditional method such as manual
retrieval of data takes a huge time and effect.Therefore, automatic retrieval of data takes
place. Initially, the raw data should be converted into a real number format.Secondly, the
preprocessing of the data takes place by performing the following step tokenization, and
removal of stop words which could not contain any meaningto it. Finally, according to the
user’s query, the similarity metrics should be measured.The above step takes place with help
of NLP Techniques. The Existing method mainly uses the hybrid model to obtain the results.
The hybrid model combines two or more techniques.The accuracy of a single technique is
less than hybrid techniques. The hybrid technique can be used in the Analysis of large and
complex documents, the Entertainment industry, Resume ranking and many more. But, the
usage of the hybrid model makes the process complex. The hybrid method is a combination
of various NLP techniques such as TF-IDF, Jaya and grey wolf optimizer, the Longest
common sequence and may be more compared with other machine learning methods such as
CNN and CNN. The hybrid method also increases the time complexity of the work. Major
work cannot increase the similarity metrics and dictionary.

2.1.1 Disadvantages of Existing System

9
Bias and fairness
issue:

• Document ranking algorithms may inadvertently perpetuate or amplify biases present

in the training data.
• If the training data is biased, the system might favor certain groups or perspectives,
leading to unfair rankings.

Limted understanding of context:

• Document ranking systems may struggle to understand the context of the content they
are evaluating.
• They often rely on keyword matching and statistical patterns, which might not capture
the nuanced meaning of documents accurately.

Difficulty in handling multimedia texts:

• Many existing document ranking methods are primarily designed for text-based
content and may not effectively handle multimedia content such as images, videos, or
audio.

• This can limit the usefulness of search engines for users seeking diverse types of
information.

Scalability challenges:

• As the volume of documents and user queries continues to grow, existing document
ranking methods may struggle to maintain scalable performance without sacrificing
relevance or quality.

• This can result in slower response times or degraded search experiences for users.

Limted personalization:

• Existing document ranking methods often provide one-size-fits-all rankings based on

the query and document content, without considering individual user preferences,
search history, or context.

• This lack of personalization can lead to suboptimal search experiences for users who
have diverse needs and preferences.
10
2.2 Proposed Sysyem:

In this model,ranking the documents based on similarity of the users query is done on
Natural language processing.The ranking of the documents helps in the efficient retrieval of
information in any firm. This model helps in identifying the most relevant and important
information based on the user requirement.
The proposed work is implemented in Python 3.8 with libraries Gensim, Spacy, NumPy,
nltk, corpora, models, similarities and other mandatory libraries. There are many
applications of Natural language processing such as Information retrieval, classification of
documents, spell checker, estimation of similarity, keyword extraction, Language
Translation and many information retrieval problems. Most of the information retrieval
problems are solved using NLP techniques. In the natural language processing techniques,
data processing plays an important role to solve the problem. The proposed model is
implemented using an unsupervised learning algorithm and well-known python library
called Gensim which ranked the documents based on similarity of the user’s query with
better accuracy. The dataset in which we have simple text will be pre-processed with
different metrics such as word tokenization, removal of punctuation. Stop word removal
and word stemming. Word stemming plays a crucial role in finding the accurate similarity.
The preprocessed documents are splitted into source and queries documents and the
proposed model is applied to the preproceed documents, compared the source with other
target documents, obtained the score of the documents and rank the documents based on
similarity of the user query. The performance evaluation of the proposed method is better
than the traditional methods.

2.2.1 Advantages of Proposed System

• Better data handling:
In the present digital world, a lot of unstructured data such as documents, pdf and
emails are handled with the help of NLP techniques. NLP techniques help in
information retrieval, efficient access to data, and ordering of data.

• Time-efficient:

11
The automatic generation of solutions to any natural language problem reduces the
time complexity of the users. The automatic generation of solutions to any natural
language problem reduces the time complexity of the users.
• Market Research and Analysis:
Automatic summarization of research papers, Extraction of keywords, and clustering
of similar documents in query help the researchers.

• Streamlined processes:
Avoiding the traditional methods for communicating such as help centres and
customer care were updated to chatbots to improve customer satisfaction.

• Improve customer satisfaction:

NLP techniques such as sentimental analysis in feedback, customer satisfaction
surveys, and reviews analysis help in the efficient analysis of customer problems and
give relevant results, and also improve customer satisfaction.

• E-learning:
NLP machine learning technology can examine the language used in a classroom to
define the mental states of both teachers and students.

• Provide high information:

With the help of NLP techniques, the user gets a piece of high-quality relevant
information.

CHAPTER 3

SYSTEM DESIGN

3.1 System Architecture

12
Figure 3.1 System Architecture

• Preprocessing:
Preprocessing consists of term folding, term tokenization, removal of punctuation,
stop term elimination term stemming. Preprocessing is the first thing we have to do
for a document for further easy process.It raises reliability and accuracy.Preprocessing
data can increase the correctness and quality of a dataset, making it more reliable by
removing missing or inconsistent data values brought on by human or computer
mistake. It ensures consistency in data.

• Term Folding:

Preprocessing is applied to the data once it has been received. Term folding is a
preprocessing technique that lowercases words that are currently in uppercase. It
appears as though two similar terms are the lowercase.

• Removal of Punctuation:

Grammar is defined as the rules for forming well-structured sentences. while

describing the syntactic structure of well -formed programs, Grammar plays a very
essential and important role. In simple words. Grammar denotes syntactical rules that

13
are used for conversation in natural languages. The data's punctuation and spaces are
also eliminated. For simpler processing, we should eliminate these punctuation marks.

• Term Tokenization:

Term tokenization then separates the raw text into tokens, which are words and
sentences. By studying the words, these symbols help the reader in determining the
context and analyzing the text's meaning.

• Stop Term Elimination:

Elimination of stop term occurs in the next preprocessing stage. In any language, stop
terms are a group of frequently used terms. Stop words in English include "the," "is,"
and "and," for instance. Stop words are used to remove unnecessary words so that
computers can concentrate on the crucial ones. It is one of the most commonly used
preprocessing steps across different NLP applications.

• Term Stemming:

Word stemming, the final preprocessing stage, removes the final few characters from
a word, frequently resulting in inaccurate spelling and meaning. Lemmatization
considers the context and converts the word to its meaningful base form, which is
called Lemma. For instance, stemming the word 'Eating' would return 'Eat'.

• UPDATED SOURCE & TARGET DOCUMENTS:

After preprocessing is finished, the input source document and set of documents in
datasets are updated as a preprocessed data.

• BUILT GENSIM MODEL:

Gensim is open source python natural language processing library used for
unsupervised modeling. The features of the gensim is its scalability, robust, platform
agnostic and Efficient multicore implementations. The uses of gensim are fastText,
Word2vec, LSA (Latent Semantic Analysis), LDA (Latent Dirichlet Allocation) and
tf-idf (term frequency-inverse document frequency). fastText, uses a neural network
for word embedding, is a library for learning of word embedding and text
classification. Word2vec, used to produce word embedding, is a group of shallow and

14
two-layer neural network models. LSA is a technique in NLP (Natural Language
Processing) that allows us to analyse relationships between a set of documents and
their containing terms. LDA is a technique in NLP that allows sets of observations to
be explained by unobserved groups. These unobserved groups explain, why some
parts of the data are similar. tf-idf, a numeric statistic in information retrieval, reflects
how important a word is to a document in a corpus. It is often used by search engines
to score and rank a document’s relevance given a user query. The facilities provided
by Gensim for building topic models and word embedding is unparalleled. It also
provides more convenient facilities for text processing. It handle large text files even
without loading the whole file in memory. Gensim doesn’t require costly annotations
or hand tagging of documents because it uses unsupervised models. The core concepts
of the gensim are document, corpus, vector and model. Document is an object of the
text sequence type which is known as ‘str’ in Python 3. A corpus may be defined as
the large and structured set of machine-readable texts produced in a natural
communicative setting. In Gensim, a collection of document object is called corpus.
The role of the corpus are Serves as input for training a model and Serves as topic
extractor. A vector is mathematical representation of a document. Model refers to an
algorithm used for transforming vectors from one representation to another. For
working on text documents, Gensim also requires the words, i.e. tokens to be
converted to their unique ids. For achieving this, it gives us the facility of Dictionary
object, which maps each word to their unique integer id. It does this by converting
input text to the list of words and then pass it to the corpora.Dictionary() object. In
Gensim, the dictionary object is used to create a bag of words (BoW) corpus which
further used as the input to topic modelling and other models as well. Term
Frequency-Inverse Document Frequency model which is also a bag-of-words model.
It is different from the regular corpus because it down weights the tokens i.e. words
appearing frequently across documents. During initialisation, this tf-idf model
algorithm expects a training corpus having integer values (such as Bag-of-Words
model). Then after that at the time of transformation, it takes a vector representation
and returns another vector representation. The output vector will have the same
dimensionality but the value of the rare features (at the time of training) will be
increased. It basically converts integer-valued vectors into real-valued vectors.

• COMPARING AND RANKING THE DOCUMENTS:

15
After performing the gensim model then we start comparing the input data with data
set details to measure the similarity then the score of the target documents means
documents in the dataset is measured with respect to the source document the target
documents are sorted according to the score obtained means the documents are ranked
based on the scores.

3.2 UML Diagrams

3.2.1 Use Case Diagram

The use-case diagram presents the functionality provided by a system in

terms of actors, their goals and any dependencies between those use cases.
The actors involved in the project are the actors and the system. The user
uploads the dataset for pre-processing and the system evaluates and predicts
the result.

16
Figure 3.2.1 Use Case Diagram
3.2.2 Class Diagram
In software engineering, a class diagram in the Unified Modelling Language (UML) is a type
of static structure diagram that describes the structure of a system by showing the system's
classes, attributes, operations (or methods), and the relationships among the classes. It
explains which class contains information. In this class diagram there are document with
document type, and also a folder and document version in it.

17
Figure 3.2.2 Class Diagram

3.2.3 Data Flow Diagram

The Data Flow Diagram (DFD) shows information flow in the system the user uploads
and views the data and the system evaluates and provides the result.

18
Figure 3.2.3 Data Flow Diagram

3.2.4 Activity Diagram

Activity diagrams are graphical representations of workflows of stepwise activities and
actions with support for choice, iteration and concurrency. In the Unified Modelling
Language, activity diagrams can be used to describe the business and operational
19
stepby-step workflows of components in a system. An activity diagram shows the
overall flow of control.

Figure 3.2.4 Activity Diagram

CHAPTER 4

SYSTEM IMPLEMENTATION

4.1 Modules:

Module Function of Module

Data Set Collection - upload_csv

- retrieve_file
- prompt_user
Data Pre-processing - handle_null_values
- select_feature
- encode_categorical
Building and applying gensim - topic_id
- topic_words
- ida_model

20
Ranking the documents - scoring_documents
- ranking_algorithms
- learning_to_Rank models
- result_presentation
Computing the evaluation parameters - Precision
- Accuracy
- F1score

4.2 Module Description:

4.2.1 Data Set Collection:

Data collection is the process of gathering and measuring information on variables of

interest, in an established systematic fashion that enables one to answer stated research
questions, test hypotheses, and evaluate outcomes.The collected data is pre-processed
before giving it to the classification algorithm which involves the data cleaning and the
removal of the missing data which is needed for moving forward with the
procedure.Dataset is in the form of simple and plain text which is collected from the
wikipedia and content written unstructured documents. The dataset is collect in the form
of sample where each sample contain one query document and other corpus of
documents.The samples of the dataset are collects from different domains such as
biography , crime story, History of a flower.

4.2.2 Data Pre-Processing:

Data processing of text refers to the process of cleaning, preprocessing, and
transforming raw text data to make it suitable for analysis and modeling. In the data
processing of text, it consists of five crucial steps such as.
• Word tokenization,
• Removal of punctuation
• word folding
• Stop Word Removal
• Word Stemming
Word tokenization:

Tokenization is a fundamental step in Natural Language Processing (NLP) that involves

breaking down a text into smaller units, known as tokens. A token is a sequence of
characters that represents a single element of meaning in the text. These tokens could be
21
individual words, phrases, or even individual characters. Tokenization is a crucial step
in NLP as it is the foundation for most downstream tasks such as sentiment analysis,
text classification, and language modeling. The primary goal of tokenization is to split
text into smaller units that can be more easily analyzed and processed by a computer.
Without tokenization, it would be challenging to process large volumes of text data
effectively. Tokenization helps to reduce the complexity of text data, making it easier for
machines to analyzed. In the samples of the dataset, we broken the word into smaller
words.

Removal of Punctuation:

Removal of punctuation is followed by tokenization of plain text. Punctuation removal

is an essential step in preprocessing of text, especially in information retrieval.
Punctuation refers to characters that are used to enhance the readability and convey
meaning in written language, such as commas, periods, question marks, and
exclamation marks. However, when dealing with text data, these symbols are often
irrelevant or even detrimental to the analysis process. Removing punctuation from text
data is a simple but powerful technique that can improve the performance of NLP
models and increase the accuracy of results. The primary reason for removing
punctuation is to simplify the text and make it easier for the model to process.
Punctuation can disrupt the continuity of the text and introduce noise into the analysis
process. When a model is trained on a dataset that includes punctuation, it may not
perform well when it encounters text without punctuation. Removing punctuation
ensures that the model can process text consistently, regardless of the presence or
absence of punctuation. Removal of punctuation can also enhance the accuracy of text
analysis tasks. In the samples of the dataset, the plain text is after the completion of the
tokenization, then tokenized text is preprocessed with punctuation removal. Finally, the
punctuation free text is passed to the next stage. In conclusion, removing punctuation
from text data is a simple but effective technique to simplify the text and make it easier
to process. This technique is used to improve the performance and accuracy of model.

Word folding:

Word folding is followed by Removal of punctuation. Word Folding is a text

normalization technique that aims to standardizes words by converting them to a
common format. It involves collapsing different variations of a word into a single
representation so that they can be easily compared or searched. The process of Word
22
Folding typically involves Two steps: case folding and accent folding. Case folding is
the process of converting all the characters in a word to either uppercase or lowercase.
This is necessary because uppercase and lowercase letters can represent the same letter,
and Word Folding aims to standardize the representation of words.Accent folding is the
process of removing diacritical marks or accents from the text. This step is important
because different languages may use different accents, and some languages have
multiple accents for the same letter.

Stop word removal:

Stop Word Removal is followed by word folding.In natural language processing, stop
word removal is a common technique used for text preprocessing. Stop words are words
that are commonly used in a language but do not contribute to the meaning of a
sentence. Examples of stop words in English include "a", "an", "the", "is", "are", "of",
and "in". These words are usually removed from the text during the preprocessing stage
as they don't provide any value for the analysis of the text.The main purpose of stop
word removal is to reduce the size of the dataset and improve the accuracy of
downstream analysis of text in the model. For the process of stop word removal , then
the preprocessed text from the dataset will be preprocessed with stop word Removal .
Then the preprocessed text is passed to the next stage.In conclusion, stop word removal
is a common technique used in natural language processing for preprocessing text data.
While it can be useful in reducing the size of the dataset and improving the accuracy of
model.

Word Stemming:

Finally, the crucial step in the preprocessing is stemming. Stemming is followed by Stop
word Removal.Stemming is a common technique used in Natural Language Processing
(NLP) for text pre-processing. It is the process of reducing a word to its base or root
form, called a stem or lemma. This is done by removing the suffixes and prefixes from
the word, which results in the stem being derived. Stemming is useful because it helps
to reduce the dimensionality of the text data, making it easier to analyze and process. It
also helps to normalize the text, allowing similar words to be treated as the same, which
can improve the accuracy of model which helps in information retrieval.

For example, consider the word "running". The Porter Stemming Algorithm would
apply the following rules: Remove the suffix "ing" to get "runn", Apply the rule for "nn"
23
to get "run", Apply the rule for "un" to get "run". In the samples of the dataset, the
preprocessed data is passed to the stage of stemming which will written the root of the
word in the sample. In conclusion, stemming is a valuable technique in NLP that helps
to normalize text and reduce dimensionality. However, it is important to use judiciously
and to combine them with other techniques to achieve the best results in text processing.
In conclusion, text data processing is a crucial step in natural language processing that
involves the conversion of raw textual data into a structured format that can be used for
analysis and modelling. This process involves several techniques such as tokenization,
stop word removal, stemming/lemmatization, and part-of-speech tagging, which can be
used individually or in combination depending on the specific requirements of the
analysis.

4.2.3 Building and applying gensim:

Initially, the builded contains importing the libraries of genism and Natural processing
language libraries. The corpora, models, and similarities modules are imported from the
gensim library. These modules will be used to create a dictionary of words from the
corpus, train a TF-IDF model, and calculate document similarities. Several sample
documents are read from text files and stored in a list called documents. The first
document is stored in doc1, the second document is stored in doc2, and so on. The
documents are collected from different domains from wikipedia. The text of each
document is split into words and stored in another list called text_corpus. This creates a
list of lists, where each inner list contains the words of a single document. A Dictionary
object is created using the text_corpus list. This creates a mapping between words and
unique integer IDs. A corpus object is created by converting each document in
text_corpus to a bag-of-words representation using the doc2bow method of the
Dictionary object. This creates a list of sparse vectors, where each vector represents the
frequency of each word in a single document. A TfidfModel object is trained on the
corpus object, which creates a TF-IDF representation of each document in the corpus.
This assigns a weight to each word in each document based on how important it is to the
document relative to the other documents in the corpus. A MatrixSimilarity object is
created from the TF-IDF corpus. This object allows us to calculate the similarity
between any two documents in the corpus. A query document is defined as the first
document in documents. The text of the query document is converted to a bag-of-words
representation using the same Dictionary object that was used to create the corpus. The
similarity between the query document and each document in the corpus is calculated
24
using the MatrixSimilarity object. This produces a list of similarity scores, where each
score represents the similarity between the query document and a single document in the
corpus. The similarity scores are sorted in descending order, and the document ID and
similarity score for each document in the corpus are printed to the console.

4.2.4 Ranking the document:

Ranking documents based on user requirements is an essential task in information

retrieval. The dataset consists of plaintext which is preprocessed, the preprocessed text
unstructured document is splitted and vectorization using TF-IDF model. The distance
between query document and corpus of documents. The similarity score is calculated
with respect to requirement is calculated.The corpus documents are ranked based on
similarity score.

4.2.5 Computing the evaluation parameters:

The accuracy of the proposed gensim model is compared with other similar method. The
accuracy of the gensim model is higher than other traditional models.

4.3 Algorithms:
4.3.1 Natural Language Processing:
Natural language processing(NLP) is confined to the stream of computer science and
artificial intelligence. Natural language processing mainly deals with the interaction
between computers and human language which mainly focuses on how computers
process and analyze a large amount of data. The NLP techniques help computers to
understand contexts in the document. Natural Language Processing (NLP) plays a
significant role in document ranking, especially in information retrieval systems such as
search engines. Document ranking refers to the process of determining the relevance of
documents to a user's query and presenting them in a ranked order.

25
Figure 4.3.1 NLP
4.3.2 Gensim:
Gensim is a popular open source natural language processing library used for
unsupervised topic modelling and that specializes in creating and manipulating vector
space models of natural language data. Vector space models represent text documents as
high-dimensional vectors, which can be analysed using various mathematical operations
to discover patterns, similarities, and relationships between them. Gensim provides a
suite of tools for building, training, and using vector space models, with a focus on
scalability, performance,and ease of use. One of the main features of Gensim is its
support for multiple text corpus formats, includingplain text, CSV, and preprocessed
corpus formats such as MMCorpus and LDA-C. Gensim provides a flexibleand efficient
way to preprocess text data, which involves tokenizing, stemming, stop-word removal,
and other tasks that are necessary to convert raw text into a form that can be used to
build vector space models. Preprocessing is typically done using Gensim's built-in
functions or custom pipelines, which can be configured to meet the specific needs of the
user. Gensim supports several popular vector space models, including bag-ofwords,
TFIDF, LSI (Latent Semantic Indexing), LDA (Latent Dirichlet Allocation), and
word2vec. These models differ in their underlying assumptions and mathematical
techniques, but they all share the goal of representing text documents as vectors in a
high-dimensional space. For example, the bag-of-words model represents each
document as a vector of term frequencies, where each term corresponds to a dimension
in the vector space.
26
Figure 4.3.2 Gensim
4.3.3 Term Frequency-Inverse Document Frequency:

TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical statistic that

reflect how important a word is to a document in a collection or corpus of documents.
It is widely used in natural language processing and information retrieval for tasks
such as text classification, document clustering, and search engine ranking. The
TFIDF statistic is calculated based on the frequency of each word in a document and
the inverse frequency of the word across all documents in the corpus. This calculation
results in a weight for each word that reflects its relative importance in the document.
The term frequency (TF) of a word in a document is simply the number of times the
word appears in the document. It is usually normalized by dividing it by the total
number of words in the document to account for differences in document length. The
TF value is high for words that appear frequently in a document and low for words
that appear infrequently. The inverse document frequency (IDF) of a word is a
measure of how common or rare the word is across all documents in the corpus. It is
calculated by taking the logarithm of the total number of documents in the corpus
divided by the number of documents in the corpus that contain the word. The IDF
value is high for words that appear in few documents and low for words that appear in
many documents.

27
Figure 4.3.3 TF-IDF
4.3.4 Bag of words:
The doc2bow method is a function provided by the Gensim library for converting a
document (list of words) into a bag-of-words format. Bag-of-words (BOW) is a
commonly used representation of text in natural language processing. In BOW, a
document is represented as a sparse vector of word frequencies, where each dimension
corresponds to a unique word in the vocabulary. The doc2bow method takes a list of
tokens as input and returns a list of tuples. Each tuple represents a word in the document
and its frequency count. The first element of the tuple is the word's index in the
vocabulary, and the second element is the word's frequency count in the
document.Overall, the doc2bow method is an important tool for text processing in
natural language processing. It provides a simple and efficient way to convert
documents to a bag-of-words format, which can be used in various downstream tasks
such as topic modeling, clustering, and classification.

28
Figure 4.3.4 Bag of words
4.4 Testing:

Software testing techniques are methods used to design and execute tests to evaluate
software applications. It involves rigorous unit testing to validate the functionality of
individual modules, comprehensive integration testing to ensure seamless interaction
between components, and manual testing to assess overall system performance,
usability, and accessibility.

4.4.1 Testing Methods:

4.4.1.1 Unit Testing:

Unit testing typically involves testing individual components or functions of you

ranking system to ensure they behave as expected. Break down your document ranking
system into smaller units or components. This might include functions or modules
responsible for tokenization, term weighting, similarity calculation, and ranking
algorithm implementation. Create test cases for each unit to cover a range of
scenarios,including edge cases and typical use cases. For document ranking, these might
include different types of queries, various document structures, and scenarios where
certain components might fail.Ensure that each unit test is isolated from external
dependencies. Mock or stub external services or modules to focus solely on the unit
being tested. This allows you to pinpoint the source of any failures. Verify that the
document tokenization unit correctly processes documents into tokens. Test it with
29
various document types and check if the output is as expected. Check that the term
weighting unit assigns appropriate weights to terms based on their importance. Test with
different term frequencies and document lengths. Validate the similarity calculation unit
to ensure it accurately computes the similarity between a query and a document. Use
predefined cases with known results to verify correctness.

4.4.1.2 Integration Testing:

Integration testing plays a crucial role in ensuring the seamless interaction between various
components of the system in our project.
Determine the key integration points in your document ranking system. These might include
the interaction between document tokenization, term weighting, similarity calculation, and
the ranking algorithm. Verify the flow of data between different components. Ensure that data
is passed correctly from one module to another and that the transformations are applied as
intended .
Use mock objects or stubs to simulate external dependencies, such as databases or external
APIs. This allows you to control the input and focus on the interactions between the internal
components. Test how different components interact with each other. For example, ensure
that the term weights calculated during term weighting are correctly used in the similarity
calculation, and that the results are then appropriately considered in the ranking algorithm. If
your document ranking system interacts with external systems (e.g., a search engine platform,
database, or caching system), perform tests to ensure a smooth integration. Test scenarios like
data retrieval, updates, and error handling

4.4.1.3 System Testing:

System testing plays a crucial role evaluating the entire document ranking system as a whole
to ensure that it meets the specified requirements and functions correctly in a real-world
environment. Identify and define various test scenarios that represent typical and edge use
cases. These scenarios should cover a range of queries, document types, and user interactions.
Perform end-to-end testing to simulate the entire user journey, from submitting a query to
receiving and displaying the ranked document results. Ensure that the system behaves as
expected at every step.

30
CHAPTER 5

RESULTS & DISCUSSION

Document ranking based on Similarity includes usage of NLP techniques and also applying
machine learning algorithms. The model includes application of NLP techniques for
preprocessing which includes word tokenization, removal of punctuation,stop word removal
and word stemming. The dataset of different field documents such as categories which
includes biography, news and hand written chapters.The each category contains one source
document and target documents. The input of the proposed gensim model is preprocessed
documents and output of the proposed system is ranked based on the similarity with respect
to the source document. The proposed method calculated the similarity score before
stemming and after stemming. The measurement of similarity score is more accurate when
we use stemming and without stemming similarity scores are not accurate. The accuracy
measure of the proposed method is higher than the other traditional methods. The output of
sample of the proposed method is shown below

SAMPLE

 Initially, the dataset is collected from the sources and the documents contain
information about the biography of Dhoni.
 The dataset with plain text is preprocessed with following steps such as word
tokenization, removal of punctuation, word folding, stop word removal and word
stemming.
 The preprocessed documents are processed into the proposed method and ranked
documents are obtained.

31
5.1 Similarity Scores before and after stemming

5.2 Difference between similarity Scores

32
CHAPTER 6
CONCLUSION

This project “Document Ranking Based on Similarity using Natural Language Processing
Technique” is used for ranking the documents based on similarity score with respect kk’to the
source document which plays a crucial role in information retrieval. In the era of the digital
world, digital information has been increasing widely and it has been doubled every five
years. The manual accessing of data is a different and time-consuming process. The
traditional method for accessing the documents has been not accurate. The preprocessing of
text plays an important role in the NLP techniques models. Most of the existing methods were
not focused on preprocessing of text. The proposed Gensim model processes the text with
five different methods such as tokenization, Removal of punctuation, Word Folding, Stop
word Removal and Word Stemming. Word Stemming plays an important role in measuring
the similarity score. The existing models only focused on finding the similarity of documents
and clustering them in a cluster but the proposed method ranked the documents based on the
similarity score with respect to the user query. The model proposed model is more accurate
than other traditional models with accurary 1 after the stemming and before stemming with
accuray 0.86. The model proposed is more accurate than other traditional models. The
proposed model helps in information retrieval applications such as web engine, search engine
, Entertainment and News industry and many more. our work can be further improved by
considering the homonyms ambiguity in the documents. The homonyms of the words can

also improve the similarity measure of documents.

REFERENCES
[1]. Benzi Xu et al. (2021) [1] used pseudo-longest-common-subsequence (pseudo-LCS) and
the Jaccard similarity coefficient is proposed based on this analysis and principal component
analysis (PCA) Volume 132, 2022.

[2] Qian Liu et al. (2021) proposed to use of association rules for measuring word
similarity at a global level and fuzzy similarity to measure the top-k words in in IEEE
Access, vol. 9, pp. 126801-126821, 2021.

[3] N. Kumar, S. K. Yadav and D. S. Yadav, "Similarity Measure Approaches Applied in

Text Document Clustering for Information Retrieval," 2020 Sixth International Conference
on Parallel, Distributed and Grid Computing (PDGC), 2020, pp. 88-921.

33
[4] Shuaizhang et al. (2019) proposed a model for extended citation model for scientific
document clustering" in IEEE Access, vol. 9, pp. 150865-150877, 2021, doi:
10.1109/ACCESS.2021.3125729.

[5]. M.P.Mahalakshmi and N. S. Fatima, "Maximum Entropy Principle based Document

Ranking with Term Selection Analysis for Cross-Lingual Information Retrieval," 2021 Third
International Conference on Intelligent Communication Technologies and Virtual Mobile
Networks (ICICV), 2021, pp. 1015-1019.

[6]. Mohamed Attia, Manal A. Abdel-Fattah, Ayman E. Khedr, A proposed multi criteria
indexing and ranking model for documents and web pages on large scale data, Journal of
King Saud University – Computer and Information Sciences, 2021.

[9]. Tianrun Cai, Zeling He, Chuan Hong, Yichi Zhang, Yuk-Lam Ho, Jacqueline Honerlaw,
Alon Geva, Vidul Ayakulangara Panickan, Amanda King, David R Gagnon, Michael
Gaziano, Kelly Cho, Katherine Liao, Tianxi Cai, Scalable relevance ranking algorithm via
semantic similarity assessment improves efficiency of medical chart review, Journal of
Biomedical Informatics, Volume 132, 2022.

[10]. P. Zhang, X. Huang, Y. Wang, C. Jiang, S. He and H. Wang, "Semantic Similarity

Computing Model Based on Multi Model Fine-Grained Nonlinear Fusion," in IEEE Access,
vol. 9, pp. 8433-8443, 2021.

[11]. Bo Xu, Hongfei Lin, Yuan Lin, Kan Xu, Two-stage supervised ranking for emotion
cause extraction, Knowledge-Based Systems, Volume 228, 2021.

34
[12]. M. F. Bashir, H. Arshad, A. R. Javed, N. Kryvinska and S. S. Band, "Subjective
Answers Evaluation Using Machine Learning and Natural Language Processing," in
IEEE Access, vol. 9, pp. 158972-158983, 2021.

[13]. M. AbuSafiya, "Measuring Documents Similarity using Finite State Automata," 2020
2nd International Conference on Mathematics and Information Technology (ICMIT), 2020,
pp. 208-211.

[14]. V. Kuppili, M. Biswas, D. R. Edla, K. J. R. Prasad and J. S. Suri, "A Mechanics-Based

Similarity Measure for Text Classification in Machine Learning Paradigm," in IEEE
Transactions on Emerging Topics in Computational Intelligence, vol. 4, no. 2, pp. 180-200,
April 2020.

[15]. M. J. Kim, J. S. Kang and K. Chung, "Word-Embedding-Based Traffic Document

Classification Model for Detecting Emerging Risks Using Sentiment Similarity Weight," in
IEEE Access, vol. 8, pp. 183983-183994, 2020.

[16]. F. Ye, X. Zhao, W. Luo, D. Li and W. Min, "Query-Adaptive Remote Sensing Image
Retrieval Based on Image Rank Similarity and Image-to-Query Class Similarity," in IEEE
Access, vol. 8, pp. 116824-116839, 2020.

[17]. Jesus Serrano-Guerrero, Francisco P. Romero, Jose A. Olivas, A relevance and

qualitybased ranking algorithm applied to evidence-based medicine, Computer Methods and
Programs in Biomedicine, Volume 191, 2020.

[18]. Yun Li, Yongyao Jiang, Chaowei Yang, Manzhu Yu, Lara Kamal, Edward M.
Armstrong, Thomas Huang, David Moroni, Lewis J. McGibbney, Improving search ranking
of geospatial data based on deep learning using user behavior data, Computers &
Geosciences, Volume 142, 2020.

[19]. R. Pelánek, "Measuring Similarity of Educational Items: An Overview," in IEEE

Transactions on Learning Technologies, vol. 13, no. 2, pp. 354-366, 1 April-June 2020.

[20]. G. Venkanna and D. K. F. Bharati, "Optimal Text Document Clustering Enabled by

Weighed Similarity Oriented Jaya With Grey Wolf Optimization Algorithm," in The
Computer Journal, vol. 64, no. 1, pp. 960-972, Nov. 2019.

[21]. S. Zhang, Y. Xu and W. Zhang, "Clustering Scientific Document Based on an Extended

Citation Model," in IEEE Access, vol. 7, pp. 57037-57046, 2019.
35
[22]. R. Dong, Z. -g. Wei, C. Liu and J. Kan, "A Novel Loop Closure Detection Method
Using Line Features," in IEEE Access, vol. 7, pp. 111245-111256, 2019.

[23]. J. Kim, "A Document Ranking Method with Query-Related Web Context," in IEEE
Access, vol. 7, pp. 150168-150174, 2019.

[24]. C. Xia, T. He, W. Li, Z. Qin and Z. Zou, "Similarity Analysis of Law Documents Based
on Word2vec," 2019 IEEE 19th International Conference on Software Quality, Reliability and
Security Companion (QRS-C), 2019, pp. 354-357.

[25]. Y. Ma, P. Zhang and J. Ma, "An Ontology Driven Knowledge Block Summarization
Approach for Chinese Judgment Document Classification," in IEEE Access, vol. 6, pp.
71327-71338, 2018.

[26]. Q. Mahmood, M. A. Qadir and M. T. Afzal, "Application of COReS to Compute

Research Papers Similarity," in IEEE Access, vol. 5, pp. 26124- 26134, 2017.

[27]. M. Liu, B. Lang, Z. Gu and A. Zeeshan, "Measuring similarity of academic articles with
semantic profile and joint word embedding," in Tsinghua Science and Technology, vol. 22,
no. 6, pp. 619-632, December 2017.

[28]. Olga Vechtomova, Murat Karamuftuoglu, Lexical cohesion and term proximity in
document ranking, Information Processing & Management, Volume 44, Issue 4, 2008. [29].
Czesław Daniłowicz, Jarosław Baliński, Document ranking based upon Markov chains,
Information Processing & Management, Volume 37, Issue 4, 2001.

[30]. H. Shen, L. Xue, H. Wang, L. Zhang and J. Zhang, "B+-Tree Based MultiKeyword
Ranked Similarity Search Scheme Over Encrypted Cloud Data," in IEEE Access, vol. 9, pp.
150865-150877, 2021, doi: 10.1109/ACCESS.2021.3125729.

36
APPENDIX I - SOURCE CODE

1.PreProcessing
import spacy from nltk.stem.porter import
PorterStemmer from gensim.utils import
simple_preprocess

# Load the English NLP model nlp =

spacy.load("en_core_web_sm")

# Define a sample text to tokenize

#text = "This is an example sentence. It contains multiple words."
with open("","r") as f:
text =f.read()
print(text)

# Tokenize the text using spaCy doc

= nlp(text)

# Print each token and its part of speech

for token in doc:
print(f"Token: {token.text} POS: {token.pos_}") #
Remove punctuation tokens from the document
doc_without_punct = [token.text for token in doc if not token.is_punct]

# Join the remaining tokens back into a single string

text_without_punct = " ".join(doc_without_punct)

# Print the resulting text

print(text_without_punct) doc1
= nlp(text_without_punct)
lemmatized_tokens = [token.lemma_ for token in doc1]

# Join the lemmatized tokens back into a single string

37
lemmatized_text = " ".join(lemmatized_tokens)
# Print the resulting text
print(lemmatized_text) doc2
= nlp(lemmatized_text)

# Remove the stop words from the tokenized text tokens_without_stopwords

= [token.text for token in doc2 if not token.is_stop]

# Join the non-stopword tokens back into a single string

text_without_stopwords = " ".join(tokens_without_stopwords)

# Print the resulting text

print(text_without_stopwords)

from nltk.stem.porter import PorterStemmer

from gensim.utils import simple_preprocess

# Define a sample text to stem

#with open("/content/text.txt","r") as f:
#text =f.read()

# Tokenize the text using Gensim's simple_preprocess function tokens

= simple_preprocess(text_without_stopwords)

# Define a stemmer object

stemmer = PorterStemmer()

# Stem each token in the text stemmed_tokens =

[stemmer.stem(token) for token in tokens]

# Join the stemmed tokens back into a single string

stemmed_text = " ".join(stemmed_tokens)

# Print the resulting text

print(stemmed_text)
38
2. Gensim
from gensim import corpora, models, similarities
# Define some sample documents with
open("/content/pf-source.txt","r") as f:
doc1 =f.read() with
open("/content/pf-2.txt","r") as f:
doc2 =f.read()
#doc2 = "This document is the second document"
with open("/content/pf-3.txt","r") as f:
doc3=f.read() with
open("/content/pf-4.txt","r") as f:
doc4=f.read() with
open("/content/pf-5.txt","r") as f:
doc5=f.read() with
open("/content/ps-6.txt","r") as f:
doc6=f.read()
# Create a corpus of documents documents = [doc1, doc2,
doc3, doc4,doc5,doc6] text_corpus = [doc.split() for doc
in documents] dictionary =
corpora.Dictionary(text_corpus) corpus =
[dictionary.doc2bow(text) for text in text_corpus]

# Train a TF-IDF model on the corpus

tfidf_model =
models.TfidfModel(corpus) tfidf_corpus
= tfidf_model[corpus]

# Create a similarity index for the TF-IDF corpus

similarity_index = similarities.MatrixSimilarity(tfidf_corpus)

# Define a sample query with

open("/content/pf-source.txt","r") as f:
query = f.read()

39
# Convert the query to a bag-of-words vector using the corpus dictionary query_vec =
dictionary.doc2bow(query.lower().split())

# Calculate the similarities between the query vector and each document in the corpus
similarities = similarity_index[tfidf_model[query_vec]]

# Sort the documents by similarity result_docs =

sorted(enumerate(similarities), key=lambda item: -item[1])

# Print the ranked results for doc_id,

sim_score in result_docs:
print(f"Document {doc_id}: (Similarity score: {sim_score:.3f})

3. Accuracy

relevant_docs = [doc1,doc2,doc3,doc4,doc5,doc6]
# the relevant documents are assumed to be doc1, doc2, and
doc7 num_relevant_docs = len(relevant_docs) num_correct = 0
for doc_id, sim_score in result_docs: if documents[doc_id] in
relevant_docs:
num_correct += 1 accuracy =
num_correct / num_relevant_docs
print(f"Accuracy: {accuracy:.2f}")

APPENDIX II - SCREENSHOTS

40
Figure 01 Original Text

41
Figure 02 Tokenization

Figure 03 Removal of punctuation

Figure 04 Word folding

42
Figure 05 Stop word removal

Figure 06 Word Stemming

Figure 07 Document ranking based on the similarity score before stemming

43
Figure 08 Document ranking based on the similarity score after steming

44
45

OceanofPDF - Com DIGITAL TRANSFORMATION PROJECT MANAGEMENT - Mario Fernandez
No ratings yet
OceanofPDF - Com DIGITAL TRANSFORMATION PROJECT MANAGEMENT - Mario Fernandez
200 pages
Efdt Q1
No ratings yet
Efdt Q1
33 pages
1NH17CS407
No ratings yet
1NH17CS407
110 pages
Sentimental Analysis of Movie Review
100% (1)
Sentimental Analysis of Movie Review
58 pages
Bit2109 Bac2104 Professional Issues in It 2 PDF
100% (1)
Bit2109 Bac2104 Professional Issues in It 2 PDF
3 pages
CS 22(DOC)Edited
No ratings yet
CS 22(DOC)Edited
77 pages
Final Report on Chatbot
No ratings yet
Final Report on Chatbot
70 pages
Introduction to Computers With Cover
No ratings yet
Introduction to Computers With Cover
170 pages
NeuroScribe Turning Thoughts Into Text VINUSHA
No ratings yet
NeuroScribe Turning Thoughts Into Text VINUSHA
93 pages
Updated Project File
No ratings yet
Updated Project File
77 pages
Brain Tumor Final Report Latex
No ratings yet
Brain Tumor Final Report Latex
29 pages
Report On Advancements in Early Detection of Alzheimer's Disease
No ratings yet
Report On Advancements in Early Detection of Alzheimer's Disease
40 pages
Team 4 Report Document (3)
No ratings yet
Team 4 Report Document (3)
72 pages
Final_Report
No ratings yet
Final_Report
74 pages
Revised UNIT 4 Records and Information Management
No ratings yet
Revised UNIT 4 Records and Information Management
38 pages
Neuromuscular Monitoring Report WORD-3
No ratings yet
Neuromuscular Monitoring Report WORD-3
73 pages
MAJOR PROJECT B (3)
No ratings yet
MAJOR PROJECT B (3)
72 pages
Streaming Big-Data Analytic Platform For Unified Logholaye
No ratings yet
Streaming Big-Data Analytic Platform For Unified Logholaye
117 pages
Medical Data Analysis Using Machine Learning Techniques: Devansh Bhasin (14Bcb0045)
No ratings yet
Medical Data Analysis Using Machine Learning Techniques: Devansh Bhasin (14Bcb0045)
51 pages
Final Document Recent f5
No ratings yet
Final Document Recent f5
52 pages
Final Document Recent f4
No ratings yet
Final Document Recent f4
52 pages
Projects 2021 B4
No ratings yet
Projects 2021 B4
96 pages
BT4431 Report of Project Ete 7TH Sem Plag Report Attachted
No ratings yet
BT4431 Report of Project Ete 7TH Sem Plag Report Attachted
69 pages
project
No ratings yet
project
63 pages
Big Data and Intellectual Property Rights
No ratings yet
Big Data and Intellectual Property Rights
97 pages
Unit-4: Information Systems
No ratings yet
Unit-4: Information Systems
63 pages
Null 2
No ratings yet
Null 2
72 pages
Glaumetric Report[1]
No ratings yet
Glaumetric Report[1]
56 pages
Black Book Final Word
No ratings yet
Black Book Final Word
66 pages
Final report 6 -
No ratings yet
Final report 6 -
73 pages
Project Report
No ratings yet
Project Report
52 pages
4-2Final
No ratings yet
4-2Final
34 pages
Fake Review Detection Prj2 (1)
No ratings yet
Fake Review Detection Prj2 (1)
30 pages
Virtual HR - Report Final
No ratings yet
Virtual HR - Report Final
70 pages
Report
No ratings yet
Report
55 pages
Sinemn Pro
No ratings yet
Sinemn Pro
54 pages
INQUIRY-VS.-RESEARCH
No ratings yet
INQUIRY-VS.-RESEARCH
47 pages
Phase 2 Final
100% (1)
Phase 2 Final
65 pages
Santhosh BE Paper To Jeevi Veh
No ratings yet
Santhosh BE Paper To Jeevi Veh
47 pages
Discovering Drugs To Treat Various Diseases For Human Beings
No ratings yet
Discovering Drugs To Treat Various Diseases For Human Beings
11 pages
sampdf_merged_removed
No ratings yet
sampdf_merged_removed
41 pages
VCEH B Tech Project Report
No ratings yet
VCEH B Tech Project Report
66 pages
udaya
No ratings yet
udaya
63 pages
chatbot-5-43
No ratings yet
chatbot-5-43
39 pages
Minor Project-1 R21-Cse Report Template Ss2425
No ratings yet
Minor Project-1 R21-Cse Report Template Ss2425
39 pages
Batch 1
No ratings yet
Batch 1
57 pages
Brain Tumor Classification FirstReport
No ratings yet
Brain Tumor Classification FirstReport
42 pages
Phase 1
No ratings yet
Phase 1
78 pages
Major_Project (5) (1)
No ratings yet
Major_Project (5) (1)
33 pages
Binder 1
No ratings yet
Binder 1
93 pages
Report
No ratings yet
Report
112 pages
Dissertation Topics MSC Construction Project Management
100% (1)
Dissertation Topics MSC Construction Project Management
6 pages
WAVES Merged
No ratings yet
WAVES Merged
71 pages
Final Report
No ratings yet
Final Report
59 pages
Brain Tumor Detection Using Deep Learnin
No ratings yet
Brain Tumor Detection Using Deep Learnin
62 pages
Srinivas Major Project
No ratings yet
Srinivas Major Project
40 pages
Chapter 2 - Edited
No ratings yet
Chapter 2 - Edited
45 pages
Minor PROJECT WS 21 22
No ratings yet
Minor PROJECT WS 21 22
37 pages
Impact of Information Technology in Business Communication
80% (5)
Impact of Information Technology in Business Communication
4 pages
CSC 602 - Assignment 2
75% (4)
CSC 602 - Assignment 2
5 pages
Movie Recommendation System Using Machine Learning: Robin Sharma (1613106009)
No ratings yet
Movie Recommendation System Using Machine Learning: Robin Sharma (1613106009)
21 pages
Batch 9
No ratings yet
Batch 9
90 pages
New Report
No ratings yet
New Report
73 pages
Mini Project Report Format
No ratings yet
Mini Project Report Format
19 pages
Datastage Business Glossary
100% (1)
Datastage Business Glossary
108 pages
PS - Report Format
No ratings yet
PS - Report Format
18 pages
Ten Cost Saving Metrics Enabled by Asset Integrity Management Software
No ratings yet
Ten Cost Saving Metrics Enabled by Asset Integrity Management Software
10 pages
Paper Translation Techniques
No ratings yet
Paper Translation Techniques
10 pages
RPT BI 1 2025
No ratings yet
RPT BI 1 2025
9 pages
Technowire - Profile
No ratings yet
Technowire - Profile
11 pages
Data Agility Eim
No ratings yet
Data Agility Eim
20 pages
AI Unit 3
No ratings yet
AI Unit 3
18 pages
acknowledgment skin lesion
No ratings yet
acknowledgment skin lesion
8 pages
FYP Proposal
No ratings yet
FYP Proposal
18 pages
Bonafide Index
No ratings yet
Bonafide Index
9 pages
66
No ratings yet
66
82 pages
index list
No ratings yet
index list
9 pages
Anna University: Chennai 600 025
No ratings yet
Anna University: Chennai 600 025
10 pages
Scs1101 Course Outline
No ratings yet
Scs1101 Course Outline
4 pages
Data Classification Procedure
No ratings yet
Data Classification Procedure
4 pages
CRM in Banking
No ratings yet
CRM in Banking
11 pages
A Project Synopsis
0% (1)
A Project Synopsis
3 pages
Compliance Risk Management
No ratings yet
Compliance Risk Management
6 pages
Retrieval-Augmented Generation For Large Language Models A Survey
No ratings yet
Retrieval-Augmented Generation For Large Language Models A Survey
26 pages
MSC Management: Customer Experience Strategy
No ratings yet
MSC Management: Customer Experience Strategy
8 pages
What Is The Role of Each Concept To Philippine Popular Culture
100% (1)
What Is The Role of Each Concept To Philippine Popular Culture
2 pages
Mismo
No ratings yet
Mismo
6 pages
Introduction To Media: and Information Literacy
No ratings yet
Introduction To Media: and Information Literacy
10 pages
Comm 226 Quiz 3
No ratings yet
Comm 226 Quiz 3
5 pages
Data Mining Models: Techniques and Applications
From Everand
Data Mining Models: Techniques and Applications
Ravi Deshpande
No ratings yet