Report PRIEE
Report PRIEE
Submitted by
K. Karthik 111721102056
J.V.B. Sathwik 111721102044
G.B.R.S. Vinayak 11172112179
March 2024
R.M.K. ENGINEERING COLLEGE
(An Autonomous Institution)
R.S.M. Nagar, Kavaraipettai-601 206
BONAFIDE CERTIFICATE
Certified that this project report “Multi-Modal Approach for Brain MRI Image
Enhancement and Tumor Using CNN” is the bonafide work of KARTHIK K
(111721102056), SATHWIK J.V.B (111721102044), VINAYAK G.B.R.S (111721102179)
who carried out the 20CS713 Project Phase II work under my supervision.
SIGNATURE SIGNATURE
ACKNOWLEDGEMENT
We earnestly portray our sincere gratitude and regard to our beloved Chairman Shri. R. S.
Munirathinam, our Vice Chairman, Shri. R. M. Kishore and our Director, Shri. R. Jyothi
Naidu, for the interest and affection shown towards us throughout the course.
We convey our sincere thanks to our Principal, Dr. K. A. Mohamed Junaid, for being the source of
We reveal our sincere thanks to our Professor and Head of the Department, Computer Science and
We would like to express our sincere gratitude for our Project Guide Ms. P. Baby Shamini,
Assistant Professor for her valuable suggestions towards the successful completion for this
We take this opportunity to extend our thanks to all faculty members of Department of
Computer Science and Engineering, parents and friends for all that they meant to us during
Neurological and brain-related cancers are one of the main causes of death worldwide. A
common (MRI), yet the manual evaluation of MRI images by medical experts presents
difficulties due to time constraints and variability. This research introduces a novel, two-
module computerized method to increase the speed and accuracy of brain tumor detection.
The first module, termed the Image Enhancement Technique, utilizes a trio of machine
learning and imaging strategies—adaptive Wiener filtering, neural networks, and independent
component analysis—to normalize images and combat issues such as noise and varying low
region contrast. The second module uses Support Vector Machines to validate the output of
the first module and perform tumor segmentation and classification. Applied to various types
of brain tumors, including meningiomas and pituitary tumors, our method exhibited
significant improvements in contrast and classification efficiency. It achieved an average
sensitivity and specificity of 0.991, an accuracy of 0.989, and a Dice score (DSC) of 0.981.
Furthermore, the processing time of our method, averaging 0.43 seconds, was markedly lower
compared to existing methods. These results underscore the superior performance of our
approach over current state-of-the-art methods in terms of sensitivity, specificity, precision,
and DSC. Future enhancements will seek to increase the robustness of the tumor classification
method by employing a standardized approach across a suite of classifiers.
Keywords: Magnetic resonance imaging (MRI), image enhancement technique, brain
tumor segmentation, neural networks, brain tumor classification.
III
TABLE OF CONTENTS
2 SYSTEM ANALYSIS 10
2.1 Existing System 10
2.1.1 Disadvantages of Existing System 10
2.2 Proposed System 12
2.2.1 Advantages of Proposed System 12
3 SYSTEM DESIGN 14
3.1 System Architecture 14
3.2 UML Diagrams 18
3.2.1 Use Case Diagram 18
3.2.2 Class Diagram 19
3.2.3 Data Flow Diagram 20
3.2.4 Activity Diagram 21
4 SYSTEM IMPLEMENTATION 22
4.1 Modules 22
4.2 Module description 23
IV
4.2.1 Data Set Collection 23
4.3.2 Gensim 29
4.3.3 Term Frequency-Inverse Document Frequency 30
4.4 Testing 32
4.3.1 NLP 28
V
4.3.2 Gensim 29
4.3.3 TF-IDF 30
01 Original Text 45
02 Tokenization 46
03 Removal of punctuation 47
04 Word folding 47
06 Word Stemming 48
4.1 Modules 19
VI
LIST OF ABBREVATIONS
03 Artificial intelligence
AI
04
NLTK Natural Language Tool Kit
05
SVM Support Vector Machine
06
OS Operating system
07
PCA Principle Component Analysis
08
STS Semantic textual similarity
09
RAM Random Access Memory
10
LSA Latent Semantic Analysis
VII
CHAPTER 1 INTRODUCTION
Over many years, it has been a tremendous task for retrieval of information for any
organization due to the vast volume of data growing continuously. It is an uphill battle for the
user to retrieve the required information from an immense amount of data. Information or
data is stored in the form of documents in any organization. The documents available in
digital form are doubled every five years. So, the identification of similar documents is
beneficial. Similarity among the data plays a crucial role in information retrieval. Segregation
of documents helps the user in many ways. Manual segregation of documents takes a lot of
time and human resources. Thus, this project tends to show the technological innovation to
simplify the task by automatically segregating data. To carry out this thought, various
documents are used as input. The users can take a query of the documents as input and it will
automatically rank the documents in the query based on similarity. This chapter briefly
describes the introduction part of this work by first discussing the overview of the project and
problem statement followed by the objective, existing systems, the significance, and finally
the limitations. Document similarity is a method of getting similar documents from large
document datasets into one query. Ranking the documents in the query based on similarity
helps the user to get access to the required information efficiently. Document similarity is
mainly used in the process of Information retrieval. Information retrieval is a major task in
any organization. Information retrieval helps users to formulate solutions for present
problems by analyzing past results. Information or results are mainly stored in the form of
documents or portable document formats.
Because of the increasing amount of data, search engines encounter different difficulties in
fetching better relevant to users search queries. Traditional document ranking methods are
mostly based on the similarity computations between documents and queries.In many cases
users may want to retrieve documents that are not only similar but also general or broad
regarding a certain topic. So, to rank the documents efficiently and accurately there is a need
for document ranking using the semantic measure.
1
[1]. Benzi Xu et al. (2021) [1] used pseudo-longest-common-subsequence (pseudo-LCS)
and the Jaccard similarity coefficient is proposed based on this analysis and principal
component analysis (PCA) Volume 132, 2022.
This used pseudo-longest-common-subsequence (pseudo-LCS) and the Jaccard similarity
coefficient is proposed based on this analysis and principal component analysis (PCA) along
with it K-medoids is also used to improve the soft constraints. To effectively measure the
similarity of the operation sequences, a deep analysis was performed to determine the
information requirements and characteristics of the operation sequence similarity problem.
The modified pseudo-LCS is proposed to record the first two pieces of information, and a
corresponding backtracking algorithm is also presented. The Jaccard similarity coefficient is
used here to measure the last information. These two similarity coefficients are combined
based on PCA to generate a novel comprehensive similarity coefficient. The numerical
illustration result shows that it can distinguish all the different cases with rational similarity
values. The typical process route discovery is a practical problem; two conflicting soft
constraints are introduced and solved by the K-medoids method. The proposed method
effectively measures the similarity of the operation sequence. The algorithms provide the
similarity coefficient which helps the user to calculate the ranks of the sequence. The
proposed methodologies also have soft constraints while clustering the documents. The
advantage of this experiment is the quality of products.
[2]. Qian Liu et al. (2021) [2] proposed to use of association rules for measuring word
similarity at a global level and fuzzy similarity to measure the top-k words in in IEEE
Access, vol. 9, pp. 126801-126821,2021.
This proposed to use of association rules for measuring word similarity at a global level and
fuzzy similarity to measure the top-k words. For top-k words, the authors proposed a
similarity measure to word embedding where both local and global information is considered.
The global information is measured with association rules and local information is measured
by word embedding also the authors compared this proposed method to eight state-of-the-art
baselines. The data sets used by the authors are TRECdisk 4&5, WT10G, and RCV1. The
authors found a fuzzy logic system which overcomes the problems associated with combining
the two types of measures by inferring the similarity between words and then returning the
top-k selected words. The advantage of fuzzy logic is that it provides a flexible and
convenient way to transform expert knowledge expressed in natural language into fuzzy
rules. This paper contains three components: local, global and fuzzy systems but there is
nocomponent-wise validation.
2
[3]. N. Kumar, S. K. Yadav and D. S. Yadav, "Similarity Measure Approaches Applied in
Text Document Clustering for Information Retrieval," 2020 Sixth International
Conference on Parallel, Distributed and Grid Computing (PDGC), 2020, pp. 88-921.
This used the standard K-means algorithm. The K-means algorithm calculates the distance
measure. Experimented by using the datasets similar news articles and analysis of customer
feedback, text mining, duplicate content detection, and finding similar documents. In the
dataset, the number of assigned different categories is the same with a set of several clusters.
They can use two evaluation measure-purity and entropy that give the quality of a clustering
result. By using Euclidean distance measures they are used to find the document clustering
more effectively. Pearson and Jaccard methods are more suitable for finding rational clusters
with high clarity values that are represented by documents from a single group that controls
every cluster. The advantage of the experiment is the frequency is calculated according to the
typed dataset but the accuracy is very low.
[4]. Shuaizhang et al. (2019) [4] proposed a model for extended citation model for scientific document
clustering" in IEEE Access, vol. 9, pp. 150865-150877, 2021, doi:
10.1109/ACCESS.2021.3125729.
This proposed a model for extended citation model for scientific document clustering and a
citation network and textual similarity network to enhance the performance of scientific
document clustering. They are conducted by using the PMC database and PubMed database
are two popular databases in the biomedical field and provide a large number of openaccess
and full-text scientific documents with 10,996 scientific documents. They used Java and R
programming for the implementation of the experiment. The data should be preprocessed
before the experiment should be conducted they constructed a textual similarity network
according to the integrated citation network, co-citation network, bibliographic coupling
network, and textual similarity network. They proved the practicability of our proposed
extended citation model by comparing it with the traditional bibliographic coupling model
and the textual similarity model for scientific document clustering by using the R
programming language. They used a random walk algorithm for the popular community
detection algorithm whose input is the similarity network and output is the clustering results.
The advantage is the clustering of scientific documents in an efficient way by considering the
frequency of the document. The limitation is that it used a limited dataset.
This model the authors used Maximum Entropy Principle based Document Ranking with
Term Selection Analysis (MEPDR-TSA) for CLIR. The user query in the Tamil language is
translated into the English language. Then, the MEPDR technique is employed for the
ranking of the documents and TSA is used for choosing a set of retrieved documents from
each query. Finally, the retrieved English documents are again converted back into the Tamil
language proficiently by using google Translate and the results are tested against precision,
recall, and F-score. The authors found a Maximum Entropy Principle based Document
Ranking with Term Selection Analysis and converted the Tamil query into English also
converting the retrieved English document into Tamil. The advantage of the proposed method
is better than the existing methods like Okapi BM25, IB-MLIR, MULM, KNN, and n-gram.
But, this method should be improved with advanced document ranking methods.
[6]. Mohamed Attia, Manal A. Abdel-Fattah, Ayman E. Khedr, A proposed multi criteria
indexing and ranking model for documents and web pages on large scale data, Journal
of King Saud University - Computer and Information Sciences, 2021.
This proposed a model of multi-criteria indexing and retrieval works for web pages and
documents. It uses different retrieval methods to get an accurate document and it handles the
page ranking algorithm issues, this model utilized the top seven criteria for indexing and
retrieving results. First Phase: Users enter the required queries. The MCIR goes through
crawling online or offline. The first step is finding pages or documents existing on the web if
it is working online or offline through stored files on a machine. Once the system finds a page
URL or document path, it visits and finds out according to the user search query. Second
phase: At this stage, the weight model starts generating weight for each criterion according to
user preferences. After it calls rank statistics to return the final page weight to users. This
model proved that ranking through multiple criteria has different results than one or two
criteria as compared to previous algorithms. It retrieves the document and web pages based
on the top seven criteria: Pages or documents votes, Keywords in (Domain, Page content,
Url), Pages publish date, Pages Modified date, Number of links, Pages load time, and Bad
links.
[7]. Hikmat Ullah Khan, Shumaila Nasir, Kishwar Nasim, Danial Shabbir, Ahsan
Mahmood, Twitter trends: A ranking algorithm analysis on real time data, Expert
Systems with Applications, Volume 164, 2021.
4
This explored Term Frequency-Inverse Document Frequency (Tf-IDF), Combined
Component Approach (CCA) and Biterm Topic Model (BTM) approaches for finding the
topics and terms within given topics. Data Set - Data is collected using the Twitter application
programming interface API that extracts data from Twitter sources. First, the data is extracted
from the Twitter API and the results are stored in the. xlsx file. In the next step data cleaning
involves deleting unused data, and duplicates, doing a spell check, and other modifications
that make the data easier to understand. Stemming translates past tense verbs into present
tenses, tokenization develops tokens in connection to the given roles in a sentence, and
transforms normalization. Data integration is a procedure that gathers information from
several sources and unifies it. The data reduction procedure assists with large-scale dataset
analysis by condensing datasets with a high volume while retaining all relevant information.
The dataset has been used in the last phase to apply various models, including TF-IDF, CCA,
and BTM, to extract the subjects from the tweet collection.
[8]. Dimitris Pappas and Ion Androutsopoulos, A Neural Model for Joint Document and
Snippet Ranking in Question Answering for Large Document Collections, Department
of Informatics, Athens University of Economics and Business, Greece, Institute for
Language and Speech Processing, Research Center ‘Athena’, Greece, 2021.
This used POSIT-DRMM or PDRMM, it is a differentiable extension of DRMM. Proposed
architecture to jointly rank documents and snippets concerning a question; there are two
particularly important stages in QA for large document collections. They instantiated the
proposed architecture using a recent neural relevance model (PDRMM) and a BERT-based
ranker. Using biomedical data (from BIOASQ), they showed that the two resulting joint
models (PDRMM-based and BERT-based) vastly outperform the corresponding pipelines in
snippet retrieval, the main goal in QA for document collections, using fewer parameters, and
also remaining competitive in document retrieval. They provided a modified version of the
Natural Questions dataset, suitable for document and snippet retrieval. The documents are
retrieved by using fewer parameters like snippets of the document for the questions. The
advantage of this method is that document retrieval results are better than DRMM and several
other neural rankers. But, the stage of the dataset should be increased to a multi-granular task
as BIOASQ already includes this multigranular task, but exact answers are provided only for
factoid questions and they are freely written by humans, as in MS-MARCO, with similar
limitations. Hence, appropriately modified versions of the BIOASQ datasets are needed.
[9]. Tianrun Cai, Zeling He, Chuan Hong, Yichi Zhang, Yuk-Lam Ho, Jacqueline
5
Honerlaw, Alon Geva, Vidul Ayakulangara Panickan, Amanda King, David R Gagnon,
Michael Gaziano, Kelly Cho, Katherine Liao, Tianxi Cai, Scalable relevance ranking
algorithm via semantic similarity assessment improves efficiency of medical chart
review, Journal of Biomedical Informatics, Volume 132, 2022.
The authors used the pGUESS algorithm as prior guided semantic similarity to measure the
informativeness of a clinical note to a given phenotype. The algorithm scores the relevance of
a note as the cosine similarity between SEVnote and SEVref. The pGUESS algorithm is fully
knowledge-based except for assigning notes into the three categories via clustering. The
results on note ranking for CAD performed at both VHA and PHS suggests high
transportability of the pGUESS algorithm across two different healthcare systems. So, the
pGUESS algorithm does not require local EHR data but only knowledge sources and
embedding vectors. The algorithm reduced the burden of chart review and it improved the
efficiency and accuracy of human annotation. Determining patient disease status via chart
review is a critical yet labour-intensive task for EHR for training or validating robust
prediction algorithms.
The advantage of this algorithm is that overall ranking quality, as measured by the rank
correlation, was the highest for pGUESS compared to all other methods. But, the publicly
available TMGuassian algorithm does not allow out-of-sample prediction, so only tested the
portability of LDAgibbs and LEAvem.
This used the combination of the traditional statistical method and deep learning model as
well as a novel model based on multi-model nonlinear fusion proposed in this paper. The ant
financial data set comes from the Chinese text similarity contest held by Alipay. Its data
mainly come from Alipay’s customer service data. The data contains 100000 pieces of data.
Semantic textual similarity (STS) datasets from 2012 to 2015 in the experiment are also used.
The model uses the Jaccard coefficient based on part of speech, Term Frequency-Inverse
Document Frequency (TF-IDF) and word2vecCNN algorithm to measure the similarity of
sentences respectively. The model combines the traditional sentence similarity calculation
method based on statistics and completes the coarse-grained extraction of the sentence. The
results of the Jaccard algorithm, TF-IDF algorithm and word2vec-CNN are input into the
shallow fully connected neural network to train the model and give an ideal classification
6
result. IN which the Jaccard algorithm takes grammatical information, TF-IDF calculates
sentence similarity from the word frequency and Inverse document and by using the
word2vec-CNN algorithm, the sentence feature matrix is weighted by a multi-feature
attention mechanism by which it increases the performance. The experimental results of the
proposed method show that the proposed sentence similarity calculation method based on
multi-feature fusion can balance the calculation results of multiple models. Experimental
results show that the matching of sentence similarity calculation method based on multimodel
nonlinear fusion is 84%, and the F1 value of the model is 75%. In this experiment, the
similarity of the sentence is measured with the respective meaning of the words in the
sentence. The limitations are that the word vector given by the word2vec model is static and
cannot describe the dynamic change of semantics and accuracy can be improved.
The feasibility of the project is analysed in this phase and business proposal is put forth with a very
general plan for the project and some cost estimates.During system analysis the feasibility study of
the proposed system is to be carried out. This is to ensure that the proposed system is not a burden
7
to the company.For feasibility analysis, some understanding of the major requirements for the
system is essential.
ECONOMICAL FEASIBILITY :
The economic feasibility of document ranking depends on various factors, primarily the specific
application and the value it brings to users or businesses. Document ranking,often associate with
information retrieval and search engines, can have economic benefits in different contexts.
Here are some considerations:
• User satisfaction and Engagement
• Productivity and Efficiency
• Adaptability and Scalabily TECHNICAL FEASIBILITY:
The technical feasibility of document ranking involves assessing whether the
implementation of a document ranking system is technically viable given the available
technology, resources and infrastructure.
Here are key considerations for evaluating the technical feasibility of document ranking:
• Algorithm complexity
• Data availability and Quality
• Feature Engineering
• Testing and Evaluation
OPERATIONAL FEASIBILITY:
The operational feasibility for document ranking refers to whether the system can be
effectively integrated into existing operations and processes. It assesses the practicality of
implementing the document ranking system within the operational context of an organization.
Here are key considerations for evaluating the operational feasibility of document ranking:
• User acceptance
• Data input and output
• Resource availability
• Legal and regulatory compilance
8
CHAPTER 2 SYSTEM ANALYSIS
The existing system of Natural language processing is confined to the stream of computer
science and artificial intelligence. Natural language processing mainly deals with the
interaction between computers and human language which mainly focuses on how computers
process and analyze a large amount of data. The NLP techniques help computers to
understand contexts in the document.Due to the increase in data in digital formats such as
documents, the classification of data becomes a difficult task. So, different methods are used
to classify the data based on the user's requirements. The traditional method such as manual
retrieval of data takes a huge time and effect.Therefore, automatic retrieval of data takes
place. Initially, the raw data should be converted into a real number format.Secondly, the
preprocessing of the data takes place by performing the following step tokenization, and
removal of stop words which could not contain any meaningto it. Finally, according to the
user’s query, the similarity metrics should be measured.The above step takes place with help
of NLP Techniques. The Existing method mainly uses the hybrid model to obtain the results.
The hybrid model combines two or more techniques.The accuracy of a single technique is
less than hybrid techniques. The hybrid technique can be used in the Analysis of large and
complex documents, the Entertainment industry, Resume ranking and many more. But, the
usage of the hybrid model makes the process complex. The hybrid method is a combination
of various NLP techniques such as TF-IDF, Jaya and grey wolf optimizer, the Longest
common sequence and may be more compared with other machine learning methods such as
CNN and CNN. The hybrid method also increases the time complexity of the work. Major
work cannot increase the similarity metrics and dictionary.
• Many existing document ranking methods are primarily designed for text-based
content and may not effectively handle multimedia content such as images, videos, or
audio.
• This can limit the usefulness of search engines for users seeking diverse types of
information.
Scalability challenges:
• As the volume of documents and user queries continues to grow, existing document
ranking methods may struggle to maintain scalable performance without sacrificing
relevance or quality.
• This can result in slower response times or degraded search experiences for users.
Limted personalization:
• This lack of personalization can lead to suboptimal search experiences for users who
have diverse needs and preferences.
10
2.2 Proposed Sysyem:
In this model,ranking the documents based on similarity of the users query is done on
Natural language processing.The ranking of the documents helps in the efficient retrieval of
information in any firm. This model helps in identifying the most relevant and important
information based on the user requirement.
The proposed work is implemented in Python 3.8 with libraries Gensim, Spacy, NumPy,
nltk, corpora, models, similarities and other mandatory libraries. There are many
applications of Natural language processing such as Information retrieval, classification of
documents, spell checker, estimation of similarity, keyword extraction, Language
Translation and many information retrieval problems. Most of the information retrieval
problems are solved using NLP techniques. In the natural language processing techniques,
data processing plays an important role to solve the problem. The proposed model is
implemented using an unsupervised learning algorithm and well-known python library
called Gensim which ranked the documents based on similarity of the user’s query with
better accuracy. The dataset in which we have simple text will be pre-processed with
different metrics such as word tokenization, removal of punctuation. Stop word removal
and word stemming. Word stemming plays a crucial role in finding the accurate similarity.
The preprocessed documents are splitted into source and queries documents and the
proposed model is applied to the preproceed documents, compared the source with other
target documents, obtained the score of the documents and rank the documents based on
similarity of the user query. The performance evaluation of the proposed method is better
than the traditional methods.
• Time-efficient:
11
The automatic generation of solutions to any natural language problem reduces the
time complexity of the users. The automatic generation of solutions to any natural
language problem reduces the time complexity of the users.
• Market Research and Analysis:
Automatic summarization of research papers, Extraction of keywords, and clustering
of similar documents in query help the researchers.
• Streamlined processes:
Avoiding the traditional methods for communicating such as help centres and
customer care were updated to chatbots to improve customer satisfaction.
• E-learning:
NLP machine learning technology can examine the language used in a classroom to
define the mental states of both teachers and students.
CHAPTER 3
SYSTEM DESIGN
12
Figure 3.1 System Architecture
• Preprocessing:
Preprocessing consists of term folding, term tokenization, removal of punctuation,
stop term elimination term stemming. Preprocessing is the first thing we have to do
for a document for further easy process.It raises reliability and accuracy.Preprocessing
data can increase the correctness and quality of a dataset, making it more reliable by
removing missing or inconsistent data values brought on by human or computer
mistake. It ensures consistency in data.
• Term Folding:
Preprocessing is applied to the data once it has been received. Term folding is a
preprocessing technique that lowercases words that are currently in uppercase. It
appears as though two similar terms are the lowercase.
• Removal of Punctuation:
13
are used for conversation in natural languages. The data's punctuation and spaces are
also eliminated. For simpler processing, we should eliminate these punctuation marks.
• Term Tokenization:
Term tokenization then separates the raw text into tokens, which are words and
sentences. By studying the words, these symbols help the reader in determining the
context and analyzing the text's meaning.
Elimination of stop term occurs in the next preprocessing stage. In any language, stop
terms are a group of frequently used terms. Stop words in English include "the," "is,"
and "and," for instance. Stop words are used to remove unnecessary words so that
computers can concentrate on the crucial ones. It is one of the most commonly used
preprocessing steps across different NLP applications.
• Term Stemming:
Word stemming, the final preprocessing stage, removes the final few characters from
a word, frequently resulting in inaccurate spelling and meaning. Lemmatization
considers the context and converts the word to its meaningful base form, which is
called Lemma. For instance, stemming the word 'Eating' would return 'Eat'.
After preprocessing is finished, the input source document and set of documents in
datasets are updated as a preprocessed data.
Gensim is open source python natural language processing library used for
unsupervised modeling. The features of the gensim is its scalability, robust, platform
agnostic and Efficient multicore implementations. The uses of gensim are fastText,
Word2vec, LSA (Latent Semantic Analysis), LDA (Latent Dirichlet Allocation) and
tf-idf (term frequency-inverse document frequency). fastText, uses a neural network
for word embedding, is a library for learning of word embedding and text
classification. Word2vec, used to produce word embedding, is a group of shallow and
14
two-layer neural network models. LSA is a technique in NLP (Natural Language
Processing) that allows us to analyse relationships between a set of documents and
their containing terms. LDA is a technique in NLP that allows sets of observations to
be explained by unobserved groups. These unobserved groups explain, why some
parts of the data are similar. tf-idf, a numeric statistic in information retrieval, reflects
how important a word is to a document in a corpus. It is often used by search engines
to score and rank a document’s relevance given a user query. The facilities provided
by Gensim for building topic models and word embedding is unparalleled. It also
provides more convenient facilities for text processing. It handle large text files even
without loading the whole file in memory. Gensim doesn’t require costly annotations
or hand tagging of documents because it uses unsupervised models. The core concepts
of the gensim are document, corpus, vector and model. Document is an object of the
text sequence type which is known as ‘str’ in Python 3. A corpus may be defined as
the large and structured set of machine-readable texts produced in a natural
communicative setting. In Gensim, a collection of document object is called corpus.
The role of the corpus are Serves as input for training a model and Serves as topic
extractor. A vector is mathematical representation of a document. Model refers to an
algorithm used for transforming vectors from one representation to another. For
working on text documents, Gensim also requires the words, i.e. tokens to be
converted to their unique ids. For achieving this, it gives us the facility of Dictionary
object, which maps each word to their unique integer id. It does this by converting
input text to the list of words and then pass it to the corpora.Dictionary() object. In
Gensim, the dictionary object is used to create a bag of words (BoW) corpus which
further used as the input to topic modelling and other models as well. Term
Frequency-Inverse Document Frequency model which is also a bag-of-words model.
It is different from the regular corpus because it down weights the tokens i.e. words
appearing frequently across documents. During initialisation, this tf-idf model
algorithm expects a training corpus having integer values (such as Bag-of-Words
model). Then after that at the time of transformation, it takes a vector representation
and returns another vector representation. The output vector will have the same
dimensionality but the value of the rare features (at the time of training) will be
increased. It basically converts integer-valued vectors into real-valued vectors.
15
After performing the gensim model then we start comparing the input data with data
set details to measure the similarity then the score of the target documents means
documents in the dataset is measured with respect to the source document the target
documents are sorted according to the score obtained means the documents are ranked
based on the scores.
16
Figure 3.2.1 Use Case Diagram
3.2.2 Class Diagram
In software engineering, a class diagram in the Unified Modelling Language (UML) is a type
of static structure diagram that describes the structure of a system by showing the system's
classes, attributes, operations (or methods), and the relationships among the classes. It
explains which class contains information. In this class diagram there are document with
document type, and also a folder and document version in it.
17
Figure 3.2.2 Class Diagram
The Data Flow Diagram (DFD) shows information flow in the system the user uploads
and views the data and the system evaluates and provides the result.
18
Figure 3.2.3 Data Flow Diagram
SYSTEM IMPLEMENTATION
4.1 Modules:
20
Ranking the documents - scoring_documents
- ranking_algorithms
- learning_to_Rank models
- result_presentation
Computing the evaluation parameters - Precision
- Accuracy
- F1score
Removal of Punctuation:
Word folding:
Stop Word Removal is followed by word folding.In natural language processing, stop
word removal is a common technique used for text preprocessing. Stop words are words
that are commonly used in a language but do not contribute to the meaning of a
sentence. Examples of stop words in English include "a", "an", "the", "is", "are", "of",
and "in". These words are usually removed from the text during the preprocessing stage
as they don't provide any value for the analysis of the text.The main purpose of stop
word removal is to reduce the size of the dataset and improve the accuracy of
downstream analysis of text in the model. For the process of stop word removal , then
the preprocessed text from the dataset will be preprocessed with stop word Removal .
Then the preprocessed text is passed to the next stage.In conclusion, stop word removal
is a common technique used in natural language processing for preprocessing text data.
While it can be useful in reducing the size of the dataset and improving the accuracy of
model.
Word Stemming:
Finally, the crucial step in the preprocessing is stemming. Stemming is followed by Stop
word Removal.Stemming is a common technique used in Natural Language Processing
(NLP) for text pre-processing. It is the process of reducing a word to its base or root
form, called a stem or lemma. This is done by removing the suffixes and prefixes from
the word, which results in the stem being derived. Stemming is useful because it helps
to reduce the dimensionality of the text data, making it easier to analyze and process. It
also helps to normalize the text, allowing similar words to be treated as the same, which
can improve the accuracy of model which helps in information retrieval.
For example, consider the word "running". The Porter Stemming Algorithm would
apply the following rules: Remove the suffix "ing" to get "runn", Apply the rule for "nn"
23
to get "run", Apply the rule for "un" to get "run". In the samples of the dataset, the
preprocessed data is passed to the stage of stemming which will written the root of the
word in the sample. In conclusion, stemming is a valuable technique in NLP that helps
to normalize text and reduce dimensionality. However, it is important to use judiciously
and to combine them with other techniques to achieve the best results in text processing.
In conclusion, text data processing is a crucial step in natural language processing that
involves the conversion of raw textual data into a structured format that can be used for
analysis and modelling. This process involves several techniques such as tokenization,
stop word removal, stemming/lemmatization, and part-of-speech tagging, which can be
used individually or in combination depending on the specific requirements of the
analysis.
Initially, the builded contains importing the libraries of genism and Natural processing
language libraries. The corpora, models, and similarities modules are imported from the
gensim library. These modules will be used to create a dictionary of words from the
corpus, train a TF-IDF model, and calculate document similarities. Several sample
documents are read from text files and stored in a list called documents. The first
document is stored in doc1, the second document is stored in doc2, and so on. The
documents are collected from different domains from wikipedia. The text of each
document is split into words and stored in another list called text_corpus. This creates a
list of lists, where each inner list contains the words of a single document. A Dictionary
object is created using the text_corpus list. This creates a mapping between words and
unique integer IDs. A corpus object is created by converting each document in
text_corpus to a bag-of-words representation using the doc2bow method of the
Dictionary object. This creates a list of sparse vectors, where each vector represents the
frequency of each word in a single document. A TfidfModel object is trained on the
corpus object, which creates a TF-IDF representation of each document in the corpus.
This assigns a weight to each word in each document based on how important it is to the
document relative to the other documents in the corpus. A MatrixSimilarity object is
created from the TF-IDF corpus. This object allows us to calculate the similarity
between any two documents in the corpus. A query document is defined as the first
document in documents. The text of the query document is converted to a bag-of-words
representation using the same Dictionary object that was used to create the corpus. The
similarity between the query document and each document in the corpus is calculated
24
using the MatrixSimilarity object. This produces a list of similarity scores, where each
score represents the similarity between the query document and a single document in the
corpus. The similarity scores are sorted in descending order, and the document ID and
similarity score for each document in the corpus are printed to the console.
The accuracy of the proposed gensim model is compared with other similar method. The
accuracy of the gensim model is higher than other traditional models.
4.3 Algorithms:
4.3.1 Natural Language Processing:
Natural language processing(NLP) is confined to the stream of computer science and
artificial intelligence. Natural language processing mainly deals with the interaction
between computers and human language which mainly focuses on how computers
process and analyze a large amount of data. The NLP techniques help computers to
understand contexts in the document. Natural Language Processing (NLP) plays a
significant role in document ranking, especially in information retrieval systems such as
search engines. Document ranking refers to the process of determining the relevance of
documents to a user's query and presenting them in a ranked order.
25
Figure 4.3.1 NLP
4.3.2 Gensim:
Gensim is a popular open source natural language processing library used for
unsupervised topic modelling and that specializes in creating and manipulating vector
space models of natural language data. Vector space models represent text documents as
high-dimensional vectors, which can be analysed using various mathematical operations
to discover patterns, similarities, and relationships between them. Gensim provides a
suite of tools for building, training, and using vector space models, with a focus on
scalability, performance,and ease of use. One of the main features of Gensim is its
support for multiple text corpus formats, includingplain text, CSV, and preprocessed
corpus formats such as MMCorpus and LDA-C. Gensim provides a flexibleand efficient
way to preprocess text data, which involves tokenizing, stemming, stop-word removal,
and other tasks that are necessary to convert raw text into a form that can be used to
build vector space models. Preprocessing is typically done using Gensim's built-in
functions or custom pipelines, which can be configured to meet the specific needs of the
user. Gensim supports several popular vector space models, including bag-ofwords,
TFIDF, LSI (Latent Semantic Indexing), LDA (Latent Dirichlet Allocation), and
word2vec. These models differ in their underlying assumptions and mathematical
techniques, but they all share the goal of representing text documents as vectors in a
high-dimensional space. For example, the bag-of-words model represents each
document as a vector of term frequencies, where each term corresponds to a dimension
in the vector space.
26
Figure 4.3.2 Gensim
4.3.3 Term Frequency-Inverse Document Frequency:
27
Figure 4.3.3 TF-IDF
4.3.4 Bag of words:
The doc2bow method is a function provided by the Gensim library for converting a
document (list of words) into a bag-of-words format. Bag-of-words (BOW) is a
commonly used representation of text in natural language processing. In BOW, a
document is represented as a sparse vector of word frequencies, where each dimension
corresponds to a unique word in the vocabulary. The doc2bow method takes a list of
tokens as input and returns a list of tuples. Each tuple represents a word in the document
and its frequency count. The first element of the tuple is the word's index in the
vocabulary, and the second element is the word's frequency count in the
document.Overall, the doc2bow method is an important tool for text processing in
natural language processing. It provides a simple and efficient way to convert
documents to a bag-of-words format, which can be used in various downstream tasks
such as topic modeling, clustering, and classification.
28
Figure 4.3.4 Bag of words
4.4 Testing:
Software testing techniques are methods used to design and execute tests to evaluate
software applications. It involves rigorous unit testing to validate the functionality of
individual modules, comprehensive integration testing to ensure seamless interaction
between components, and manual testing to assess overall system performance,
usability, and accessibility.
Integration testing plays a crucial role in ensuring the seamless interaction between various
components of the system in our project.
Determine the key integration points in your document ranking system. These might include
the interaction between document tokenization, term weighting, similarity calculation, and
the ranking algorithm. Verify the flow of data between different components. Ensure that data
is passed correctly from one module to another and that the transformations are applied as
intended .
Use mock objects or stubs to simulate external dependencies, such as databases or external
APIs. This allows you to control the input and focus on the interactions between the internal
components. Test how different components interact with each other. For example, ensure
that the term weights calculated during term weighting are correctly used in the similarity
calculation, and that the results are then appropriately considered in the ranking algorithm. If
your document ranking system interacts with external systems (e.g., a search engine platform,
database, or caching system), perform tests to ensure a smooth integration. Test scenarios like
data retrieval, updates, and error handling
System testing plays a crucial role evaluating the entire document ranking system as a whole
to ensure that it meets the specified requirements and functions correctly in a real-world
environment. Identify and define various test scenarios that represent typical and edge use
cases. These scenarios should cover a range of queries, document types, and user interactions.
Perform end-to-end testing to simulate the entire user journey, from submitting a query to
receiving and displaying the ranked document results. Ensure that the system behaves as
expected at every step.
30
CHAPTER 5
Document ranking based on Similarity includes usage of NLP techniques and also applying
machine learning algorithms. The model includes application of NLP techniques for
preprocessing which includes word tokenization, removal of punctuation,stop word removal
and word stemming. The dataset of different field documents such as categories which
includes biography, news and hand written chapters.The each category contains one source
document and target documents. The input of the proposed gensim model is preprocessed
documents and output of the proposed system is ranked based on the similarity with respect
to the source document. The proposed method calculated the similarity score before
stemming and after stemming. The measurement of similarity score is more accurate when
we use stemming and without stemming similarity scores are not accurate. The accuracy
measure of the proposed method is higher than the other traditional methods. The output of
sample of the proposed method is shown below
SAMPLE
Initially, the dataset is collected from the sources and the documents contain
information about the biography of Dhoni.
The dataset with plain text is preprocessed with following steps such as word
tokenization, removal of punctuation, word folding, stop word removal and word
stemming.
The preprocessed documents are processed into the proposed method and ranked
documents are obtained.
31
5.1 Similarity Scores before and after stemming
32
CHAPTER 6
CONCLUSION
This project “Document Ranking Based on Similarity using Natural Language Processing
Technique” is used for ranking the documents based on similarity score with respect kk’to the
source document which plays a crucial role in information retrieval. In the era of the digital
world, digital information has been increasing widely and it has been doubled every five
years. The manual accessing of data is a different and time-consuming process. The
traditional method for accessing the documents has been not accurate. The preprocessing of
text plays an important role in the NLP techniques models. Most of the existing methods were
not focused on preprocessing of text. The proposed Gensim model processes the text with
five different methods such as tokenization, Removal of punctuation, Word Folding, Stop
word Removal and Word Stemming. Word Stemming plays an important role in measuring
the similarity score. The existing models only focused on finding the similarity of documents
and clustering them in a cluster but the proposed method ranked the documents based on the
similarity score with respect to the user query. The model proposed model is more accurate
than other traditional models with accurary 1 after the stemming and before stemming with
accuray 0.86. The model proposed is more accurate than other traditional models. The
proposed model helps in information retrieval applications such as web engine, search engine
, Entertainment and News industry and many more. our work can be further improved by
considering the homonyms ambiguity in the documents. The homonyms of the words can
REFERENCES
[1]. Benzi Xu et al. (2021) [1] used pseudo-longest-common-subsequence (pseudo-LCS) and
the Jaccard similarity coefficient is proposed based on this analysis and principal component
analysis (PCA) Volume 132, 2022.
[2] Qian Liu et al. (2021) proposed to use of association rules for measuring word
similarity at a global level and fuzzy similarity to measure the top-k words in in IEEE
Access, vol. 9, pp. 126801-126821, 2021.
33
[4] Shuaizhang et al. (2019) proposed a model for extended citation model for scientific
document clustering" in IEEE Access, vol. 9, pp. 150865-150877, 2021, doi:
10.1109/ACCESS.2021.3125729.
[6]. Mohamed Attia, Manal A. Abdel-Fattah, Ayman E. Khedr, A proposed multi criteria
indexing and ranking model for documents and web pages on large scale data, Journal of
King Saud University – Computer and Information Sciences, 2021.
[7]. Hikmat Ullah Khan, Shumaila Nasir, Kishwar Nasim, Danial Shabbir, Ahsan Mahmood,
Twitter trends: A ranking algorithm analysis on real time data, Expert Systems with
Applications, Volume 164, 2021.
[8]. Dimitris Pappas and Ion Androutsopoulos, A Neural Model for Joint Document and
Snippet Ranking in Question Answering for Large Document Collections, Department of
Informatics, Athens University of Economics and Business, Greece, Institute for Language
and Speech Processing Research Center ‘Athena’, Greece, 2021.
[9]. Tianrun Cai, Zeling He, Chuan Hong, Yichi Zhang, Yuk-Lam Ho, Jacqueline Honerlaw,
Alon Geva, Vidul Ayakulangara Panickan, Amanda King, David R Gagnon, Michael
Gaziano, Kelly Cho, Katherine Liao, Tianxi Cai, Scalable relevance ranking algorithm via
semantic similarity assessment improves efficiency of medical chart review, Journal of
Biomedical Informatics, Volume 132, 2022.
[11]. Bo Xu, Hongfei Lin, Yuan Lin, Kan Xu, Two-stage supervised ranking for emotion
cause extraction, Knowledge-Based Systems, Volume 228, 2021.
34
[12]. M. F. Bashir, H. Arshad, A. R. Javed, N. Kryvinska and S. S. Band, "Subjective
Answers Evaluation Using Machine Learning and Natural Language Processing," in
IEEE Access, vol. 9, pp. 158972-158983, 2021.
[13]. M. AbuSafiya, "Measuring Documents Similarity using Finite State Automata," 2020
2nd International Conference on Mathematics and Information Technology (ICMIT), 2020,
pp. 208-211.
[16]. F. Ye, X. Zhao, W. Luo, D. Li and W. Min, "Query-Adaptive Remote Sensing Image
Retrieval Based on Image Rank Similarity and Image-to-Query Class Similarity," in IEEE
Access, vol. 8, pp. 116824-116839, 2020.
[18]. Yun Li, Yongyao Jiang, Chaowei Yang, Manzhu Yu, Lara Kamal, Edward M.
Armstrong, Thomas Huang, David Moroni, Lewis J. McGibbney, Improving search ranking
of geospatial data based on deep learning using user behavior data, Computers &
Geosciences, Volume 142, 2020.
[23]. J. Kim, "A Document Ranking Method with Query-Related Web Context," in IEEE
Access, vol. 7, pp. 150168-150174, 2019.
[24]. C. Xia, T. He, W. Li, Z. Qin and Z. Zou, "Similarity Analysis of Law Documents Based
on Word2vec," 2019 IEEE 19th International Conference on Software Quality, Reliability and
Security Companion (QRS-C), 2019, pp. 354-357.
[25]. Y. Ma, P. Zhang and J. Ma, "An Ontology Driven Knowledge Block Summarization
Approach for Chinese Judgment Document Classification," in IEEE Access, vol. 6, pp.
71327-71338, 2018.
[27]. M. Liu, B. Lang, Z. Gu and A. Zeeshan, "Measuring similarity of academic articles with
semantic profile and joint word embedding," in Tsinghua Science and Technology, vol. 22,
no. 6, pp. 619-632, December 2017.
[28]. Olga Vechtomova, Murat Karamuftuoglu, Lexical cohesion and term proximity in
document ranking, Information Processing & Management, Volume 44, Issue 4, 2008. [29].
Czesław Daniłowicz, Jarosław Baliński, Document ranking based upon Markov chains,
Information Processing & Management, Volume 37, Issue 4, 2001.
[30]. H. Shen, L. Xue, H. Wang, L. Zhang and J. Zhang, "B+-Tree Based MultiKeyword
Ranked Similarity Search Scheme Over Encrypted Cloud Data," in IEEE Access, vol. 9, pp.
150865-150877, 2021, doi: 10.1109/ACCESS.2021.3125729.
36
APPENDIX I - SOURCE CODE
1.PreProcessing
import spacy from nltk.stem.porter import
PorterStemmer from gensim.utils import
simple_preprocess
37
lemmatized_text = " ".join(lemmatized_tokens)
# Print the resulting text
print(lemmatized_text) doc2
= nlp(lemmatized_text)
39
# Convert the query to a bag-of-words vector using the corpus dictionary query_vec =
dictionary.doc2bow(query.lower().split())
# Calculate the similarities between the query vector and each document in the corpus
similarities = similarity_index[tfidf_model[query_vec]]
3. Accuracy
relevant_docs = [doc1,doc2,doc3,doc4,doc5,doc6]
# the relevant documents are assumed to be doc1, doc2, and
doc7 num_relevant_docs = len(relevant_docs) num_correct = 0
for doc_id, sim_score in result_docs: if documents[doc_id] in
relevant_docs:
num_correct += 1 accuracy =
num_correct / num_relevant_docs
print(f"Accuracy: {accuracy:.2f}")
APPENDIX II - SCREENSHOTS
40
Figure 01 Original Text
41
Figure 02 Tokenization
42
Figure 05 Stop word removal
43
Figure 08 Document ranking based on the similarity score after steming
44
45