0% found this document useful (0 votes)
10 views

article (3)

Uploaded by

Malik Arslan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

article (3)

Uploaded by

Malik Arslan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

“Effective Information Retrieval Using Space Model, Word2Vec, Latent Semantic

Semantic Model” Query generation, COVID-19 ontology.


Abstract Introduction:
Information Retrieval (IR) involves In today’s world, the rapid expansion of
retrieving relevant information from large digital data across all aspects of life brings
datasets. Semantic models have been both promising possibilities and significant
considered up-and-coming tools for hurdles. This data raises diverse
enhancing effectiveness in Information knowledge creation and opens doors for
Retrieval systems. However, IR systems innovation, yet it also poses a growing
face different challenges in providing challenge like managing the vast amounts
answers to unambiguous queries. Also of unstructured data effectively [1].
whenever the users need to have a more Information Retrieval (IR) systems are
sophisticated kind of knowledge of crucial in addressing data management
relevant information they need. challenges by sifting through vast data
Furthermore, an information retrieval (IR) repositories to deliver relevant information
system's performance largely depends on to users. These systems are instrumental in
the models and algorithms to compare organizing unstructured data—from text
documents with queries. Therefore, the documents and multimedia content to web
study aims to distinguish between pages and social media posts. However, IR
semantic and non-semantic models and systems often fall short in accurately
compare their efficiency based on extracting the most relevant information
precision and average precision. The study and consistently providing the knowledge
seeks to illustrate the capability of users seek [2]. Early Information Retrieval
semantic models to increase the accuracy (IR) systems relied heavily on keyword-
of information retrieval and help develop based search engines, which could retrieve
the IR field. The proposed methodology relevant documents from vast data
comprises three phases: The first is the repositories only when exact phrase
query process for the Latent Semantic matches were found [3]. A major
Query generation, and the second involves drawback of keyword-based techniques is
using different models for comparison and their inability to grasp the contextual
dataset selection. A two-step evaluation meaning behind user queries. This often
phase is employed for the abstraction of leads to irrelevant search results, which in
outcomes and comparison. The finding turn causes user dissatisfaction [3].
shows that the Semantic Query Generation More advanced Information Retrieval (IR)
and semantic model like the Word2Vec systems go beyond simple keyword
model is superior to other models in both matching, aiming to interpret user intent
evaluation processes and can effectively and meaning in a way that predicts human
capture semantic relationships. On the understanding. However, despite these
other hand, query processing enlarges improvements, IR technology still has
retrieval accuracy by up to 90% with limitations. In today’s data retrieval
semantics in the retrieval process. landscape, both traditional and semantic
Keywords: Information Retrieval, systems often struggle to retrieve
Semantic Model, Cosine Similarity, Vector information that aligns with a query's
underlying intent and context. As a result,
the choice of model and the design of the challenges around accessing and retrieving
algorithm significantly impact the system’s relevant information in today’s data-
ability to interpret the semantics within packed world.
documents, ultimately affecting the The paper is structured as follows: Section
precision and relevance of the results 2 discusses the literature review, while
delivered to users [4]. Information retrieval Section 3 represents the research
models often struggle to accurately methodology. Section 4 conducts a critical
interpret user queries and intentions, analysis to compare models for solving
resulting in an overload of irrelevant or problems. Finally, the conclusions made
loosely related documents. These from the findings are included in the
inefficiencies in the retrieval process can paper's final part, providing a summary of
be particularly frustrating for users. As the research to guide future research
shown in the study of Yordan Kalmukov needs in this field.
[5], using Cosine Similarity the accuracy Literature Review:
in text similarity matching is not achieving
a better result.
In the evolving field of information
retrieval, a growing body of research
emphasizes the role of semantics in
enhancing relevance and accuracy. This
distinction underlines the differences
between traditional IR models and newer
semantic IR models. Semantic approaches,
like Word2Vec and Latent Semantic
Analysis (LSA), aim to capture the context
and meaning behind queries. Despite these
advances, semantic models still face
limitations in consistently retrieving highly
relevant results. The paper [6] Has an
average precision of 73% which needs to
be improved for better retrieval.
This study aims to dive into the limits of
current information retrieval (IR) models
and highlight how semantic IR models
differ from traditional ones. We’ll do this
by critically analyzing both types and
comparing them on precision and average
precision. This research aims to add real
value to the field of information retrieval
by offering insights into how these models
stack up—especially when it comes to
models like cosine similarity, vector space
models, and Word2Vec. Ultimately, the
findings could help tackle some of the big
This section breaks down the different Each model has its strengths and
types of semantic and non-semantic weaknesses suited to specific IR tasks.
models used in information retrieval (IR). There’s also a wide range of semantic
New models are constantly being models, each designed for a particular
developed to address the weaknesses of purpose. Some of the newer approaches
traditional methods, making IR a diverse include knowledge graph embeddings,
field with models like vector space, distributional semantic models, and
probabilistic, and Boolean models. ontology-based methods. Word
Contextual IR models, which improve embeddings, a type of distributional
relevance by factoring in the query’s semantic model, can learn semantic
context, are also gaining popularity. relationships from large text datasets.
Many of these semantic and non-semantic
models are easy to integrate into any IR
system, helping to boost the accuracy and
relevance of search results. The following
are some of the key models that have been
proposed to enhance IR systems.

Figure 1 Literature Review


 NeuIR: shaping the future of how we
Neural networks are shaking things search for and use information.
up in the field of information  Ranking Models:
retrieval (IR). More and more The era of simple keyword
Neural Information Retrieval matching is behind us. Re-ranking
(NeuIR) models are being adopted, models that understand the
transforming the way modern relationships and deeper meanings
retrieval strategies work [7]. Neural between words are now leading to
Information Retrieval (NeuIR) is big jumps in retrieval accuracy.
Take the MS MARCO passage
ranking competition, for example Personalization and context-
—methods powered by LLM-based awareness are now at the heart of
re-ranking models like BERT, search technology. These systems
RoBERTa, and ELECTRA are adapt to individual needs,
dominating the field [7]. delivering tailored search
 BERT and Transformers: experiences that feel more intuitive
Recently, several Transformer- and relevant.
based models have been designed  Efficient and Scalable Retrieval:
to connect semantic data across As the digital world keeps growing
both visual and textual domains at an incredible pace, managing
[8]. These models excel at information overload is becoming a
capturing the relationships within bigger challenge. Semantic tackles
the context of text, allowing this by creating scalable, efficient
systems to truly grasp the meaning systems for finding and organizing
behind queries and documents. The data. These systems help improve
result? More relevant and scalability, especially in areas that
meaningful search outcomes. aren’t well-represented in the
 Learning to Rank: training data [8]. Due to better
The science of learning to rank index structures and smarter
(LTR) search results has become so algorithms, these systems can
advanced it’s almost like black art. handle massive amounts of data
LTR systems use data from while keeping search speeds
previous user actions during a blazing fast.
search session to fine-tune and  Explainable AI:
boost the performance of search When it comes to using AI in
queries [9]. These models learn information retrieval (IR), building
from user interactions and transparency and trust is crucial.
feedback, constantly improving Researchers suggest that trust
how they present results. The goal grows when algorithms and their
is to ensure the most relevant outputs—both digital and physical
information appears right at the —are accurate and reliable. It also
top. helps if machine learning focuses
 Interactivity and User-Centric on cause-and-effect relationships
Retrieval: rather than just relying on surface-
Recent research focuses on level correlations [10].
understanding user behavior and  Reinforcement Learning:
preferences to create more Reinforcement learning is quickly
interactive, user-focused retrieval becoming a game-changer for
systems. Behavioral signals, like improving information retrieval.
clicks or time spent on a page, act These techniques let systems learn
as implicit feedback, allowing and adapt by analyzing user
search engines to incorporate interactions and feedback, and fine-
behind-the-scenes user data to tuning ranking algorithms to
improve their performance [9]. deliver better search results over
time. Plus, behavioral signals act as adding semantic models to the process, it
subtle feedback tools, using hidden bridges the gap between what you’re
patterns in user behavior to looking for and how it’s understood.
enhance various parts of search Unlike old-school keyword methods,
engines without users even noticing semantic approaches focus on uncovering
[9]. the meaning and connections within text.
 Context-Aware Retrieval: Key tools like ontology-based
Grasping the context behind a representations, semantic annotation, and
search is essential for providing semantic indexing play a big role in
truly relevant results. Things like making this happen. By factoring in these
your location, device, and past elements, search systems can dive deeper
interactions all play a role in into queries and documents for more
shaping a hyper-personalized accurate results.
search experience that’s tailored  Word relationships:
just for you. This kind of Semantic models go beyond just
adaptation helps users feel more matching words—they capture
confident, knowing the search relationships like synonyms,
engine "gets" what they’re looking antonyms, and
for [11]. hypernyms/hyponyms. This helps
 Ethical Considerations: systems understand the true
As semantic technologies reshape meaning and intent behind a query,
information retrieval, it’s crucial to even when the exact keywords
keep ethics front and center. Digital aren’t in the documents. A step up
ethics, in simple terms, means the from this is Relation Extraction
moral principles and value systems (RE), which focuses on uncovering
that guide how we interact in the connections between words. Unlike
online world [10]. Terminology Extraction or Topic
computers can now fully understand what Modeling, which mostly deal with
users are searching for and provide fixed relationships (like synonymy
answers that are ethical, personalized, and or relatedness), RE dives deeper,
relevant to the context. This marks a new identifying a broader range of links
era for information retrieval, placing it at between entities—think "born-in,"
the cutting edge of the future. As this field "married-to," or "interacts-with."
keeps evolving, finding exactly what you It’s like giving systems a richer
need could soon be as effortless as just toolkit for understanding how
thinking about it. concepts connect in the real world
Integrating Semantics in [12].
Information Retrieval:  Entity recognition and linking:
Traditional information retrieval has To help systems grasp real-world
mainly relied on keyword matching, but entities and their relationships,
let’s be honest—it often leads to semantic models can identify and
frustrating, irrelevant search results. That’s connect entities like people, places,
where semantic integration steps in. By and organizations in both queries
and documents. This process
involves tasks like Entity unstructured data. It works by
Recognition, Entity applying a mathematical approach
Disambiguation, and Entity called Singular Value
Decomposition (SVD), which
Linking, all of which work within
helps identify patterns and
the Semantic Web framework to relationships within the data, even
make these connections smarter when they aren’t obvious on the
and more accurate [12]. surface [14]. These greatly improve
 Ontology-based reasoning: the accuracy of matching between
Thanks to ontologies, which the user search and documents,
provide organized knowledge even in an environment where the
about specific topics or domains, match is not on specific phrases. A
semantic models can uncover major quality of semantic
hidden connections between ideas. integration is the ability to deliver a
This means computers can find range of key advantages.
relevant information even if a Analysis of Semantic Integration
query isn’t clear. By reasoning Approaches:
through semantic links between
concepts and entities, these systems There are plenty of ways to add semantic
not only boost search accuracy but knowledge to improve how information
also contribute to building new retrieval systems work. Let’s break down
knowledge. Take BIGOWL4DQ the pros and cons of these approaches.
as an example—it expands on Contrastingly, incorporating semantic
BIGOWL by adding features knowledge can boost accuracy and user
focused on data quality. It
experience. For instance, search engines
improves reasoning capabilities
and helps integrate data quality become smarter at understanding what
tasks directly into Big Data users mean—even when they phrase things
processes, making everything more differently. Think about how a search
efficient and accurate [7]. engine can figure out that "books,"
 Semantic Annotation: "novels," and "romance novels" might all
Documents come with metadata, lead to the same types of results,
like tags, that help convey their
depending on the context. This makes
meaning. In large-scale systems,
annotated datasets are often used to searches more relevant and user-friendly.
create gold standards for But it’s not all smooth sailing. In certain
evaluation. For example, this data applications, using semantic models can
can train machine learning tools to introduce challenges, like added
predict sentence structure, extract complexity or inefficiencies. A lot of
key arguments from case texts, or research has gone into figuring out the best
even build a summarization system
ways to weave semantic knowledge into
that condenses the original text.
These annotated datasets play a these systems, with plenty of methods
crucial role in improving the already outlined in the literature. Let’s take
accuracy and functionality of a closer look at some of those approaches:
machine learning models. [13].  Query expansion:
 Semantic Indexing: Query expansion and enhancement
Latent Semantic Indexing (LSI) is models are like giving search
a popular method for uncovering
engines a boost. They work by
similarities in collections of
adding related terms, synonyms, or nuances of what users are asking,
even specific entities using making them more effective and
semantic models. The result? user-friendly. This work lays the
Better search results. These models foundation for smoother, smarter
improve both precision (finding the conversations [17].
 Rule-Based Systems:
right stuff) and recall (finding more
By relying on predefined semantic
of it).
rules or ontologies, these systems
Researchers have put a lot of effort
create a clear framework for
into studying how well query
integrating semantic knowledge. A
expansion (QE) works for
great example comes from a case
reformulating searches. The goal is
study using Stardog 6.014 [18], a
simple: make searches smarter and
tool designed to handle semantic
more effective at pulling up the
inference.
best results. And it seems to be
Stardog works by applying preset
doing the job [15].
rules to evaluate things like energy
 Semantic ranking:
performance. It pulls data from
These models rank documents by
analyzing how similar their three main sources: OWL
meanings are to the search query— ontologies, RDF instances in the
not just by matching keywords, but ABox, and SWRL inference rules
by looking at the deeper in the TBox. What makes it unique
relationships and context behind is its “lazy” reasoning approach—it
them. It’s like finding connections doesn’t process everything upfront.
that go beyond the surface. To take Instead, it performs reasoning at
it up a notch, semantic models for query time, offering flexibility and
re-ranking can use more advanced efficiency when responding to
and complex architectures. Why?
queries.
To reach even higher levels of
 Performance Comparison:
precision when delivering results.
When evaluating retrieval models,
It’s a smarter way to make sure
users get exactly what they’re researchers often look at metrics
looking for [16]. like precision, recall, and user
 Conversational search: satisfaction. These assessments are
Semantic models are a game- typically carried out across various
changer for conversational benchmarks and datasets to get a
interfaces. They allow users to comprehensive picture of
interact with information retrieval performance. The study [16]
systems in a way that feels natural Explores the current state of first-
and intuitive—like having a real stage retrieval models and offers a
conversation instead of typing rigid comparison of different
search terms. For task-oriented
approaches. It examines early
systems (think virtual assistants or
semantic retrieval techniques,
customer service chatbots),
semantic models make the magic which focus on basic semantic
happen. They help these systems understanding, alongside neural
understand the context, intent, and semantic retrieval methods that use
advanced AI and deep learning for continued improvement in semantic
smarter matching. It also considers modeling [19].
conventional term-based retrieval 2. Computational Worth:
techniques, which rely on Complex models often demand a lot of
straightforward keyword matching. computing power, especially when they’re
This analysis sheds light on how working with massive datasets. This can
these methods perform in terms of make them tough to use in situations
accuracy and user experience. where speed is critical.
Challenges: Take semantic reasoning as an example. Its
Using different models in information high processing cost means it’s not ideal
retrieval systems can lead to great results, for real-time monitoring in large-scale
but there are still some common challenges applications. Imagine trying to use it for
that need attention. These challenges are something like early epidemic detection—
both conceptual—like understanding the where every second counts—and you can
meaning and context behind a query—and see why it might fall short. These kinds of
technical, involving how systems process limitations make it clear that finding the
and retrieve information effectively. To right balance between model complexity
tackle these issues, ongoing research and and practicality is essential [20].
innovation are essential. Overcoming these Scalability and practical implementation
hurdles is key to unlocking the full need efficient algorithms and
potential of information retrieval systems infrastructure.
and making them even more useful for 3. Model Complexity:
users. Understanding how complex models make
1. Data Hurdles: retrieval decisions isn’t always easy,
Gathering and managing massive amounts thanks to their intricate internal workings.
of high-quality, domain-specific data to Plus, any biases or flaws in the training
train strong models isn’t easy. It’s a tricky data can creep into the models, leading to
process that demands time, resources, and unfair or discriminatory outcomes. Models
expertise. like COIL are a good example because
Even one of the most well-known models, they’re more advanced but also more
Latent Semantic Analysis (LSA), isn’t complex and costly to run compared to
without its flaws. While widely used and simpler matching retrievers [16]. To make
effective in many cases, LSA has faced these systems more user-friendly, it’s
criticism. For one, it overlooks the essential to focus on keeping them as clear
importance of word transitions, which are and straightforward as possible.
crucial for understanding how terms Conceptual Challenges:
connect in context. It’s also been flagged  Ambiguity:
for breaking certain rules about connective Resolving ambiguity in user
power (the strength of word relationships) queries and content remains a
and for lacking an incremental learning tricky challenge. Traditional search
mechanism, meaning it can’t adapt or methods often add to the
improve dynamically as new data comes confusion, struggling to refine the
in. These limitations highlight the need for search field effectively. Even with
modern BERT-based models that
aim to better understand queries, similarity between text data, often
many issues still need to be relying on unclear rules-based
addressed to fully eliminate methods [24]. To truly measure the
ambiguity. There’s progress, but effectiveness of semantic
there’s still work to do [21]. integration, we need evaluation
 Scaling up semantics: criteria that go beyond traditional
Making sure semantic integration metrics like recall and precision.
can scale effectively is essential as This is a significant challenge for
data keeps expanding. The process researchers, as it involves
of combining data from different estimating semantic similarity
databases is called heterogeneous between text data, often relying on
database integration, and it’s no unclear rules-based methods.
easy task. There are three major Methodology:
challenges when integrating Developing Information Retrieval (IR)
databases within the same domain: systems powered by semantic models is a
structural heterogeneity, syntactic major step toward transforming how we
heterogeneity, and semantic access and retrieve information. This
heterogeneity [22]. These issues chapter explores the design and creation of
make solving the heterogeneity such systems, focusing on integrating
problem tricky. To tackle it, you semantic models to improve retrieval
need to focus on studying accuracy and effectiveness. Based on
distributed systems and fine-tuning previous research, the primary aim is to
algorithms to handle the address the ongoing challenges in
complexity. traditional IR systems. Although many
 Ontology integration: studies have tried to tackle the issues in
Bringing together different semantic information retrieval, the
semantic models and standards is a persistent limitations and weaknesses in
big deal for researchers, current systems are well-documented.
developers, and users alike. Right While researchers have made some
now, there’s no systematic way to valuable progress, the ultimate goal of
integrate domain ontologies from delivering consistently satisfying search
different sources, which makes life results remains unsolved.
harder for developers and users This research seeks to bridge those gaps by
[23]. Shared frameworks and proposing new solutions and
ontologies will be key to enabling methodologies for extracting semantic
smooth interaction across systems. meaning from documents to enhance
 Evaluation Success: information retrieval. The goal is to create
To truly measure the effectiveness reliable systems that can extract and use
of semantic integration, we need semantic information effectively. By
evaluation criteria that go beyond leveraging advanced semantic algorithms
traditional metrics like recall and and insights from modern studies, the
precision. This is a significant foundation is laid for groundbreaking
challenge for researchers, as it improvements in retrieval performance,
involves estimating semantic
pushing the field forward with innovative involves creating more detailed and
techniques. relevant queries. Evaluation methods are
High-Level Architecture: used to compare various models, and
The proposed architecture is built around Result Evaluation procedures determine
four main components. The Information how useful and relevant the retrieved
Retrieval system starts with simple information is to ensure the system
keyword-based queries and advances to effectively meets user needs. The
Latent Semantic Query Generation, which components are shown in Figure 1.

Figure 2 Research Architecture


The goal of this study is to review different head-to-head comparison of how well the
models used in information retrieval (IR) models perform under the same conditions.
and evaluate how effective they are. To do The evaluation focuses on precision
this, the system processes basic queries measures, which are key to assessing the
three times—once with each of the quality of the retrieval results.
following models: the cosine similarity The study is designed to make the
model, the vector space model, and the differences between the models’ precision
Word2Vec model. The results from these scores statistically significant, giving a
models are then compared in terms of their clearer picture of how they stack up. By
precision and effectiveness, helping to carefully conducting iterative experiments
highlight their strengths and weaknesses. and evaluations, this research provides a
This comparison sets the stage for further detailed analysis of each model’s
research by exposing the limits of each effectiveness in retrieving accurate and
approach. relevant information.
Before being used in the IR system, the Semantic Query Process:
queries are pre-processed and then run A semantic algorithm was designed to
through each model. This allows for a handle textual queries by diving deeper
into their meanings, relationships, and The Development of the Semantic
contexts. This approach aims to enhance Algorithm:
the accuracy and relevance of information Developing a semantic algorithm starts
retrieval. The process was broken down with a clear understanding of what it needs
into several sub-phases, each contributing to achieve. The first step is defining the
to the overall effectiveness of the specific goals and functions—what
algorithm. semantic elements need to be extracted
Ontology Selection: from the text and how these elements will
The review process for this research is improve information retrieval. This phase
highly meticulous, focusing on identifying lays the foundation, ensuring the design
ontologies that not only provide essential addresses all the requirements and
concepts but are also detailed enough to objectives. It also includes creating
capture complex semantic relationships. methods to encode contextual details, word
Essentially, each ontology is evaluated for relationships, and even the implicit
its relevance based on key factors like meanings hidden within the text.
coverage, expressiveness, and suitability Once the design phase is complete, the
for semantic modeling. These factors are focus shifts to implementing the algorithm.
carefully scrutinized to ensure that any This involves turning the theoretical
selected ontology is capable of meeting the framework into functional code using the
specific goals of the research. appropriate programming languages and
The ultimate aim is to choose one or more libraries. The choice of tools and
ontologies that align closely with the computing platforms is crucial here; they
research objectives. For instance, the must align with the algorithm’s
selected ontology [25] offers a clear and requirements and provide the necessary
organized framework of classes, capabilities for effective implementation.
properties, and instances. It’s specifically Selecting the right programming languages
designed to structure COVID-19-related and resources is key to ensuring the
knowledge into a coherent system. This algorithm can handle the complexities of
approach covers key aspects of the virus, semantic extraction.
such as symptoms, safety precautions, After implementation, the algorithm
transmission modes, variants, treatments, undergoes rigorous validation and testing.
and vaccines, ensuring each area is well- This step ensures it performs as intended—
represented. accurately identifying semantic features
By defining classes with attributes and within the given textual data and doing so
relationships among different COVID-19 efficiently. These tests are critical to fine-
elements, this ontology creates a tuning the algorithm, making it a reliable
standardized model for data exchange and and powerful tool for improving
interoperability. This structured framework information retrieval. Through this
will be a valuable tool for researchers, iterative process, the algorithm is
medical professionals, policymakers, and strengthened to meet the demands of real-
others involved in combating the world applications.
pandemic, making it easier to access,
share, and analyze organized data.
1: Semantic Algorithm

1.1: Query_Tokenization Algorithm:

3.2: Instances_Match Algorithm:

1.2: Token_List_Creation Algorithm:

1.3 1.4: Latent Semantic Query Generation


: Semantic Computation Algorithm: Algorithm:
3.1: Class_Match_Algorithm

Latent Semantic Query


Generation:
The semantic program works by using
ontology to enrich a query’s content and
generate new, more detailed queries.
Here’s how it operates: First, it extracts all
instances from the ontology and organizes
them in a way that makes them usable for
search purposes. Once this groundwork is
done, the program compares the terms On the other hand, the vector space model
from the original query with the ontology's approaches retrieval by treating documents
classes to determine their semantic and queries as vectors in a
relevance. multidimensional space and calculating
After identifying matching classes, the their similarity to find relevant matches.
system digs deeper to find related instance To enhance the system’s retrieval
details within those classes. It then links performance, a semantic layer is added
these new terms and instances back to the with the Word2Vec model. Word2Vec is a
original query. By doing this, the program powerful tool in natural language
essentially expands and enhances the processing (NLP) because it captures the
query, incorporating additional semantic meanings and relationships between
insights that might not have been obvious words. This combination of classical and
at first. semantic models will allow the study to
The result is a much richer and more compare how well these techniques
context-aware query. This expanded query perform and contribute to more accurate
doesn’t just find more results—it finds and meaningful information retrieval.
more relevant and meaningful ones. By Evaluation:
refining the relevance of terms and classes Evaluating information retrieval (IR)
based on ontology, the program boosts systems is a critical step in understanding
both the accuracy and contextual how effectively they can find relevant
understanding of information retrieval. documents within a collection. This phase
This ensures that the retrieved information helps identify strengths and weaknesses
aligns more closely with the query’s and ensures the system is fine-tuned for
intended meaning. better performance. To do this, testing is
IR System: carried out using various queries during
Phase Two, focused on Research the development process to assess the
Methodology, involves selecting the system’s effectiveness under different
models that will form the backbone of the conditions.
information retrieval (IR) framework. This When it comes to measuring similarity
decision is crucial, as it sets the stage for between text documents, several
the experimental and evaluation phases techniques are used, each based on
that follow. different paradigms and grounded in
From the variety of available IR models, distinct theoretical approaches. In this
two classical ones have been chosen: the evaluation, three popular models—Cosine
vector space model and cosine similarity. Similarity, the Vector Space Model
These will be paired with a more advanced (VSM), and Word2Vec—are compared to
semantic model, Word2Vec, to create a assess their performance.
robust and comparative framework. To make the comparisons meaningful, two
Cosine similarity, known for its simplicity key measures are used: precision and
and effectiveness, calculates the cosine of average precision. These metrics are
the angle between document vectors in a applied to investigate how well the three
high-dimensional space. It’s widely used models perform on specific tasks. This
for measuring how similar documents are, structured comparison not only highlights
making it an ideal choice for this project. how each model handles the problem of
document similarity but also provides efficient these methods are—and where
deeper insights into their strengths and they fall short—by zooming in on
limitations, making it easier to determine precision scores. These range from small
which model works best for particular percentages for cosine similarity to
applications. Word2Vec hitting 100% precision in most
Mathematically, precision is calculated as: cases. What’s the big takeaway? This
Precision = evaluation not only helps us see where
Number of Relevant Documents Retrieved 4.1 these methods shine (and where they
Total Number of Documents Retrieved don’t) but also lays the foundation for
Their precision scores are averaged to
future breakthroughs in information
compare the performance of different
retrieval.
models in detail. This is done using a
straightforward formula: the precision
Retrieval Model Evaluation:
scores for each model are added together, During the evaluation phase, the selected
and the total is then divided by the number retrieval models were put to the test using
of models. Calculating this mean precision a specific dataset. The goal? To see how
gives a clear, overall view of how each well these models could measure the
model performs against the others. It’s a similarity between queries and documents
simple yet effective way to summarize and for both general queries and latent
compare their performance, especially semantic queries.
when working with multiple models in an The study used three models—Cosine
information retrieval system. Similarity, Vector Space Model, and
Mathematically, Average precision is Word2Vec—on a dataset of 100 research
calculated as: papers covering a variety of topics. Each
Average Precision= 4.2 model was tasked with comparing
Total Precision percentage documents to match them with queries that
Total number of queries reflected user search intentions. In short,
The first evaluation uses general queries the models had to show how accurately
on each IR system to measure precision for they could find relevant matches based on
similar documents. This gives an initial the queries.
sense of how well each model performs Cosine Similarity Performance
with basic queries. Metrics Using General and Latent
The second evaluation uses latent semantic
Semantic Queries:
queries across all three models. This
The retrieval model generates a list of
provides more detailed precision scores
documents for each query, along with a
and allows for a better comparison of their
similarity percentage for each one. To
effectiveness.
figure out how accurate these results are,
Result and Discussion: the documents are evaluated using reliable
This is the most crucial part of the relevance assessments.
analysis. It’s where we carefully evaluate Precision is the key metric here—it
how models like Cosine Similarity, Vector measures how well the model performs. In
Space Model, and Word2Vec perform simple terms, precision shows the
when tested on both general and latent percentage of relevant documents retrieved
semantic queries. The results show how out of all the documents in the dataset. The
higher the precision, the better the model is In the Vector Space Model (VSM), the
at finding what truly matters. retrieval process works by calculating a
similarity score for every query in the
dataset. Each document is assigned a score
that reflects how closely it matches the
query. This helps measure how relevant a
document is to the specific query it’s
paired with.
To evaluate how effective this model is,
precision assessments are used. These
evaluations give a clear picture of how
well the model retrieves relevant
documents for each query. Ultimately, the
results provide a real measure of the
model’s performance and allow for a
Graph 4-1 Cosine Similarity deeper exploration of its capabilities in
The first model, Cosine Similarity, is used information retrieval.
to compare queries. The graph above
shows its evaluation results, with two sets
of values for each query: one for general
queries and one for latent semantic
queries. The graph highlights how similar
the queries are based on this method.
Overall, the graph demonstrates how
useful Cosine Similarity is for measuring
semantic relationships in textual data. It
clearly illustrates how the similarity scores
vary depending on the type of query.
These results offer valuable insights: the Graph 4-2 VectorSpace Model
latent semantic query (LSQ) method does The graph above shows how the Vector
a better job capturing deeper semantic Space Model (VSM) performed on five
meanings between queries compared to specific queries (Q1 to Q5), comparing
general queries. general queries with latent semantic
Higher similarity scores point to a stronger queries. It focuses on the similarity
semantic connection between the queries percentage between each query and the
and the hidden information they contain. dataset, giving us a sense of how well
On the other hand, lower scores indicate VSM retrieves relevant documents.
greater dissimilarity, often revealing What stands out? The evaluation reveals
irrelevant or mismatched information in that different queries—and query types—
the dataset. produce varying levels of precision. In
Vector Space Model Performance simpler terms, some queries are matched
more effectively than others. This tells us
Metrics Using General and Latent
that VSM is generally good at finding
Semantic Queries: relevant documents but has room for
improvement depending on the type of
query.
Higher precision scores mean VSM
successfully captures and represents the
meaning behind the queries, retrieving
more relevant documents. On the flip side,
low precision scores suggest there are gaps
in the retrieval process, pointing to
potential issues with data extraction.
Overall, the graph gives a well-rounded
view of VSM's strengths and weaknesses
in information retrieval. These results
provide valuable insights for improving
Graph 4-3 Word2Vec Model
and fine-tuning the model, helping it better The graph highlights how the Word2Vec
meet users' needs in different query model performed with two types of
scenarios. queries: general queries and latent
Word2Vec Model Performance semantic queries. Each query is assigned a
percentage, representing how accurately
Metrics Using General and Latent the model retrieves relevant information.
Semantic Queries: The results? They’re impressive! The
The Word2Vec model takes a slightly Word2Vec model consistently delivers
different approach. For each query in the outstanding precision, with most scores
dataset, it generates a list of documents above 95%. This demonstrates its strong
ability to identify relevant information. For
along with relevancy scores. These scores
latent semantic queries specifically, the
reflect how closely the documents align model often achieves a perfect precision
with the given query, helping determine score of 100%.
how effective Word2Vec is at retrieving This flawless performance shows that
relevant content. Word2Vec excels at capturing the deeper,
The primary metric for measuring underlying meanings of queries, making it
effectiveness here is precision—how many highly effective at retrieving documents
that closely align with the semantic context
of the retrieved documents are relevant?
of the search. This model is a top
This evaluation helps researchers see how contender when it comes to accuracy and
well Word2Vec performs in matching relevance.
documents to the underlying meaning of Comparison of General Queries
queries. and Latent Semantic Queries:
What sets Word2Vec apart is its ability to Below are two tables that summarize the
dive deep into semantic relationships. By performance of the three retrieval models.
understanding the nuanced connections Cosine Similarity, Vector Space Model
between words and texts, Word2Vec (VSM), and Word2Vec under two different
captures richer meanings and context. This scenarios: one for general queries and the
allows for a more detailed evaluation of other for latent semantic queries:
how well the model retrieves documents
that not only match the query but also
align with its deeper semantics.
Comparison of General Queries: Word2Vec for more accurate and efficient
Table 4.1 General Queries document retrieval.
Comparison of Latent Semantic
Queries:
Table 4.2 Latent Semantic Query

For general queries, when cosine similarity


is applied, the precision score is relatively
low. The issue lies in its methodology,
which doesn’t account for semantic Cosine similarity shows lower precision
relationships between words and relies when applied to latent semantic queries. Its
mostly on term frequency within inability to capture complex semantic links
documents. This leads to less effective limits its effectiveness in retrieving
semantic matching and, therefore, low relevant information in these scenarios.
accuracy in generic searches. This creates a gap between cosine
The Vector Space Model shows better similarity and more advanced models,
precision than Cosine Similarity for which deliver higher precision.
general queries. It represents the The Vector Space Model performs better
relationships between documents and than cosine similarity for latent semantic
queries more effectively and captures queries, offering improved precision. It
semantic similarity since both are captures the similarity of latent semantics
represented as vectors in a high- more effectively by integrating semantic
dimensional space. The results show that understanding into text representation and
this model achieves higher precision query processing. However, while it
scores than Cosine Similarity, as it better outperforms cosine similarity, its precision
understands and retrieves relevant still falls short of what the Word2Vec
documents. model achieves.
For general queries, the Word2Vec model The Word2Vec model stands out with
achieves higher precision than both the significantly higher precision compared to
Vector Space Model and Cosine both cosine similarity and the Vector
Similarity. Word2Vec uses techniques that Space Model. It excels in understanding
capture subtle semantic relationships, the deeper, underlying meaning of latent
enabling it to extract the true meaning of semantic queries, producing highly
queries more effectively. This results in accurate results. This model highlights
highly relevant documents being retrieved, how leveraging semantic meaning can
with precision consistently scoring higher. greatly improve information retrieval,
The Word2Vec model performs making it a strong tool for finding relevant
excellently for general queries, with strong documents.
document ranking capabilities based on Average Precision:
term-document cosine similarity. Although
Precision alone may sometimes give a
Cosine Similarity underperforms
misleading picture of a system's
compared to the Vector Space Model, its
performance. To get a clearer comparison,
results highlight the importance of
average precision (AP) is used. AP
adopting advanced approaches like
evaluates the system's ability to retrieve
relevant documents by averaging precision
scores across all relevant results. This limited to basic keyword matching. This
metric provides a more detailed and inability to grasp the nuances and
accurate assessment of retrieval complexity of the desired information can
performance. By using average precision, lead to less effective document retrieval.
the study gains better insights into how This finding underscores how essential it is
well each system performs under both to consider semantic context during
types of queries. The graph below visually information retrieval tasks. By leveraging
represents these comparison results. semantic understanding, retrieval systems
can deliver more accurate, meaningful, and
relevant results, ultimately enhancing the
overall effectiveness of the retrieval
process. This deeper, context-aware
approach transforms search accuracy and
highlights the value of moving beyond
simple keyword-based methods.
Conclusion:
This research explored the potential of
traditional and semantic models to enhance
information retrieval effectiveness.
Through a comprehensive analysis of
existing IR models, it has become evident
that traditional approaches often struggle
Graph 4.4 Average Precision to capture the underlying meaning of
The data presented highlights the average queries and documents, leading to
precision scores for two types of queries— irrelevant retrieval results. By
general and semantic—across three differentiating between semantic and
retrieval models: Word2Vec, Vector Space, traditional IR approaches and conducting a
and Cosine Similarity. Higher percentages comparison, this study highlights the
in the results reflect better performance, benefits of using semantics in retrieval
serving as key indicators of how systems. This study followed a systematic
effectively each model retrieves relevant methodology consisting of different phases
information. to improve the retrieval systems'
One clear trend emerges from the data: effectiveness by merging semantics in the
semantic queries consistently achieve retrieval process. The evaluation phase
higher precision than general, keyword- consists of two types of evaluation and
based searches. This difference is result discussion. The results show even in
significant and suggests that retrieval the traditional models better performance
techniques perform far better when they is obtained when using ontology-based
account for context and meaning rather Latent Semantic Queries. However
than simply matching keywords. Semantic semantic models such as Word2Vec hold
queries allow retrieval models to tap into significant improvement in retrieval
deeper relationships between words and precision and 99% Average precision. The
documents, making it possible to explore comparison of non-semantic and semantic
the true intent or context behind the search. retrieval methods has demonstrated the
By doing so, these models can pinpoint superiority of semantic models, with
and retrieve documents that align closely Word2Vec performing better than other
with the semantic elements of the query, techniques in capturing semantic
resulting in more accurate results. relationships. The use of COVID-19
On the other hand, generic queries often ontology-enabled semantic processing,
fall short in precision because they are besides, has resulted in very high
improvement in precision during the approaches,” in Journal of Physics:
retrieval of information, thus reiterating Conference Series, IOP Publishing Ltd,
the need for semantics in the performance Jun. 2021. doi: 10.1088/1742-
of the information retrieval activity. This
6596/1898/1/012008.
research opens the way for the
development of more efficient and [7] C. Barba-González, I. Caballero, Á. J.
contextually relevant IR solutions, with Varela-Vaca, J. A. Cruz-Lemus, M. T.
implications extending to various domains Gómez-López, and I. Navas-Delgado,
where information retrieval plays an “BIGOWL4DQ: Ontology-driven
important role. approach for Big Data quality meta-
References: modeling, selection and reasoning,” Inf
[1] J. Grossman and A. Pedahzur,
Softw Technol, vol. 167, Mar. 2024, doi:
“Political Science and Big Data:
10.1016/j.infsof.2023.107378.
Structured Data, Unstructured
[8] R. Mao et al., “A Survey on Semantic
Data, and How to Use Them,” Processing Techniques,” Oct. 2023,
Polit Sci Q, vol. 135, no. 2, pp. [Online]. Available:
225–257, Jun. 2020, doi: http://arxiv.org/abs/2310.18345
10.1002/polq.13032. [9] S. Muhammad and S. Aloteibi, “A user-
[2] Chirag Shah. Picture Emily M. Bender centered approach to information
(2024) “Envisioning Information Access retrieval.”
Systems: What Makes for Good Tools and [10] M. Ashok, R. Madan, A. Joha, and U.
a Healthy Web?”. Sivarajah, “Ethical Framework for
[3] S. Shaukat, A. Shaukat, K. Shahzad, and Artificial Intelligence and Digital
A. Daud, “Using TREC for developing technologies,” International Journal of
semantic information retrieval benchmark Information Management, vol. 62. Elsevier
for Urdu,” Inf Process Manag, vol. 59, no. Ltd, Feb. 01, 2022. doi:
3, May 2022, doi: 10.1016/j.ijinfomgt.2021.102433.
10.1016/j.ipm.2022.102939. [11] H. Zamani, S. Dumais, N. Craswell, P.
[4] M. MAZEN Almustafa, M. Sheikh Oghli, Bennett, and G. Lueck, “Generating
and M. Mazen Almustafa, “Comparison of Clarifying Questions for Information
basic Information Retrieval Models,” Retrieval,” in The Web Conference 2020 -
2021. [Online]. Available: Proceedings of the World Wide Web
https://www.researchgate.net/publication/3 Conference, WWW 2020, Association for
58509603 Computing Machinery, Inc, Apr. 2020, pp.
[5] Y. Kalmukov, “Comparison of Latent 418–428. doi: 10.1145/3366423.3380126.
Semantic Analysis and Vector Space [12] A. Hotho, D. Ba Nguyen, M. Cheatham, J.
Model for Automatic Identification of L. Martinez-Rodriguez, A. Hogan, and I.
Competent Reviewers to Evaluate Papers,” Lopez-Arevalo, “Information Extraction
International Journal of Advanced meets the Semantic Web: A Survey; 2
Computer Science and Applications, vol. Anonymous Reviewers Open review(s),”
13, no. 2, pp. 77–85, 2022, doi: 2016. [Online]. Available: http://prefix.cc.
10.14569/IJACSA.2022.0130209. [13] A. Loreggia, S. Mosco, and A. Zerbinati,
[6] S. A. Savitri, A. Amalia, and M. A. “SenTag: A Web-Based Tool for Semantic
Budiman, “A relevant document search Annotation of Textual Documents,” 2022.
system model using word2vec [Online]. Available: www.aaai.org
[14] P. K. Sadineni, “Comparative study on [21] D. Ortega, “Institute for Natural Language
query processing and indexing techniques Processing (IMS) Pfaffenwaldring 5B
in big data,” in Proceedings of the 3rd 70569 Stuttgart.”
International Conference on Intelligent [22] M. Asfand-E-Yar and R. Ali, “Semantic
Sustainable Systems, ICISS 2020, Institute integration of heterogeneous databases of
of Electrical and Electronics Engineers the same domain using an ontology,”
Inc., Dec. 2020, pp. 933–939. doi: IEEE Access, vol. 8, pp. 77903–77919,
10.1109/ICISS49785.2020.9315935. 2020, doi:
[15] M. A. Khedr, F. A. El-Licy, and A. Salah, 10.1109/ACCESS.2020.2988685.
“Ontology-based Semantic Query [23] Y. He et al., “DeepOnto: A Python
Expansion for Searching Queries in Package for Ontology Engineering with
Programming Domain,” International Deep Learning,” Jul. 2023, [Online].
Journal of Advanced Computer Science Available: http://arxiv.org/abs/2307.03067
and Applications, vol. 12, no. 8, pp. 449– [24] D. Chandrasekaran and V. Mago,
455, 2021, doi: “Evolution of Semantic Similarity -- A
10.14569/IJACSA.2021.0120852. Survey,” Apr. 2020, doi: 10.1145/3440755.
[16] J. Guo, Y. Cai, Y. Fan, F. Sun, R. Zhang, [25] Anuttara Rajasinghe, “AnuttaraR”,
and X. Cheng, “Semantic Models for the “GitHub”,
First-Stage Retrieval: A Comprehensive https://github.com/AnuttaraR/Covid19_On
Review,” ACM Trans Inf Syst, vol. 40, no. tology/tree/main [Accessed: Feb 19, 2024]
4, Oct. 2022, doi: 10.1145/3486250.
[17] A. Aghajanyan et al., “Conversational
Semantic Parsing,” Sep. 2020, [Online].
Available: http://arxiv.org/abs/2009.13655
[18] S. Hu, J. Wang, C. Hoare, Y. Li, P.
Pauwels, and J. O’Donnell, “Building
energy performance assessment using
linked data and cross-domain semantic
reasoning,” Autom Constr, vol. 124, Apr.
2021, doi: 10.1016/j.autcon.2021.103580.
[19] A. A. Kumar, “Semantic memory: A
review of methods, models, and current
challenges,” Psychonomic Bulletin and
Review, vol. 28, no. 1. Springer, pp. 40–80,
Feb. 01, 2021. doi: 10.3758/s13423-020-
01792-x.
[20] R. Zgheib et al., “A scalable semantic
framework for IoT healthcare applications
A scalable semantic framework for IoT
healthcare applications A Scalable
Semantic Framework for IoT Healthcare
Applications,” J Ambient Intell Humaniz
Comput, p. 10, 2020, doi: 10.1007/s12652-
020-02136-2ï.

You might also like