0% found this document useful (0 votes)

32 views4 pages

IR Unit II

The document discusses two fundamental models in information retrieval: the Vector Space Model (VSM) and the Boolean Model. VSM represents documents and queries as vectors in a multi-dimensional space, utilizing term weighting and similarity measurement for relevance ranking, while the Boolean Model retrieves documents based on exact matches to Boolean queries using operators like AND, OR, and NOT. Both models have their advantages and limitations, with VSM offering flexibility and ranking, and the Boolean Model providing precise control but lacking in ranking capabilities.

Uploaded by

mohamedfarookali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views4 pages

IR Unit II

Uploaded by

mohamedfarookali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

The Vector Space Model (VSM)

The Vector Space Model (VSM) is a fundamental model in information retrieval (IR) that represents
documents and queries as vectors in a multi-dimensional space. It is widely used in search engines and
other IR systems to measure the relevance of documents to a given query.

Vector Space Model (VSM) in information retrieval (IR), which is a mathematical framework to represent
text-based data (e.g., documents and queries) in a way that computers can process for tasks like
searching and ranking.

Key Concepts

1. Vector Representation:

o Each document and query is represented as a vector in a space where each dimension
corresponds to a term in the vocabulary (e.g., words or tokens).

o For example, a document Di with terms t1,t2,...,tn is represented as:

Di=(Wi1,Wi2,...,Win)

o where Wij is the weight of term tj in document Di.

2. Term Weighting:

o Terms are assigned weights to reflect their importance. Common weighting schemes
include:

 Binary weighting: Wij=1 if the term tj appears in Di, otherwise 0.

 Term Frequency (TF): Counts how often tj appears in Di.

 TF-IDF (Term Frequency-Inverse Document Frequency): A popular scheme that

considers term frequency and how rare the term is across all documents.

 N: Total number of documents, nj: Number of documents containing tj.

3. Similarity Measurement:

o To rank documents, the similarity between the query vector QQ and document vectors
Di is calculated. The most common metric is cosine similarity:

4. Dimensionality:

o The dimensionality of the vector space corresponds to the size of the vocabulary
(number of unique terms). High dimensionality can be reduced using techniques like
Latent Semantic Analysis (LSA) or Principal Component Analysis (PCA).
Boolean Model

The Boolean Model is one of the simplest and earliest models used in information retrieval (IR). It is
based on set theory and logic, where documents are retrieved based on whether they exactly satisfy a
Boolean query. The model operates on binary decisions—either a document is relevant or it is not.

Key Concepts

1. Representation of Documents and Queries:

o Each document is represented as a set of terms (keywords or tokens).

o Queries are expressed using Boolean operators:

 AND: Retrieves documents containing all specified terms.

 OR: Retrieves documents containing at least one of the specified terms.

 NOT: Excludes documents containing specific terms.

2. Boolean Retrieval:

o The model retrieves documents based on whether they match the query exactly.

o The result is a binary outcome (relevant or not relevant), with no ranking or partial
relevance.

Query Example

Suppose we have the following documents:

 Document 1 (D1): "data science is fun"

 Document 2 (D2): "machine learning and data science"

 Document 3 (D3): "deep learning for science"

Boolean Query:

 Query 1: "data AND science"

Matches: D1, D2
Explanation: Both "data" and "science" appear in D1 and D2.

 Query 2: "data OR machine"

Matches: D1, D2
Explanation: "data" appears in D1 and D2, and "machine" appears in D2.

 Query 3: "data AND NOT machine"

Matches: D1
Explanation: D2 contains "machine," so it is excluded.

Advantages

1. Simplicity: Easy to understand and implement.

2. Exact Matching: Useful for applications where precise results are needed (e.g., legal or patent
search).

3. Boolean Logic: Queries can be structured logically to filter results effectively.

The Boolean model is a foundational model in information retrieval (IR) that uses Boolean logic to
retrieve documents that match specific criteria defined by a user's query. It's based on the simple idea
that documents and queries can be represented as sets of terms, and retrieval is based on whether a
document contains the query terms and satisfies the Boolean conditions specified in the query.

Key Concepts:

 Boolean Logic: The model employs Boolean operators like AND, OR, and NOT to combine terms
in a query.

o AND: Retrieves documents that contain all the specified terms.

o OR: Retrieves documents that contain at least one of the specified terms.

o NOT: Excludes documents that contain a specific term.

 Term-Document Matrix: This matrix represents the relationship between terms and documents.
Each row corresponds to a term, and each column corresponds to a document. A cell in the
matrix is 1 if the term appears in the document and 0 otherwise.

 Inverted Index: An inverted index is a data structure that maps each term to a list of documents
containing that term. It's more efficient for searching large collections of documents compared
to the term-document matrix.

How it Works:

1. Query Formulation: The user expresses their information need as a Boolean query using terms
and operators.

2. Index Lookup: The system uses the inverted index to retrieve the set of documents associated
with each term in the query.

3. Boolean Evaluation: The system applies the Boolean operators to the sets of documents to
determine the final set of documents that match the query.

Example:

Consider the query "cat AND dog". The system would retrieve the set of documents containing the term
"cat" and the set of documents containing the term "dog". Then, it would find the intersection of these
two sets to retrieve the documents that contain both "cat" and "dog".

Advantages:

 Simple and Understandable: The Boolean model is easy to understand and implement.

 Precise Control: It allows users to precisely control the retrieval process using Boolean operators.
 Efficient for Simple Queries: For simple queries, the Boolean model can be very efficient.

Disadvantages:

 Limited Flexibility: It can be difficult to express complex information needs using only Boolean
operators.

 No Ranking: The Boolean model retrieves documents without ranking, making it challenging to
distinguish between highly relevant and marginally relevant documents.

 Sensitivity to Term Choice: The effectiveness of the model depends heavily on the user's ability
to choose the right terms.

Extensions:

 Phrase Queries: Allow users to search for specific phrases or sequences of words.

 Proximity Queries: Allow users to specify the distance between terms in a document.

 Fuzzy Queries: Allow for variations in spelling or word forms.

Despite its limitations, the Boolean model remains a valuable tool in IR, especially for tasks that require
precise control over the retrieval process or when dealing with small collections of documents. It also
serves as a foundation for more sophisticated IR models.

Intro to Information Retrieval
No ratings yet
Intro to Information Retrieval
47 pages
Unit - II
100% (1)
Unit - II
5 pages
Information Retrieval Models
No ratings yet
Information Retrieval Models
113 pages
Search Engine Evaluation Guide
No ratings yet
Search Engine Evaluation Guide
48 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
67 pages
Information Retrieval System and The Pagerank Algorithm
No ratings yet
Information Retrieval System and The Pagerank Algorithm
37 pages
4 IRModels
No ratings yet
4 IRModels
46 pages
Ir Mod2 Notes
No ratings yet
Ir Mod2 Notes
26 pages
Unit-5 Adt
No ratings yet
Unit-5 Adt
11 pages
Unit 2
No ratings yet
Unit 2
13 pages
Detailed IR Document 2
No ratings yet
Detailed IR Document 2
2 pages
NLP See
No ratings yet
NLP See
27 pages
LIBS 894 Assignment Three Classic Models
No ratings yet
LIBS 894 Assignment Three Classic Models
8 pages
Module 2-Students
No ratings yet
Module 2-Students
143 pages
Overview of Information Retrieval Systems
No ratings yet
Overview of Information Retrieval Systems
42 pages
Information Retrieval 7 Boolean Model
No ratings yet
Information Retrieval 7 Boolean Model
11 pages
Boolean Model (1) 1
No ratings yet
Boolean Model (1) 1
52 pages
IR Models for Students
No ratings yet
IR Models for Students
62 pages
Introduction to IR Models
No ratings yet
Introduction to IR Models
46 pages
IR Chapter 4
No ratings yet
IR Chapter 4
15 pages
Unit2 ISR
No ratings yet
Unit2 ISR
12 pages
NLP - Module 5
No ratings yet
NLP - Module 5
58 pages
Information Retrieval Models Guide
No ratings yet
Information Retrieval Models Guide
15 pages
Information Retrieval Practical
No ratings yet
Information Retrieval Practical
10 pages
NLP Unit-Ii (Part-I)
No ratings yet
NLP Unit-Ii (Part-I)
19 pages
Unit 2 - Modern Information Retrieval - WWW - Rgpvnotes.in
No ratings yet
Unit 2 - Modern Information Retrieval - WWW - Rgpvnotes.in
8 pages
4 IRModels
No ratings yet
4 IRModels
32 pages
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
No ratings yet
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
42 pages
Boolean Retrieval Model
No ratings yet
Boolean Retrieval Model
5 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
34 pages
Information Retrieval
No ratings yet
Information Retrieval
15 pages
Information Retrieval Lecture Overview
No ratings yet
Information Retrieval Lecture Overview
6 pages
NLP See
No ratings yet
NLP See
9 pages
CS8080 Irt Unit Ii Qbank Main
No ratings yet
CS8080 Irt Unit Ii Qbank Main
8 pages
02 Chap02a-BooleanAndvector Models
No ratings yet
02 Chap02a-BooleanAndvector Models
30 pages
Information Retrieval
No ratings yet
Information Retrieval
9 pages
Information Retrieval Models Overview
No ratings yet
Information Retrieval Models Overview
21 pages
Overview of Information Retrieval Systems
No ratings yet
Overview of Information Retrieval Systems
23 pages
Traditional IR Models Overview
No ratings yet
Traditional IR Models Overview
65 pages
Mid1 Irs Ans
No ratings yet
Mid1 Irs Ans
13 pages
Information Retrieval Models Guide
No ratings yet
Information Retrieval Models Guide
54 pages
IR Models for Tech Students
No ratings yet
IR Models for Tech Students
24 pages
Assignment
No ratings yet
Assignment
1 page
Module 2
No ratings yet
Module 2
18 pages
Advanced Database Tech: IR & Web Search
No ratings yet
Advanced Database Tech: IR & Web Search
21 pages
AI Module 7
No ratings yet
AI Module 7
76 pages
CS726 Handouts
No ratings yet
CS726 Handouts
237 pages
Introduction to IR Models and Techniques
100% (1)
Introduction to IR Models and Techniques
32 pages
Module 6 Updated Final
No ratings yet
Module 6 Updated Final
48 pages
Overview of Information Retrieval Models
100% (1)
Overview of Information Retrieval Models
32 pages
Unit II
No ratings yet
Unit II
73 pages
Information Retrieval System MODULE 2 Mumbai University
No ratings yet
Information Retrieval System MODULE 2 Mumbai University
23 pages
ISE Information Retrieval Mod-V (Uploaded by Snaptricks - In)
No ratings yet
ISE Information Retrieval Mod-V (Uploaded by Snaptricks - In)
48 pages
Unit 2
No ratings yet
Unit 2
58 pages
M Ahsan
No ratings yet
M Ahsan
2 pages
Taylor and Maclaurn Series
No ratings yet
Taylor and Maclaurn Series
5 pages
IOT - Data Analysis
No ratings yet
IOT - Data Analysis
3 pages
Centurion University Admission Report 2016-17
No ratings yet
Centurion University Admission Report 2016-17
2 pages
Vision
0% (1)
Vision
33 pages
Computing Models in Industrial and Environmental Applications 1st Edition Alvaro Herrero
No ratings yet
Computing Models in Industrial and Environmental Applications 1st Edition Alvaro Herrero
66 pages
MX Road Manual
No ratings yet
MX Road Manual
204 pages
Application Form Academic Mobility Program: For Outbound Programme For Inbound Programme
No ratings yet
Application Form Academic Mobility Program: For Outbound Programme For Inbound Programme
6 pages
Symmetrical Quick Couplers For Excavators June 2011 Edition
No ratings yet
Symmetrical Quick Couplers For Excavators June 2011 Edition
5 pages
Data Center Design Criteria Course
No ratings yet
Data Center Design Criteria Course
30 pages
Evolution and Basics of Computers
No ratings yet
Evolution and Basics of Computers
77 pages
Recursion and Fibonacci Series
No ratings yet
Recursion and Fibonacci Series
54 pages
AI Ebook Mar26 2025 Sample
No ratings yet
AI Ebook Mar26 2025 Sample
10 pages
CS213 Assignment 2: OOP in C++
No ratings yet
CS213 Assignment 2: OOP in C++
7 pages
Xitanium 100W 0.7A 230V Y PDF
No ratings yet
Xitanium 100W 0.7A 230V Y PDF
8 pages
RC Bullet
100% (1)
RC Bullet
1 page
Hanacleaner - Sap Note 2399996: Sap Note Presents A Tool That Can Help With Housekeeping Tasks
No ratings yet
Hanacleaner - Sap Note 2399996: Sap Note Presents A Tool That Can Help With Housekeeping Tasks
59 pages
Object Oriented Programming Through Java
No ratings yet
Object Oriented Programming Through Java
131 pages
Mental Ability Test - Coding Decoding: Total Questions: 30 Exam Duration: 45 Mins
No ratings yet
Mental Ability Test - Coding Decoding: Total Questions: 30 Exam Duration: 45 Mins
3 pages
Free Proxy List - Public Proxy Servers (IP PORT) - Hide My Ass!
36% (11)
Free Proxy List - Public Proxy Servers (IP PORT) - Hide My Ass!
3 pages
CSC 126 Project Proprosal
No ratings yet
CSC 126 Project Proprosal
19 pages
JEE Advanced Applications of Derivatives Important Questions
No ratings yet
JEE Advanced Applications of Derivatives Important Questions
23 pages
Computer Architecture Homework Solutions
No ratings yet
Computer Architecture Homework Solutions
6 pages
Filter List
No ratings yet
Filter List
170 pages
IECEx Certificate for Control Units
No ratings yet
IECEx Certificate for Control Units
4 pages
Integrating SAP Data With Maiora
No ratings yet
Integrating SAP Data With Maiora
6 pages
Tuv Asme Ped Certification Catalogue PDF
No ratings yet
Tuv Asme Ped Certification Catalogue PDF
8 pages
2.1. Upgrading From RHEL 8 To RHEL 9
100% (1)
2.1. Upgrading From RHEL 8 To RHEL 9
48 pages
Contador Eaton
No ratings yet
Contador Eaton
126 pages
QGIS Manual Burmese
No ratings yet
QGIS Manual Burmese
232 pages

IR Unit II

Uploaded by

IR Unit II

Uploaded by

The Vector Space Model (VSM)

o For example, a document Di with terms t1,t2,...,tn is represented as:

o where Wij is the weight of term tj in document Di.

 Binary weighting: Wij=1 if the term tj appears in Di, otherwise 0.

 Term Frequency (TF): Counts how often tj appears in Di.

 TF-IDF (Term Frequency-Inverse Document Frequency): A popular scheme that

 N: Total number of documents, nj: Number of documents containing tj.

1. Representation of Documents and Queries:

o Each document is represented as a set of terms (keywords or tokens).

o Queries are expressed using Boolean operators:

 AND: Retrieves documents containing all specified terms.

 OR: Retrieves documents containing at least one of the specified terms.

 NOT: Excludes documents containing specific terms.

Suppose we have the following documents:

 Document 1 (D1): "data science is fun"

 Document 2 (D2): "machine learning and data science"

 Document 3 (D3): "deep learning for science"

 Query 1: "data AND science"

 Query 2: "data OR machine"

 Query 3: "data AND NOT machine"

1. Simplicity: Easy to understand and implement.

3. Boolean Logic: Queries can be structured logically to filter results effectively.

o AND: Retrieves documents that contain all the specified terms.

o NOT: Excludes documents that contain a specific term.

 Fuzzy Queries: Allow for variations in spelling or word forms.

You might also like