0% found this document useful (0 votes)

54 views

CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica

The document discusses two main information retrieval models: the Boolean model and the vector space model. In the Boolean model, documents and queries are represented as sets of terms and relevance is binary. Phrase queries can also be supported by checking term adjacency. The vector space model represents documents and queries as weighted term vectors in a multidimensional space, where weights are based on term frequency and inverse document frequency (tf-idf). This model allows documents to be ranked by similarity to a query vector.

Uploaded by

BlackMooth

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views

CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica

Uploaded by

BlackMooth

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

CAIM: Cerca i Anàlisi d’Informació Massiva

FIB, Grau en Enginyeria Informàtica

Slides by Marta Arias, José Luis Balcázar,

Ramon Ferrer-i-Cancho, Ricard Gavaldá
Department of Computer Science, UPC

Fall 2018
http://www.cs.upc.edu/~caim

1 / 21
2. Information Retrieval Models
Information Retrieval Models, I
Setting the stage to think about IR

What is an Information Retrieval Model?

We need to clarify:
I A proposal for a logical view of documents
(what info is stored/indexed about each document?),
I a query language
(what kinds of queries will be allowed?),
I and a notion of relevance
(how to handle each document, given a query?).

3 / 21
Information Retrieval Models, II
A couple of IR models

Focus for this course:

I Boolean model,
I Boolean queries, exact answers;
I extension: phrase queries.
I Vector model,
I weights on terms and documents;
I similarity queries, approximate answers, ranking.

4 / 21
Boolean Model of Information Retrieval
Relevance assumed binary

Documents:
A document is completely identified by the set of terms that it
contains.
I Order of occurrence considered irrelevant,
I number of occurrences considered irrelevant
(but a closely related model, called bag-of-words or BoW,
does consider relevant the number of occurrences).

Thus, for a set of terms T = {t1 , . . . , tT }, a document is just a

subset of T .
Each document can be seen as a bit vector of length T ,
d = (d1 , . . . , dT ), where
I di = 1 if and only if ti appears in d, or, equivalently,
I di = 0 if and only if ti does not appear in d.

5 / 21
Queries in the Boolean Model, I
Boolean queries, exact answers

Atomic query:
a single term.
The answer is the set of documents that contain it.

Combining queries:
I OR, AND: operate as union or intersection of answers;
I Set difference, t1 BUTNOT t2 ≡ t1 AND NOT t2 ;
I motivation: avoid unmanageably large answer sets.
In Lucene: +/− signs on query terms, Boolean operators.

6 / 21
Queries in the Boolean Model, II
A close relative to propositional logic

Analogy:
I Terms act as propositional variables;
I documents act as propositional models;
I a document is relevant for a term if it contains the term,
that is, if, as a propositional model, satisfies the variable;
I queries are propositional formulas
(with a syntactic condition of avoiding global negation);
I a document is relevant for a query if, as a propositional
model, it satisfies the propositional formula.

7 / 21
Example, I
A very simple toy case

Consider 7 documents with a vocabulary of 6 terms:

d1 = one three
d2 = two two three
d3 = one three four five five five
d4 = one two two two two three six six
d5 = three four four four six
d6 = three three three six six
d7 = four five

8 / 21
Example, II
Our documents in the Boolean model

f ive f our one six three two

d1 = [ 0 0 1 0 1 0 ]
d2 = [ 0 0 0 0 1 1 ]
d3 = [ 1 1 1 0 1 0 ]
d4 = [ 0 0 1 1 1 1 ]
d5 = [ 0 1 0 1 1 0 ]
d6 = [ 0 0 0 1 1 0 ]
d7 = [ 1 1 0 0 0 0 ]

(Invent some queries and compute their answers!)

9 / 21
Queries in the Boolean Model, III
No ranking of answers

Answers are not quantified:

A document either
I matches the query (is fully relevant),
I or does not match the query (is fully irrelevant).

Depending on user needs and application, this feature may be

good or may be bad.

10 / 21
Phrase Queries, I
Slightly beyond the Boolean model

Phrase queries: conjunction plus adjacency

Ability to answer with the set of documents that have the terms
of the query consecutively.
I A user querying “Keith Richards” may not wish a document
that mentions both Keith Emerson and Emil Richards.
I Requires extending the notion of “basic query” to include
adjacency.

11 / 21
Phrase Queries, II
Options to “hack them in”

Options:
I Run as conjunctive query, then doublecheck the whole
answer set to filter out nonadjacency cases.
This option may be very slow in cases of large amounts of
“false positives”.
I Keep in the index dedicated information about adjacency
of any two terms in a document (e.g. positions).
I Keep in the index dedicated information about a choice of
“interesting pairs” of words.

12 / 21
Vector Space Model of Information Retrieval, I
Basis of all successful approaches

I Order of words still irrelevant.

I Frequence is relevant.
I Not all words are equally important.
I For a set of terms T = {t1 , . . . , tT }, a document is a vector
d = (w1 , . . . , wT ) of floats instead of bits.
I wi is the weight of ti in d.

13 / 21
Vector Space Model of Information Retrieval, II
Moving to vector space

I A document is now a vector in IRT .

I The document collection conceptually becomes a matrix
terms × documents.
but we never compute the matrix explicitly.
I Queries may also be seen as vectors in IRT .

14 / 21
The tf-idf scheme
A way to assign weight vector to documents
Two principles:
I The more frequent t is in d, the higher weight it should
have.

I The more frequent t is in the whole collection, the less it

discriminates among documents, so the lower its weight
should be in all documents.

15 / 21
The tf-idf scheme, II
The formula
A document is a vector of weights

d = [wd,1 , . . . , wd,i , . . . , wd,T ].

Each weight is a product of two terms

wd,i = tfd,i · idfi .

The term frequency term tf is

fd,i
tfd,i = , where fd,j is the frequency of tj in d.
máxj fd,j

And the inverse document frequency idf is

D
idfi = log2 , where D = number of documents
dfi
and dfi = number of documents that contain term ti .
16 / 21
Example, I

f ive f our one six three two maxf

d1 = [ 0 0 1 0 1 0 ] 1
d2 = [ 0 0 0 0 1 2 ] 2
d3 = [ 3 1 1 0 1 0 ] 3
d4 = [ 0 0 1 2 1 4 ] 4
d5 = [ 0 3 0 1 1 0 ] 3
d6 = [ 0 0 0 2 3 0 ] 3
d7 = [ 1 1 0 0 0 0 ] 1

df = 2 3 3 3 6 2

17 / 21
Example, II

df = 2 3 3 3 6 2
d3 = [ 3 1 1 0 1 0 ]
→
3 7 1 7 1 7 0 7 1 7 0 7
d3 = [ 3
log2 2 3
log2 3 3
log2 3 3
log2 3 3
log2 6 3
log2 2
]

= [ 1.81 0.41 0.41 0 0.07 0 ]

d4 = [ 0 0 1 2 1 4 ]
→
0 7 0 7 1 7 2 7 1 7 4 7
d4 = [ 4
log2 2 4
log2 3 4
log2 3 4
log2 3 4
log2 6 4
log2 2
]

= [ 0 0 0.61 1.22 0.11 3.61 ]

18 / 21
Similarity of Documents in the Vector Space Model
The cosine similarity measure

I “Similar vectors” may happen to have very different sizes.

I We better compare only their directions.
I Equivalently, we normalize them before comparing them to
have the same Euclidean length.

d1 · d2 d1 d2
sim(d1, d2) = = ·
|d1| |d2| |d1| |d2|
where
√
X sX
v·w = vi · wi , and |v| = v·v = vi2 .
i i

I Our weights are all nonnegative.

I Therefore, all cosines / similarities are between 0 and 1.
19 / 21
Cosine similarity, Example

d3 = [ 1.81 0.41 0.41 0 0.07 0 ]

d4 = [ 0 0 0.61 1.22 0.11 3.61 ]
Then
|d3| = 1.898, |d4| = 3.866, d3 · d4 = 0.26
and sim(d3, d4) = 0.035 (i.e., small similarity).

20 / 21
Query Answering

I Queries can be transformed to vectors too.

I Sometimes, tf-idf weights; often, binary weights.
I sim(doc, query) ∈ [0, 1].
I Answer: List of documents sorted by decreasing similarity.

I We will find uses for comparing sim(d1, d2) too.

21 / 21

1Z0 1195 24
0% (1)
1Z0 1195 24
9 pages
Immediate download DAMA DMBOK 2nd Edition Data Management Body of Knowledge Dama International ebooks 2024
100% (2)
Immediate download DAMA DMBOK 2nd Edition Data Management Body of Knowledge Dama International ebooks 2024
52 pages
Data Analytics for Accounting, 3rd Edition Vernon J. Richardson all chapter instant download
100% (1)
Data Analytics for Accounting, 3rd Edition Vernon J. Richardson all chapter instant download
65 pages
Website: VCE To PDF Converter: Facebook: Twitter:: Number: 1z0-083 Passing Score: 800 Time Limit: 120 Min
No ratings yet
Website: VCE To PDF Converter: Facebook: Twitter:: Number: 1z0-083 Passing Score: 800 Time Limit: 120 Min
28 pages
5 B IRModels
No ratings yet
5 B IRModels
51 pages
Unit V Easy To Learn
No ratings yet
Unit V Easy To Learn
21 pages
ir
No ratings yet
ir
120 pages
4_IRModels
No ratings yet
4_IRModels
46 pages
IR Systems Usually Adopt Index Terms To Process Queries Index Term
No ratings yet
IR Systems Usually Adopt Index Terms To Process Queries Index Term
24 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
27 pages
Chapter 5 IR
No ratings yet
Chapter 5 IR
46 pages
IR Models: Chapter Five
100% (1)
IR Models: Chapter Five
26 pages
IR Chap4
100% (1)
IR Chap4
32 pages
IR Chap4
100% (1)
IR Chap4
32 pages
Chapter Five IR Models
No ratings yet
Chapter Five IR Models
28 pages
5 IRModels IR
No ratings yet
5 IRModels IR
25 pages
Module 3 Indexing Part A
No ratings yet
Module 3 Indexing Part A
46 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
34 pages
Information Retrieval Practical
No ratings yet
Information Retrieval Practical
10 pages
4_IRModels
No ratings yet
4_IRModels
30 pages
CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
No ratings yet
CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
420 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
5 IRModels
No ratings yet
5 IRModels
30 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
67 pages
F-IR
No ratings yet
F-IR
30 pages
02 Chap02a-BooleanAndvector Models
No ratings yet
02 Chap02a-BooleanAndvector Models
30 pages
IR - Models
100% (3)
IR - Models
58 pages
IR Unit 2
No ratings yet
IR Unit 2
54 pages
4_IRModels
No ratings yet
4_IRModels
46 pages
Unit 2
No ratings yet
Unit 2
58 pages
Irt-23 Unit 2
No ratings yet
Irt-23 Unit 2
10 pages
Chapter 4
No ratings yet
Chapter 4
8 pages
IR Unit II
No ratings yet
IR Unit II
4 pages
Relevance of A Document To A Query
No ratings yet
Relevance of A Document To A Query
10 pages
2 Introduction To Information Retrieval
No ratings yet
2 Introduction To Information Retrieval
38 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
43 pages
L02-IR Models MMN
No ratings yet
L02-IR Models MMN
27 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
No ratings yet
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
46 pages
4 IRModels
No ratings yet
4 IRModels
32 pages
NLP SEE
No ratings yet
NLP SEE
27 pages
Basic IR: Modeling
No ratings yet
Basic IR: Modeling
22 pages
NLP 4
No ratings yet
NLP 4
33 pages
3 Retrieval Models
No ratings yet
3 Retrieval Models
87 pages
Retrieval Models and Rank Retrieval
No ratings yet
Retrieval Models and Rank Retrieval
16 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
62 pages
Lecture 5
No ratings yet
Lecture 5
75 pages
Information Retrieval Notes
No ratings yet
Information Retrieval Notes
42 pages
Completed Unit II 17.7.17
No ratings yet
Completed Unit II 17.7.17
113 pages
IR Journal
No ratings yet
IR Journal
36 pages
ir-journal
No ratings yet
ir-journal
41 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
UNIT5-User Search Techniques
No ratings yet
UNIT5-User Search Techniques
24 pages
Information Retrieval - 1
No ratings yet
Information Retrieval - 1
47 pages
module 7
No ratings yet
module 7
53 pages
Information Retrieval System and The Pagerank Algorithm
No ratings yet
Information Retrieval System and The Pagerank Algorithm
37 pages
Boolean VectorSpace 11
No ratings yet
Boolean VectorSpace 11
15 pages
Document Ranking Using Customizes Vector Method
No ratings yet
Document Ranking Using Customizes Vector Method
6 pages
NLP SEE
No ratings yet
NLP SEE
9 pages
1 Overview
No ratings yet
1 Overview
44 pages
Visualizing Data Structures
From Everand
Visualizing Data Structures
Rhonda Hoenigman
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
36 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
97 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
65 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
19 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
44 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
20 pages
1preprocessing Crawling Laws PDF
No ratings yet
1preprocessing Crawling Laws PDF
53 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
11 pages
SQL Commands (DDL & Constraints) : Prof.B.Saleena Scse/Vitcc
No ratings yet
SQL Commands (DDL & Constraints) : Prof.B.Saleena Scse/Vitcc
20 pages
23000376-CS237 Assignment-1 (1)
No ratings yet
23000376-CS237 Assignment-1 (1)
17 pages
Notes of Data Science Unit 3
No ratings yet
Notes of Data Science Unit 3
22 pages
Oracle SQL Training: Name of Trainer: M.Atiq Patel & M. Yasir
No ratings yet
Oracle SQL Training: Name of Trainer: M.Atiq Patel & M. Yasir
26 pages
Semantic Network and Knowledge Graph
0% (1)
Semantic Network and Knowledge Graph
35 pages
Computer Science Research Poster
No ratings yet
Computer Science Research Poster
1 page
Bda Aiml Note Unit 1
No ratings yet
Bda Aiml Note Unit 1
14 pages
E-Assessment & Learning Analytics
No ratings yet
E-Assessment & Learning Analytics
51 pages
What Is MapReduce in Hadoop - Architecture - Example
No ratings yet
What Is MapReduce in Hadoop - Architecture - Example
7 pages
SQL Syntax and Conventions
No ratings yet
SQL Syntax and Conventions
7 pages
DBMS LAB MANUAL Updated
No ratings yet
DBMS LAB MANUAL Updated
67 pages
CRUD Without Reload Page Using Ajax and Codeigniter
No ratings yet
CRUD Without Reload Page Using Ajax and Codeigniter
26 pages
Oracle 1Z0 071 Oracle Database 12c SQL
No ratings yet
Oracle 1Z0 071 Oracle Database 12c SQL
6 pages
Ict L6 Exercise
No ratings yet
Ict L6 Exercise
42 pages
CSC3300 - Lab Activity 1
No ratings yet
CSC3300 - Lab Activity 1
5 pages
Timetable 2019 MTech Thrutrain PT - V16.0 - 031019
No ratings yet
Timetable 2019 MTech Thrutrain PT - V16.0 - 031019
1 page
Practical Exam
No ratings yet
Practical Exam
4 pages
DBCC, DMVs and PT Q&A
No ratings yet
DBCC, DMVs and PT Q&A
12 pages
Lab 1 Database
No ratings yet
Lab 1 Database
47 pages
What Is Retrofit in Solution Manager 7.2
No ratings yet
What Is Retrofit in Solution Manager 7.2
17 pages
SandeepJulakanti_GCPDE_resume
No ratings yet
SandeepJulakanti_GCPDE_resume
8 pages
Project SQL
No ratings yet
Project SQL
2 pages
CDS Abap 2016
No ratings yet
CDS Abap 2016
77 pages
Final Ayush Report Internship
No ratings yet
Final Ayush Report Internship
49 pages
CCS334 BDA Syllabus
No ratings yet
CCS334 BDA Syllabus
5 pages

CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica

Uploaded by

CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica

Uploaded by

CAIM: Cerca i Anàlisi d’Informació Massiva

FIB, Grau en Enginyeria Informàtica

Slides by Marta Arias, José Luis Balcázar,

What is an Information Retrieval Model?

Focus for this course:

Thus, for a set of terms T = {t1 , . . . , tT }, a document is just a

Consider 7 documents with a vocabulary of 6 terms:

f ive f our one six three two

(Invent some queries and compute their answers!)

Answers are not quantified:

Depending on user needs and application, this feature may be

Phrase queries: conjunction plus adjacency

I Order of words still irrelevant.

I A document is now a vector in IRT .

I The more frequent t is in the whole collection, the less it

d = [wd,1 , . . . , wd,i , . . . , wd,T ].

Each weight is a product of two terms

wd,i = tfd,i · idfi .

The term frequency term tf is

And the inverse document frequency idf is

f ive f our one six three two maxf

= [ 1.81 0.41 0.41 0 0.07 0 ]

= [ 0 0 0.61 1.22 0.11 3.61 ]

I “Similar vectors” may happen to have very different sizes.

I Our weights are all nonnegative.

d3 = [ 1.81 0.41 0.41 0 0.07 0 ]

I Queries can be transformed to vectors too.

I We will find uses for comparing sim(d1, d2) too.

You might also like