0% found this document useful (0 votes)

2 views

L04

The document discusses the Vector-Space Model, which represents documents as vectors in a multi-dimensional space based on distinct index terms. It explains how to calculate similarity between documents and queries using various measures such as inner product, cosine similarity, Jaccard coefficient, and Dice coefficient. Additionally, it highlights operations on vectors and relevance feedback to improve query results based on user feedback.

Uploaded by

Stephen Chow

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

L04

Uploaded by

Stephen Chow

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

The Vector-Space Model

• T distinct terms are available; call them index terms or the vocabulary
• The index terms represent important terms for an application
– What might be the index terms for a computer science library?

architecture
bus
computer
database
….
xml
computer science
collection
index terms or vocabulary or vector space
of the collection
For now: we only consider single terms (no phrases)
Representing a Vector Space

• Previously, we say a document can be represented by a list of keywords:

– D1 = 〈 text 1.0; retrieval 1.0; database 0.5; computer 0.8; information 0.2 〉
or
– D2 = 〈 database 1.0; architecture 0.8; retrieval 0.8; management 0.2 〉
• A collection of N documents can be represented as a document-term matrix
– A document is a term vector
– An entry in the matrix corresponds to the “weight” of a term in the document; zero
means the term has no significance or simply doesn't exist in the document

T1 T2 …. Tt
D1 d11 d12 … d1t
D2 d21 d22 … d2t
: : : :
D1 = 1.0 0.5 0.0 0.8 1.0 0.0 0.2
D2 = 0.8 1.0 0.8 0.0 0.0 0.2 0.0 : : : :
Dn dn1 dn2 … dnt
The Vector-Space Model
• A vocabulary of 2 terms forms a 2D space; a document may contain
0, 1 or 2 terms
t1 t2
– di = 〈 0, 0 〉 (contains none of the index terms)
– dj = 〈 0, 0.7 〉 (contains one of the two index terms)
– dk = 〈 1, 2 〉 (contains both index terms)
• Likewise, a vocabulary of 3 terms forms a 3D space
• A vocabulary of n terms forms a n-dimensional space
• A document or query can be represented as a linear combination of
T’s t2
dk

Why do we bother about the

dj empty document?

di t1
Graphic Representation

Example:
D1 = 2T1 + 3T2 + 5T3
T3
D2 = 3T1 + 7T2 + T3
Q = 0T1 + 0T2 + 2T3 5

D1 = 2T1+ 3T2 + 5T3

Q = 0T1 + 0T2 + 2T3

2 3
T1
D2 = 3T1 + 7T2 + T3 3
• Is D1 or D2 more similar to Q?
• How to measure the degree of
7
T2 similarity? Distance? Angle?
Projection?
Similarity Measure
• A similarity measure is a function which computes the degree of
similarity between a pair of vectors
– since queries and documents are both vectors, similarity can be computed
between two documents, two queries, or a document and a query

• There are a large number of similarity measures proposed in the

literature, because the best similarity measure doesn't exist (yet!)

• With similarity measure between query and documents

– it is possible to rank the retrieved documents in the order of presumed
importance
– it is possible to enforce certain threshold so that the size of the retrieved
set can be controlled
– the results can be used to reformulate the original query in relevance
feedback (e.g., combining a document vector with the query vector)
Similarity Measure - Inner Product
Inner Product -- Examples

Binary:
– D = 1, 1, 1, 0, 1, 1, 0
• Size of vector = size of vocabulary
– Q = 1, 0 , 1, 0, 0, 1, 1
=7
• 0 means corresponding term not
sim(D, Q) = 3 found in document or query

Weighted:
D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + T3
Q = 0T1 + 0T2 + 2T3 Q = 0T1 + 0T2 + 2T3

sim(D1 , Q) = 20 + 30 + 5*2 = 10

sim(D2 , Q) = 3*0 + 7*0 + 1*2 = 2
Properties of Inner Product

• The inner product similarity is unbounded

• Favors long documents

– long document ⇒ a large number of unique terms, each of which
may occur many times
– measures how many terms matched but not how many terms not
matched
Cosine Similarity Measures
t3
• Cosine similarity measures the cosine of the
angle between two vectors
• Inner product normalized by the vector lengths θ1
t

∑ (d ik q ) D1
Q
θ2
k
k =1
t1
t t

∑ d ik ∑ q kt
2 2

k =1 k =1 D2
 
 2
Document length Query length
D1 = 2T1 + 3T2 + 5T3
D2 = 3T1 + 7T2 + T3 D1 is 6 times better
Q = 0T1 + 0T2 + 2T3 than D2 using cosine
similarity but only 5
CosSim(D1 , Q) = 5*2 / (√38√4) = 0.81 times better using inner
CosSim(D2 , Q) = 1*2 / (√59√4) = 0.13 product
Jaccard Coefficient

• By Paul Jaccard, Prof of Botany and Plant Physiology, in 1901

Jaccard Coefficient: ∑ (d q )
k =1
ik k
t t t

∑ d ik + ∑q − ∑ (d ik q )
2 2

k =1 k =1
k k =1
k

D1 = 2T1 + 3T2 + 5T3 Sim(D1 , Q) = 10 / (38+4-10) = 10/32 = 0.31

D2 = 3T1 + 7T2 + T3 Sim(D2 , Q) = 2 / (59+4-2) = 2/61 = 0.04
Q = 0T1 + 0T2 + 2T3

• D1 is 9.5 times better than D2

• Difference between Jaccard and CosSim?
Dice Coefficient

• Ranges from 0 to 1 but does not satisfy triangle inequality

t
2∑ (d ik q )
k
k =1
Dice Coefficient: t t

∑ d ik + ∑q
2 2

k =1 k =1
k

D1 = 2T1 + 3T2 + 5T3 Sim(D1 , Q) = 2*10 / (38+4) = 20/42 = 0.48

D2 = 3T1 + 7T2 + T3 Sim(D2 , Q) = 2*2 / (59+4) = 4/63 = 0.06
Q = 0T1 + 0T2 + 2T3

• D1 is 8 times better than D2

Binary Versions of Similarity Measures
Non-binary weights Binary weights
Inner t

Product: ∑ (d q )
k =1
ik k d i ∩ qk

Cosine:
t

∑ (d q
k =1
ik k
) d i ∩ qk
t t

∑ d ik ∑ qk
2 2
d i qk
k =1 k =1

Jaccard : t

∑ (d q ) d i ∩ qk d i ∩ qk
⇒
ik k
k =1

d i + qk − d i ∩ qk d i ∪ qk
t t t

∑ d ik + ∑q − ∑ (d ik q )
2 2

k =1 k =1
k k =1
k

di and qk are vectors di and qk are sets of keywords

| di | or | qk |: Number of non-zero
Jaccard with binary weights is the intersection of the query elements in the set (Note: since the
and document divided by their union. weights are binary, the square in the non-
When di=qk, Jaccard similarity = 1 (highest possible) binary formula can be ignored)
000 Dik Lun LEE Department of Computer Science, HKUST Slide 29
When di⊃qk, Jaccard similarity decreases as |di| increases
(penalized long documents)
Similarity between a Query and a Document
Query Document Term-ID TF
• Given a query containing terms: 123 5
Q-ID DF
123, 345, 544, 642, 850, the 145 2
123 50
corresponding DF’s and the 345 540
320 3

document vector 544 1300

344 1
390 1
• Assume the collection contains 642 35
450 1
10,000 documents 850 250
482 1
544 1
Term-ID tf() idf() tf*idf
580 1
123 5 log2(10000/50) = 7.64 38.2 590 3

345 0 No need to calculate it! 0 610 1

630 1
544 1 log2(10000/1300) = 2.94 2.94
661 1
642 0 No need to calculate it! 0
702 2
850 1 log2(10000/250) = 5.32 5.32 758 1
850 1
Inner(D,Q) = 38.2 + 2.94 + 5.32 = 46.46 887 1
How to obtain cosine? 950 2
Interesting Things we can do in
Vector Space Model
Operations on Vectors
• Query and document vectors are structurally the same (i.e., they are
vectors of same dimension)
– Query and document vectors can be added, subtracted, etc.
• Adding two vectors means adding the corresponding elements of the two
vectors; likewise for substraction
– Imagine 𝐷𝐷1 is about soccer/sport, 𝐷𝐷2 is about politics, and 𝑄𝑄 is a query
⟨𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢, 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚⟩
– What are the meanings of the following operations (𝐷𝐷1 , 𝐷𝐷2 and 𝑄𝑄 are vectors):
• 𝐷𝐷1 + 𝐷𝐷2
• 𝐷𝐷1 − 𝐷𝐷2
• 𝑄𝑄 + 𝐷𝐷1
• 𝑄𝑄 − 𝐷𝐷1
• The last two examples illustrate relevance feedback
– Good/Bad terms in feedback document 𝐷𝐷1 can be added to or subtracted from
the original query
Relevant Feedback in Vector Space Model

• Recall that: Relevance feedback tries to modify a query into a “better” one
based on documents that users have identified as relevant (or irrelevant)
– Good/Bad terms in feedback document can be added to/subtracted from the
original query
Q = 〈 retrieval 0, database 1, architecture 0, computer 0, text 1, management 0, information 1 〉

Di = 〈 retrieval 1.0, database 0.5, architecture 1.2, computer 0.0, text 0.8, management 0.9, information 0.0 〉

Suppose Di is marked as relevant

Good query terms: database, text
Good query terms missed: architecture, retrieval, management
Bad (query) terms: information

Q’ = 〈 retrieval 0.8, database 1, architecture 0.8, computer 0, text 1, management 0.8, information 0 〉

• Query term weights in revised query can be set differently; here, added query terms
are given lower weights than original query terms
• Boolean queries are difficult to modify
– What does it mean by adding two Boolean queries together?
• Does “add” mean AND or OR?
– Even worse, what does it mean by adding a Boolean query and a
document (a bag of words) together?
• Suppose: QB = database AND text AND information
D1= 〈 text 1.0; retrieval 1.0; database 0.5; computer 0.8; information 0.2 〉
• How can you improve QB based on the fact that Di is relevant?
– Database AND text AND architecture AND retrieval?
– Database AND text AND architecture AND retrieval AND NOT information?
– Database AND text AND (information OR architecture OR retrieval)?
– (Database AND architecture) OR (text AND retrieval)?
– ……
• Revised queries are more complex and difficult to modify
Document Clusters and Centroids

• Similarity can be computed:

– Between document and query vectors ⇒ search engine (ranking)
– Between document and document vectors ⇒ document clustering
– Between query and query vectors ⇒ query suggestion, query expansion, etc.
• Comparing document against document
D1 = 1, 1, 1, 0, 1, 1, 0
D2 = 1, 0 , 1, 0, 0, 1, 1
D3 = 0.5, 2, 0, 0, 1, 1, 0.5

• What are the similarity between these documents?

• How close are these documents?
– Total similarity or average similarity of each possible pair of documents
(needs n*(n-1)/2 similarity computations)
• If the document are close enough, they can form a cluster
Document Centroid Example

• Given three documents:

t1 t2
– di = 〈 0, 0 〉
– dj = 〈 0, 0.7 〉
– dk = 〈 1, 2 〉
• The x and y coordinates of the centroid is the average of the x and y
coordinates of the documents: t2 dk
2.0
C = 〈 0.33, 0.9〉
1.0 C
• To compute the similarity of the three
dj
documents, we can compute the similarity
of each document to the centroid instead di t1
of all possible document pairs 1.0
Centroid of Multiple Documents

• Given a set of m document vectors :

ck: average over m documents
d1
d1=[ T1,1 … T1,k … … T1,t ]
d2=[ T2,1 … T2,k … … T2,t ] d3 d2
d3=[ T3,1 … T3,k … … T3,t ]
d4
d4=[ T4,1 … T4,k … … T4,t ]

• The centroid of the m document vectors is a “virtual” document vector C:

∑𝑚𝑚
𝑖𝑖=1 𝑇𝑇𝑖𝑖,𝑘𝑘
C = (c1, c2, …, ck …, ct) where 𝑐𝑐𝑘𝑘 = 𝑚𝑚
– Centroid is a convenient and compact representation of a set of documents
– The closeness of a document set can be measured by the distances of the
documents to their centroid
Some Issues about Similarity Measures
Geometric Interpretation of Similarity Measures

• Cosine similarity: we all know by now that it is the angle between

two vectors (query and document vectors)
• Inner product:
• What about Euclidian distance?
Choice of Similarity Measures

• The vector space model can match queries against documents, or documents
against documents, but the interpretations and match criteria are different

D = 1, 1, 1, 0, 1, 1, 0 • Q wants “retrieval”, “architecture”, “information” and

“management”. It doesn’t care about other things.
Q = 1, 0 , 1, 0, 0, 1, 1
• D meets three out of the four “wants”; match well

D1 = 1, 1, 1, 0, 1, 1, 0 • D1 talks about “database”, “retrieval”, “architecture”,

“text”, and “management” but not about “information”.
D2 = 1, 0 , 1, 0, 0, 1, 1
• D2 talks about “retrieval”, “architecture”, “information”,
and “management” but not about “database” and
“text”.
• Are D1 and D2 very similar? Three matches, three
mismatches, and one common absence (i.e., computer).
Similarity Measures for Documents Comparison

Document / Document Document / Query

Euclidean OK Bad
Cosine OK OK
Inner Fair OK
Jaccard Fair+ OK
Query Term Weights

• The weight of a query term is assumed to be 1 if the term

presents in the query; 0 otherwise

• A weight may be given by the user to each query term

• A natural language query can be regarded as a document

– The query “MP3 songs sung by Aaron Kwok” may be transformed
into:
<MP3, song, sung, Aaron, Kwok>
– The query “I am interested in gold medals or gold investment” may
be transformed into:
<gold 2, medal 1, investment 1> after filtering out “I am interested
in”

• Similarity of two documents can be computed in the same way

Information Abundance on Web
• Question: Is the vector space model more like ‘OR’ or ‘AND’?
• Given the answer to the above question, do you think the
vector space model is good or bad in
– A professional database (e.g., legal or medical datasets)?
– web retrieval?

Just type in as many relevant words as you can think of, you will find something useful
from the web (Google), especially for popular topics (e.g., presidential election), since
there are many sources of information on popular topics.
Term Independence Assumption (I)
• Each term i is identified as Ti
• A document is represented as a set of unordered terms, called the bag-of-
words (bow) representation
• The terms are assumed to be independent (or uncorrelated or orthogonal)
to each other and form an orthogonal vector space
• On the left: x and y are orthogonal; y = x cos 90° = 0 * x
• On the right: x and y are not orthogonal; y = x cos θ

y
• What does it mean to retrieval? y
– D1 = < computer, CPU, Intel >
– D2 = < computer, analysis, price > θ
x x
– Q = < computer >
– Intuitively, is D1 or D2 a better match? What does vector space model tell you?
– Should D1 be represented as < computer 2.5 > ?
• How to measure term dependence and incorporate it into retrieval?
Term Independence Assumption (II)
• In real life, it is hard to find two terms that are absolutely independent to
each other
• Independence can be judged with respect to a document collection.

“computer” “science” “business”

CS collection Important words in the collection

• In reality, these three terms are correlated to each other!
• When you can find “computer” in a document, there is a very high chance that
you also find “science” in it
• When you can find “computer” in a document, there is a medium chance that
you also find “business” in it
• When you can find “business” in a document, there is a small chance that you
also find “science” in it
• We can judge term independence by checking whether or not two terms
frequently co-occur in a document (or a sentence, a window of text, etc.)
Synonyms
• Two terms have the same meaning (synonyms), more or less
– D1 = < company, earning, invest >
– Q = < corporation, earning >
– Straight keyword matching will cause a mismatch between company and
corporation

– D1 = < furniture, table, desk, chair, lamp >

– Q = < furniture >
– Straight keyword matching will have only one matching keyword and so D1
has a small similarity to Q
Unbalanced Property of Vector Space Model

• If Q = 〈x, y〉, then

– D2 contains 〈x, x〉 (talk only about x but a lot of it) and D1 contains 〈x, y〉
(talk about both x and y) can have the same similarity to Q

Example 2: Cosine

y
Example 1: Inner product D1
x y
D2 = 〈 2, 0 〉 Q
D1 = 〈 1, 1 〉 D2
Q = 〈 1, 1 〉 x
Sim(Q,D1)=Sim(Q,D2) Suppose
Q = 〈 1, 0.414 〉
Sim(Q,D1)=Sim(Q,D2)
What Problems does it Cause?

• What are the impacts of the unbalanced property on the following queries?
• Education University: documents with high tf of “education” alone or
“university”, vs a balance of both words
• Travel moon space walk
How to Favor “Balanced” Documents?

• Multi-step filtering
– Filter out all documents containing both x and y, rank them higher
– Filter out all documents containing either x or y, rank them lower

• Compute similarity as usual but weight the similarity by the number

of query terms matched
– Sim’(D, Q) = Sim(D, Q) * [ | D ∩ Q | / | D ∪ Q | ]
Combining Boolean and Vector Space Model
Ranking with Boolean Operators
• Statistical model is more like the “OR” operator documents

• Boolean and vector space models can be Boolean filter

Boolean
query
combined by doing Boolean filtering first
followed by ranking: ranking Vector-space
– First evaluate the Boolean query as usual query
results
– collect all documents satisfying the Boolean
query
– Rank the result using the vector space query
Altavista Advanced Search: Boolean query and ranking criteria are separated
Query: gold AND investment
Ranking Criteria: precious metal

In most other systems, the vector space query is the same as the
Boolean query with Boolean operators ignored
Summary

• Cosine similarity is independent of document sizes

• Based on occurrence frequencies only
• Consider both local and global occurrence frequencies
• Advantages
– simplicity
– able to handle weighted terms
– easy to modify term vectors
• Disadvantages
– assumption of term independence
– lack the control of Boolean model (e.g., requiring a term to appear in a
document); e.g., two documents Di = {x, x} and Dj = {x, y} may yield
the same similarity score for Q = {x, y} if x and y have the same weight;
this may not be desirable.

2.8 V6 5V (Aha & Atq)
No ratings yet
2.8 V6 5V (Aha & Atq)
200 pages
ISR chap...5
No ratings yet
ISR chap...5
34 pages
Module 3 Indexing Part A
No ratings yet
Module 3 Indexing Part A
46 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
27 pages
Chapter 4- Part II
No ratings yet
Chapter 4- Part II
44 pages
06 VectorSpaceModel PDF
No ratings yet
06 VectorSpaceModel PDF
75 pages
Unit 4
No ratings yet
Unit 4
61 pages
06 VectorSpaceModel
No ratings yet
06 VectorSpaceModel
65 pages
Vector Space Model
No ratings yet
Vector Space Model
7 pages
Tf-Idf: David Kauchak cs160 Fall 2009
No ratings yet
Tf-Idf: David Kauchak cs160 Fall 2009
51 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
Text
No ratings yet
Text
11 pages
Vector Space Model
No ratings yet
Vector Space Model
11 pages
IR - ch5 - Vector Space Model
No ratings yet
IR - ch5 - Vector Space Model
23 pages
Vector Space Model: TF - IDF: Adapted From Lectures by
No ratings yet
Vector Space Model: TF - IDF: Adapted From Lectures by
37 pages
IRS-Unit-4
No ratings yet
IRS-Unit-4
63 pages
Lec2 2
No ratings yet
Lec2 2
17 pages
Chapter 5 IR
No ratings yet
Chapter 5 IR
46 pages
Week 5 - Latent Semantic Indexing
No ratings yet
Week 5 - Latent Semantic Indexing
38 pages
Data Mining: Similarity and Distance Recommendation Systems Sketching, Locality Sensitive Hashing
No ratings yet
Data Mining: Similarity and Distance Recommendation Systems Sketching, Locality Sensitive Hashing
57 pages
UNIT5-User Search Techniques
No ratings yet
UNIT5-User Search Techniques
24 pages
Relevance of A Document To A Query
No ratings yet
Relevance of A Document To A Query
10 pages
Retrieval Models and Rank Retrieval
No ratings yet
Retrieval Models and Rank Retrieval
16 pages
4_IRModels
No ratings yet
4_IRModels
46 pages
TF Idf
100% (3)
TF Idf
38 pages
The Boolean Model: Simple Model Based On Set Theory Queries Specified As Boolean Expressions
No ratings yet
The Boolean Model: Simple Model Based On Set Theory Queries Specified As Boolean Expressions
26 pages
Lecture 6 Score - Term Weight - Vector Space Model
No ratings yet
Lecture 6 Score - Term Weight - Vector Space Model
43 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
43 pages
Vector Space Model
No ratings yet
Vector Space Model
4 pages
Chapter 6_Scoring Term weighting and vector space model
No ratings yet
Chapter 6_Scoring Term weighting and vector space model
43 pages
Frontiers of Computational Journalism - Columbia Journalism School Fall 2012 - Week 3: Document Topic Modeling
No ratings yet
Frontiers of Computational Journalism - Columbia Journalism School Fall 2012 - Week 3: Document Topic Modeling
48 pages
AI6122 Topic 3.2 - Ranking
No ratings yet
AI6122 Topic 3.2 - Ranking
27 pages
Chapter Five IR Models
No ratings yet
Chapter Five IR Models
28 pages
5 IRModels IR
No ratings yet
5 IRModels IR
25 pages
TE IT DMBI Module2 Data Preprocessing L8-L11
No ratings yet
TE IT DMBI Module2 Data Preprocessing L8-L11
73 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
67 pages
IR Models: Chapter Five
100% (1)
IR Models: Chapter Five
26 pages
Lecture 3
No ratings yet
Lecture 3
58 pages
Data Mining: Characterization: Jimma University, Faculty of Computing Arranged By: Dessalegn Y
No ratings yet
Data Mining: Characterization: Jimma University, Faculty of Computing Arranged By: Dessalegn Y
79 pages
Session 4 Text Feature
No ratings yet
Session 4 Text Feature
40 pages
L14 VSM
No ratings yet
L14 VSM
24 pages
Vector Space Model
No ratings yet
Vector Space Model
10 pages
Similarity Analysis
No ratings yet
Similarity Analysis
85 pages
IR Systems Usually Adopt Index Terms To Process Queries Index Term
No ratings yet
IR Systems Usually Adopt Index Terms To Process Queries Index Term
24 pages
Lecture 2: More Similarity Searching Multidimensional Scaling
No ratings yet
Lecture 2: More Similarity Searching Multidimensional Scaling
8 pages
L13
No ratings yet
L13
19 pages
L02-IR Models MMN
No ratings yet
L02-IR Models MMN
27 pages
Information Retrieval Notes
No ratings yet
Information Retrieval Notes
42 pages
Non Numeric Clustering Seminar
No ratings yet
Non Numeric Clustering Seminar
26 pages
Precision recal TF idf
No ratings yet
Precision recal TF idf
36 pages
5 B IRModels
No ratings yet
5 B IRModels
51 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
62 pages
Similarity and Dissimilarity
No ratings yet
Similarity and Dissimilarity
34 pages
5 IRModels
No ratings yet
5 IRModels
30 pages
Reference Material For NLP - 1
No ratings yet
Reference Material For NLP - 1
40 pages
IRS Unit-4
No ratings yet
IRS Unit-4
35 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
NLP Ir
No ratings yet
NLP Ir
24 pages
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
From Everand
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
Shubhankar Paul
No ratings yet
WELDING
No ratings yet
WELDING
6 pages
MAT3707 Assignment01 Solutions
No ratings yet
MAT3707 Assignment01 Solutions
9 pages
Get (eBook PDF) An Introduction to Probability and Statistical Inference 2nd Edition free all chapters
100% (5)
Get (eBook PDF) An Introduction to Probability and Statistical Inference 2nd Edition free all chapters
45 pages
Fund, Which Is Separate From The Reporting Entity For The Purpose of
No ratings yet
Fund, Which Is Separate From The Reporting Entity For The Purpose of
7 pages
Bluetooth Based Smart Sensor Network
80% (5)
Bluetooth Based Smart Sensor Network
32 pages
Electrical Energy Cambridge (CIE) IGCSE Physics Revision Notes 2021 3
No ratings yet
Electrical Energy Cambridge (CIE) IGCSE Physics Revision Notes 2021 3
1 page
Graphing Rational Functions
No ratings yet
Graphing Rational Functions
6 pages
Operating and Service Manual Controlled Atmosphere: Star Cool Refrigeration Unit Model SCI - XX - X - CA
No ratings yet
Operating and Service Manual Controlled Atmosphere: Star Cool Refrigeration Unit Model SCI - XX - X - CA
50 pages
Chem Imp
No ratings yet
Chem Imp
19 pages
Applications of DSP
No ratings yet
Applications of DSP
11 pages
University Paper Solutions
No ratings yet
University Paper Solutions
132 pages
IIT Bombay, Dept of Mechanical Engineering ME 311: Autumn 2021 Mid-Semester Examination
No ratings yet
IIT Bombay, Dept of Mechanical Engineering ME 311: Autumn 2021 Mid-Semester Examination
3 pages
Circle-theorems past questions
No ratings yet
Circle-theorems past questions
22 pages
DartVision (Proposal)
No ratings yet
DartVision (Proposal)
53 pages
University of Gondar College of Medicine and Health Science Department of Epidemiology and Biostatistics
No ratings yet
University of Gondar College of Medicine and Health Science Department of Epidemiology and Biostatistics
34 pages
Fuzzy Logic Controller
No ratings yet
Fuzzy Logic Controller
4 pages
Dom Tutorial
No ratings yet
Dom Tutorial
7 pages
Readme
No ratings yet
Readme
2 pages
GAS-Guard 8: Answers For Energy
No ratings yet
GAS-Guard 8: Answers For Energy
6 pages
MS Excel Lesson For Grade 8
100% (1)
MS Excel Lesson For Grade 8
55 pages
Maths - Stage - 9 - Set-2 Paper-1
0% (1)
Maths - Stage - 9 - Set-2 Paper-1
11 pages
Build A Professional Barbecue Smoker - 14 Steps (With Pictures) - Instructables
No ratings yet
Build A Professional Barbecue Smoker - 14 Steps (With Pictures) - Instructables
15 pages
"On-Off" RGD Signaling Using Azobenzene Photoswitch-Modified Surfaces
No ratings yet
"On-Off" RGD Signaling Using Azobenzene Photoswitch-Modified Surfaces
9 pages
Business Applications of Measure of Central Tendency
100% (2)
Business Applications of Measure of Central Tendency
2 pages
Introduction To Signal Processing: Professor Mike Brennan
No ratings yet
Introduction To Signal Processing: Professor Mike Brennan
40 pages
EOE Unit 1
No ratings yet
EOE Unit 1
4 pages
Astm 20M PDF
No ratings yet
Astm 20M PDF
33 pages
Engineering Drawing Lab Manual
No ratings yet
Engineering Drawing Lab Manual
51 pages
Propositional Logic
No ratings yet
Propositional Logic
41 pages

L04

Uploaded by

L04

Uploaded by

The Vector-Space Model

• Previously, we say a document can be represented by a list of keywords:

Why do we bother about the

D1 = 2T1+ 3T2 + 5T3

Q = 0T1 + 0T2 + 2T3

• There are a large number of similarity measures proposed in the

• With similarity measure between query and documents

sim(D1 , Q) = 2*0 + 3*0 + 5*2 = 10

• The inner product similarity is unbounded

• Favors long documents

• By Paul Jaccard, Prof of Botany and Plant Physiology, in 1901

D1 = 2T1 + 3T2 + 5T3 Sim(D1 , Q) = 10 / (38+4-10) = 10/32 = 0.31

• D1 is 9.5 times better than D2

• Ranges from 0 to 1 but does not satisfy triangle inequality

D1 = 2T1 + 3T2 + 5T3 Sim(D1 , Q) = 2*10 / (38+4) = 20/42 = 0.48

• D1 is 8 times better than D2

di and qk are vectors di and qk are sets of keywords

document vector 544 1300

345 0 No need to calculate it! 0 610 1

Suppose Di is marked as relevant

• Similarity can be computed:

• What are the similarity between these documents?

• Given three documents:

• Given a set of m document vectors :

• The centroid of the m document vectors is a “virtual” document vector C:

• Cosine similarity: we all know by now that it is the angle between

D = 1, 1, 1, 0, 1, 1, 0 • Q wants “retrieval”, “architecture”, “information” and

D1 = 1, 1, 1, 0, 1, 1, 0 • D1 talks about “database”, “retrieval”, “architecture”,

Document / Document Document / Query

• The weight of a query term is assumed to be 1 if the term

• A weight may be given by the user to each query term

• A natural language query can be regarded as a document

• Similarity of two documents can be computed in the same way

“computer” “science” “business”

CS collection Important words in the collection

– D1 = < furniture, table, desk, chair, lamp >

• If Q = 〈x, y〉, then

• Compute similarity as usual but weight the similarity by the number

• Boolean and vector space models can be Boolean filter

• Cosine similarity is independent of document sizes

You might also like

sim(D1 , Q) = 20 + 30 + 5*2 = 10