0% found this document useful (0 votes)
2 views

L04

The document discusses the Vector-Space Model, which represents documents as vectors in a multi-dimensional space based on distinct index terms. It explains how to calculate similarity between documents and queries using various measures such as inner product, cosine similarity, Jaccard coefficient, and Dice coefficient. Additionally, it highlights operations on vectors and relevance feedback to improve query results based on user feedback.

Uploaded by

Stephen Chow
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

L04

The document discusses the Vector-Space Model, which represents documents as vectors in a multi-dimensional space based on distinct index terms. It explains how to calculate similarity between documents and queries using various measures such as inner product, cosine similarity, Jaccard coefficient, and Dice coefficient. Additionally, it highlights operations on vectors and relevance feedback to improve query results based on user feedback.

Uploaded by

Stephen Chow
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

The Vector-Space Model

• T distinct terms are available; call them index terms or the vocabulary
• The index terms represent important terms for an application
– What might be the index terms for a computer science library?

architecture
bus
computer
database
….
xml
computer science
collection
index terms or vocabulary or vector space
of the collection
For now: we only consider single terms (no phrases)
Representing a Vector Space

• Previously, we say a document can be represented by a list of keywords:


– D1 = 〈 text 1.0; retrieval 1.0; database 0.5; computer 0.8; information 0.2 〉
or
– D2 = 〈 database 1.0; architecture 0.8; retrieval 0.8; management 0.2 〉
• A collection of N documents can be represented as a document-term matrix
– A document is a term vector
– An entry in the matrix corresponds to the “weight” of a term in the document; zero
means the term has no significance or simply doesn't exist in the document

T1 T2 …. Tt
D1 d11 d12 … d1t
D2 d21 d22 … d2t
: : : :
D1 = 1.0 0.5 0.0 0.8 1.0 0.0 0.2
D2 = 0.8 1.0 0.8 0.0 0.0 0.2 0.0 : : : :
Dn dn1 dn2 … dnt
The Vector-Space Model
• A vocabulary of 2 terms forms a 2D space; a document may contain
0, 1 or 2 terms
t1 t2
– di = 〈 0, 0 〉 (contains none of the index terms)
– dj = 〈 0, 0.7 〉 (contains one of the two index terms)
– dk = 〈 1, 2 〉 (contains both index terms)
• Likewise, a vocabulary of 3 terms forms a 3D space
• A vocabulary of n terms forms a n-dimensional space
• A document or query can be represented as a linear combination of
T’s t2
dk

Why do we bother about the


dj empty document?

di t1
Graphic Representation

Example:
D1 = 2T1 + 3T2 + 5T3
T3
D2 = 3T1 + 7T2 + T3
Q = 0T1 + 0T2 + 2T3 5

D1 = 2T1+ 3T2 + 5T3

Q = 0T1 + 0T2 + 2T3


2 3
T1
D2 = 3T1 + 7T2 + T3 3
• Is D1 or D2 more similar to Q?
• How to measure the degree of
7
T2 similarity? Distance? Angle?
Projection?
Similarity Measure
• A similarity measure is a function which computes the degree of
similarity between a pair of vectors
– since queries and documents are both vectors, similarity can be computed
between two documents, two queries, or a document and a query

• There are a large number of similarity measures proposed in the


literature, because the best similarity measure doesn't exist (yet!)

• With similarity measure between query and documents


– it is possible to rank the retrieved documents in the order of presumed
importance
– it is possible to enforce certain threshold so that the size of the retrieved
set can be controlled
– the results can be used to reformulate the original query in relevance
feedback (e.g., combining a document vector with the query vector)
Similarity Measure - Inner Product
Inner Product -- Examples

Binary:
– D = 1, 1, 1, 0, 1, 1, 0
• Size of vector = size of vocabulary
– Q = 1, 0 , 1, 0, 0, 1, 1
=7
• 0 means corresponding term not
sim(D, Q) = 3 found in document or query

Weighted:
D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + T3
Q = 0T1 + 0T2 + 2T3 Q = 0T1 + 0T2 + 2T3

sim(D1 , Q) = 2*0 + 3*0 + 5*2 = 10


sim(D2 , Q) = 3*0 + 7*0 + 1*2 = 2
Properties of Inner Product

• The inner product similarity is unbounded

• Favors long documents


– long document ⇒ a large number of unique terms, each of which
may occur many times
– measures how many terms matched but not how many terms not
matched
Cosine Similarity Measures
t3
• Cosine similarity measures the cosine of the
angle between two vectors
• Inner product normalized by the vector lengths θ1
t

∑ (d ik q ) D1
Q
θ2
k
k =1
t1
t t

∑ d ik ∑ q kt
2 2

k =1 k =1 D2
 
 2
Document length Query length
D1 = 2T1 + 3T2 + 5T3
D2 = 3T1 + 7T2 + T3 D1 is 6 times better
Q = 0T1 + 0T2 + 2T3 than D2 using cosine
similarity but only 5
CosSim(D1 , Q) = 5*2 / (√38√4) = 0.81 times better using inner
CosSim(D2 , Q) = 1*2 / (√59√4) = 0.13 product
Jaccard Coefficient

• By Paul Jaccard, Prof of Botany and Plant Physiology, in 1901

Jaccard Coefficient: ∑ (d q )
k =1
ik k
t t t

∑ d ik + ∑q − ∑ (d ik q )
2 2

k =1 k =1
k k =1
k

D1 = 2T1 + 3T2 + 5T3 Sim(D1 , Q) = 10 / (38+4-10) = 10/32 = 0.31


D2 = 3T1 + 7T2 + T3 Sim(D2 , Q) = 2 / (59+4-2) = 2/61 = 0.04
Q = 0T1 + 0T2 + 2T3

• D1 is 9.5 times better than D2


• Difference between Jaccard and CosSim?
Dice Coefficient

• Ranges from 0 to 1 but does not satisfy triangle inequality


t
2∑ (d ik q )
k
k =1
Dice Coefficient: t t

∑ d ik + ∑q
2 2

k =1 k =1
k

D1 = 2T1 + 3T2 + 5T3 Sim(D1 , Q) = 2*10 / (38+4) = 20/42 = 0.48


D2 = 3T1 + 7T2 + T3 Sim(D2 , Q) = 2*2 / (59+4) = 4/63 = 0.06
Q = 0T1 + 0T2 + 2T3

• D1 is 8 times better than D2


Binary Versions of Similarity Measures
Non-binary weights Binary weights
Inner t

Product: ∑ (d q )
k =1
ik k d i ∩ qk

Cosine:
t

∑ (d q
k =1
ik k
) d i ∩ qk
t t

∑ d ik ∑ qk
2 2
d i qk
k =1 k =1

Jaccard : t

∑ (d q ) d i ∩ qk d i ∩ qk

ik k
k =1

d i + qk − d i ∩ qk d i ∪ qk
t t t

∑ d ik + ∑q − ∑ (d ik q )
2 2

k =1 k =1
k k =1
k

di and qk are vectors di and qk are sets of keywords


| di | or | qk |: Number of non-zero
Jaccard with binary weights is the intersection of the query elements in the set (Note: since the
and document divided by their union. weights are binary, the square in the non-
When di=qk, Jaccard similarity = 1 (highest possible) binary formula can be ignored)
000 Dik Lun LEE Department of Computer Science, HKUST Slide 29
When di⊃qk, Jaccard similarity decreases as |di| increases
(penalized long documents)
Similarity between a Query and a Document
Query Document Term-ID TF
• Given a query containing terms: 123 5
Q-ID DF
123, 345, 544, 642, 850, the 145 2
123 50
corresponding DF’s and the 345 540
320 3

document vector 544 1300


344 1
390 1
• Assume the collection contains 642 35
450 1
10,000 documents 850 250
482 1
544 1
Term-ID tf() idf() tf*idf
580 1
123 5 log2(10000/50) = 7.64 38.2 590 3

345 0 No need to calculate it! 0 610 1


630 1
544 1 log2(10000/1300) = 2.94 2.94
661 1
642 0 No need to calculate it! 0
702 2
850 1 log2(10000/250) = 5.32 5.32 758 1
850 1
Inner(D,Q) = 38.2 + 2.94 + 5.32 = 46.46 887 1
How to obtain cosine? 950 2
Interesting Things we can do in
Vector Space Model
Operations on Vectors
• Query and document vectors are structurally the same (i.e., they are
vectors of same dimension)
– Query and document vectors can be added, subtracted, etc.
• Adding two vectors means adding the corresponding elements of the two
vectors; likewise for substraction
– Imagine 𝐷𝐷1 is about soccer/sport, 𝐷𝐷2 is about politics, and 𝑄𝑄 is a query
⟨𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢, 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚⟩
– What are the meanings of the following operations (𝐷𝐷1 , 𝐷𝐷2 and 𝑄𝑄 are vectors):
• 𝐷𝐷1 + 𝐷𝐷2
• 𝐷𝐷1 − 𝐷𝐷2
• 𝑄𝑄 + 𝐷𝐷1
• 𝑄𝑄 − 𝐷𝐷1
• The last two examples illustrate relevance feedback
– Good/Bad terms in feedback document 𝐷𝐷1 can be added to or subtracted from
the original query
Relevant Feedback in Vector Space Model

• Recall that: Relevance feedback tries to modify a query into a “better” one
based on documents that users have identified as relevant (or irrelevant)
– Good/Bad terms in feedback document can be added to/subtracted from the
original query
Q = 〈 retrieval 0, database 1, architecture 0, computer 0, text 1, management 0, information 1 〉

Search

Di = 〈 retrieval 1.0, database 0.5, architecture 1.2, computer 0.0, text 0.8, management 0.9, information 0.0 〉

Suppose Di is marked as relevant


Good query terms: database, text
Good query terms missed: architecture, retrieval, management
Bad (query) terms: information

Q’ = 〈 retrieval 0.8, database 1, architecture 0.8, computer 0, text 1, management 0.8, information 0 〉

• Query term weights in revised query can be set differently; here, added query terms
are given lower weights than original query terms
• Boolean queries are difficult to modify
– What does it mean by adding two Boolean queries together?
• Does “add” mean AND or OR?
– Even worse, what does it mean by adding a Boolean query and a
document (a bag of words) together?
• Suppose: QB = database AND text AND information
D1= 〈 text 1.0; retrieval 1.0; database 0.5; computer 0.8; information 0.2 〉
• How can you improve QB based on the fact that Di is relevant?
– Database AND text AND architecture AND retrieval?
– Database AND text AND architecture AND retrieval AND NOT information?
– Database AND text AND (information OR architecture OR retrieval)?
– (Database AND architecture) OR (text AND retrieval)?
– ……
• Revised queries are more complex and difficult to modify
Document Clusters and Centroids

• Similarity can be computed:


– Between document and query vectors ⇒ search engine (ranking)
– Between document and document vectors ⇒ document clustering
– Between query and query vectors ⇒ query suggestion, query expansion, etc.
• Comparing document against document
D1 = 1, 1, 1, 0, 1, 1, 0
D2 = 1, 0 , 1, 0, 0, 1, 1
D3 = 0.5, 2, 0, 0, 1, 1, 0.5

• What are the similarity between these documents?


• How close are these documents?
– Total similarity or average similarity of each possible pair of documents
(needs n*(n-1)/2 similarity computations)
• If the document are close enough, they can form a cluster
Document Centroid Example

• Given three documents:


t1 t2
– di = 〈 0, 0 〉
– dj = 〈 0, 0.7 〉
– dk = 〈 1, 2 〉
• The x and y coordinates of the centroid is the average of the x and y
coordinates of the documents: t2 dk
2.0
C = 〈 0.33, 0.9〉
1.0 C
• To compute the similarity of the three
dj
documents, we can compute the similarity
of each document to the centroid instead di t1
of all possible document pairs 1.0
Centroid of Multiple Documents

• Given a set of m document vectors :


ck: average over m documents
d1
d1=[ T1,1 … T1,k … … T1,t ]
d2=[ T2,1 … T2,k … … T2,t ] d3 d2
d3=[ T3,1 … T3,k … … T3,t ]
d4
d4=[ T4,1 … T4,k … … T4,t ]

• The centroid of the m document vectors is a “virtual” document vector C:


∑𝑚𝑚
𝑖𝑖=1 𝑇𝑇𝑖𝑖,𝑘𝑘
C = (c1, c2, …, ck …, ct) where 𝑐𝑐𝑘𝑘 = 𝑚𝑚
– Centroid is a convenient and compact representation of a set of documents
– The closeness of a document set can be measured by the distances of the
documents to their centroid
Some Issues about Similarity Measures
Geometric Interpretation of Similarity Measures

• Cosine similarity: we all know by now that it is the angle between


two vectors (query and document vectors)
• Inner product:
• What about Euclidian distance?
Choice of Similarity Measures

• The vector space model can match queries against documents, or documents
against documents, but the interpretations and match criteria are different

D = 1, 1, 1, 0, 1, 1, 0 • Q wants “retrieval”, “architecture”, “information” and


“management”. It doesn’t care about other things.
Q = 1, 0 , 1, 0, 0, 1, 1
• D meets three out of the four “wants”; match well

D1 = 1, 1, 1, 0, 1, 1, 0 • D1 talks about “database”, “retrieval”, “architecture”,


“text”, and “management” but not about “information”.
D2 = 1, 0 , 1, 0, 0, 1, 1
• D2 talks about “retrieval”, “architecture”, “information”,
and “management” but not about “database” and
“text”.
• Are D1 and D2 very similar? Three matches, three
mismatches, and one common absence (i.e., computer).
Similarity Measures for Documents Comparison

Document / Document Document / Query


Euclidean OK Bad
Cosine OK OK
Inner Fair OK
Jaccard Fair+ OK
Query Term Weights

• The weight of a query term is assumed to be 1 if the term


presents in the query; 0 otherwise

• A weight may be given by the user to each query term

• A natural language query can be regarded as a document


– The query “MP3 songs sung by Aaron Kwok” may be transformed
into:
<MP3, song, sung, Aaron, Kwok>
– The query “I am interested in gold medals or gold investment” may
be transformed into:
<gold 2, medal 1, investment 1> after filtering out “I am interested
in”

• Similarity of two documents can be computed in the same way


Information Abundance on Web
• Question: Is the vector space model more like ‘OR’ or ‘AND’?
• Given the answer to the above question, do you think the
vector space model is good or bad in
– A professional database (e.g., legal or medical datasets)?
– web retrieval?

Just type in as many relevant words as you can think of, you will find something useful
from the web (Google), especially for popular topics (e.g., presidential election), since
there are many sources of information on popular topics.
Term Independence Assumption (I)
• Each term i is identified as Ti
• A document is represented as a set of unordered terms, called the bag-of-
words (bow) representation
• The terms are assumed to be independent (or uncorrelated or orthogonal)
to each other and form an orthogonal vector space
• On the left: x and y are orthogonal; y = x cos 90° = 0 * x
• On the right: x and y are not orthogonal; y = x cos θ

y
• What does it mean to retrieval? y
– D1 = < computer, CPU, Intel >
– D2 = < computer, analysis, price > θ
x x
– Q = < computer >
– Intuitively, is D1 or D2 a better match? What does vector space model tell you?
– Should D1 be represented as < computer 2.5 > ?
• How to measure term dependence and incorporate it into retrieval?
Term Independence Assumption (II)
• In real life, it is hard to find two terms that are absolutely independent to
each other
• Independence can be judged with respect to a document collection.

“computer” “science” “business”

CS collection Important words in the collection


• In reality, these three terms are correlated to each other!
• When you can find “computer” in a document, there is a very high chance that
you also find “science” in it
• When you can find “computer” in a document, there is a medium chance that
you also find “business” in it
• When you can find “business” in a document, there is a small chance that you
also find “science” in it
• We can judge term independence by checking whether or not two terms
frequently co-occur in a document (or a sentence, a window of text, etc.)
Synonyms
• Two terms have the same meaning (synonyms), more or less
– D1 = < company, earning, invest >
– Q = < corporation, earning >
– Straight keyword matching will cause a mismatch between company and
corporation

– D1 = < furniture, table, desk, chair, lamp >


– Q = < furniture >
– Straight keyword matching will have only one matching keyword and so D1
has a small similarity to Q
Unbalanced Property of Vector Space Model

• If Q = 〈x, y〉, then


– D2 contains 〈x, x〉 (talk only about x but a lot of it) and D1 contains 〈x, y〉
(talk about both x and y) can have the same similarity to Q

Example 2: Cosine

y
Example 1: Inner product D1
x y
D2 = 〈 2, 0 〉 Q
D1 = 〈 1, 1 〉 D2
Q = 〈 1, 1 〉 x
Sim(Q,D1)=Sim(Q,D2) Suppose
Q = 〈 1, 0.414 〉
Sim(Q,D1)=Sim(Q,D2)
What Problems does it Cause?

• What are the impacts of the unbalanced property on the following queries?
• Education University: documents with high tf of “education” alone or
“university”, vs a balance of both words
• Travel moon space walk
How to Favor “Balanced” Documents?

• Multi-step filtering
– Filter out all documents containing both x and y, rank them higher
– Filter out all documents containing either x or y, rank them lower

• Compute similarity as usual but weight the similarity by the number


of query terms matched
– Sim’(D, Q) = Sim(D, Q) * [ | D ∩ Q | / | D ∪ Q | ]
Combining Boolean and Vector Space Model
Ranking with Boolean Operators
• Statistical model is more like the “OR” operator documents

• Boolean and vector space models can be Boolean filter


Boolean
query
combined by doing Boolean filtering first
followed by ranking: ranking Vector-space
– First evaluate the Boolean query as usual query
results
– collect all documents satisfying the Boolean
query
– Rank the result using the vector space query
Altavista Advanced Search: Boolean query and ranking criteria are separated
Query: gold AND investment
Ranking Criteria: precious metal

In most other systems, the vector space query is the same as the
Boolean query with Boolean operators ignored
Summary

• Cosine similarity is independent of document sizes


• Based on occurrence frequencies only
• Consider both local and global occurrence frequencies
• Advantages
– simplicity
– able to handle weighted terms
– easy to modify term vectors
• Disadvantages
– assumption of term independence
– lack the control of Boolean model (e.g., requiring a term to appear in a
document); e.g., two documents Di = {x, x} and Dj = {x, y} may yield
the same similarity score for Q = {x, y} if x and y have the same weight;
this may not be desirable.

You might also like