L04
L04
• T distinct terms are available; call them index terms or the vocabulary
• The index terms represent important terms for an application
– What might be the index terms for a computer science library?
architecture
bus
computer
database
….
xml
computer science
collection
index terms or vocabulary or vector space
of the collection
For now: we only consider single terms (no phrases)
Representing a Vector Space
T1 T2 …. Tt
D1 d11 d12 … d1t
D2 d21 d22 … d2t
: : : :
D1 = 1.0 0.5 0.0 0.8 1.0 0.0 0.2
D2 = 0.8 1.0 0.8 0.0 0.0 0.2 0.0 : : : :
Dn dn1 dn2 … dnt
The Vector-Space Model
• A vocabulary of 2 terms forms a 2D space; a document may contain
0, 1 or 2 terms
t1 t2
– di = 〈 0, 0 〉 (contains none of the index terms)
– dj = 〈 0, 0.7 〉 (contains one of the two index terms)
– dk = 〈 1, 2 〉 (contains both index terms)
• Likewise, a vocabulary of 3 terms forms a 3D space
• A vocabulary of n terms forms a n-dimensional space
• A document or query can be represented as a linear combination of
T’s t2
dk
di t1
Graphic Representation
Example:
D1 = 2T1 + 3T2 + 5T3
T3
D2 = 3T1 + 7T2 + T3
Q = 0T1 + 0T2 + 2T3 5
Binary:
– D = 1, 1, 1, 0, 1, 1, 0
• Size of vector = size of vocabulary
– Q = 1, 0 , 1, 0, 0, 1, 1
=7
• 0 means corresponding term not
sim(D, Q) = 3 found in document or query
Weighted:
D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + T3
Q = 0T1 + 0T2 + 2T3 Q = 0T1 + 0T2 + 2T3
∑ (d ik q ) D1
Q
θ2
k
k =1
t1
t t
∑ d ik ∑ q kt
2 2
k =1 k =1 D2
2
Document length Query length
D1 = 2T1 + 3T2 + 5T3
D2 = 3T1 + 7T2 + T3 D1 is 6 times better
Q = 0T1 + 0T2 + 2T3 than D2 using cosine
similarity but only 5
CosSim(D1 , Q) = 5*2 / (√38√4) = 0.81 times better using inner
CosSim(D2 , Q) = 1*2 / (√59√4) = 0.13 product
Jaccard Coefficient
Jaccard Coefficient: ∑ (d q )
k =1
ik k
t t t
∑ d ik + ∑q − ∑ (d ik q )
2 2
k =1 k =1
k k =1
k
∑ d ik + ∑q
2 2
k =1 k =1
k
Product: ∑ (d q )
k =1
ik k d i ∩ qk
Cosine:
t
∑ (d q
k =1
ik k
) d i ∩ qk
t t
∑ d ik ∑ qk
2 2
d i qk
k =1 k =1
Jaccard : t
∑ (d q ) d i ∩ qk d i ∩ qk
⇒
ik k
k =1
d i + qk − d i ∩ qk d i ∪ qk
t t t
∑ d ik + ∑q − ∑ (d ik q )
2 2
k =1 k =1
k k =1
k
• Recall that: Relevance feedback tries to modify a query into a “better” one
based on documents that users have identified as relevant (or irrelevant)
– Good/Bad terms in feedback document can be added to/subtracted from the
original query
Q = 〈 retrieval 0, database 1, architecture 0, computer 0, text 1, management 0, information 1 〉
Search
Di = 〈 retrieval 1.0, database 0.5, architecture 1.2, computer 0.0, text 0.8, management 0.9, information 0.0 〉
Q’ = 〈 retrieval 0.8, database 1, architecture 0.8, computer 0, text 1, management 0.8, information 0 〉
• Query term weights in revised query can be set differently; here, added query terms
are given lower weights than original query terms
• Boolean queries are difficult to modify
– What does it mean by adding two Boolean queries together?
• Does “add” mean AND or OR?
– Even worse, what does it mean by adding a Boolean query and a
document (a bag of words) together?
• Suppose: QB = database AND text AND information
D1= 〈 text 1.0; retrieval 1.0; database 0.5; computer 0.8; information 0.2 〉
• How can you improve QB based on the fact that Di is relevant?
– Database AND text AND architecture AND retrieval?
– Database AND text AND architecture AND retrieval AND NOT information?
– Database AND text AND (information OR architecture OR retrieval)?
– (Database AND architecture) OR (text AND retrieval)?
– ……
• Revised queries are more complex and difficult to modify
Document Clusters and Centroids
• The vector space model can match queries against documents, or documents
against documents, but the interpretations and match criteria are different
Just type in as many relevant words as you can think of, you will find something useful
from the web (Google), especially for popular topics (e.g., presidential election), since
there are many sources of information on popular topics.
Term Independence Assumption (I)
• Each term i is identified as Ti
• A document is represented as a set of unordered terms, called the bag-of-
words (bow) representation
• The terms are assumed to be independent (or uncorrelated or orthogonal)
to each other and form an orthogonal vector space
• On the left: x and y are orthogonal; y = x cos 90° = 0 * x
• On the right: x and y are not orthogonal; y = x cos θ
y
• What does it mean to retrieval? y
– D1 = < computer, CPU, Intel >
– D2 = < computer, analysis, price > θ
x x
– Q = < computer >
– Intuitively, is D1 or D2 a better match? What does vector space model tell you?
– Should D1 be represented as < computer 2.5 > ?
• How to measure term dependence and incorporate it into retrieval?
Term Independence Assumption (II)
• In real life, it is hard to find two terms that are absolutely independent to
each other
• Independence can be judged with respect to a document collection.
Example 2: Cosine
y
Example 1: Inner product D1
x y
D2 = 〈 2, 0 〉 Q
D1 = 〈 1, 1 〉 D2
Q = 〈 1, 1 〉 x
Sim(Q,D1)=Sim(Q,D2) Suppose
Q = 〈 1, 0.414 〉
Sim(Q,D1)=Sim(Q,D2)
What Problems does it Cause?
• What are the impacts of the unbalanced property on the following queries?
• Education University: documents with high tf of “education” alone or
“university”, vs a balance of both words
• Travel moon space walk
How to Favor “Balanced” Documents?
• Multi-step filtering
– Filter out all documents containing both x and y, rank them higher
– Filter out all documents containing either x or y, rank them lower
In most other systems, the vector space query is the same as the
Boolean query with Boolean operators ignored
Summary