0% found this document useful (0 votes)

60 views

06 VectorSpaceModel PDF

The document discusses the vector space model, which represents documents and queries as vectors in a high-dimensional space where each dimension corresponds to a unique term. This allows the similarity between a document and query to be computed as the cosine of the angle between their vector representations, with more similar vectors having a higher cosine similarity. The vector space model was an improvement over boolean retrieval models by providing a relevance ranking instead of just binary matches.

Uploaded by

ponnusamywilliam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

60 views

06 VectorSpaceModel PDF

Uploaded by

ponnusamywilliam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 75

Vector Space Model

Jaime Arguello
INLS 509: Information Retrieval
[email protected]

February 13, 2013

Wednesday, February 13, 13

The Search Task

• Given a query and a corpus, find relevant items

query: a textual description of the user’s information need

corpus: a repository of textual documents
relevance: satisfaction of the user’s information need

Wednesday, February 13, 13

What is a Retrieval Model?

• A formal method that predicts the degree of relevance of a

document to a query

Wednesday, February 13, 13

Basic Information Retrieval Process
doc
the retrieval model doc
doc
doc
doc
is responsible for
information need performing this document
comparison and
retrieving objects
representation that are likely to representation
satisfy the user

query comparison indexed objects

retrieved objects

evaluation
4

Wednesday, February 13, 13

Boolean Retrieval Models
• The user describes their information need using boolean
constraints (e.g., AND, OR, and AND NOT)
• Unranked Boolean Retrieval Model: retrieves documents
that satisfy the constraints in no particular order
• Ranked Boolean Retrieval Model: retrieves documents
that satisfy the constraints and ranks them based on the
number of ways they satisfy the constraints
• Also known as ‘exact-match’ retrieval models

• Advantages and disadvantages?

Wednesday, February 13, 13

Boolean Retrieval Models
• Advantages:

‣ Easy for the system

‣ Users get transparency: it is easy to understand why a
document was or was not retrieved
‣ Users get control: it easy to determine whether the
query is too specific (few results) or too broad (many
results)
• Disadvantages:

‣ The burden is on the user to formulate a good boolean

query
6

Wednesday, February 13, 13

Relevance

• Many factors affect whether a document satisfies a

particular user’s information need
• Topicality, novelty, freshness, authority, formatting,
reading level, assumed level of prior knowledge/expertise
• Topical relevance: the document is on the same topic as
the query
• User relevance: everything else!

• For now, we will only try to predict topical relevance

Wednesday, February 13, 13

Relevance

• Focusing on topical relevance does not mean we’re

ignoring everything else!
• It only means we’re focusing on one (of many) criteria by
which users judge relevance
• And, it’s an important criterion

Wednesday, February 13, 13

Introduction to Best-Match Retrieval Models

• So far, we’ve discussed ‘exact-match’ models

• Today, we start discussing ‘best-match’ models

• Best-match models predict the degree to which a

document is relevant to a query
• Ideally, this would be expressed as RELEVANT(q,d)

• In practice, it is expressed as SIMILAR(q,d)

• How might you compute the similarity between q and d?

Wednesday, February 13, 13

Vector Space Model

Wednesday, February 13, 13

What is a Vector Space?
• Formally, a vector space is defined by a set of linearly
independent basis vectors
• The basis vectors correspond to the dimensions or
directions of the vector space

Y Y

basis vectors for 2- basis vectors for 3-

dimensional space dimensional space

X X

Z 11

Wednesday, February 13, 13

What is a Vector?

• A vector is a point in a vector space and has length (from

the origin to the point) and direction

Y Y

X X

Z
12

Wednesday, February 13, 13

What is a Vector?

• A 2-dimensional vector can be written as [x,y]

• A 3-dimensional vector can be written as [x,y,z]

Y Y

y y
X X
z
x
Z
13

Wednesday, February 13, 13

What is a Vector Space?

• The basis vectors are linearly independent because

knowing a vector’s value on one dimension doesn’t say
anything about its value along another dimension

Y Y

X X

basis vectors for 2- basis vectors for 3-

Z
dimensional space dimensional space
14

Wednesday, February 13, 13

Binary Text Representation

a aardvark abacus abba able ... zoom

doc_1 1 0 0 0 0 ... 1
doc_2 0 0 0 0 1 ... 1
:: :: :: :: :: :: ... 0
doc_m 0 0 1 1 0 ... 0

• 1 = the word appears in the document

• 0 = the word does not appear in the document

• Does not represent word frequency, word location, or

word order information

Wednesday, February 13, 13

Vector Space Representation

• Let V denote the size of the indexed vocabulary

‣ V = the number of unique terms,

‣ V = the number of unique terms excluding stopwords,
‣ V = the number of unique stems, etc...
• Any arbitrary span of text (i.e., a document, or a query)
can be represented as a vector in V-dimensional space
• For simplicity, let’s assume three index terms: dog, bite,
man (i.e., V=3)
• Why? Because it’s easy to visualize 3-D space

Wednesday, February 13, 13

Vector Space Representation
with binary weights

• 1 = the term appears at least once

• 0 = the term does not appear

man

“dog bite man”

dog man bite [1, 1, 1]
doc_1 1 1 1
doc_2 1 0 1 1
dog
doc_3 0 1 1 1

bite 1

Wednesday, February 13, 13

Vector Space Representation
with binary weights

• 1 = the term appears at least once

• 0 = the term does not appear

man

“dog bite man”

dog man bite [1, 1, 1]
doc_1 1 1 1
doc_2 1 0 1 1
dog
doc_3 0 1 1 1

bite 1 “dog bite”

[1, 0, 1]

Wednesday, February 13, 13

Vector Space Representation
with binary weights

• 1 = the term appears at least once

• 0 = the term does not appear

man

“dog bite man”

1
dog man bite [1, 1, 1]
doc_1 1 1 1
“man bite”
[0, 1, 1] 1 1
doc_2 1 0 1
dog
doc_3 0 1 1 1

bite 1 “dog bite”

[1, 0, 1]

Wednesday, February 13, 13

Vector Space Representation
with binary weights

• What span(s) of text does this vector represent?

man

1
dog

bite
20

Wednesday, February 13, 13

Vector Space Representation
with binary weights

• What span(s) of text does this vector represent?

man

dog
1

bite
21

Wednesday, February 13, 13

Vector Space Representation
with binary weights

• What span(s) of text does this vector represent?

man

1
dog
1
1
bite
22

Wednesday, February 13, 13

Vector Space Representation

• Any span of text is a vector in V-dimensional space,

where V is the size of the vocabulary

man

1 doc1: “man bite dog”

doc2: “man bite” [1,1,1]
[0,1,1]
1 1
dog
1
1 doc3: “dog bite”
bite
[1,0,1]

Wednesday, February 13, 13

Vector Space Representation

• A query is a vector in V-dimensional space, where V is

the number of terms in the vocabulary

man

query: “man dog”

[1,1,0]

dog

doc3: “dog bite”

bite
[1,0,1]

Wednesday, February 13, 13

Vector Space Similarity

• The vector space model ranks documents based on the

vector-space similarity between the query vector and
the document vector
• There are many ways to compute the similarity
between two vectors
• One way is to compute the inner product

V
∑ xi × yi
i =1

Wednesday, February 13, 13

The Inner Product
xi yi xi × yi
• Multiply a 1 1 1
corresponding
components and aardvark 0 1 0
then sum of abacus 1 1 1
those products
abba 1 0 0

V able 0 1 0
∑ xi × yi :: :: :: ::
i =1
zoom 0 0 0
inner product => 2

Wednesday, February 13, 13

The Inner Product
xi yi xi × yi
• When using 0’s a 1 1 1
and 1’s, this is
just the number aardvark 0 1 0
of terms in abacus 1 1 1
common
abba 1 0 0
between the
query and the able 0 1 0
document
:: :: :: ::

V zoom 0 0 0
∑ xi × yi inner product => 2
i =1

Wednesday, February 13, 13

The Inner Product

• 1 = the term appears at least once

• 0 = the term does not appear

man
“man”
[0, 1, 0] “dog bite man”
dog man bite 1
[1, 1, 1]
doc_1 1 1 1 “man bite”
doc_2 1 0 1 [0, 1, 1] 1 1
doc_3 0 1 1 dog
doc_4 0 1 0 1

bite 1 “dog bite”

[1, 0, 1]

Wednesday, February 13, 13

The Inner Product

• Multiply corresponding components and then sum

those products
• Using a binary representation, the inner product
corresponds to the number of terms appearing (at least
once) in both spans of text
• Scoring documents based on their inner-product with
the query has one major issue. Any ideas?

Wednesday, February 13, 13

The Inner Product
• What is more relevant to a query?

‣ A 50-word document which contains 3 of the query-

terms?
‣ A 100-word document which contains 3 of the
query-terms?
• The inner-product doesn’t account for the fact that
documents have widely varying lengths
• All things being equal, longer documents are more
likely to have the query-terms
• So, the inner-product favors long documents

Wednesday, February 13, 13

The Cosine Similarity

• The numerator is the inner product

• The denominator is the product of the two vector-lengths

• Ranges from 0 to 1 (equals 1 if the vectors are identical)

V
∑ i =1xi × yi
! !
V 2× V 2
x
∑ i =1 i y
∑ i =1 i
length of length of
vector x vector y

Wednesday, February 13, 13

∑V xi × yi
In Class Exercise
! i = 1 !
∑V x
i =1 i
2×
∑ V 2
i =1 y i

• For each document, compute the inner-product and

cosine similarity score for the query: Jill

doc_1 Jack and Jill went up the hill

doc_2 To fetch a pail of water.
doc_3 Jack fell down and broke his crown,
doc_4 And Jill came tumbling after.
doc_5 Up Jack got, and home did trot,
doc_6 As fast as he could caper,
doc_7 To old Dame Dob, who patched his nob
doc_8 With vinegar and brown paper.

Wednesday, February 13, 13

∑V xi × yi
In Class Exercise
! i = 1 !
∑V x
i =1 i
2×
∑ V 2
i =1 y i

• For each document, compute the inner-product and

cosine similarity score for the query: Jack

doc_1 Jack and Jill went up the hill

Wednesday, February 13, 13

Vector Space Representation
a aardvark abacus abba able ... zoom
doc_1 1 0 0 0 0 ... 1
doc_2 0 0 0 0 1 ... 1
:: :: :: :: :: :: ... 0
doc_m 0 0 1 1 0 ... 0

a aardvark abacus abba able ... zoom

query 0 1 0 0 1 ... 1

• So far, we’ve assumed binary vectors

• 0’s and 1’s indicate whether the term occurs (at least
once) in the document/query
• Let’s explore a more sophisticated representation 34

Wednesday, February 13, 13

Term-Weighting
what are the most important terms?

• Movie: Rocky (1976)

• Plot:
Rocky Balboa is a struggling boxer trying to make the big time. Working in a meat factory in Philadelphia for a
pittance, he also earns extra cash as a debt collector. When heavyweight champion Apollo Creed visits
Philadelphia, his managers want to set up an exhibition match between Creed and a struggling boxer, touting the
fight as a chance for a "nobody" to become a "somebody". The match is supposed to be easily won by Creed, but
someone forgot to tell Rocky, who sees this as his only shot at the big time. Rocky Balboa is a small-time boxer
who lives in an apartment in Philadelphia, Pennsylvania, and his career has so far not gotten off the canvas. Rocky
earns a living by collecting debts for a loan shark named Gazzo, but Gazzo doesn't think Rocky has the
viciousness it takes to beat up deadbeats. Rocky still boxes every once in a while to keep his boxing skills sharp,
and his ex-trainer, Mickey, believes he could've made it to the top if he was willing to work for it. Rocky, goes to a
pet store that sells pet supplies, and this is where he meets a young woman named Adrian, who is extremely shy,
with no ability to talk to men. Rocky befriends her. Adrain later surprised Rocky with a dog from the pet shop that
Rocky had befriended. Adrian's brother Paulie, who works for a meat packing company, is thrilled that someone
has become interested in Adrian, and Adrian spends Thanksgiving with Rocky. Later, they go to Rocky's apartment,
where Adrian explains that she has never been in a man's apartment before. Rocky sets her mind at ease, and they
become lovers. Current world heavyweight boxing champion Apollo Creed comes up with the idea of giving an
unknown a shot at the title. Apollo checks out the Philadelphia boxing scene, and chooses Rocky. Fight promoter
Jergens gets things in gear, and Rocky starts training with Mickey. After a lot of training, Rocky is ready for the
match, and he wants to prove that he can go the distance with Apollo. The 'Italian Stallion', Rocky Balboa, is an
aspiring boxer in downtown Philadelphia. His one chance to make a better life for himself is through his boxing
and Adrian, a girl who works in the local pet store. Through a publicity stunt, Rocky is set up to fight Apollo Creed,
the current heavyweight champion who is already set to win. But Rocky really needs to triumph, against all the
odds...
35

Wednesday, February 13, 13

Term-Frequency
how important is a term?
rank term freq. rank term freq.
1 a 22 16 creed 5
2 rocky 19 17 philadelphia 5
3 to 18 18 has 4
4 the 17 19 pet 4
5 is 11 20 boxing 4
6 and 10 21 up 4
7 in 10 22 an 4
8 for 7 23 boxer 4
9 his 7 24 s 3
10 he 6 25 balboa 3
11 adrian 6 26 it 3
12 with 6 27 heavyweigh 3
13 who 6 28 t
champion 3
14 that 5 29 fight 3
15 apollo 5 30 become 3 36

Wednesday, February 13, 13

Inverse Document Frequency (IDF)
how important is a term?

N
id f t = log( )
d ft

• N = number of documents in the collection

• dft = number of documents in which term t appears

Wednesday, February 13, 13

Inverse Document Frequency (IDF)
how important is a term?
rank term idf rank term idf
1 doesn 11.66 16 creed 6.84
2 adrain 10.96 17 paulie 6.82
3 viciousness 9.95 18 packing 6.81
4 deadbeats 9.86 19 boxes 6.75
5 touting 9.64 20 forgot 6.72
6 jergens 9.35 21 ease 6.53
7 gazzo 9.21 22 thanksgivin 6.52
8 pittance 9.05 23 g
earns 6.51
9 balboa 8.61 24 pennsylvani 6.50
10 heavyweigh 7.18 25 a
promoter 6.43
11 t
stallion 7.17 26 befriended 6.38
12 canvas 7.10 27 exhibition 6.31
13 ve 6.96 28 collecting 6.23
14 managers 6.88 29 philadelphia 6.19
15 apollo 6.84 30 gear 6.18 39

Wednesday, February 13, 13

TF.IDF
how important is a term?

t f t × id f t

greater when greater when

the term is the term is rare
frequent in in in the
the document collection
(does not
appear in many
documents)

Wednesday, February 13, 13

TF.IDF
how important is a term?
rank term idf rank term idf
1 rocky 96.72 16 meat 11.76
2 apollo 34.20 17 doesn 11.66
3 creed 34.18 18 adrain 10.96
4 philadelphia 30.95 19 fight 10.02
5 adrian 26.44 20 viciousness 9.95
6 balboa 25.83 21 deadbeats 9.86
7 boxing 22.37 22 touting 9.64
8 boxer 22.19 23 current 9.57
9 heavyweigh 21.54 24 jergens 9.35
10 t
pet 21.17 25 s 9.29
11 gazzo 18.43 26 struggling 9.21
12 champion 15.08 27 training 9.17
13 match 13.96 28 pittance 9.05
14 earns 13.01 29 become 8.96
15 apartment 11.82 30 mickey 8.96 41

Wednesday, February 13, 13

TF.IDF/Caricature Analogy

• TF.IDF: accentuates terms that are frequent in the

document, but not frequent in general
• Caricature: exaggerates traits that are characteristic of
the person (compared to the average)
42

Wednesday, February 13, 13

TF, IDF, or TF.IDF?

,.",-! ,.",- ! ,22! ,2"),.& ! ,2+#! ,- ,-. ,1,"'7)-' ,1#22# ,+

! ,'
! ! ! ! ,+1*"*-5!

3,23#,! 3)$#7)! 3)'')"! 35 3#4)" 3#4-5 36' 3&

! !$(,71*#- ! ! ! $,-! $,"))" !

$(,-$)! $")). $6"")-' ! ! .)3'! .#)+-! ),"-+! )8)"&! )4(*3*'*#-! )4'",! /,"! /*5('! /#" 5,99# ! ! 5)'+ ! 5*"2

5#! (,+! () (),8&0)*5(' ()"

! ! ! (*7+)2/! (*+ *- *+ *'
! */! ! ! ! %))1! 2,')"! 2*/)! 2*8*-5! 2#,-! 2#8)"+
7,%)!7,-!7,'$(!7),'!7)-!7*$%)&!-,7).!-#3#.&!#/!1,62*)! 1)'! 1(*2,.)21(*,
"#$%& +)' ! ! +()! +(#'! +7,22! +#7)3#.&! +#7)#-)! +'*22! +'#")! +'"6552*-5! +6112*)+! +6"1"*+).
'(,' '() '()&
! ! !'(*-% ! '(*+! '("#65(! '*7)! '*'2)! '# ! '",*-)" ! '",*-*-5! 61! 0,-'! 0()-! 0()")
0(# 0*'( !0*22*-5! !0#7,-!0#-! 0#"%+

Wednesday, February 13, 13

TF, IDF, or TF.IDF?

'/-)-5& ! '+"'-.! '+"-'. '('"56.5 '(#))# /')/#' /$#6*

! ')"*'+& ! ! ! '7(-"-.1! !

/#0" /#07 /#0-.1 $'.3'7 $,'6(-#. $,'.$*

/*9"-*.+*+! /*9"-*.+7 ! /-1! ! ! ! ! ! ! $,*$%7

$,##7*7 !$#))*$5-.1! $#))*$5#" ! $"**+ $8""*.5 +*'+/*'57 ! ! ! +*/5! +*/57! +-75'.$*! +#*7.! +#4.5#4.
*'".7! *'7*! *'7-)& ! *0,-/-5-#.! *05"'! *05"*6*)& ! 9'$5#"& ! 9-1,5! 9#"1#5! 1'22# ! 1*'"! 1#55*.

,*'3&4*-1,5 ! ,-7 ! -7 ! :*"1*.7! )'5*"! )#'.! )#5! )#3*"7! 6'.'1*"7! 6'5$,! 6*'5! 6-$%*&! .'6*+
.#/#+& ! #++7 ! ('$%-.1! ('8)-*! (*..7&)3'.-'! (*5 (,-)'+*)(,-' (-55'.$* ("#6#5*"
! ! !

(8/)-$-5&! "*'+& ! "#$%& ! 7*))7 ! 7*5! 7,'"% ! 7,'"(! 7,#5! 7,& ! 7#6*/#+&! 7#6*#.*! 75'))-#.! 75#"*
75"811)-.1! 758.5! 78(()-*7! 78((#7*+! 78"("-7*+! 5,'.%71-3-.1! 5,-.%! 5,"-))*+! 5-6*! 5-5)*! 5#85-.1! 5"'-.*"! 5"'-.-.1
5"-86(,!8(!3*!3-$-#87.*77!3-7-57 !4,*"*!4,#!4-))-.1!4#.!4#"%7

Wednesday, February 13, 13

TF, IDF, or TF.IDF?
'*)3)+5! '"(')& '"()'& '3($'"5 '/#33# '%/)()&, *'3*#'
! ! ! ! !

*$'+! *$7()$&"$"
*$7()$&"%! *$++$(! *#8$(! *#8$%! *#8)&,
!

0'&2'%! 0'%4! 04'9/)#&! 04$0:%! 04##%$%! 0#33$0+)&,

0#33$0+#(! 0($$"! 01(($&+! ! "$*+! "$*+% "$'"*$'+%
")%+'&0$! "#$%& "#6&+#6& $'(&% $'%$ $'%)35 ! ! ! !

$84)*)+)#& $8/3')&% $8+(' $8+($9$35 7'0+#(5

! 7#(,#+
! ! ! ! 7'( !

,'--# ,$'( ,)2)&, ,#++$& 4$'256$),4+ )"$' )&+$($%+$"

! ! ! ! ! !

)+'3)'& .$(,$&%
! 3#'& 3#+ 3#2$(% 9'&',$(% 9'+04 9$'+
! :$$/! 3)2)&,! ! ! ! ! !

9)0:$5 &#*#"5 #""% /'0:)&, /'13)$ /$&&%532'&)' /$+

! ! ! ! ! !

/4)3'"$3/4)' /)++'&0$ /(#9#+$( /(#2$ /1*3)0)+5

! ! ! !

($'"5 (#0:5 %$33% %4'(: %4'(/ %4#/ %45 %:)33% %#9$*#"5 %/$&"%
! ! ! ! ! ! ! ! !

%+'33)#& %+(1,,3)&, %+1&+ %1//3)$% %1//#%$" %1(/()%$"

! ! ! ! !

+4'&:%,)2)&, +4)&: +4()33$" +)+3$ +#1+)&, +(')&$( +(')&)&,

! ! ! ! ! !

+()19/4 1&:&#6& 2$ 2)0)#1%&$%% 2)%)+% 6)33)&, 6)&

! ! ! ! ! 6'&+! !

6#&
45

Wednesday, February 13, 13

Queries as TF.IDF Vectors

• Terms tend to appear only once in the query

• TF usually equals 1

• IDF is computed using the collection statistics

N
id f t = log( )
d ft

• Terms appearing in fewer documents get a higher weight

Wednesday, February 13, 13

Queries as TF.IDF Vectors
examples from AOL queries with clicks on IMDB results
term 1 tf.idf term 2 tf.idf term 3 tf.idf
central 4.89 casting 6.05 ny 5.99
wizard 6.04 of 0.18 oz 6.14
sam 2.80 jones 3.15 iii 2.26
film 2.31 technical 6.34 advisors 8.74
edie 7.41 sands 5.88 singer 3.88
high 3.09 fidelity 7.66 quotes 8.11
quotes 8.11 about 1.61 brides 6.71
title 4.71 wave 5.68 pics 10.96
saw 4.87 3 2.43 trailers 7.83
the 0.03 rainmaker 9.09 movie 0.00
nancy 5.50 and 0.09 sluggo 9.46
audrey 6.30 rose 4.52 movie 0.00
mark 2.43 sway 7.53 photo 5.14
piece 4.59 of 0.18 cheese 6.38
date 3.93 movie 0.00 cast 0.00 47

Wednesday, February 13, 13

Putting Everything Together

• Rank documents based on cosine similarity to the query

man
doc_1
query

doc_2
bite

Wednesday, February 13, 13

Vector Space Model
another cosine similarity example (binary weights)

V
xi × yi
∑ i =1
! !
V V
∑i=1 xi × ∑i=1 y2i
2

cosine( [1,0,1] , [1,1,0] ) =

(1 × 1) + (0 × 1) + (1 × 0)
√ √ = 0.5
12 + 02 + 12 × 12 + 12 + 02

Wednesday, February 13, 13

Independence Assumption
• The basis vectors (X, Y, Z) are linearly independent
because knowing a vector’s value on one dimension
doesn’t say anything about its value along another
dimension
Y =man

does this hold true

for natural language
text?
X = bite

Z = dog
basis vectors for 3-dimensional space
50

Wednesday, February 13, 13

Mutual Information
IMDB Corpus
• If this were true, what would these mutual information
values be?
w1 w2 MI w1 w2 MI
francisco san ? dollars million ?
angeles los ? brooke rick ?
prime minister ? teach lesson ?
united states ? canada canadian ?
9 11 ? un ma ?
winning award ? nicole roman ?
brooke taylor ? china chinese ?
con un ? japan japanese ?
un la ? belle roman ?
belle nicole ? border mexican ? 51

Wednesday, February 13, 13

Mutual Information
IMDB Corpus
• These mutual information values should be zero!

w1 w2 MI w1 w2 MI
francisco san 6.619 dollars million 5.437
angeles los 6.282 brooke rick 5.405
prime minister 5.976 teach lesson 5.370
united states 5.765 canada canadian 5.338
9 11 5.639 un ma 5.334
winning award 5.597 nicole roman 5.255
brooke taylor 5.518 china chinese 5.231
con un 5.514 japan japanese 5.204
un la 5.512 belle roman 5.202
belle nicole 5.508 border mexican 5.186 52

Wednesday, February 13, 13

Independence Assumption

• The vector space model assumes that

terms are independent
• The fact that one occurs says nothing
about another one occurring Y
• This is viewed as a limitation

• However, the implications of this

limitation are still debated
• A very popular solution X

Z
53

Wednesday, February 13, 13

TF.IDF
! "
N
t f t × log
d ft

term tf N df idf tf.idf

rocky 19 230721 1420 5.09 96.72
philadelphia 5 230721 473 6.19 30.95
boxer 4 230721 900 5.55 22.19
fight 3 230721 8170 3.34 10.02
mickey 2 230721 2621 4.48 8.96
for 7 230721 117137 0.68 4.75

Wednesday, February 13, 13

TF.IDF

• Many variants of this formula have been proposed

• However, they all have two components in common:

‣ TF: favors terms that are frequent in the document

‣ IDF: favors terms that do not occur in many
documents
! "
N
t f t × log
d ft

Wednesday, February 13, 13

Sub-linear TF Scaling

• Suppose ‘rocky’ occurs twice in document A and once

in document B
• Is A twice as much about rocky than B?

• Suppose ‘rocky’ occurs 20 times in document A and 10

times in document B
• Is A twice as much about rocky than B?

Wednesday, February 13, 13

Sub-linear TF Scaling

• It turns out that IR systems are more effective when they

assume this is not the case

Wednesday, February 13, 13

Sub-linear TF Scaling

• Assumption:

‣ A document that contains ‘rocky’ 5 times is more

about rocky than one that contains ‘rocky’ 1 time
‣ How much more?
‣ Roughly, 5 times more
‣ A document that contains ‘rocky’ 50 times is more
about rocky than one that contains ‘rocky’ 10 times
‣ How much more?
‣ Not 5 times more. Less.
58

Wednesday, February 13, 13

Sub-linear TF Scaling
3

y=x
2

1
y = 1 + log( x )

-5 -4 -3 -2 -1 0 1 2 3 4 5

-1

-2

-3

Wednesday, February 13, 13

TF.IDF
what are the most important terms?
! "
N
(1 + log(t f t )) × log
d ft

term tf fw N df idf tf.idf

rocky 19 3.94 230721 1420 5.09 20.08
philadelphia 5 2.61 230721 473 6.19 16.15
boxer 4 2.39 230721 900 5.55 13.24
fight 3 2.10 230721 8170 3.34 7.01
mickey 2 1.69 230721 2621 4.48 7.58
for 7 2.95 230721 117137 0.68 2.00

Wednesday, February 13, 13

TF.IDF
what are the most important terms?

! " ! "
N N
t f t × log (1 + log(t f t )) × log
d ft d ft
term tf.idf (linear tf) tf.idf (sub-linear tf)
rocky 96.72 20.08
philadelphia 30.95 16.15
boxer 22.19 13.24
fight 10.02 7.01
mickey 8.96 7.58
for 4.75 2.00

Wednesday, February 13, 13

Remember Heaps’ Law?
• As we see more and more text, the frequency of new
words decreases

Wednesday, February 13, 13

Remember Heaps’ Law?
• Put differently, as we see more text, it becomes more rare
to encounter previously unseen words
• This means that the text mentions the same words over
and over
• Once we see a word, we’re likely to see it again

• This may be a motivation for sub-linear TF scaling

• Explanations are good. But, IR is an empirical science

• This works in practice

Wednesday, February 13, 13

Vector Space Model

• Any text can be seen as a vector in V-dimensional space

‣ a document
‣ a query
‣ a sentence
‣ a word
‣ an entire encyclopedia
• Rank documents based on their cosine similarity to query

• If a document is similar to the query, it is likely to be

relevant (remember: topical relevance!)

Wednesday, February 13, 13

Vector Space Representation

• A power tool!

• A lot of problems in IR can be cast as:

‣ Find me _ that is similar to _ !

• As long as _____ and _____ are associated with text,
one potential solution is:
‣ represent these items as tf.idf term-weight vectors
and compute their cosine similarity
‣ return the items with the highest similarity

Wednesday, February 13, 13

Vector Space Representation
• Find me documents that are similar to this query

Wednesday, February 13, 13

Vector Space Representation
• Find me ads that are similar to these results

Wednesday, February 13, 13

Vector Space Representation
• Find me queries that are similar to this query

Wednesday, February 13, 13

Vector Space Representation
• Find me search engines that are similar to this query

news

shopping

images
69

Wednesday, February 13, 13

Vector Space Representation
• Topic categorization: automatically assigning a
document to a category

Wednesday, February 13, 13

Vector Space Representation
• Find me documents (with a known category assignment)
that are similar to this document

Wednesday, February 13, 13

Vector Space Representation
• Find me documents (with a known category assignment)
that are similar to this document

computers
sports
politics

Wednesday, February 13, 13

Vector Space Representation

So, does the vector space representation solve all

problems?

Wednesday, February 13, 13

Advertisement Placement
• Find me ads similar to this this document

Wednesday, February 13, 13

Summary

• Any text can be seen as a vector in V-dimensional space

‣ a document
‣ a query
‣ a sentence
‣ a word
‣ an entire encyclopedia
• Rank documents based on their cosine similarity to query

• If a document is similar to the query, it is likely to be

relevant (remember: topical relevance!)

Wednesday, February 13, 13

Information Retrieval On Cranfield Dataset
No ratings yet
Information Retrieval On Cranfield Dataset
15 pages
06 VectorSpaceModel
No ratings yet
06 VectorSpaceModel
65 pages
Module 3 Indexing Part A
No ratings yet
Module 3 Indexing Part A
46 pages
L04
No ratings yet
L04
35 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
27 pages
Chapter 4- Part II
No ratings yet
Chapter 4- Part II
44 pages
Vector Space Model
No ratings yet
Vector Space Model
11 pages
ISR chap...5
No ratings yet
ISR chap...5
34 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
Vector Space Model
No ratings yet
Vector Space Model
7 pages
Text
No ratings yet
Text
11 pages
Retrieval Models and Rank Retrieval
No ratings yet
Retrieval Models and Rank Retrieval
16 pages
Vector Space Model: TF - IDF: Adapted From Lectures by
No ratings yet
Vector Space Model: TF - IDF: Adapted From Lectures by
37 pages
Lecture 6 Score - Term Weight - Vector Space Model
No ratings yet
Lecture 6 Score - Term Weight - Vector Space Model
43 pages
Language Independent Document
No ratings yet
Language Independent Document
10 pages
Chapter 5 IR
No ratings yet
Chapter 5 IR
46 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
67 pages
Chapter Five IR Models
No ratings yet
Chapter Five IR Models
28 pages
5 IRModels IR
No ratings yet
5 IRModels IR
25 pages
Chapter 4
No ratings yet
Chapter 4
8 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
43 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
62 pages
4_IRModels
No ratings yet
4_IRModels
46 pages
Tf-Idf: David Kauchak cs160 Fall 2009
No ratings yet
Tf-Idf: David Kauchak cs160 Fall 2009
51 pages
AI6122 Topic 3.2 - Ranking
No ratings yet
AI6122 Topic 3.2 - Ranking
27 pages
Vector Space Model: An Information Retrieval System: Information Technology Empowering Digital India
No ratings yet
Vector Space Model: An Information Retrieval System: Information Technology Empowering Digital India
3 pages
Lec2 2
No ratings yet
Lec2 2
17 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
34 pages
Frontiers of Computational Journalism - Columbia Journalism School Fall 2012 - Week 3: Document Topic Modeling
No ratings yet
Frontiers of Computational Journalism - Columbia Journalism School Fall 2012 - Week 3: Document Topic Modeling
48 pages
IR Models: Chapter Five
100% (1)
IR Models: Chapter Five
26 pages
5 B IRModels
No ratings yet
5 B IRModels
51 pages
5 IRModels
No ratings yet
5 IRModels
30 pages
Information Retrieval Models: Vector Space Models: Chengxiang Zhai
No ratings yet
Information Retrieval Models: Vector Space Models: Chengxiang Zhai
30 pages
L02-IR Models MMN
No ratings yet
L02-IR Models MMN
27 pages
Week 5 - Latent Semantic Indexing
No ratings yet
Week 5 - Latent Semantic Indexing
38 pages
IR - ch5 - Vector Space Model
No ratings yet
IR - ch5 - Vector Space Model
23 pages
IR Chap4
100% (1)
IR Chap4
32 pages
IR Chap4
100% (1)
IR Chap4
32 pages
Vector Space Model
No ratings yet
Vector Space Model
4 pages
TF Idf
100% (3)
TF Idf
38 pages
IR-Lab Manual A1
No ratings yet
IR-Lab Manual A1
3 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
L14 VSM
No ratings yet
L14 VSM
24 pages
Unit V Easy To Learn
No ratings yet
Unit V Easy To Learn
21 pages
4_IRModels
No ratings yet
4_IRModels
30 pages
Acm Iconiaac 2014
No ratings yet
Acm Iconiaac 2014
8 pages
Vector Space Model
No ratings yet
Vector Space Model
6 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
Lecture 2 Introduction to Linear Algebra (Part 1)
No ratings yet
Lecture 2 Introduction to Linear Algebra (Part 1)
49 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
21 pages
1.5-TR-Vector Space Model Basic Idea
No ratings yet
1.5-TR-Vector Space Model Basic Idea
6 pages
Unit 2.3 Vector Model
No ratings yet
Unit 2.3 Vector Model
11 pages
Relevance of A Document To A Query
No ratings yet
Relevance of A Document To A Query
10 pages
Information Retrieval Notes
No ratings yet
Information Retrieval Notes
42 pages
Information Retrieval Models
No ratings yet
Information Retrieval Models
4 pages
Chapter 6_Scoring Term weighting and vector space model
No ratings yet
Chapter 6_Scoring Term weighting and vector space model
43 pages
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
From Everand
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
Fouad Sabry
No ratings yet
Automatic Image Annotation: Fundamentals and Applications
From Everand
Automatic Image Annotation: Fundamentals and Applications
Fouad Sabry
No ratings yet