Week 5 - Latent Semantic Indexing
Week 5 - Latent Semantic Indexing
4 EMBEDDING TECHNIQUES
Introduction - 1
1.4.1 Latent Semantic Indexing
Vector space retrieval is vague and noisy
– Based on index terms
– Unrelated documents might be included in the answer set
• apple (company) vs. apple (fruit)
– Relevant documents that do not contain at least one index
term are not retrieved
• car vs. automobile
Observation
– The user information need is more related to concepts and
ideas than to index terms
Introduction - 2
The Problem
Vector Space Retrieval handles poorly the following two
situations
Introduction - 3
Example: 3 documents
doc1 doc2 doc3
apple apple iOS
blackberry blackberry iPad
orange smartphone RIM
vitamine orange mobile
fruit carrier handy
health tablet telcom
tablet swisscom provider
Dimensionality reduction
– Retrieval (and clustering) in a reduced concept space might
be superior to retrieval in the high-dimensional space of
index terms
Introduction - 5
Using Concepts for Retrieval
d1 d1
t1 t1
d2 c1 d2
t2 t2
d3 c2 d3
t3 t3
d4 d4
Introduction - 6
Example: Concept Space
doc3 doc2
iOS fruit apple
iPad blackberry
RIM health smartphone
mobile orange
handy device carrier
telcom tablet
provider telco swisscom
Introduction - 7
Similarity Computation in Concept Space
Concept represented by terms, e.g.
device = {iOS, iPad, RIM, mobile, handy,
tablet, apple, blackberry}
Document represented by concept vector, counting
number of concept terms, e.g.
doc1 = (4, 3, 3, 1)
doc3 = (0, 0, 5, 2)
Similarity computed by scalar product of normalized
concept vectors
Introduction - 8
Result Concept vector (fruit, health, device, telco)
Introduction - 10
Computing the Ranking Using M Scalar product
m
Mt . q Mt q
query
... doc3 m
n
doc4
doc5
query . doc6 doc6
Introduction - 11
In vector space retrieval each row of the matrix
M corresponds to
A. A document
B. A concept
C. A query
D. A term
Introduction - 12
Identifying Top Concepts
Key Idea: extract the essential features of Mt and
approximate it by the most important ones
“Top concept”
Transformation Mt
Introduction - 14
Construction of SVD
K is the matrix of eigenvectors derived from M.Mt
D is the matrix of eigenvectors derived from Mt.M
Introduction - 15
Interpretation of SVD
We can write M = K.S.Dt as sum of outer vector products
M i 1 si ki d it
r
r n
m M = K S Dt r
Vectors represent
documents in concept space
Vectors represent
terms in concept space
mxn mxr rxr rxn
Assuming m ≤ n
Introduction - 17
Illustration of SVD – Another Perspective
n r
r n
m M = K S Dt r
Vectors represent
concepts in document space
Vectors represent
concepts in term space
mxn mxr rxr rxn
Assuming m ≤ n
Introduction - 18
Latent Semantic Indexing (LSI)
In the matrix S, select only the s largest singular values
– Keep the corresponding columns in K and D
The resultant matrix is called Ms and is given by
– Ms = Ks.Ss.Dst where s, s < r, is the dimensionality of
the concept space
The parameter s should be
– large enough to allow fitting the characteristics of
the data
– small enough to filter out the non-relevant
representational details
Introduction - 19
Illustration of Latent Semantic Indexing
n s
m Ms = Ks Ss D st s
Document vectors
Term vectors
Introduction - 20
Answering Queries
Documents can be compared by computing cosine
similarity in the concept space, i.e., comparing their
columns (Dst)i and (Dst)j in matrix Dst
Introduction - 21
Mapping Queries
Mapping of M to D
M = K.S.Dt
S-1.Kt .M = Dt (since K.Kt = 1)
D = Mt .K.S-1
Apply same transformation to q:
q* = qt.Ks.Ss-1
Then compare transformed vector by using the standard
cosine measure
q * ( Dst )i
sim(q*, d i )
q * ( Dst )i
Introduction - 22
Illustration of LSI Querying
n s
qt
q*
m Ms = Ks Ss D st s
Document vectors
Term vectors
Introduction - 23
Example: Documents
B1 A Course on Integral Equations
B2 Attractors for Semigroups and Evolution Equations
B3 Automatic Differentiation of Algorithms: Theory, Implementation, and Application
B4 Geometrical Aspects of Partial Differential Equations
B5 Ideals, Varieties, and Algorithms: An Introduction to Computational Algebraic Geometry and
Commutative Algebra
B6 Introduction to Hamiltonian Dynamical Systems and the N-Body Problem
B7 Knapsack Problems: Algorithms and Computer Implementations
B8 Methods of Solving Singular Systems of Ordinary Differential Equations
B9 Nonlinear Systems
B10 Ordinary Differential Equations
B11 Oscillation Theory for Neutral Differential Equations with Delay
B12 Oscillation Theory of Delay Differential Equations
B13 Pseudodifferential Operators and Nonlinear Partial Differential Equations
B14 Sinc Methods for Quadrature and Differential Equations
B15 Stability of Stochastic Differential Equations with Respect to Semi-Martingales
B16 The Boundary Integral Approach to Static and Dynamic Contact Problems
B17 The Double Mellin-Barnes Type Integrals and Their Applications to Convolution Theory
Introduction - 24
Implementation in Python
Introduction - 25
Results (s=2)
Introduction - 26
Plot of Terms and
Documents in 2-d
Concept Space
Introduction - 27
Applying SVD to a term-document matrix M. Each
concept is represented in K
A. as a singular value
B. as a linear combination of terms of the vocabulary
C. as a linear combination of documents in the document
collection
D. as a least squares approximation of the matrix M
Introduction - 28
The number of term vectors in the matrix Ks
used for LSI
A. Is smaller than the number of rows in the matrix M
B. Is the same as the number of rows in the matrix M
C. Is larger than the number of rows in the matrix M
Introduction - 29
A query transformed into the concept space for
LSI has …
A. s components (number of singular values)
B. m components (size of vocabulary)
C. n components (number of documents)
Introduction - 30
Discussion of Latent Semantic Indexing
Latent semantic indexing provides an interesting
conceptualization of the IR problem
Advantages
– It allows reducing the complexity of the underlying
concept representation
– Facilitates interfacing with the user
Disadvantages
– Computationally expensive
– Poor statistical explanation
Introduction - 31
Alternative Techniques
Probabilistic Latent Semantic Analysis
– Based on Bayesian Networks
loan
bank
ba
loan money1 bank1 bank1 loan1 river2 stream2 bank1 money1 river2 bank1
.2 money1 bank1 loan1 bank1 money1 stream2
Topic is a mixture TOPIC 1 Document is a
of words .3 mixture of topics
DOCUMENT 2: river2 stream2 bank2 stream2 bank2 money1 loan1
er nk
riv ba river2 stream2 loan1 bank2 river2 bank2 bank1 stream2 river2 loan1
river
TOPIC 2
Introduction - 33
Document Generation using a
Probabilistic Process
For each document, choose
a mixture of topics
TOPIC
MIXTURE
topic mixture
WORD ... WORD
For every word position,
sample a word from the
chosen topic
Introduction - 34
LDA: Topic Identification
Approach: Inverting the process: given a
document collection, reconstruct the topic
model
? DOCUMENT 1: money? bank? bank? loan? river? stream? bank?
money? river? bank? money? bank? loan? money? stream? bank?
money? bank? bank? loan? river? stream? bank? money? river? bank?
money? bank? loan? bank? money? stream?
TOPIC 1
?
DOCUMENT 2: river? stream? bank? stream? bank? money? loan?
river? stream? loan? bank? river? bank? bank? stream? river? loan?
bank? stream? bank? money? loan? river? stream? bank? stream?
TOPIC 2
Introduction - 35
Latent Dirichlet Allocation
Introduction - 36
Use of Topic Models
Unsupervised Learning of topics
– Understanding main topics of a topic collection
– Organizing the document collection
Use for document retrieval: use topic vectors
instead of term vectors to represent documents
and queries
Document classification (Supervised Learning):
use topics as features
Introduction - 37
Summary
0
1
0 0
1 0
0 1
0 0
0 0 1
0 0
1 0 0 1
0 0 0 1
0 0
0 1 0 0
0 1 0 1
1 0 0
1 0 0 0
0 1
0 1 1 0
0 0
0 1 0
1 0
0 1 0
1 0 0 0
0 0 1
0 1 1
1 0
0
VS model maps Topic model maps
documents to documents to
sparse term vectors dense vectors Each topic corresponds
(dimension = term in vocabulary) (dimension = topic) to a distribution over terms
Introduction - 38