0% found this document useful (0 votes)
2 views38 pages

Week 5 - Latent Semantic Indexing

The document discusses embedding techniques, particularly focusing on Latent Semantic Indexing (LSI) and Latent Dirichlet Allocation (LDA) for improving information retrieval. LSI addresses issues of synonymy and homonymy by mapping documents and queries into a lower-dimensional concept space, while LDA uses a probabilistic model to identify topics within a document collection. Both methods aim to enhance document representation and retrieval efficiency by reducing dimensionality and improving concept identification.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views38 pages

Week 5 - Latent Semantic Indexing

The document discusses embedding techniques, particularly focusing on Latent Semantic Indexing (LSI) and Latent Dirichlet Allocation (LDA) for improving information retrieval. LSI addresses issues of synonymy and homonymy by mapping documents and queries into a lower-dimensional concept space, while LDA uses a probabilistic model to identify topics within a document collection. Both methods aim to enhance document representation and retrieval efficiency by reducing dimensionality and improving concept identification.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 38

1.

4 EMBEDDING TECHNIQUES

Introduction - 1
1.4.1 Latent Semantic Indexing
Vector space retrieval is vague and noisy
– Based on index terms
– Unrelated documents might be included in the answer set
• apple (company) vs. apple (fruit)
– Relevant documents that do not contain at least one index
term are not retrieved
• car vs. automobile

Observation
– The user information need is more related to concepts and
ideas than to index terms

Introduction - 2
The Problem
Vector Space Retrieval handles poorly the following two
situations

1. Synonymy: different terms refer to the same concept,


e.g. car and automobile
– Result: poor recall
2. Homonymy: the same term may have different
meanings, e.g. apple, model, bank
– Result: poor precision

Introduction - 3
Example: 3 documents
doc1 doc2 doc3
apple apple iOS
blackberry blackberry iPad
orange smartphone RIM
vitamine orange mobile
fruit carrier handy
health tablet telcom
tablet swisscom provider

High similarity No similarity


Introduction - 4
Key Idea
Map documents and queries into a lower-dimensional
space composed of higher-level concepts
– Each concept represented by a combination of terms
– Fewer concepts than terms
– e.g. vehicle = {car, automobile, wheels, auto, sportscar}

Dimensionality reduction
– Retrieval (and clustering) in a reduced concept space might
be superior to retrieval in the high-dimensional space of
index terms

Introduction - 5
Using Concepts for Retrieval

d1 d1
t1 t1

d2 c1 d2

t2 t2
d3 c2 d3

t3 t3
d4 d4

Introduction - 6
Example: Concept Space
doc3 doc2
iOS fruit apple
iPad blackberry
RIM health smartphone
mobile orange
handy device carrier
telcom tablet
provider telco swisscom

Introduction - 7
Similarity Computation in Concept Space
Concept represented by terms, e.g.
device = {iOS, iPad, RIM, mobile, handy,
tablet, apple, blackberry}
Document represented by concept vector, counting
number of concept terms, e.g.
doc1 = (4, 3, 3, 1)
doc3 = (0, 0, 5, 2)
Similarity computed by scalar product of normalized
concept vectors

Introduction - 8
Result Concept vector (fruit, health, device, telco)

doc1 = (4,3,3,1) doc2=(3,1,3,3) doc3=(0,0,5,2)


apple apple iOS
blackberry blackberry iPad
orange smartphone RIM
vitamine orange mobile
fruit carrier handy
health tablet telcom
tablet swisscom provider

Similarity(doc1, doc2) = 0.245 Similarity(doc2, doc3) = 0.3


Similarity(doc1, doc3) = 0.22
Introduction - 9
Basic Definitions
Problem: how to identify and compute “concepts” ?

Consider the term-document matrix


– Let Mij be a term-document matrix with m rows
(terms) and n columns (documents)
– To each element of this matrix is assigned a weight
wij associated with ti and dj
– The weight wij can be based on a tf-idf weighting
scheme

Introduction - 10
Computing the Ranking Using M Scalar product

m
Mt . q Mt q

query . doc1 doc1


query . doc2 doc2

query
... doc3 m
n
doc4

doc5
query . doc6 doc6

Introduction - 11
In vector space retrieval each row of the matrix
M corresponds to
A. A document
B. A concept
C. A query
D. A term

Introduction - 12
Identifying Top Concepts
Key Idea: extract the essential features of Mt and
approximate it by the most important ones
“Top concept”

Transformation Mt

Unit ball Transformed ball


Introduction - 13
Singular Value Decomposition (SVD)
Represent Matrix M as M = K.S.Dt
– K and D are matrices with orthonormal columns
K.Kt = I = D.Dt
– S is an r x r diagonal matrix of the singular values
sorted in decreasing order where r = min(m, n), i.e.
the rank of M
– Such a decomposition always exists and is unique
(up to sign)

Introduction - 14
Construction of SVD
K is the matrix of eigenvectors derived from M.Mt
D is the matrix of eigenvectors derived from Mt.M

Algorithms for constructing the SVD of a m x n matrix


have complexity O(n3) if mn

Introduction - 15
Interpretation of SVD
We can write M = K.S.Dt as sum of outer vector products

M  i 1 si ki  d it
r

The si are ordered in decreasing size


By taking only the largest ones we obtain a «good»
approximation of M (least square approximation)

The singular values si are the lengths of the semi-axes of


the hyperellipsoid E defined by
E  Mx | x 2 1
Introduction - 16
Illustration of SVD
n r

r n

m M = K  S  Dt r

Vectors represent
documents in concept space

Vectors represent
terms in concept space
mxn mxr rxr rxn

Assuming m ≤ n
Introduction - 17
Illustration of SVD – Another Perspective
n r

r n

m M = K  S  Dt r

Vectors represent
concepts in document space

Vectors represent
concepts in term space
mxn mxr rxr rxn

Assuming m ≤ n
Introduction - 18
Latent Semantic Indexing (LSI)
In the matrix S, select only the s largest singular values
– Keep the corresponding columns in K and D
The resultant matrix is called Ms and is given by
– Ms = Ks.Ss.Dst where s, s < r, is the dimensionality of
the concept space
The parameter s should be
– large enough to allow fitting the characteristics of
the data
– small enough to filter out the non-relevant
representational details
Introduction - 19
Illustration of Latent Semantic Indexing
n s

m Ms = Ks  Ss  D st s

Document vectors

Term vectors

mxn mxs sxs sxn

Introduction - 20
Answering Queries
Documents can be compared by computing cosine
similarity in the concept space, i.e., comparing their
columns (Dst)i and (Dst)j in matrix Dst

A query q is treated like one further document


– it is added as an additional column to matrix M
– the same transformation is applied to this column
as for mapping M to D

Introduction - 21
Mapping Queries
Mapping of M to D
M = K.S.Dt
S-1.Kt .M = Dt (since K.Kt = 1)
D = Mt .K.S-1
Apply same transformation to q:
q* = qt.Ks.Ss-1
Then compare transformed vector by using the standard
cosine measure
q *  ( Dst )i
sim(q*, d i ) 
q * ( Dst )i
Introduction - 22
Illustration of LSI Querying
n s

qt

q*

m Ms = Ks  Ss  D st s

Document vectors

Term vectors

mxn mxs sxs sxn

Introduction - 23
Example: Documents
B1 A Course on Integral Equations
B2 Attractors for Semigroups and Evolution Equations
B3 Automatic Differentiation of Algorithms: Theory, Implementation, and Application
B4 Geometrical Aspects of Partial Differential Equations
B5 Ideals, Varieties, and Algorithms: An Introduction to Computational Algebraic Geometry and
Commutative Algebra
B6 Introduction to Hamiltonian Dynamical Systems and the N-Body Problem
B7 Knapsack Problems: Algorithms and Computer Implementations
B8 Methods of Solving Singular Systems of Ordinary Differential Equations
B9 Nonlinear Systems
B10 Ordinary Differential Equations
B11 Oscillation Theory for Neutral Differential Equations with Delay
B12 Oscillation Theory of Delay Differential Equations
B13 Pseudodifferential Operators and Nonlinear Partial Differential Equations
B14 Sinc Methods for Quadrature and Differential Equations
B15 Stability of Stochastic Differential Equations with Respect to Semi-Martingales
B16 The Boundary Integral Approach to Static and Dynamic Contact Problems
B17 The Double Mellin-Barnes Type Integrals and Their Applications to Convolution Theory

Introduction - 24
Implementation in Python

Introduction - 25
Results (s=2)

Introduction - 26
Plot of Terms and
Documents in 2-d
Concept Space

Introduction - 27
Applying SVD to a term-document matrix M. Each
concept is represented in K
A. as a singular value
B. as a linear combination of terms of the vocabulary
C. as a linear combination of documents in the document
collection
D. as a least squares approximation of the matrix M

Introduction - 28
The number of term vectors in the matrix Ks
used for LSI
A. Is smaller than the number of rows in the matrix M
B. Is the same as the number of rows in the matrix M
C. Is larger than the number of rows in the matrix M

Introduction - 29
A query transformed into the concept space for
LSI has …
A. s components (number of singular values)
B. m components (size of vocabulary)
C. n components (number of documents)

Introduction - 30
Discussion of Latent Semantic Indexing
Latent semantic indexing provides an interesting
conceptualization of the IR problem
Advantages
– It allows reducing the complexity of the underlying
concept representation
– Facilitates interfacing with the user
Disadvantages
– Computationally expensive
– Poor statistical explanation

Introduction - 31
Alternative Techniques
Probabilistic Latent Semantic Analysis
– Based on Bayesian Networks

Latent Dirichlet Allocation


– Based on Dirichlet Distribution
– State-of-the-art method for concept extraction

Same objective of creating a lower-dimensional concept


space based on the term-document matrix
– Better explained mathematical foundation
– Better experimental results
Introduction - 32
1.4.2 Latent Dirichlet Allocation (LDA)
Idea: assume a document collection is
(randomly) generated from a known set of topics
(probabilistic generative model)
mo ey
ne on
y m
loan

bank .8 DOCUMENT 1: money1 bank1 bank1 loan1 river2 stream2 bank1


money1 river2 bank1 money1 bank1 loan1 money1 stream2 bank1
nk

loan
bank

ba

loan money1 bank1 bank1 loan1 river2 stream2 bank1 money1 river2 bank1
.2 money1 bank1 loan1 bank1 money1 stream2
Topic is a mixture TOPIC 1 Document is a
of words .3 mixture of topics
DOCUMENT 2: river2 stream2 bank2 stream2 bank2 money1 loan1
er nk
riv ba river2 stream2 loan1 bank2 river2 bank2 bank1 stream2 river2 loan1
river

.7 bank2 stream2 bank2 money1 loan1 river2 stream2 bank2 stream2


stream bank2 money1 river2 stream2 loan1 bank2 river2 bank2 money1
rive
r ban
k bank1 stream2 river2 bank2 stream2 bank2 money1
eam
str

TOPIC 2

Introduction - 33
Document Generation using a
Probabilistic Process
For each document, choose
a mixture of topics
TOPIC
MIXTURE

For every word position,


sample a topic from the TOPIC ... TOPIC

topic mixture
WORD ... WORD
For every word position,
sample a word from the
chosen topic
Introduction - 34
LDA: Topic Identification
Approach: Inverting the process: given a
document collection, reconstruct the topic
model
? DOCUMENT 1: money? bank? bank? loan? river? stream? bank?
money? river? bank? money? bank? loan? money? stream? bank?
money? bank? bank? loan? river? stream? bank? money? river? bank?
money? bank? loan? bank? money? stream?

TOPIC 1
?
DOCUMENT 2: river? stream? bank? stream? bank? money? loan?
river? stream? loan? bank? river? bank? bank? stream? river? loan?
bank? stream? bank? money? loan? river? stream? bank? stream?

? bank? money? river? stream? loan? bank? river? bank? money?


bank? stream? river? bank? stream? bank? money?

TOPIC 2

Introduction - 35
Latent Dirichlet Allocation

Topics are interpretable unlike the arbitrary


dimensions of LSI
• Distributions follow a Dirichlet distribution
• Construction of topic model is mathematically
involved, but computationally feasible
• Considered as the state-of-the art method for topic
identification

Introduction - 36
Use of Topic Models
Unsupervised Learning of topics
– Understanding main topics of a topic collection
– Organizing the document collection
Use for document retrieval: use topic vectors
instead of term vectors to represent documents
and queries
Document classification (Supervised Learning):
use topics as features

Introduction - 37
Summary
0
1
0 0
1 0
0 1
0 0
0 0 1
0 0
1 0 0 1
0 0 0 1
0 0
0 1 0 0
0 1 0 1
1 0 0
1 0 0 0
0 1
0 1 1 0
0 0
0 1 0
1 0
0 1 0
1 0 0 0
0 0 1
0 1 1
1 0
0
VS model maps Topic model maps
documents to documents to
sparse term vectors dense vectors Each topic corresponds
(dimension = term in vocabulary) (dimension = topic) to a distribution over terms

Introduction - 38

You might also like