0% found this document useful (0 votes)

2 views38 pages

Week 5 - Latent Semantic Indexing

The document discusses embedding techniques, particularly focusing on Latent Semantic Indexing (LSI) and Latent Dirichlet Allocation (LDA) for improving information retrieval. LSI addresses issues of synonymy and homonymy by mapping documents and queries into a lower-dimensional concept space, while LDA uses a probabilistic model to identify topics within a document collection. Both methods aim to enhance document representation and retrieval efficiency by reducing dimensionality and improving concept identification.

Uploaded by

محمد وسيق شيراز

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views38 pages

Week 5 - Latent Semantic Indexing

Uploaded by

محمد وسيق شيراز

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 38

1.

4 EMBEDDING TECHNIQUES

Introduction - 1
1.4.1 Latent Semantic Indexing
Vector space retrieval is vague and noisy
– Based on index terms
– Unrelated documents might be included in the answer set
• apple (company) vs. apple (fruit)
– Relevant documents that do not contain at least one index
term are not retrieved
• car vs. automobile

Observation
– The user information need is more related to concepts and
ideas than to index terms

Introduction - 2
The Problem
Vector Space Retrieval handles poorly the following two
situations

1. Synonymy: different terms refer to the same concept,

e.g. car and automobile
– Result: poor recall
2. Homonymy: the same term may have different
meanings, e.g. apple, model, bank
– Result: poor precision

Introduction - 3
Example: 3 documents
doc1 doc2 doc3
apple apple iOS
blackberry blackberry iPad
orange smartphone RIM
vitamine orange mobile
fruit carrier handy
health tablet telcom
tablet swisscom provider

High similarity No similarity

Introduction - 4
Key Idea
Map documents and queries into a lower-dimensional
space composed of higher-level concepts
– Each concept represented by a combination of terms
– Fewer concepts than terms
– e.g. vehicle = {car, automobile, wheels, auto, sportscar}

Dimensionality reduction
– Retrieval (and clustering) in a reduced concept space might
be superior to retrieval in the high-dimensional space of
index terms

Introduction - 5
Using Concepts for Retrieval

d1 d1
t1 t1

d2 c1 d2

t2 t2
d3 c2 d3

t3 t3
d4 d4

Introduction - 6
Example: Concept Space
doc3 doc2
iOS fruit apple
iPad blackberry
RIM health smartphone
mobile orange
handy device carrier
telcom tablet
provider telco swisscom

Introduction - 7
Similarity Computation in Concept Space
Concept represented by terms, e.g.
device = {iOS, iPad, RIM, mobile, handy,
tablet, apple, blackberry}
Document represented by concept vector, counting
number of concept terms, e.g.
doc1 = (4, 3, 3, 1)
doc3 = (0, 0, 5, 2)
Similarity computed by scalar product of normalized
concept vectors

Introduction - 8
Result Concept vector (fruit, health, device, telco)

doc1 = (4,3,3,1) doc2=(3,1,3,3) doc3=(0,0,5,2)

apple apple iOS
blackberry blackberry iPad
orange smartphone RIM
vitamine orange mobile
fruit carrier handy
health tablet telcom
tablet swisscom provider

Similarity(doc1, doc2) = 0.245 Similarity(doc2, doc3) = 0.3

Similarity(doc1, doc3) = 0.22
Introduction - 9
Basic Definitions
Problem: how to identify and compute “concepts” ?

Consider the term-document matrix

– Let Mij be a term-document matrix with m rows
(terms) and n columns (documents)
– To each element of this matrix is assigned a weight
wij associated with ti and dj
– The weight wij can be based on a tf-idf weighting
scheme

Introduction - 10
Computing the Ranking Using M Scalar product

m
Mt . q Mt q

query . doc1 doc1

query . doc2 doc2

query
... doc3 m
n
doc4

doc5
query . doc6 doc6

Introduction - 11
In vector space retrieval each row of the matrix
M corresponds to
A. A document
B. A concept
C. A query
D. A term

Introduction - 12
Identifying Top Concepts
Key Idea: extract the essential features of Mt and
approximate it by the most important ones
“Top concept”

Transformation Mt

Unit ball Transformed ball

Introduction - 13
Singular Value Decomposition (SVD)
Represent Matrix M as M = K.S.Dt
– K and D are matrices with orthonormal columns
K.Kt = I = D.Dt
– S is an r x r diagonal matrix of the singular values
sorted in decreasing order where r = min(m, n), i.e.
the rank of M
– Such a decomposition always exists and is unique
(up to sign)

Introduction - 14
Construction of SVD
K is the matrix of eigenvectors derived from M.Mt
D is the matrix of eigenvectors derived from Mt.M

Algorithms for constructing the SVD of a m x n matrix

have complexity O(n3) if mn

Introduction - 15
Interpretation of SVD
We can write M = K.S.Dt as sum of outer vector products

M  i 1 si ki  d it
r

The si are ordered in decreasing size

By taking only the largest ones we obtain a «good»
approximation of M (least square approximation)

The singular values si are the lengths of the semi-axes of

the hyperellipsoid E defined by
E  Mx | x 2 1
Introduction - 16
Illustration of SVD
n r

r n

m M = K  S  Dt r

Vectors represent
documents in concept space

Vectors represent
terms in concept space
mxn mxr rxr rxn

Assuming m ≤ n
Introduction - 17
Illustration of SVD – Another Perspective
n r

r n

m M = K  S  Dt r

Vectors represent
concepts in document space

Vectors represent
concepts in term space
mxn mxr rxr rxn

Assuming m ≤ n
Introduction - 18
Latent Semantic Indexing (LSI)
In the matrix S, select only the s largest singular values
– Keep the corresponding columns in K and D
The resultant matrix is called Ms and is given by
– Ms = Ks.Ss.Dst where s, s < r, is the dimensionality of
the concept space
The parameter s should be
– large enough to allow fitting the characteristics of
the data
– small enough to filter out the non-relevant
representational details
Introduction - 19
Illustration of Latent Semantic Indexing
n s

m Ms = Ks  Ss  D st s

Document vectors

Term vectors

mxn mxs sxs sxn

Introduction - 20
Answering Queries
Documents can be compared by computing cosine
similarity in the concept space, i.e., comparing their
columns (Dst)i and (Dst)j in matrix Dst

A query q is treated like one further document

– it is added as an additional column to matrix M
– the same transformation is applied to this column
as for mapping M to D

Introduction - 21
Mapping Queries
Mapping of M to D
M = K.S.Dt
S-1.Kt .M = Dt (since K.Kt = 1)
D = Mt .K.S-1
Apply same transformation to q:
q* = qt.Ks.Ss-1
Then compare transformed vector by using the standard
cosine measure
q *  ( Dst )i
sim(q*, d i ) 
q * ( Dst )i
Introduction - 22
Illustration of LSI Querying
n s

m Ms = Ks  Ss  D st s

Document vectors

Term vectors

mxn mxs sxs sxn

Introduction - 23
Example: Documents
B1 A Course on Integral Equations
B2 Attractors for Semigroups and Evolution Equations
B3 Automatic Differentiation of Algorithms: Theory, Implementation, and Application
B4 Geometrical Aspects of Partial Differential Equations
B5 Ideals, Varieties, and Algorithms: An Introduction to Computational Algebraic Geometry and
Commutative Algebra
B6 Introduction to Hamiltonian Dynamical Systems and the N-Body Problem
B7 Knapsack Problems: Algorithms and Computer Implementations
B8 Methods of Solving Singular Systems of Ordinary Differential Equations
B9 Nonlinear Systems
B10 Ordinary Differential Equations
B11 Oscillation Theory for Neutral Differential Equations with Delay
B12 Oscillation Theory of Delay Differential Equations
B13 Pseudodifferential Operators and Nonlinear Partial Differential Equations
B14 Sinc Methods for Quadrature and Differential Equations
B15 Stability of Stochastic Differential Equations with Respect to Semi-Martingales
B16 The Boundary Integral Approach to Static and Dynamic Contact Problems
B17 The Double Mellin-Barnes Type Integrals and Their Applications to Convolution Theory

Introduction - 24
Implementation in Python

Introduction - 25
Results (s=2)

Introduction - 26
Plot of Terms and
Documents in 2-d
Concept Space

Introduction - 27
Applying SVD to a term-document matrix M. Each
concept is represented in K
A. as a singular value
B. as a linear combination of terms of the vocabulary
C. as a linear combination of documents in the document
collection
D. as a least squares approximation of the matrix M

Introduction - 28
The number of term vectors in the matrix Ks
used for LSI
A. Is smaller than the number of rows in the matrix M
B. Is the same as the number of rows in the matrix M
C. Is larger than the number of rows in the matrix M

Introduction - 29
A query transformed into the concept space for
LSI has …
A. s components (number of singular values)
B. m components (size of vocabulary)
C. n components (number of documents)

Introduction - 30
Discussion of Latent Semantic Indexing
Latent semantic indexing provides an interesting
conceptualization of the IR problem
Advantages
– It allows reducing the complexity of the underlying
concept representation
– Facilitates interfacing with the user
Disadvantages
– Computationally expensive
– Poor statistical explanation

Introduction - 31
Alternative Techniques
Probabilistic Latent Semantic Analysis
– Based on Bayesian Networks

Latent Dirichlet Allocation

– Based on Dirichlet Distribution
– State-of-the-art method for concept extraction

Same objective of creating a lower-dimensional concept

space based on the term-document matrix
– Better explained mathematical foundation
– Better experimental results
Introduction - 32
1.4.2 Latent Dirichlet Allocation (LDA)
Idea: assume a document collection is
(randomly) generated from a known set of topics
(probabilistic generative model)
mo ey
ne on
y m
loan

bank .8 DOCUMENT 1: money1 bank1 bank1 loan1 river2 stream2 bank1

money1 river2 bank1 money1 bank1 loan1 money1 stream2 bank1
nk

loan
bank

loan money1 bank1 bank1 loan1 river2 stream2 bank1 money1 river2 bank1
.2 money1 bank1 loan1 bank1 money1 stream2
Topic is a mixture TOPIC 1 Document is a
of words .3 mixture of topics
DOCUMENT 2: river2 stream2 bank2 stream2 bank2 money1 loan1
er nk
riv ba river2 stream2 loan1 bank2 river2 bank2 bank1 stream2 river2 loan1
river

.7 bank2 stream2 bank2 money1 loan1 river2 stream2 bank2 stream2

stream bank2 money1 river2 stream2 loan1 bank2 river2 bank2 money1
rive
r ban
k bank1 stream2 river2 bank2 stream2 bank2 money1
eam
str

TOPIC 2

Introduction - 33
Document Generation using a
Probabilistic Process
For each document, choose
a mixture of topics
TOPIC
MIXTURE

For every word position,

sample a topic from the TOPIC ... TOPIC

topic mixture
WORD ... WORD
For every word position,
sample a word from the
chosen topic
Introduction - 34
LDA: Topic Identification
Approach: Inverting the process: given a
document collection, reconstruct the topic
model
? DOCUMENT 1: money? bank? bank? loan? river? stream? bank?
money? river? bank? money? bank? loan? money? stream? bank?
money? bank? bank? loan? river? stream? bank? money? river? bank?
money? bank? loan? bank? money? stream?

TOPIC 1
?
DOCUMENT 2: river? stream? bank? stream? bank? money? loan?
river? stream? loan? bank? river? bank? bank? stream? river? loan?
bank? stream? bank? money? loan? river? stream? bank? stream?

? bank? money? river? stream? loan? bank? river? bank? money?

bank? stream? river? bank? stream? bank? money?

TOPIC 2

Introduction - 35
Latent Dirichlet Allocation

Topics are interpretable unlike the arbitrary

dimensions of LSI
• Distributions follow a Dirichlet distribution
• Construction of topic model is mathematically
involved, but computationally feasible
• Considered as the state-of-the art method for topic
identification

Introduction - 36
Use of Topic Models
Unsupervised Learning of topics
– Understanding main topics of a topic collection
– Organizing the document collection
Use for document retrieval: use topic vectors
instead of term vectors to represent documents
and queries
Document classification (Supervised Learning):
use topics as features

Introduction - 37
Summary
0
1
0 0
1 0
0 1
0 0
0 0 1
0 0
1 0 0 1
0 0 0 1
0 0
0 1 0 0
0 1 0 1
1 0 0
1 0 0 0
0 1
0 1 1 0
0 0
0 1 0
1 0
0 1 0
1 0 0 0
0 0 1
0 1 1
1 0
0
VS model maps Topic model maps
documents to documents to
sparse term vectors dense vectors Each topic corresponds
(dimension = term in vocabulary) (dimension = topic) to a distribution over terms

Introduction - 38

Roger A. Horn, Charles R. Johnson - Topics in Matrix Analysis-Cambridge University Press (2008)
No ratings yet
Roger A. Horn, Charles R. Johnson - Topics in Matrix Analysis-Cambridge University Press (2008)
615 pages
Information Retrieval On Cranfield Dataset
No ratings yet
Information Retrieval On Cranfield Dataset
15 pages
Vector Space Model
No ratings yet
Vector Space Model
7 pages
Semantic Technology-Assisted Review STAR Document
No ratings yet
Semantic Technology-Assisted Review STAR Document
14 pages
Module 3 Indexing Part A
No ratings yet
Module 3 Indexing Part A
46 pages
Latent Semantic Indexing by Singular Value Decomposition
No ratings yet
Latent Semantic Indexing by Singular Value Decomposition
26 pages
Module 3 Part B
No ratings yet
Module 3 Part B
19 pages
Vector Space Model
No ratings yet
Vector Space Model
11 pages
A Literature Survey On Latent Semantic Indexing: Ashwini Deshmukh Gayatri Hegde
No ratings yet
A Literature Survey On Latent Semantic Indexing: Ashwini Deshmukh Gayatri Hegde
5 pages
Using Linear Algebra For Intelligent Information Retrieval
No ratings yet
Using Linear Algebra For Intelligent Information Retrieval
23 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
27 pages
Dimensionality Reduction: Pca, SVD, MDS, Ica, and Friends
No ratings yet
Dimensionality Reduction: Pca, SVD, MDS, Ica, and Friends
50 pages
Lec2 2
No ratings yet
Lec2 2
17 pages
Frontiers of Computational Journalism - Columbia Journalism School Fall 2012 - Week 3: Document Topic Modeling
No ratings yet
Frontiers of Computational Journalism - Columbia Journalism School Fall 2012 - Week 3: Document Topic Modeling
48 pages
Latent Semantic Analysis: Dr. Maunendra Sankar Desarkar IIT Hyderabad
No ratings yet
Latent Semantic Analysis: Dr. Maunendra Sankar Desarkar IIT Hyderabad
41 pages
Lec 3
No ratings yet
Lec 3
51 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
51 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
51 pages
L04
No ratings yet
L04
35 pages
06 VectorSpaceModel
No ratings yet
06 VectorSpaceModel
65 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
33 pages
Vector Space Model: TF - IDF: Adapted From Lectures by
No ratings yet
Vector Space Model: TF - IDF: Adapted From Lectures by
37 pages
Vector Space Model
No ratings yet
Vector Space Model
6 pages
Lecture 15
No ratings yet
Lecture 15
43 pages
What Is Information Retrieval (IR)
No ratings yet
What Is Information Retrieval (IR)
9 pages
Tf-Idf: David Kauchak cs160 Fall 2009
No ratings yet
Tf-Idf: David Kauchak cs160 Fall 2009
51 pages
Unified Linear Subspace Approach To Semantic Analysis
No ratings yet
Unified Linear Subspace Approach To Semantic Analysis
7 pages
Topic Modelling and LSA
No ratings yet
Topic Modelling and LSA
10 pages
UNIT 4 Information Retrieval Using NLP
No ratings yet
UNIT 4 Information Retrieval Using NLP
13 pages
Textdb
No ratings yet
Textdb
27 pages
Wordembedding
No ratings yet
Wordembedding
25 pages
AI6122 Topic 3.2 - Ranking
No ratings yet
AI6122 Topic 3.2 - Ranking
27 pages
Latent Semantic Indexing (LSI) : CSE 434/535 Information Retrieval Fall 2019
No ratings yet
Latent Semantic Indexing (LSI) : CSE 434/535 Information Retrieval Fall 2019
65 pages
L02-IR Models MMN
No ratings yet
L02-IR Models MMN
27 pages
Chapter 4 - Part II
No ratings yet
Chapter 4 - Part II
44 pages
Chapter 6
No ratings yet
Chapter 6
55 pages
Lecture 6 Score - Term Weight - Vector Space Model
No ratings yet
Lecture 6 Score - Term Weight - Vector Space Model
43 pages
Information Retrieval: Latent Semantic Indexing
No ratings yet
Information Retrieval: Latent Semantic Indexing
36 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
43 pages
Information Retrieval Practical
No ratings yet
Information Retrieval Practical
10 pages
Information Retrieval and Web Search: 1 The Traditional Vector Space Method
No ratings yet
Information Retrieval and Web Search: 1 The Traditional Vector Space Method
24 pages
Information Retrieval Models: Vector Space Models: Chengxiang Zhai
No ratings yet
Information Retrieval Models: Vector Space Models: Chengxiang Zhai
30 pages
06 VectorSpaceModel PDF
No ratings yet
06 VectorSpaceModel PDF
75 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
61 pages
BRMIC91
No ratings yet
BRMIC91
19 pages
Representing Structured Relational Data in Euclidean Vector Spaces
No ratings yet
Representing Structured Relational Data in Euclidean Vector Spaces
14 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
3 pages
SVD and LSI Combined
No ratings yet
SVD and LSI Combined
11 pages
NLP Ir
No ratings yet
NLP Ir
24 pages
Topic Models Dsi Talk March 2017
No ratings yet
Topic Models Dsi Talk March 2017
24 pages
Acm Iconiaac 2014
No ratings yet
Acm Iconiaac 2014
8 pages
Language Independent Document
No ratings yet
Language Independent Document
10 pages
Course Name: Advanced Information Retrieval
No ratings yet
Course Name: Advanced Information Retrieval
6 pages
ISR Chap... 5
No ratings yet
ISR Chap... 5
34 pages
IR Models: Chapter Five
100% (1)
IR Models: Chapter Five
26 pages
Text
No ratings yet
Text
11 pages
The Tech Interview Playbook: From DSA to System Design
From Everand
The Tech Interview Playbook: From DSA to System Design
Chinmoy Mukherjee
No ratings yet
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet
Two Dimensional Geometric Model: Understanding and Applications in Computer Vision
From Everand
Two Dimensional Geometric Model: Understanding and Applications in Computer Vision
Fouad Sabry
No ratings yet
UCLA Math 33A Syllabus
No ratings yet
UCLA Math 33A Syllabus
2 pages
Matrix World
No ratings yet
Matrix World
9 pages
Report On Hill 1 Cipher
No ratings yet
Report On Hill 1 Cipher
6 pages
ماث٢ ملخص قوانين-1
No ratings yet
ماث٢ ملخص قوانين-1
5 pages
MIT18 06SCF11 FinalRevsum
No ratings yet
MIT18 06SCF11 FinalRevsum
7 pages
Tensor Algebra: Multimedia Course On Continuum Mechanics
No ratings yet
Tensor Algebra: Multimedia Course On Continuum Mechanics
61 pages
Reference: Digital Signal Processing Laboratory Using Matlab Author: Sanjit K. Mitra
No ratings yet
Reference: Digital Signal Processing Laboratory Using Matlab Author: Sanjit K. Mitra
16 pages
Convolution Theorem
No ratings yet
Convolution Theorem
4 pages
Instant Ebooks Textbook Linear Algebra With Applications Global Edition Steven J. Leon Download All Chapters
100% (9)
Instant Ebooks Textbook Linear Algebra With Applications Global Edition Steven J. Leon Download All Chapters
84 pages
CS 4300 Computer Graphics: Prof. Harriet Fell Fall 2012 - September 27, 2012
No ratings yet
CS 4300 Computer Graphics: Prof. Harriet Fell Fall 2012 - September 27, 2012
36 pages
MAT223 Mid2 2013F
No ratings yet
MAT223 Mid2 2013F
10 pages
Math 123
No ratings yet
Math 123
47 pages
Linear Algebra 1st Edition Lina Oliveira Download
No ratings yet
Linear Algebra 1st Edition Lina Oliveira Download
80 pages
MTH100 Quiz-2 Cheat-Sheet
No ratings yet
MTH100 Quiz-2 Cheat-Sheet
2 pages
Exercise Book MAE 101 (final) - điều chỉnh
No ratings yet
Exercise Book MAE 101 (final) - điều chỉnh
24 pages
Maths MCQ
100% (2)
Maths MCQ
91 pages
MTH501 Final Term Current Paper-1
No ratings yet
MTH501 Final Term Current Paper-1
50 pages
Matrices and Determinants
No ratings yet
Matrices and Determinants
5 pages
01 Matrices Revision Exercise
No ratings yet
01 Matrices Revision Exercise
26 pages
CH 11 - Transformation P2 (MS)
No ratings yet
CH 11 - Transformation P2 (MS)
4 pages
Math Midterm Model Exam
No ratings yet
Math Midterm Model Exam
19 pages
1.7 - Diagonal, Triangular and Symmetric Matrix - 6
No ratings yet
1.7 - Diagonal, Triangular and Symmetric Matrix - 6
10 pages
Mat5101 Linear Algebra
No ratings yet
Mat5101 Linear Algebra
142 pages
ENGG2013 Unit 19 The Principal Axes Theorem
No ratings yet
ENGG2013 Unit 19 The Principal Axes Theorem
22 pages
Alindogan - Matlab Activity 4-2 - 2062362647
No ratings yet
Alindogan - Matlab Activity 4-2 - 2062362647
7 pages
Nptel: Linear Algebra - Video Course
No ratings yet
Nptel: Linear Algebra - Video Course
4 pages
1introduction Basic Definitions
No ratings yet
1introduction Basic Definitions
30 pages
Chapter 3 - Determinants and Diagonalization
No ratings yet
Chapter 3 - Determinants and Diagonalization
121 pages

Week 5 - Latent Semantic Indexing

Uploaded by

Week 5 - Latent Semantic Indexing

Uploaded by

1.

1. Synonymy: different terms refer to the same concept,

High similarity No similarity

doc1 = (4,3,3,1) doc2=(3,1,3,3) doc3=(0,0,5,2)

Similarity(doc1, doc2) = 0.245 Similarity(doc2, doc3) = 0.3

Consider the term-document matrix

query . doc1 doc1

Unit ball Transformed ball

Algorithms for constructing the SVD of a m x n matrix

The si are ordered in decreasing size

The singular values si are the lengths of the semi-axes of

mxn mxs sxs sxn

A query q is treated like one further document

mxn mxs sxs sxn

Latent Dirichlet Allocation

Same objective of creating a lower-dimensional concept

bank .8 DOCUMENT 1: money1 bank1 bank1 loan1 river2 stream2 bank1

.7 bank2 stream2 bank2 money1 loan1 river2 stream2 bank2 stream2

For every word position,

? bank? money? river? stream? loan? bank? river? bank? money?

Topics are interpretable unlike the arbitrary

You might also like