Spoken Content Retrieval

Spoken Content Retrieval
Beyond Cascading Speech
Recognition and Text Retrieval
Lin-shan Lee and Hung-yi Lee
National Taiwan University

Focus of this Tutorial
 New frontiers and directions towards the
future of speech technologies
 Not skills and experiences in optimizing
performance in evaluation programs

Text Content Retrieval
Voice Search

Lectures Broadcast Program
Multimedia Content
Spoken Content

 Spoken content retrieval: Machine listens to the data, and
extract the desired information for each individual user.
 Nobody is able to go through the data.
300 hrs multimedia is
uploaded per minute.
(2015.01)
1874 courses on coursera
(2016.04)
 In these multimedia, the spoken part carries very
important information about the content
• Just as Google does on text data

 Basic goal: Identify the time spans that the query
occurs in an audio database
 This is called “Spoken Term Detection”
Spoken Content Retrieval – Goal
time 1:01
time 2:05
…
……
“Obama”
user

 Basic goal: Identify the time spans that the query
occurs in an audio database
 This is called “Spoken Term Detection”
 Advanced goal: Semantic retrieval of spoken content
Spoken Content Retrieval – Goal
user
“US President”
The user is also looking
for utterances including
“Obama”.
Retrieval system

It is natural to think ……
Speech Recognition
+
Text Retrieval
=

Speech
Recognition Models
Text
Acoustic Models
Language Model
Spoken
Content
 Transcribe spoken content into text by speech recognition

Speech
Recognition Models
Text
Spoken
Content
RNN/LSTM DNN

Speech
Recognition Models
Text
Retrieval
Result
Text
Retrieval
Query user
 Use text retrieval approaches to search over the
transcriptions
Spoken
Content
Black Box

Speech
Recognition Models
Text
Retrieval
Result
Text
Retrieval
user
Spoken
Content
 For spoken queries, transcribe them into text by speech
recognition.
Black Box

Our point in this tutorial
Speech Recognition
+
Text Retrieval
=

Outline
 Introduction: Conventional Approach:
Spoken Content Retrieval =
Speech Recognition + Text Retrieval
 Core: Beyond Cascading Speech
Five new directions

Introduction:
Spoken Content Retrieval =
Speech Recognition + Text Retrieval

Speech
Recognition Models
Text
Retrieval
Result
Text
Retrieval
user
Spoken
Content
Speech Recognition always produces errors.

Lattices
Speech
Recognition Models
Text
Retrieval
Result
Text
Retrieval
Query user
Spoken
Content
Lattices
 Keep most possible recognition output
 Each path has a weight (confidence to be
correct)
M. Larson and G. J. F. Jones, “Spoken content retrieval: A
survey of techniques and technologies,” Foundations and
Trends in Information Retrieval, vol. 5, no. 4-5, 2012.

Lattices
Spoken
Archive
Speech
Recognition
System
Acoustic &
Language
Models
Lattices
Retrieval
Result
Text
Retrieval
user
Text Query
Each path is a possible recognition result
time
Horizontal scale is the time

Lattices
Spoken
Archive
Speech
Recognition
System
Acoustic &
Language
Models
Lattices
Retrieval
Result
Text
Retrieval
user
Text Query
time
Horizontal scale is the time
Each path is a possible recognition result

Lattices
Spoken
Archive
Speech
Recognition
System
Acoustic &
Language
Models
Lattices
Retrieval
Result
Text
Retrieval
user
Text Query
time
Higher probability to include the correct words
More noisy words included inevitably
Higher memory/computation requirements

Searching over Lattices
 Consider the basic goal: Spoken Term Detection
“Obama”
user

 Find the arcs hypothesized to be the query term
Obama
“Obama”
user
Obama
x1
x2

 Posterior probabilities computed from lattices used as
confidence scores
Obama
x1
R(x1)=0.9
Obama
x2
R(x2)=0.3
Two ways to display the results:
unranked and ranked.

 Unranked: Return the results with the scores higher than
a threshold
Obama
x1
R(x1)=0.9
Obama
x2
R(x2)=0.3
Set the threshold as 0.6
Return x1

 Unranked: Return the results with the scores higher than
a threshold
Obama
x1
R(x1)=0.9
Obama
x2
R(x2)=0.3
The threshold can be determined
automatically and query specific.
[Miller, Interspeech 07][Can, HLT 09][Mamou,
ICASSP 13][Karakos, ASRU 13][Zhang, Interspeech
12][Pham, ICASSP 14]

Actual Term Weighted Value
(ATWV)
 Evaluating unranked result
𝐴𝑇𝑊𝑉 = 1 − 𝑃𝑚𝑖𝑠𝑠 − 𝛽𝑃𝐹𝐴
𝑃𝑚𝑖𝑠𝑠 = 1 − 𝑁𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑁𝑟𝑒𝑓
𝑃𝐹𝐴 = 𝑁𝑠𝑝𝑢𝑟𝑖𝑜𝑢𝑠 𝑁𝑁𝑇
time 1:01 1.0
time 2:05 0.9
time 1:31 0.7
……
retrieved
𝑁𝑟𝑒𝑓: number of times the query term appears in audio database
threshold
Maximum Term Weighted Value (MTWV): tune the threshold to
obtain the best ATWV
𝑁𝑐𝑜𝑟𝑟𝑒𝑐𝑡: the number of retrieved objects that are actually correct
𝑁𝑠𝑝𝑢𝑟𝑖𝑜𝑢𝑠: the number of retrieved objects that are incorrect
𝑁𝑁𝑇: audio duration (in seconds) – 𝑁𝑟𝑒𝑓

 Ranked: results ranked according to the scores
Obama
x1
R(x1)=0.9
Obama
x2
R(x2)=0.3
x1 0.9
x2 0.3
…

 Ranked: The results are ranked according to the scores
Searching on Lattices
Obama
x1
R(x1)=0.9
Obama
x2
R(x2)=0.3
x1 0.9
x2 0.3
… user

Mean Average Precision (MAP)
 Evaluating ranked list
 area under recall-precision curve
 Recall: percentage of ground truth results retrieved
 Precision: percentage of retrieved results being
correct
 Higher threshold gives higher precision but lower
recall, etc.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Precision
Recall
MAP = 0.484
MAP = 0.586

Examples of Lattice Indexing
Approaches
• Position Specific Posterior Lattices (PSPL)[Chelba, ACL
05][Chelba, Computer Speech and Language 07]
• Confusion Networks (CN)[Mamou, SIGIR 06][Hori, ICASSP
07][Mamou, SIGIR 07]
• Time-based Merging for Indexing (TMI)[Zhou, HLT 06][Seide,
ASRU 07]
• Time-anchored Lattice Expansion (TALE)[Seide, ASRU
07][Seide, ICASSP 08]
• WFST: directly compile the lattice into a weighted finite
state transducer [Allauzen, HLT 04][Parlak, ICASSP 08][Can, ICASSP
09][Parada, ASRU 09]

Out-of-Vocabulary (OOV) Problem
 Speech recognition is based on a lexicon
 Words not in the lexicon can never be transcribed
 Many informative words are out-of-vocabulary
(OOV)
 Many query terms are new or special words or
named entities

 All OOV words composed of subword units
 Generate subword lattices
 Transform word lattices into subword lattices
 Can also be directly generated by speech recognition
using subword-based lexicon and language model
Subword-based Retrieval
Retrieval
An arc in the
word lattice
Corresponding
subword sequence
/rɪ/ /trɪ/ /vǝl/
word lattices subword lattices

 Subword-based retrieval
 Generate subword lattices
 Transform user query into subword sequence
Obama → /au/ /ba/ /mǝ/
 Text retrieval techniques equally useful except
based on subword lattices and subword query
Replace words by subword units
 OOV words can be retrieved by matching over the
subword units without being recognized

- Frequently Used Subword Units
• Linguistically motivated units
– phonemes, syllables/characters, morphemes, etc.
[Ng, MIT 00][Wallace, Interspeech 07][Chen & Lee, IEEE T. SAP 02]
[Pan & Lee, ASRU 07][Meng, ASRU 07][Meng, Interspeech 08]
[Mertens, ICASSP 09][Itoh, Interspeech 07][Itoh, Interspeech 11]
[Pan & Lee, IEEE T. ASL 10]
• Data-driven units
– particles, word fragments, phone multigrams, morphs,
etc.
[Turunen, SIGIR 07] [Turunen, Interspeech 08]
[Parlak, ICASSP 08][Logan, IEEE T. Multimedia 05]
[Gouvea, Interspeech 10][Gouvea, Interspeech 11][Lee & Lee, ASRU 09]

Integrating Different Clues
from Recognition
 Similar to system combination in ASR
 Consistency very often implies accuracy
 Integrating the outputs from different
recognition systems [Natori, Interspeech 10]
 Integrating results based on different subword
units [S.-w. Lee, ICASSP 05][Pan & Lee, Interspeech 07][Meng,
Interspeech 10][Itoh, Interspeech 11]
 Weights of different clues estimated by
optimizing some retrieval related criteria [Meng &
Lee, ICASSP 09][Chen & Lee, ICASSP 10][Meng, Interspeech 10][Wollmer, ICASSP
09]

Integrating Different Clues
from Recognition
 Weights for Integrating 1,2,3-grams for different
word/subword units and different indices
syllable
Confusion
Network
Position
Specific
Posterior
Lattice
word
character
syllable
word
character
1-gram
2-gram
3-gram
1-gram
2-gram
3-gram
integrated with
different weights
maximizing the lower bound of MAP by SVM-MAP
[Meng & Lee, ICASSP 09]

Training Retrieval Model
Parameters
 Integrating different n-grams,
word/subword units and indices
single clue integrated
[Meng & Lee, ICASSP 09] [Chen & Lee, ICASSP 10]

ASR Accuracy v.s. Retrieval Performance
 Spoken Term Detection, Lectures
Speaker Dependent:
10 hours of speech from the instructor

Improved Speaker Adaptation

Initial Speaker Adaptation

Speaker Independent
Realistic!
ASR Accuracy
MAP

 Precision at 10: Percentage of the correct items among
the top 10 selected
Speaker Independent:
Only 60% of results are correct
Retrieve
YouTube?!

 Did lattices solve the problem?
 Need high quality recognition models to produce better
lattices and accurately estimate the confidence scores
 Spoken content over the Internet is produced in different
languages on different domains in different parts of the
world under varying acoustic conditions
 High quality recognition models for such content doesn’t
exist yet
 Retrieval performance limited by ASR accuracy
Is the problem solved?

 Desired spoken content retrieval
Less constrained by ASR accuracy
Existing approaches limited by ASR
accuracy because of the cascading of speech
recognition and text retrieval
 Go beyond the cascading concept
Is the problem solved?

Core:
Beyond Cascading Speech

New Directions
1. Modified ASR for Retrieval Purposes
2. Incorporating Those Information Lost in ASR
3. No Speech Recognition!
4. Special Semantic Retrieval Techniques for Spoken
Content
5. Spoken Content is Difficult to Browse!

Overview Paper
 Lin-shan Lee, James Glass, Hung-yi Lee,
Chun-an Chan, "Spoken Content Retrieval —
Beyond Cascading Speech Recognition with
Text Retrieval," IEEE/ACM Transactions on
Audio, Speech, and Language Processing,
vol.23, no.9, pp.1389-1420, Sept. 2015
 http://speech.ee.ntu.edu.tw/~tlkagk/paper/Over
view.pdf
 This tutorial includes updated information after
this paper is published.

New Direction 1:
Modified ASR
for Retrieval Purposes

Retrieval Performance
v.s. Recognition Accuracy
 Intuition: Higher recognition accuracy, better
retrieval performance
Not always true!
In Taiwan, the need of …
Recognition
System A
Recognition
System B
In Taiwan, a need of … In Thailand, the need of …
Same recognition accuracy

 Intuition: Higher recognition accuracy, better
retrieval performance
Not always true!
In Taiwan, the need of …
Recognition
System A
Recognition
System B
In Taiwan, a need of … In Thailand, the need of …
Not important
for retrieval
Serious problem
for retrieval

 Retrieval performance is more correlated to the ASR errors of
name entities than normal terms [Garofolo, TREC-7 99][L. van der
Werff, SSCS 07]
 Expected error rate defined on lattices is a better predictor of
retrieval performance than one-best transcriptions [Olsson, SSCS
07]
 lattices used in retrieval
 For retrieval, substitution errors have more influence than
insertions and deletions [Johnson, ICASSP 99]
 The language models reducing ASR errors do not always yield
better retrieval performance [Cui, ICASSP, 13][Shao, Interspeech, 08][
Wallace, SSCS 09]
 Query terms usually topic-specific with lower n-gram
probabilities

ASR models learned by
Optimizing Retrieval Performance
Speech
Recognition Models
Text
Retrieval
Result
Text
Retrieval
Query user
Spoken
Content

Speech
Recognition Models
Text
Retrieval
Result
Text
Retrieval
Query user
Spoken
Content
Optimized for recognition accuracy

New Direction 1-1:
Modified ASR
Acoustic Modeling

Acoustic Modeling
 Acoustic Model Training
𝜃 = 𝑎𝑟𝑔 max
𝜃
𝐹 𝜃 𝜃: acoustic model
parameters
𝐹 𝜃 : objective function
The objective function 𝐹 𝜃 usually defined
to optimize ASR accuracy
Design a new objective function for
optimizing retrieval performance.

Acoustic Modeling
 Objective Function for optimizing ASR
performance
𝜃
𝐹 𝜃
lattice of utterance u
wA
wB
wC
wA
wA
wA
wC
𝐹 𝜃 =
𝑢 𝑠𝑢∈𝐿 𝑢
𝐴 𝑟𝑢, 𝑠𝑢 𝑃𝜃 𝑠𝑢|𝑢
Summation over all the
utterances u in the training data
L(u): all the word sequence in
the lattice of x

Acoustic Modeling
performance
𝜃
𝐹 𝜃
𝐹 𝜃 =
wA
wB
wC
wA
wA
wA
wC
𝐴 𝑟𝑢, 𝑠𝑢 : the accuracy of word or phoneme
sequence 𝑠𝑢 comparing with reference 𝑟𝑢
𝑠𝑢: a word sequence in the lattice of x
𝑃𝜃 𝑠𝑢|𝑢 : posterior probability of word
sequence 𝑠𝑢 given acoustic model 𝜃
MCE, MPE, sMBR
𝜃 can be
HMM or
DNN
lattice of utterance u

Acoustic Modeling
performance
𝜃
𝐹 𝜃
𝐹 𝜃 =
retrieval
W-MCE, [Fu, ASRU 07][Weng, Interspeech 12][Weng, ICASSP, 13]
keyword-boosted sMBR [Chen, Interspeech 14]
If the possible query terms are known in advance,
they can be weighted higher in 𝐴 𝑟𝑢, 𝑠𝑢

 In most cases, the query terms are not known in
advance
 Collect feedback data on-line
 Use the information to optimize search engines
Feedback can be implicit
Training Data collected from User
time 1:10 F
time 2:01 F
time 3:04 F
time 5:31 T
time 1:10 F
time 2:01 T
time 3:04 F
time 5:31 T
time 1:10 F
time 2:01 T
time 3:04
time 5:31
Query Q1 Query Q2 Query Qn
……

Speech
Recognition Models
Text
Retrieval
Query user
Spoken
Content
Lattices
Retrieval
Result
re-estimate
optimize
[Lee & Lee, ICASSP 10]
[Lee & Lee, Interspeech 10]
[Lee & Lee, SLT 10]
[Lee & Lee, IEEE T. ASL 12]
time 1:10 F
time 2:01 F
time 3:04 F
time 5:31 T
time 1:10 F
time 2:01 T
time 3:04 F
time 5:31 T
time 1:10 F
time 2:01 T
time 3:04
time 5:31
Query Q1 Query Q2 Query Qn
……

Updated Retrieval Process
 Each retrieval result x has a confidence score
R(x)
 R(x) depends on the recognition model θ
 R(x) should be R(x;θ)
Re-estimate
recognition
model θ
Update the
scores R(x; θ)
The retrieval
results can be
re-ranked.
Considering some
retrieval criterion

     





 

x
x
x
R
x
R
F 

 ;
;
Basic Form
 Basic Form:
: confidence score of the positive example
 

;

x
R
: confidence score of the negative example
 

;

x
R
: a positive example

x
: a negative example

x
 



F
max
arg
ˆ 

     





 

x
x
x
R
x
R
F 

 ;
;
Basic Form
Increase the confidence scores of
the positive examples
 Basic Form:
 



F
max
arg
ˆ 
Decrease the confidence scores of
the negative examples

Consider Ranking
positive example
negative example
Confidence
score
: Original Model
 : New Model

ˆ
 

;
R x  

ˆ
;
R x

     





 

x
x
x
R
x
R
F 

 ;
;
Consider Ranking
positive example
negative example
Confidence
score
: Original Model
 : New Model

ˆ
 

;
R x  

ˆ
;
R x
Increase the basic
objective function

Consider Ranking
positive example
negative example
Confidence
score
: Original Model
 : New Model

ˆ
 

;
x
R  

ˆ
;
R x
Rank perfectly
Worse ranking
Considering the ranking order

 
   


 
 



otherwise
x
x
x
x
0
;
R
;
R
1
,



Consider Ranking
If the confidence score for a positive example
exceed that for a negative example
the objective function adds 1.
   






x
x
x
x
F
,
,



Consider Ranking
 δ(x+,x-) approximated by a sigmoid function
during optimization.
   






x
x
x
x
F
,
,


 
   


 
 



otherwise
x
x
x
x
0
;
R
;
R
1
,



Little feedback data?
The unlabeled examples as negative examples

0.46
0.47
0.48
0.49
0.50
0.51
0.52
MAP
Baseline Basic Form
Ranking unlabelled as negative
Acoustic Models - Experiments
 Lecture recording (80 queries, each has 5 clicks)
[Lee & Lee, IEEE T. ASL 1

New Direction 1-2:
Modified ASR
Besides Acoustic Modeling

Language Modeling
 The query terms are usually very specific. Their
probabilities are underestimated.
 Boosting the probabilities of n-grams including query
terms
 By repeating the sentences including the query terms
in training corpora
 Helpful in DARPA’s RATS program [Mandal, Interspeech 13]
and NIST OpenKWS13 evaluation [Chen, ISCSLP 14]
 NN-based LM: Modifying training criterion, so the key
terms are weighted more during training
 Helpful in NIST OpenKWS13 evaluation [Gandhe, ICASSP
14]

Decoding
 Give different words different pruning thresholds during
decoding
 The keywords given lower pruning thresholds than
normal terms
 Called white listing [Zhang, Interspeech 12] or keyword-
aware pruning [Mandal, Interspeech 13]
 OOV words never correctly recognized
 Two stage approach [Shao, Interspeech 08]
 Identify the lattices probably containing OOV (by
subword-based approach)
 Insert the word arcs of OOV words into lattices and
rescore

Confusion Models
Speech
Recognition Models
Text
Retrieval
Result
Text
Retrieval
Query user
Spoken
Content
Confusion
Model
A B C A’ B’ C’
The ASR produces systematic errors, so it is possible to
learn a confusion model to offer better retrieval results
[Karanasou, Interspeech 12][Wallace, ICASSP 10]

Jointly Optimizing Speech
Recognition and Retrieval Modules
Complex
Model
query
A spoken segment
Yes, the segment
contains the query.
No, …….
End-to-end model performing speech recognition
and retrieval jointly (learned jointly) in one step
Sounds crazy?

Much information lost during
ASR
Much information lost during ASR
Transcriptions:
using syntax vectors surge ……
Lattice:
ASR
Spoken Content

New Direction 2:
Incorporating
Those Information Lost in ASR

Information beyond Speech
Recognition Output
Speech
Recognition Models
Text
Retrieval
Result
Text
Retrieval
Query user
Spoken
Content
Black Box
Incorporating information lost in
standard speech recognition to help retrieval

New Direction 2-1:
Incorporating
What kind of information can be helpful?

Information beyond Speech
Recognition Output
 Phoneme or syllable duration [Wollmer, ICASSP 09][Naoyuki
Kanda, SLT 12][Teppei Ohno, SLT 12]
 Pitch & Energy [Tejedor, Interspeech 10]
 Landmark and attribute detection with prosodic cues
includes can reduce the false alarm [Ma, Interspeech 2007]
[Naoyuki Kanda, SLT 12]
Query is Japanese word “fu-ji-sa-N”
very short! False alarm!

Query-specific Information
 "Jack of all trades, master of none“
Speech Recognition
Spoken Term
Detection
Correctly recognized
all the words
higher detector accuracy
on specific query

Retrieval
System
Query-specific Detector
Query Q
Lattices
First-pass Retrieval Result
x1
x2 x3
Examples of Q
Compute Similarity
Exemplar-based approach also used in speech recognition
[Demuynck, ICASSP 2011][Heigold, ICASSP 2012][Nancy Chen, ICASSP 2016]

Similarities
between Audio Segments
Dynamic Time
Warping (DTW)

Retrieval
System
Query Q
Lattices
x1
x2 x3
Examples of Q
Learn a
model
Model for Q
Evaluate
confidence

Retrieval
System
Query Q
Lattices
x1
x2 x3
positive examples
Learn a discriminative model
negative examples
SVM
[Tu & Lee, ASRU 11]
[I.-F. Chen, Interspeech 13]

 The input of SVM or MLP has to be a fixed-length vector
 Representing an audio segment with different length into
a fixed-length vector
…
…
…
…
…
…
…
…
…
…
…
…
…
…
[Tu & Lee, ASRU 11]
[I.-F. Chen, Interspeech 13]

Retrieval
System
Query Q
Lattices
x1
x2 x3
positive examples
negative examples
 Is it realistic to have those examples?
Pseudo-relevance Feedback (PRF)
User Feedback

New Direction 2-2:
Incorporating
Pseudo Relevance Feedback

Retrieval
System
Pseudo Relevance Feedback (PRF)
Query Q
Lattices
x1
x2 x3
[Chen & Lee,
Interspeech 11]
[Lee & Lee, CSL 14]

Retrieval
System
Query Q
Confidence scores from lattices
Lattices
R(x1)
x1
x2 x3
R(x2) R(x3)
Not shown to the user
[Chen & Lee,
Interspeech 11]
[Lee & Lee, CSL 14]

Retrieval
System
Query Q
Lattices
R(x1)
x1
x2 x3
R(x2) R(x3)
Assume the results with high confidence scores as correct
Examples of Q
Considered as examples of Q
[Chen & Lee,
Interspeech 11]
[Lee & Lee, CSL 14]

Retrieval
System
Query Q
Lattices
R(x1)
x1
x2 x3
R(x2) R(x3)
similar dissimilar
Examples of Q
[Chen & Lee,
Interspeech 11]
[Lee & Lee, CSL 14]

Retrieval
System
Query Q
Lattices
R(x1)
x1
x2 x3
R(x2) R(x3)
time 1:01
time 2:05
time 1:45
…
time 2:16
time 7:22
time 9:01
Rank according to new scores
Examples of Q

(A) (B)
 Lecture recording [Lee & Lee, CSL 14]
- Experiments
Evaluation Measure: MAP (Mean Average Precision)

(A) (B)
- Experiments
(B): speaker independent (50% recognition accuracy)
(A): speaker dependent (84% recognition accuracy)
(A) and (B) use different speech recognition systems

(A) (B)
 PRF (red bars) improved the first-pass retrieval
results with lattices (blue bars)
- Experiments

New Direction 2-3:
Incorporating
Graph-based Approach

Graph-based Approach
 PRF
 Each result considers the similarity to the audio
examples
 Make some assumption to find the examples
 Graph-based approach [Chen & Lee, ICASSP 11][Lee & Lee, APSIPA
11][Lee & Lee, CSL 14]
 Not assume some results are correct
 Consider the similarity between all results

Graph Construction
 The first-pass results is considered as a graph.
 Each retrieval result is a node
First-pass Retrieval
Result from lattices
x1
x2
x3
x2
x3
x1
x4
x5

Graph Construction
 The first-pass results is considered as a graph.
 Nodes are connected if their retrieval results are similar.
 DTW similarities are considered as edge weights
x2
x3
x1
x4
x5
Dynamic Time Warping
(DTW) Similarity
similar

Changing Confidence Scores by Graph
 The score of each node depends on its neighbors.
x2
x3
x1
x4
x5
G(x1)
G(x2)
G(x3)
G(x5)
G(x4)
high
high
The results are ranked according to new scores G(xi).
“You are known by the company you keep”

Changing Confidence Scores by Graph
 The score of each node depends on its neighbors.
x2
x3
x1
x4
x5
G(x1)
G(x2)
G(x3)
G(x5)
G(x4)
low
low
The results are ranked according to new scores G(xi).
“You are known by the company you keep”

Graph-based Re-ranking - Formulation
xi
xj
G(xi)
         
 





i
j x
x
i
j
j
i
i x
x
x
G
x
x
G
N
,
Ŵ
R
1 


         
 





i
j x
x
i
j
j
i
i x
x
x
G
x
x
G
N
,
Ŵ
R
1 

xi
xj
G(xi)
original score
considering graph structure
(from lattices)

         
 





i
j x
x
i
j
j
i
i x
x
x
G
x
x
G
N
,
Ŵ
R
1 

xi
xj
G(xi)
xj: neighbors of xi (nodes connected to xi)
N(xi): neighbors of xi (nodes connected to xi)

         
 





i
j x
x
i
j
j
i
i x
x
x
G
x
x
G
N
,
Ŵ
R
1 

xi
xj
G(xi
)

xi
xj
W(xi,xj)
         
 





i
j x
x
i
j
j
i
i x
x
x
G
x
x
G
N
,
Ŵ
R
1 


xi
xj
Normalized by the weights of all the
edges connected to xj
         
 





i
j x
x
i
j
j
i
i x
x
x
G
x
x
G
N
,
Ŵ
R
1 


xi
xj
W(xi,xj)
The score of xi would be more close to
the nodes xj with larger edge weights.
         
 





i
j x
x
i
j
j
i
i x
x
x
G
x
x
G
N
,
Ŵ
R
1 


         
 





i
j x
x
i
j
j
i
i x
x
x
G
x
x
G
N
,
Ŵ
R
1 

xi
xj
W(xi,xj)
interpolation

 Assign score G(x) for each hit region based on the graph
structure
x1
x3
x2
x4
x5
G(x1)
G(x2)
G(x3)
G(x4)
G(x5)
 G(x1) depends on G(x2)
and G(x3)
         
 





i
j x
x
i
j
j
i
i x
x
x
G
x
x
G
N
,
Ŵ
R
1 


structure
x1
x3
x2
x4
x5
G(x1)
G(x2)
G(x3)
G(x4)
G(x5)
and G(x3)
and G(x3) ……
         
 





i
j x
x
i
j
j
i
i x
x
x
G
x
x
G
N
,
Ŵ
R
1 


structure
x1
x3
x2
x4
x5
G(x1)
G(x2)
G(x3)
G(x4)
G(x5)
and G(x3)
and G(x3) ……
 ……
         
 





i
j x
x
i
j
j
i
i x
x
x
G
x
x
G
N
,
Ŵ
R
1 


structure
x1
x3
x2
x4
x5
G(x1)
G(x2)
G(x3)
G(x4)
G(x5)
 How to find G(x1), G(x2),
G(x3) …… satisfying the
following equation?
 This is random walk.
         
 





i
j x
x
i
j
j
i
i x
x
x
G
x
x
G
N
,
Ŵ
R
1 

G(xi) is uniquely and
efficiently obtainable

 Lecture recording [Lee & Lee, CSL 14]
(A) (B)
Graph-based Approach -
Experiments
(B): speaker independent (low recognition accuracy)
(A): speaker dependent (high recognition accuracy)

(A) (B)
 Graph-based re-ranking (green bars) outperformed PRF (red
bars)
Graph-based Approach -
Experiments

0.25
0.27
0.29
0.31
0.33
0.35
Assamese Bengali Lao
ATWV
First Pass (on lattices)
Graph
Graph-based Approach –
Experiments on OpenKWS
[Lee & Glass,
Interspeech 14]

Graph-based Approach –
More Experiments
 13% relative improvement on OOV queries on
another lecture recording (several speakers) [Jansen,
ICASSP 13][Norouzian, ICASSP 13]
 14% relative improvement on AMI Meeting Corpus
[Norouzian, Interspeech 13]
 Graph Spectral Clustering
 Optimizing evaluation measure and considering the
graph structure at the same time [Audhkhasi, ICASSP 2014]
 11% relative improvement with subword-based
system on OpenKWS15 (Swahili) [Van Tung Pham, ICASSP,
2016]

New Direction 3:
No Speech Recognition!

Why Spoken Content Retrieval
without Speech Recognition?
 Bypassing ASR to avoid information loss and all problems
with ASR (errors, OOV words, background noise, etc. )
 Just to identify the query, no need to find out which words the
query includes
 Audio files on the Internet in hundreds of different languages
 Too limited annotated data for training reliable speech
recognition systems for most languages
 Written form even doesn’t exist for some languages
 Many audio files are code-switched across several different
languages

without Speech Recognition
user
“US President”
spoken
query
Compute similarity between spoken queries and audio
files on acoustic level, and find the query term
Spoken Content
“US President” “US President”
Is it possible?

Approach Categories
 DTW-based Approaches
 Matching sequences with DTW
 Audio Segment Representation
 Representing audio segments by fixed length vector
representations
 Unsupervised ASR (or model-based approach)
 Training word- or subword-like acoustic patterns (or
tokens) from target audio archive
 Transcribing both the audio archive and the query into
word- or subword-like token sequences
 Matching based on the tokens, just like text retrieval

New Direction 3-1:
DTW-based Approaches

DTW-based Approach
 Conventional DTW
Audio Segment
Audio Segment

DTW-based Approach
 DTW for query-by-example
 Whether a spoken query is in an utterance
Spoken
Query
Utterance
Segmental DTW [Zhang,
ICASSP 10], Subsequence
DTW [Anguera, ICME 13][Calvo,
MediaEval 14]

Acoustic Feature Vectors
 Gaussian posteriorgram [Zhang, ICASSP 10][Wang,
MediaEval 14]
 Phonetic posteriors [Hazen, ASRU 09]
 MLP trained from another corpus (probably in a
different language)
 Bottle-neck feature generated from MLP [Kesiraju,
MediaEval 14]
 RBM posteriorgram [Zhang, ICASSP 12]
 Performance comparison [Carlin, Interspeech 11]

Speed-up Approaches for DTW
 Segment-based matching [Chan & Lee, Interspeech
10][Chan & Lee, ICASSP 11]
Spoken
Query
Utterance
Group consecutive acoustically similar feature
vectors into a segment

 Segment-based matching
Hierarchical
Agglomerative
Clustering (HAC)
Step 1: build a tree
Step 2: pick a
threshold
Group consecutive acoustically similar feature
vectors into a segment

Spoken
Query
Utterance
Compute similarities between segments only

 Lower bound estimation [Zhang, ICASSP 11][Zhang,
Interspeech 11]
 Indexing the frames in the target audio file [Jansen,
ASRU 11][Jansen, Interspeech 12]
 Information Retrieval based DTW [Anguera, Interspeech
13]

New Direction 3-2:
Audio Segment Representation

Framework
Audio archive divided into variable-
length audio segments
Audio
Segment to
Vector
Audio
Segment to
Vector
Similarity
Search
Result
Spoken
Query
Off-line
On-line
[Chung & Lee, Interspeech 16][Chen, ICASSP 15]
[Levin, ICASSP 15][Levin, ASRU 13]

Audio Word to Vector
 The audio segments corresponding to words
with similar pronunciations are close to each
other.
ever ever
never
never
never
dog
dog
dogs

Audio Word to Vector -
Segmental Acoustic Indexing
 Basic idea
[Levin, ICASSP 15][Levin, ASRU 13]
A set of template
audio segments
0.5
……
0.8
0.3
0.5
0.8
0.3
⋮
DTW

Audio Word to Vector –
Sequence Auto-encoder
[Chung & Lee,
Interspeech 16]
RNN Encoder
x1 x2 x3 x4
audio segment
acoustic features
Representation for the whole
audio segment

Audio Word to Vector –
Sequence Auto-encoder
RNN Decoder
x1 x2 x3 x4
y1 y2 y3 y4
x1 x2 x3 x4
RNN Encoder
[Chung & Lee,
Interspeech 16]

Sequence Auto-encoder –
Experimental Results
never
ever
Cosine
Similarity
Edit Distance between
Phoneme sequences
Deep
Learning
Deep
Learning

Experimental
Results
More similar
pronunciation
Higher cosine similarity.

 Projecting the embedding vectors to 2-D
day
days
say
s
say

 Audio story (LibriSpeech corpus)
MAP
training epochs for
sequence auto-encoder
SA: sequence
auto-encoder
DSA: de-noising
sequence auto-encoder
Input: clean speech
+ noise
output: clean speech

New Direction 3-3:
Unsupervised ASR

Conventional ASR
… Hello World
…
ASR
unknown speech signal

Unsupervised ASR
ASR
unknown speech signal
Used in Query by example
Spoken Term Detection
Unsupervised ASR:
Learn the models for a set of acoustic patterns (tokens)
directly from the corpus (target spoken archive)
t0t1t2, t1t3,
t2t3,
t2t1t3t3t2 …
Acoustic Tokens

Unsupervised ASR - Acoustic
Token
utterance
acoustic
feature
acoustic tokens: chunks of acoustically similar feature
vectors with token ids
t0 t1 t2 t1
[Zhang & Glass, ASRU 09]
[Huijbregts, ICASSP 11]
[Chan & Lee, Interspeech 11]

Unsupervised ASR
- Overall Framework
Initialization
feature
sequence
model training
token decoding
initial token
sequence
final token
sequence
: feature sequence for the whole
corpus
: token sequences for X
: Model (e.g. HMM) parameters
: training iteration
simple segmentation
and clustering

Unsupervised ASR
- Initialization
Get Token ID
Extract acoustic
features for every
utterance

Unsupervised ASR
- Overall Framework
Initialization
feature
sequence
model training
token decoding
initial token
sequence
final token
sequence
: feature sequence for the whole
corpus
: Model (e.g. HMM) parameters
: token sequences for X
: training iteration
simple segmentation
and clustering

Unsupervised ASR
- Overall Framework
Initialization
feature
sequence
model training
token decoding
initial token
sequence
final token
sequence
optimize HMM parameters using
Baum–Welch algorithm on token
sequence 𝜔𝑖−1 to get new models 𝜃𝑖
decode acoustic features into a new
token sequence 𝜔𝑖 using Viterbi
decoding
𝜔𝑖−1

Unsupervised ASR
- Overall Framework
iterate until the token
sequences (including token
boudaries) converge
Initialization
feature
sequence
model training
token decoding
initial token
sequence
final token
sequence

Acoustic Token in Query by
Example Spoken Term
Detection
 Compute the similarity between the models of
two tokens
Model of
token A
Model of
token B
KL divergence of the
Gaussian mixtures in the
first state of two models

Acoustic Token in Query by
Example Spoken Term
Detection
 Compute the similarity between the models of
two tokens
Model of
token A
Model of
token B
Sum of the KL
divergence over the
states of the two token
models

Token-based DTW
subsequence matching Token-based DTW
Tokens
in query
Tokens in an
utterance
 Signal-level DTW is more sensitive to signal variation (e.g. same phoneme
across different speakers), while token models are able to cover better the
distribution of signal variation
 Much lower on-line computation load
a b c d a b g h b d
b
g
d
b
a

Multi-granularity Space
for Acousitc Tokens
• Unknown hyperparameters for the token models
• Number of HMM states per token (m): token length
• Number of distinct tokens (n)
• Multiple layers of intrinsic representations of speech

for Acousitc Tokens
• From short to long (Temporal Granularity)
– phoneme
– syllable
– word
– phrase
• From coarse units to fine units (Phonetic Granularity)
– general phoneme set
– gender dependent phoneme set
– speaker specific phoneme set
Number of distinct HMMs (n)
Number of states per HMM (m)

for Acousitc Tokens
Training multiple sets of HMMs for with different granularity
[Chung & Lee, ICASSP 14 ]
phoneme syllable word phrase
general
gender
dependent
speaker
specific
n
m

for Acousitc Tokens
 Token-based DTW using tokens with different
granularity (m,n) averaged gave much better
performance
 One example
 Frame-level DTW: MAP = 10%
 Using only the token set with the best
performance: MAP = 11%
 Using 20 sets of tokens (number of states per
HMM m = 3,5,7,9,11, number of distinct HMMs
n=50,100,200,300): MAP = 26%

Hierarchical Paradigm
 Typical ASR:
 Acoustic Model: models for the phonemes
 Lexicon: the pronunciation of every word as a
phoneme sequence
 Language Model: the transition between words
Word 1
Phoneme 1 Phoneme 4
Word 2
Phoneme 2 Phoneme 1 Phoneme 3
Lexicon
word 1
word 2
word 3
word 4
Language Model
Phoneme 1
Phoneme 2
Phoneme 3
Acoustic Model

word-like
token 1
word-like
token 1
word-like
token 1
 Similarly, in unsupervised ASR:
 Acoustic Model: the phoneme-like token HMMs
 Lexicon: the pronunciation of every word-like token as
a sequence of phoneme-like tokens
 Language Model: the transition between word-like
tokens
word-like token 1
phoneme-
like token 1
phoneme-
like token 4
word-like token 2
phoneme-
like token 2
phoneme-
like token 1
phoneme-
like token 3
Lexicon
word-like
token 1
Language Model
phoneme-like token 1
Acoustic Model

word-like
token 1
word-like
token 1
word-like
token 1
 Similarly, in unsupervised ASR:
 Acoustic Model: the phoneme-like token HMMs
 Lexicon: the pronunciation of every word-like token as
a sequence of phoneme-like tokens
 Language Model: the transition between word-like
tokens
word-like token 1
phoneme-
like token 1
phoneme-
like token 4
word-like token 2
phoneme-
like token 2
phoneme-
like token 1
phoneme-
like token 3
Lexicon
word-like
token 1
Language Model
Acoustic Model
Bottom-up Construction
Top Down Constraint

Bottom Up Construction
3 stages during training
focus on different constraints:
stage1: Acoustic Model
stage2: Language Model
stage3: Lexicon
stage1
stage2
stage3
this part alone would be the
HMM training we described
earlier
[Chung & Lee, ICASSP 13]

Top-down Constraints [Jansen, ICASSP 13]
This figure is from Aren
Jansen’s ICASSP paper.
 Signals of the same phoneme may be very different on phoneme
level, but the global structures of signals of the same word are very
often very similar on word level
 Global structures help in building the hierarchical model

Token Model
Optimization
Token Label
Optimization
𝟁 = (m,n)
Multi-target DNN (MDNN)
Multi-layered Acoustic Tokenizer (MAT)
o o o o o o o o o
o o o o o o o o o
o o o o
o o o o o o o o o o o o
Bottleneck Features
concatenation
Multi- layered
Token labels as
MDNN targets
Bottleneck
Features
Initial
Acoustic
Features
(iteration 1)
Concatenated Features(iteration 2,3,...)
Bottleneck
Features
(iterations 2,3,...)
feature
evaluation
(sub)word
evaluation
Initial
Acoustic
Features
(iteration 1)
In the first iteration, we use MFCC as the initial features
In the other iterations, we concatenate the bottleneck features with the MFCC
Multi-layered Acoustic Tokenizing Deep Neural
Networks (MAT-DNN) [Chung & Lee, ASRU 15]
 Jointly learn high quality frame-level features (much better than MFCCs) and
acoustic tokens in an unsupervised way
 Unsupervised training of multi-target DNN using unsupervised token labels
as training target

Token Model
Optimization
Token Label
Optimization
𝟁 = (m,n)
Multi-target DNN (MDNN)
Multi-layered Acoustic Tokenizer (MAT)
o o o o o o o o o
o o o o o o o o o
o o o o
o o o o o o o o o o o o
Bottleneck Features
concatenation
Multi- layered
Token labels as
MDNN targets
Bottleneck
Features
Initial
Acoustic
Features
(iteration 1)
Concatenated Features(iteration 2,3,...)
Bottleneck
Features
(iterations 2,3,...)
feature
evaluation
(sub)word
evaluation
Initial
Acoustic
Features
(iteration 1)
Networks (MAT-DNN)
In the first iteration, we use MFCC as the initial features
In the other iterations, we concatenate the bottleneck features with the MFCC
 Jointly learn high quality frame-level features (much better than MFCCs) and
acoustic tokens in an unsupervised way
 Unsupervised training of multi-target DNN using unsupervised token labels
as training target
[Chung & Lee, ASRU 15]

 Experimental Results
 Query by Example Spoken Term Detection on
Tsonga
Networks (MAT-DNN) [Chung & Lee, ASRU 13]
Approach MAP
Frame-based
DTW
MFCC 9.0
New Feature 28.7
Token-based
DTW
New Tokens 26.2

New Direction 4:
Special Semantic Retrieval Techniques
for Spoken Content

Semantic Retrieval
 User expects semantic retrieval of spoken content.
 User asks “US President”, system also finds “Obama”
 Widely studies on text retrieval
 Take query expansion as example
user
“US President”
Search both
“US President” or “Obama”
“Obama” and “US
President” are related
Retrieval
system

Semantic Retrieval
of Spoken Content
 User expects semantic retrieval of spoken content.
 User asks “US President”, system also finds “Obama”
 Widely studies on text retrieval
 Take query expansion as example
 The techniques developed for text can be directly
applied on semantic retrieval of spoken content
 Are there any special techniques for spoken content?

 Both query Q and document d are represented as
unigram language models θQ and θd
Review: Language Modeling Retrieval
Approach
w1 w2 w3 w4 w5 …
 
Q
w
P 
|
Query model θQ
 
d
w
P 
|
Document model θd
w1 w2 w3 w4 w5 …
KL divergence between the two models can be evaluated.

Review: Language Modeling Retrieval
Approach
 Given query Q, rank document d according to a
relevance score function SLM(Q,d):
 Inverse of KL divergence between query model θQ and
document model θd
 The documents with document models θd similar to
query model θQ are more likely to be relevant.
   
d
Q
LM KL
d
Q
S 
 |
, 


   
 


'
,
,
|
w
d
d
w
N
d
w
N
w
P 
 Query model θQ for text:
 Document model θd for text :
   
 


'
,
,
|
w
Q
Q
w
N
Q
w
N
w
P 
Review: Basic Query/Document
Models in Text Retrieval
Normalize into probability
N(w,Q): term frequency of word w in query Q
Normalize into probability
N(w,d): term frequency of word w in document d
Those basic models can be enhanced by query/document
expansion to handle the problem of semantic retrieval.

Review: Query Expansion
Document
Model
θd
Retrieval
Engine
doc 101
doc 205
doc 145
…
…
Text Query Q
w1 w2 w3 w4 w5
 
Q
w
P 
|
Query
model
First-pass
Retrieval Result
[Tao, SIGIR 06]

Document
Model
θd
Retrieval
Engine
doc 101
doc 205
doc 145
…
…
Document model
for doc 101
w1 w2 w3 w4 w5 ……
 
d
w
P 
|
w1 w2 w3 w4 w5 ……
 
d
w
P 
|
Text Query Q
Top N
documents
Document model
for doc 205
w1 w2 w3 w4 w5
 
Q
w
P 
|
Query
model
First-pass
Retrieval Result
[Tao, SIGIR 06]

Document
Model
θd
doc 101
doc 205
doc 145
…
…
w1 w2 w3 w4 w5 ……
 
d
w
P 
|
w1 w2 w3 w4 w5 ……
 
d
w
P 
|
Text Query Q
w1 w2 w3 w4 w5
 
Q
w
P 
| common patterns in
document models
New Query Model
Query
model
w1 w2 w3 w4 w5 ……
 
Q
w
P '
|
Retrieval
Engine
First-pass
Retrieval Result
Top N
documents
[Tao, SIGIR 06]
(by EM algorithm)

Document
Model
θd
doc 101
doc 205
doc 145
…
…
w1 w2 w3 w4 w5 ……
 
d
w
P 
|
w1 w2 w3 w4 w5 ……
 
d
w
P 
|
Text Query Q
w1 w2 w3 w4 w5
 
Q
w
P 
|
Query
model
Retrieval Engine
Final Result
w1 w2 w3 w4 w5 ……
 
Q
w
P '
|
Retrieval
Engine
First-pass
Retrieval Result New Query Model
Top N
documents
[Tao, SIGIR 06]

Review: Document Expansion
 
d
w
P 
|
Document model θd
w1 w2 w3 w4 w5 …
Topic
Find topics
behind
document
Modify
document
model
This is realized by PLSA, LDA, etc.
Topic
Topic
[Wei, SIGIR 06]
“airplane”
 
d
w
P 
|
New Document model θd
w1 w2 w3 w4 w5 …
“airplane”
“aircraft”

Semantic Retrieval on Lattices
Original Retrieval
Model of Text
For Lattices
Term Frequency
Expected
Term Frequency
Document Length
Expected
Document Length
…… ……
 Take the basic language modeling retrieval approach as
example
 Modify retrieval model for lattices:

Document Model from Lattices
 Document model θd for text
 (Spoken) Document model θd from lattice
   
 


'
,
,
|
w
d
d
w
N
d
w
N
w
P 
   
 


'
,
,
|
w
d
d
w
E
d
w
E
w
P 
Replace term frequency N(w,d) with expected
term frequency E(w,d) computed from lattices

 Expected term frequency E(w,d) for word w in spoken
document d based on lattice
Expected Term Frequency
lattice of spoken
document d
wA
wB
wC
wA
wA
wA
wC
     
 



d
L
u
d
u
P
u
w
N
d
w
E |
,
,

Expected Term Frequency
 Expected term frequency E(w,d) for word w in spoken
document d based on lattice
L(d): all the word sequences in
the lattice of d
N(w,u): the number of word w
appearing in word sequence u
u: a word sequence in the lattice of d
P(u|d): posterior probability of word sequence u
wA
wB
wC
wA
wA
wA
wC
lattice of spoken
document d
     
 



d
L
u
d
u
P
u
w
N
d
w
E |
,
,

New Direction 4-1:
for Spoken Content
Better Estimation of Term Frequencies

Better Estimation of Term
Frequencies
 Expected term frequency E(w,d) from lattices can be
inaccurate
 Context of each term in the lattices [Tu & Lee, ICASSP 12]
 The same terms usually have similar context [Schneider,
Interspeech 10]
 Graph-based approach
 Graph-based approach using acoustic feature similarity
improved spoken term detection
 It can also improve semantic retrieval of spoken content
based on language modeling retrieval approach
 Idea: Replace expected term frequency E(w,d) with scores
from graph-based approach [Lee & Lee, SLT 12] [Lee & Lee,
IEEE/ACM T. ASL 14]

Graph-based Approach for
Semantic Retrieval
 For each word w in the lexicon
… …
… …
… …
x1 x2
x3
x4
spoken document
spoken document
spoken document
Find the occurrence regions of word w

Semantic Retrieval
… …
… …
… …
x1 x2
x3
x4
spoken document
spoken document
spoken document
Connect the occurrence regions by acoustic feature similarities

Semantic Retrieval
… …
… …
… …
x1 x2
x3
x4
spoken document
spoken document
spoken document
Obtain new score G(x) by random walk
G(x1) G(x2)
G(x3)
G(x4)

Semantic Retrieval
… …
… …
… …
x1 x2
x3
x4
spoken document
spoken document
spoken document
G(x1) G(x2)
G(x3)
G(x4)
Repeat this process for all the words w in the lexicon

Semantic Retrieval
… …
x1 x2
G(x1) G(x2)
 
d
w
E ,

   
 


'
,
,
|
w
d
d
w
E
d
w
E
w
P 
spoken document d
Better estimation of term
frequencies for each word w
Lattice-based
Document Model:
Scores from
graph
   
 
 



'
,
,
|
w
d
d
w
E
d
w
E
w
P 
Graph-enhanced
document model:
replace

Semantic Retrieval
   
 


'
,
,
|
w
d
d
w
E
d
w
E
w
P     
 
 



'
,
,
|
w
d
d
w
E
d
w
E
w
P 
Graph-enhanced
document model:
Query and document expansion borrowed
from text retrieval can be equally applied
Lattice-based
Document Model:

Graph-based Approach for Semantic
Retrieval - Experiments
 Experiments on TV News
0.43
0.44
0.45
0.46
0.47
0.48
0.49
0.5
0.51
Basic LM Query Expansion Document
Expansion
Query + Document
Expansion
lattice Graph-Enhanced
MAP
[Lee & Lee, IEEE/ACM T. ASL 14]

New Direction 4-2:
for Spoken Content
Exploiting Acoustic Tokens

Acoustic Tokens
 We can identify “acoustic tokens” in direction 3
Token 1
Token 1
Token 1
Token 2
Token 2 Token 3
Token 3
Token 3
Query expansion with acoustic tokens
Can be useful in semantic
retrieval of spoken content:
Unsupervised semantic retrieval of spoken content

“US President” “Obama”
Query Expansion – Never Appear?
If “Obama” is not in the lexicon
We can never know “Obama” co-occur with
“US President” in query expansion.
“Obama” will never appear in lattices.
Typical approach: using subwords
Query expansion with acoustic
tokens
Complementary
to each other

Query Expansion
with Acoustic Tokens
Original Text Query:
“US President”
d100: …… US President …
d205: … US President ……
First pass: Retrieve spoken documents containing “US
President” in the transcriptions

Query Expansion
Find acoustic tokens frequently appear in the signals of
these retrieved documents
“US President”

Query Expansion
Obama?
Obama?
“US President”
Even the terms related to the query is OOV
If they really co-occur with the query in speech signals
Find acoustic tokens corresponding to these terms

Query Expansion
Expanded Query:
“US President” +
“US President”

Query Expansion
user
“US President”
Retrieval
system
Expanded Query:
Lattices
Find the same tokens
“White House”
“US President” and
By expanding the text query with acoustic tokens,
more semantically related audio files can be retrieved.

Query Expansion
– Acoustic Patterns
 Experiments on TV News [Lee & Lee, ICASSP 13]

Unsupervised Semantic Retrieval
 Unsupervised Semantic Retrieval [Li & Lee, ASRU
13][Oard, FIRE 13]
 No speech recognition as query-by-example
spoken term detection
 But find spoken documents semantically related to
the spoken queries
 New task, not too much previous work
 Below is just a very preliminary study [Li & Lee, ASRU 13]

Spoken Queries
1. Find spoken documents containing the spoken query
database
Spoken Document 1
Spoken Document 2
Spoken Document 3
Done by the query-by-example spoken term
detection approaches (e.g. DTW)

2. Find acoustic tokens frequently co-occurring with
the spoken queries in the same document
Spoken Queries

3. Use the acoustic tokens to expand the
original spoken query
Expanded
Queries

4. Retrieve again by the expanded queries
Expanded
Queries
Can retrieve spoken documents not
containing the original spoken queries

Unsupervised Semantic Retrieval -
Experiments
 Broadcast news, MAP as evaluation measure
 Using only DTW to retrieve spoken queries:
Spoken term detection: 28.30%
Semantic retrieval: 8.76%
User only wants to find
spoken documents
containing query term.
User wants to find all spoken documents
semantically related to query term.
[Li & Lee, ASRU 13]

Experiments
Exactly the same retrieval
results, but what user wants to
find is different
Lots of semantically related
spoken documents cannot be
retrieved [Li & Lee, ASRU 13]

Experiments
 Expanded spoken queries:
MAP was improved from 8.76% to 9.70%
 Unsupervised semantic retrieval has a long way to go
[Li & Lee, ASRU 13]

New Direction 5:
Speech Content is Difficult to Browse!

Audio is hard to browse
 When the system returns the retrieval results,
user doesn’t know what he/she get at the first
glance
Retrieval Result

Audio is hard to browse
Interactive spoken content retrieval
Summarization, Key term extraction,
Title Generation
Organizing retrieved results
Question answering

New Direction 5-1:
Interactive Spoken Content Retrieval

Interactive
 Conventional Retrieval Process
user
US President
Here are what
you are looking
for:
Doc3
Doc1
Doc2
…
Can be noisy

 Interactive retrieval
Interactive
user
US President
Not clear enough
……
More precisely, please.

Is it related to “Election”?
Interactive
user
US President
Still not clear
enough ……
Obama

Here are what you are
looking for.
Is it related to “Election”?
Interactive
user
US President
I see!
Obama
Yes.

Interactive
 Given the information entered by the users at
present, which action should be taken?
“Give me an example.”
“Is it relevant to XXX?”
“Can you give me another query?”
“Show the results.”

MDP for Interactive Retrieval
 MDP
 Widely used in dialogue system (air ticket booking,
city guides .., )
 The system is in certain states.
 Which action should be taken depends on the state the
system is in.
 MDP for Interactive retrieval [Wen & Lee, Interspeech
12][Wen & Lee, ICASSP 13]
 State: the degree of clarity of the user’s information
need
Ambiguous Clear
state space

S1
Spoken
Achieve
Search
Engine
Query
US President.
Doc3
Doc1
Doc2
…
Ambiguous Clear
state space
State
Estimator
[Cronen-Townsen,
SIGIR 02]
[Zhou, SIGIR 07]
State Estimator: Estimate the degree
of clarity from the retrieval results

S1
A
1
A
2
A
3
A
4
 A set of candidate actions
 System: “More precise, please.”
 System: “Is it relevant to XXX?”
 …..
Ambiguous Clear
state space
 There is an action “show results”
 When the system decides to show the results, the
retrieval session is ended

S1
A
1
A
2
A
3
A
4
 Choose the actions by intrinsic policy π(S)
 The policy is a function
 Input: state S, output: action A
π(S)=“More
precise, please”
π(S)=Show Results
Ambiguous Clear
state space

S1
Spoken
Achieve
Search
Engine
Doc3
Doc1
Doc2
…
A
1
A1: More
precise, please.
Obama
C1
User response The system gets a cost
C1 due to user labor.
Ambiguous Clear
state space
π(S1) = A1

S1
Spoken
Achieve
Search
Engine
Doc2
Doc1
Doc3
…
A
1
Update Results
State
Estimator
S2
C1
Ambiguous Clear
state space

Interact with Users - MDP
 Good interaction:
 The quality of final retrieval results shown to the
users are as good as possible
 The user labors (C1, C2) are as small as possible
S1
S3
S2
A
1
C1
C2
A
2
En
d
Sho
w

Interact with Users - MDP
 Learn polity π maximizing:
Retrieval Quality - User labor
 The polity π can be learned from historical
interaction by fitted value iteration [Chandramohan,
Interspeech 10]
S1
S3
S2
A
1
C1
C2
A
2
En
d
Sho
w

Deep Reinforcement Learning
 Replacing MDP by deep reinforcement
learning

Spoken
Content
Retrieval
Result
s
Spoken
Content
query
State
Estimation
Action
Decision
state
The degree of clarity from
the retrieval results
action
features
 The policy π(s) is a function
 Input: state s, output: action a
Decide the actions by intrinsic
policy π(S)
[Wen & Lee, Interspeech 12]
[Wen & Lee, ICASSP 13]

Spoken
Content
Retrieval
Result
s
Spoken
Content
query
features
…
…
…
DNN
State Estimation
Action Decision
Is it relevant to XX?
Give me an example.
Show the results.
Max
[Wu & Lee, Interspeech 16]

Spoken
Content
Retrieval
Result
s
Spoken
Content
query
features
…
…
…
DNN
Is it relevant to XX?
Give me an example.
Show the results.
Max

 Broadcast news, semantic retrieval [Wu & Lee,
Interspeech 16]
Retrieval Quality (MAP)
Optimization Target:
Retrieval Quality - User labor
Hand-crafted Deep Reinforcement
Learning
Previous Method
(state + decision)

New Direction 5-2:
Summarization, Key Term Extraction
& Title Generation

Introduction
Retrieved
Audio File
Summary
Deep Learning,
Neural Network ….
10 minutes
30 seconds
Extractive
Summarization
Title Generation
Key Term
Extraction
“Introduction of
Deep Learning”

Summarization
 Unsupervised Approach: Maximum Margin
Relevance (MMR) and Graph-based Approach
 Supervised approach: Summarization problem can
be formulated as a binary classification program
 Included in the summary or not
utterance 1
utterance 2
utterance 3
utterance 4
Binary
Classifier
-1
+1
+1
-1
utterance 2
utterance 3
classification
result
summary
Binary
Classifier
Binary
Classifier
Binary
Classifier
Lecture

Summarization
– Binary Classification
 Binary classifier individually considers each utterance
 To generate a good summary, “global information” should be
considered
 Example: summary should be concise
More advanced machine learning techniques
LSA is useful for summarization
LSA is helpful for summarization
LSA is useful for summarization
Hello ……
LSA is Latent semantic analysis
LSA is helpful for summarization
Repeat again
……
Spoken Document
Summary
Summary should be succinct

Summarization
- Whole spoken document
 Learn a special model by structured learning
techniques
 Input: whole lecture
 Output: summary
Special
Model
spoken
document
Summary
Consider the
whole lecture
3 utterances
selected in
summary
[Lee & Lee, Interspeech 12]

Evaluation Function
 Evaluation function of utterance set F(s)
 s: utterance set in a lecture
F(s) 10
score
utterance
set s
how suitable it is to
consider utterance set
s as the summary
Properties:
• Concise?
• Include
keyword?
• Short enough?
……….
How good it is to take this
utterance set as summary?
Lecture

Evaluation Function
– How to summary
 With F(s), we can do summarization on new
lectures now
Lecture
s1
s2
s3
s4
s5
s6
s7
Compute F(s) for
all utterance sets
If s6
maximizes
F(s)
summary
Enumerate all
the possible
utterance set s

Evaluation Function
 Evaluation function of utterance set F(s)
 s: utterance set in a lecture
F(s) 10
score
utterance
set s
how suitable it is to
consider utterance set
s as the summary
Properties:
• Concise?
• Include
keyword?
• Short enough?
……….
How good it is to take this
utterance set as summary?
Lecture
What properties
should F(s) check?

 Learn F(s) from training data
Reference
summary
Reference
summary
Evaluation Function - Training
…
9
7
-4
high
Find F(s) such that
lecture
Training data
F(s)
F(s)
F(s)
Structured SVM: I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support
Vector Learning for Interdependent and Structured Output Spaces, ICML, 2004.

Summarization
- Structure of spoken document
 Temporal structure helps summarization
 Long summary: consecutive utterances in a
paragraph are more likely to be
 Short summary: one utterance is selected on behalf
of a paragraph.
…
𝑥𝑖+3
𝑥𝑖−2 𝑥𝑖−1 𝑥𝑖+6
𝑥𝑖+4 𝑥𝑖+5
…
𝑥𝑖+3
𝑥𝑖−2 𝑥𝑖+6
𝑥𝑖+4
Important paragraph
Representative of the paragraph
𝑥𝑖+5
𝑥𝑖 𝑥𝑖+1 𝑥𝑖+2
𝑥𝑖−1
𝑥𝑖 𝑥𝑖+1 𝑥𝑖+2
Paragraph 1 Paragraph 2 Paragraph 3
Paragraph 1 Paragraph 2

Summarization
 Add structure information into evaluation function
of utterance set F(s)
F(s) 100
score
Properties:
• Concise?
• Include
keyword?
• Short enough?
……….
utterances
Given the
information of
structure
Paragraph 1 Paragraph 2
[Shiang &
Lee,
Interspeech
13]

Summarization
 Structure in text are clear
Paragraph boundaries are directly known
 For spoken content, there is no obvious
structure
Here the structure are considered as “hidden
variables”
Structured learning with hidden variables

Summarization
 Evaluation Measure: ROUGE-1 and ROUGE-2
 Larger scores means the machine-generated summaries
is more similar to human-generated summaries.
[Shiang & Lee, Interspeech 1

Key Term Extraction
 TF-IDF is a food measure for identifying key
terms [E. D’Avanzo, DUC 04][Jiang, SIGIR 09]
 Feature parameters from latent topic models
[Hazen, Interspeech 11] [Chen & Lee, SLT 10]
 Key terms are usually focused on small number
of topics
 Prosodic Features [Chen & Lee, ICASSP 12]
 slightly lower speed, higher energy, wider pitch
range
 Machine Learning methods
 Input: a term, output: key term or not [Liu, SLT 08][Chen
& Lee, SLT 10]

Key Term Extraction
– Deep Learning
α1 α2 α3 α4 … αT
ΣαiVi
x4
x3
x2
x1 xT
…
…
V3
V2
V1 V4 VT
Embedding Layer
…
V3
V2
V1 V4 VT
OT
…
document Hidden Layer
Output Layer
Embedding Layer
Keyword Set
SVM, Regression, Python,
DNN, Fourier Transform,
Speech Processing,
LSTM, Bubble Sort, etc.
[Shen & Lee, Interspeech 16]

Title Generation
 Deep Learning based Approach [Alexander M Rush,
EMNLP 15][Chopra, NAACL 16][Lopyrev, arXiv 2015][Shen, arXiv 2016]
 Sequence-to-sequence learning
 Input: a document (word sequence), output:
its title (shorter word sequence)
https://arxiv.org/pdf/1512.01712v1.pdf

New Direction 5-3:
Visualizing Retrieved Results

Introduction
 Visualizing the retrieval results on an intuitive
interface helps users know what is retrieved
 Take retrieving on-line lectures as example
 Searching spoken lectures is a very good
application for spoken content retrieval
 The speech of the instructors conveys most
knowledge in the lectures

Retrieving One Course
 NTU Virtual Instructor
Searching the course Digital Speech Processing of NTU

Massive Open On-line Courses
(MOOCs)
 Enormous on-line courses

Today’s Retrieval Techniques
A list of related courses

Today’s Retrieval Techniques

More is less …...
 Given all the related lectures from different courses
learner
Which lecture should I
go first?
Learning Map
 Nodes: lectures in the same
topics
 Edges: suggested learning
order
[Shen & Lee,
Interspeech 15]

Learning Map
lectures in the
same topic

Lectures in the same topic
same topic?
 Compute the similarity of each pair of lectures
 Lexical and topical similarity of the audio transcriptions
 Lexical similarity and syntactic parsing tree similarity of
the titles of the lectures
 Weighted sum the similarity measures

Lectures in the same topic
same topic?
a1
a2
a3
b1
b2
b3
a1
a2
a3
b1
b2
b3
More Likely Less Likely

Learning Map
suggested
learning order

Prerequisite
Lectures in
different courses
Prerequisite?
Learning a binary classifier
Training data: lectures in
existing courses
An existing course
…
Lecture 1
Lecture 2
Lecture 3
Lecture 1 is a
prerequisite of
lecture 2
Assumption:
Lecture 2 is a
prerequisite of
lecture 3
……

Vision: Personalized Courses
I want to learn “XXX”.
I am a graduate student of
computer science.
I can spend 10 hours.
Learner
I open a course for you.
on-line learning
material
 Spoken Language Processing techniques can be very
helpful.
 The spoken content in courses plays the most
important role in conveying the knowledge.

New Direction 5-4:
Speech Question Answering

 Machine answers questions based on the
information in spoken content
What is a possible
origin of Venus’
clouds?
volcanic activity

 Reference: 6 Spoken Question Answering (Sophie
Rosset, Olivier Galibert and Lori Lamel). G. Tur and R.
DeMori, Spoken Language Understanding: Systems for
Extracting Semantic Information from Speech.
 Question Answering in Speech Transcripts (QAST) has
been a well-known evaluation program of SQA.
 2007, 2008, 2009
 Focused on factoid questions in the previous study
 E.g. “What is name of the highest mountain in
Taiwan?”.
 How about more difficult questions?

Preliminary Study
 TOEFL Listening Comprehension Test by
Machine
Question: “ What is a possible origin of Venus’ clouds? ”
Audio Story:
Choices:
(A) gases released as a result of volcanic activity
(B) chemical reactions caused by high surface temperatures
(C) bursts of radio energy from the plane's surface
(D) strong winds that blow dust into the atmosphere
(The original story is 5 min long.)
[Tseng & Lee, Interspeech 16]

Simple Baselines
Accuracy
(%)
(1) (2) (3) (4) (5) (6) (7)
Naive Approaches
random
(4) the choice with
semantic most similar to
others
(2) select the
shortest choice as
answer
Experimental setup:
717 for training,
124 for validation, 122 for
testing
[Tseng & Lee, Interspeech 1

Model Architecture
“what is a possible
origin of Venus‘ clouds?"
Question:
Question
Semantic
s
…… It be quite possible that this be
due to volcanic eruption because
volcanic eruption often emit gas. If
that be the case volcanism could
very well be the root cause of
Venus 's thick cloud cover. And also
we have observe burst of radio
energy from the planet 's surface.
These burst be similar to what we
see when volcano erupt on earth
……
Audio Story:
Speech
Recognition
Semantic
Analysis
Semantic
Analysis
Answer
Select the choice
most similar to the
answer

Accuracy
(%)
(1) (2) (3) (4) (5) (6) (7)
Memory Network:
39.2%
Naive Approaches
Word-based Attention:
48.3%
(proposed by FB AI group)
[Tseng & Lee, Interspeech 1

Conclusion Remarks
 New research directions for spoken content retrieval
 Modified ASR for Retrieval Purposes
 Incorporating Those Information Lost in ASR
 No Speech Recognition!
 Special Semantic Retrieval Techniques for Spoken
Content
 Spoken Content is Difficult to Browse!

Take-Home Message
Speech Recognition
+
Text Retrieval
=

Spoken Content Retrieval

More Related Content

Similar to Spoken Content Retrieval

More from linshanleearchive

Recently uploaded

Spoken Content Retrieval

Editor's Notes