Spoken Content Retrieval
Beyond Cascading Speech
Recognition and Text Retrieval
Lin-shan Lee and Hung-yi Lee
National Taiwan University
Focus of this Tutorial
 New frontiers and directions towards the
future of speech technologies
 Not skills and experiences in optimizing
performance in evaluation programs
Text Content Retrieval
Voice Search
Spoken Content Retrieval
Lectures Broadcast Program
Multimedia Content
Spoken Content
Spoken Content Retrieval
 Spoken content retrieval: Machine listens to the data, and
extract the desired information for each individual user.
 Nobody is able to go through the data.
300 hrs multimedia is
uploaded per minute.
(2015.01)
1874 courses on coursera
(2016.04)
 In these multimedia, the spoken part carries very
important information about the content
• Just as Google does on text data
 Basic goal: Identify the time spans that the query
occurs in an audio database
 This is called “Spoken Term Detection”
Spoken Content Retrieval – Goal
time 1:01
time 2:05
…
……
“Obama”
user
 Basic goal: Identify the time spans that the query
occurs in an audio database
 This is called “Spoken Term Detection”
 Advanced goal: Semantic retrieval of spoken content
Spoken Content Retrieval – Goal
user
“US President”
The user is also looking
for utterances including
“Obama”.
Retrieval system
It is natural to think ……
Spoken Content Retrieval
Speech Recognition
+
Text Retrieval
=
It is natural to think ……
Speech
Recognition Models
Text
Acoustic Models
Language Model
Spoken
Content
 Transcribe spoken content into text by speech recognition
It is natural to think ……
Speech
Recognition Models
Text
 Transcribe spoken content into text by speech recognition
Spoken
Content
RNN/LSTM DNN
It is natural to think ……
 Transcribe spoken content into text by speech recognition
Speech
Recognition Models
Text
Retrieval
Result
Text
Retrieval
Query user
 Use text retrieval approaches to search over the
transcriptions
Spoken
Content
Black Box
It is natural to think ……
Speech
Recognition Models
Text
Retrieval
Result
Text
Retrieval
user
Spoken
Content
 For spoken queries, transcribe them into text by speech
recognition.
Black Box
Our point in this tutorial
Spoken Content Retrieval
Speech Recognition
+
Text Retrieval
=
Outline
 Introduction: Conventional Approach:
Spoken Content Retrieval =
Speech Recognition + Text Retrieval
 Core: Beyond Cascading Speech
Recognition and Text Retrieval
Five new directions
Introduction:
Spoken Content Retrieval =
Speech Recognition + Text Retrieval
It is natural to think ……
Speech
Recognition Models
Text
Retrieval
Result
Text
Retrieval
user
Spoken
Content
Speech Recognition always produces errors.
Lattices
Speech
Recognition Models
Text
Retrieval
Result
Text
Retrieval
Query user
Spoken
Content
Lattices
 Keep most possible recognition output
 Each path has a weight (confidence to be
correct)
M. Larson and G. J. F. Jones, “Spoken content retrieval: A
survey of techniques and technologies,” Foundations and
Trends in Information Retrieval, vol. 5, no. 4-5, 2012.
Lattices
Spoken
Archive
Speech
Recognition
System
Acoustic &
Language
Models
Lattices
Retrieval
Result
Text
Retrieval
user
Text Query
Each path is a possible recognition result
time
Horizontal scale is the time
Lattices
Spoken
Archive
Speech
Recognition
System
Acoustic &
Language
Models
Lattices
Retrieval
Result
Text
Retrieval
user
Text Query
time
Horizontal scale is the time
Each path is a possible recognition result
Lattices
Spoken
Archive
Speech
Recognition
System
Acoustic &
Language
Models
Lattices
Retrieval
Result
Text
Retrieval
user
Text Query
time
Horizontal scale is the time
Each path is a possible recognition result
Lattices
Spoken
Archive
Speech
Recognition
System
Acoustic &
Language
Models
Lattices
Retrieval
Result
Text
Retrieval
user
Text Query
time
Higher probability to include the correct words
More noisy words included inevitably
Higher memory/computation requirements
Searching over Lattices
 Consider the basic goal: Spoken Term Detection
“Obama”
user
Searching over Lattices
 Consider the basic goal: Spoken Term Detection
 Find the arcs hypothesized to be the query term
Obama
“Obama”
user
Obama
x1
x2
 Consider the basic goal: Spoken Term Detection
 Posterior probabilities computed from lattices used as
confidence scores
Searching over Lattices
Obama
x1
R(x1)=0.9
Obama
x2
R(x2)=0.3
Two ways to display the results:
unranked and ranked.
 Consider the basic goal: Spoken Term Detection
 Unranked: Return the results with the scores higher than
a threshold
Searching over Lattices
Obama
x1
R(x1)=0.9
Obama
x2
R(x2)=0.3
Set the threshold as 0.6
Return x1
 Consider the basic goal: Spoken Term Detection
 Unranked: Return the results with the scores higher than
a threshold
Searching over Lattices
Obama
x1
R(x1)=0.9
Obama
x2
R(x2)=0.3
The threshold can be determined
automatically and query specific.
[Miller, Interspeech 07][Can, HLT 09][Mamou,
ICASSP 13][Karakos, ASRU 13][Zhang, Interspeech
12][Pham, ICASSP 14]
Actual Term Weighted Value
(ATWV)
 Evaluating unranked result
𝐴𝑇𝑊𝑉 = 1 − 𝑃𝑚𝑖𝑠𝑠 − 𝛽𝑃𝐹𝐴
𝑃𝑚𝑖𝑠𝑠 = 1 − 𝑁𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑁𝑟𝑒𝑓
𝑃𝐹𝐴 = 𝑁𝑠𝑝𝑢𝑟𝑖𝑜𝑢𝑠 𝑁𝑁𝑇
time 1:01 1.0
time 2:05 0.9
time 1:31 0.7
……
retrieved
𝑁𝑟𝑒𝑓: number of times the query term appears in audio database
threshold
Maximum Term Weighted Value (MTWV): tune the threshold to
obtain the best ATWV
𝑁𝑐𝑜𝑟𝑟𝑒𝑐𝑡: the number of retrieved objects that are actually correct
𝑁𝑠𝑝𝑢𝑟𝑖𝑜𝑢𝑠: the number of retrieved objects that are incorrect
𝑁𝑁𝑇: audio duration (in seconds) – 𝑁𝑟𝑒𝑓
 Consider the basic goal: Spoken Term Detection
 Ranked: results ranked according to the scores
Searching over Lattices
Obama
x1
R(x1)=0.9
Obama
x2
R(x2)=0.3
x1 0.9
x2 0.3
…
 Consider the basic goal: Spoken Term Detection
 Ranked: The results are ranked according to the scores
Searching on Lattices
Obama
x1
R(x1)=0.9
Obama
x2
R(x2)=0.3
x1 0.9
x2 0.3
… user
Mean Average Precision (MAP)
 Evaluating ranked list
 area under recall-precision curve
 Recall: percentage of ground truth results retrieved
 Precision: percentage of retrieved results being
correct
 Higher threshold gives higher precision but lower
recall, etc.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Precision
Recall
MAP = 0.484
MAP = 0.586
Examples of Lattice Indexing
Approaches
• Position Specific Posterior Lattices (PSPL)[Chelba, ACL
05][Chelba, Computer Speech and Language 07]
• Confusion Networks (CN)[Mamou, SIGIR 06][Hori, ICASSP
07][Mamou, SIGIR 07]
• Time-based Merging for Indexing (TMI)[Zhou, HLT 06][Seide,
ASRU 07]
• Time-anchored Lattice Expansion (TALE)[Seide, ASRU
07][Seide, ICASSP 08]
• WFST: directly compile the lattice into a weighted finite
state transducer [Allauzen, HLT 04][Parlak, ICASSP 08][Can, ICASSP
09][Parada, ASRU 09]
Out-of-Vocabulary (OOV) Problem
 Speech recognition is based on a lexicon
 Words not in the lexicon can never be transcribed
 Many informative words are out-of-vocabulary
(OOV)
 Many query terms are new or special words or
named entities
 All OOV words composed of subword units
 Generate subword lattices
 Transform word lattices into subword lattices
 Can also be directly generated by speech recognition
using subword-based lexicon and language model
Subword-based Retrieval
Retrieval
An arc in the
word lattice
Corresponding
subword sequence
/rɪ/ /trɪ/ /vǝl/
word lattices subword lattices
 Subword-based retrieval
 Generate subword lattices
 Transform user query into subword sequence
Obama → /au/ /ba/ /mǝ/
 Text retrieval techniques equally useful except
based on subword lattices and subword query
Replace words by subword units
 OOV words can be retrieved by matching over the
subword units without being recognized
Subword-based Retrieval
Subword-based Retrieval
- Frequently Used Subword Units
• Linguistically motivated units
– phonemes, syllables/characters, morphemes, etc.
[Ng, MIT 00][Wallace, Interspeech 07][Chen & Lee, IEEE T. SAP 02]
[Pan & Lee, ASRU 07][Meng, ASRU 07][Meng, Interspeech 08]
[Mertens, ICASSP 09][Itoh, Interspeech 07][Itoh, Interspeech 11]
[Pan & Lee, IEEE T. ASL 10]
• Data-driven units
– particles, word fragments, phone multigrams, morphs,
etc.
[Turunen, SIGIR 07] [Turunen, Interspeech 08]
[Parlak, ICASSP 08][Logan, IEEE T. Multimedia 05]
[Gouvea, Interspeech 10][Gouvea, Interspeech 11][Lee & Lee, ASRU 09]
Integrating Different Clues
from Recognition
 Similar to system combination in ASR
 Consistency very often implies accuracy
 Integrating the outputs from different
recognition systems [Natori, Interspeech 10]
 Integrating results based on different subword
units [S.-w. Lee, ICASSP 05][Pan & Lee, Interspeech 07][Meng,
Interspeech 10][Itoh, Interspeech 11]
 Weights of different clues estimated by
optimizing some retrieval related criteria [Meng &
Lee, ICASSP 09][Chen & Lee, ICASSP 10][Meng, Interspeech 10][Wollmer, ICASSP
09]
Integrating Different Clues
from Recognition
 Weights for Integrating 1,2,3-grams for different
word/subword units and different indices
syllable
Confusion
Network
Position
Specific
Posterior
Lattice
word
character
syllable
word
character
1-gram
2-gram
3-gram
1-gram
2-gram
3-gram
integrated with
different weights
maximizing the lower bound of MAP by SVM-MAP
[Meng & Lee, ICASSP 09]
Training Retrieval Model
Parameters
 Integrating different n-grams,
word/subword units and indices
single clue integrated
[Meng & Lee, ICASSP 09] [Chen & Lee, ICASSP 10]
ASR Accuracy v.s. Retrieval Performance
 Spoken Term Detection, Lectures
Speaker Dependent:
10 hours of speech from the instructor
ASR Accuracy v.s. Retrieval Performance
 Spoken Term Detection, Lectures
Improved Speaker Adaptation
ASR Accuracy v.s. Retrieval Performance
 Spoken Term Detection, Lectures
Initial Speaker Adaptation
ASR Accuracy v.s. Retrieval Performance
 Spoken Term Detection, Lectures
Speaker Independent
Realistic!
ASR Accuracy
MAP
ASR Accuracy v.s. Retrieval Performance
 Precision at 10: Percentage of the correct items among
the top 10 selected
Speaker Independent:
Only 60% of results are correct
Retrieve
YouTube?!
 Did lattices solve the problem?
 Need high quality recognition models to produce better
lattices and accurately estimate the confidence scores
 Spoken content over the Internet is produced in different
languages on different domains in different parts of the
world under varying acoustic conditions
 High quality recognition models for such content doesn’t
exist yet
 Retrieval performance limited by ASR accuracy
Is the problem solved?
 Desired spoken content retrieval
Less constrained by ASR accuracy
Existing approaches limited by ASR
accuracy because of the cascading of speech
recognition and text retrieval
 Go beyond the cascading concept
Is the problem solved?
Our point in this tutorial
Spoken Content Retrieval
Speech Recognition
+
Text Retrieval
=
Core:
Beyond Cascading Speech
Recognition and Text Retrieval
New Directions
1. Modified ASR for Retrieval Purposes
2. Incorporating Those Information Lost in ASR
3. No Speech Recognition!
4. Special Semantic Retrieval Techniques for Spoken
Content
5. Spoken Content is Difficult to Browse!
Overview Paper
 Lin-shan Lee, James Glass, Hung-yi Lee,
Chun-an Chan, "Spoken Content Retrieval —
Beyond Cascading Speech Recognition with
Text Retrieval," IEEE/ACM Transactions on
Audio, Speech, and Language Processing,
vol.23, no.9, pp.1389-1420, Sept. 2015
 http://speech.ee.ntu.edu.tw/~tlkagk/paper/Over
view.pdf
 This tutorial includes updated information after
this paper is published.
New Direction 1:
Modified ASR
for Retrieval Purposes
Retrieval Performance
v.s. Recognition Accuracy
 Intuition: Higher recognition accuracy, better
retrieval performance
Not always true!
In Taiwan, the need of …
Recognition
System A
Recognition
System B
In Taiwan, a need of … In Thailand, the need of …
Same recognition accuracy
Retrieval Performance
v.s. Recognition Accuracy
 Intuition: Higher recognition accuracy, better
retrieval performance
Not always true!
In Taiwan, the need of …
Recognition
System A
Recognition
System B
In Taiwan, a need of … In Thailand, the need of …
Not important
for retrieval
Serious problem
for retrieval
Retrieval Performance
v.s. Recognition Accuracy
 Retrieval performance is more correlated to the ASR errors of
name entities than normal terms [Garofolo, TREC-7 99][L. van der
Werff, SSCS 07]
 Expected error rate defined on lattices is a better predictor of
retrieval performance than one-best transcriptions [Olsson, SSCS
07]
 lattices used in retrieval
 For retrieval, substitution errors have more influence than
insertions and deletions [Johnson, ICASSP 99]
 The language models reducing ASR errors do not always yield
better retrieval performance [Cui, ICASSP, 13][Shao, Interspeech, 08][
Wallace, SSCS 09]
 Query terms usually topic-specific with lower n-gram
probabilities
ASR models learned by
Optimizing Retrieval Performance
Speech
Recognition Models
Text
Retrieval
Result
Text
Retrieval
Query user
Spoken
Content
ASR models learned by
Optimizing Retrieval Performance
Speech
Recognition Models
Text
Retrieval
Result
Text
Retrieval
Query user
Spoken
Content
Optimized for recognition accuracy
ASR models learned by
Optimizing Retrieval Performance
Speech
Recognition Models
Text
Retrieval
Result
Text
Retrieval
Query user
Spoken
Content
Optimized for recognition accuracy
Spoken Content Retrieval
Retrieval Performance
New Direction 1-1:
Modified ASR
for Retrieval Purposes
Acoustic Modeling
Acoustic Modeling
 Acoustic Model Training
𝜃 = 𝑎𝑟𝑔 max
𝜃
𝐹 𝜃 𝜃: acoustic model
parameters
𝐹 𝜃 : objective function
The objective function 𝐹 𝜃 usually defined
to optimize ASR accuracy
Design a new objective function for
optimizing retrieval performance.
Acoustic Modeling
 Objective Function for optimizing ASR
performance
𝜃 = 𝑎𝑟𝑔 max
𝜃
𝐹 𝜃
lattice of utterance u
wA
wB
wC
wA
wA
wA
wC
𝐹 𝜃 =
𝑢 𝑠𝑢∈𝐿 𝑢
𝐴 𝑟𝑢, 𝑠𝑢 𝑃𝜃 𝑠𝑢|𝑢
Summation over all the
utterances u in the training data
L(u): all the word sequence in
the lattice of x
Acoustic Modeling
 Objective Function for optimizing ASR
performance
𝜃 = 𝑎𝑟𝑔 max
𝜃
𝐹 𝜃
𝐹 𝜃 =
𝑢 𝑠𝑢∈𝐿 𝑢
𝐴 𝑟𝑢, 𝑠𝑢 𝑃𝜃 𝑠𝑢|𝑢
wA
wB
wC
wA
wA
wA
wC
𝐴 𝑟𝑢, 𝑠𝑢 : the accuracy of word or phoneme
sequence 𝑠𝑢 comparing with reference 𝑟𝑢
𝑠𝑢: a word sequence in the lattice of x
𝑃𝜃 𝑠𝑢|𝑢 : posterior probability of word
sequence 𝑠𝑢 given acoustic model 𝜃
MCE, MPE, sMBR
𝜃 can be
HMM or
DNN
lattice of utterance u
Acoustic Modeling
 Objective Function for optimizing ASR
performance
𝜃 = 𝑎𝑟𝑔 max
𝜃
𝐹 𝜃
𝐹 𝜃 =
𝑢 𝑠𝑢∈𝐿 𝑢
𝐴 𝑟𝑢, 𝑠𝑢 𝑃𝜃 𝑠𝑢|𝑢
retrieval
W-MCE, [Fu, ASRU 07][Weng, Interspeech 12][Weng, ICASSP, 13]
keyword-boosted sMBR [Chen, Interspeech 14]
If the possible query terms are known in advance,
they can be weighted higher in 𝐴 𝑟𝑢, 𝑠𝑢
 In most cases, the query terms are not known in
advance
 Collect feedback data on-line
 Use the information to optimize search engines
Feedback can be implicit
Training Data collected from User
time 1:10 F
time 2:01 F
time 3:04 F
time 5:31 T
time 1:10 F
time 2:01 T
time 3:04 F
time 5:31 T
time 1:10 F
time 2:01 T
time 3:04
time 5:31
Query Q1 Query Q2 Query Qn
……
ASR models learned by
Optimizing Retrieval Performance
Speech
Recognition Models
Text
Retrieval
Query user
Spoken
Content
Lattices
Retrieval
Result
re-estimate
optimize
[Lee & Lee, ICASSP 10]
[Lee & Lee, Interspeech 10]
[Lee & Lee, SLT 10]
[Lee & Lee, IEEE T. ASL 12]
time 1:10 F
time 2:01 F
time 3:04 F
time 5:31 T
time 1:10 F
time 2:01 T
time 3:04 F
time 5:31 T
time 1:10 F
time 2:01 T
time 3:04
time 5:31
Query Q1 Query Q2 Query Qn
……
Updated Retrieval Process
 Each retrieval result x has a confidence score
R(x)
 R(x) depends on the recognition model θ
 R(x) should be R(x;θ)
Re-estimate
recognition
model θ
Update the
scores R(x; θ)
The retrieval
results can be
re-ranked.
Considering some
retrieval criterion
     





 

x
x
x
R
x
R
F 

 ;
;
Basic Form
 Basic Form:
: confidence score of the positive example
 

;

x
R
: confidence score of the negative example
 

;

x
R
: a positive example

x
: a negative example

x
 



F
max
arg
ˆ 
     





 

x
x
x
R
x
R
F 

 ;
;
Basic Form
Increase the confidence scores of
the positive examples
 Basic Form:
 



F
max
arg
ˆ 
Decrease the confidence scores of
the negative examples
Consider Ranking
positive example
negative example
Confidence
score
: Original Model
 : New Model

ˆ
 

;
R x  

ˆ
;
R x
Consider Ranking
positive example
negative example
Confidence
score
: Original Model
 : New Model

ˆ
 

;
R x  

ˆ
;
R x
     





 

x
x
x
R
x
R
F 

 ;
;
Consider Ranking
positive example
negative example
Confidence
score
: Original Model
 : New Model

ˆ
 

;
R x  

ˆ
;
R x
Increase the basic
objective function
Consider Ranking
positive example
negative example
Confidence
score
: Original Model
 : New Model

ˆ
 

;
x
R  

ˆ
;
R x
Rank perfectly
Worse ranking
Considering the ranking order
 
   


 
 



otherwise
x
x
x
x
0
;
R
;
R
1
,



Consider Ranking
If the confidence score for a positive example
exceed that for a negative example
the objective function adds 1.
   






x
x
x
x
F
,
,


Consider Ranking
 δ(x+,x-) approximated by a sigmoid function
during optimization.
   






x
x
x
x
F
,
,


 
   


 
 



otherwise
x
x
x
x
0
;
R
;
R
1
,



Little feedback data?
The unlabeled examples as negative examples
0.46
0.47
0.48
0.49
0.50
0.51
0.52
MAP
Baseline Basic Form
Ranking unlabelled as negative
Acoustic Models - Experiments
 Lecture recording (80 queries, each has 5 clicks)
[Lee & Lee, IEEE T. ASL 1
New Direction 1-2:
Modified ASR
for Retrieval Purposes
Besides Acoustic Modeling
Language Modeling
 The query terms are usually very specific. Their
probabilities are underestimated.
 Boosting the probabilities of n-grams including query
terms
 By repeating the sentences including the query terms
in training corpora
 Helpful in DARPA’s RATS program [Mandal, Interspeech 13]
and NIST OpenKWS13 evaluation [Chen, ISCSLP 14]
 NN-based LM: Modifying training criterion, so the key
terms are weighted more during training
 Helpful in NIST OpenKWS13 evaluation [Gandhe, ICASSP
14]
Decoding
 Give different words different pruning thresholds during
decoding
 The keywords given lower pruning thresholds than
normal terms
 Called white listing [Zhang, Interspeech 12] or keyword-
aware pruning [Mandal, Interspeech 13]
 OOV words never correctly recognized
 Two stage approach [Shao, Interspeech 08]
 Identify the lattices probably containing OOV (by
subword-based approach)
 Insert the word arcs of OOV words into lattices and
rescore
Confusion Models
Speech
Recognition Models
Text
Retrieval
Result
Text
Retrieval
Query user
Spoken
Content
Confusion
Model
A B C A’ B’ C’
The ASR produces systematic errors, so it is possible to
learn a confusion model to offer better retrieval results
[Karanasou, Interspeech 12][Wallace, ICASSP 10]
Jointly Optimizing Speech
Recognition and Retrieval Modules
Complex
Model
query
A spoken segment
Yes, the segment
contains the query.
No, …….
End-to-end model performing speech recognition
and retrieval jointly (learned jointly) in one step
Sounds crazy?
Much information lost during
ASR
Much information lost during ASR
Transcriptions:
using syntax vectors surge ……
Lattice:
ASR
Spoken Content
New Direction 2:
Incorporating
Those Information Lost in ASR
Information beyond Speech
Recognition Output
Speech
Recognition Models
Text
Retrieval
Result
Text
Retrieval
Query user
Spoken
Content
Black Box
Incorporating information lost in
standard speech recognition to help retrieval
New Direction 2-1:
Incorporating
Those Information Lost in ASR
What kind of information can be helpful?
Information beyond Speech
Recognition Output
 Phoneme or syllable duration [Wollmer, ICASSP 09][Naoyuki
Kanda, SLT 12][Teppei Ohno, SLT 12]
 Pitch & Energy [Tejedor, Interspeech 10]
 Landmark and attribute detection with prosodic cues
includes can reduce the false alarm [Ma, Interspeech 2007]
[Naoyuki Kanda, SLT 12]
Query is Japanese word “fu-ji-sa-N”
very short! False alarm!
Query-specific Information
 "Jack of all trades, master of none“
Speech Recognition
Spoken Term
Detection
Correctly recognized
all the words
higher detector accuracy
on specific query
Retrieval
System
Query-specific Detector
Query Q
Lattices
First-pass Retrieval Result
x1
x2 x3
Examples of Q
Compute Similarity
Exemplar-based approach also used in speech recognition
[Demuynck, ICASSP 2011][Heigold, ICASSP 2012][Nancy Chen, ICASSP 2016]
Similarities
between Audio Segments
Dynamic Time
Warping (DTW)
Retrieval
System
Query-specific Detector
Query Q
Lattices
First-pass Retrieval Result
x1
x2 x3
Examples of Q
Learn a
model
Model for Q
Evaluate
confidence
Retrieval
System
Query-specific Detector
Query Q
Lattices
First-pass Retrieval Result
x1
x2 x3
positive examples
Learn a discriminative model
negative examples
SVM
[Tu & Lee, ASRU 11]
[I.-F. Chen, Interspeech 13]
Query-specific Detector
 The input of SVM or MLP has to be a fixed-length vector
 Representing an audio segment with different length into
a fixed-length vector
…
…
…
…
…
…
…
…
…
…
…
…
…
…
[Tu & Lee, ASRU 11]
[I.-F. Chen, Interspeech 13]
Retrieval
System
Query-specific Detector
Query Q
Lattices
First-pass Retrieval Result
x1
x2 x3
positive examples
negative examples
 Is it realistic to have those examples?
Pseudo-relevance Feedback (PRF)
User Feedback
New Direction 2-2:
Incorporating
Those Information Lost in ASR
Pseudo Relevance Feedback
Retrieval
System
Pseudo Relevance Feedback (PRF)
Query Q
Lattices
First-pass Retrieval Result
x1
x2 x3
[Chen & Lee,
Interspeech 11]
[Lee & Lee, CSL 14]
Retrieval
System
Pseudo Relevance Feedback (PRF)
Query Q
Confidence scores from lattices
Lattices
R(x1)
First-pass Retrieval Result
x1
x2 x3
R(x2) R(x3)
Not shown to the user
[Chen & Lee,
Interspeech 11]
[Lee & Lee, CSL 14]
Retrieval
System
Pseudo Relevance Feedback (PRF)
Query Q
Lattices
R(x1)
First-pass Retrieval Result
x1
x2 x3
R(x2) R(x3)
Assume the results with high confidence scores as correct
Examples of Q
Considered as examples of Q
[Chen & Lee,
Interspeech 11]
[Lee & Lee, CSL 14]
Retrieval
System
Pseudo Relevance Feedback (PRF)
Query Q
Lattices
R(x1)
First-pass Retrieval Result
x1
x2 x3
R(x2) R(x3)
similar dissimilar
Examples of Q
[Chen & Lee,
Interspeech 11]
[Lee & Lee, CSL 14]
Retrieval
System
Pseudo Relevance Feedback (PRF)
Query Q
Lattices
R(x1)
First-pass Retrieval Result
x1
x2 x3
R(x2) R(x3)
time 1:01
time 2:05
time 1:45
…
time 2:16
time 7:22
time 9:01
Rank according to new scores
Examples of Q
(A) (B)
 Lecture recording [Lee & Lee, CSL 14]
Pseudo Relevance Feedback (PRF)
- Experiments
Evaluation Measure: MAP (Mean Average Precision)
(A) (B)
Pseudo Relevance Feedback (PRF)
- Experiments
(B): speaker independent (50% recognition accuracy)
(A): speaker dependent (84% recognition accuracy)
(A) and (B) use different speech recognition systems
(A) (B)
 PRF (red bars) improved the first-pass retrieval
results with lattices (blue bars)
Pseudo Relevance Feedback (PRF)
- Experiments
New Direction 2-3:
Incorporating
Those Information Lost in ASR
Graph-based Approach
Graph-based Approach
 PRF
 Each result considers the similarity to the audio
examples
 Make some assumption to find the examples
 Graph-based approach [Chen & Lee, ICASSP 11][Lee & Lee, APSIPA
11][Lee & Lee, CSL 14]
 Not assume some results are correct
 Consider the similarity between all results
Graph Construction
 The first-pass results is considered as a graph.
 Each retrieval result is a node
First-pass Retrieval
Result from lattices
x1
x2
x3
x2
x3
x1
x4
x5
Graph Construction
 The first-pass results is considered as a graph.
 Nodes are connected if their retrieval results are similar.
 DTW similarities are considered as edge weights
x2
x3
x1
x4
x5
Dynamic Time Warping
(DTW) Similarity
similar
Changing Confidence Scores by Graph
 The score of each node depends on its neighbors.
x2
x3
x1
x4
x5
G(x1)
G(x2)
G(x3)
G(x5)
G(x4)
high
high
The results are ranked according to new scores G(xi).
“You are known by the company you keep”
Changing Confidence Scores by Graph
 The score of each node depends on its neighbors.
x2
x3
x1
x4
x5
G(x1)
G(x2)
G(x3)
G(x5)
G(x4)
low
low
The results are ranked according to new scores G(xi).
“You are known by the company you keep”
Graph-based Re-ranking - Formulation
xi
xj
G(xi)
         
 





i
j x
x
i
j
j
i
i x
x
x
G
x
x
G
N
,
Ŵ
R
1 

         
 





i
j x
x
i
j
j
i
i x
x
x
G
x
x
G
N
,
Ŵ
R
1 

Graph-based Re-ranking - Formulation
xi
xj
G(xi)
original score
considering graph structure
(from lattices)
         
 





i
j x
x
i
j
j
i
i x
x
x
G
x
x
G
N
,
Ŵ
R
1 

Graph-based Re-ranking - Formulation
xi
xj
G(xi)
xj: neighbors of xi (nodes connected to xi)
N(xi): neighbors of xi (nodes connected to xi)
         
 





i
j x
x
i
j
j
i
i x
x
x
G
x
x
G
N
,
Ŵ
R
1 

Graph-based Re-ranking - Formulation
xi
xj
G(xi
)
Graph-based Re-ranking - Formulation
xi
xj
W(xi,xj)
         
 





i
j x
x
i
j
j
i
i x
x
x
G
x
x
G
N
,
Ŵ
R
1 

Graph-based Re-ranking - Formulation
xi
xj
Normalized by the weights of all the
edges connected to xj
         
 





i
j x
x
i
j
j
i
i x
x
x
G
x
x
G
N
,
Ŵ
R
1 

Graph-based Re-ranking - Formulation
xi
xj
W(xi,xj)
The score of xi would be more close to
the nodes xj with larger edge weights.
         
 





i
j x
x
i
j
j
i
i x
x
x
G
x
x
G
N
,
Ŵ
R
1 

         
 





i
j x
x
i
j
j
i
i x
x
x
G
x
x
G
N
,
Ŵ
R
1 

Graph-based Re-ranking - Formulation
xi
xj
W(xi,xj)
interpolation
 Assign score G(x) for each hit region based on the graph
structure
x1
x3
x2
x4
x5
G(x1)
G(x2)
G(x3)
G(x4)
G(x5)
 G(x1) depends on G(x2)
and G(x3)
Graph-based Re-ranking - Formulation
         
 





i
j x
x
i
j
j
i
i x
x
x
G
x
x
G
N
,
Ŵ
R
1 

 Assign score G(x) for each hit region based on the graph
structure
x1
x3
x2
x4
x5
G(x1)
G(x2)
G(x3)
G(x4)
G(x5)
 G(x1) depends on G(x2)
and G(x3)
 G(x2) depends on G(x1)
and G(x3) ……
Graph-based Re-ranking - Formulation
         
 





i
j x
x
i
j
j
i
i x
x
x
G
x
x
G
N
,
Ŵ
R
1 

 Assign score G(x) for each hit region based on the graph
structure
x1
x3
x2
x4
x5
G(x1)
G(x2)
G(x3)
G(x4)
G(x5)
 G(x1) depends on G(x2)
and G(x3)
 G(x2) depends on G(x1)
and G(x3) ……
 ……
Graph-based Re-ranking - Formulation
         
 





i
j x
x
i
j
j
i
i x
x
x
G
x
x
G
N
,
Ŵ
R
1 

 Assign score G(x) for each hit region based on the graph
structure
x1
x3
x2
x4
x5
G(x1)
G(x2)
G(x3)
G(x4)
G(x5)
 How to find G(x1), G(x2),
G(x3) …… satisfying the
following equation?
 This is random walk.
Graph-based Re-ranking - Formulation
         
 





i
j x
x
i
j
j
i
i x
x
x
G
x
x
G
N
,
Ŵ
R
1 

G(xi) is uniquely and
efficiently obtainable
 Lecture recording [Lee & Lee, CSL 14]
(A) (B)
Graph-based Approach -
Experiments
(B): speaker independent (low recognition accuracy)
(A): speaker dependent (high recognition accuracy)
(A) (B)
 Graph-based re-ranking (green bars) outperformed PRF (red
bars)
Graph-based Approach -
Experiments
0.25
0.27
0.29
0.31
0.33
0.35
Assamese Bengali Lao
ATWV
First Pass (on lattices)
Graph
Graph-based Approach –
Experiments on OpenKWS
[Lee & Glass,
Interspeech 14]
Graph-based Approach –
More Experiments
 13% relative improvement on OOV queries on
another lecture recording (several speakers) [Jansen,
ICASSP 13][Norouzian, ICASSP 13]
 14% relative improvement on AMI Meeting Corpus
[Norouzian, Interspeech 13]
 Graph Spectral Clustering
 Optimizing evaluation measure and considering the
graph structure at the same time [Audhkhasi, ICASSP 2014]
 11% relative improvement with subword-based
system on OpenKWS15 (Swahili) [Van Tung Pham, ICASSP,
2016]
New Direction 3:
No Speech Recognition!
Why Spoken Content Retrieval
without Speech Recognition?
 Bypassing ASR to avoid information loss and all problems
with ASR (errors, OOV words, background noise, etc. )
 Just to identify the query, no need to find out which words the
query includes
 Audio files on the Internet in hundreds of different languages
 Too limited annotated data for training reliable speech
recognition systems for most languages
 Written form even doesn’t exist for some languages
 Many audio files are code-switched across several different
languages
Spoken Content Retrieval
without Speech Recognition
user
“US President”
spoken
query
Compute similarity between spoken queries and audio
files on acoustic level, and find the query term
Spoken Content
“US President” “US President”
Is it possible?
Approach Categories
 DTW-based Approaches
 Matching sequences with DTW
 Audio Segment Representation
 Representing audio segments by fixed length vector
representations
 Unsupervised ASR (or model-based approach)
 Training word- or subword-like acoustic patterns (or
tokens) from target audio archive
 Transcribing both the audio archive and the query into
word- or subword-like token sequences
 Matching based on the tokens, just like text retrieval
New Direction 3-1:
No Speech Recognition!
DTW-based Approaches
DTW-based Approach
 Conventional DTW
Audio Segment
Audio Segment
DTW-based Approach
 DTW for query-by-example
 Whether a spoken query is in an utterance
Spoken
Query
Utterance
Segmental DTW [Zhang,
ICASSP 10], Subsequence
DTW [Anguera, ICME 13][Calvo,
MediaEval 14]
Acoustic Feature Vectors
 Gaussian posteriorgram [Zhang, ICASSP 10][Wang,
MediaEval 14]
 Phonetic posteriors [Hazen, ASRU 09]
 MLP trained from another corpus (probably in a
different language)
 Bottle-neck feature generated from MLP [Kesiraju,
MediaEval 14]
 RBM posteriorgram [Zhang, ICASSP 12]
 Performance comparison [Carlin, Interspeech 11]
Speed-up Approaches for DTW
 Segment-based matching [Chan & Lee, Interspeech
10][Chan & Lee, ICASSP 11]
Spoken
Query
Utterance
Group consecutive acoustically similar feature
vectors into a segment
Speed-up Approaches for DTW
 Segment-based matching
Hierarchical
Agglomerative
Clustering (HAC)
Step 1: build a tree
Step 2: pick a
threshold
Group consecutive acoustically similar feature
vectors into a segment
Speed-up Approaches for DTW
 Segment-based matching [Chan & Lee, Interspeech
10][Chan & Lee, ICASSP 11]
Spoken
Query
Utterance
Compute similarities between segments only
Speed-up Approaches for DTW
 Segment-based matching [Chan & Lee, Interspeech
10][Chan & Lee, ICASSP 11]
 Lower bound estimation [Zhang, ICASSP 11][Zhang,
Interspeech 11]
 Indexing the frames in the target audio file [Jansen,
ASRU 11][Jansen, Interspeech 12]
 Information Retrieval based DTW [Anguera, Interspeech
13]
New Direction 3-2:
No Speech Recognition!
Audio Segment Representation
Framework
Audio archive divided into variable-
length audio segments
Audio
Segment to
Vector
Audio
Segment to
Vector
Similarity
Search
Result
Spoken
Query
Off-line
On-line
[Chung & Lee, Interspeech 16][Chen, ICASSP 15]
[Levin, ICASSP 15][Levin, ASRU 13]
Audio Word to Vector
 The audio segments corresponding to words
with similar pronunciations are close to each
other.
ever ever
never
never
never
dog
dog
dogs
Audio Word to Vector -
Segmental Acoustic Indexing
 Basic idea
[Levin, ICASSP 15][Levin, ASRU 13]
A set of template
audio segments
0.5
……
0.8
0.3
0.5
0.8
0.3
⋮
DTW
Audio Word to Vector –
Sequence Auto-encoder
[Chung & Lee,
Interspeech 16]
RNN Encoder
x1 x2 x3 x4
audio segment
acoustic features
Representation for the whole
audio segment
Audio Word to Vector –
Sequence Auto-encoder
RNN Decoder
x1 x2 x3 x4
y1 y2 y3 y4
x1 x2 x3 x4
RNN Encoder
[Chung & Lee,
Interspeech 16]
Sequence Auto-encoder –
Experimental Results
never
ever
Cosine
Similarity
Edit Distance between
Phoneme sequences
Deep
Learning
Deep
Learning
Experimental
Results
More similar
pronunciation
Higher cosine similarity.
Sequence Auto-encoder –
Experimental Results
 Projecting the embedding vectors to 2-D
day
days
say
s
say
Sequence Auto-encoder –
Experimental Results
 Audio story (LibriSpeech corpus)
MAP
training epochs for
sequence auto-encoder
SA: sequence
auto-encoder
DSA: de-noising
sequence auto-encoder
Input: clean speech
+ noise
output: clean speech
New Direction 3-3:
No Speech Recognition!
Unsupervised ASR
Conventional ASR
… Hello World
…
ASR
unknown speech signal
Unsupervised ASR
ASR
unknown speech signal
Used in Query by example
Spoken Term Detection
Unsupervised ASR:
Learn the models for a set of acoustic patterns (tokens)
directly from the corpus (target spoken archive)
t0t1t2, t1t3,
t2t3,
t2t1t3t3t2 …
Acoustic Tokens
Unsupervised ASR - Acoustic
Token
utterance
acoustic
feature
acoustic tokens: chunks of acoustically similar feature
vectors with token ids
t0 t1 t2 t1
[Zhang & Glass, ASRU 09]
[Huijbregts, ICASSP 11]
[Chan & Lee, Interspeech 11]
Unsupervised ASR
- Overall Framework
Initialization
feature
sequence
model training
token decoding
initial token
sequence
final token
sequence
: feature sequence for the whole
corpus
: token sequences for X
: Model (e.g. HMM) parameters
: training iteration
simple segmentation
and clustering
Unsupervised ASR
- Initialization
Get Token ID
Extract acoustic
features for every
utterance
Unsupervised ASR
- Overall Framework
Initialization
feature
sequence
model training
token decoding
initial token
sequence
final token
sequence
: feature sequence for the whole
corpus
: Model (e.g. HMM) parameters
: token sequences for X
: training iteration
simple segmentation
and clustering
Unsupervised ASR
- Overall Framework
Initialization
feature
sequence
model training
token decoding
initial token
sequence
final token
sequence
optimize HMM parameters using
Baum–Welch algorithm on token
sequence 𝜔𝑖−1 to get new models 𝜃𝑖
decode acoustic features into a new
token sequence 𝜔𝑖 using Viterbi
decoding
𝜔𝑖−1
Unsupervised ASR
- Overall Framework
iterate until the token
sequences (including token
boudaries) converge
Initialization
feature
sequence
model training
token decoding
initial token
sequence
final token
sequence
Acoustic Token in Query by
Example Spoken Term
Detection
 Compute the similarity between the models of
two tokens
Model of
token A
Model of
token B
KL divergence of the
Gaussian mixtures in the
first state of two models
Acoustic Token in Query by
Example Spoken Term
Detection
 Compute the similarity between the models of
two tokens
Model of
token A
Model of
token B
Sum of the KL
divergence over the
states of the two token
models
Token-based DTW
subsequence matching Token-based DTW
Tokens
in query
Tokens in an
utterance
 Signal-level DTW is more sensitive to signal variation (e.g. same phoneme
across different speakers), while token models are able to cover better the
distribution of signal variation
 Much lower on-line computation load
a b c d a b g h b d
b
g
d
b
a
Multi-granularity Space
for Acousitc Tokens
• Unknown hyperparameters for the token models
• Number of HMM states per token (m): token length
• Number of distinct tokens (n)
• Multiple layers of intrinsic representations of speech
Multi-granularity Space
for Acousitc Tokens
• From short to long (Temporal Granularity)
– phoneme
– syllable
– word
– phrase
• From coarse units to fine units (Phonetic Granularity)
– general phoneme set
– gender dependent phoneme set
– speaker specific phoneme set
Number of distinct HMMs (n)
Number of states per HMM (m)
Multi-granularity Space
for Acousitc Tokens
Training multiple sets of HMMs for with different granularity
[Chung & Lee, ICASSP 14 ]
phoneme syllable word phrase
general
gender
dependent
speaker
specific
n
m
Multi-granularity Space
for Acousitc Tokens
 Token-based DTW using tokens with different
granularity (m,n) averaged gave much better
performance
 One example
 Frame-level DTW: MAP = 10%
 Using only the token set with the best
performance: MAP = 11%
 Using 20 sets of tokens (number of states per
HMM m = 3,5,7,9,11, number of distinct HMMs
n=50,100,200,300): MAP = 26%
Hierarchical Paradigm
 Typical ASR:
 Acoustic Model: models for the phonemes
 Lexicon: the pronunciation of every word as a
phoneme sequence
 Language Model: the transition between words
Word 1
Phoneme 1 Phoneme 4
Word 2
Phoneme 2 Phoneme 1 Phoneme 3
Lexicon
word 1
word 2
word 3
word 4
Language Model
Phoneme 1
Phoneme 2
Phoneme 3
Acoustic Model
word-like
token 1
word-like
token 1
word-like
token 1
Hierarchical Paradigm
 Similarly, in unsupervised ASR:
 Acoustic Model: the phoneme-like token HMMs
 Lexicon: the pronunciation of every word-like token as
a sequence of phoneme-like tokens
 Language Model: the transition between word-like
tokens
word-like token 1
phoneme-
like token 1
phoneme-
like token 4
word-like token 2
phoneme-
like token 2
phoneme-
like token 1
phoneme-
like token 3
Lexicon
word-like
token 1
Language Model
phoneme-like token 1
Acoustic Model
phoneme-like token 2
phoneme-like token 3
word-like
token 1
word-like
token 1
word-like
token 1
Hierarchical Paradigm
 Similarly, in unsupervised ASR:
 Acoustic Model: the phoneme-like token HMMs
 Lexicon: the pronunciation of every word-like token as
a sequence of phoneme-like tokens
 Language Model: the transition between word-like
tokens
word-like token 1
phoneme-
like token 1
phoneme-
like token 4
word-like token 2
phoneme-
like token 2
phoneme-
like token 1
phoneme-
like token 3
Lexicon
word-like
token 1
Language Model
phoneme-like token 1
Acoustic Model
phoneme-like token 2
phoneme-like token 3
Bottom-up Construction
Top Down Constraint
Bottom Up Construction
3 stages during training
focus on different constraints:
stage1: Acoustic Model
stage2: Language Model
stage3: Lexicon
stage1
stage2
stage3
this part alone would be the
HMM training we described
earlier
[Chung & Lee, ICASSP 13]
word-like
token 1
word-like
token 1
word-like
token 1
Hierarchical Paradigm
 Similarly, in unsupervised ASR:
 Acoustic Model: the phoneme-like token HMMs
 Lexicon: the pronunciation of every word-like token as
a sequence of phoneme-like tokens
 Language Model: the transition between word-like
tokens
word-like token 1
phoneme-
like token 1
phoneme-
like token 4
word-like token 2
phoneme-
like token 2
phoneme-
like token 1
phoneme-
like token 3
Lexicon
word-like
token 1
Language Model
phoneme-like token 1
Acoustic Model
phoneme-like token 2
phoneme-like token 3
Bottom-up Construction
Top Down Constraint
Top-down Constraints [Jansen, ICASSP 13]
This figure is from Aren
Jansen’s ICASSP paper.
 Signals of the same phoneme may be very different on phoneme
level, but the global structures of signals of the same word are very
often very similar on word level
 Global structures help in building the hierarchical model
Token Model
Optimization
Token Label
Optimization
𝟁 = (m,n)
Multi-target DNN (MDNN)
Multi-layered Acoustic Tokenizer (MAT)
o o o o o o o o o
o o o o o o o o o
o o o o
o o o o o o o o o o o o
Bottleneck Features
concatenation
Multi- layered
Token labels as
MDNN targets
Bottleneck
Features
Initial
Acoustic
Features
(iteration 1)
Concatenated Features(iteration 2,3,...)
Bottleneck
Features
(iterations 2,3,...)
feature
evaluation
(sub)word
evaluation
Initial
Acoustic
Features
(iteration 1)
In the first iteration, we use MFCC as the initial features
In the other iterations, we concatenate the bottleneck features with the MFCC
Multi-layered Acoustic Tokenizing Deep Neural
Networks (MAT-DNN) [Chung & Lee, ASRU 15]
 Jointly learn high quality frame-level features (much better than MFCCs) and
acoustic tokens in an unsupervised way
 Unsupervised training of multi-target DNN using unsupervised token labels
as training target
Token Model
Optimization
Token Label
Optimization
𝟁 = (m,n)
Multi-target DNN (MDNN)
Multi-layered Acoustic Tokenizer (MAT)
o o o o o o o o o
o o o o o o o o o
o o o o
o o o o o o o o o o o o
Bottleneck Features
concatenation
Multi- layered
Token labels as
MDNN targets
Bottleneck
Features
Initial
Acoustic
Features
(iteration 1)
Concatenated Features(iteration 2,3,...)
Bottleneck
Features
(iterations 2,3,...)
feature
evaluation
(sub)word
evaluation
Initial
Acoustic
Features
(iteration 1)
Multi-layered Acoustic Tokenizing Deep Neural
Networks (MAT-DNN)
In the first iteration, we use MFCC as the initial features
In the other iterations, we concatenate the bottleneck features with the MFCC
 Jointly learn high quality frame-level features (much better than MFCCs) and
acoustic tokens in an unsupervised way
 Unsupervised training of multi-target DNN using unsupervised token labels
as training target
[Chung & Lee, ASRU 15]
 Experimental Results
 Query by Example Spoken Term Detection on
Tsonga
Multi-layered Acoustic Tokenizing Deep Neural
Networks (MAT-DNN) [Chung & Lee, ASRU 13]
Approach MAP
Frame-based
DTW
MFCC 9.0
New Feature 28.7
Token-based
DTW
New Tokens 26.2
New Direction 4:
Special Semantic Retrieval Techniques
for Spoken Content
Semantic Retrieval
 User expects semantic retrieval of spoken content.
 User asks “US President”, system also finds “Obama”
 Widely studies on text retrieval
 Take query expansion as example
user
“US President”
Search both
“US President” or “Obama”
“Obama” and “US
President” are related
Retrieval
system
Semantic Retrieval
of Spoken Content
 User expects semantic retrieval of spoken content.
 User asks “US President”, system also finds “Obama”
 Widely studies on text retrieval
 Take query expansion as example
 The techniques developed for text can be directly
applied on semantic retrieval of spoken content
 Are there any special techniques for spoken content?
 Both query Q and document d are represented as
unigram language models θQ and θd
Review: Language Modeling Retrieval
Approach
w1 w2 w3 w4 w5 …
 
Q
w
P 
|
Query model θQ
 
d
w
P 
|
Document model θd
w1 w2 w3 w4 w5 …
KL divergence between the two models can be evaluated.
Review: Language Modeling Retrieval
Approach
 Given query Q, rank document d according to a
relevance score function SLM(Q,d):
 Inverse of KL divergence between query model θQ and
document model θd
 The documents with document models θd similar to
query model θQ are more likely to be relevant.
   
d
Q
LM KL
d
Q
S 
 |
, 

   
 


'
,
,
|
w
d
d
w
N
d
w
N
w
P 
 Query model θQ for text:
 Document model θd for text :
   
 


'
,
,
|
w
Q
Q
w
N
Q
w
N
w
P 
Review: Basic Query/Document
Models in Text Retrieval
Normalize into probability
N(w,Q): term frequency of word w in query Q
Normalize into probability
N(w,d): term frequency of word w in document d
Those basic models can be enhanced by query/document
expansion to handle the problem of semantic retrieval.
Review: Query Expansion
Document
Model
θd
Retrieval
Engine
doc 101
doc 205
doc 145
…
…
Text Query Q
w1 w2 w3 w4 w5
 
Q
w
P 
|
Query
model
First-pass
Retrieval Result
[Tao, SIGIR 06]
Review: Query Expansion
Document
Model
θd
Retrieval
Engine
doc 101
doc 205
doc 145
…
…
Document model
for doc 101
w1 w2 w3 w4 w5 ……
 
d
w
P 
|
w1 w2 w3 w4 w5 ……
 
d
w
P 
|
Text Query Q
Top N
documents
Document model
for doc 205
w1 w2 w3 w4 w5
 
Q
w
P 
|
Query
model
First-pass
Retrieval Result
[Tao, SIGIR 06]
Review: Query Expansion
Document
Model
θd
doc 101
doc 205
doc 145
…
…
w1 w2 w3 w4 w5 ……
 
d
w
P 
|
w1 w2 w3 w4 w5 ……
 
d
w
P 
|
Text Query Q
w1 w2 w3 w4 w5
 
Q
w
P 
| common patterns in
document models
New Query Model
Query
model
w1 w2 w3 w4 w5 ……
 
Q
w
P '
|
Retrieval
Engine
First-pass
Retrieval Result
Top N
documents
[Tao, SIGIR 06]
(by EM algorithm)
Review: Query Expansion
Document
Model
θd
doc 101
doc 205
doc 145
…
…
w1 w2 w3 w4 w5 ……
 
d
w
P 
|
w1 w2 w3 w4 w5 ……
 
d
w
P 
|
Text Query Q
w1 w2 w3 w4 w5
 
Q
w
P 
|
Query
model
Retrieval Engine
Final Result
w1 w2 w3 w4 w5 ……
 
Q
w
P '
|
Retrieval
Engine
First-pass
Retrieval Result New Query Model
Top N
documents
[Tao, SIGIR 06]
Review: Document Expansion
 
d
w
P 
|
Document model θd
w1 w2 w3 w4 w5 …
Topic
Find topics
behind
document
Modify
document
model
This is realized by PLSA, LDA, etc.
Topic
Topic
[Wei, SIGIR 06]
“airplane”
 
d
w
P 
|
New Document model θd
w1 w2 w3 w4 w5 …
“airplane”
“aircraft”
Semantic Retrieval on Lattices
Original Retrieval
Model of Text
For Lattices
Term Frequency
Expected
Term Frequency
Document Length
Expected
Document Length
…… ……
 Take the basic language modeling retrieval approach as
example
 Modify retrieval model for lattices:
Document Model from Lattices
 Document model θd for text
 (Spoken) Document model θd from lattice
   
 


'
,
,
|
w
d
d
w
N
d
w
N
w
P 
   
 


'
,
,
|
w
d
d
w
E
d
w
E
w
P 
Replace term frequency N(w,d) with expected
term frequency E(w,d) computed from lattices
 Expected term frequency E(w,d) for word w in spoken
document d based on lattice
Expected Term Frequency
lattice of spoken
document d
wA
wB
wC
wA
wA
wA
wC
     
 



d
L
u
d
u
P
u
w
N
d
w
E |
,
,
Expected Term Frequency
 Expected term frequency E(w,d) for word w in spoken
document d based on lattice
L(d): all the word sequences in
the lattice of d
N(w,u): the number of word w
appearing in word sequence u
u: a word sequence in the lattice of d
P(u|d): posterior probability of word sequence u
wA
wB
wC
wA
wA
wA
wC
lattice of spoken
document d
     
 



d
L
u
d
u
P
u
w
N
d
w
E |
,
,
New Direction 4-1:
Special Semantic Retrieval Techniques
for Spoken Content
Better Estimation of Term Frequencies
Better Estimation of Term
Frequencies
 Expected term frequency E(w,d) from lattices can be
inaccurate
 Context of each term in the lattices [Tu & Lee, ICASSP 12]
 The same terms usually have similar context [Schneider,
Interspeech 10]
 Graph-based approach
 Graph-based approach using acoustic feature similarity
improved spoken term detection
 It can also improve semantic retrieval of spoken content
based on language modeling retrieval approach
 Idea: Replace expected term frequency E(w,d) with scores
from graph-based approach [Lee & Lee, SLT 12] [Lee & Lee,
IEEE/ACM T. ASL 14]
Graph-based Approach for
Semantic Retrieval
 For each word w in the lexicon
… …
… …
… …
x1 x2
x3
x4
spoken document
spoken document
spoken document
Find the occurrence regions of word w
Graph-based Approach for
Semantic Retrieval
 For each word w in the lexicon
… …
… …
… …
x1 x2
x3
x4
spoken document
spoken document
spoken document
Connect the occurrence regions by acoustic feature similarities
Graph-based Approach for
Semantic Retrieval
 For each word w in the lexicon
… …
… …
… …
x1 x2
x3
x4
spoken document
spoken document
spoken document
Obtain new score G(x) by random walk
G(x1) G(x2)
G(x3)
G(x4)
Graph-based Approach for
Semantic Retrieval
 For each word w in the lexicon
… …
… …
… …
x1 x2
x3
x4
spoken document
spoken document
spoken document
G(x1) G(x2)
G(x3)
G(x4)
Repeat this process for all the words w in the lexicon
Graph-based Approach for
Semantic Retrieval
… …
x1 x2
G(x1) G(x2)
 
d
w
E ,

   
 


'
,
,
|
w
d
d
w
E
d
w
E
w
P 
spoken document d
Better estimation of term
frequencies for each word w
Lattice-based
Document Model:
Scores from
graph
   
 
 



'
,
,
|
w
d
d
w
E
d
w
E
w
P 
Graph-enhanced
document model:
replace
Graph-based Approach for
Semantic Retrieval
   
 


'
,
,
|
w
d
d
w
E
d
w
E
w
P     
 
 



'
,
,
|
w
d
d
w
E
d
w
E
w
P 
Graph-enhanced
document model:
Query and document expansion borrowed
from text retrieval can be equally applied
Lattice-based
Document Model:
Graph-based Approach for Semantic
Retrieval - Experiments
 Experiments on TV News
0.43
0.44
0.45
0.46
0.47
0.48
0.49
0.5
0.51
Basic LM Query Expansion Document
Expansion
Query + Document
Expansion
lattice Graph-Enhanced
MAP
[Lee & Lee, IEEE/ACM T. ASL 14]
New Direction 4-2:
Special Semantic Retrieval Techniques
for Spoken Content
Exploiting Acoustic Tokens
Acoustic Tokens
 We can identify “acoustic tokens” in direction 3
Token 1
Token 1
Token 1
Token 2
Token 2 Token 3
Token 3
Token 3
Query expansion with acoustic tokens
Can be useful in semantic
retrieval of spoken content:
Unsupervised semantic retrieval of spoken content
“US President” “Obama”
Query Expansion – Never Appear?
If “Obama” is not in the lexicon
We can never know “Obama” co-occur with
“US President” in query expansion.
“Obama” will never appear in lattices.
Typical approach: using subwords
Query expansion with acoustic
tokens
Complementary
to each other
Query Expansion
with Acoustic Tokens
Original Text Query:
“US President”
d100: …… US President …
d205: … US President ……
First pass: Retrieve spoken documents containing “US
President” in the transcriptions
[Lee & Lee, ICASSP 13]
Query Expansion
with Acoustic Tokens
d100: …… US President …
d205: … US President ……
Find acoustic tokens frequently appear in the signals of
these retrieved documents
Original Text Query:
“US President”
Query Expansion
with Acoustic Tokens
d100: …… US President …
d205: … US President ……
Obama?
Obama?
Original Text Query:
“US President”
Even the terms related to the query is OOV
If they really co-occur with the query in speech signals
Find acoustic tokens corresponding to these terms
Query Expansion
with Acoustic Tokens
d100: …… US President …
d205: … US President ……
Expanded Query:
“US President” +
Original Text Query:
“US President”
Query Expansion
with Acoustic Tokens
user
“US President”
Retrieval
system
Expanded Query:
Lattices
Find the same tokens
“White House”
“US President” and
By expanding the text query with acoustic tokens,
more semantically related audio files can be retrieved.
Query Expansion
– Acoustic Patterns
 Experiments on TV News [Lee & Lee, ICASSP 13]
Unsupervised Semantic Retrieval
 Unsupervised Semantic Retrieval [Li & Lee, ASRU
13][Oard, FIRE 13]
 No speech recognition as query-by-example
spoken term detection
 But find spoken documents semantically related to
the spoken queries
 New task, not too much previous work
 Below is just a very preliminary study [Li & Lee, ASRU 13]
Unsupervised Semantic Retrieval
Spoken Queries
1. Find spoken documents containing the spoken query
database
Spoken Document 1
Spoken Document 2
Spoken Document 3
Done by the query-by-example spoken term
detection approaches (e.g. DTW)
Unsupervised Semantic Retrieval
2. Find acoustic tokens frequently co-occurring with
the spoken queries in the same document
Spoken Queries
Unsupervised Semantic Retrieval
3. Use the acoustic tokens to expand the
original spoken query
Expanded
Queries
Unsupervised Semantic Retrieval
4. Retrieve again by the expanded queries
Expanded
Queries
Can retrieve spoken documents not
containing the original spoken queries
Unsupervised Semantic Retrieval -
Experiments
 Broadcast news, MAP as evaluation measure
 Using only DTW to retrieve spoken queries:
Spoken term detection: 28.30%
Semantic retrieval: 8.76%
User only wants to find
spoken documents
containing query term.
User wants to find all spoken documents
semantically related to query term.
[Li & Lee, ASRU 13]
Unsupervised Semantic Retrieval -
Experiments
 Broadcast news, MAP as evaluation measure
 Using only DTW to retrieve spoken queries:
Spoken term detection: 28.30%
Semantic retrieval: 8.76%
Exactly the same retrieval
results, but what user wants to
find is different
Lots of semantically related
spoken documents cannot be
retrieved [Li & Lee, ASRU 13]
Unsupervised Semantic Retrieval -
Experiments
 Broadcast news, MAP as evaluation measure
 Using only DTW to retrieve spoken queries:
Spoken term detection: 28.30%
Semantic retrieval: 8.76%
 Expanded spoken queries:
MAP was improved from 8.76% to 9.70%
 Unsupervised semantic retrieval has a long way to go
[Li & Lee, ASRU 13]
New Direction 5:
Speech Content is Difficult to Browse!
Audio is hard to browse
 When the system returns the retrieval results,
user doesn’t know what he/she get at the first
glance
Retrieval Result
Audio is hard to browse
Interactive spoken content retrieval
Summarization, Key term extraction,
Title Generation
Organizing retrieved results
Question answering
New Direction 5-1:
Speech Content is Difficult to Browse!
Interactive Spoken Content Retrieval
Interactive
Spoken Content Retrieval
 Conventional Retrieval Process
user
US President
Here are what
you are looking
for:
Doc3
Doc1
Doc2
…
Can be noisy
 Interactive retrieval
Interactive
Spoken Content Retrieval
user
US President
Not clear enough
……
More precisely, please.
 Interactive retrieval
Is it related to “Election”?
Interactive
Spoken Content Retrieval
user
US President
Still not clear
enough ……
More precisely, please.
Obama
Here are what you are
looking for.
 Interactive retrieval
Is it related to “Election”?
Interactive
Spoken Content Retrieval
user
US President
I see!
More precisely, please.
Obama
Yes.
Interactive
Spoken Content Retrieval
 Given the information entered by the users at
present, which action should be taken?
“Give me an example.”
“Is it relevant to XXX?”
“Can you give me another query?”
“Show the results.”
MDP for Interactive Retrieval
 MDP
 Widely used in dialogue system (air ticket booking,
city guides .., )
 The system is in certain states.
 Which action should be taken depends on the state the
system is in.
 MDP for Interactive retrieval [Wen & Lee, Interspeech
12][Wen & Lee, ICASSP 13]
 State: the degree of clarity of the user’s information
need
Ambiguous Clear
state space
S1
Spoken
Achieve
Search
Engine
Query
US President.
Doc3
Doc1
Doc2
…
Ambiguous Clear
state space
State
Estimator
[Cronen-Townsen,
SIGIR 02]
[Zhou, SIGIR 07]
State Estimator: Estimate the degree
of clarity from the retrieval results
S1
A
1
A
2
A
3
A
4
 A set of candidate actions
 System: “More precise, please.”
 System: “Is it relevant to XXX?”
 …..
Ambiguous Clear
state space
 There is an action “show results”
 When the system decides to show the results, the
retrieval session is ended
S1
A
1
A
2
A
3
A
4
 Choose the actions by intrinsic policy π(S)
 The policy is a function
 Input: state S, output: action A
π(S)=“More
precise, please”
π(S)=Show Results
Ambiguous Clear
state space
S1
Spoken
Achieve
Search
Engine
Doc3
Doc1
Doc2
…
A
1
A1: More
precise, please.
Obama
C1
User response The system gets a cost
C1 due to user labor.
Ambiguous Clear
state space
π(S1) = A1
S1
Spoken
Achieve
Search
Engine
Doc2
Doc1
Doc3
…
A
1
Update Results
State
Estimator
S2
C1
Ambiguous Clear
state space
Interact with Users - MDP
 Good interaction:
 The quality of final retrieval results shown to the
users are as good as possible
 The user labors (C1, C2) are as small as possible
S1
S3
S2
A
1
C1
C2
A
2
En
d
Sho
w
Interact with Users - MDP
 Learn polity π maximizing:
Retrieval Quality - User labor
 The polity π can be learned from historical
interaction by fitted value iteration [Chandramohan,
Interspeech 10]
S1
S3
S2
A
1
C1
C2
A
2
En
d
Sho
w
Deep Reinforcement Learning
 Replacing MDP by deep reinforcement
learning
Spoken
Content
Retrieval
Result
s
Spoken
Content
query
State
Estimation
Action
Decision
state
The degree of clarity from
the retrieval results
action
features
 The policy π(s) is a function
 Input: state s, output: action a
Decide the actions by intrinsic
policy π(S)
[Wen & Lee, Interspeech 12]
[Wen & Lee, ICASSP 13]
Spoken
Content
Retrieval
Result
s
Spoken
Content
query
features
…
…
…
DNN
State Estimation
Action Decision
Is it relevant to XX?
Give me an example.
Show the results.
Max
[Wu & Lee, Interspeech 16]
Spoken
Content
Retrieval
Result
s
Spoken
Content
query
features
…
…
…
DNN
Is it relevant to XX?
Give me an example.
Show the results.
Max
Experimental Results
 Broadcast news, semantic retrieval [Wu & Lee,
Interspeech 16]
Retrieval Quality (MAP)
Optimization Target:
Retrieval Quality - User labor
Hand-crafted Deep Reinforcement
Learning
Previous Method
(state + decision)
New Direction 5-2:
Speech Content is Difficult to Browse!
Summarization, Key Term Extraction
& Title Generation
Introduction
Retrieved
Audio File
Summary
Deep Learning,
Neural Network ….
10 minutes
30 seconds
Extractive
Summarization
Title Generation
Key Term
Extraction
“Introduction of
Deep Learning”
Summarization
 Unsupervised Approach: Maximum Margin
Relevance (MMR) and Graph-based Approach
 Supervised approach: Summarization problem can
be formulated as a binary classification program
 Included in the summary or not
utterance 1
utterance 2
utterance 3
utterance 4
Binary
Classifier
-1
+1
+1
-1
utterance 2
utterance 3
classification
result
summary
Binary
Classifier
Binary
Classifier
Binary
Classifier
Lecture
Summarization
– Binary Classification
 Binary classifier individually considers each utterance
 To generate a good summary, “global information” should be
considered
 Example: summary should be concise
More advanced machine learning techniques
LSA is useful for summarization
LSA is helpful for summarization
LSA is useful for summarization
Hello ……
LSA is Latent semantic analysis
LSA is helpful for summarization
Repeat again
……
Spoken Document
Summary
Summary should be succinct
Summarization
- Whole spoken document
 Learn a special model by structured learning
techniques
 Input: whole lecture
 Output: summary
Special
Model
spoken
document
Summary
Consider the
whole lecture
3 utterances
selected in
summary
[Lee & Lee, ICASSP 13]
[Lee & Lee, Interspeech 12]
Evaluation Function
 Evaluation function of utterance set F(s)
 s: utterance set in a lecture
F(s) 10
score
utterance
set s
how suitable it is to
consider utterance set
s as the summary
Properties:
• Concise?
• Include
keyword?
• Short enough?
……….
How good it is to take this
utterance set as summary?
Lecture
Evaluation Function
– How to summary
 With F(s), we can do summarization on new
lectures now
Lecture
s1
s2
s3
s4
s5
s6
s7
Compute F(s) for
all utterance sets
If s6
maximizes
F(s)
summary
Enumerate all
the possible
utterance set s
Evaluation Function
 Evaluation function of utterance set F(s)
 s: utterance set in a lecture
F(s) 10
score
utterance
set s
how suitable it is to
consider utterance set
s as the summary
Properties:
• Concise?
• Include
keyword?
• Short enough?
……….
How good it is to take this
utterance set as summary?
Lecture
What properties
should F(s) check?
 Learn F(s) from training data
Reference
summary
Reference
summary
Evaluation Function - Training
…
9
7
-4
high
Find F(s) such that
lecture
Training data
F(s)
F(s)
F(s)
Structured SVM: I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support
Vector Learning for Interdependent and Structured Output Spaces, ICML, 2004.
Summarization
- Structure of spoken document
 Temporal structure helps summarization
 Long summary: consecutive utterances in a
paragraph are more likely to be
 Short summary: one utterance is selected on behalf
of a paragraph.
…
𝑥𝑖+3
𝑥𝑖−2 𝑥𝑖−1 𝑥𝑖+6
𝑥𝑖+4 𝑥𝑖+5
…
𝑥𝑖+3
𝑥𝑖−2 𝑥𝑖+6
𝑥𝑖+4
Important paragraph
Representative of the paragraph
𝑥𝑖+5
𝑥𝑖 𝑥𝑖+1 𝑥𝑖+2
𝑥𝑖−1
𝑥𝑖 𝑥𝑖+1 𝑥𝑖+2
Paragraph 1 Paragraph 2 Paragraph 3
Paragraph 1 Paragraph 2
Summarization
- Structure of spoken document
 Add structure information into evaluation function
of utterance set F(s)
F(s) 100
score
Properties:
• Concise?
• Include
keyword?
• Short enough?
……….
utterances
Given the
information of
structure
Paragraph 1 Paragraph 2
[Shiang &
Lee,
Interspeech
13]
Summarization
- Structure of spoken document
 Structure in text are clear
Paragraph boundaries are directly known
 For spoken content, there is no obvious
structure
Here the structure are considered as “hidden
variables”
Structured learning with hidden variables
Summarization
- Structure of spoken document
 Evaluation Measure: ROUGE-1 and ROUGE-2
 Larger scores means the machine-generated summaries
is more similar to human-generated summaries.
[Shiang & Lee, Interspeech 1
Key Term Extraction
 TF-IDF is a food measure for identifying key
terms [E. D’Avanzo, DUC 04][Jiang, SIGIR 09]
 Feature parameters from latent topic models
[Hazen, Interspeech 11] [Chen & Lee, SLT 10]
 Key terms are usually focused on small number
of topics
 Prosodic Features [Chen & Lee, ICASSP 12]
 slightly lower speed, higher energy, wider pitch
range
 Machine Learning methods
 Input: a term, output: key term or not [Liu, SLT 08][Chen
& Lee, SLT 10]
Key Term Extraction
– Deep Learning
α1 α2 α3 α4 … αT
ΣαiVi
x4
x3
x2
x1 xT
…
…
V3
V2
V1 V4 VT
Embedding Layer
…
V3
V2
V1 V4 VT
OT
…
document Hidden Layer
Output Layer
Embedding Layer
Keyword Set
SVM, Regression, Python,
DNN, Fourier Transform,
Speech Processing,
LSTM, Bubble Sort, etc.
[Shen & Lee, Interspeech 16]
Title Generation
 Deep Learning based Approach [Alexander M Rush,
EMNLP 15][Chopra, NAACL 16][Lopyrev, arXiv 2015][Shen, arXiv 2016]
 Sequence-to-sequence learning
 Input: a document (word sequence), output:
its title (shorter word sequence)
https://arxiv.org/pdf/1512.01712v1.pdf
New Direction 5-3:
Speech Content is Difficult to Browse!
Visualizing Retrieved Results
Introduction
 Visualizing the retrieval results on an intuitive
interface helps users know what is retrieved
 Take retrieving on-line lectures as example
 Searching spoken lectures is a very good
application for spoken content retrieval
 The speech of the instructors conveys most
knowledge in the lectures
Retrieving One Course
 NTU Virtual Instructor
Searching the course Digital Speech Processing of NTU
Massive Open On-line Courses
(MOOCs)
 Enormous on-line courses
Today’s Retrieval Techniques
A list of related courses
Today’s Retrieval Techniques
More is less …...
 Given all the related lectures from different courses
learner
Which lecture should I
go first?
Learning Map
 Nodes: lectures in the same
topics
 Edges: suggested learning
order
[Shen & Lee,
Interspeech 15]
Learning Map
lectures in the
same topic
Lectures in the same topic
same topic?
 Compute the similarity of each pair of lectures
 Lexical and topical similarity of the audio transcriptions
 Lexical similarity and syntactic parsing tree similarity of
the titles of the lectures
 Weighted sum the similarity measures
Lectures in the same topic
same topic?
a1
a2
a3
b1
b2
b3
a1
a2
a3
b1
b2
b3
More Likely Less Likely
Learning Map
suggested
learning order
Prerequisite
Lectures in
different courses
Prerequisite?
Learning a binary classifier
Training data: lectures in
existing courses
An existing course
…
Lecture 1
Lecture 2
Lecture 3
Lecture 1 is a
prerequisite of
lecture 2
Assumption:
Lecture 2 is a
prerequisite of
lecture 3
……
Demo
Vision: Personalized Courses
I want to learn “XXX”.
I am a graduate student of
computer science.
I can spend 10 hours.
Learner
I open a course for you.
on-line learning
material
 Spoken Language Processing techniques can be very
helpful.
 The spoken content in courses plays the most
important role in conveying the knowledge.
New Direction 5-4:
Speech Content is Difficult to Browse!
Speech Question Answering
Speech Question Answering
 Machine answers questions based on the
information in spoken content
What is a possible
origin of Venus’
clouds?
volcanic activity
Speech Question Answering
 Reference: 6 Spoken Question Answering (Sophie
Rosset, Olivier Galibert and Lori Lamel). G. Tur and R.
DeMori, Spoken Language Understanding: Systems for
Extracting Semantic Information from Speech.
 Question Answering in Speech Transcripts (QAST) has
been a well-known evaluation program of SQA.
 2007, 2008, 2009
 Focused on factoid questions in the previous study
 E.g. “What is name of the highest mountain in
Taiwan?”.
 How about more difficult questions?
Preliminary Study
 TOEFL Listening Comprehension Test by
Machine
Question: “ What is a possible origin of Venus’ clouds? ”
Audio Story:
Choices:
(A) gases released as a result of volcanic activity
(B) chemical reactions caused by high surface temperatures
(C) bursts of radio energy from the plane's surface
(D) strong winds that blow dust into the atmosphere
(The original story is 5 min long.)
[Tseng & Lee, Interspeech 16]
Simple Baselines
Accuracy
(%)
(1) (2) (3) (4) (5) (6) (7)
Naive Approaches
random
(4) the choice with
semantic most similar to
others
(2) select the
shortest choice as
answer
Experimental setup:
717 for training,
124 for validation, 122 for
testing
[Tseng & Lee, Interspeech 1
Model Architecture
“what is a possible
origin of Venus‘ clouds?"
Question:
Question
Semantic
s
…… It be quite possible that this be
due to volcanic eruption because
volcanic eruption often emit gas. If
that be the case volcanism could
very well be the root cause of
Venus 's thick cloud cover. And also
we have observe burst of radio
energy from the planet 's surface.
These burst be similar to what we
see when volcano erupt on earth
……
Audio Story:
Speech
Recognition
Semantic
Analysis
Semantic
Analysis
Answer
Select the choice
most similar to the
answer
Experimental Results
Accuracy
(%)
(1) (2) (3) (4) (5) (6) (7)
Memory Network:
39.2%
Naive Approaches
Word-based Attention:
48.3%
(proposed by FB AI group)
[Tseng & Lee, Interspeech 1
Concluding Remarks
Conclusion Remarks
 New research directions for spoken content retrieval
 Modified ASR for Retrieval Purposes
 Incorporating Those Information Lost in ASR
 No Speech Recognition!
 Special Semantic Retrieval Techniques for Spoken
Content
 Spoken Content is Difficult to Browse!
Take-Home Message
Spoken Content Retrieval
Speech Recognition
+
Text Retrieval
=
Spoken Content Retrieval
 Spoken content retrieval: Machine listens to the data, and
extract the desired information for each individual user.
 Nobody is able to go through the data.
300 hrs multimedia is
uploaded per minute.
(2015.01)
1874 courses on coursera
(2016.04)
 In these multimedia, the spoken part carries very
important information about the content
• Just as Google does on text data
Overview Paper
 Lin-shan Lee, James Glass, Hung-yi Lee,
Chun-an Chan, "Spoken Content Retrieval —
Beyond Cascading Speech Recognition with
Text Retrieval," IEEE/ACM Transactions on
Audio, Speech, and Language Processing,
vol.23, no.9, pp.1389-1420, Sept. 2015
 http://speech.ee.ntu.edu.tw/~tlkagk/paper/Over
view.pdf
 This tutorial includes updated information after
this paper is published.
Thank You for Your Attention

Spoken Content Retrieval

  • 1.
    Spoken Content Retrieval BeyondCascading Speech Recognition and Text Retrieval Lin-shan Lee and Hung-yi Lee National Taiwan University
  • 2.
    Focus of thisTutorial  New frontiers and directions towards the future of speech technologies  Not skills and experiences in optimizing performance in evaluation programs
  • 3.
  • 4.
    Spoken Content Retrieval LecturesBroadcast Program Multimedia Content Spoken Content
  • 5.
    Spoken Content Retrieval Spoken content retrieval: Machine listens to the data, and extract the desired information for each individual user.  Nobody is able to go through the data. 300 hrs multimedia is uploaded per minute. (2015.01) 1874 courses on coursera (2016.04)  In these multimedia, the spoken part carries very important information about the content • Just as Google does on text data
  • 6.
     Basic goal:Identify the time spans that the query occurs in an audio database  This is called “Spoken Term Detection” Spoken Content Retrieval – Goal time 1:01 time 2:05 … …… “Obama” user
  • 7.
     Basic goal:Identify the time spans that the query occurs in an audio database  This is called “Spoken Term Detection”  Advanced goal: Semantic retrieval of spoken content Spoken Content Retrieval – Goal user “US President” The user is also looking for utterances including “Obama”. Retrieval system
  • 8.
    It is naturalto think …… Spoken Content Retrieval Speech Recognition + Text Retrieval =
  • 9.
    It is naturalto think …… Speech Recognition Models Text Acoustic Models Language Model Spoken Content  Transcribe spoken content into text by speech recognition
  • 10.
    It is naturalto think …… Speech Recognition Models Text  Transcribe spoken content into text by speech recognition Spoken Content RNN/LSTM DNN
  • 11.
    It is naturalto think ……  Transcribe spoken content into text by speech recognition Speech Recognition Models Text Retrieval Result Text Retrieval Query user  Use text retrieval approaches to search over the transcriptions Spoken Content Black Box
  • 12.
    It is naturalto think …… Speech Recognition Models Text Retrieval Result Text Retrieval user Spoken Content  For spoken queries, transcribe them into text by speech recognition. Black Box
  • 13.
    Our point inthis tutorial Spoken Content Retrieval Speech Recognition + Text Retrieval =
  • 14.
    Outline  Introduction: ConventionalApproach: Spoken Content Retrieval = Speech Recognition + Text Retrieval  Core: Beyond Cascading Speech Recognition and Text Retrieval Five new directions
  • 15.
    Introduction: Spoken Content Retrieval= Speech Recognition + Text Retrieval
  • 16.
    It is naturalto think …… Speech Recognition Models Text Retrieval Result Text Retrieval user Spoken Content Speech Recognition always produces errors.
  • 17.
    Lattices Speech Recognition Models Text Retrieval Result Text Retrieval Query user Spoken Content Lattices Keep most possible recognition output  Each path has a weight (confidence to be correct) M. Larson and G. J. F. Jones, “Spoken content retrieval: A survey of techniques and technologies,” Foundations and Trends in Information Retrieval, vol. 5, no. 4-5, 2012.
  • 18.
  • 19.
  • 20.
  • 21.
    Lattices Spoken Archive Speech Recognition System Acoustic & Language Models Lattices Retrieval Result Text Retrieval user Text Query time Higherprobability to include the correct words More noisy words included inevitably Higher memory/computation requirements
  • 22.
    Searching over Lattices Consider the basic goal: Spoken Term Detection “Obama” user
  • 23.
    Searching over Lattices Consider the basic goal: Spoken Term Detection  Find the arcs hypothesized to be the query term Obama “Obama” user Obama x1 x2
  • 24.
     Consider thebasic goal: Spoken Term Detection  Posterior probabilities computed from lattices used as confidence scores Searching over Lattices Obama x1 R(x1)=0.9 Obama x2 R(x2)=0.3 Two ways to display the results: unranked and ranked.
  • 25.
     Consider thebasic goal: Spoken Term Detection  Unranked: Return the results with the scores higher than a threshold Searching over Lattices Obama x1 R(x1)=0.9 Obama x2 R(x2)=0.3 Set the threshold as 0.6 Return x1
  • 26.
     Consider thebasic goal: Spoken Term Detection  Unranked: Return the results with the scores higher than a threshold Searching over Lattices Obama x1 R(x1)=0.9 Obama x2 R(x2)=0.3 The threshold can be determined automatically and query specific. [Miller, Interspeech 07][Can, HLT 09][Mamou, ICASSP 13][Karakos, ASRU 13][Zhang, Interspeech 12][Pham, ICASSP 14]
  • 27.
    Actual Term WeightedValue (ATWV)  Evaluating unranked result 𝐴𝑇𝑊𝑉 = 1 − 𝑃𝑚𝑖𝑠𝑠 − 𝛽𝑃𝐹𝐴 𝑃𝑚𝑖𝑠𝑠 = 1 − 𝑁𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑁𝑟𝑒𝑓 𝑃𝐹𝐴 = 𝑁𝑠𝑝𝑢𝑟𝑖𝑜𝑢𝑠 𝑁𝑁𝑇 time 1:01 1.0 time 2:05 0.9 time 1:31 0.7 …… retrieved 𝑁𝑟𝑒𝑓: number of times the query term appears in audio database threshold Maximum Term Weighted Value (MTWV): tune the threshold to obtain the best ATWV 𝑁𝑐𝑜𝑟𝑟𝑒𝑐𝑡: the number of retrieved objects that are actually correct 𝑁𝑠𝑝𝑢𝑟𝑖𝑜𝑢𝑠: the number of retrieved objects that are incorrect 𝑁𝑁𝑇: audio duration (in seconds) – 𝑁𝑟𝑒𝑓
  • 28.
     Consider thebasic goal: Spoken Term Detection  Ranked: results ranked according to the scores Searching over Lattices Obama x1 R(x1)=0.9 Obama x2 R(x2)=0.3 x1 0.9 x2 0.3 …
  • 29.
     Consider thebasic goal: Spoken Term Detection  Ranked: The results are ranked according to the scores Searching on Lattices Obama x1 R(x1)=0.9 Obama x2 R(x2)=0.3 x1 0.9 x2 0.3 … user
  • 30.
    Mean Average Precision(MAP)  Evaluating ranked list  area under recall-precision curve  Recall: percentage of ground truth results retrieved  Precision: percentage of retrieved results being correct  Higher threshold gives higher precision but lower recall, etc. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Precision Recall MAP = 0.484 MAP = 0.586
  • 31.
    Examples of LatticeIndexing Approaches • Position Specific Posterior Lattices (PSPL)[Chelba, ACL 05][Chelba, Computer Speech and Language 07] • Confusion Networks (CN)[Mamou, SIGIR 06][Hori, ICASSP 07][Mamou, SIGIR 07] • Time-based Merging for Indexing (TMI)[Zhou, HLT 06][Seide, ASRU 07] • Time-anchored Lattice Expansion (TALE)[Seide, ASRU 07][Seide, ICASSP 08] • WFST: directly compile the lattice into a weighted finite state transducer [Allauzen, HLT 04][Parlak, ICASSP 08][Can, ICASSP 09][Parada, ASRU 09]
  • 32.
    Out-of-Vocabulary (OOV) Problem Speech recognition is based on a lexicon  Words not in the lexicon can never be transcribed  Many informative words are out-of-vocabulary (OOV)  Many query terms are new or special words or named entities
  • 33.
     All OOVwords composed of subword units  Generate subword lattices  Transform word lattices into subword lattices  Can also be directly generated by speech recognition using subword-based lexicon and language model Subword-based Retrieval Retrieval An arc in the word lattice Corresponding subword sequence /rɪ/ /trɪ/ /vǝl/ word lattices subword lattices
  • 34.
     Subword-based retrieval Generate subword lattices  Transform user query into subword sequence Obama → /au/ /ba/ /mǝ/  Text retrieval techniques equally useful except based on subword lattices and subword query Replace words by subword units  OOV words can be retrieved by matching over the subword units without being recognized Subword-based Retrieval
  • 35.
    Subword-based Retrieval - FrequentlyUsed Subword Units • Linguistically motivated units – phonemes, syllables/characters, morphemes, etc. [Ng, MIT 00][Wallace, Interspeech 07][Chen & Lee, IEEE T. SAP 02] [Pan & Lee, ASRU 07][Meng, ASRU 07][Meng, Interspeech 08] [Mertens, ICASSP 09][Itoh, Interspeech 07][Itoh, Interspeech 11] [Pan & Lee, IEEE T. ASL 10] • Data-driven units – particles, word fragments, phone multigrams, morphs, etc. [Turunen, SIGIR 07] [Turunen, Interspeech 08] [Parlak, ICASSP 08][Logan, IEEE T. Multimedia 05] [Gouvea, Interspeech 10][Gouvea, Interspeech 11][Lee & Lee, ASRU 09]
  • 36.
    Integrating Different Clues fromRecognition  Similar to system combination in ASR  Consistency very often implies accuracy  Integrating the outputs from different recognition systems [Natori, Interspeech 10]  Integrating results based on different subword units [S.-w. Lee, ICASSP 05][Pan & Lee, Interspeech 07][Meng, Interspeech 10][Itoh, Interspeech 11]  Weights of different clues estimated by optimizing some retrieval related criteria [Meng & Lee, ICASSP 09][Chen & Lee, ICASSP 10][Meng, Interspeech 10][Wollmer, ICASSP 09]
  • 37.
    Integrating Different Clues fromRecognition  Weights for Integrating 1,2,3-grams for different word/subword units and different indices syllable Confusion Network Position Specific Posterior Lattice word character syllable word character 1-gram 2-gram 3-gram 1-gram 2-gram 3-gram integrated with different weights maximizing the lower bound of MAP by SVM-MAP [Meng & Lee, ICASSP 09]
  • 38.
    Training Retrieval Model Parameters Integrating different n-grams, word/subword units and indices single clue integrated [Meng & Lee, ICASSP 09] [Chen & Lee, ICASSP 10]
  • 39.
    ASR Accuracy v.s.Retrieval Performance  Spoken Term Detection, Lectures Speaker Dependent: 10 hours of speech from the instructor
  • 40.
    ASR Accuracy v.s.Retrieval Performance  Spoken Term Detection, Lectures Improved Speaker Adaptation
  • 41.
    ASR Accuracy v.s.Retrieval Performance  Spoken Term Detection, Lectures Initial Speaker Adaptation
  • 42.
    ASR Accuracy v.s.Retrieval Performance  Spoken Term Detection, Lectures Speaker Independent Realistic! ASR Accuracy MAP
  • 43.
    ASR Accuracy v.s.Retrieval Performance  Precision at 10: Percentage of the correct items among the top 10 selected Speaker Independent: Only 60% of results are correct Retrieve YouTube?!
  • 44.
     Did latticessolve the problem?  Need high quality recognition models to produce better lattices and accurately estimate the confidence scores  Spoken content over the Internet is produced in different languages on different domains in different parts of the world under varying acoustic conditions  High quality recognition models for such content doesn’t exist yet  Retrieval performance limited by ASR accuracy Is the problem solved?
  • 45.
     Desired spokencontent retrieval Less constrained by ASR accuracy Existing approaches limited by ASR accuracy because of the cascading of speech recognition and text retrieval  Go beyond the cascading concept Is the problem solved?
  • 46.
    Our point inthis tutorial Spoken Content Retrieval Speech Recognition + Text Retrieval =
  • 47.
  • 48.
    New Directions 1. ModifiedASR for Retrieval Purposes 2. Incorporating Those Information Lost in ASR 3. No Speech Recognition! 4. Special Semantic Retrieval Techniques for Spoken Content 5. Spoken Content is Difficult to Browse!
  • 49.
    Overview Paper  Lin-shanLee, James Glass, Hung-yi Lee, Chun-an Chan, "Spoken Content Retrieval — Beyond Cascading Speech Recognition with Text Retrieval," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.23, no.9, pp.1389-1420, Sept. 2015  http://speech.ee.ntu.edu.tw/~tlkagk/paper/Over view.pdf  This tutorial includes updated information after this paper is published.
  • 50.
    New Direction 1: ModifiedASR for Retrieval Purposes
  • 51.
    Retrieval Performance v.s. RecognitionAccuracy  Intuition: Higher recognition accuracy, better retrieval performance Not always true! In Taiwan, the need of … Recognition System A Recognition System B In Taiwan, a need of … In Thailand, the need of … Same recognition accuracy
  • 52.
    Retrieval Performance v.s. RecognitionAccuracy  Intuition: Higher recognition accuracy, better retrieval performance Not always true! In Taiwan, the need of … Recognition System A Recognition System B In Taiwan, a need of … In Thailand, the need of … Not important for retrieval Serious problem for retrieval
  • 53.
    Retrieval Performance v.s. RecognitionAccuracy  Retrieval performance is more correlated to the ASR errors of name entities than normal terms [Garofolo, TREC-7 99][L. van der Werff, SSCS 07]  Expected error rate defined on lattices is a better predictor of retrieval performance than one-best transcriptions [Olsson, SSCS 07]  lattices used in retrieval  For retrieval, substitution errors have more influence than insertions and deletions [Johnson, ICASSP 99]  The language models reducing ASR errors do not always yield better retrieval performance [Cui, ICASSP, 13][Shao, Interspeech, 08][ Wallace, SSCS 09]  Query terms usually topic-specific with lower n-gram probabilities
  • 54.
    ASR models learnedby Optimizing Retrieval Performance Speech Recognition Models Text Retrieval Result Text Retrieval Query user Spoken Content
  • 55.
    ASR models learnedby Optimizing Retrieval Performance Speech Recognition Models Text Retrieval Result Text Retrieval Query user Spoken Content Optimized for recognition accuracy
  • 56.
    ASR models learnedby Optimizing Retrieval Performance Speech Recognition Models Text Retrieval Result Text Retrieval Query user Spoken Content Optimized for recognition accuracy Spoken Content Retrieval Retrieval Performance
  • 57.
    New Direction 1-1: ModifiedASR for Retrieval Purposes Acoustic Modeling
  • 58.
    Acoustic Modeling  AcousticModel Training 𝜃 = 𝑎𝑟𝑔 max 𝜃 𝐹 𝜃 𝜃: acoustic model parameters 𝐹 𝜃 : objective function The objective function 𝐹 𝜃 usually defined to optimize ASR accuracy Design a new objective function for optimizing retrieval performance.
  • 59.
    Acoustic Modeling  ObjectiveFunction for optimizing ASR performance 𝜃 = 𝑎𝑟𝑔 max 𝜃 𝐹 𝜃 lattice of utterance u wA wB wC wA wA wA wC 𝐹 𝜃 = 𝑢 𝑠𝑢∈𝐿 𝑢 𝐴 𝑟𝑢, 𝑠𝑢 𝑃𝜃 𝑠𝑢|𝑢 Summation over all the utterances u in the training data L(u): all the word sequence in the lattice of x
  • 60.
    Acoustic Modeling  ObjectiveFunction for optimizing ASR performance 𝜃 = 𝑎𝑟𝑔 max 𝜃 𝐹 𝜃 𝐹 𝜃 = 𝑢 𝑠𝑢∈𝐿 𝑢 𝐴 𝑟𝑢, 𝑠𝑢 𝑃𝜃 𝑠𝑢|𝑢 wA wB wC wA wA wA wC 𝐴 𝑟𝑢, 𝑠𝑢 : the accuracy of word or phoneme sequence 𝑠𝑢 comparing with reference 𝑟𝑢 𝑠𝑢: a word sequence in the lattice of x 𝑃𝜃 𝑠𝑢|𝑢 : posterior probability of word sequence 𝑠𝑢 given acoustic model 𝜃 MCE, MPE, sMBR 𝜃 can be HMM or DNN lattice of utterance u
  • 61.
    Acoustic Modeling  ObjectiveFunction for optimizing ASR performance 𝜃 = 𝑎𝑟𝑔 max 𝜃 𝐹 𝜃 𝐹 𝜃 = 𝑢 𝑠𝑢∈𝐿 𝑢 𝐴 𝑟𝑢, 𝑠𝑢 𝑃𝜃 𝑠𝑢|𝑢 retrieval W-MCE, [Fu, ASRU 07][Weng, Interspeech 12][Weng, ICASSP, 13] keyword-boosted sMBR [Chen, Interspeech 14] If the possible query terms are known in advance, they can be weighted higher in 𝐴 𝑟𝑢, 𝑠𝑢
  • 62.
     In mostcases, the query terms are not known in advance  Collect feedback data on-line  Use the information to optimize search engines Feedback can be implicit Training Data collected from User time 1:10 F time 2:01 F time 3:04 F time 5:31 T time 1:10 F time 2:01 T time 3:04 F time 5:31 T time 1:10 F time 2:01 T time 3:04 time 5:31 Query Q1 Query Q2 Query Qn ……
  • 63.
    ASR models learnedby Optimizing Retrieval Performance Speech Recognition Models Text Retrieval Query user Spoken Content Lattices Retrieval Result re-estimate optimize [Lee & Lee, ICASSP 10] [Lee & Lee, Interspeech 10] [Lee & Lee, SLT 10] [Lee & Lee, IEEE T. ASL 12] time 1:10 F time 2:01 F time 3:04 F time 5:31 T time 1:10 F time 2:01 T time 3:04 F time 5:31 T time 1:10 F time 2:01 T time 3:04 time 5:31 Query Q1 Query Q2 Query Qn ……
  • 64.
    Updated Retrieval Process Each retrieval result x has a confidence score R(x)  R(x) depends on the recognition model θ  R(x) should be R(x;θ) Re-estimate recognition model θ Update the scores R(x; θ) The retrieval results can be re-ranked. Considering some retrieval criterion
  • 65.
                 x x x R x R F    ; ; Basic Form  Basic Form: : confidence score of the positive example    ;  x R : confidence score of the negative example    ;  x R : a positive example  x : a negative example  x      F max arg ˆ 
  • 66.
                 x x x R x R F    ; ; Basic Form Increase the confidence scores of the positive examples  Basic Form:      F max arg ˆ  Decrease the confidence scores of the negative examples
  • 67.
    Consider Ranking positive example negativeexample Confidence score : Original Model  : New Model  ˆ    ; R x    ˆ ; R x
  • 68.
    Consider Ranking positive example negativeexample Confidence score : Original Model  : New Model  ˆ    ; R x    ˆ ; R x
  • 69.
                 x x x R x R F    ; ; Consider Ranking positive example negative example Confidence score : Original Model  : New Model  ˆ    ; R x    ˆ ; R x Increase the basic objective function
  • 70.
    Consider Ranking positive example negativeexample Confidence score : Original Model  : New Model  ˆ    ; x R    ˆ ; R x Rank perfectly Worse ranking Considering the ranking order
  • 71.
                  otherwise x x x x 0 ; R ; R 1 ,    Consider Ranking If the confidence score for a positive example exceed that for a negative example the objective function adds 1.           x x x x F , ,  
  • 72.
    Consider Ranking  δ(x+,x-)approximated by a sigmoid function during optimization.           x x x x F , ,                  otherwise x x x x 0 ; R ; R 1 ,    Little feedback data? The unlabeled examples as negative examples
  • 73.
    0.46 0.47 0.48 0.49 0.50 0.51 0.52 MAP Baseline Basic Form Rankingunlabelled as negative Acoustic Models - Experiments  Lecture recording (80 queries, each has 5 clicks) [Lee & Lee, IEEE T. ASL 1
  • 74.
    New Direction 1-2: ModifiedASR for Retrieval Purposes Besides Acoustic Modeling
  • 75.
    Language Modeling  Thequery terms are usually very specific. Their probabilities are underestimated.  Boosting the probabilities of n-grams including query terms  By repeating the sentences including the query terms in training corpora  Helpful in DARPA’s RATS program [Mandal, Interspeech 13] and NIST OpenKWS13 evaluation [Chen, ISCSLP 14]  NN-based LM: Modifying training criterion, so the key terms are weighted more during training  Helpful in NIST OpenKWS13 evaluation [Gandhe, ICASSP 14]
  • 76.
    Decoding  Give differentwords different pruning thresholds during decoding  The keywords given lower pruning thresholds than normal terms  Called white listing [Zhang, Interspeech 12] or keyword- aware pruning [Mandal, Interspeech 13]  OOV words never correctly recognized  Two stage approach [Shao, Interspeech 08]  Identify the lattices probably containing OOV (by subword-based approach)  Insert the word arcs of OOV words into lattices and rescore
  • 77.
    Confusion Models Speech Recognition Models Text Retrieval Result Text Retrieval Queryuser Spoken Content Confusion Model A B C A’ B’ C’ The ASR produces systematic errors, so it is possible to learn a confusion model to offer better retrieval results [Karanasou, Interspeech 12][Wallace, ICASSP 10]
  • 78.
    Jointly Optimizing Speech Recognitionand Retrieval Modules Complex Model query A spoken segment Yes, the segment contains the query. No, ……. End-to-end model performing speech recognition and retrieval jointly (learned jointly) in one step Sounds crazy?
  • 79.
    Much information lostduring ASR Much information lost during ASR Transcriptions: using syntax vectors surge …… Lattice: ASR Spoken Content
  • 80.
    New Direction 2: Incorporating ThoseInformation Lost in ASR
  • 81.
    Information beyond Speech RecognitionOutput Speech Recognition Models Text Retrieval Result Text Retrieval Query user Spoken Content Black Box Incorporating information lost in standard speech recognition to help retrieval
  • 82.
    New Direction 2-1: Incorporating ThoseInformation Lost in ASR What kind of information can be helpful?
  • 83.
    Information beyond Speech RecognitionOutput  Phoneme or syllable duration [Wollmer, ICASSP 09][Naoyuki Kanda, SLT 12][Teppei Ohno, SLT 12]  Pitch & Energy [Tejedor, Interspeech 10]  Landmark and attribute detection with prosodic cues includes can reduce the false alarm [Ma, Interspeech 2007] [Naoyuki Kanda, SLT 12] Query is Japanese word “fu-ji-sa-N” very short! False alarm!
  • 84.
    Query-specific Information  "Jackof all trades, master of none“ Speech Recognition Spoken Term Detection Correctly recognized all the words higher detector accuracy on specific query
  • 85.
    Retrieval System Query-specific Detector Query Q Lattices First-passRetrieval Result x1 x2 x3 Examples of Q Compute Similarity Exemplar-based approach also used in speech recognition [Demuynck, ICASSP 2011][Heigold, ICASSP 2012][Nancy Chen, ICASSP 2016]
  • 86.
  • 87.
    Retrieval System Query-specific Detector Query Q Lattices First-passRetrieval Result x1 x2 x3 Examples of Q Learn a model Model for Q Evaluate confidence
  • 88.
    Retrieval System Query-specific Detector Query Q Lattices First-passRetrieval Result x1 x2 x3 positive examples Learn a discriminative model negative examples SVM [Tu & Lee, ASRU 11] [I.-F. Chen, Interspeech 13]
  • 89.
    Query-specific Detector  Theinput of SVM or MLP has to be a fixed-length vector  Representing an audio segment with different length into a fixed-length vector … … … … … … … … … … … … … … [Tu & Lee, ASRU 11] [I.-F. Chen, Interspeech 13]
  • 90.
    Retrieval System Query-specific Detector Query Q Lattices First-passRetrieval Result x1 x2 x3 positive examples negative examples  Is it realistic to have those examples? Pseudo-relevance Feedback (PRF) User Feedback
  • 91.
    New Direction 2-2: Incorporating ThoseInformation Lost in ASR Pseudo Relevance Feedback
  • 92.
    Retrieval System Pseudo Relevance Feedback(PRF) Query Q Lattices First-pass Retrieval Result x1 x2 x3 [Chen & Lee, Interspeech 11] [Lee & Lee, CSL 14]
  • 93.
    Retrieval System Pseudo Relevance Feedback(PRF) Query Q Confidence scores from lattices Lattices R(x1) First-pass Retrieval Result x1 x2 x3 R(x2) R(x3) Not shown to the user [Chen & Lee, Interspeech 11] [Lee & Lee, CSL 14]
  • 94.
    Retrieval System Pseudo Relevance Feedback(PRF) Query Q Lattices R(x1) First-pass Retrieval Result x1 x2 x3 R(x2) R(x3) Assume the results with high confidence scores as correct Examples of Q Considered as examples of Q [Chen & Lee, Interspeech 11] [Lee & Lee, CSL 14]
  • 95.
    Retrieval System Pseudo Relevance Feedback(PRF) Query Q Lattices R(x1) First-pass Retrieval Result x1 x2 x3 R(x2) R(x3) similar dissimilar Examples of Q [Chen & Lee, Interspeech 11] [Lee & Lee, CSL 14]
  • 96.
    Retrieval System Pseudo Relevance Feedback(PRF) Query Q Lattices R(x1) First-pass Retrieval Result x1 x2 x3 R(x2) R(x3) time 1:01 time 2:05 time 1:45 … time 2:16 time 7:22 time 9:01 Rank according to new scores Examples of Q
  • 97.
    (A) (B)  Lecturerecording [Lee & Lee, CSL 14] Pseudo Relevance Feedback (PRF) - Experiments Evaluation Measure: MAP (Mean Average Precision)
  • 98.
    (A) (B) Pseudo RelevanceFeedback (PRF) - Experiments (B): speaker independent (50% recognition accuracy) (A): speaker dependent (84% recognition accuracy) (A) and (B) use different speech recognition systems
  • 99.
    (A) (B)  PRF(red bars) improved the first-pass retrieval results with lattices (blue bars) Pseudo Relevance Feedback (PRF) - Experiments
  • 100.
    New Direction 2-3: Incorporating ThoseInformation Lost in ASR Graph-based Approach
  • 101.
    Graph-based Approach  PRF Each result considers the similarity to the audio examples  Make some assumption to find the examples  Graph-based approach [Chen & Lee, ICASSP 11][Lee & Lee, APSIPA 11][Lee & Lee, CSL 14]  Not assume some results are correct  Consider the similarity between all results
  • 102.
    Graph Construction  Thefirst-pass results is considered as a graph.  Each retrieval result is a node First-pass Retrieval Result from lattices x1 x2 x3 x2 x3 x1 x4 x5
  • 103.
    Graph Construction  Thefirst-pass results is considered as a graph.  Nodes are connected if their retrieval results are similar.  DTW similarities are considered as edge weights x2 x3 x1 x4 x5 Dynamic Time Warping (DTW) Similarity similar
  • 104.
    Changing Confidence Scoresby Graph  The score of each node depends on its neighbors. x2 x3 x1 x4 x5 G(x1) G(x2) G(x3) G(x5) G(x4) high high The results are ranked according to new scores G(xi). “You are known by the company you keep”
  • 105.
    Changing Confidence Scoresby Graph  The score of each node depends on its neighbors. x2 x3 x1 x4 x5 G(x1) G(x2) G(x3) G(x5) G(x4) low low The results are ranked according to new scores G(xi). “You are known by the company you keep”
  • 106.
    Graph-based Re-ranking -Formulation xi xj G(xi)                  i j x x i j j i i x x x G x x G N , Ŵ R 1  
  • 107.
                    i j x x i j j i i x x x G x x G N , Ŵ R 1   Graph-based Re-ranking - Formulation xi xj G(xi) original score considering graph structure (from lattices)
  • 108.
                    i j x x i j j i i x x x G x x G N , Ŵ R 1   Graph-based Re-ranking - Formulation xi xj G(xi) xj: neighbors of xi (nodes connected to xi) N(xi): neighbors of xi (nodes connected to xi)
  • 109.
                    i j x x i j j i i x x x G x x G N , Ŵ R 1   Graph-based Re-ranking - Formulation xi xj G(xi )
  • 110.
    Graph-based Re-ranking -Formulation xi xj W(xi,xj)                  i j x x i j j i i x x x G x x G N , Ŵ R 1  
  • 111.
    Graph-based Re-ranking -Formulation xi xj Normalized by the weights of all the edges connected to xj                  i j x x i j j i i x x x G x x G N , Ŵ R 1  
  • 112.
    Graph-based Re-ranking -Formulation xi xj W(xi,xj) The score of xi would be more close to the nodes xj with larger edge weights.                  i j x x i j j i i x x x G x x G N , Ŵ R 1  
  • 113.
                    i j x x i j j i i x x x G x x G N , Ŵ R 1   Graph-based Re-ranking - Formulation xi xj W(xi,xj) interpolation
  • 114.
     Assign scoreG(x) for each hit region based on the graph structure x1 x3 x2 x4 x5 G(x1) G(x2) G(x3) G(x4) G(x5)  G(x1) depends on G(x2) and G(x3) Graph-based Re-ranking - Formulation                  i j x x i j j i i x x x G x x G N , Ŵ R 1  
  • 115.
     Assign scoreG(x) for each hit region based on the graph structure x1 x3 x2 x4 x5 G(x1) G(x2) G(x3) G(x4) G(x5)  G(x1) depends on G(x2) and G(x3)  G(x2) depends on G(x1) and G(x3) …… Graph-based Re-ranking - Formulation                  i j x x i j j i i x x x G x x G N , Ŵ R 1  
  • 116.
     Assign scoreG(x) for each hit region based on the graph structure x1 x3 x2 x4 x5 G(x1) G(x2) G(x3) G(x4) G(x5)  G(x1) depends on G(x2) and G(x3)  G(x2) depends on G(x1) and G(x3) ……  …… Graph-based Re-ranking - Formulation                  i j x x i j j i i x x x G x x G N , Ŵ R 1  
  • 117.
     Assign scoreG(x) for each hit region based on the graph structure x1 x3 x2 x4 x5 G(x1) G(x2) G(x3) G(x4) G(x5)  How to find G(x1), G(x2), G(x3) …… satisfying the following equation?  This is random walk. Graph-based Re-ranking - Formulation                  i j x x i j j i i x x x G x x G N , Ŵ R 1   G(xi) is uniquely and efficiently obtainable
  • 118.
     Lecture recording[Lee & Lee, CSL 14] (A) (B) Graph-based Approach - Experiments (B): speaker independent (low recognition accuracy) (A): speaker dependent (high recognition accuracy)
  • 119.
    (A) (B)  Graph-basedre-ranking (green bars) outperformed PRF (red bars) Graph-based Approach - Experiments
  • 120.
    0.25 0.27 0.29 0.31 0.33 0.35 Assamese Bengali Lao ATWV FirstPass (on lattices) Graph Graph-based Approach – Experiments on OpenKWS [Lee & Glass, Interspeech 14]
  • 121.
    Graph-based Approach – MoreExperiments  13% relative improvement on OOV queries on another lecture recording (several speakers) [Jansen, ICASSP 13][Norouzian, ICASSP 13]  14% relative improvement on AMI Meeting Corpus [Norouzian, Interspeech 13]  Graph Spectral Clustering  Optimizing evaluation measure and considering the graph structure at the same time [Audhkhasi, ICASSP 2014]  11% relative improvement with subword-based system on OpenKWS15 (Swahili) [Van Tung Pham, ICASSP, 2016]
  • 122.
    New Direction 3: NoSpeech Recognition!
  • 123.
    Why Spoken ContentRetrieval without Speech Recognition?  Bypassing ASR to avoid information loss and all problems with ASR (errors, OOV words, background noise, etc. )  Just to identify the query, no need to find out which words the query includes  Audio files on the Internet in hundreds of different languages  Too limited annotated data for training reliable speech recognition systems for most languages  Written form even doesn’t exist for some languages  Many audio files are code-switched across several different languages
  • 124.
    Spoken Content Retrieval withoutSpeech Recognition user “US President” spoken query Compute similarity between spoken queries and audio files on acoustic level, and find the query term Spoken Content “US President” “US President” Is it possible?
  • 125.
    Approach Categories  DTW-basedApproaches  Matching sequences with DTW  Audio Segment Representation  Representing audio segments by fixed length vector representations  Unsupervised ASR (or model-based approach)  Training word- or subword-like acoustic patterns (or tokens) from target audio archive  Transcribing both the audio archive and the query into word- or subword-like token sequences  Matching based on the tokens, just like text retrieval
  • 126.
    New Direction 3-1: NoSpeech Recognition! DTW-based Approaches
  • 127.
    DTW-based Approach  ConventionalDTW Audio Segment Audio Segment
  • 128.
    DTW-based Approach  DTWfor query-by-example  Whether a spoken query is in an utterance Spoken Query Utterance Segmental DTW [Zhang, ICASSP 10], Subsequence DTW [Anguera, ICME 13][Calvo, MediaEval 14]
  • 129.
    Acoustic Feature Vectors Gaussian posteriorgram [Zhang, ICASSP 10][Wang, MediaEval 14]  Phonetic posteriors [Hazen, ASRU 09]  MLP trained from another corpus (probably in a different language)  Bottle-neck feature generated from MLP [Kesiraju, MediaEval 14]  RBM posteriorgram [Zhang, ICASSP 12]  Performance comparison [Carlin, Interspeech 11]
  • 130.
    Speed-up Approaches forDTW  Segment-based matching [Chan & Lee, Interspeech 10][Chan & Lee, ICASSP 11] Spoken Query Utterance Group consecutive acoustically similar feature vectors into a segment
  • 131.
    Speed-up Approaches forDTW  Segment-based matching Hierarchical Agglomerative Clustering (HAC) Step 1: build a tree Step 2: pick a threshold Group consecutive acoustically similar feature vectors into a segment
  • 132.
    Speed-up Approaches forDTW  Segment-based matching [Chan & Lee, Interspeech 10][Chan & Lee, ICASSP 11] Spoken Query Utterance Compute similarities between segments only
  • 133.
    Speed-up Approaches forDTW  Segment-based matching [Chan & Lee, Interspeech 10][Chan & Lee, ICASSP 11]  Lower bound estimation [Zhang, ICASSP 11][Zhang, Interspeech 11]  Indexing the frames in the target audio file [Jansen, ASRU 11][Jansen, Interspeech 12]  Information Retrieval based DTW [Anguera, Interspeech 13]
  • 134.
    New Direction 3-2: NoSpeech Recognition! Audio Segment Representation
  • 135.
    Framework Audio archive dividedinto variable- length audio segments Audio Segment to Vector Audio Segment to Vector Similarity Search Result Spoken Query Off-line On-line [Chung & Lee, Interspeech 16][Chen, ICASSP 15] [Levin, ICASSP 15][Levin, ASRU 13]
  • 136.
    Audio Word toVector  The audio segments corresponding to words with similar pronunciations are close to each other. ever ever never never never dog dog dogs
  • 137.
    Audio Word toVector - Segmental Acoustic Indexing  Basic idea [Levin, ICASSP 15][Levin, ASRU 13] A set of template audio segments 0.5 …… 0.8 0.3 0.5 0.8 0.3 ⋮ DTW
  • 138.
    Audio Word toVector – Sequence Auto-encoder [Chung & Lee, Interspeech 16] RNN Encoder x1 x2 x3 x4 audio segment acoustic features Representation for the whole audio segment
  • 139.
    Audio Word toVector – Sequence Auto-encoder RNN Decoder x1 x2 x3 x4 y1 y2 y3 y4 x1 x2 x3 x4 RNN Encoder [Chung & Lee, Interspeech 16]
  • 140.
    Sequence Auto-encoder – ExperimentalResults never ever Cosine Similarity Edit Distance between Phoneme sequences Deep Learning Deep Learning
  • 141.
  • 142.
    Sequence Auto-encoder – ExperimentalResults  Projecting the embedding vectors to 2-D day days say s say
  • 143.
    Sequence Auto-encoder – ExperimentalResults  Audio story (LibriSpeech corpus) MAP training epochs for sequence auto-encoder SA: sequence auto-encoder DSA: de-noising sequence auto-encoder Input: clean speech + noise output: clean speech
  • 144.
    New Direction 3-3: NoSpeech Recognition! Unsupervised ASR
  • 145.
    Conventional ASR … HelloWorld … ASR unknown speech signal
  • 146.
    Unsupervised ASR ASR unknown speechsignal Used in Query by example Spoken Term Detection Unsupervised ASR: Learn the models for a set of acoustic patterns (tokens) directly from the corpus (target spoken archive) t0t1t2, t1t3, t2t3, t2t1t3t3t2 … Acoustic Tokens
  • 147.
    Unsupervised ASR -Acoustic Token utterance acoustic feature acoustic tokens: chunks of acoustically similar feature vectors with token ids t0 t1 t2 t1 [Zhang & Glass, ASRU 09] [Huijbregts, ICASSP 11] [Chan & Lee, Interspeech 11]
  • 148.
    Unsupervised ASR - OverallFramework Initialization feature sequence model training token decoding initial token sequence final token sequence : feature sequence for the whole corpus : token sequences for X : Model (e.g. HMM) parameters : training iteration simple segmentation and clustering
  • 149.
    Unsupervised ASR - Initialization GetToken ID Extract acoustic features for every utterance
  • 150.
    Unsupervised ASR - OverallFramework Initialization feature sequence model training token decoding initial token sequence final token sequence : feature sequence for the whole corpus : Model (e.g. HMM) parameters : token sequences for X : training iteration simple segmentation and clustering
  • 151.
    Unsupervised ASR - OverallFramework Initialization feature sequence model training token decoding initial token sequence final token sequence optimize HMM parameters using Baum–Welch algorithm on token sequence 𝜔𝑖−1 to get new models 𝜃𝑖 decode acoustic features into a new token sequence 𝜔𝑖 using Viterbi decoding 𝜔𝑖−1
  • 152.
    Unsupervised ASR - OverallFramework iterate until the token sequences (including token boudaries) converge Initialization feature sequence model training token decoding initial token sequence final token sequence
  • 153.
    Acoustic Token inQuery by Example Spoken Term Detection  Compute the similarity between the models of two tokens Model of token A Model of token B KL divergence of the Gaussian mixtures in the first state of two models
  • 154.
    Acoustic Token inQuery by Example Spoken Term Detection  Compute the similarity between the models of two tokens Model of token A Model of token B Sum of the KL divergence over the states of the two token models
  • 155.
    Token-based DTW subsequence matchingToken-based DTW Tokens in query Tokens in an utterance  Signal-level DTW is more sensitive to signal variation (e.g. same phoneme across different speakers), while token models are able to cover better the distribution of signal variation  Much lower on-line computation load a b c d a b g h b d b g d b a
  • 156.
    Multi-granularity Space for AcousitcTokens • Unknown hyperparameters for the token models • Number of HMM states per token (m): token length • Number of distinct tokens (n) • Multiple layers of intrinsic representations of speech
  • 157.
    Multi-granularity Space for AcousitcTokens • From short to long (Temporal Granularity) – phoneme – syllable – word – phrase • From coarse units to fine units (Phonetic Granularity) – general phoneme set – gender dependent phoneme set – speaker specific phoneme set Number of distinct HMMs (n) Number of states per HMM (m)
  • 158.
    Multi-granularity Space for AcousitcTokens Training multiple sets of HMMs for with different granularity [Chung & Lee, ICASSP 14 ] phoneme syllable word phrase general gender dependent speaker specific n m
  • 159.
    Multi-granularity Space for AcousitcTokens  Token-based DTW using tokens with different granularity (m,n) averaged gave much better performance  One example  Frame-level DTW: MAP = 10%  Using only the token set with the best performance: MAP = 11%  Using 20 sets of tokens (number of states per HMM m = 3,5,7,9,11, number of distinct HMMs n=50,100,200,300): MAP = 26%
  • 160.
    Hierarchical Paradigm  TypicalASR:  Acoustic Model: models for the phonemes  Lexicon: the pronunciation of every word as a phoneme sequence  Language Model: the transition between words Word 1 Phoneme 1 Phoneme 4 Word 2 Phoneme 2 Phoneme 1 Phoneme 3 Lexicon word 1 word 2 word 3 word 4 Language Model Phoneme 1 Phoneme 2 Phoneme 3 Acoustic Model
  • 161.
    word-like token 1 word-like token 1 word-like token1 Hierarchical Paradigm  Similarly, in unsupervised ASR:  Acoustic Model: the phoneme-like token HMMs  Lexicon: the pronunciation of every word-like token as a sequence of phoneme-like tokens  Language Model: the transition between word-like tokens word-like token 1 phoneme- like token 1 phoneme- like token 4 word-like token 2 phoneme- like token 2 phoneme- like token 1 phoneme- like token 3 Lexicon word-like token 1 Language Model phoneme-like token 1 Acoustic Model phoneme-like token 2 phoneme-like token 3
  • 162.
    word-like token 1 word-like token 1 word-like token1 Hierarchical Paradigm  Similarly, in unsupervised ASR:  Acoustic Model: the phoneme-like token HMMs  Lexicon: the pronunciation of every word-like token as a sequence of phoneme-like tokens  Language Model: the transition between word-like tokens word-like token 1 phoneme- like token 1 phoneme- like token 4 word-like token 2 phoneme- like token 2 phoneme- like token 1 phoneme- like token 3 Lexicon word-like token 1 Language Model phoneme-like token 1 Acoustic Model phoneme-like token 2 phoneme-like token 3 Bottom-up Construction Top Down Constraint
  • 163.
    Bottom Up Construction 3stages during training focus on different constraints: stage1: Acoustic Model stage2: Language Model stage3: Lexicon stage1 stage2 stage3 this part alone would be the HMM training we described earlier [Chung & Lee, ICASSP 13]
  • 164.
    word-like token 1 word-like token 1 word-like token1 Hierarchical Paradigm  Similarly, in unsupervised ASR:  Acoustic Model: the phoneme-like token HMMs  Lexicon: the pronunciation of every word-like token as a sequence of phoneme-like tokens  Language Model: the transition between word-like tokens word-like token 1 phoneme- like token 1 phoneme- like token 4 word-like token 2 phoneme- like token 2 phoneme- like token 1 phoneme- like token 3 Lexicon word-like token 1 Language Model phoneme-like token 1 Acoustic Model phoneme-like token 2 phoneme-like token 3 Bottom-up Construction Top Down Constraint
  • 165.
    Top-down Constraints [Jansen,ICASSP 13] This figure is from Aren Jansen’s ICASSP paper.  Signals of the same phoneme may be very different on phoneme level, but the global structures of signals of the same word are very often very similar on word level  Global structures help in building the hierarchical model
  • 166.
    Token Model Optimization Token Label Optimization 𝟁= (m,n) Multi-target DNN (MDNN) Multi-layered Acoustic Tokenizer (MAT) o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o Bottleneck Features concatenation Multi- layered Token labels as MDNN targets Bottleneck Features Initial Acoustic Features (iteration 1) Concatenated Features(iteration 2,3,...) Bottleneck Features (iterations 2,3,...) feature evaluation (sub)word evaluation Initial Acoustic Features (iteration 1) In the first iteration, we use MFCC as the initial features In the other iterations, we concatenate the bottleneck features with the MFCC Multi-layered Acoustic Tokenizing Deep Neural Networks (MAT-DNN) [Chung & Lee, ASRU 15]  Jointly learn high quality frame-level features (much better than MFCCs) and acoustic tokens in an unsupervised way  Unsupervised training of multi-target DNN using unsupervised token labels as training target
  • 167.
    Token Model Optimization Token Label Optimization 𝟁= (m,n) Multi-target DNN (MDNN) Multi-layered Acoustic Tokenizer (MAT) o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o Bottleneck Features concatenation Multi- layered Token labels as MDNN targets Bottleneck Features Initial Acoustic Features (iteration 1) Concatenated Features(iteration 2,3,...) Bottleneck Features (iterations 2,3,...) feature evaluation (sub)word evaluation Initial Acoustic Features (iteration 1) Multi-layered Acoustic Tokenizing Deep Neural Networks (MAT-DNN) In the first iteration, we use MFCC as the initial features In the other iterations, we concatenate the bottleneck features with the MFCC  Jointly learn high quality frame-level features (much better than MFCCs) and acoustic tokens in an unsupervised way  Unsupervised training of multi-target DNN using unsupervised token labels as training target [Chung & Lee, ASRU 15]
  • 168.
     Experimental Results Query by Example Spoken Term Detection on Tsonga Multi-layered Acoustic Tokenizing Deep Neural Networks (MAT-DNN) [Chung & Lee, ASRU 13] Approach MAP Frame-based DTW MFCC 9.0 New Feature 28.7 Token-based DTW New Tokens 26.2
  • 169.
    New Direction 4: SpecialSemantic Retrieval Techniques for Spoken Content
  • 170.
    Semantic Retrieval  Userexpects semantic retrieval of spoken content.  User asks “US President”, system also finds “Obama”  Widely studies on text retrieval  Take query expansion as example user “US President” Search both “US President” or “Obama” “Obama” and “US President” are related Retrieval system
  • 171.
    Semantic Retrieval of SpokenContent  User expects semantic retrieval of spoken content.  User asks “US President”, system also finds “Obama”  Widely studies on text retrieval  Take query expansion as example  The techniques developed for text can be directly applied on semantic retrieval of spoken content  Are there any special techniques for spoken content?
  • 172.
     Both queryQ and document d are represented as unigram language models θQ and θd Review: Language Modeling Retrieval Approach w1 w2 w3 w4 w5 …   Q w P  | Query model θQ   d w P  | Document model θd w1 w2 w3 w4 w5 … KL divergence between the two models can be evaluated.
  • 173.
    Review: Language ModelingRetrieval Approach  Given query Q, rank document d according to a relevance score function SLM(Q,d):  Inverse of KL divergence between query model θQ and document model θd  The documents with document models θd similar to query model θQ are more likely to be relevant.     d Q LM KL d Q S   | ,  
  • 174.
           ' , , | w d d w N d w N w P   Query model θQ for text:  Document model θd for text :         ' , , | w Q Q w N Q w N w P  Review: Basic Query/Document Models in Text Retrieval Normalize into probability N(w,Q): term frequency of word w in query Q Normalize into probability N(w,d): term frequency of word w in document d Those basic models can be enhanced by query/document expansion to handle the problem of semantic retrieval.
  • 175.
    Review: Query Expansion Document Model θd Retrieval Engine doc101 doc 205 doc 145 … … Text Query Q w1 w2 w3 w4 w5   Q w P  | Query model First-pass Retrieval Result [Tao, SIGIR 06]
  • 176.
    Review: Query Expansion Document Model θd Retrieval Engine doc101 doc 205 doc 145 … … Document model for doc 101 w1 w2 w3 w4 w5 ……   d w P  | w1 w2 w3 w4 w5 ……   d w P  | Text Query Q Top N documents Document model for doc 205 w1 w2 w3 w4 w5   Q w P  | Query model First-pass Retrieval Result [Tao, SIGIR 06]
  • 177.
    Review: Query Expansion Document Model θd doc101 doc 205 doc 145 … … w1 w2 w3 w4 w5 ……   d w P  | w1 w2 w3 w4 w5 ……   d w P  | Text Query Q w1 w2 w3 w4 w5   Q w P  | common patterns in document models New Query Model Query model w1 w2 w3 w4 w5 ……   Q w P ' | Retrieval Engine First-pass Retrieval Result Top N documents [Tao, SIGIR 06] (by EM algorithm)
  • 178.
    Review: Query Expansion Document Model θd doc101 doc 205 doc 145 … … w1 w2 w3 w4 w5 ……   d w P  | w1 w2 w3 w4 w5 ……   d w P  | Text Query Q w1 w2 w3 w4 w5   Q w P  | Query model Retrieval Engine Final Result w1 w2 w3 w4 w5 ……   Q w P ' | Retrieval Engine First-pass Retrieval Result New Query Model Top N documents [Tao, SIGIR 06]
  • 179.
    Review: Document Expansion  d w P  | Document model θd w1 w2 w3 w4 w5 … Topic Find topics behind document Modify document model This is realized by PLSA, LDA, etc. Topic Topic [Wei, SIGIR 06] “airplane”   d w P  | New Document model θd w1 w2 w3 w4 w5 … “airplane” “aircraft”
  • 180.
    Semantic Retrieval onLattices Original Retrieval Model of Text For Lattices Term Frequency Expected Term Frequency Document Length Expected Document Length …… ……  Take the basic language modeling retrieval approach as example  Modify retrieval model for lattices:
  • 181.
    Document Model fromLattices  Document model θd for text  (Spoken) Document model θd from lattice         ' , , | w d d w N d w N w P          ' , , | w d d w E d w E w P  Replace term frequency N(w,d) with expected term frequency E(w,d) computed from lattices
  • 182.
     Expected termfrequency E(w,d) for word w in spoken document d based on lattice Expected Term Frequency lattice of spoken document d wA wB wC wA wA wA wC            d L u d u P u w N d w E | , ,
  • 183.
    Expected Term Frequency Expected term frequency E(w,d) for word w in spoken document d based on lattice L(d): all the word sequences in the lattice of d N(w,u): the number of word w appearing in word sequence u u: a word sequence in the lattice of d P(u|d): posterior probability of word sequence u wA wB wC wA wA wA wC lattice of spoken document d            d L u d u P u w N d w E | , ,
  • 184.
    New Direction 4-1: SpecialSemantic Retrieval Techniques for Spoken Content Better Estimation of Term Frequencies
  • 185.
    Better Estimation ofTerm Frequencies  Expected term frequency E(w,d) from lattices can be inaccurate  Context of each term in the lattices [Tu & Lee, ICASSP 12]  The same terms usually have similar context [Schneider, Interspeech 10]  Graph-based approach  Graph-based approach using acoustic feature similarity improved spoken term detection  It can also improve semantic retrieval of spoken content based on language modeling retrieval approach  Idea: Replace expected term frequency E(w,d) with scores from graph-based approach [Lee & Lee, SLT 12] [Lee & Lee, IEEE/ACM T. ASL 14]
  • 186.
    Graph-based Approach for SemanticRetrieval  For each word w in the lexicon … … … … … … x1 x2 x3 x4 spoken document spoken document spoken document Find the occurrence regions of word w
  • 187.
    Graph-based Approach for SemanticRetrieval  For each word w in the lexicon … … … … … … x1 x2 x3 x4 spoken document spoken document spoken document Connect the occurrence regions by acoustic feature similarities
  • 188.
    Graph-based Approach for SemanticRetrieval  For each word w in the lexicon … … … … … … x1 x2 x3 x4 spoken document spoken document spoken document Obtain new score G(x) by random walk G(x1) G(x2) G(x3) G(x4)
  • 189.
    Graph-based Approach for SemanticRetrieval  For each word w in the lexicon … … … … … … x1 x2 x3 x4 spoken document spoken document spoken document G(x1) G(x2) G(x3) G(x4) Repeat this process for all the words w in the lexicon
  • 190.
    Graph-based Approach for SemanticRetrieval … … x1 x2 G(x1) G(x2)   d w E ,          ' , , | w d d w E d w E w P  spoken document d Better estimation of term frequencies for each word w Lattice-based Document Model: Scores from graph            ' , , | w d d w E d w E w P  Graph-enhanced document model: replace
  • 191.
    Graph-based Approach for SemanticRetrieval         ' , , | w d d w E d w E w P             ' , , | w d d w E d w E w P  Graph-enhanced document model: Query and document expansion borrowed from text retrieval can be equally applied Lattice-based Document Model:
  • 192.
    Graph-based Approach forSemantic Retrieval - Experiments  Experiments on TV News 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.5 0.51 Basic LM Query Expansion Document Expansion Query + Document Expansion lattice Graph-Enhanced MAP [Lee & Lee, IEEE/ACM T. ASL 14]
  • 193.
    New Direction 4-2: SpecialSemantic Retrieval Techniques for Spoken Content Exploiting Acoustic Tokens
  • 194.
    Acoustic Tokens  Wecan identify “acoustic tokens” in direction 3 Token 1 Token 1 Token 1 Token 2 Token 2 Token 3 Token 3 Token 3 Query expansion with acoustic tokens Can be useful in semantic retrieval of spoken content: Unsupervised semantic retrieval of spoken content
  • 195.
    “US President” “Obama” QueryExpansion – Never Appear? If “Obama” is not in the lexicon We can never know “Obama” co-occur with “US President” in query expansion. “Obama” will never appear in lattices. Typical approach: using subwords Query expansion with acoustic tokens Complementary to each other
  • 196.
    Query Expansion with AcousticTokens Original Text Query: “US President” d100: …… US President … d205: … US President …… First pass: Retrieve spoken documents containing “US President” in the transcriptions [Lee & Lee, ICASSP 13]
  • 197.
    Query Expansion with AcousticTokens d100: …… US President … d205: … US President …… Find acoustic tokens frequently appear in the signals of these retrieved documents Original Text Query: “US President”
  • 198.
    Query Expansion with AcousticTokens d100: …… US President … d205: … US President …… Obama? Obama? Original Text Query: “US President” Even the terms related to the query is OOV If they really co-occur with the query in speech signals Find acoustic tokens corresponding to these terms
  • 199.
    Query Expansion with AcousticTokens d100: …… US President … d205: … US President …… Expanded Query: “US President” + Original Text Query: “US President”
  • 200.
    Query Expansion with AcousticTokens user “US President” Retrieval system Expanded Query: Lattices Find the same tokens “White House” “US President” and By expanding the text query with acoustic tokens, more semantically related audio files can be retrieved.
  • 201.
    Query Expansion – AcousticPatterns  Experiments on TV News [Lee & Lee, ICASSP 13]
  • 202.
    Unsupervised Semantic Retrieval Unsupervised Semantic Retrieval [Li & Lee, ASRU 13][Oard, FIRE 13]  No speech recognition as query-by-example spoken term detection  But find spoken documents semantically related to the spoken queries  New task, not too much previous work  Below is just a very preliminary study [Li & Lee, ASRU 13]
  • 203.
    Unsupervised Semantic Retrieval SpokenQueries 1. Find spoken documents containing the spoken query database Spoken Document 1 Spoken Document 2 Spoken Document 3 Done by the query-by-example spoken term detection approaches (e.g. DTW)
  • 204.
    Unsupervised Semantic Retrieval 2.Find acoustic tokens frequently co-occurring with the spoken queries in the same document Spoken Queries
  • 205.
    Unsupervised Semantic Retrieval 3.Use the acoustic tokens to expand the original spoken query Expanded Queries
  • 206.
    Unsupervised Semantic Retrieval 4.Retrieve again by the expanded queries Expanded Queries Can retrieve spoken documents not containing the original spoken queries
  • 207.
    Unsupervised Semantic Retrieval- Experiments  Broadcast news, MAP as evaluation measure  Using only DTW to retrieve spoken queries: Spoken term detection: 28.30% Semantic retrieval: 8.76% User only wants to find spoken documents containing query term. User wants to find all spoken documents semantically related to query term. [Li & Lee, ASRU 13]
  • 208.
    Unsupervised Semantic Retrieval- Experiments  Broadcast news, MAP as evaluation measure  Using only DTW to retrieve spoken queries: Spoken term detection: 28.30% Semantic retrieval: 8.76% Exactly the same retrieval results, but what user wants to find is different Lots of semantically related spoken documents cannot be retrieved [Li & Lee, ASRU 13]
  • 209.
    Unsupervised Semantic Retrieval- Experiments  Broadcast news, MAP as evaluation measure  Using only DTW to retrieve spoken queries: Spoken term detection: 28.30% Semantic retrieval: 8.76%  Expanded spoken queries: MAP was improved from 8.76% to 9.70%  Unsupervised semantic retrieval has a long way to go [Li & Lee, ASRU 13]
  • 210.
    New Direction 5: SpeechContent is Difficult to Browse!
  • 211.
    Audio is hardto browse  When the system returns the retrieval results, user doesn’t know what he/she get at the first glance Retrieval Result
  • 212.
    Audio is hardto browse Interactive spoken content retrieval Summarization, Key term extraction, Title Generation Organizing retrieved results Question answering
  • 213.
    New Direction 5-1: SpeechContent is Difficult to Browse! Interactive Spoken Content Retrieval
  • 214.
    Interactive Spoken Content Retrieval Conventional Retrieval Process user US President Here are what you are looking for: Doc3 Doc1 Doc2 … Can be noisy
  • 215.
     Interactive retrieval Interactive SpokenContent Retrieval user US President Not clear enough …… More precisely, please.
  • 216.
     Interactive retrieval Isit related to “Election”? Interactive Spoken Content Retrieval user US President Still not clear enough …… More precisely, please. Obama
  • 217.
    Here are whatyou are looking for.  Interactive retrieval Is it related to “Election”? Interactive Spoken Content Retrieval user US President I see! More precisely, please. Obama Yes.
  • 218.
    Interactive Spoken Content Retrieval Given the information entered by the users at present, which action should be taken? “Give me an example.” “Is it relevant to XXX?” “Can you give me another query?” “Show the results.”
  • 219.
    MDP for InteractiveRetrieval  MDP  Widely used in dialogue system (air ticket booking, city guides .., )  The system is in certain states.  Which action should be taken depends on the state the system is in.  MDP for Interactive retrieval [Wen & Lee, Interspeech 12][Wen & Lee, ICASSP 13]  State: the degree of clarity of the user’s information need Ambiguous Clear state space
  • 220.
    S1 Spoken Achieve Search Engine Query US President. Doc3 Doc1 Doc2 … Ambiguous Clear statespace State Estimator [Cronen-Townsen, SIGIR 02] [Zhou, SIGIR 07] State Estimator: Estimate the degree of clarity from the retrieval results
  • 221.
    S1 A 1 A 2 A 3 A 4  A setof candidate actions  System: “More precise, please.”  System: “Is it relevant to XXX?”  ….. Ambiguous Clear state space  There is an action “show results”  When the system decides to show the results, the retrieval session is ended
  • 222.
    S1 A 1 A 2 A 3 A 4  Choose theactions by intrinsic policy π(S)  The policy is a function  Input: state S, output: action A π(S)=“More precise, please” π(S)=Show Results Ambiguous Clear state space
  • 223.
    S1 Spoken Achieve Search Engine Doc3 Doc1 Doc2 … A 1 A1: More precise, please. Obama C1 Userresponse The system gets a cost C1 due to user labor. Ambiguous Clear state space π(S1) = A1
  • 224.
  • 225.
    Interact with Users- MDP  Good interaction:  The quality of final retrieval results shown to the users are as good as possible  The user labors (C1, C2) are as small as possible S1 S3 S2 A 1 C1 C2 A 2 En d Sho w
  • 226.
    Interact with Users- MDP  Learn polity π maximizing: Retrieval Quality - User labor  The polity π can be learned from historical interaction by fitted value iteration [Chandramohan, Interspeech 10] S1 S3 S2 A 1 C1 C2 A 2 En d Sho w
  • 227.
    Deep Reinforcement Learning Replacing MDP by deep reinforcement learning
  • 228.
    Spoken Content Retrieval Result s Spoken Content query State Estimation Action Decision state The degree ofclarity from the retrieval results action features  The policy π(s) is a function  Input: state s, output: action a Decide the actions by intrinsic policy π(S) [Wen & Lee, Interspeech 12] [Wen & Lee, ICASSP 13]
  • 229.
    Spoken Content Retrieval Result s Spoken Content query features … … … DNN State Estimation Action Decision Isit relevant to XX? Give me an example. Show the results. Max [Wu & Lee, Interspeech 16]
  • 230.
  • 231.
    Experimental Results  Broadcastnews, semantic retrieval [Wu & Lee, Interspeech 16] Retrieval Quality (MAP) Optimization Target: Retrieval Quality - User labor Hand-crafted Deep Reinforcement Learning Previous Method (state + decision)
  • 232.
    New Direction 5-2: SpeechContent is Difficult to Browse! Summarization, Key Term Extraction & Title Generation
  • 233.
    Introduction Retrieved Audio File Summary Deep Learning, NeuralNetwork …. 10 minutes 30 seconds Extractive Summarization Title Generation Key Term Extraction “Introduction of Deep Learning”
  • 234.
    Summarization  Unsupervised Approach:Maximum Margin Relevance (MMR) and Graph-based Approach  Supervised approach: Summarization problem can be formulated as a binary classification program  Included in the summary or not utterance 1 utterance 2 utterance 3 utterance 4 Binary Classifier -1 +1 +1 -1 utterance 2 utterance 3 classification result summary Binary Classifier Binary Classifier Binary Classifier Lecture
  • 235.
    Summarization – Binary Classification Binary classifier individually considers each utterance  To generate a good summary, “global information” should be considered  Example: summary should be concise More advanced machine learning techniques LSA is useful for summarization LSA is helpful for summarization LSA is useful for summarization Hello …… LSA is Latent semantic analysis LSA is helpful for summarization Repeat again …… Spoken Document Summary Summary should be succinct
  • 236.
    Summarization - Whole spokendocument  Learn a special model by structured learning techniques  Input: whole lecture  Output: summary Special Model spoken document Summary Consider the whole lecture 3 utterances selected in summary [Lee & Lee, ICASSP 13] [Lee & Lee, Interspeech 12]
  • 237.
    Evaluation Function  Evaluationfunction of utterance set F(s)  s: utterance set in a lecture F(s) 10 score utterance set s how suitable it is to consider utterance set s as the summary Properties: • Concise? • Include keyword? • Short enough? ………. How good it is to take this utterance set as summary? Lecture
  • 238.
    Evaluation Function – Howto summary  With F(s), we can do summarization on new lectures now Lecture s1 s2 s3 s4 s5 s6 s7 Compute F(s) for all utterance sets If s6 maximizes F(s) summary Enumerate all the possible utterance set s
  • 239.
    Evaluation Function  Evaluationfunction of utterance set F(s)  s: utterance set in a lecture F(s) 10 score utterance set s how suitable it is to consider utterance set s as the summary Properties: • Concise? • Include keyword? • Short enough? ………. How good it is to take this utterance set as summary? Lecture What properties should F(s) check?
  • 240.
     Learn F(s)from training data Reference summary Reference summary Evaluation Function - Training … 9 7 -4 high Find F(s) such that lecture Training data F(s) F(s) F(s) Structured SVM: I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support Vector Learning for Interdependent and Structured Output Spaces, ICML, 2004.
  • 241.
    Summarization - Structure ofspoken document  Temporal structure helps summarization  Long summary: consecutive utterances in a paragraph are more likely to be  Short summary: one utterance is selected on behalf of a paragraph. … 𝑥𝑖+3 𝑥𝑖−2 𝑥𝑖−1 𝑥𝑖+6 𝑥𝑖+4 𝑥𝑖+5 … 𝑥𝑖+3 𝑥𝑖−2 𝑥𝑖+6 𝑥𝑖+4 Important paragraph Representative of the paragraph 𝑥𝑖+5 𝑥𝑖 𝑥𝑖+1 𝑥𝑖+2 𝑥𝑖−1 𝑥𝑖 𝑥𝑖+1 𝑥𝑖+2 Paragraph 1 Paragraph 2 Paragraph 3 Paragraph 1 Paragraph 2
  • 242.
    Summarization - Structure ofspoken document  Add structure information into evaluation function of utterance set F(s) F(s) 100 score Properties: • Concise? • Include keyword? • Short enough? ………. utterances Given the information of structure Paragraph 1 Paragraph 2 [Shiang & Lee, Interspeech 13]
  • 243.
    Summarization - Structure ofspoken document  Structure in text are clear Paragraph boundaries are directly known  For spoken content, there is no obvious structure Here the structure are considered as “hidden variables” Structured learning with hidden variables
  • 244.
    Summarization - Structure ofspoken document  Evaluation Measure: ROUGE-1 and ROUGE-2  Larger scores means the machine-generated summaries is more similar to human-generated summaries. [Shiang & Lee, Interspeech 1
  • 245.
    Key Term Extraction TF-IDF is a food measure for identifying key terms [E. D’Avanzo, DUC 04][Jiang, SIGIR 09]  Feature parameters from latent topic models [Hazen, Interspeech 11] [Chen & Lee, SLT 10]  Key terms are usually focused on small number of topics  Prosodic Features [Chen & Lee, ICASSP 12]  slightly lower speed, higher energy, wider pitch range  Machine Learning methods  Input: a term, output: key term or not [Liu, SLT 08][Chen & Lee, SLT 10]
  • 246.
    Key Term Extraction –Deep Learning α1 α2 α3 α4 … αT ΣαiVi x4 x3 x2 x1 xT … … V3 V2 V1 V4 VT Embedding Layer … V3 V2 V1 V4 VT OT … document Hidden Layer Output Layer Embedding Layer Keyword Set SVM, Regression, Python, DNN, Fourier Transform, Speech Processing, LSTM, Bubble Sort, etc. [Shen & Lee, Interspeech 16]
  • 247.
    Title Generation  DeepLearning based Approach [Alexander M Rush, EMNLP 15][Chopra, NAACL 16][Lopyrev, arXiv 2015][Shen, arXiv 2016]  Sequence-to-sequence learning  Input: a document (word sequence), output: its title (shorter word sequence) https://arxiv.org/pdf/1512.01712v1.pdf
  • 248.
    New Direction 5-3: SpeechContent is Difficult to Browse! Visualizing Retrieved Results
  • 249.
    Introduction  Visualizing theretrieval results on an intuitive interface helps users know what is retrieved  Take retrieving on-line lectures as example  Searching spoken lectures is a very good application for spoken content retrieval  The speech of the instructors conveys most knowledge in the lectures
  • 250.
    Retrieving One Course NTU Virtual Instructor Searching the course Digital Speech Processing of NTU
  • 251.
    Massive Open On-lineCourses (MOOCs)  Enormous on-line courses
  • 252.
    Today’s Retrieval Techniques Alist of related courses
  • 253.
  • 254.
    More is less…...  Given all the related lectures from different courses learner Which lecture should I go first? Learning Map  Nodes: lectures in the same topics  Edges: suggested learning order [Shen & Lee, Interspeech 15]
  • 255.
  • 256.
    Lectures in thesame topic same topic?  Compute the similarity of each pair of lectures  Lexical and topical similarity of the audio transcriptions  Lexical similarity and syntactic parsing tree similarity of the titles of the lectures  Weighted sum the similarity measures
  • 257.
    Lectures in thesame topic same topic? a1 a2 a3 b1 b2 b3 a1 a2 a3 b1 b2 b3 More Likely Less Likely
  • 258.
  • 259.
    Prerequisite Lectures in different courses Prerequisite? Learninga binary classifier Training data: lectures in existing courses An existing course … Lecture 1 Lecture 2 Lecture 3 Lecture 1 is a prerequisite of lecture 2 Assumption: Lecture 2 is a prerequisite of lecture 3 ……
  • 260.
  • 261.
    Vision: Personalized Courses Iwant to learn “XXX”. I am a graduate student of computer science. I can spend 10 hours. Learner I open a course for you. on-line learning material  Spoken Language Processing techniques can be very helpful.  The spoken content in courses plays the most important role in conveying the knowledge.
  • 262.
    New Direction 5-4: SpeechContent is Difficult to Browse! Speech Question Answering
  • 263.
    Speech Question Answering Machine answers questions based on the information in spoken content What is a possible origin of Venus’ clouds? volcanic activity
  • 264.
    Speech Question Answering Reference: 6 Spoken Question Answering (Sophie Rosset, Olivier Galibert and Lori Lamel). G. Tur and R. DeMori, Spoken Language Understanding: Systems for Extracting Semantic Information from Speech.  Question Answering in Speech Transcripts (QAST) has been a well-known evaluation program of SQA.  2007, 2008, 2009  Focused on factoid questions in the previous study  E.g. “What is name of the highest mountain in Taiwan?”.  How about more difficult questions?
  • 265.
    Preliminary Study  TOEFLListening Comprehension Test by Machine Question: “ What is a possible origin of Venus’ clouds? ” Audio Story: Choices: (A) gases released as a result of volcanic activity (B) chemical reactions caused by high surface temperatures (C) bursts of radio energy from the plane's surface (D) strong winds that blow dust into the atmosphere (The original story is 5 min long.) [Tseng & Lee, Interspeech 16]
  • 266.
    Simple Baselines Accuracy (%) (1) (2)(3) (4) (5) (6) (7) Naive Approaches random (4) the choice with semantic most similar to others (2) select the shortest choice as answer Experimental setup: 717 for training, 124 for validation, 122 for testing [Tseng & Lee, Interspeech 1
  • 267.
    Model Architecture “what isa possible origin of Venus‘ clouds?" Question: Question Semantic s …… It be quite possible that this be due to volcanic eruption because volcanic eruption often emit gas. If that be the case volcanism could very well be the root cause of Venus 's thick cloud cover. And also we have observe burst of radio energy from the planet 's surface. These burst be similar to what we see when volcano erupt on earth …… Audio Story: Speech Recognition Semantic Analysis Semantic Analysis Answer Select the choice most similar to the answer
  • 268.
    Experimental Results Accuracy (%) (1) (2)(3) (4) (5) (6) (7) Memory Network: 39.2% Naive Approaches Word-based Attention: 48.3% (proposed by FB AI group) [Tseng & Lee, Interspeech 1
  • 269.
  • 270.
    Conclusion Remarks  Newresearch directions for spoken content retrieval  Modified ASR for Retrieval Purposes  Incorporating Those Information Lost in ASR  No Speech Recognition!  Special Semantic Retrieval Techniques for Spoken Content  Spoken Content is Difficult to Browse!
  • 271.
    Take-Home Message Spoken ContentRetrieval Speech Recognition + Text Retrieval =
  • 272.
    Spoken Content Retrieval Spoken content retrieval: Machine listens to the data, and extract the desired information for each individual user.  Nobody is able to go through the data. 300 hrs multimedia is uploaded per minute. (2015.01) 1874 courses on coursera (2016.04)  In these multimedia, the spoken part carries very important information about the content • Just as Google does on text data
  • 273.
    Overview Paper  Lin-shanLee, James Glass, Hung-yi Lee, Chun-an Chan, "Spoken Content Retrieval — Beyond Cascading Speech Recognition with Text Retrieval," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.23, no.9, pp.1389-1420, Sept. 2015  http://speech.ee.ntu.edu.tw/~tlkagk/paper/Over view.pdf  This tutorial includes updated information after this paper is published.
  • 274.
    Thank You forYour Attention

Editor's Notes

  • #5 上課錄音 電視節目
  • #6 Today the Internet has become an everyday part of human life. Internet content is indexed, retrieved, searched, and browsed primarily based on text. Although multimedia Internet content is growing rapidly, with shared videos, social media, broadcasts, etc., as of today, it still tends to be processed primarily based on the textual descriptions of the content offered by the multimedia providers.
  • #8 找到只有 歐巴馬 卻沒有 US President 的聲音檔
  • #12 1:00
  • #13 1:00
  • #14 講慢點 4:20
  • #15 Not going to review the Babel !!!!!
  • #21 1:30
  • #22 1:30
  • #25 I have to explain posterior probablity
  • #26 I have to explain posterior probablity
  • #27 I have to explain posterior probablity
  • #28 http://www.itl.nist.gov/iad/mig/publications/storage_paper/Interspeech07-STD06-v13.pdf https://www.icsi.berkeley.edu/pubs/speech/taoatwv13.pdf if we weight each query by its occurrence times in the audio collections (the queries appear frequently in the corpus may have larger probabilities to be requested), then we have Actual Occurrence-Weighted Value. Pick a threshold (how to pick threshold has large influence to the results)
  • #35 1:20
  • #45 1:30 = 7:00
  • #46 中文: 超越現有的架構 Last: 10:00
  • #47 6:00
  • #49 Interactive spoken content retrieval Summarization & Key term extraction Organizing retrieved results Question answering Interactive retrieval, Summarization, Key term extraction, Organizing retrieved results, Question answering Visualization
  • #73 0:40
  • #77 97 135 110
  • #84 Fuji-san 夫機桑 Naoyuki Kanda, Ryu Takeda, Yasunari Obuchi: Using rhythmic features for Japanese spoken term detection. SLT 2012: 170-175
  • #100 3:00
  • #102 Not select examples Formulated as a problem on graph
  • #121 7:20
  • #122 14% is semi, with one label data [Kartik Audhkhasi, ICASSP, 2014] http://www.redes.unb.br/lasp/files/events/ICASSP2014/papers/p7919-audhkhasi.pdf
  • #124 [Hazen, ASRU 09][Zhang & Glass, ASRU 09][Zhang & Glass, ICASSP 11][Zhang & Glass, Interspeech 11][Zhang, Salakhutdinov, ICASSP 12]
  • #125 We have to mention some evolutions here!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!????????????????????? Also known as unsupervised spoken term detection, zero-resource spoken content retrieval, etc. Evaluation program
  • #130 How’s the comparison result?
  • #132 https://www.semanticscholar.org/paper/Integrating-frame-based-and-segment-based-dynamic-Chan-Lee/7d8eaf612500420a125b3cfca98701114cf68873/pdf
  • #134 How do they do
  • #137 With coloer
  • #138 “Segmental acoustic indexing for zero resource keyword search https://www.semanticscholar.org/paper/Segmental-acoustic-indexing-for-zero-resource-Levin-Jansen/a11b4da42e6392e8dba76ec6175da856a2e143b4/pdf Indexing raw acoustic features for scalable zero resource search,” http://ttic.uchicago.edu/~haotang/speech/IS2012a.pdf
  • #166 encyclopedia https://wiki.inf.ed.ac.uk/twiki/pub/CSTR/ListenSemester2201213/ICASSP2013b.pdf
  • #170 Consider as P35
  • #180 Transportation 交通工具
  • #185 Consider as P35
  • #187 as spoken term detection
  • #195 How to do acoustic pattern discovery will be in the following talk We just focus on after we get the pattern How can it be helpful in semantic retrieval
  • #196 “辭典”
  • #202 5:00
  • #211 QA doen Prere Summary RL
  • #214 52:30 02:00
  • #226 We want the Return as large as possible
  • #229 [Cronen-Townsen, SIGIR 02] [Zhou, SIGIR 07] A set of candidate actions System: “More precise, please.” System: “Is it relevant to XXX?” …..
  • #230 [Cronen-Townsen, SIGIR 02] [Zhou, SIGIR 07] A set of candidate actions System: “More precise, please.” System: “Is it relevant to XXX?” …..
  • #231 [Cronen-Townsen, SIGIR 02] [Zhou, SIGIR 07] A set of candidate actions System: “More precise, please.” System: “Is it relevant to XXX?” …..
  • #232 Put More Results?
  • #233 52:30 02:00
  • #243 Training data 和 structure 的關係是用 training data 學出來的
  • #249 52:30 02:00
  • #251 Retrieving online lectures is never true But only search one course
  • #253 On the well-known online lecture platform Type language model into the look up wind
  • #254 Video clips The system can find the lectures related to the learning need. Although this on-line lecture platform do not do that, it is possible to do not
  • #256 Identify the lectures teaching the same content
  • #260 Feature representation:
  • #261 52:30 02:00
  • #263 52:30 02:00
  • #265 information retrieval (IR) techniques [3] or relied on knowledge bases [4] May be answered by simply extracting the key terms from a properly chosen utterance [3] S.-R. Shiang, H.-y. Lee, and L.-s. Lee, “Spoken question answering using tree-structured conditional random fields and two-layer random walk.” in INTERSPEECH, 2014, pp. 263–267. [4] B. Hixon, P. Clark, and H. Hajishirzi, “Learning knowledge graphs for question answering through conversational dialog.” [5] P. R. Comas, J. Turmo, and L. M`arquez, “Sibyl, a factoid question-answering system for spoken documents,” ACM Trans. Inf. Syst., 2012. [6] J. Turmo, P. R. Comas, S. Rosset, O. Galibert, N. Moreau, D. Mostefa, P. Rosso, and D. Buscaldi, Multilingual Information Access Evaluation I. Text Retrieval Experiments: 10th Workshop of the Cross-Language Evaluation Forum, CLEF 2009, Corfu, Greece, September 30 - October 2, 2009, Revised Selected Papers. Springer Berlin Heidelberg, 2010, ch. Overview of QAST 2009, pp. 197–211. [7] J. Turmo, P. Comas, S. Rosset, L. Lamel, N. Moreau, and D. Mostefa, “Overview of QAST 2008,” in Working Notes for the CLEF 2008 Workshop,, 2008. [8] D. Giampiccolo, P. Forner, J. Herrera, A. Pe˜nas, C. Ayache, C. Forascu, V. Jijkoun, P. Osenova, P. Rocha, B. Sacaleanu, and R. Sutcliffe, Advances in Multilingual and Multimodal Information Retrieval: 8th Workshop of the Cross-Language Evaluation Forum. Springer Berlin Heidelberg, 2008, ch. Overview of the CLEF 2007 Multilingual Question Answering Track, pp. 200– 236.
  • #268 Mamery network proposed by FB’s AI team
  • #273 Today the Internet has become an everyday part of human life. Internet content is indexed, retrieved, searched, and browsed primarily based on text. Although multimedia Internet content is growing rapidly, with shared videos, social media, broadcasts, etc., as of today, it still tends to be processed primarily based on the textual descriptions of the content offered by the multimedia providers.