Focus on spoken content in multimedia retrieval

Focus on spoken content in multimedia retrieval 1/48
Focus on spoken content
in multimedia retrieval
Maria Eskevich
Centre for Next Generation Localisation
School of Computing, Dublin City University,
Dublin, Ireland
April, 16, 2013

Outline
Spoken Content Retrieval: historical perspective
MediaEval Benchmark:
3 years of Spoken Content Retrieval experiments:
Rich Speech Retrieval and Search and Hyperlinking tasks
Dataset collection creation issues for multimedia retrieval:
crowdsourcing aspect
Interesting observations on results:
Segmentation methods
Evaluation metrics
Numbers

Outline
Interesting observations on results: segmentation aspect

Information Retrieval (IR)
Speech Processing (Automatic Speech Recognition (ASR))

Standard IR System

Standard IR System
Queries
IR System
Indexed
Documents
IR Model
Information
Request
Results
Retrieval

Standard IR System
Queries
IR System
Indexed
Documents
IR Model
Information
Request
Results
Retrieval
Audio Data Collection Transcripts
of Audio DataASR
System

Spoken Content Retrieval (SCR)
Queries
SCR System
Indexed
Documents
Indexed
Transcripts
IR Model
Information
Request
Audio
Files
Retrieval

Spoken Content

Spoken Content
ASR Transcript
ASR System

Spoken Content
ASR Transcript
ASR System
Indexed Transcript
Indexing

Spoken Content
ASR Transcript
ASR System
Indexed Transcript
Ranked Result List
1
2
...
Indexing
Retrieval

Data
Spoken Content
ASR Transcript
ASR System
Indexed Transcript
Ranked Result List
1
2
...
Indexing
Retrieval

Data
Spoken Content
ASR Transcript
Experiments
ASR System
Indexed Transcript
Ranked Result List
1
2
...
Indexing
Retrieval

Data
Spoken Content
ASR Transcript
Experiments
ASR System
Indexed Transcript
Ranked Result List
1
2
...
Indexing
Evaluation
Metrics
Retrieval

Outline: Spoken Content
Data
Spoken Content
ASR Transcript
Experiments
ASR System
Indexed Transcript
Ranked Result List
1
2
...
Indexing
Evaluation
Metrics
Retrieval

Spoken Content

Spoken Content
Prepared Speech
Informal
Conversational
Speech

Spoken Content
Prepared Speech
Informal
Conversational
Speech
Broadcast News

Spoken Content
Prepared Speech
Informal
Conversational
Speech
Broadcast News
Lectures

Spoken Content
Prepared Speech
Informal
Conversational
Speech
Broadcast News
Lectures
Meetings

Spoken Content
Prepared Speech
Informal
Conversational
Speech
Broadcast News
Lectures
Meetings
Informal Content

Spoken Content
Prepared Speech
Informal
Conversational
Speech
Broadcast News
Lectures
Meetings
Informal Content
Internet TV,
Podcast, Interview

Spoken Content
Prepared Speech
Informal
Conversational
Speech
Broadcast NewsBroadcast News
Lectures
Meetings
Informal Content
Internet TV,
Podcast, Interview
Broadcast News:
Data
High quality recordings:
Often soundproof studio
Speaker - professional presenter
Well deﬁned structure
Query is on a certain topic:
User is ready to listen to the whole section
Experiments: TREC SDR (1997-2000)
Known-item search and ad-hoc retrieval
Search with and without ﬁxed story boundaries
Evaluation: interest in rank position

Spoken Content
Prepared Speech
Informal
Conversational
Speech
Broadcast NewsBroadcast News
Lectures
Meetings
Informal Content
Internet TV,
Podcast, Interview
Broadcast News:
Data
High quality recordings:
Often soundproof studio
Speaker - professional presenter
Well deﬁned structure
Query is on a certain topic:
User is ready to listen to the whole section
Experiments: TREC SDR (1997-2000)
Known-item search and ad-hoc retrieval
Search with and without ﬁxed story boundaries
Evaluation: interest in rank position
HIGHLIGHT: ”Success story” (Garofolo et al., 2000):
Performance on ASR Transcript ≈ Manual Transcript
ASR good: large amounts of training data
Data structure
CHALLENGE:
Speech data in broadcast news is close to the written text,
and differs from the informal content of spontaneous speech

Spoken Content
Prepared Speech
Informal
Conversational
Speech
Broadcast News
LecturesLectures
Meetings
Informal Content
Internet TV,
Podcast, Interview
Lectures:
Data:
Prepared presentations containing
conversational style features:
hesitations, mispronunciations
Specialized vocabulary
Out-Of-Vocabulary words
Lecture speciﬁc words may have low
probability scores in the ASR language
model
Additional information available:
presentation slides, textbooks
Experiments:
Lectures browsing:
e.g. TalkMiner, MIT lectures, eLectures
SpokenDoc(2) Tasks at NTCIR-9, NTCIR-10:
e.g. IR experiments, evaluation metrics that
assess topic segmentation methods

Spoken Content
Prepared Speech
Informal
Conversational
Speech
Broadcast News
LecturesLectures
Meetings
Informal Content
Internet TV,
Podcast, Interview
Lectures:
Data:
Prepared presentations containing
conversational style features:
hesitations, mispronunciations
Specialized vocabulary
Out-Of-Vocabulary words
Lecture speciﬁc words may have low
probability scores in the ASR language
model
Additional information available:
presentation slides, textbooks
Experiments:
Lectures browsing:
e.g. TalkMiner, MIT lectures, eLectures
SpokenDoc(2) Tasks at NTCIR-9, NTCIR-10:
e.g. IR experiments, evaluation metrics that
assess topic segmentation methods
HIGHLIGHT/CHALLENGE:
Focus on segmentation methods, jump-in
points

Spoken Content
Prepared Speech
Informal
Conversational
Speech
Broadcast News
Lectures
MeetingsMeetings
Informal Content
Internet TV,
Podcast, Interview
Meetings:
Data features:
Mixture of semi-formal and prepared spoken
content
Additional data: slides, minutes
Possible real life motivated scenario:
Jump-in points where discussion on topic
started or a decision point is reached
Opinion of a certain person or person with a
certain role
Search for all relevant (parts of) meetings
where topic was discussed
Experiments:
topic segmentation, browsing
summarization

Spoken Content
Prepared Speech
Informal
Conversational
Speech
Broadcast News
Lectures
MeetingsMeetings
Informal Content
Internet TV,
Podcast, Interview
Meetings:
Data features:
Mixture of semi-formal and prepared spoken
content
Additional data: slides, minutes
Possible real life motivated scenario:
Jump-in points where discussion on topic
started or a decision point is reached
Opinion of a certain person or person with a
certain role
Search for all relevant (parts of) meetings
where topic was discussed
Experiments:
topic segmentation, browsing
summarization
No uniﬁed search scenario
We created a test retrieval collection on the basis of AMI
corpus and set up a task scenario ourselves

Spoken Content
Prepared Speech
Informal
Conversational
Speech
Broadcast News
Lectures
Meetings
Informal ContentInformal Content
Internet TV,
Podcast, Interview
Informal Content (Interviews, Internet TV):
Data features:
Varying quality: semi- and
non-professional data creators
Additional data: professionally or
user-generated metadata
Experiments:
CLEF CL-SR: MALACH collection
un/known-boundaries, ad-hoc task
MediaEval’11,’12,’13: retrieval of
semi-professional multimedia content
known-item task, unknown
boundaries
Metrics: focus on ranking and penalize
distance from the jump-in point

Spoken Content
Prepared Speech
Informal
Conversational
Speech
Broadcast News
Lectures
Meetings
Internet TV,
Podcast, Interview
Informal Content (Interviews, Internet TV):
Data features:
Varying quality: semi- and
non-professional data creators
Additional data: professionally or
user-generated metadata
Experiments:
CLEF CL-SR: MALACH collection
un/known-boundaries, ad-hoc task
MediaEval’11,’12,’13: retrieval of
semi-professional multimedia content
known-item task, unknown
boundaries
Metrics: focus on ranking and penalize
distance from the jump-in point
Metric does not always take into account how much time the
user needs to spend listening to access the relevant content
Diversity of the informal multimedia content
Search scenario no longer limited to factual information

Spoken Content
Prepared Speech
Informal
Conversational
Speech
Broadcast News
LecturesLectures
MeetingsMeetings
Internet TV,
Podcast, Interview
Review of the challenges/our work for Informal SCR:
Framework of retrieval experiment has to be set
up: retrieval collections to be created
Our work: We collected new multimodal retrieval
collections via crowdsourcing
ASR errors decrease IR results
Our work: We examined deeper relationship
between ASR performance and results ranking
Suitable segmentation is vital
Our work: We carry out experiments with varying
methods
Need for metrics that reﬂect all aspects of user
experience
Our work: We created a new set of metrics

Outline
Rich Speech Retrieval and Search and Hyperlinking
tasks
Evaluation metrics
Numbers

MediaEval
Multimedia Evaluation benchmarking inititative
Evaluate new algorithms for multimedia access and
retrieval.
Emphasize the ”multi” in multimedia: speech, audio,
visual content, tags, users, context.
Innovates new tasks and techniques focusing on the
human and social aspects of multimedia content.

MediaEval 2011
Rich Speech Retrieval (RSR) Task
Task Goal:
Information to be found - combination of required
audio and visual content, and speaker’s intention

MediaEval 2011
Task Goal:

MediaEval 2011
Task Goal:
Transcript 1 Transcript 2

MediaEval 2011
Task Goal:
Transcript 1
Meaning 1
Transcript 2
Meaning 2

MediaEval 2011
Task Goal:
Transcript 1 =
Meaning 1 =
Transcript 2
Meaning 2

MediaEval 2011
Task Goal:
Transcript 1 =
Meaning 1 =
Transcript 2
Meaning 2
Conventional retrieval

MediaEval 2011
Task Goal:
Transcript 1 =
Meaning 1 =
Transcript 2
Meaning 2

MediaEval 2011
Task Goal:
Transcript 1 =
Meaning 1 =
Speech act 1 =
Transcript 2
Meaning 2
Speech act 2

MediaEval 2011
Task Goal:
Transcript 1 =
Meaning 1 =
Speech act 1 =
Transcript 2
Meaning 2
Speech act 2
Extended speech retrieval

MediaEval 2012-2013:
Search and Hyperlinking (S&H) Task Background

MediaEval 2012-2013:
S&H Task

MediaEval 2012-2013: S&H Task and Crowdsourcing

Outline
Dataset collection creation issues for multimedia
retrieval: crowdsourcing aspect
Evaluation metrics
Numbers

What is crowdsourcing?
Crowdsourcing is a form of human computation.
Human computation is a method of having people do
things that we might consider assigning to a computing
device, e.g. a language translation task.
A crowdsourcing system facilitates a crowdsourcing
process.

process.
Factors to take into account:

process.
Sufﬁcient number of workers

process.
Level of payment

process.
Level of payment
Clear instructions

process.
Level of payment
Clear instructions
Possible cheating

Results assessment

Results assessment
Number of accepted HITs = number of collected queries

Results assessment
No overlap of workers in dev and test sets

Results assessment
Creative work - Creative Cheating:

Results assessment
Copy and paste provided examples

Results assessment
− > Examples should be pictures, not texts

Results assessment
Choose the option of no speech act found in the video

Results assessment
− > Manual assessment by requester needed

Results assessment
− > Manual assessment by requester needed
Workers rarely ﬁnd noteworthy content later than the
third minute from the start of playback point in the video

Crowdsourcing issues
for multimedia retrieval collection creation
It is possible to crowdsource extensive and complex
tasks to support speech and language resources

Use concepts and vocabulary familiar to the workers

Pay attention to technical issues of watching the video

Video preprocessing into smaller segments

Creative work demands higher reward level, or just
more ﬂexible system

Creative work demands higher reward level, or just
more ﬂexible system
High level of wastage due to task complexity

Outline
Evaluation metrics
Numbers

Dataset segment representation

Approach 1: Fixed length segmentation
Fixed length segmentation
Number of words (including/excluding stop words)
Time slots

Time slots
Fixed length segmentation with sliding window:

Time slots
Post-processing:

Approach 2: Flexible length segmentation
Speech or Video units of varying length

Speech: sentence, speech segment, silence points,
changes of speakers
Video: shots

Speech: sentence, speech segment, silence points,
changes of speakers
Video: shots
Topical segmentation
Lexical cohesion - C99, TexTiling

Outline
Evaluation metrics
Numbers

Evaluation: Search sub-task

Mean Reciprocal Rank (MRR):
RR =
1
RANK
Mean Generalized Average Precision (mGAP):
GAP =
1
RANK
. PENALTY

Mean Average Segment Precision (MASP):
Ranking + Length of (ir)relevant content
Segment Precision (SP[r]) at rank r:

Average Segment Precision:

Average Segment Precision:
ASP =
1
n
.
N
r=1
SP[r] · rel(sr )
rel(sr ) = 1, if relevant content is present,
otherwise rel(sr ) = 0

Focus on Precision/Recall of the relevant content within the
retrieved segment.

Outline
Evaluation metrics
Numbers

Experiments (RSR): Spontaneous Speech Search
Relationship Between
Retrieval Effectiveness and Segmentation Methods
Segment:
100 % Recall of the relevant content
High Precision (30, 56 %) of the relevant content
Topic consistency

Experiments (S&H)
Fixed length segmentation with sliding window
2 transcrpts (LIMSI, LIUM)
LIMSI LIUM

Segmentation requirements for effective SCR
Segmentation plays signiﬁcant role in retrieving relevant
content

content
High recall and precision of the relevant content within the
segment leads to good segment ranking.

content
Related metadata can be useful to improve ranking of the
segment with high recall and containing non relevant
content.

content
content.
Inﬂuence of ASR quality:

content
content.
The errors effect is not straightforward, can be smoothed by
the use of context, query dependent treatment of the
transcript.

content
content.
transcript.
ASR System Vocabulary variability: longer segments have
higher MRR scores with transcript of lower language
variability (LIMSI), whereas shorter segments perform
better with transcripts of higher language variability (LIUM).

content
content.
transcript.
ASR System Vocabulary variability: longer segments have
higher MRR scores with transcript of lower language
variability (LIMSI), whereas shorter segments perform
better with transcripts of higher language variability (LIUM).
Multimodal queries: addition of visual information
decreases performance.

Thank you for your attention!
Questions?

Focus on spoken content in multimedia retrieval

More Related Content

Viewers also liked

Similar to Focus on spoken content in multimedia retrieval

Recently uploaded

Focus on spoken content in multimedia retrieval