Doing Something We Never Could
with
Spoken Language Technologies
Lin-shan Lee
National Taiwan University
• Some research effort tries to do something Better
– having aircrafts fly faster
– having images look more beautiful
– always very good
• Some tries to do something we Never could
– developing the Internet to connect everyone over the world
– selecting information from Internet with Google
– usually challenging
What and Why
• Those we could Never do before
– very far from realization
– wait for the right industry to appear at the right time
– new generations of technologies used very different from the
earlier solutions found in research
• Interesting and Exciting !
What and Why
• Three examples
– (1) Teaching machines to listen to Mandarin Chinese
– (2) Towards a Spoken Version of Google
– (3) Unsupervised ASR (phoneme recognition)
(1) Teaching Machines to Listen to
Mandarin Chinese
• English typewriter
Talk to Machines in Writing –
• Chinese typewriter
Typewriting
Chinese Language is Not Alphabetic
• Every Chinese character is a square graph
– many thousands of such characters
• English typewriter
Talk to Machines in Writing - Typewriting
• Chinese typewriter
Age of Chinese Typewriters (1980)
• Many people tried to represent Chinese characters by
code sequences
– radicals (字根) : 口 火 木 山 人 手
– phonetic symbols
– corner codes (四角號碼)
• A Voice-driven Typewriter for Chinese ?
– monosyllable per character
– total number of distinct monosyllables limited
– something we could Never do before
• To Teach Machines to Listen to Mandarin ?
– too difficult (1980)
– to teach machines to speak Mandarin first
→Speech Synthesis (TTS)
Hardware-assisted Voice Synthesizer
• Real-time requirement
– producing 1 sec of voice data within 1 sec
– from Linear Predicative Coding (LPC) coefficients
• Bit-slice microprocessor far too weak
• Completed 1981, used in next several years
Initial Effort in Chinese Text-to-speech Synthesis
• Calculated and stored the LPC coefficients and tone
patterns for all Mandarin monosyllables
– concatenating isolated monosyllables into utterances directly
– Never worked
• Learned from a linguist
– the prosody (pitch, energy, duration, etc.) of the same
monosyllable is different in different context in continuous
speech (context dependency)
– made sentences, recorded voice, analyzed the prosody
– prosody rules for each monosyllable in continuous speech based
on context
– synthesized voice very poor, but intelligible
– data science in early years
• Lesson learned
– linguistics important for engineers
, stuck for long
Prof. Chiu-yu Tseng
Mandarin Prosody Rules
• Complete prosody rules (1984)
[ICASSP 1986][IEEE Trans ASSP 1989]
我 有 好 幾 把 小 雨 傘
Tone Sandhi Rule Example
• Concatenation rules for Tone 3
3 3 → 2 3
S
有
N
好幾
𝑉𝑃
我
𝑁𝑃
V
NP
QP
Q
MOD N
MOD
把 小 雨傘
2 2
3 3
2 3 3
2 [ICASSP 1986][IEEE Trans ASSP 1989]
雨傘 – the rule applies across
certain syntactic boundaries,
but not for others
Chinese Text-to-speech Synthesis
• Concatenating stored monosyllables with adjusted prosody
(1984)
– reproduced from a tape in 1984
[ICASSP 1986] [IEEE Trans ASSP 1989]
Speech Prosody Depends on Sentence Structure
• Chinese Sentence Grammar and Parser (1986)
• Empty category
被李四叫去吃飯的小孩
(The children who are asked to go to dinner by Li-s)
[AAAI 1986] [Computational Linguistics 1991]
– natural language
processing in early
years
Mandarin Speech Recognition
• Very Large Vocabulary and Unlimited Text
• Isolated Monosyllable Input
– each character pronounced as a monosyllable
– limited number of distinct monosyllables
– a voice-driven typewriter
• Decomposing the problem
– recognition of consonants/vowels/tones
– identifying the character out of many homonym characters
• System Integration very hard
– software very weak, each part assisted by different hardware
[Concept IJCAI 1987] [Initial Results ICASSP 1990]
• 1989
An Example Circuit Diagram
• Completed in 1989
Photo of Completed Hardware
, but Never worked
• 1992
– far from a complete system
LPC Analysis Chip Viterbi Chip
Chip Design
• Implemented with Transputer (10 CPUs in parallel,
everything in software) purchased in 1990
• March 1992
• 金聲玉振,金玉之聲
• Isolated monosyllable input
• Several seconds per monosyllable
• Lesson learned: software more powerful than hardware
Golden Mandarin I (金聲一號 )
[ICASSP 1993] [IEEE Trans SAP 1993]
Golden Mandarin III (金聲三號)
• March 1995
• Continuous read speech input
• Limitation by computation resources
– Version(a) on PC 486 for short utterance input
– Version(b) on Workstation for long utterance input
[ICASSP 1995] [IEEE Trans SAP 1997] [IEEE SP Magazine 1997]
Mini-Remarks
• Today
• A dream many years ago was realized by the right
industry at the right time
– with new generations of technologies
• No worry for realization during research
– someone will solve the problem in the future
(2) Towards
a Spoken Version of Google
Spoken Content Retrieval
• Spoken term detection
– to detect if a target term was spoken in any of the utterances
in an audio data set
target term: COVID-19
similarity scores
Spoken Content Retrieval – Basic Approach
Lexicon
Spoken
Content
Recognition
Engine
Language
Model
Transcriptions
Retrieval Results
Text-based
Search
Engine user
Query
Acoustic
Models
• Transcribe the spoken content
• Search over the transcriptions as they are text
• Recognition errors cause serious performance degradation
• Low recognition accuracies for speech signals under
various acoustic conditions
– considering lattices rather than 1-best output
– lattices of subword units
Lattices for Spoken Content Retrieval
W6 W8
W4
W1
W8
W7
W9
W3
W2
W5
W10
Start node End node
Time index
Wi: word hypotheses
[Ng 2000, Scarclar 2004, Chelba 2005, etc.]
• Recognition stage cascaded with retrieval stage
– retrieval performance limited by recognition accuracy
• Considering recognition and retrieval processes as a
whole (2010)
– acoustic models re-estimated by optimizing overall retrieval
performance
ASR Error Problem Comes from Cascading Two
Stages
Spoken
Content
Recognition
Engine
Acoustic
Models
lattices
Search
Engine
Retrieval
Model
user
Query Q
Retrieval
Output
Recognition Retrieval
[ICASSP 2010]
• End-to-End Spoken Content Retrieval in early years
Acoustic
Models
What can Spoken Content Retrieval do for us ?
• Google reads all text over the Internet
– can find any text over the Internet for the user
• All Roles of Text can be realized by Voice
• Machines can listen to all voices over the Internet
– can find any utterances over the Internet for the user
• A Spoken Version of Google
• Machines may be able to listen to and understand the entire
multimedia knowledge archive over the Internet
– extracting desired information for each individual user
300hrs of videos
uploaded per min
(2015.01)
Roughly 2000 online
courses on Coursera
(2016.04)
• Nobody can go through so much multimedia
information
• Multimedia Content exponentially
increasing over the Internet
– best archive of global human knowledge is here
– desired information deeply buried under huge quantities
of unrelated information
, but Machines can
What can we do with a Spoken Version of Google ?
A Target Application Example : Personalized
Education Environment
• For each individual user
I wish to learn about Wolfgang
Amadeus Mozart and his music
I can spend 3 hrs to learn
user
This is the 3-hr personalized
course for you. I’ll be your
personalized teaching assistant.
Ask me when you have questions.
Information
from Internet
• Understanding, Summarization and Question Answering for
Spoken Content
– something we could Never do (even today)
– semantic analysis for spoken content
Probabilistic Latent Semantic Analysis (PLSA)
t1
t2
tj
tn
D1
D2
Di
DN
TK
Tk
T2
T1
P(T |D )
k i
P(t |T )
j k
Di: documents Tk: latent topics tj: terms
𝑃(𝑤 𝑧)
𝑃(𝑧 𝑑)
• Unsupervised Learning of Topics from text corpus
d: document z: topic w: word
[Hofmann 1999]
Latent Dirichlet Allocation (LDA)
𝑃(𝜑𝑘 𝛽): Dirichlet Distribution
𝑃(𝜃 𝛼): Dirichlet Distribution
𝑃(𝑤𝑚,𝑛 𝑧𝑚,𝑛, 𝜑𝑘)
𝑃(𝑧𝑚,𝑛 𝜃𝑚)
[Blei 2003]
• Unsupervised Learning of Topics from text corpus
Key Term Extraction and Summarization for
Spoken Content
• Key term extraction based on semantic features from
PLSA or LDA
– key terms usually focused on smaller number of topics
Not key term
P(Tk|ti)
k
key term
P(Tk|ti)
k
topics topics
[ICASSP 2006] [SLT 2010]
• Summarization
– selecting most representative utterances but avoiding redundancy
• Constructing the Semantic Structures of the Spoken Content
• Example Approach 1: Spoken Content categorized by Topics
and organized in a Two-dimensional Tree Structure (2005)
– each category labeled by a set of key terms (topic) located on a map
– categories nearby on the map are more related semantically
– each category expanded into another map in the next layer
Semantic Structuring of Spoken Content (1/2)
[Eurospeech 2005]
• Broadcast News Browser (2006)
An Example of Two-dimensional Trees
[Interspeech 2006]
• Sequential knowledge transfer lecture by lecture
• When a lecture in an online course is retrieved for a
user
– difficult for the user to understand this lecture without
listening to previous related lectures
– not easy to find out background or related knowledge
Online Courses
• Example Approach 2: Key Term Graph (2009)
– each spoken slide labeled by a set of key terms (topics)
– relationships between key terms represented by a graph
Semantic Structuring of Spoken Content (2/2)
spoken slides
(plus audio/video)
key term
graph
Acoustic
Modeling
Viterbi
search
HMM
Language
Modeling
Perplexity
[ICASSP 2009][IEEE Trans ASL 2014]
• Very Similar to Knowledge Graph
……
……
……
……
……
……
……
……
……
……
……
……
……
……
……
……
……
……
• Course : Digital Speech Processing (2009)
– Query : “triphone”
– retrieved utterances shown with the spoken slides they belong to
specified by the titles and key terms
An Example of Retrieving with an Online Course
Browser (1/2)
[ICASSP 2009][IEEE Trans ASL 2014]
• User clicks to view the spoken slide (2009)
– including a summary, key terms and related key terms from the graph
– recommended learning path for a specific key term
[ICASSP 2009][IEEE Trans ASL 2014]
An Example of Retrieving with an Online Course
Browser (2/2)
A Huge Number of Online Courses
752 matches
• A user enters a keyword or a key phrase to coursera
Having Machines Listen to all the Online Courses
Lectures with very
similar content
three courses on
some similar topic
[Interspeech 2015]
sequential order for
learning (prerequisite
conditions)
three courses on
some similar topic
[Interspeech 2015]
Having Machines Listen to all the Online Courses
Demonstration
Learning Map Produced by Machine
(2014)
[Interspeech 2015]
Question
Answering
Knowledge source
question
answer
unstructured
documents
search engine
Question Answering
• Machine answering questions from the user
spoken
content
Speech
Recognition
(ASR)
Question
Answering
Answer
Question
Question
Answering
Answer
Question
Spoken Content
Retrieved
Text
Retrieved
Text v.s. Spoken QA (Cascading v.s. End-to-end)
• Text QA
End-to-end Spoken
Question Answering
Answer
Question [Interspeech 2020]
Cascading
End-to-end
• Spoken QA
Errors
Reconstruction
Random Mask
Start/End Position
Audio-and-Text Jointly Learned SpeechBERT
• Pre-training • Fine-tuning
[Interspeech 2020]
• End-to-end Globally Optimized for Overall QA Performance
– not limited by ASR errors (no ASR here)
– extracting semantics directly from speech, not from words via ASR
• Internet is the only largest archive for global human
knowledge
– spoken language technologies offer a bridge to that knowledge
• Still a long way to go towards the personalized education
environment considered
– many small steps may lead to the final destination some day
– someone may realize it at the right time in the future
• Machines capable of handling huge quantities of Data
– using machines in the way they are specially capable and
efficient
Mini-Remarks
(3) Unsupervised ASR
(Phoneme Recognition)
Hung-yi Lee (left) and Lin-shan Lee
• Supervised ASR
– Has been very successful
– Problem : requiring a huge quantity of annotated data
How are you.
He thinks it’s…
Thanks for…
Data
(Annotated)
Supervised ASR
• Unsupervised ASR
– Train without annotated data
Supervised/Unsupervised ASR
Thousands of languages spoken over the
world
‐ most are low-resourced without
enough annotated data
How are you.
He thinks it’s…
Thanks for…
Data
(Audio)
Unsupervised ASR
Data
(Text)
Supervised/Unsupervised ASR
• Supervised ASR
– Has been very successful
– Problem : requiring a huge quantity of annotated data
• Unsupervised ASR
– Train without annotated data
– Unlabeled, unpaired data are easier to
collect
– something we could Never do before
(2018)
Thousands of languages spoken over the
world
‐ most are low-resourced without
enough annotated data
Generator (ASR) Discriminator
Acoustic Features
Data
(Audio)
Data
(Text)
How are you.
He thinks it’s…
Thanks for…
hh ih hh ih ng kcl …
hh aw aa r y uw
th ae ng k s f ao r …
Generated
Phoneme Sequences
Real
Phoneme Sequences
Use of Generative Adversarial Networks (GAN)
Generator (ASR)
Tries to “fool” Discriminator
Discriminator
Tries to distinguish real or
generated phoneme sequence.
Acoustic Features
Generated
Phoneme Sequences
Real / Generated
Real / Generated
Phoneme Sequences
Train
Iteratively
Use of Generative Adversarial Networks (GAN)
• Discriminator / Generator improve themselves individually
and iteratively
• Generative Adversarial Network (GAN)
…
…
Audio word2vec
𝑋1
𝑋2 𝑋3
𝑋𝑀
𝑧1 𝑧2 𝑧3 𝑧𝑀
…
Acoustic Feature
Audio embedding sequence
Model 1 (2018)
• Waveform segmentation and embedding
– divide the features into acoustically similar segments of different lengths
– transform each segment into a fixed-length vector (audio embedding)
[Interspeech 2018]
Cluster index 16
Cluster index 25
Cluster index 1
…
Audio embedding sequence
2 …
16 25 2
𝑧1 𝑧2 𝑧3 𝑧𝑀
Cluster index sequence
K-means
…
Cluster index 2
Model 1 (2018)
• Cluster the embeddings into groups
[Interspeech 2018]
Generator
Lookup Table
Real phoneme sequence
Real / Generated
Discriminator
CNN network
2 … 2
Cluster index sequence
16 25
Model 1 (2018)
• Learning the mapping between cluster indices and
phonemes with a GAN
– embedding clustering followed by (cascaded with) a GAN
[Interspeech 2018]
𝑠𝑖𝑙 …
ℎℎ 𝑖ℎ 𝒔𝒊𝒍
Generated phoneme sequence
Generator
Lookup Table
2 … 2
Cluster index sequence
16 25
Model 1 (2018)
• Learning the mapping between cluster indices and
phonemes with a GAN
– embedding clustering followed by (cascaded with) a GAN
[Interspeech 2018]
𝑠𝑖𝑙 …
ℎℎ 𝑖ℎ 𝒔𝒊𝒍
Generated phoneme sequence
ASR !
(phoneme recognition)
• Generator consists of two
parts
(a) Phoneme Classifier (DNN)
(b) Sampling Process
Data
(Audio)
Model 2 (2019)
[Interspeech 2019]
• Generator consists of two
parts
(a) Phoneme Classifier (DNN)
(b) Sampling Process
• Discriminator is a two layer 1-D
CNN.
Data (Text)
Data
(Audio)
Model 2 (2019)
[Interspeech 2019]
• Generator consists of two
parts
(a) Phoneme Classifier (DNN)
(b) Sampling Process
• Discriminator is a two layer 1-D
CNN.
Data (Text)
Data
(Audio)
Model 2 (2019)
• A GAN (Generator and Discriminator) trained
End-to-end
– DNN trained in an unsupervised way
[Interspeech 2019]
GAN model
Hidden Markov
Models
2. Train the GAN
in an unsupervised way
3. Obtain transcriptions
by GAN model
4. Train the
HMMs
5. Obtain new
boundaries by HMMs
1. Obtain the initial
boundaries
Phoneme
Boundaries
Transcriptions
(Pseudo label)
Model 2 (2019)
• GAN iterated with HMMs
[Interspeech 2019]
Approaches
PER
Matched Unrelated
Supervised
RNN Transducer 17.7 -
Standard HMMs 21.5 -
Completely unsupervised (no label at all)
Model 1 76.0 -
Model
2
Iteration 1
GAN 48.6 50.0
HMM 30.7 39.5
Iteration 2
GAN 41.0 44.3
HMM 27.0 35.5
Iteration 3
GAN 38.4 44.2
HMM 26.1 33.1
Experimental Results for Models 1, 2 on TIMIT
• Phoneme Error Rate
[Interspeech 2018] [Interspeech 2019]
• Model 1 cascaded clustering with GAN, while Model 2 did
everything end-to-end with GAN
Accuracy
Unsupervised learning is as good as
supervised learning 30 years ago.
The Progress of Supervised Learning on TIMIT
• Milestones in phone recognition accuracy
[Phone recognition on the TIMIT database, Lopes, C. and Perdigão, F., 2011. Speech
Technologies, Vol 1, pp. 285--302.]
– Will it take another 30 years for unsupervised learning to achieve the
performance of supervised learning today ?
ASR TTS
- Learning basic sound units by listening and reproducing
- Babies learn to speak from parents
Vector Space
Model 3 (2019)
• A Cascading Recognition-Synthesis Framework
[ICASSP 2020]
ASR
B
AE
T
“Ta b”
Model 3 (2019)
• A Cascading Recognition-Synthesis Framework
[ICASSP 2020]
- Learning basic sound units by listening and reproducing
- Babies learn to speak from parents
- ASR ! (Phoneme recognition)
- Audio data only (no text) plus a small set of paired data
Encoder
Decoder
phoneme representation
sequence
Quantizer
RNNs
Seq-to-seq
Model
encoder output
sequence
(frame-sync)
• Implemented with a Sequential Quantization AutoEncoder
Model 3 (2019)
[ICASSP 2020]
encoder output
sequence
(frame-sync)
Quantizer
Learnable codebook
L2 Distance
Argmin
Lookup
4
𝑒4
2
𝑒2
Phonetic Clustering Temporal Segmentation
phoneme representation
sequence
[𝑒4,𝑒4,𝑒4,𝑒2,𝑒2]
• Implemented with a Sequential Quantization AutoEncoder
Model 3 (2019)
[ICASSP 2020]
– With small quantity of paired data, train
codebook to match real phonemes
Learnable codebook
L2 Distance
Argmin
𝑣
𝐞𝑣
Phonetic Clustering
1. Assign one phoneme to each codeword
𝐡𝑡
2. Define the probability that vector 𝐡𝑡 belongs to phoneme 𝑣
Pr 𝐡𝑡 belongs to 𝑣 =
exp(− 𝐡𝑡 − 𝐞𝑣 2)
𝑢 exp(− 𝐡𝑡 − 𝐞𝒖 2)
3. Matching with the paired data by maximizing the probability
Lookup
• Sequential Quantization AutoEncoder
Model 3 (2019)
[ICASSP 2020]
encoder output
sequence
(frame-sync)
Quantizer
Learnable codebook
L2 Distance
Argmin
Phonetic Clustering Temporal Segmentation
phoneme representation
sequence
[𝑒4,𝑒4,𝑒4,𝑒2,𝑒2]
Lookup
• Implemented with a Sequential Quantization AutoEncoder
Model 3 (2019)
[ICASSP 2020]
ℎ𝑡
L2 Distance Argmin
Learnable codebook
Quantizer
Encoder Decoder
Straight-through
Gradient Estimation
Forward pass
Backward pass
Phonetic Clustering
– End-to-end trained (with enough audio data plus small paired data)
Temporal
Segmentation
Lookup
• Implemented with a Sequential Quantization AutoEncoder
Model 3 (2019)
[ICASSP 2020]
ASR !
(phoneme
recognition)
ℎ𝑡
L2 Distance Argmin
Learnable codebook
Quantizer
Encoder
Straight-through
Gradient Estimation
Forward pass
Backward pass
Phonetic Clustering
– End-to-end trained (with enough audio data plus small paired data)
Temporal
Segmentation
Lookup
• Implemented with a Sequential Quantization AutoEncoder
Model 3 (2019)
[ICASSP 2020]
Unlabeled
Audio
20 min.
Paired
15 min.
Paired
10 min.
Paired
5 min.
Paired
Proposed Method 22 hrs. 25.5 29.0 35.2 49.3
• Phoneme Error Rate (PER)
– when different amount of paired data is available (single speaker)
Experiment Results for Model 3 on LJSpeech
[ICASSP 2020]
– initial recognition observed, although 20 min of paired data used
– related work reported recently
• Deep learning makes tasks impossible before
realizable today
• Much more crazy concepts are yet to be explored !
• Unpaired data may be more useful than we thought
• End-to-end training is more attractive than
cascading in the context of deep learning
– overall performance is optimized globally as compared to locally
Mini-Remarks
• Doing something we never could is interesting and
exciting !
• Though challenging, the difficulties may be reduced by
the fast advancing technologies including deep learning
• 15 years ago we never knew what kind of technologies
we could have today
– today we never know what kind of technologies we may have 15
years from now
– anything in our mind could be possible
• This may be the golden age we never had for research
– very deep learning, very big data, very powerful machines, very
strong industry
– which we never had before
– possible to do something we Never could !
Concluding Remarks

Doing Something We Never Could with Spoken Language Technologies_109-10-29_Interspeech 2020

  • 1.
    Doing Something WeNever Could with Spoken Language Technologies Lin-shan Lee National Taiwan University
  • 2.
    • Some researcheffort tries to do something Better – having aircrafts fly faster – having images look more beautiful – always very good • Some tries to do something we Never could – developing the Internet to connect everyone over the world – selecting information from Internet with Google – usually challenging What and Why
  • 3.
    • Those wecould Never do before – very far from realization – wait for the right industry to appear at the right time – new generations of technologies used very different from the earlier solutions found in research • Interesting and Exciting ! What and Why • Three examples – (1) Teaching machines to listen to Mandarin Chinese – (2) Towards a Spoken Version of Google – (3) Unsupervised ASR (phoneme recognition)
  • 4.
    (1) Teaching Machinesto Listen to Mandarin Chinese
  • 5.
    • English typewriter Talkto Machines in Writing – • Chinese typewriter Typewriting
  • 6.
    Chinese Language isNot Alphabetic • Every Chinese character is a square graph – many thousands of such characters
  • 7.
    • English typewriter Talkto Machines in Writing - Typewriting • Chinese typewriter
  • 8.
    Age of ChineseTypewriters (1980) • Many people tried to represent Chinese characters by code sequences – radicals (字根) : 口 火 木 山 人 手 – phonetic symbols – corner codes (四角號碼) • A Voice-driven Typewriter for Chinese ? – monosyllable per character – total number of distinct monosyllables limited – something we could Never do before • To Teach Machines to Listen to Mandarin ? – too difficult (1980) – to teach machines to speak Mandarin first →Speech Synthesis (TTS)
  • 9.
    Hardware-assisted Voice Synthesizer •Real-time requirement – producing 1 sec of voice data within 1 sec – from Linear Predicative Coding (LPC) coefficients • Bit-slice microprocessor far too weak • Completed 1981, used in next several years
  • 10.
    Initial Effort inChinese Text-to-speech Synthesis • Calculated and stored the LPC coefficients and tone patterns for all Mandarin monosyllables – concatenating isolated monosyllables into utterances directly – Never worked • Learned from a linguist – the prosody (pitch, energy, duration, etc.) of the same monosyllable is different in different context in continuous speech (context dependency) – made sentences, recorded voice, analyzed the prosody – prosody rules for each monosyllable in continuous speech based on context – synthesized voice very poor, but intelligible – data science in early years • Lesson learned – linguistics important for engineers , stuck for long Prof. Chiu-yu Tseng
  • 11.
    Mandarin Prosody Rules •Complete prosody rules (1984) [ICASSP 1986][IEEE Trans ASSP 1989]
  • 12.
    我 有 好幾 把 小 雨 傘 Tone Sandhi Rule Example • Concatenation rules for Tone 3 3 3 → 2 3 S 有 N 好幾 𝑉𝑃 我 𝑁𝑃 V NP QP Q MOD N MOD 把 小 雨傘 2 2 3 3 2 3 3 2 [ICASSP 1986][IEEE Trans ASSP 1989] 雨傘 – the rule applies across certain syntactic boundaries, but not for others
  • 13.
    Chinese Text-to-speech Synthesis •Concatenating stored monosyllables with adjusted prosody (1984) – reproduced from a tape in 1984 [ICASSP 1986] [IEEE Trans ASSP 1989]
  • 14.
    Speech Prosody Dependson Sentence Structure • Chinese Sentence Grammar and Parser (1986) • Empty category 被李四叫去吃飯的小孩 (The children who are asked to go to dinner by Li-s) [AAAI 1986] [Computational Linguistics 1991] – natural language processing in early years
  • 15.
    Mandarin Speech Recognition •Very Large Vocabulary and Unlimited Text • Isolated Monosyllable Input – each character pronounced as a monosyllable – limited number of distinct monosyllables – a voice-driven typewriter • Decomposing the problem – recognition of consonants/vowels/tones – identifying the character out of many homonym characters • System Integration very hard – software very weak, each part assisted by different hardware [Concept IJCAI 1987] [Initial Results ICASSP 1990]
  • 16.
    • 1989 An ExampleCircuit Diagram
  • 17.
    • Completed in1989 Photo of Completed Hardware , but Never worked
  • 18.
    • 1992 – farfrom a complete system LPC Analysis Chip Viterbi Chip Chip Design
  • 19.
    • Implemented withTransputer (10 CPUs in parallel, everything in software) purchased in 1990 • March 1992 • 金聲玉振,金玉之聲 • Isolated monosyllable input • Several seconds per monosyllable • Lesson learned: software more powerful than hardware Golden Mandarin I (金聲一號 ) [ICASSP 1993] [IEEE Trans SAP 1993]
  • 20.
    Golden Mandarin III(金聲三號) • March 1995 • Continuous read speech input • Limitation by computation resources – Version(a) on PC 486 for short utterance input – Version(b) on Workstation for long utterance input [ICASSP 1995] [IEEE Trans SAP 1997] [IEEE SP Magazine 1997]
  • 21.
    Mini-Remarks • Today • Adream many years ago was realized by the right industry at the right time – with new generations of technologies • No worry for realization during research – someone will solve the problem in the future
  • 22.
    (2) Towards a SpokenVersion of Google
  • 23.
    Spoken Content Retrieval •Spoken term detection – to detect if a target term was spoken in any of the utterances in an audio data set target term: COVID-19 similarity scores
  • 24.
    Spoken Content Retrieval– Basic Approach Lexicon Spoken Content Recognition Engine Language Model Transcriptions Retrieval Results Text-based Search Engine user Query Acoustic Models • Transcribe the spoken content • Search over the transcriptions as they are text • Recognition errors cause serious performance degradation
  • 25.
    • Low recognitionaccuracies for speech signals under various acoustic conditions – considering lattices rather than 1-best output – lattices of subword units Lattices for Spoken Content Retrieval W6 W8 W4 W1 W8 W7 W9 W3 W2 W5 W10 Start node End node Time index Wi: word hypotheses [Ng 2000, Scarclar 2004, Chelba 2005, etc.]
  • 26.
    • Recognition stagecascaded with retrieval stage – retrieval performance limited by recognition accuracy • Considering recognition and retrieval processes as a whole (2010) – acoustic models re-estimated by optimizing overall retrieval performance ASR Error Problem Comes from Cascading Two Stages Spoken Content Recognition Engine Acoustic Models lattices Search Engine Retrieval Model user Query Q Retrieval Output Recognition Retrieval [ICASSP 2010] • End-to-End Spoken Content Retrieval in early years Acoustic Models
  • 27.
    What can SpokenContent Retrieval do for us ? • Google reads all text over the Internet – can find any text over the Internet for the user • All Roles of Text can be realized by Voice • Machines can listen to all voices over the Internet – can find any utterances over the Internet for the user • A Spoken Version of Google
  • 28.
    • Machines maybe able to listen to and understand the entire multimedia knowledge archive over the Internet – extracting desired information for each individual user 300hrs of videos uploaded per min (2015.01) Roughly 2000 online courses on Coursera (2016.04) • Nobody can go through so much multimedia information • Multimedia Content exponentially increasing over the Internet – best archive of global human knowledge is here – desired information deeply buried under huge quantities of unrelated information , but Machines can What can we do with a Spoken Version of Google ?
  • 29.
    A Target ApplicationExample : Personalized Education Environment • For each individual user I wish to learn about Wolfgang Amadeus Mozart and his music I can spend 3 hrs to learn user This is the 3-hr personalized course for you. I’ll be your personalized teaching assistant. Ask me when you have questions. Information from Internet • Understanding, Summarization and Question Answering for Spoken Content – something we could Never do (even today) – semantic analysis for spoken content
  • 30.
    Probabilistic Latent SemanticAnalysis (PLSA) t1 t2 tj tn D1 D2 Di DN TK Tk T2 T1 P(T |D ) k i P(t |T ) j k Di: documents Tk: latent topics tj: terms 𝑃(𝑤 𝑧) 𝑃(𝑧 𝑑) • Unsupervised Learning of Topics from text corpus d: document z: topic w: word [Hofmann 1999]
  • 31.
    Latent Dirichlet Allocation(LDA) 𝑃(𝜑𝑘 𝛽): Dirichlet Distribution 𝑃(𝜃 𝛼): Dirichlet Distribution 𝑃(𝑤𝑚,𝑛 𝑧𝑚,𝑛, 𝜑𝑘) 𝑃(𝑧𝑚,𝑛 𝜃𝑚) [Blei 2003] • Unsupervised Learning of Topics from text corpus
  • 32.
    Key Term Extractionand Summarization for Spoken Content • Key term extraction based on semantic features from PLSA or LDA – key terms usually focused on smaller number of topics Not key term P(Tk|ti) k key term P(Tk|ti) k topics topics [ICASSP 2006] [SLT 2010] • Summarization – selecting most representative utterances but avoiding redundancy
  • 33.
    • Constructing theSemantic Structures of the Spoken Content • Example Approach 1: Spoken Content categorized by Topics and organized in a Two-dimensional Tree Structure (2005) – each category labeled by a set of key terms (topic) located on a map – categories nearby on the map are more related semantically – each category expanded into another map in the next layer Semantic Structuring of Spoken Content (1/2) [Eurospeech 2005]
  • 34.
    • Broadcast NewsBrowser (2006) An Example of Two-dimensional Trees [Interspeech 2006]
  • 35.
    • Sequential knowledgetransfer lecture by lecture • When a lecture in an online course is retrieved for a user – difficult for the user to understand this lecture without listening to previous related lectures – not easy to find out background or related knowledge Online Courses
  • 36.
    • Example Approach2: Key Term Graph (2009) – each spoken slide labeled by a set of key terms (topics) – relationships between key terms represented by a graph Semantic Structuring of Spoken Content (2/2) spoken slides (plus audio/video) key term graph Acoustic Modeling Viterbi search HMM Language Modeling Perplexity [ICASSP 2009][IEEE Trans ASL 2014] • Very Similar to Knowledge Graph …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… …… ……
  • 37.
    • Course :Digital Speech Processing (2009) – Query : “triphone” – retrieved utterances shown with the spoken slides they belong to specified by the titles and key terms An Example of Retrieving with an Online Course Browser (1/2) [ICASSP 2009][IEEE Trans ASL 2014]
  • 38.
    • User clicksto view the spoken slide (2009) – including a summary, key terms and related key terms from the graph – recommended learning path for a specific key term [ICASSP 2009][IEEE Trans ASL 2014] An Example of Retrieving with an Online Course Browser (2/2)
  • 39.
    A Huge Numberof Online Courses 752 matches • A user enters a keyword or a key phrase to coursera
  • 40.
    Having Machines Listento all the Online Courses Lectures with very similar content three courses on some similar topic [Interspeech 2015]
  • 41.
    sequential order for learning(prerequisite conditions) three courses on some similar topic [Interspeech 2015] Having Machines Listen to all the Online Courses
  • 42.
    Demonstration Learning Map Producedby Machine (2014) [Interspeech 2015]
  • 43.
    Question Answering Knowledge source question answer unstructured documents search engine QuestionAnswering • Machine answering questions from the user spoken content
  • 44.
    Speech Recognition (ASR) Question Answering Answer Question Question Answering Answer Question Spoken Content Retrieved Text Retrieved Text v.s.Spoken QA (Cascading v.s. End-to-end) • Text QA End-to-end Spoken Question Answering Answer Question [Interspeech 2020] Cascading End-to-end • Spoken QA Errors
  • 45.
    Reconstruction Random Mask Start/End Position Audio-and-TextJointly Learned SpeechBERT • Pre-training • Fine-tuning [Interspeech 2020] • End-to-end Globally Optimized for Overall QA Performance – not limited by ASR errors (no ASR here) – extracting semantics directly from speech, not from words via ASR
  • 46.
    • Internet isthe only largest archive for global human knowledge – spoken language technologies offer a bridge to that knowledge • Still a long way to go towards the personalized education environment considered – many small steps may lead to the final destination some day – someone may realize it at the right time in the future • Machines capable of handling huge quantities of Data – using machines in the way they are specially capable and efficient Mini-Remarks
  • 47.
    (3) Unsupervised ASR (PhonemeRecognition) Hung-yi Lee (left) and Lin-shan Lee
  • 48.
    • Supervised ASR –Has been very successful – Problem : requiring a huge quantity of annotated data How are you. He thinks it’s… Thanks for… Data (Annotated) Supervised ASR • Unsupervised ASR – Train without annotated data Supervised/Unsupervised ASR Thousands of languages spoken over the world ‐ most are low-resourced without enough annotated data
  • 49.
    How are you. Hethinks it’s… Thanks for… Data (Audio) Unsupervised ASR Data (Text) Supervised/Unsupervised ASR • Supervised ASR – Has been very successful – Problem : requiring a huge quantity of annotated data • Unsupervised ASR – Train without annotated data – Unlabeled, unpaired data are easier to collect – something we could Never do before (2018) Thousands of languages spoken over the world ‐ most are low-resourced without enough annotated data
  • 50.
    Generator (ASR) Discriminator AcousticFeatures Data (Audio) Data (Text) How are you. He thinks it’s… Thanks for… hh ih hh ih ng kcl … hh aw aa r y uw th ae ng k s f ao r … Generated Phoneme Sequences Real Phoneme Sequences Use of Generative Adversarial Networks (GAN)
  • 51.
    Generator (ASR) Tries to“fool” Discriminator Discriminator Tries to distinguish real or generated phoneme sequence. Acoustic Features Generated Phoneme Sequences Real / Generated Real / Generated Phoneme Sequences Train Iteratively Use of Generative Adversarial Networks (GAN) • Discriminator / Generator improve themselves individually and iteratively • Generative Adversarial Network (GAN)
  • 52.
    … … Audio word2vec 𝑋1 𝑋2 𝑋3 𝑋𝑀 𝑧1𝑧2 𝑧3 𝑧𝑀 … Acoustic Feature Audio embedding sequence Model 1 (2018) • Waveform segmentation and embedding – divide the features into acoustically similar segments of different lengths – transform each segment into a fixed-length vector (audio embedding) [Interspeech 2018]
  • 53.
    Cluster index 16 Clusterindex 25 Cluster index 1 … Audio embedding sequence 2 … 16 25 2 𝑧1 𝑧2 𝑧3 𝑧𝑀 Cluster index sequence K-means … Cluster index 2 Model 1 (2018) • Cluster the embeddings into groups [Interspeech 2018]
  • 54.
    Generator Lookup Table Real phonemesequence Real / Generated Discriminator CNN network 2 … 2 Cluster index sequence 16 25 Model 1 (2018) • Learning the mapping between cluster indices and phonemes with a GAN – embedding clustering followed by (cascaded with) a GAN [Interspeech 2018] 𝑠𝑖𝑙 … ℎℎ 𝑖ℎ 𝒔𝒊𝒍 Generated phoneme sequence
  • 55.
    Generator Lookup Table 2 …2 Cluster index sequence 16 25 Model 1 (2018) • Learning the mapping between cluster indices and phonemes with a GAN – embedding clustering followed by (cascaded with) a GAN [Interspeech 2018] 𝑠𝑖𝑙 … ℎℎ 𝑖ℎ 𝒔𝒊𝒍 Generated phoneme sequence ASR ! (phoneme recognition)
  • 56.
    • Generator consistsof two parts (a) Phoneme Classifier (DNN) (b) Sampling Process Data (Audio) Model 2 (2019) [Interspeech 2019]
  • 57.
    • Generator consistsof two parts (a) Phoneme Classifier (DNN) (b) Sampling Process • Discriminator is a two layer 1-D CNN. Data (Text) Data (Audio) Model 2 (2019) [Interspeech 2019]
  • 58.
    • Generator consistsof two parts (a) Phoneme Classifier (DNN) (b) Sampling Process • Discriminator is a two layer 1-D CNN. Data (Text) Data (Audio) Model 2 (2019) • A GAN (Generator and Discriminator) trained End-to-end – DNN trained in an unsupervised way [Interspeech 2019]
  • 59.
    GAN model Hidden Markov Models 2.Train the GAN in an unsupervised way 3. Obtain transcriptions by GAN model 4. Train the HMMs 5. Obtain new boundaries by HMMs 1. Obtain the initial boundaries Phoneme Boundaries Transcriptions (Pseudo label) Model 2 (2019) • GAN iterated with HMMs [Interspeech 2019]
  • 60.
    Approaches PER Matched Unrelated Supervised RNN Transducer17.7 - Standard HMMs 21.5 - Completely unsupervised (no label at all) Model 1 76.0 - Model 2 Iteration 1 GAN 48.6 50.0 HMM 30.7 39.5 Iteration 2 GAN 41.0 44.3 HMM 27.0 35.5 Iteration 3 GAN 38.4 44.2 HMM 26.1 33.1 Experimental Results for Models 1, 2 on TIMIT • Phoneme Error Rate [Interspeech 2018] [Interspeech 2019] • Model 1 cascaded clustering with GAN, while Model 2 did everything end-to-end with GAN
  • 61.
    Accuracy Unsupervised learning isas good as supervised learning 30 years ago. The Progress of Supervised Learning on TIMIT • Milestones in phone recognition accuracy [Phone recognition on the TIMIT database, Lopes, C. and Perdigão, F., 2011. Speech Technologies, Vol 1, pp. 285--302.] – Will it take another 30 years for unsupervised learning to achieve the performance of supervised learning today ?
  • 62.
    ASR TTS - Learningbasic sound units by listening and reproducing - Babies learn to speak from parents Vector Space Model 3 (2019) • A Cascading Recognition-Synthesis Framework [ICASSP 2020]
  • 63.
    ASR B AE T “Ta b” Model 3(2019) • A Cascading Recognition-Synthesis Framework [ICASSP 2020] - Learning basic sound units by listening and reproducing - Babies learn to speak from parents - ASR ! (Phoneme recognition) - Audio data only (no text) plus a small set of paired data
  • 64.
    Encoder Decoder phoneme representation sequence Quantizer RNNs Seq-to-seq Model encoder output sequence (frame-sync) •Implemented with a Sequential Quantization AutoEncoder Model 3 (2019) [ICASSP 2020]
  • 65.
    encoder output sequence (frame-sync) Quantizer Learnable codebook L2Distance Argmin Lookup 4 𝑒4 2 𝑒2 Phonetic Clustering Temporal Segmentation phoneme representation sequence [𝑒4,𝑒4,𝑒4,𝑒2,𝑒2] • Implemented with a Sequential Quantization AutoEncoder Model 3 (2019) [ICASSP 2020]
  • 66.
    – With smallquantity of paired data, train codebook to match real phonemes Learnable codebook L2 Distance Argmin 𝑣 𝐞𝑣 Phonetic Clustering 1. Assign one phoneme to each codeword 𝐡𝑡 2. Define the probability that vector 𝐡𝑡 belongs to phoneme 𝑣 Pr 𝐡𝑡 belongs to 𝑣 = exp(− 𝐡𝑡 − 𝐞𝑣 2) 𝑢 exp(− 𝐡𝑡 − 𝐞𝒖 2) 3. Matching with the paired data by maximizing the probability Lookup • Sequential Quantization AutoEncoder Model 3 (2019) [ICASSP 2020]
  • 67.
    encoder output sequence (frame-sync) Quantizer Learnable codebook L2Distance Argmin Phonetic Clustering Temporal Segmentation phoneme representation sequence [𝑒4,𝑒4,𝑒4,𝑒2,𝑒2] Lookup • Implemented with a Sequential Quantization AutoEncoder Model 3 (2019) [ICASSP 2020]
  • 68.
    ℎ𝑡 L2 Distance Argmin Learnablecodebook Quantizer Encoder Decoder Straight-through Gradient Estimation Forward pass Backward pass Phonetic Clustering – End-to-end trained (with enough audio data plus small paired data) Temporal Segmentation Lookup • Implemented with a Sequential Quantization AutoEncoder Model 3 (2019) [ICASSP 2020]
  • 69.
    ASR ! (phoneme recognition) ℎ𝑡 L2 DistanceArgmin Learnable codebook Quantizer Encoder Straight-through Gradient Estimation Forward pass Backward pass Phonetic Clustering – End-to-end trained (with enough audio data plus small paired data) Temporal Segmentation Lookup • Implemented with a Sequential Quantization AutoEncoder Model 3 (2019) [ICASSP 2020]
  • 70.
    Unlabeled Audio 20 min. Paired 15 min. Paired 10min. Paired 5 min. Paired Proposed Method 22 hrs. 25.5 29.0 35.2 49.3 • Phoneme Error Rate (PER) – when different amount of paired data is available (single speaker) Experiment Results for Model 3 on LJSpeech [ICASSP 2020] – initial recognition observed, although 20 min of paired data used – related work reported recently
  • 71.
    • Deep learningmakes tasks impossible before realizable today • Much more crazy concepts are yet to be explored ! • Unpaired data may be more useful than we thought • End-to-end training is more attractive than cascading in the context of deep learning – overall performance is optimized globally as compared to locally Mini-Remarks
  • 72.
    • Doing somethingwe never could is interesting and exciting ! • Though challenging, the difficulties may be reduced by the fast advancing technologies including deep learning • 15 years ago we never knew what kind of technologies we could have today – today we never know what kind of technologies we may have 15 years from now – anything in our mind could be possible • This may be the golden age we never had for research – very deep learning, very big data, very powerful machines, very strong industry – which we never had before – possible to do something we Never could ! Concluding Remarks