Building your own ASR
engine
Speech as a modality
High throughput (130 words per minute)
Natural
Hands-free
Need to be mindful that ASR is errorful
NLP on top of ASR output needs to be able to correct
errors
Dialogue systems
Dialogue Manager
Generation
Speech
synthesis
(TTS)
Language
Understanding
(NLP/NLU)
ASR
User intents
+confidence
text text
System intent
speechspeech
Other outputs
Inside the recognizer
Search Text
Lexicon
Words and pronunciations
Language Model
Text data
Feature
extractor
Acoustic model
Transcribed Speech Data
This is a pen.
This is a pen.This is a pen.
Constraints
Inside the recognizer
Beam search
Text
N-best
Lattice
Acoustic model
Transcribed Speech Data
This is a pen.
This is a pen.This is a pen.
Lexicon
Words and pronunciations
Language Model
Text data
HCLG
FST (Graph)
Feature
extractor MFCC
The feature - MFCC (Mel Frequency
Cepstral Coefficient)
|STFT|
log
DCT
…
Coefficient
selection
ASR features
https://www.webmd.com/cold-and-flu/ear-infection/picture-of-the-ear#1
Acoustic modeling
“This is a cat”
This is a cat
/th/ /iy/ /iy//s/ /s/ /a/ /ae/ /t//k/
Force alignment - process of assigning different segments to each sounds
Automatic if you have an ASR
phoneme - distinct unit of speech sound
Utterance-level
transcription
ASR typically models phoneme level
HMM and GMM in speech
• Each phoneme is separate into parts and model
separately
• This is model by Hidden Markov Model and Gaussian
Mixture Model
Decoding
????????
/th/
10
/iy/
20
/iy/
10
/s/
1
/s/
2
/a/
3
/ae/
35
/t/
2
/k/
1
Acoustic model give scores to possible sequences of phonemes
Intractable: use beamsearch (trade-off between accuracy and compute)
/ae/
50
/iy/
10
/t/
2
/t/
3
/a/
3
/ae/
15
/n/
1
/k/
1
Total score
84
/iy/
20
Total score
105
How do we go back to words?
Can we trust the AM that much?
The lexicon
• A dictionary saying how a word can be pronounced
• Can have multiple pronunciations
• Must use the same phoneme units as the AM
Phoneme lexicon
กรรไกร : k a n^ kr ai z^
กรณี : k a z^ r a z^ n ii z^
กรณี : k or z^ r a z^ n ii z^
เพลา : p ae z^ l aa z^
เพลา : pl ao z^
The grapheme lexicon
• Represent letters (graphemes) as sound units
• The pronunciation is just the sequence of letter spelling
• Works quite well for many languages
• Thai somewhat problematic
Phoneme lexicon Grapheme lexicon
Grapheme vs Phoneme
Language (WER) Phoneme lexicon Grapheme lexicon diff
Kazakh 76.8% 77.0% +0.2
Kurmanji 85.5% 85.1% -0.4
Telugu 86.3% 87.0% +0.7
Cebuano 75.7% 75.9% +0.2
Lao 67.3% 69.9% +2.6
Haitian 52.0% 52.3% +0.3
Assamese 58.6% 58.5% -0.1
English 8.0% 8.5% +0.5
• Not the end of the world if you do not have a lexicon
• Can be slightly improved with some knowledge about the language
(rule-based)
• The more regular the spelling is the closer the gap
D. Harwath and J. Glass. Speech recognition without a lexicon-bridging the gap between graphemic and phonetic systems. In Proc. InterSpeech, 2014.
V. Le, L. Lamel, A. Messaoudi, W. Hartmann, J. Gauvain, C. Woehrling, J. Despres, and A. Roy. Developing STT and KWS systems using limited language
resources. In Proc. InterSpeech, 2014.
E. Chuangsuwanich, Multilingual Techniques for Low Resource Automatic Speech Recognition, MIT, June 2016.
Automatic Grapheme-to-Phoneme
(G2P)
• Given a small lexicon generated by linguists, learn a
model to predict the pronunciation of new words
• Sometimes called L2S model (Letter-to-Sound)
• Can use acoustic data to improve performance of generated
pronunciation
G2P example
• Trained with 5k pronunciations using Sequitur
• Can produce multiple candidate pronunciations
• PythaiNLP also has G2P support (?)
Language Model (LM)
• Specifies the grammar of a valid sentence
• Can be strict
• Or probabilistic n-grams
ขอ ถอน เงิน
ฝาก
Deep learning in ASR?
Beam search
Text
N-best
Lattice
Acoustic model
Transcribed Speech Data
This is a pen.
This is a pen.This is a pen.
Lexicon
Words and pronunciations
Language Model
Text data
HCLG
FST (Graph)
Feature
extractor MFCC
Simpler features
|STFT|
log
DCT
…
Coefficient
selection
Traditional ASR features
https://www.webmd.com/cold-and-flu/ear-infection/picture-of-the-ear#1
Simpler features
|STFT|
…
Features for deep
learning
Deep learning in AM
Three main approaches
• Hybrid DNN-HMM
• Tandem
• End-to-end
Hybrid DNN-HMM approach
• A typical speech recognizer uses the GMM-HMM
framework
• Emission probabilities are modeled by a GMM
• Instead, model emission probabilities with a DNN
• DNN gives posteriors, while GMM gives likelihoods. Convert DNN
outputs to likelihoods by removing the priors.
DN
N
b1
b2
b3
DNN-HM
M
s1
s2
s3
DNN/LSTM/etc
Tandem approach
• Use the DNN to generate good features to feed into the general
GMM-HMM framework.
• Typically done by placing a narrow hidden layer in the network.
Bottleneck layer
Input
features
Input
features
Typical
GMM-HMM
Deep learning in LM
• Use neural language models instead of n-grams
• Can only use neural LM in rescoring
End-to-End models
• BIG network that goes
from waveform to
characters
• Still need LM rescpromg
• Needs large amounts of
data (1000+ hours)
• People also tried on
smaller data (100
hours)**
Chan, W. Listen, Attend and Spell, 2015
Github available (transformer)
https://github.com/tensorflow/tensor2tensor
Inside the recognizer
Search Text
Lexicon
Words and pronunciations
Language Model
Text data
Feature
extractor
Acoustic model
Transcribed Speech Data
This is a pen.
This is a pen.This is a pen.
Constraints
Factors making ASR hard
Vocabulary size
small (10) large (1000+)
Type of speech
disconnected words news broadcast conversational
Domain
fixed domain open domain
Quality
close talking far field
Speakers
speaker dependent speaker independent
HarderEasier
Why build your own ASR?
• You can better specify your domain and vocabulary
• Google speech API only supports word/phrases emphasis
• You can better specify your type of recordings
• Google now have specialized models for English
• You can adapt to the speaker
• Google prefers speaker independent models
• You can run locally on device
Powered by Kaldi
https://www.nist.gov/itl/iad/mig/openkws16-evaluation
Smartvid.io
Smartvid.io
https://blogs.nvidia.com/blog/2017/05/10/these-six-ai-startups-just-snagged-a-share-of-1-5-million-in-cash-prizes/
Snowboy (kitt.ai)
Snowboy (kitt.ai)
https://techcrunch.com/2017/07/05/baidu-acquires-natural-language-startup-kitt-ai-maker-of-chatbot-engine-chatflow/
How much data?
E. Chuangsuwanich, Multilingual Techniques for Low Resource Automatic Speech Recognition, MIT, June 2016.
But no open dataset
• English has 1000 hours open dataset
• Gives deployment-worthy models
• At least four startups I know starts from this
• Transfer learning gives huge leverage
• Let’s build shared resources
• Mozilla common voice https://voice.mozilla.org/
• Recordings from ASR class
https://github.com/ekapolc/gowajee_corpus/
• 40hr spelling corpora planned for release soon(
™)
• Thai crowdsourcing platform solution soon(
™)
Links
• ASR course at Chula (all materials + videos)
• https://github.com/ekapolc/ASR_course
• How to create your own Kaldi ASR and deploy it
• Starting AM, G2P, and docker images
• https://github.com/ekapolc/ASR_classproject
• Kaldi
• http://kaldi-asr.org/
• Snowboy for hotword detection
• https://snowboy.kitt.ai/
Commercial time
Chula engineering is accepting applicants for Master and
PhD
M Computer Science
M Computer Engineering
M Software Engineering
PhD Computer Science
Wide range of Big data/AI courses being offered
https://www.cp.eng.chula.ac.th/
Talks from the faculty on what they are working on
https://www.youtube.com/watch?v=XU1PWNeLv4o
Current projects
Thai NLP basic capabilities
Sentence segmentation
Word correction/normalization
Word segmentation
NER
Domain focus: social media, chat, financial info
Current projects
National platform to assess students’ English
Both written (essays) and spoken
Speech rehab platform for stroke patients
Biomedical text mining
Precision medicine research
Sahin and Tureci. Science (2018)
Synthetic
neoantigen
vaccines
Engineered T
cells
Precision medicine
Kobayashi and van den Elsen. Nature Reviews Immunology (2012).
Precision medicine
Model peptides as a string
ABEASOEWL
will the receptor also a string
…AEOAENWOAIRPEERW....
accept the peptide
Use deep learning

Build your own ASR engine

  • 1.
  • 2.
    Speech as amodality High throughput (130 words per minute) Natural Hands-free Need to be mindful that ASR is errorful NLP on top of ASR output needs to be able to correct errors
  • 3.
  • 4.
    Inside the recognizer SearchText Lexicon Words and pronunciations Language Model Text data Feature extractor Acoustic model Transcribed Speech Data This is a pen. This is a pen.This is a pen. Constraints
  • 5.
    Inside the recognizer Beamsearch Text N-best Lattice Acoustic model Transcribed Speech Data This is a pen. This is a pen.This is a pen. Lexicon Words and pronunciations Language Model Text data HCLG FST (Graph) Feature extractor MFCC
  • 6.
    The feature -MFCC (Mel Frequency Cepstral Coefficient) |STFT| log DCT … Coefficient selection ASR features https://www.webmd.com/cold-and-flu/ear-infection/picture-of-the-ear#1
  • 7.
    Acoustic modeling “This isa cat” This is a cat /th/ /iy/ /iy//s/ /s/ /a/ /ae/ /t//k/ Force alignment - process of assigning different segments to each sounds Automatic if you have an ASR phoneme - distinct unit of speech sound Utterance-level transcription ASR typically models phoneme level
  • 8.
    HMM and GMMin speech • Each phoneme is separate into parts and model separately • This is model by Hidden Markov Model and Gaussian Mixture Model
  • 9.
    Decoding ???????? /th/ 10 /iy/ 20 /iy/ 10 /s/ 1 /s/ 2 /a/ 3 /ae/ 35 /t/ 2 /k/ 1 Acoustic model givescores to possible sequences of phonemes Intractable: use beamsearch (trade-off between accuracy and compute) /ae/ 50 /iy/ 10 /t/ 2 /t/ 3 /a/ 3 /ae/ 15 /n/ 1 /k/ 1 Total score 84 /iy/ 20 Total score 105 How do we go back to words? Can we trust the AM that much?
  • 10.
    The lexicon • Adictionary saying how a word can be pronounced • Can have multiple pronunciations • Must use the same phoneme units as the AM Phoneme lexicon กรรไกร : k a n^ kr ai z^ กรณี : k a z^ r a z^ n ii z^ กรณี : k or z^ r a z^ n ii z^ เพลา : p ae z^ l aa z^ เพลา : pl ao z^
  • 11.
    The grapheme lexicon •Represent letters (graphemes) as sound units • The pronunciation is just the sequence of letter spelling • Works quite well for many languages • Thai somewhat problematic Phoneme lexicon Grapheme lexicon
  • 12.
    Grapheme vs Phoneme Language(WER) Phoneme lexicon Grapheme lexicon diff Kazakh 76.8% 77.0% +0.2 Kurmanji 85.5% 85.1% -0.4 Telugu 86.3% 87.0% +0.7 Cebuano 75.7% 75.9% +0.2 Lao 67.3% 69.9% +2.6 Haitian 52.0% 52.3% +0.3 Assamese 58.6% 58.5% -0.1 English 8.0% 8.5% +0.5 • Not the end of the world if you do not have a lexicon • Can be slightly improved with some knowledge about the language (rule-based) • The more regular the spelling is the closer the gap D. Harwath and J. Glass. Speech recognition without a lexicon-bridging the gap between graphemic and phonetic systems. In Proc. InterSpeech, 2014. V. Le, L. Lamel, A. Messaoudi, W. Hartmann, J. Gauvain, C. Woehrling, J. Despres, and A. Roy. Developing STT and KWS systems using limited language resources. In Proc. InterSpeech, 2014. E. Chuangsuwanich, Multilingual Techniques for Low Resource Automatic Speech Recognition, MIT, June 2016.
  • 13.
    Automatic Grapheme-to-Phoneme (G2P) • Givena small lexicon generated by linguists, learn a model to predict the pronunciation of new words • Sometimes called L2S model (Letter-to-Sound) • Can use acoustic data to improve performance of generated pronunciation
  • 14.
    G2P example • Trainedwith 5k pronunciations using Sequitur • Can produce multiple candidate pronunciations • PythaiNLP also has G2P support (?)
  • 15.
    Language Model (LM) •Specifies the grammar of a valid sentence • Can be strict • Or probabilistic n-grams ขอ ถอน เงิน ฝาก
  • 16.
    Deep learning inASR? Beam search Text N-best Lattice Acoustic model Transcribed Speech Data This is a pen. This is a pen.This is a pen. Lexicon Words and pronunciations Language Model Text data HCLG FST (Graph) Feature extractor MFCC
  • 17.
    Simpler features |STFT| log DCT … Coefficient selection Traditional ASRfeatures https://www.webmd.com/cold-and-flu/ear-infection/picture-of-the-ear#1
  • 18.
  • 19.
    Deep learning inAM Three main approaches • Hybrid DNN-HMM • Tandem • End-to-end
  • 20.
    Hybrid DNN-HMM approach •A typical speech recognizer uses the GMM-HMM framework • Emission probabilities are modeled by a GMM • Instead, model emission probabilities with a DNN • DNN gives posteriors, while GMM gives likelihoods. Convert DNN outputs to likelihoods by removing the priors. DN N b1 b2 b3 DNN-HM M s1 s2 s3 DNN/LSTM/etc
  • 21.
    Tandem approach • Usethe DNN to generate good features to feed into the general GMM-HMM framework. • Typically done by placing a narrow hidden layer in the network. Bottleneck layer Input features Input features Typical GMM-HMM
  • 22.
    Deep learning inLM • Use neural language models instead of n-grams • Can only use neural LM in rescoring
  • 23.
    End-to-End models • BIGnetwork that goes from waveform to characters • Still need LM rescpromg • Needs large amounts of data (1000+ hours) • People also tried on smaller data (100 hours)** Chan, W. Listen, Attend and Spell, 2015 Github available (transformer) https://github.com/tensorflow/tensor2tensor
  • 24.
    Inside the recognizer SearchText Lexicon Words and pronunciations Language Model Text data Feature extractor Acoustic model Transcribed Speech Data This is a pen. This is a pen.This is a pen. Constraints
  • 25.
    Factors making ASRhard Vocabulary size small (10) large (1000+) Type of speech disconnected words news broadcast conversational Domain fixed domain open domain Quality close talking far field Speakers speaker dependent speaker independent HarderEasier
  • 26.
    Why build yourown ASR? • You can better specify your domain and vocabulary • Google speech API only supports word/phrases emphasis • You can better specify your type of recordings • Google now have specialized models for English • You can adapt to the speaker • Google prefers speaker independent models • You can run locally on device
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
    How much data? E.Chuangsuwanich, Multilingual Techniques for Low Resource Automatic Speech Recognition, MIT, June 2016.
  • 33.
    But no opendataset • English has 1000 hours open dataset • Gives deployment-worthy models • At least four startups I know starts from this • Transfer learning gives huge leverage • Let’s build shared resources • Mozilla common voice https://voice.mozilla.org/ • Recordings from ASR class https://github.com/ekapolc/gowajee_corpus/ • 40hr spelling corpora planned for release soon( ™) • Thai crowdsourcing platform solution soon( ™)
  • 34.
    Links • ASR courseat Chula (all materials + videos) • https://github.com/ekapolc/ASR_course • How to create your own Kaldi ASR and deploy it • Starting AM, G2P, and docker images • https://github.com/ekapolc/ASR_classproject • Kaldi • http://kaldi-asr.org/ • Snowboy for hotword detection • https://snowboy.kitt.ai/
  • 35.
    Commercial time Chula engineeringis accepting applicants for Master and PhD M Computer Science M Computer Engineering M Software Engineering PhD Computer Science Wide range of Big data/AI courses being offered https://www.cp.eng.chula.ac.th/ Talks from the faculty on what they are working on https://www.youtube.com/watch?v=XU1PWNeLv4o
  • 36.
    Current projects Thai NLPbasic capabilities Sentence segmentation Word correction/normalization Word segmentation NER Domain focus: social media, chat, financial info
  • 37.
    Current projects National platformto assess students’ English Both written (essays) and spoken Speech rehab platform for stroke patients Biomedical text mining Precision medicine research
  • 38.
    Sahin and Tureci.Science (2018) Synthetic neoantigen vaccines Engineered T cells Precision medicine
  • 39.
    Kobayashi and vanden Elsen. Nature Reviews Immunology (2012). Precision medicine Model peptides as a string ABEASOEWL will the receptor also a string …AEOAENWOAIRPEERW.... accept the peptide Use deep learning