This document provides an overview of building an automatic speech recognition (ASR) engine. It discusses speech as a natural modality with high throughput that needs to account for errors in ASR output. It describes the components of a dialogue system including the ASR, natural language understanding, text-to-speech, and a dialogue manager. The document then discusses the components inside the recognizer including the acoustic model, language model, feature extraction using MFCC, and decoding using techniques like beam search. It also discusses topics like building the lexicon, acoustic modeling, and using deep learning approaches in ASR.
Focus on constructing an ASR engine utilizing speech as a natural modality (130 WPM), error management, and the need for NLP correction.
Overview of dialogue systems, components like ASR, TTS, and language models, as well as recognizing text through lexicons and acoustic models.
Description of MFCC as a feature for ASR and alignment processes, emphasizing phonemes and their modeling using HMM and GMM.
Decoding phoneme sequences using acoustic models, with a focus on N-best lists and beam search for optimizing accuracy.
Explanation of phoneme and grapheme lexicons, their application, and comparison of performance in different languages.
The G2P model predicts pronunciations from lexicons, illustrating its application with examples of generated pronunciations.
Discusses language models’ roles in grammar specification, the integration of deep learning, and different model approaches in ASR.
Various deep learning strategies in acoustic modeling and language modeling including hybrid systems, tandem approaches, and end-to-end networks.
Factors complicating ASR like vocabulary size and recording type, and reasons to build custom ASR engines tailored to specific needs.
Mention of various platforms and datasets like Kaldi, Smartvid.io, and Mozilla's common voice for developing ASR.
Information on academic programs in Chula engineering and current NLP projects focusing on Thai language, education assessments, and precision medicine.
Speech as amodality
High throughput (130 words per minute)
Natural
Hands-free
Need to be mindful that ASR is errorful
NLP on top of ASR output needs to be able to correct
errors
Inside the recognizer
SearchText
Lexicon
Words and pronunciations
Language Model
Text data
Feature
extractor
Acoustic model
Transcribed Speech Data
This is a pen.
This is a pen.This is a pen.
Constraints
5.
Inside the recognizer
Beamsearch
Text
N-best
Lattice
Acoustic model
Transcribed Speech Data
This is a pen.
This is a pen.This is a pen.
Lexicon
Words and pronunciations
Language Model
Text data
HCLG
FST (Graph)
Feature
extractor MFCC
6.
The feature -MFCC (Mel Frequency
Cepstral Coefficient)
|STFT|
log
DCT
…
Coefficient
selection
ASR features
https://www.webmd.com/cold-and-flu/ear-infection/picture-of-the-ear#1
7.
Acoustic modeling
“This isa cat”
This is a cat
/th/ /iy/ /iy//s/ /s/ /a/ /ae/ /t//k/
Force alignment - process of assigning different segments to each sounds
Automatic if you have an ASR
phoneme - distinct unit of speech sound
Utterance-level
transcription
ASR typically models phoneme level
8.
HMM and GMMin speech
• Each phoneme is separate into parts and model
separately
• This is model by Hidden Markov Model and Gaussian
Mixture Model
The lexicon
• Adictionary saying how a word can be pronounced
• Can have multiple pronunciations
• Must use the same phoneme units as the AM
Phoneme lexicon
กรรไกร : k a n^ kr ai z^
กรณี : k a z^ r a z^ n ii z^
กรณี : k or z^ r a z^ n ii z^
เพลา : p ae z^ l aa z^
เพลา : pl ao z^
11.
The grapheme lexicon
•Represent letters (graphemes) as sound units
• The pronunciation is just the sequence of letter spelling
• Works quite well for many languages
• Thai somewhat problematic
Phoneme lexicon Grapheme lexicon
12.
Grapheme vs Phoneme
Language(WER) Phoneme lexicon Grapheme lexicon diff
Kazakh 76.8% 77.0% +0.2
Kurmanji 85.5% 85.1% -0.4
Telugu 86.3% 87.0% +0.7
Cebuano 75.7% 75.9% +0.2
Lao 67.3% 69.9% +2.6
Haitian 52.0% 52.3% +0.3
Assamese 58.6% 58.5% -0.1
English 8.0% 8.5% +0.5
• Not the end of the world if you do not have a lexicon
• Can be slightly improved with some knowledge about the language
(rule-based)
• The more regular the spelling is the closer the gap
D. Harwath and J. Glass. Speech recognition without a lexicon-bridging the gap between graphemic and phonetic systems. In Proc. InterSpeech, 2014.
V. Le, L. Lamel, A. Messaoudi, W. Hartmann, J. Gauvain, C. Woehrling, J. Despres, and A. Roy. Developing STT and KWS systems using limited language
resources. In Proc. InterSpeech, 2014.
E. Chuangsuwanich, Multilingual Techniques for Low Resource Automatic Speech Recognition, MIT, June 2016.
13.
Automatic Grapheme-to-Phoneme
(G2P)
• Givena small lexicon generated by linguists, learn a
model to predict the pronunciation of new words
• Sometimes called L2S model (Letter-to-Sound)
• Can use acoustic data to improve performance of generated
pronunciation
14.
G2P example
• Trainedwith 5k pronunciations using Sequitur
• Can produce multiple candidate pronunciations
• PythaiNLP also has G2P support (?)
15.
Language Model (LM)
•Specifies the grammar of a valid sentence
• Can be strict
• Or probabilistic n-grams
ขอ ถอน เงิน
ฝาก
16.
Deep learning inASR?
Beam search
Text
N-best
Lattice
Acoustic model
Transcribed Speech Data
This is a pen.
This is a pen.This is a pen.
Lexicon
Words and pronunciations
Language Model
Text data
HCLG
FST (Graph)
Feature
extractor MFCC
Deep learning inAM
Three main approaches
• Hybrid DNN-HMM
• Tandem
• End-to-end
20.
Hybrid DNN-HMM approach
•A typical speech recognizer uses the GMM-HMM
framework
• Emission probabilities are modeled by a GMM
• Instead, model emission probabilities with a DNN
• DNN gives posteriors, while GMM gives likelihoods. Convert DNN
outputs to likelihoods by removing the priors.
DN
N
b1
b2
b3
DNN-HM
M
s1
s2
s3
DNN/LSTM/etc
21.
Tandem approach
• Usethe DNN to generate good features to feed into the general
GMM-HMM framework.
• Typically done by placing a narrow hidden layer in the network.
Bottleneck layer
Input
features
Input
features
Typical
GMM-HMM
22.
Deep learning inLM
• Use neural language models instead of n-grams
• Can only use neural LM in rescoring
23.
End-to-End models
• BIGnetwork that goes
from waveform to
characters
• Still need LM rescpromg
• Needs large amounts of
data (1000+ hours)
• People also tried on
smaller data (100
hours)**
Chan, W. Listen, Attend and Spell, 2015
Github available (transformer)
https://github.com/tensorflow/tensor2tensor
24.
Inside the recognizer
SearchText
Lexicon
Words and pronunciations
Language Model
Text data
Feature
extractor
Acoustic model
Transcribed Speech Data
This is a pen.
This is a pen.This is a pen.
Constraints
25.
Factors making ASRhard
Vocabulary size
small (10) large (1000+)
Type of speech
disconnected words news broadcast conversational
Domain
fixed domain open domain
Quality
close talking far field
Speakers
speaker dependent speaker independent
HarderEasier
26.
Why build yourown ASR?
• You can better specify your domain and vocabulary
• Google speech API only supports word/phrases emphasis
• You can better specify your type of recordings
• Google now have specialized models for English
• You can adapt to the speaker
• Google prefers speaker independent models
• You can run locally on device
How much data?
E.Chuangsuwanich, Multilingual Techniques for Low Resource Automatic Speech Recognition, MIT, June 2016.
33.
But no opendataset
• English has 1000 hours open dataset
• Gives deployment-worthy models
• At least four startups I know starts from this
• Transfer learning gives huge leverage
• Let’s build shared resources
• Mozilla common voice https://voice.mozilla.org/
• Recordings from ASR class
https://github.com/ekapolc/gowajee_corpus/
• 40hr spelling corpora planned for release soon(
™)
• Thai crowdsourcing platform solution soon(
™)
34.
Links
• ASR courseat Chula (all materials + videos)
• https://github.com/ekapolc/ASR_course
• How to create your own Kaldi ASR and deploy it
• Starting AM, G2P, and docker images
• https://github.com/ekapolc/ASR_classproject
• Kaldi
• http://kaldi-asr.org/
• Snowboy for hotword detection
• https://snowboy.kitt.ai/
35.
Commercial time
Chula engineeringis accepting applicants for Master and
PhD
M Computer Science
M Computer Engineering
M Software Engineering
PhD Computer Science
Wide range of Big data/AI courses being offered
https://www.cp.eng.chula.ac.th/
Talks from the faculty on what they are working on
https://www.youtube.com/watch?v=XU1PWNeLv4o
36.
Current projects
Thai NLPbasic capabilities
Sentence segmentation
Word correction/normalization
Word segmentation
NER
Domain focus: social media, chat, financial info
37.
Current projects
National platformto assess students’ English
Both written (essays) and spoken
Speech rehab platform for stroke patients
Biomedical text mining
Precision medicine research
38.
Sahin and Tureci.Science (2018)
Synthetic
neoantigen
vaccines
Engineered T
cells
Precision medicine
39.
Kobayashi and vanden Elsen. Nature Reviews Immunology (2012).
Precision medicine
Model peptides as a string
ABEASOEWL
will the receptor also a string
…AEOAENWOAIRPEERW....
accept the peptide
Use deep learning