01.query-By-Example On-Device Keyword Spotting

This document presents a query-by-example on-device keyword spotting system that uses a small-footprint automatic speech recognition model to output phonetic posteriors. It builds a finite-state transducer graph from the phonetic posteriors to enroll keywords and avoid out-of-vocabulary problems. During testing, it scores the log-likelihood of input audio using the graph and predicts a threshold for detection using positives from enrollment and negatives generated by rearranging the waveform of positives. The system aims to perform query-specific keyword detection on device while maintaining simplicity.

Uploaded by

Hoàng Phạm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

139 views7 pages

01.query-By-Example On-Device Keyword Spotting

Uploaded by

Hoàng Phạm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

QUERY-BY-EXAMPLE ON-DEVICE KEYWORD SPOTTING

Byeonggeun Kim, Mingu Lee, Jinkyu Lee, Yeonseok Kim, and Kyuwoong Hwang

Qualcomm AI Research
Hakdong-ro, Gangnam-gu, Seoul, Republic of Korea

ABSTRACT posteriors. Sequence matching, traditionally modeled by hid-

arXiv:1910.05171v3 [cs.LG] 14 Jan 2020

den Markov models (HMMs), interprets the AM outputs into

A keyword spotting (KWS) system determines the exis- keyword and background parts. Meanwhile, [3, 4, 5, 6] have
tence of, usually predefined, keyword in a continuous speech end-to-end NN architectures to directly determine the pres-
stream. This paper presents a query-by-example on-device ence of keywords. They use recurrent neural networks (RNN)
KWS system which is user-specific. The proposed system with attention layers [3, 4], dilated convolution network [5],
consists of two main steps: query enrollment and testing. or filters based on singular value decomposition [6].
In query enrollment step, phonetic posteriors are output by
On the other hand, there have been query-by-example
a small-footprint automatic speech recognition model based
approaches which detect query keywords of any kinds. Early
on connectionist temporal classification. Using the phonetic-
approaches use automatic speech recognition (ASR) pho-
level posteriorgram, hypothesis graph of finite-state trans-
netic posterior as a posteriorgram and exploit dynamic time
ducer (FST) is built, thus can enroll any keywords thus avoid-
warping (DTW) to compare keyword samples and test ut-
ing an out-of-vocabulary problem. In testing, a log-likelihood
terances [7, 8, 9]. [10, 11] also used posteriorgram while
is scored for input audio using the FST. We propose a thresh-
using connectionist temporal classification (CTC) ASR. [10]
old prediction method while using the user-specific keyword
used an edit distance metric, and [11] directly used posteri-
hypothesis only. The system generates query-specific nega-
ors of ASR. Furthermore, [12] computes a simple similarity
tives by rearranging each query utterance in waveform. The
scores of LSTM output vectors between enrollment and test
threshold is decided based on the enrollment queries and
utterance. Recently, end-to-end NN based query-by-example
generated negatives. We tested two keywords in English,
systems are suggested [13, 14]. [13] uses a recurrent neural
and the proposed work shows promising performance while
network transducer (RNN-T) model biased with attention
preserving simplicity.
over keyword. [14] suggests to use text query instead of
Index Terms— keyword spotting, user-specific, query- audio.
by-example, on-device, threshold prediction Meanwhile, there have been other groups who explored
keyword spotting problem. [15, 16, 17, 18] solve multiple
keyword detection. [19, 20] focus on KWS tasks with small
1. INTRODUCTION
dataset. [19] use DTW to augment the data, and [20] suggests
Keyword spotting (KWS) has widely been used in personal a few-shot meta-learning approach.
devices like mobile phones and home appliances for detect- In this paper, we propose a simple yet powerful query-
ing keywords which are usually compounded of one or two by-example on-device KWS approach using user-specific
words. The goal is to detect the keywords from real-time queries. Our system provides user-specific model by uti-
audio stream. For practical use, it is required to achieve low lizing a few keyword utterances spoken by a single user.
false rejection rate (FRR) while keeping low false alarms The system uses posteriorgram based graph matching al-
(FAs) per hour. gorithm using a small-footprint ASR. An ASR based CTC
Many previous works consider predefined keywords to [21] outputs phonetic posteriors, and we build a hypothe-
reach promising performance. Keywords such as “Alexa”, sis graph of finite-state transducer (FST). The posteriorgram
“Okay/Hey Google”, “Hey Siri” and “Xiaovi Xiaovi” are the consists of phonetic output which frees the model from out-
examples. They collect numerous variations of a specific key- of-vocabulary problem. On testing, the system determines
word utterance and train neural networks (NNs) which have whether an input audio contains the keyword or not through
been promising method in the field. [1, 2] have acoustic en- a log-likelihood score according to the graph which includes
coder and sequence matching decoder as separate modules. constraints of phonetic hypothesis. Despite of the score nor-
The NN-based acoustic models (AMs) predict senone-level malization, score-based query-by-example on-device KWS
systems usually suffer from threshold decision, because there
Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc are not enough negative examples in on-device system. We
predict user-specific threshold by keyword hypothesis graphs.
We generate query-specific negatives by rearranging positives
in waveform. Then we predicts a threshold by using positives
and generated negatives. While keeping this simplicity, our
approach shows comparable performances with recent KWS
systems.
The rest of the paper is organized as follows. In Section
2, the KWS system is described including the acoustic model,
the FST in the decoder, and the threshold prediction method.
The performance evaluation results are discussed in Section 3
followed by the conclusion in Section 4.

2. QUERY-BY-EXAMPLE KWS SYSTEM

Our system consists of three parts, acoustic model, decoder,

and threshold prediction part. In subsections, we denote
acoustic model input features as X = x1 , x2 , · · · , xT where
Fig. 1: Example of a generated negative from a query utter-
xt ∈ RM and t is a time frame index. Corresponding labels
ance, ‘Hey Snapdragon’. The query utterance is divided into
are Y = y1 , y2 , · · · , yK and usually K < T .
three in waveform and shuffled.

2.1. Acoustic model of maximum-posterior at each time frame. For each time step
t, we choose argmax(ont , n = 1, · · · , N ) and get a path P. The
We exploit a CTC acoustic model [21]. We denote activa- n
tion of ASR as O = o1 , o2 , · · · , oT where ot ∈ RN and let hypothesis is defined by the mapping B, as B(P ).
ont as activation of unit n at time t. Thus ont is a probabil- A keyword ‘Hey Snapdragon’ gives a hypothesis like
ity of observing n at time t. CTC uses a extra blank output ‘HH.EY. .S.N.AE.P.T. .A.AE.G.AH.N’. With the hypothesis
φ. We denote L0 = L ∪ {φ, space} where L is the set of 39 as a sequential phonetic constraint, we generate left-to-right
context-independent phonemes. The space output implies a FST systems.
short pause between words. We let L0 (T ) as sequence set of
length T, where their elements are in L0 . Then,Qconditional 2.2.2. Keyword spotting
T
probability of path P given X is p(P |X) = t=1 p(oP t )
t

0 In testing, the system calculates a score of a test utterance for

where ∀P ∈ L (T ).
[21] suggests a many-to-one mapping B which maps hypothesis FSTs. Assume that the FST has L distinct possi-
activation O to label sequence Y . The mapping collapses ble states S = [s(i) ], i = 1, 2, · · · , L where s(φ) denotes the
repeats and removes blank output φ, e.g. B(xφyyφz) = blank state. The FST is left-to-right, therefore, has an ordered
B(xφφyzφ) = xyz. The conditional probability P (Y |X) is label Hypothesis Y 0 = y10 , y20 , · · · , yK
0
where yk0 ∈ S, ∀k.
marginalizing of possible paths for Y and is defined as, Given the hypothesis, the score is log likelihood of a test in-
X put, X 0 = x01 , x02 , · · · , x0T .
p(Y |X) = p(P |X). (1) At time step t, the activation of AM is ot and we denote
P ∈B −1 (Y ) the corresponding FST state as qt ∈ S. The transition proba-
bility aij is p(qt = s(j) |qt−1 = s(i) ). The hypothesis limits
0
2.2. Keyword spotting decoder the transition probability as Eq.(2), where qt−1 = yl−1 . If
qt = sφ , then qt = qt−1 , i.e. remaining in the previous state.
The keyword spotting decoder operates in two phases: an en- Hypothesis Y 0 is usually shorter than X 0 because we use the
rollment step and testing. In the enrollment step, using AM mapping B to get Y 0 . Therefore it is more likely to remain
output of the query utterance, the model finds the hypothesis at a current state than moving to the next. We naively choose
and build FSTs for the path. While testing, the model cal- the transition probabilities to reflect the scenario.
culates the score and determines whether the input utterance
contains the keyword using the hypothesis.
(
1/3, if qt ∈ {yl0 , yl−1
0
, s(φ) }
aij (t) = (2)
0, otherwise.
2.2.1. Query enrollment
In the enrollment step, the system uses a few clean utterances A log likelihood is,
of a keyword spoken by a single user. We use simple and
heuristic method, max-decoding. We follow the component Snapdragon is a registered trademark of Qualcomm Incorporated.
Fig. 2: A histogram of query, negative and generated negative
log likelihood scores for hypothesis FSTs of a single speaker. Fig. 3: Comparison of baseline, the S-DTW with the FST
Colored histogram shows generated negatives. constrained by phonetic hypothesis.

X where τ is a hyperparameter in [0, 1], a, a0 ∈ [A] and b ∈

0 0
log p(X |Y ) = log{ p(q|Y 0 )p(X 0 |Y 0 , q)} [B]. Eq.(4) means the threshold as a score between mean of
q positive scores and that of negative scores.
T
Y T
Y p(qt |x0t )p(x0t ) We generate query-specific negatives from queries. Fig-
≈ max[log{π aqt−1 qt }] ure 1 shows an example of a keyword, ’Hey Snapdragon’.
q,t0 p(qt )
t=t0 +1 t=t0 Each positive is divided to sub-parts and shuffled in wave-
T
Y T
Y form. We overlap 16 samples of each part boundary and ap-
∝ max[log{π aqt−1 qt p(qt |x0t )}], (3) ply them one-sided triangular windows to guarantee smooth
q,t0
t=t0 +1 t=t0 waveform transition and to prevent undesirable discontinu-
where π denotes the initial state probability, and π = ities, i.e. impulsive noises. Figure 2 plots an example of his-
p(q1 = y10 ) = 1 for a given path. The p(q|Y 0 ) is product tograms of queries, negatives, and generated negatives of hy-
of transition probabilities, and the likelihood, p(X 0 |Y 0 , q) is pothesis FSTs from a single speaker. A probability distribu-
proportional to the posteriors of the AM. Here p(x0t ) and the tion is drawn in histogram while assuming Gaussian distribu-
state prior p(qt ) are assumed to be uniform. tion for better visualization. We used the generated negatives
We normalize the score by dividing Eq.(3) by the number as {Zb }.
of non-blank states, |{qt |t = 1, · · · , T, qt 6= s(φ) }|. We find q
and t0 which maximize Eq.(3) by beam searching. During the 3. EXPERIMENTS
search, we consider each time step t as a initial time t0 . By
doing this, the system can spot the keyword in a long audio 3.1. Experimental setup
stream.
3.1.1. Query and testing data
2.3. On-device threshold prediction Many previous works experiment with their own data which
are not accessible. In some literature, only relative perfor-
In this section, query set is Q = {X10 , X20 , · · · , XA
0
}, and
mances are reported, thus the results are hard to compare with
corresponding hypothesis set is H = {Y1 , Y2 , · · · , YA0 }.
0 0
each other and are not reproducible. To be free from this is-
FY (X) is a mapping from a test utterance, X, to log likeli-
sue, we use public and well-known data.
hood score for a hypothesis Y . We denote negative utterances
as Z1 , Z2 , · · · , ZB . The hypothesis computes positive scores We use two query keywords in English, ‘Hey Snap-
from each other’s query. A threshold δ is defined as, dragon’ and ‘Hey Snips’. The audio data of ‘Hey Snips’ is
introduced at [5]. We select 61 speakers who have at least 11
‘Hey Snips’ utterances each. We use 993 utterances from the
τ X
δ(Q,H) = FYa0 (Xa0 0 )|a6=a0 data. ‘Hey Snapdragon’ utterances are from a publicly avail-
A(A − 1) 0 (a,a ) able dataset1 . There are 50 speakers and each of them speaks
(4) the keyword 22 or 23 times. In total, there are 1,112 ‘Hey
(1 − τ ) X
+ FYa0 (Zb )
A·B 1 https://developer.qualcomm.com/project/keyword-speech-dataset
(a,b)
Table 1: FRR (%) at 0.05 FAs per hour for clean and SNR levels {10 dB, 6 dB, 0 dB} of positives.
Method Keyword clean 10 dB 6 dB 0 dB Avg.
Hey Snapdragon 1.35 3.84 8.01 21.6 8.70
S-DTW
Hey Snips 10.5 15.8 20.7 32.8 19.9
Hey Snapdragon 0.53 0.83 3.22 12.2 4.19
FST
Hey Snips 1.85 5.36 8.59 24.7 10.13

Table 2: Comparison of FRR (%) of various KWS systems at given FAs per hour levels.
Method Keyword Params SNR FRR @ 1 FA/ hr FRR @ 0.5 FA/ hr FRR @ 0.05 FA/ hr
Shan et al. [3] Xiao ai tong xue 84 k - 1.02 - -
Coucke et al. [5] Hey snips 222 k 5 dB2 - 1.60 -
Wang et al. [4] Hai xiao wen - - 4.17 - -
He et al. [13] Personal Name3 - - - - 8.9
Hey Snapdragon 3.12 4.46 8.01
S-DTW
Hey Snips 13.30 15.07 20.69
211 k 6 dB
Hey Snapdragon 0.62 1.04 3.22
FST
Hey Snips 2.79 3.77 8.58

Snapdragon’. At each user-specific test, 3 query utterance are % word-error-rate (WER) on Librispeech test-clean dataset
randomly picked and rest are used as positive test samples. without prior linguistic information.
We augment the positive utterances using five types of noises,
{babble, car, music, office, typing} at three signal-to-noise 3.2. Results
ratios (SNR) {10 dB, 6 dB and 0 dB}.
We use WSJ-SI200 [22] as negative samples. We sampled We tested 111 user-specific KWS systems. 50 are from the
24 hrs of WSJ-SI200 and segmented the whole audio stream query ‘Hey Snapdragon’ and the rest are from ‘Hey Snips’.
into 2 seconds long. We augment each data with one of the We used three queries from a given speaker for an enrollment.
five noise types, {babble, car, music, office, typing} and one When we use one or two queries instead, the relative increase
SNR ratio among {10 dB, 6 dB and 0 dB}. Noise type and of FRR (%) at 0.5 FA per hour are 222.05 % or 2.47 % re-
SNR are randomly selected. spectively at 6 dB SNR. The scores from three hypothesis are
averaged for each test.

3.1.2. Acoustic model details

3.2.1. Baseline
The model is trained with Librispeech [23] data. Noises, Some previous works exploit DTW to compare the query and
{babble, music, street}, are added at uniform random SNRs test sample [7, 8, 9]. We exploit DTW as our baseline while
in [−3, 15] dB range. For more generalized model, we dis- using the CTC-based AM model. We use KL-divergence as
torted the data by speech rate, power and reverberation. We DTW distance, and allow a subsequence as an optmial path,
changeed the speech rate with uniform random rates between which refers to subsequence DTW (S-DTW) [25]. The score
0.9 and 1.2. For reverberation, we used several measured is normalized by input length of DTW corresponding to the
room impulse responses in a moderately reverberant meeting optimal path.
rooms in an office building. From the term ‘power’, we meant
the input level augmentation for which we changed the peak
3.2.2. FST constrained by phonectic hypothesis
amplitudes of the input waveforms to have a random value
between 0 dB and −40 dB in the normalized full scale. We build 3 hypothesis FST for each system. We tested all 111
Input features are 40-dimensional Per-channel energy nor- user-specific models and average them by keywords. Table
malization (PCEN) mel-filterbank energy [24] with 30 ms 1 compares the baseline, the S-DTW with the FST method,
window and 10 ms frame shift. The model has two convo- and we average the performances for the four SNR levels to
lutional layers followed by five unidirectional LSTM layers. plot a ROC curve, shown in Figure 3. The method using
Each covolutional layer is followed by batch normalization FST consistently outperforms the S-DTW while using a same
and activation function. Each LSTM layer has 256 LSTM query, and ‘Hey Snapdragon’ stands out than ‘Hey Snips’.
cells. On top, there are a fully-connected layer and a soft- 2 Coucke et al. [5] augmented the positive dev and test datasets by only 5
max layer. Through the trade-off between ASR performance dB, while our 6 dB is only for positive dev. Our test dataset are augmented
and network size, the model has 211 k number of parame- by {10, 6, 0} dB.
ters and shows 16.61 % phoneme-error-rate (PER) and 48.04 3 He et al. [13] used queries like ’Olivia’ and ’Erica’.
Fig. 4: Histograms of FRRs (%) at 0.05 FA/ hr per user
model.

The query word, ‘Hey Snips’ is short and false alarms are
more likely to occur. The performance is heavily influenced
by the type of keyword and this result is also specified in [13].
In Figure 4, we plot a histogram which shows the FRR by
users. Most user models show low FRR except some outliers. Fig. 5: Comparison of baseline with query-specific generated
Due to the limited data access, direct result comparison negatives. The graphs show relationship between the mean
with previous works became difficult. Nevertheless, we com- positives and the mean negatives and their best-fit in lines.
pared our results with others in Table 2 to show that the results
are comparable to that of predefined KWS systems [3, 5, 4]
scores for 111 user-specific models. The baseline shows low
and query-by-example system [13]. Blanks in the table im-
and even negative correlation coefficient (R) value. R values
plies unknown information.
for ‘Hey Snapdragon’ and ‘Hey Snips’ are -0.04 and -0.21
respectively. Meanwhile, the proposed method shows posi-
3.2.3. On-device threshold prediction tive R values, 0.25 for ‘Hey Snapdragon’ and 0.40 for ‘Hey
Snips’. If there is a common tendencies between positives
We tested a naive threshold prediction approach as a baseline.
and negatives across keywords, we can expect useful thresh-
The baseline assumes a scenario that a device stores randomly
old decision rules from them. Here we tried a simple linear
chosen 100 general negatives. 50 negatives are from clean and
interpolation introduced in Section 2.3.
the rest are from augmented data mentioned in section 3.1.1.
A = 3 and B = 100 in Eq.(4). We search τ in Eq.(4) leveraging brute-force to get near
The proposed method exploits query-specific negatives. 0.05 FAs per hour on average for 111 models. We set τ to 0.82
For each query, we divide the waveform into three parts with for baseline and 0.38 for the proposed method, and resulting
the same lengths, thus there are five ways to shuffle to make FAs per hour are 0.049 for baseline and 0.050 for the proposed
it different from the original signal. There are three queries method on average.
for each enrollment and, therefore we have 15 generated neg- Both method find the τ and reach target FAs per hour
atives. Each hypothesis from a query uses other two queries level, however, these two methods have dramatic difference
as positives and their generated negatives as negatives, thus in inter keywords. Inter keyword difference should be small
A = 3 and B = 10. in order to query-by-example system to work on any kind of
Figure 5 shows mean of positive and that of negative keywords. For the baseline, ‘Hey Snapdragon’ shows 0.001
FAs per hour while ‘Hey Snips’ shows 0.088 FA per hour. 5. REFERENCES
Despite of using 6 to 7 times lower B, the proposed method
shows exact same, 0.050 FAs per hour for both keyword ‘hey [1] J. Guo, K. Kumatani, M. Sun, M. Wu, A. Raju,
Snapdragon’ and ‘Hey Snips’. Baseline shows 17.77 % FRR N. Strom, and A. Mandal, “Time-delayed bottleneck
at 6 dB noisy positives due to the low FAs per hour while the highway networks using a dft feature for keyword spot-
proposed method shows 3.95 % FRR for ’hey Snapdragon’. ting,” in ICASSP, IEEE International Conference on
The result is different from Table 2, because it uses given FAs Acoustics, Speech and Signal Processing - Proceedings.
per hour level for each model while this session use averaged IEEE, 2018, pp. 5489–5493.
FAs per hour.
[2] M. Chen, S. Zhang, M. Lei, Y. Liu, H. Yao, and J. Gao,
“Compact feedforward sequential memory networks for
4. CONCLUSIONS small-footprint keyword spotting,” in INTERSPEECH
2018 – 19th Annual Conference of the International
In this paper, we suggest a simple and powerful approach for
Speech Communication Association, 2018, pp. 2663–
query-by-example on-device keyword spotting task. Our sys-
2667.
tem uses user-specific queries, and CTC based AM outputs
phonetic posteriorgram. We decode the output and build left- [3] C. Shan, J. Zhang, Y. Wang, and L. Xie, “Attention-
to-right FSTs as a hypothesis. The log likelihood is calcu- based end-to-end models for small-footprint keyword
lated as a score for testing. For on-device test, we suggest a spotting,” in INTERSPEECH 2018 – 19th Annual Con-
method to predict a proper user and query specific threshold ference of the International Speech Communication As-
with the hypothesis. We generate query-specific negatives by sociation, 2018, pp. 2037–2041.
shuffling the query in waveform. While many previous KWS
approaches are not reproducible due to the limited data ac- [4] X. Wang, S. Sun, C. Shan, J. Hou, L. Xie, S. Li, and
cess, we tested our methods on public and well-known data. X. Lei, “Adversarial examples for improving end-to-
In the experiments, our approach showed promising and com- end attention-based small-footprint keyword spotting,”
parable performances to the latest predefined and query-by- in ICASSP, IEEE International Conference on Acous-
example methods. There is a limit to this work due to lack of tics, Speech and Signal Processing - Proceedings. IEEE,
public data, and we suggest naive approach for utilizing gen- 2019.
erated negatives. As a future work, we will study advanced
way to predict threshold using the query-specific negatives, [5] A. Coucke, M. Chlieh, T. Gisselbrecht, D. Leroy,
and test various keywords. M. Poumeyrol, and T. Lavril, “Efficient keyword spot-
ting using dilated convolutions and gating,” in arXiv
preprint arXiv:1811.07684, 2018.

[6] A. Raziel and H. Park, “End-to-end streaming keyword

spotting,” in arXiv preprint arXiv: 1812.02802, 2019.

[7] T.J. Hazen, W. Shen, and C. White, “Query-by-example

spoken term detection using phonetic posteriorgram
templates,” in 2009 IEEE Workshop on Automatic
Speech Recognition & Understanding. IEEE, 2009, pp.
421–426.

[8] Y. Zhang and J.R. Glass, “Unsupervised spoken key-

word spotting via segmental dtw on gaussian posterior-
grams,” in 2009 IEEE Workshop on Automatic Speech
Recognition & Understanding. IEEE, 2009, pp. 398–
403.

[9] X. Anguera and M. Ferrarons, “Memory efficient sub-

sequence dtw for query-by-example spoken term detec-
tion,” in 2013 IEEE International Conference on Multi-
media and Expo (ICME). IEEE, 2013, pp. 1–6.

[10] Y. Zhuang, X. Chang, Y. Qian, and K. Yu, “Unrestricted

vocabulary keyword spotting using lstm-ctc.,” in Inter-
speech, 2016, pp. 938–942.
[11] Loren Lugosch, Samuel Myer, and Vikrant Singh [22] D.B. Paul and J.M. Baker, “The design for the wall street
Tomar, “Donut: Ctc-based query-by-example keyword journal-based csr corpus,” in Proceedings of the work-
spotting,” arXiv preprint arXiv:1811.10736, 2018. shop on Speech and Natural Language. Association for
Computational Linguistics, 1992, pp. 357–362.
[12] G. Chen, C. Parada, and T.N. Sainath, “Query-by-
example keyword spotting using long short-term mem- [23] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur,
ory networks,” in ICASSP, IEEE International Confer- “Librispeech: an asr corpus based on public domain
ence on Acoustics, Speech and Signal Processing - Pro- audio books,” in 2015 IEEE International Conference
ceedings. IEEE, 2015, pp. 5236–5240. on Acoustics, Speech and Signal Processing (ICASSP).
IEEE, 2015, pp. 5206–5210.
[13] Y. He, R. Prabhavalkar, K. Rao, W. Li, A. Bakhtin,
and I. McGraw, “Streaming small-footprint keyword [24] Y. Wang, P. Getreuer, T. Hughes, R.F. Lyon, and R.A.
spotting using sequence-to-sequence models,” in 2017 Saurous, “Trainable frontend for robust and far-field
IEEE Automatic Speech Recognition and Understand- keyword spotting,” in ICASSP, IEEE International Con-
ing Workshop (ASRU). IEEE, 2017, pp. 474–481. ference on Acoustics, Speech and Signal Processing -
[14] K. Audhkhasi, A. Rosenberg, A. Sethy, B. Ramabhad- Proceedings. IEEE, 2017, pp. 5670–5674.
ran, and B. Kingsbury, “End-to-end asr-free keyword [25] M. Müller, “Dynamic time warping,” Information re-
search from speech,” IEEE Journal of Selected Topics in trieval for music and motion, pp. 69–84, 2007.
Signal Processing, vol. 11, no. 8, pp. 1351–1359, 2017.
[15] S. Myer and V.S. Tomar, “Efficient keyword spotting
using time delay neural networks,” in INTERSPEECH
2018 – 19th Annual Conference of the International
Speech Communication Association, 2018, pp. 1264–
1268.
[16] R. Tang and J. Lin, “Deep residual learning for small-
footprint keyword spotting,” in ICASSP, IEEE Interna-
tional Conference on Acoustics, Speech and Signal Pro-
cessing - Proceedings. IEEE, 2018, pp. 5484–5488.
[17] L. Pandey and K. Nathwani, “Lstm based attentive fu-
sion of spectral and prosodic information for keyword
spotting in hindi language,” in INTERSPEECH 2018
– 19th Annual Conference of the International Speech
Communication Association, 2018, pp. 112–116.
[18] S. Fernández, A. Graves, and J. Schmidhuber, “An ap-
plication of recurrent neural networks to discriminative
keyword spotting,” in International Conference on Arti-
ficial Neural Networks. Springer, 2007, pp. 220–229.
[19] R. Menon, H. Kamper, J. Quinn, and T. Niesler, “Fast
asr-free and almost zero-resource keyword spotting us-
ing dtw and cnns for humanitarian monitoring,” in IN-
TERSPEECH 2018 – 19th Annual Conference of the In-
ternational Speech Communication Association, 2018,
pp. 2608–2612.
[20] Y. Chen, T. Ko, L. Shang, X. Chen, X. Jiang, and Q. Li,
“Meta learning for few-shot keyword spotting,” in arXiv
preprint arXiv: 1812.10233, 2018.
[21] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber,
“Connectionist temporal classification: labelling unseg-
mented sequence data with recurrent neural networks,”
in Proceedings of the 23rd international conference on
Machine learning. ACM, 2006, pp. 369–376.

Automatic Speech Recognition (ASR) : Omar Khalil Gómez - Università Di Pisa
100% (1)
Automatic Speech Recognition (ASR) : Omar Khalil Gómez - Università Di Pisa
65 pages
Data-Driven Neural Network Based Feature - Phd-Thesis
No ratings yet
Data-Driven Neural Network Based Feature - Phd-Thesis
155 pages
Robust Speech Recognition Using Articulatory Information: Der Technischen Fakult at Der Universit at Bielefeld
100% (1)
Robust Speech Recognition Using Articulatory Information: Der Technischen Fakult at Der Universit at Bielefeld
148 pages
Cognitive Behavioral Therapy For Daily Life: Course Guidebook
100% (3)
Cognitive Behavioral Therapy For Daily Life: Course Guidebook
213 pages
A Review On Different Approaches For Speech - Recognition System
No ratings yet
A Review On Different Approaches For Speech - Recognition System
6 pages
BTP Thesis rs1 End-To-End-Asr
No ratings yet
BTP Thesis rs1 End-To-End-Asr
51 pages
Christoph Bensch Master Thesis
No ratings yet
Christoph Bensch Master Thesis
67 pages
The 30 Minute Change
100% (2)
The 30 Minute Change
107 pages
Alemayehu Yilma
No ratings yet
Alemayehu Yilma
67 pages
Feature Learning For Efficient ASR-free Keyword Spotting in Low-Resource Languages
No ratings yet
Feature Learning For Efficient ASR-free Keyword Spotting in Low-Resource Languages
37 pages
ESP COURSE and SYLLABUS DESIGN
No ratings yet
ESP COURSE and SYLLABUS DESIGN
5 pages
Lecture 9 - Speech Recognition
No ratings yet
Lecture 9 - Speech Recognition
65 pages
Lecture 9 PDF
No ratings yet
Lecture 9 PDF
42 pages
Voice Assistant
No ratings yet
Voice Assistant
30 pages
Voice Assistant
No ratings yet
Voice Assistant
34 pages
Deep Spoken Keyword Spotting - An Overview
No ratings yet
Deep Spoken Keyword Spotting - An Overview
32 pages
Pmtools
No ratings yet
Pmtools
35 pages
Speech Recognition On Mobile Devices
No ratings yet
Speech Recognition On Mobile Devices
27 pages
ISM Report Final
No ratings yet
ISM Report Final
33 pages
Lectures 1 Rabiner Speech Processing
No ratings yet
Lectures 1 Rabiner Speech Processing
77 pages
FasterEFTProvoke PDF
100% (2)
FasterEFTProvoke PDF
5 pages
Forced Alignment and Speech Recognition Systems
No ratings yet
Forced Alignment and Speech Recognition Systems
32 pages
Speech Representation Models For Speech Synthesis and Multimodal Speech Recognition
No ratings yet
Speech Representation Models For Speech Synthesis and Multimodal Speech Recognition
63 pages
EHaCON - 2019 Paper 8
No ratings yet
EHaCON - 2019 Paper 8
20 pages
Question Pattern For The Sunday School Examinations
No ratings yet
Question Pattern For The Sunday School Examinations
2 pages
Electrical Engineering (2017-2021) Punjab Engineering College, Chandigarh - 160012
No ratings yet
Electrical Engineering (2017-2021) Punjab Engineering College, Chandigarh - 160012
23 pages
Redaction HTK Amazigh Speech
No ratings yet
Redaction HTK Amazigh Speech
15 pages
Applsci 12 01091
No ratings yet
Applsci 12 01091
18 pages
Google Wakeword Detection 1 PDF
No ratings yet
Google Wakeword Detection 1 PDF
5 pages
End-to-End Automatic Speech Recognition
No ratings yet
End-to-End Automatic Speech Recognition
19 pages
IT Report-1
No ratings yet
IT Report-1
14 pages
Convolutional Recurrent Neural Networks For Small-Footprint Keyword Spotting
No ratings yet
Convolutional Recurrent Neural Networks For Small-Footprint Keyword Spotting
5 pages
Afeefa Sadath BPCC-101 (&)
100% (3)
Afeefa Sadath BPCC-101 (&)
13 pages
Easychair Preprint: Adnene Noughreche, Sabri Boulouma and Mohammed Benbaghdad
No ratings yet
Easychair Preprint: Adnene Noughreche, Sabri Boulouma and Mohammed Benbaghdad
8 pages
Comparative Analysis of Automatic Speech Recognition Techniques
No ratings yet
Comparative Analysis of Automatic Speech Recognition Techniques
8 pages
2208.12666v1 Feature Extraction
No ratings yet
2208.12666v1 Feature Extraction
13 pages
ASR For Embedded Real Time Applications: K.Kartheek, D.V.Srihari Babu
No ratings yet
ASR For Embedded Real Time Applications: K.Kartheek, D.V.Srihari Babu
9 pages
Ge2e KWS
No ratings yet
Ge2e KWS
8 pages
Comp Sci - Recognition Isolated - Shanthi Teressa1
No ratings yet
Comp Sci - Recognition Isolated - Shanthi Teressa1
6 pages
Developing Listening Skill For Kids
100% (1)
Developing Listening Skill For Kids
11 pages
CaTT KWS
No ratings yet
CaTT KWS
5 pages
Jayant SC
No ratings yet
Jayant SC
9 pages
Robot Arm Controller Using Fuzzy Speech Recognition
No ratings yet
Robot Arm Controller Using Fuzzy Speech Recognition
7 pages
Phoneme Spotting For Speech-Based Crypto-Key Generation: Paola - Garcia, Jnolazco, Carlosmex @itesm - MX
No ratings yet
Phoneme Spotting For Speech-Based Crypto-Key Generation: Paola - Garcia, Jnolazco, Carlosmex @itesm - MX
8 pages
Multilingual Query by Example Kws
No ratings yet
Multilingual Query by Example Kws
5 pages
Static Dictionary For Pronunciation Modeling
No ratings yet
Static Dictionary For Pronunciation Modeling
5 pages
Metric Learining
No ratings yet
Metric Learining
5 pages
Speech Recognition Using Discrete Hidden Markov Model: Department of ECE, Saveetha Engineering College, Chennai, India
No ratings yet
Speech Recognition Using Discrete Hidden Markov Model: Department of ECE, Saveetha Engineering College, Chennai, India
6 pages
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
No ratings yet
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
10 pages
Transfer Learning For ASR To Deal With Low-Resource Data Problem
No ratings yet
Transfer Learning For ASR To Deal With Low-Resource Data Problem
8 pages
Contextualized Streaming End-to-End Speech Recognition With Trie-Based Deep Biasing and Shallow Fusion
No ratings yet
Contextualized Streaming End-to-End Speech Recognition With Trie-Based Deep Biasing and Shallow Fusion
5 pages
Improving English Conversational Telephone Speech Recognition
No ratings yet
Improving English Conversational Telephone Speech Recognition
6 pages
SPeech Understanding Facebook
No ratings yet
SPeech Understanding Facebook
5 pages
Encoder-Decoder Neural Architecture Optimization For Keyword Spotting
No ratings yet
Encoder-Decoder Neural Architecture Optimization For Keyword Spotting
5 pages
Report - SIP - KWS Key Word Spotting
No ratings yet
Report - SIP - KWS Key Word Spotting
2 pages
Knowledge Distillation For In-Memory Keyword Spotting Model
No ratings yet
Knowledge Distillation For In-Memory Keyword Spotting Model
5 pages
Realization of Embedded Speech Recognmition Module Based On STM32
No ratings yet
Realization of Embedded Speech Recognmition Module Based On STM32
5 pages
SLU - Deep Belief Network Based Semantic Taggers For Spoken Language Understanding
No ratings yet
SLU - Deep Belief Network Based Semantic Taggers For Spoken Language Understanding
5 pages
Ijreas Volume 3, Issue 3 (March 2013) ISSN: 2249-3905 Efficient Speech Recognition Using Correlation Method
No ratings yet
Ijreas Volume 3, Issue 3 (March 2013) ISSN: 2249-3905 Efficient Speech Recognition Using Correlation Method
9 pages
Vision Inspired KWS
No ratings yet
Vision Inspired KWS
5 pages
NTC KWS
No ratings yet
NTC KWS
5 pages
Streaming KWS Cross Layers
No ratings yet
Streaming KWS Cross Layers
5 pages
Malayalam Speech Recognition
No ratings yet
Malayalam Speech Recognition
3 pages
TDT KWS
No ratings yet
TDT KWS
5 pages
Multiple Intelligences Chart
100% (1)
Multiple Intelligences Chart
9 pages
Cómo Te Llamas
No ratings yet
Cómo Te Llamas
34 pages
Cohen 1997
No ratings yet
Cohen 1997
53 pages
Tutorial
No ratings yet
Tutorial
28 pages
Culture and Personality (Benedict, Mead, Linton, Kardiner and Cora - Du Bois) - Notes - Study Anthropology For UPSC Optional - UPSC
No ratings yet
Culture and Personality (Benedict, Mead, Linton, Kardiner and Cora - Du Bois) - Notes - Study Anthropology For UPSC Optional - UPSC
3 pages
Business Ethics - Part 1 - Chapter 1
No ratings yet
Business Ethics - Part 1 - Chapter 1
43 pages
Fordyce MW 1983
No ratings yet
Fordyce MW 1983
46 pages
4.3 Models of Tg.
No ratings yet
4.3 Models of Tg.
23 pages
Different Effects of Machine Translation On L2 Revisions Across Students' L2 Writing Proficiency Levels
No ratings yet
Different Effects of Machine Translation On L2 Revisions Across Students' L2 Writing Proficiency Levels
21 pages
Business
No ratings yet
Business
11 pages
Laser 3rd Edition B2
No ratings yet
Laser 3rd Edition B2
3 pages
Gr11 - English - M19 - Prepared Speech
No ratings yet
Gr11 - English - M19 - Prepared Speech
5 pages
Duarte 2018
No ratings yet
Duarte 2018
39 pages
Coteaching Praxis To Theory
No ratings yet
Coteaching Praxis To Theory
21 pages
BNC Final Exam
No ratings yet
BNC Final Exam
6 pages
Quiz 1 Spring 2024 CSE 638 Deep Learning
No ratings yet
Quiz 1 Spring 2024 CSE 638 Deep Learning
2 pages
Lakan Dula High School Grade Level Grade 10: Daily Lesson Log School
No ratings yet
Lakan Dula High School Grade Level Grade 10: Daily Lesson Log School
4 pages
Data Mining Project Proposal
No ratings yet
Data Mining Project Proposal
3 pages
Are Folksonomies at Odds With Controlled Vocabulary
No ratings yet
Are Folksonomies at Odds With Controlled Vocabulary
6 pages
Various Sequence Classification Mechanisms For Knowledge Discovery
No ratings yet
Various Sequence Classification Mechanisms For Knowledge Discovery
4 pages
Intermediate Report Card 2020 - 2021: School Grade 05 # 585826 Student Name Teacher
No ratings yet
Intermediate Report Card 2020 - 2021: School Grade 05 # 585826 Student Name Teacher
1 page
Unit 3 Yes or No!: Pre-Activity
No ratings yet
Unit 3 Yes or No!: Pre-Activity
1 page
The Non-Native Speaker Teacher: Key Concepts in
No ratings yet
The Non-Native Speaker Teacher: Key Concepts in
3 pages
Efficient Memory Optimization for IoT Intrusion Detection
From Everand
Efficient Memory Optimization for IoT Intrusion Detection
Ethan Evelyn
No ratings yet
RWKV Architecture and Applications: The Complete Guide for Developers and Engineers
From Everand
RWKV Architecture and Applications: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Kernel Methods: Fundamentals and Applications
From Everand
Kernel Methods: Fundamentals and Applications
Fouad Sabry
No ratings yet