01.query-By-Example On-Device Keyword Spotting
01.query-By-Example On-Device Keyword Spotting
Byeonggeun Kim, Mingu Lee, Jinkyu Lee, Yeonseok Kim, and Kyuwoong Hwang
Qualcomm AI Research
Hakdong-ro, Gangnam-gu, Seoul, Republic of Korea
2.1. Acoustic model of maximum-posterior at each time frame. For each time step
t, we choose argmax(ont , n = 1, · · · , N ) and get a path P. The
We exploit a CTC acoustic model [21]. We denote activa- n
tion of ASR as O = o1 , o2 , · · · , oT where ot ∈ RN and let hypothesis is defined by the mapping B, as B(P ).
ont as activation of unit n at time t. Thus ont is a probabil- A keyword ‘Hey Snapdragon’ gives a hypothesis like
ity of observing n at time t. CTC uses a extra blank output ‘HH.EY. .S.N.AE.P.T. .A.AE.G.AH.N’. With the hypothesis
φ. We denote L0 = L ∪ {φ, space} where L is the set of 39 as a sequential phonetic constraint, we generate left-to-right
context-independent phonemes. The space output implies a FST systems.
short pause between words. We let L0 (T ) as sequence set of
length T, where their elements are in L0 . Then,Qconditional 2.2.2. Keyword spotting
T
probability of path P given X is p(P |X) = t=1 p(oP t )
t
Table 2: Comparison of FRR (%) of various KWS systems at given FAs per hour levels.
Method Keyword Params SNR FRR @ 1 FA/ hr FRR @ 0.5 FA/ hr FRR @ 0.05 FA/ hr
Shan et al. [3] Xiao ai tong xue 84 k - 1.02 - -
Coucke et al. [5] Hey snips 222 k 5 dB2 - 1.60 -
Wang et al. [4] Hai xiao wen - - 4.17 - -
He et al. [13] Personal Name3 - - - - 8.9
Hey Snapdragon 3.12 4.46 8.01
S-DTW
Hey Snips 13.30 15.07 20.69
211 k 6 dB
Hey Snapdragon 0.62 1.04 3.22
FST
Hey Snips 2.79 3.77 8.58
Snapdragon’. At each user-specific test, 3 query utterance are % word-error-rate (WER) on Librispeech test-clean dataset
randomly picked and rest are used as positive test samples. without prior linguistic information.
We augment the positive utterances using five types of noises,
{babble, car, music, office, typing} at three signal-to-noise 3.2. Results
ratios (SNR) {10 dB, 6 dB and 0 dB}.
We use WSJ-SI200 [22] as negative samples. We sampled We tested 111 user-specific KWS systems. 50 are from the
24 hrs of WSJ-SI200 and segmented the whole audio stream query ‘Hey Snapdragon’ and the rest are from ‘Hey Snips’.
into 2 seconds long. We augment each data with one of the We used three queries from a given speaker for an enrollment.
five noise types, {babble, car, music, office, typing} and one When we use one or two queries instead, the relative increase
SNR ratio among {10 dB, 6 dB and 0 dB}. Noise type and of FRR (%) at 0.5 FA per hour are 222.05 % or 2.47 % re-
SNR are randomly selected. spectively at 6 dB SNR. The scores from three hypothesis are
averaged for each test.
The query word, ‘Hey Snips’ is short and false alarms are
more likely to occur. The performance is heavily influenced
by the type of keyword and this result is also specified in [13].
In Figure 4, we plot a histogram which shows the FRR by
users. Most user models show low FRR except some outliers. Fig. 5: Comparison of baseline with query-specific generated
Due to the limited data access, direct result comparison negatives. The graphs show relationship between the mean
with previous works became difficult. Nevertheless, we com- positives and the mean negatives and their best-fit in lines.
pared our results with others in Table 2 to show that the results
are comparable to that of predefined KWS systems [3, 5, 4]
scores for 111 user-specific models. The baseline shows low
and query-by-example system [13]. Blanks in the table im-
and even negative correlation coefficient (R) value. R values
plies unknown information.
for ‘Hey Snapdragon’ and ‘Hey Snips’ are -0.04 and -0.21
respectively. Meanwhile, the proposed method shows posi-
3.2.3. On-device threshold prediction tive R values, 0.25 for ‘Hey Snapdragon’ and 0.40 for ‘Hey
Snips’. If there is a common tendencies between positives
We tested a naive threshold prediction approach as a baseline.
and negatives across keywords, we can expect useful thresh-
The baseline assumes a scenario that a device stores randomly
old decision rules from them. Here we tried a simple linear
chosen 100 general negatives. 50 negatives are from clean and
interpolation introduced in Section 2.3.
the rest are from augmented data mentioned in section 3.1.1.
A = 3 and B = 100 in Eq.(4). We search τ in Eq.(4) leveraging brute-force to get near
The proposed method exploits query-specific negatives. 0.05 FAs per hour on average for 111 models. We set τ to 0.82
For each query, we divide the waveform into three parts with for baseline and 0.38 for the proposed method, and resulting
the same lengths, thus there are five ways to shuffle to make FAs per hour are 0.049 for baseline and 0.050 for the proposed
it different from the original signal. There are three queries method on average.
for each enrollment and, therefore we have 15 generated neg- Both method find the τ and reach target FAs per hour
atives. Each hypothesis from a query uses other two queries level, however, these two methods have dramatic difference
as positives and their generated negatives as negatives, thus in inter keywords. Inter keyword difference should be small
A = 3 and B = 10. in order to query-by-example system to work on any kind of
Figure 5 shows mean of positive and that of negative keywords. For the baseline, ‘Hey Snapdragon’ shows 0.001
FAs per hour while ‘Hey Snips’ shows 0.088 FA per hour. 5. REFERENCES
Despite of using 6 to 7 times lower B, the proposed method
shows exact same, 0.050 FAs per hour for both keyword ‘hey [1] J. Guo, K. Kumatani, M. Sun, M. Wu, A. Raju,
Snapdragon’ and ‘Hey Snips’. Baseline shows 17.77 % FRR N. Strom, and A. Mandal, “Time-delayed bottleneck
at 6 dB noisy positives due to the low FAs per hour while the highway networks using a dft feature for keyword spot-
proposed method shows 3.95 % FRR for ’hey Snapdragon’. ting,” in ICASSP, IEEE International Conference on
The result is different from Table 2, because it uses given FAs Acoustics, Speech and Signal Processing - Proceedings.
per hour level for each model while this session use averaged IEEE, 2018, pp. 5489–5493.
FAs per hour.
[2] M. Chen, S. Zhang, M. Lei, Y. Liu, H. Yao, and J. Gao,
“Compact feedforward sequential memory networks for
4. CONCLUSIONS small-footprint keyword spotting,” in INTERSPEECH
2018 – 19th Annual Conference of the International
In this paper, we suggest a simple and powerful approach for
Speech Communication Association, 2018, pp. 2663–
query-by-example on-device keyword spotting task. Our sys-
2667.
tem uses user-specific queries, and CTC based AM outputs
phonetic posteriorgram. We decode the output and build left- [3] C. Shan, J. Zhang, Y. Wang, and L. Xie, “Attention-
to-right FSTs as a hypothesis. The log likelihood is calcu- based end-to-end models for small-footprint keyword
lated as a score for testing. For on-device test, we suggest a spotting,” in INTERSPEECH 2018 – 19th Annual Con-
method to predict a proper user and query specific threshold ference of the International Speech Communication As-
with the hypothesis. We generate query-specific negatives by sociation, 2018, pp. 2037–2041.
shuffling the query in waveform. While many previous KWS
approaches are not reproducible due to the limited data ac- [4] X. Wang, S. Sun, C. Shan, J. Hou, L. Xie, S. Li, and
cess, we tested our methods on public and well-known data. X. Lei, “Adversarial examples for improving end-to-
In the experiments, our approach showed promising and com- end attention-based small-footprint keyword spotting,”
parable performances to the latest predefined and query-by- in ICASSP, IEEE International Conference on Acous-
example methods. There is a limit to this work due to lack of tics, Speech and Signal Processing - Proceedings. IEEE,
public data, and we suggest naive approach for utilizing gen- 2019.
erated negatives. As a future work, we will study advanced
way to predict threshold using the query-specific negatives, [5] A. Coucke, M. Chlieh, T. Gisselbrecht, D. Leroy,
and test various keywords. M. Poumeyrol, and T. Lavril, “Efficient keyword spot-
ting using dilated convolutions and gating,” in arXiv
preprint arXiv:1811.07684, 2018.