Adversarial Deep Metric Learning for Cross-Modal Audio-Text Alignment in Open-Vocabulary Keyword Spotting

Jung, Youngmoon; Lee, Yong-Hyeok; Jung, Myunghun; Roh, Jaeyoung; Han, Chang Woo; Cho, Hoon-Young

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2505.16735 (eess)

[Submitted on 22 May 2025 (v1), last revised 23 May 2025 (this version, v2)]

Title:Adversarial Deep Metric Learning for Cross-Modal Audio-Text Alignment in Open-Vocabulary Keyword Spotting

Authors:Youngmoon Jung, Yong-Hyeok Lee, Myunghun Jung, Jaeyoung Roh, Chang Woo Han, Hoon-Young Cho

View PDF HTML (experimental)

Abstract:For text enrollment-based open-vocabulary keyword spotting (KWS), acoustic and text embeddings are typically compared at either the phoneme or utterance level. To facilitate this, we optimize acoustic and text encoders using deep metric learning (DML), enabling direct comparison of multi-modal embeddings in a shared embedding space. However, the inherent heterogeneity between audio and text modalities presents a significant challenge. To address this, we propose Modality Adversarial Learning (MAL), which reduces the domain gap in heterogeneous modality representations. Specifically, we train a modality classifier adversarially to encourage both encoders to generate modality-invariant embeddings. Additionally, we apply DML to achieve phoneme-level alignment between audio and text, and conduct extensive comparisons across various DML objectives. Experiments on the Wall Street Journal (WSJ) and LibriPhrase datasets demonstrate the effectiveness of the proposed approach.

Comments:	5 pages, 1 figure, Accepted at Interspeech 2025
Subjects:	Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2505.16735 [eess.AS]
	(or arXiv:2505.16735v2 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2505.16735

Submission history

From: Youngmoon Jung [view email]
[v1] Thu, 22 May 2025 14:49:46 UTC (167 KB)
[v2] Fri, 23 May 2025 02:53:38 UTC (167 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Adversarial Deep Metric Learning for Cross-Modal Audio-Text Alignment in Open-Vocabulary Keyword Spotting

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Adversarial Deep Metric Learning for Cross-Modal Audio-Text Alignment in Open-Vocabulary Keyword Spotting

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators