Cross-modal Knowledge Transfer Learning as Graph Matching Based on Optimal Transport for ASR

Lu, Xugang; Shen, Peng; Tsao, Yu; Kawai, Hisashi

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2505.13079 (eess)

[Submitted on 19 May 2025]

Title:Cross-modal Knowledge Transfer Learning as Graph Matching Based on Optimal Transport for ASR

Authors:Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai

View PDF HTML (experimental)

Abstract:Transferring linguistic knowledge from a pretrained language model (PLM) to acoustic feature learning has proven effective in enhancing end-to-end automatic speech recognition (E2E-ASR). However, aligning representations between linguistic and acoustic modalities remains a challenge due to inherent modality gaps. Optimal transport (OT) has shown promise in mitigating these gaps by minimizing the Wasserstein distance (WD) between linguistic and acoustic feature distributions. However, previous OT-based methods overlook structural relationships, treating feature vectors as unordered sets. To address this, we propose Graph Matching Optimal Transport (GM-OT), which models linguistic and acoustic sequences as structured graphs. Nodes represent feature embeddings, while edges capture temporal and sequential relationships. GM-OT minimizes both WD (between nodes) and Gromov-Wasserstein distance (GWD) (between edges), leading to a fused Gromov-Wasserstein distance (FGWD) formulation. This enables structured alignment and more efficient knowledge transfer compared to existing OT-based approaches. Theoretical analysis further shows that prior OT-based methods in linguistic knowledge transfer can be viewed as a special case within our GM-OT framework. We evaluate GM-OT on Mandarin ASR using a CTC-based E2E-ASR system with a PLM for knowledge transfer. Experimental results demonstrate significant performance gains over state-of-the-art models, validating the effectiveness of our approach.

Comments:	To appear in Interspeech 2025
Subjects:	Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2505.13079 [eess.AS]
	(or arXiv:2505.13079v1 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2505.13079

Submission history

From: Yu Tsao [view email]
[v1] Mon, 19 May 2025 13:13:18 UTC (933 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Cross-modal Knowledge Transfer Learning as Graph Matching Based on Optimal Transport for ASR

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Cross-modal Knowledge Transfer Learning as Graph Matching Based on Optimal Transport for ASR

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators