Crossmodal Knowledge Distillation with WordNet-Relaxed Text Embeddings for Robust Image Classification

Guo, Chenqi; Rong, Mengshuo; Feng, Qianli; Feng, Rongfan; Ma, Yinglong

Computer Science > Computer Vision and Pattern Recognition

arXiv:2503.24017 (cs)

[Submitted on 31 Mar 2025]

Title:Crossmodal Knowledge Distillation with WordNet-Relaxed Text Embeddings for Robust Image Classification

Authors:Chenqi Guo, Mengshuo Rong, Qianli Feng, Rongfan Feng, Yinglong Ma

View PDF HTML (experimental)

Abstract:Crossmodal knowledge distillation (KD) aims to enhance a unimodal student using a multimodal teacher model. In particular, when the teacher's modalities include the student's, additional complementary information can be exploited to improve knowledge transfer. In supervised image classification, image datasets typically include class labels that represent high-level concepts, suggesting a natural avenue to incorporate textual cues for crossmodal KD. However, these labels rarely capture the deeper semantic structures in real-world visuals and can lead to label leakage if used directly as inputs, ultimately limiting KD performance. To address these issues, we propose a multi-teacher crossmodal KD framework that integrates CLIP image embeddings with learnable WordNet-relaxed text embeddings under a hierarchical loss. By avoiding direct use of exact class names and instead using semantically richer WordNet expansions, we mitigate label leakage and introduce more diverse textual cues. Experiments show that this strategy significantly boosts student performance, whereas noisy or overly precise text embeddings hinder distillation efficiency. Interpretability analyses confirm that WordNet-relaxed prompts encourage heavier reliance on visual features over textual shortcuts, while still effectively incorporating the newly introduced textual cues. Our method achieves state-of-the-art or second-best results on six public datasets, demonstrating its effectiveness in advancing crossmodal KD.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:	arXiv:2503.24017 [cs.CV]
	(or arXiv:2503.24017v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2503.24017

Submission history

From: Chenqi Guo Dr. [view email]
[v1] Mon, 31 Mar 2025 12:41:26 UTC (347 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Crossmodal Knowledge Distillation with WordNet-Relaxed Text Embeddings for Robust Image Classification

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Crossmodal Knowledge Distillation with WordNet-Relaxed Text Embeddings for Robust Image Classification

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators