Enhancing Cognition and Explainability of Multimodal Foundation Models with Self-Synthesized Data

Shi, Yucheng; Li, Quanzheng; Sun, Jin; Li, Xiang; Liu, Ninghao

Computer Science > Computer Vision and Pattern Recognition

arXiv:2502.14044 (cs)

[Submitted on 19 Feb 2025 (v1), last revised 24 Feb 2025 (this version, v2)]

Title:Enhancing Cognition and Explainability of Multimodal Foundation Models with Self-Synthesized Data

Authors:Yucheng Shi, Quanzheng Li, Jin Sun, Xiang Li, Ninghao Liu

View PDF HTML (experimental)

Abstract:Large Multimodal Models (LMMs), or Vision-Language Models (VLMs), have shown impressive capabilities in a wide range of visual tasks. However, they often struggle with fine-grained visual reasoning, failing to identify domain-specific objectives and provide justifiable explanations for their predictions. To address the above challenge, we propose a novel visual rejection sampling framework to improve the cognition and explainability of LMMs using self-synthesized data. Specifically, visual fine-tuning requires images, queries, and target answers. Our approach begins by synthesizing interpretable answers that include human-verifiable visual features. These features are based on expert-defined concepts, and carefully selected based on their alignment with the image content. After each round of fine-tuning, we apply a reward model-free filtering mechanism to select the highest-quality interpretable answers for the next round of tuning. This iterative process of synthetic data generation and fine-tuning progressively improves the model's ability to generate accurate and reasonable explanations. Experimental results demonstrate the effectiveness of our method in improving both the accuracy and explainability of specialized visual classification tasks.

Comments:	Accepted by ICLR 2025. Code: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:	arXiv:2502.14044 [cs.CV]
	(or arXiv:2502.14044v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2502.14044

Submission history

From: Yucheng Shi [view email]
[v1] Wed, 19 Feb 2025 19:05:45 UTC (10,762 KB)
[v2] Mon, 24 Feb 2025 20:46:48 UTC (8,199 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Enhancing Cognition and Explainability of Multimodal Foundation Models with Self-Synthesized Data

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Enhancing Cognition and Explainability of Multimodal Foundation Models with Self-Synthesized Data

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators