Ground-V: Teaching VLMs to Ground Complex Instructions in Pixels

Zong, Yongshuo; Zhang, Qin; An, Dongsheng; Li, Zhihua; Xu, Xiang; Xu, Linghan; Tu, Zhuowen; Xing, Yifan; Dabeer, Onkar

Computer Science > Computer Vision and Pattern Recognition

arXiv:2505.13788 (cs)

[Submitted on 20 May 2025]

Title:Ground-V: Teaching VLMs to Ground Complex Instructions in Pixels

Authors:Yongshuo Zong, Qin Zhang, Dongsheng An, Zhihua Li, Xiang Xu, Linghan Xu, Zhuowen Tu, Yifan Xing, Onkar Dabeer

View PDF HTML (experimental)

Abstract:This work presents a simple yet effective workflow for automatically scaling instruction-following data to elicit pixel-level grounding capabilities of VLMs under complex instructions. In particular, we address five critical real-world challenges in text-instruction-based grounding: hallucinated references, multi-object scenarios, reasoning, multi-granularity, and part-level references. By leveraging knowledge distillation from a pre-trained teacher model, our approach generates high-quality instruction-response pairs linked to existing pixel-level annotations, minimizing the need for costly human annotation. The resulting dataset, Ground-V, captures rich object localization knowledge and nuanced pixel-level referring expressions. Experiment results show that models trained on Ground-V exhibit substantial improvements across diverse grounding tasks. Specifically, incorporating Ground-V during training directly achieves an average accuracy boost of 4.4% for LISA and a 7.9% for PSALM across six benchmarks on the gIoU metric. It also sets new state-of-the-art results on standard benchmarks such as RefCOCO/+/g. Notably, on gRefCOCO, we achieve an N-Acc of 83.3%, exceeding the previous state-of-the-art by more than 20%.

Comments:	Accepted to CVPR'25
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2505.13788 [cs.CV]
	(or arXiv:2505.13788v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2505.13788

Submission history

From: Linghan Xu [view email]
[v1] Tue, 20 May 2025 00:37:19 UTC (20,722 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Ground-V: Teaching VLMs to Ground Complex Instructions in Pixels

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Ground-V: Teaching VLMs to Ground Complex Instructions in Pixels

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators