Towards Visual Text Grounding of Multimodal Large Language Model

Li, Ming; Zhang, Ruiyi; Chen, Jian; Gu, Jiuxiang; Zhou, Yufan; Dernoncourt, Franck; Zhu, Wanrong; Zhou, Tianyi; Sun, Tong

Computer Science > Computer Vision and Pattern Recognition

arXiv:2504.04974 (cs)

[Submitted on 7 Apr 2025]

Title:Towards Visual Text Grounding of Multimodal Large Language Model

Authors:Ming Li, Ruiyi Zhang, Jian Chen, Jiuxiang Gu, Yufan Zhou, Franck Dernoncourt, Wanrong Zhu, Tianyi Zhou, Tong Sun

View PDF HTML (experimental)

Abstract:Despite the existing evolution of Multimodal Large Language Models (MLLMs), a non-neglectable limitation remains in their struggle with visual text grounding, especially in text-rich images of documents. Document images, such as scanned forms and infographics, highlight critical challenges due to their complex layouts and textual content. However, current benchmarks do not fully address these challenges, as they mostly focus on visual grounding on natural images, rather than text-rich document images. Thus, to bridge this gap, we introduce TRIG, a novel task with a newly designed instruction dataset for benchmarking and improving the Text-Rich Image Grounding capabilities of MLLMs in document question-answering. Specifically, we propose an OCR-LLM-human interaction pipeline to create 800 manually annotated question-answer pairs as a benchmark and a large-scale training set of 90$ synthetic data based on four diverse datasets. A comprehensive evaluation of various MLLMs on our proposed benchmark exposes substantial limitations in their grounding capability on text-rich images. In addition, we propose two simple and effective TRIG methods based on general instruction tuning and plug-and-play efficient embedding, respectively. By finetuning MLLMs on our synthetic dataset, they promisingly improve spatial reasoning and grounding capabilities.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2504.04974 [cs.CV]
	(or arXiv:2504.04974v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2504.04974

Submission history

From: Ming Li [view email]
[v1] Mon, 7 Apr 2025 12:01:59 UTC (30,040 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Towards Visual Text Grounding of Multimodal Large Language Model

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Towards Visual Text Grounding of Multimodal Large Language Model

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators