MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

Yue, Xiang; Zheng, Tianyu; Ni, Yuansheng; Wang, Yubo; Zhang, Kai; Tong, Shengbang; Sun, Yuxuan; Yu, Botao; Zhang, Ge; Sun, Huan; Su, Yu; Chen, Wenhu; Neubig, Graham

Computer Science > Computation and Language

arXiv:2409.02813 (cs)

[Submitted on 4 Sep 2024 (v1), last revised 22 May 2025 (this version, v3)]

Title:MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

Authors:Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, Graham Neubig

View PDF HTML (experimental)

Abstract:This paper introduces MMMU-Pro, a robust version of the Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU) benchmark. MMMU-Pro rigorously assesses multimodal models' true understanding and reasoning capabilities through a three-step process based on MMMU: (1) filtering out questions answerable by text-only models, (2) augmenting candidate options, and (3) introducing a vision-only input setting where questions are embedded within images. This setting challenges AI to truly "see" and "read" simultaneously, testing a fundamental human cognitive skill of seamlessly integrating visual and textual information. Results show that model performance is substantially lower on MMMU-Pro than on MMMU, ranging from 16.8% to 26.9% across models. We explore the impact of OCR prompts and Chain of Thought (CoT) reasoning, finding that OCR prompts have minimal effect while CoT generally improves performance. MMMU-Pro provides a more rigorous evaluation tool, closely mimicking real-world scenarios and offering valuable directions for future research in multimodal AI.

Comments:	ACL 2025 Main
Subjects:	Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2409.02813 [cs.CL]
	(or arXiv:2409.02813v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2409.02813

Submission history

From: Xiang Yue [view email]
[v1] Wed, 4 Sep 2024 15:31:26 UTC (2,289 KB)
[v2] Tue, 10 Sep 2024 12:55:31 UTC (2,289 KB)
[v3] Thu, 22 May 2025 08:22:02 UTC (16,647 KB)

Computer Science > Computation and Language

Title:MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators