MMMU-Pro: A More Robust Multi-discipline Multimodal
Understanding Benchmark

Xiang Yue , Tianyu Zheng¹¹footnotemark: 1, Yuansheng Ni¹¹footnotemark: 1,
Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang,
Huan Sun, Yu Su, Wenhu Chen, Graham Neubig \ANDMMMU Team
https://mmmu-benchmark.github.io/#leaderboard Equal contributions. Contact: [email protected]

Abstract

This paper introduces MMMU-Pro, a robust version of the Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU) benchmark. MMMU-Pro rigorously assesses multimodal models’ true understanding and reasoning capabilities through a three-step process based on MMMU: (1) filtering out questions answerable by text-only models, (2) augmenting candidate options, and (3) introducing a vision-only input setting where questions are embedded within images. This setting challenges AI to truly “see" and “read" simultaneously, testing a core human cognitive skill of seamlessly integrating visual and textual information. Results show that model performance is substantially lower on MMMU-Pro than on MMMU, ranging from 16.8% to 26.9% across models. We explore the impact of OCR prompts and Chain of Thought (CoT) reasoning, finding that OCR prompts have minimal effect while CoT generally improves performance. MMMU-Pro provides a more rigorous evaluation tool, closely mimicking real-world scenarios and offering valuable directions for future multimodal research.

\newlistof

appfiguresafgList of Comparision Figures in Different Settings \extrafloats100 \useunder\ul \addauthorgnmagenta \addauthorysucyan

MMMU-Pro: A More Robust Multi-discipline Multimodal
Understanding Benchmark

Xiang Yue^†^†thanks: Equal contributions. Contact: [email protected] , Tianyu Zheng¹¹footnotemark: 1, Yuansheng Ni¹¹footnotemark: 1, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, Graham Neubig

MMMU Team https://mmmu-benchmark.github.io/#leaderboard

1 Introduction

Recent advances in multimodal large language models (MLLMs) have led to progress in tackling complex reasoning tasks that combine textual and visual information (Yin et al., 2023a; Jin et al., 2024). Models like GPT-4o (OpenAI, 2024b) have achieved impressive results, e.g., on the Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU) benchmark (Yue et al., 2024), reaching an accuracy of 69.1% on college-level questions that integrate text and images.

While these achievements are significant, they raise a critical question: Do the current benchmark results truly reflect a deep, multifaceted understanding of diverse subjects, or are these models exploiting subtle shortcuts and statistical patterns to arrive at correct answers without genuine comprehension and reasoning?

This question has profound implications for the development and deployment of AI systems in real-world applications. If models rely on superficial cues rather than true multimodal understanding (Du et al., 2023; Yuksekgonul et al., 2023), we risk overestimating their capabilities and potentially deploying systems that fail in unpredictable ways when faced with novel scenarios (Wu and Xie, 2024).

To address this concern and push the boundaries of multimodal AI evaluation, we introduce MMMU-Pro, a more robust and challenging version of the MMMU benchmark. MMMU-Pro is designed to more accurately and rigorously assess a model’s true multimodal understanding and reasoning capabilities across a wide range of academic disciplines. The development of MMMU-Pro is motivated by key observations, including the text-only solvability of some benchmark questions, limited option space in multiple-choice formats (Wang et al., 2024), and the need to challenge models’ ability to jointly understand different modalities in a more integrated way.

MMMU-Pro employs a rigorous three-step construction process (as shown in Figure 1) that builds upon MMMU (Yue et al., 2024): (1) filtering out questions answerable by text-only language models, (2) augmenting candidate options to reduce the effectiveness of guessing based on the options, and (3) introducing a vision-only input setting (as shown in Figure 4) where models are presented with questions embedded in a screenshot or photo.

Refer to caption — Figure 1: An overview of the construction process of MMMU-Pro.

The introduction of the vision-only input setting is particularly crucial, as it tests a fundamental human cognitive ability: the seamless integration and switching between visual and textual information. This setting challenges models to develop the capability to truly “see” and “read” simultaneously, mirroring how humans effortlessly process complex scenes where text and images are intertwined. This ability is crucial for tasks ranging from interpreting scientific diagrams (Li et al., 2024d) to navigating graphical user interfaces (Liu et al., 2024b; Zheng et al., 2024; Koh et al., 2024). Moreover, this approach aligns with how users naturally interact with AI systems, often sharing screenshots or photos rather than separating text and images.

Our experimental results demonstrate the effectiveness of MMMU-Pro in providing a more rigorous evaluation of multimodal models. We observe significant performance drops across all tested models when compared to the original MMMU benchmark, with decreases ranging from 16.8% to 26.9%. These results highlight the limitations of current state-of-the-art models in true multimodal understanding and reasoning. Furthermore, our analysis reveals that while CoT (Wei et al., 2022) prompting generally improves performance, the benefits vary across models and settings.

Interestingly, we find that explicit OCR prompts do not significantly impact performance for most models, suggesting that advanced multimodal models have already developed robust text extraction capabilities from images. However, this result also underscores that simple OCR is insufficient for the challenges presented by MMMU-Pro’s vision-only input setting. Our further qualitative analysis indicates that when text is embedded within images, it significantly increases the overall complexity of the visual input, requiring models to not only recognize text but also understand its context, relationship to visual elements, and relevance to the question. These findings not only provide a more accurate assessment of current multimodal AI capabilities but also highlight the need for more sophisticated multimodal reasoning abilities.

2 MMMU-Pro: A More Robust Version of MMMU

2.1 Revisiting the MMMU Benchmark

The Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU) benchmark (Yue et al., 2024) is a comprehensive dataset designed to evaluate multimodal AI models on college-level tasks that require subject-specific knowledge and deliberate reasoning. MMMU consists of 11.5K carefully curated multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines across 30 subjects and 183 subfields. Each question in MMMU is a multimodal image-text pair with 4 multiple-choice options, featuring 30 diverse image types such as charts, diagrams, maps, and chemical structures. MMMU has rapidly established itself as a standard evaluation framework for testing prominent multimodal models upon their release. (OpenAI, 2024b, a; Anthropic, 2024; Reid et al., 2024; Li et al., 2024a).

However, we find that text-only LLMs can accurately answer some questions without requiring any visual input. We take a closer look at these questions and identify two main issues: 1) Text-Only Dependency: Certain questions are relatively independent or irrelevant to the corresponding images. 2) Shortcut Exploitation: Even when questions require images for humans to answer correctly, models often find shortcuts or correlations within the candidate options, leveraging their pre-existing knowledge (from pre-training) to arrive at the correct answer. Two examples that are answered correctly by Llama-3-70B Instruct (Dubey et al., 2024) are shown in Figure 2.

2.2 Methods

To address these issues and build a more robust benchmark, we implemented a three-step approach.

Filtering Questions: We begin by filtering out questions that can be answered by text-only LLMs. We select four strong open-source LLMs: Llama3-70B-Instruct (Dubey et al., 2024), Qwen2-72B-Instruct (Yang et al., 2024), Yi-1.5-34B-Chat (Young et al., 2024), and Mixtral-8 $\times$ 22B-Instruct (gpt-4o, )—and task them with answering the MMMU questions without access to images. The models are required to provide answers even when they indicate that visual input is necessary. We repeat this process ten times for each model, considering a question as “answerable” if a model correctly answers it more than five times. We then exclude any question where at least three out of the four models answer correctly across the majority of trials. We randomly sample 1800 questions from the remaining pool, evenly distributed across 30 subjects (60 questions per subject).

Augmenting Candidate Options: Despite the filtering, some questions can still be answered by text-only LLMs, often exploiting subtle hints within the candidate options. To counteract this, we increase the number of candidate options from four to ten, making it more challenging for models to rely on guessing. This augmentation is done by human experts with the help of GPT-4o, with additional validation steps to ensure the quality and diversity of the options. Specifically, GPT-4o generates and Claude 3.5 filters the options, followed by two rounds of human review to refine and verify the augmented options. This augmentation is done by human experts with the help of GPT-4o. During this process, experts also review the original annotated questions to ensure their relevance to the images and to eliminate any questions that lack a clear connection or coherence. This step filters out 70 questions, and we obtain 1730 questions in total.

As illustrated in Figure 3, these two steps significantly reduce the accuracy of text-only models attempting to guess the answers.

Enhancing Evaluation with a Vision-Only Setting: To further challenge the multimodal understanding of models, we introduce a vision-only input setting in MMMU-Pro. In this setting, the model is presented with a question embedded within a screenshot or photo, without any text explicitly fed into the model. To implement this setting, we ask the human annotators to manually capture photos and screenshots over a simulated display environment. This process involves varying the backgrounds, font styles, and font sizes to replicate the diversity of real-world conditions. By using different combinations of these elements, we create a broad range of visual contexts, ensuring that the models are not only challenged by the integration of text and images but also by the variability in how this content is presented. Examples of the vision-only input setting are shown in Figure 4.

The motivation for this setting comes from real-world usage and human cognition. Users often capture screenshots of questions with both text and images instead of inputting text separately, reflecting a natural tendency to process information holistically. Humans excel at understanding integrated visual-textual content, and this setting encourages models to develop similar comprehension. By mimicking this behavior, the vision-only input setting enhances realism and prepares models for real-world multimodal tasks. Ultimately, we obtain 3,460 questions—1,730 in standard format and 1,730 as screenshots or photos.

3 Experiments

3.1 Experimental Setups

Baselines. To establish a comprehensive understanding of MMMU-Pro’s difficulty and to provide reference points for future research, we evaluate a diverse set of state-of-the-art multimodal models as baselines. These models represent a range of training approaches and capabilities in the field of multimodal AI. Our baseline models include:

Proprietary Models: GPT-4o (0513) (OpenAI, 2024b) and GPT-4o mini (OpenAI, 2024a), Claude 3.5 Sonnet (Anthropic, 2024), and Gemini 1.5 Pro (0801 and 0523 versions) (Team et al., 2023; Reid et al., 2024). These models represent the cutting edge of multimodal AI capabilities.

Open-source models: We evaluate a range of open-source models, including InternVL2 (8B, 40B, and Llama3-76B versions) (Chen et al., 2024), LLaVA (OneVision-7B, OneVision-72B, and various NeXT versions) (Li et al., 2024a; Liu et al., 2024a), VILA-1.5-40B (Lin et al., 2024), MiniCPM-V2.6 (Yao et al., 2024), Phi-3.5-Vision (Abdin et al., 2024), and Idefics3-8B-Llama3 (Laurençon et al., 2024). These models showcase the current state of publicly available multimodal AI systems. We evaluate these models across three different settings: 1) Standard setting without augmented options (usually 4 options); 2) Standard setting with augmented options (usually 10 options); 3)Vision-only input setting.

The overall performance score for MMMU-Pro is calculated as the average of scores from settings (2) and (3). We include setting (1) and report the original MMMU validation set performance solely for comparison purposes, to highlight the increased difficulty of MMMU-Pro.

We evaluate the models with both Direct and CoT prompts (as shown in Appendix A), and report the higher ones in the overall results. We also discuss the influence of the CoT prompt in 3.3.

MMMU-Pro

MMMU (Val)

\Delta_{1}

\Delta_{2}

Standard

(4 Opts)

Standard

(10 Opts)

Vision

Random Choice

24.9

12.8

12.4

22.1

-9.3

-9.7

Frequent Choice

27.8

12.1

26.8

-14.7

Human Expert (Low)

75.4

73.0

76.2

-3.2

Human Expert (Medium)

82.1

80.8

82.6

-1.8

Human Expert (High)

88.6

85.4

88.6

-3.2

GPT-4o (0513) (OpenAI, 2024b)

64.7

\ul54.0

49.7

69.1

\ul-15.1 (

\uparrow

-19.4 ( - )

Claude 3.5 Sonnet (Anthropic, 2024)

\ul63.7

55.0

\ul48.0

\ul68.3

-13.3 (

\downarrow

\ul-20.3 ( - )

Gemini 1.5 Pro (0801) (Reid et al., 2024)

60.6

49.4

44.4

65.8

-16.4 ( - )

-21.4 ( - )

Gemini 1.5 Pro (0523) (Reid et al., 2024)

57.6

46.5

40.5

62.2

-15.7 ( - )

-21.7 ( - )

GPT-4o mini (OpenAI, 2024a)

55.3

39.9

35.2

59.4

-19.5 (

\uparrow

-24.2 (

\uparrow

Qwen2-VL-72B (Qwen, 2024)

59.3

49.2

43.3

64.5

-15.3 ( - )

-21.2 ( - )

InternVL2-Llama3-76B (Chen et al., 2024)

\ul55.0

\ul41.9

\ul38.0

58.3

-16.4 (

\downarrow

-20.3 (

\downarrow

InternVL2-40B (Chen et al., 2024)

47.4

36.3

32.1

55.2

-18.9 ( - )

-23.1 (

\downarrow

LLaVA-OneVision-72B (Li et al., 2024a)

52.3

38.0

24.0

56.8

-18.8 ( - )

-32.8 (

\uparrow

Qwen2-VL-7B (Qwen, 2024)

46.6

34.1

27.0

54.1

-20.0 (

\uparrow

-27.1 (

\downarrow

Pixtral-12B (Mistral, 2024)

47.5

33.4

25.0

52.5

-19.1 (

\uparrow

-27.5 ( - )

InternVL2-8B (Chen et al., 2024)

42.6

32.5

25.4

51.2

-18.7 ( - )

-25.8 (

\downarrow

MiniCPM-V2.6 (Yao et al., 2024)

40.6

30.2

24.2

49.8

-19.6 (

\uparrow

-25.6 (

\downarrow

VILA-1.5-40B (Lin et al., 2024)

46.8

35.9

14.1

51.9

-16.0 (

\downarrow

-37.8 (

\uparrow

LLaVA-NEXT-72B (Liu et al., 2024a)

43.0

31.0

19.2

49.9

-18.9 ( - )

-30.7 ( - )

LLaVA-OneVision-7B (Li et al., 2024a)

42.8

29.5

18.7

48.8

-19.3 (

\uparrow

-30.1 (

\downarrow

LLaVA-NeXT-34B (Liu et al., 2024a)

44.5

30.3

17.2

48.1

-17.8 (

\downarrow

-30.9 (

\downarrow

Idefics3-8B-Llama3 (Laurençon et al., 2024)

40.8

30.1

15.6

46.6

-16.5 (

\downarrow

-31.0 ( - )

Qwen2-VL-2B (Qwen, 2024)

34.8

25.3

17.2

41.1

\ul-15.8 ( - )

-23.9 (

\downarrow

Phi-3.5-Vision (Abdin et al., 2024)

37.8

26.3

13.1

43.0

-16.7 ( - )

-29.9 (

\uparrow

LLaVA-NeXT-7B (Liu et al., 2024a)

33.7

19.4

14.6

35.3

-15.9 ( - )

\ul-20.7 (

\downarrow

LLaVA-NeXT-13B (Liu et al., 2024a)

33.9

19.8

14.5

36.2

-16.4 ( - )

-21.7 (

\downarrow

Table 1: Results of models on MMMU-Pro and MMMU (Val).

\Delta_{1}

: Standard (10 options) - MMMU (Val);

\Delta_{2}

: Vision - MMMU (Val). (

\downarrow

) represents a decrease in ranking, while (

\uparrow

) indicates an increase. The best-performing model in each category is in-bold, and the second best is \ulunderlined.

Approximating Human Expert Performance. While rigorous human evaluation of MMMU-Pro provides valuable insights, conducting such an assessment is both time-consuming and costly. Instead, we develop an approach to approximate human expert performance based on the original MMMU human evaluation data. This approximation is justified by several key factors. Firstly, the core content and difficulty of the questions remain unchanged in MMMU-Pro, supporting the validity of using the original human performance data as a close approximation. Secondly, in the original MMMU evaluation, human experts are required to write out their problem-solving processes, significantly reducing the likelihood of random guessing. For questions without detailed solving processes, we randomly select one option from the augmented candidates and recalculate the accuracy. Finally, human experts, with their innate ability to seamlessly integrate visual and textual information, are expected to perform similarly in the vision-only input setting as they do in the original format. Based on these considerations, we posit that human expert performance on MMMU-Pro closely aligns with the original MMMU results, allowing us to maintain a human performance benchmark without incurring the substantial costs of a new expert evaluation. More details of the human estimation performance can be found in Appendix B.

3.2 Overall Results

We presented the overall results of MMMU-Pro of different models in Table 1.

Effect of Increased Candidate Options: The shift from 4 to 10 candidate options ( $\Delta_{1}$ ) reveals a significant drop in performance for all models. GPT-4o (0513) experienced a decrease of 10.7%, from 64.7% to 54.0%. This indicates that increasing the number of options effectively reduces the likelihood of models guessing the correct answer, forcing them to engage more deeply with the multimodal content.

Impact of Vision-Only Setting: The introduction of the vision-only input setting further challenges models, as evidenced by the additional drop in performance when comparing the vision-only results to the 10-option standard ( $\Delta_{2}$ ). For instance, GPT-4o (0513) dropped another 4.3% in accuracy when evaluated in the vision-only setting, and LLaVA-OneVision-72B saw a dramatic 14.0% decrease. This suggests that the vision-only setting successfully tests the models’ ability to integrate visual and textual information, highlighting their limitations when the text is not explicitly provided.

Combined Effects on MMMU-Pro: The overall $\Delta_{3}$ , representing the difference between MMMU-Pro and MMMU (Val), shows a significant decrease across the board. For instance, models like Gemini 1.5 Pro (0801) and Claude 3.5 Sonnet exhibited declines of 18.9% and 16.8%, respectively, while more drastic drops were seen in models like VILA-1.5-40B with a 26.9% decrease.

This significant reduction in accuracy across the board suggests that MMMU-Pro successfully mitigates the shortcuts and guessing strategies that models could exploit in the original benchmark.

3.3 Impact of CoT Prompting

Figure 5 examines the effectiveness of Chain of Thought (CoT) prompting on the MMMU-Pro benchmark, in both Standard and Vision Input settings. Across both settings, CoT prompts generally improved performance, though the extent varied significantly. For instance, Claude 3.5 Sonnet saw a substantial increase in the Standard setting, rising from 42.7% to 55.0%, while models like LLaVA-OneVision-72B showed only minimal gains.

Interestingly, we observed a significant performance drop for some models, such as VILA1.5-40B. This decline might be attributed to challenges in instruction-following abilities. When a model struggles to follow instructions accurately, generating CoT explanations becomes more difficult. Additionally, these models may face issues with maintaining the correct response format, leading to what is known as “boiled response format” problems. These findings highlight the potential of CoT to enhance model performance in complex, real-world tasks that require nuanced reasoning and integration of multiple information sources. However, they also underscore the importance of robust instruction-following capabilities as a prerequisite for effective CoT implementation.

The effectiveness of CoT prompting across disciplines is summarized in Table 6 and Figure 9, comparing CoT and direct accuracy for GPT-4o and LLaVA-OneVision 72B. CoT shows significant improvements in reasoning-intensive fields like Tech and Engineering (e.g., a 14.49% gain for GPT-4o) and Science (8.22% gain). Smaller yet consistent gains are observed for LLaVA-OneVision 72B, such as 2.33% in Tech and Engineering. However, CoT’s benefits are limited or negative in fields like Art and Design, where GPT-4o gains only 1.58%, and LLaVA-OneVision 72B sees a 17.12% decline. These results underscore CoT’s strengths in structured reasoning tasks but its reduced effectiveness in domains requiring subjective interpretation.

3.4 Does OCR Help in the Vision Setting?

In the Vision Input setting, one natural question is whether Optical Character Recognition (OCR) helps improve model performance on MMMU-Pro. We answer this question by first calculating the OCR accuracy of different models. Specifically, we ask the model to extract the full text of the question and answer choices. Then the OCR accuracy is calculated by comparing the text extracted with the original text using Levenshtein distance, which measures the difference between the two strings. The similarity between the extracted and original text is computed as:

$\displaystyle\text{OCR Accuracy}=1-\frac{\text{Levenshtein.distance}(\text{% text1},\text{text2})}{\max(\text{len(text1)},\text{len(text2)})}$

Table 2 shows that although most of the models demonstrate strong OCR capabilities, as indicated by high similarity scores. Based on the result, we then explore whether explicitly asking the model to first extract the question and then solve it (with an OCR prompt shown in Appendix A) could help in improving performance within the Vision Input setting of MMMU-Pro. Across the models evaluated, the inclusion of OCR prompts did not significantly alter performance. These minimal differences suggest that strong capable models are already proficient at extracting and understanding textual information from images, even without explicit OCR prompts.

Model

OCR Acc.

Vision Setting Acc.

w/ OCR

Prompt

w/o OCR

Prompt

GPT-4o

92.3

49.7

49.4

Gemini 1.5 Pro(0801)

89.7

44.4

43.6

GPT-4o mini

89.6

35.2

35.6

InternVL2-Llama3-76B

88.1

38.0

37.9

InternVL2-Llama3-40B

85.5

32.1

28.9

Pixtral-12B

83.1

25.0

24.1

LLaVA-OneVision-72B

87.8

24.0

23.8

InternVL2-8B

85.2

25.4

24.6

MiniCPM-V2.6

67.0

24.2

21.1

LLaVA-NEXT-72B

62.0

19.2

20.0

Idefics3-8B-Llama3

68.5

15.6

14.1

LLaVA-NeXT-7B

36.6

14.6

14.3

LLaVA-NeXT-13B

51.1

14.5

12.8

Table 2: Model performance in the Vision Input setting, comparing OCR accuracy with/without OCR prompts.

Interestingly, Figure 6 shows that high OCR accuracy doesn’t always translate to strong multimodal reasoning. For example, LLaVA-OneVision-72B matches InternVL2-Llama3-76B and GPT-4o mini in OCR accuracy but lags significantly in MMMU-Pro Vision performance, indicating that OCR accuracy alone is insufficient for robust reasoning. Conversely, top-performing models like GPT-4o consistently excel in both areas. Despite GPT-4o’s high OCR accuracy, its MMMU-Pro Vision performance drops notably compared to MMMU (Val), revealing that even advanced models struggle to fully integrate and reason over multimodal inputs in the vision-only setting.

3.5 Qualitative Analysis

To gain deeper insights into model performance beyond quantitative metrics, we conducted a thorough qualitative analysis of MMMU-Pro results, focusing on two key scenarios: 1) Correct answers with four options but failure with ten options in the standard setting; 2) Success in the standard ten-option setting but failure in the vision input setting. Our analysis revealed several critical factors affecting model performance:

Challenges with Increased Options. Models often select the closest answer rather than arriving at a definitive choice, leading to increased errors with more options, as shown in Figure 11. Conceptually similar options, particularly in nuanced questions, can cause confusion. For instance, in conceptual questions, models struggled to differentiate subtle distinctions within a subject area, revealing limitations in fine-grained understanding.

Increased Cognitive Load in Vision-Text Integration. Processing visual and textual inputs simultaneously increases the cognitive load on models. An example is shown in Figure 10. The model perfectly extracted the text from the image but still failed to answer the question correctly. Another case is shown in Figure 21. The graph’s similar lines and overlapping data points may distract the model from distinguishing between the two unemployment categories, leading to the error.

Overemphasis on Visual Cues in Multimodal Reasoning. When visual cues dominate over textual reasoning, models may incorrectly prioritize less relevant information from the images. In the Figure 33 example, the Vision Setting incorrectly chose the League of Nations by focusing on the World War I image, missing the broader context of World War II and the United Nations. A proper balance between visual and textual information is essential to avoid such mistakes.

Impact of Context Switching. Rapid transitions between visual and textual information can cause models to lose focus or misinterpret key data. For example, in Figure 26, the model initially correctly defined both the objective function and the algebraic constraints. However, due to context switching between the textual description and the geometric figure, it misinterpreted the feasible region.

3.6 Error Analysis

Following the MMMU error analysis, we analyze 60 error cases from GPT-4o in the Vision setting to better understand the error reasons (Figure 7). Consistent with MMMU findings, the errors are broadly categorized into three main types: perception errors, knowledge errors, and reasoning errors. However, reasoning errors account for 46% of cases, a significant increase from the original MMMU distribution (26%). Within perception errors, text recognition and OCR do not prove to be the primary bottleneck. Instead, the main challenges lie in the integration and interpretation of visual and textual information. This shift in error distribution highlights the increased difficulty for models in transitioning from accurate perception to complex multimodal reasoning.

3.7 Response Length Comparison

One interesting observation we have from the previous qualitative examples is that responses (especially the reasoning sentences) of GPT-4o under the Vision Input setting seem to be shorter than the Standard setting. We quantify this phenomenon by asking another LLM (Qwen2-72B-Instruct (Yang et al., 2024)) to classify the GPT-4o’s responses into “Descriptive” sentences and “Analytical” sentences. As shown in Figure 8, GPT-4o generates significantly shorter responses but uses more tokens for “Descriptive” rather than “Analytical”. One possible reason is that the increased cognition workload of the vision inputs requires the model to focus more on visual processing, which distracts the model from generating extensive reasoning chains.

4 Guide for Future Model Training

The results of MMMU-Pro provide valuable insights into the challenges faced by current multimodal models and suggest several promising directions for future model development.

Scaling of LLM Backbones. As demonstrated in Table 1, increasing the scale of large language model (LLM) backbones consistently enhances both perception and reasoning capabilities. For example, larger models such as GPT-4o outperform their smaller counterparts like GPT-4o mini, while LlavaOneVision-72B achieves better results than LlavaOneVision-7B. Similarly, InternVL2-78B demonstrates superior performance compared to InternVL2-8B. This trend underscores the importance of scaling as a critical factor in improving multimodal understanding and reasoning.

More Capable Vision Encoders that Highlights Visual Representation Learning. We train two Cambrian Tong et al. (2024a) models on 1M Cambrian data with two different vision encoders to explore their impact (more details of the setup are in Appendix E). As shown in Table 3, encoders such as Siglip ViT-SO400M-14 (Zhai et al., 2023), trained with extensive language supervision, perform well on MMMU (Val) but struggle on MMMU-Pro (Vision). In comparison, self-supervised encoders like DINOv2 ViT-G-14 (Oquab et al., 2023) achieve better results on the Vision input setting. These findings suggest future work may focus on further enhancing visual feature learning while exploring the integration of language-based training objectives with self-supervised training objectives.

Better Integration of Vision and Text Modalities. Integration of visual and textual information remains a key challenge for multimodal models. Current architectures often struggle with tasks requiring deep cross-modal understanding. Developing models with better cross-modal attention and effective feature fusion is critical to bridge this gap.

CoT Data Generation. The CoT prompting technique shows significant benefits in reasoning-heavy domains within MMMU-Pro, as reflected in Figure 5 and Table 6. While domains like Tech and Engineering and Business see notable improvements, CoT performance remains weak or even detrimental in areas such as Art and Design. To address these gaps, future efforts focus on synthesizing more diverse reasoning-intensive CoT data and tailoring strategies for domains where CoT impact is minimal. Leveraging inference-compute concepts (Welleck et al., 2024) further enhances CoT capabilities, enabling models to generalize more effectively across varied reasoning tasks.

Text-Rich Image Generation in Reasoning Scenarios. Our analysis shows that strong OCR accuracy and reasoning performance on traditional benchmarks do not always translate to success on MMMU-Pro Vision. A potential reason is the lack of training data with text-rich images in reasoning-intensive contexts. To address this, we developed a tool leveraging the MMMU-Pro Vision human annotation process. This tool processes a JSON file with questions and images and outputs screenshots embedding both. Such tools can further generate similar datasets at scale, enhancing models’ ability to integrate visual and textual information in real-world scenarios.

Method	MMMU	MMMU-Pro
Method	(Val)	(Vision)
DINOv2 ViT-G-14	37.1	17.4
Siglip ViT-SO400M-14	37.9	16.7

Table 3: Performance of an MLLM with different vision encoders on MMMU and MMMU-Pro.

5 Related Work

Multimodal Large Language Models. Recent progress in multimodal AI has been marked by innovative training approaches (Lu et al., 2019; Chen et al., 2020; Zhou et al., 2020; Zhang et al., 2021; Li et al., 2020; Alayrac et al., 2022; Awadalla et al., 2023). Inspired by the success of large language models, researchers have developed various models with improved instruction-following capabilities (Liu et al., 2023c, b, 2024a; Li et al., 2024a; Dai et al., 2023; Zhu et al., 2023; Zhang et al., 2023; Gao et al., 2023; Ye et al., 2023a, b; Zhao et al., 2023; Li et al., 2023; Monajatipoor et al., 2023; Zhao et al., 2024; Li et al., 2024c; Lin et al., 2024; Zhang et al., 2024a). Proprietary models such as GPT-4V (OpenAI, 2023), GPT-4o (OpenAI, 2024b), Gemini (Team et al., 2023), and Claude-3.5 (Anthropic, 2024) have demonstrated strong performance across various vision-language tasks. However, a significant challenge remains in accurately evaluating the capabilities of these advanced LMMs, highlighting the need for more robust and comprehensive benchmarks.

MLLM Benchmarks. The rise of more advanced multimodal pre-training and instruction tuning has exposed the limitations of earlier benchmarks like VQA (Antol et al., 2015; Goyal et al., 2017), OK-VQA (Marino et al., 2019), and MSCOCO (Lin et al., 2014), which no longer suffice to evaluate the full spectrum of LMMs capabilities. To address this, recent benchmarks such as LAMM (Yin et al., 2023b), LVLM-eHub (Xu et al., 2023), SEED (Li et al., 2024b), MMBench (Liu et al., 2023d),CV-Bench (Tong et al., 2024a), MM-Vet (Yu et al., 2024), Mantis (Jiang et al., 2024), and BLINK (Fu et al., 2024) have emerged, covering aspects from basic perception to hallucination detection (Cui et al., 2023; Liu et al., 2023a). However, existing benchmarks often fall short in evaluating expert-level domain knowledge and complex reasoning (Lu et al., 2023a; Zhang et al., 2024b). While MMMU (Yue et al., 2024) made strides by incorporating multimodal, college-level questions, it still permits text-only models to find shortcuts (Lu et al., 2023b; Zhang et al., 2024b). To address these limitations, we introduce MMMU-Pro, a more robust benchmark that removes text-only answerable questions, expands candidate options, and includes a vision-only input setting to better reflect real-world multimodal scenarios.

6 Conclusion

MMMU-Pro offers a stronger multimodal understanding and reasoning benchmark than its predecessor MMMU. Our results show MMMU-Pro’s effectiveness in exposing current state-of-the-art model limitations, with significant performance drops across all tested systems. MMMU-Pro highlights critical research directions: 1) Developing models with consistent performance across settings, particularly bridging standard and vision-only input gaps. 2) Enhancing vision-text integration for complex mixed-format inputs. 3) Advancing reasoning techniques to address MMMU-Pro’s heightened question complexity.

Ethical Statement

The MMMU-Pro benchmark is designed with ethical considerations to ensure fair and responsible AI evaluation. The dataset excludes sensitive content, and the assessment focuses on testing multimodal capabilities without introducing bias. We aim for transparency in reporting model limitations and encourage further research to address any societal impacts related to the use of these models in real-world applications.

Limitations

While MMMU-Pro improves upon existing benchmarks by filtering out text-only solvable questions and introducing a vision-only setting, some limitations remain. The dataset may still contain subtle statistical shortcuts that models can exploit, and its scope is limited to predefined disciplines and question formats. Additionally, while the vision-only input setting increases difficulty, it does not fully capture the complexities of human perception. Lastly, our reliance on approximated human performance rather than direct evaluation introduces potential biases in reporting accurate human expert performance.

References

Abdin et al. (2024) Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. 2024. Phi-3 technical report: A highly capable language model locally on your phone. ArXiv preprint, abs/2404.14219.
Alayrac et al. (2022) Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. 2022. Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems.
Anthropic (2024) Anthropic. 2024. Claude 3.5 sonnet. https://www.anthropic.com/news/claude-3-5-sonnet.
Antol et al. (2015) Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: visual question answering. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pages 2425–2433. IEEE Computer Society.
Awadalla et al. (2023) Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. 2023. Openflamingo: An open-source framework for training large autoregressive vision-language models. ArXiv preprint, abs/2308.01390.
Chen et al. (2020) Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. Uniter: Universal image-text representation learning. In European Conference on Computer Vision, pages 104–120.
Chen et al. (2024) Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. 2024. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. ArXiv preprint, abs/2404.16821.
Cui et al. (2023) Chenhang Cui, Yiyang Zhou, Xinyu Yang, Shirley Wu, Linjun Zhang, James Zou, and Huaxiu Yao. 2023. Holistic analysis of hallucination in gpt-4v (ision): Bias and interference challenges. ArXiv preprint, abs/2311.03287.
Dai et al. (2023) Wenliang Dai, Junnan Li, DONGXU LI, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. 2023. Instructblip: Towards general-purpose vision-language models with instruction tuning. In Advances in Neural Information Processing Systems, volume 36, pages 49250–49267. Curran Associates, Inc.
Du et al. (2023) Mengnan Du, Fengxiang He, Na Zou, Dacheng Tao, and Xia Hu. 2023. Shortcut learning of large language models in natural language understanding. Communications of the ACM, 67(1):110–120.
Dubey et al. (2024) Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models. ArXiv preprint, abs/2407.21783.
Fu et al. (2024) Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. 2024. Blink: Multimodal large language models can see but not perceive. ArXiv preprint, abs/2404.12390.
Gao et al. (2023) Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xiangyu Yue, et al. 2023. Llama-adapter v2: Parameter-efficient visual instruction model. ArXiv preprint, abs/2304.15010.
Goyal et al. (2017) Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 6325–6334. IEEE Computer Society.
(15) gpt-4o. 2024. Cheaper, better, faster, stronger. https://mistral.ai/news/mixtral-8x22b/.
Jiang et al. (2024) Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, and Wenhu Chen. 2024. Mantis: Interleaved multi-image instruction tuning. ArXiv preprint, abs/2405.01483.
Jin et al. (2024) Yizhang Jin, Jian Li, Yexin Liu, Tianjun Gu, Kai Wu, Zhengkai Jiang, Muyang He, Bo Zhao, Xin Tan, Zhenye Gan, et al. 2024. Efficient multimodal large language models: A survey. ArXiv preprint, abs/2405.10739.
Koh et al. (2024) Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, and Daniel Fried. 2024. VisualWebArena: Evaluating multimodal agents on realistic visual web tasks. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 881–905, Bangkok, Thailand. Association for Computational Linguistics.
Laurençon et al. (2024) Hugo Laurençon, Andrés Marafioti, Victor Sanh, and Léo Tronchon. 2024. Building and better understanding vision-language models: insights and future directions. ArXiv preprint, abs/2408.12637.
Li et al. (2023) Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. 2023. Otter: A multi-modal model with in-context instruction tuning. ArXiv preprint, abs/2305.03726.
Li et al. (2024a) Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. 2024a. Llava-onevision: Easy visual task transfer. ArXiv preprint, abs/2408.03326.
Li et al. (2024b) Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. 2024b. Seed-bench: Benchmarking multimodal large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13299–13308.
Li et al. (2024c) Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. 2024c. Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models. ArXiv preprint, abs/2407.07895.
Li et al. (2020) Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. 2020. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16, pages 121–137. Springer.
Li et al. (2024d) Zekun Li, Xianjun Yang, Kyuri Choi, Wanrong Zhu, Ryan Hsieh, HyeonJung Kim, Jin Hyuk Lim, Sungyoung Ji, Byungju Lee, Xifeng Yan, et al. 2024d. Mmsci: A multimodal multi-discipline dataset for phd-level scientific comprehension. ArXiv preprint, abs/2407.04903.
Lin et al. (2024) Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. 2024. Vila: On pre-training for visual language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26689–26699.
Lin et al. (2014) Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer.
Liu et al. (2023a) Fuxiao Liu, Tianrui Guan, Zongxia Li, Lichang Chen, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. 2023a. Hallusionbench: You see what you think? or you think what you see? an image-context reasoning benchmark challenging for gpt-4v (ision), llava-1.5, and other multi-modality models. ArXiv preprint, abs/2310.14566.
Liu et al. (2023b) Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2023b. Improved baselines with visual instruction tuning. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following.
Liu et al. (2024a) Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. 2024a. Llava-next: Improved reasoning, ocr, and world knowledge.
Liu et al. (2023c) Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023c. Visual instruction tuning. In Advances in Neural Information Processing Systems, volume 36, pages 34892–34916. Curran Associates, Inc.
Liu et al. (2024b) Junpeng Liu, Yifan Song, Bill Yuchen Lin, Wai Lam, Graham Neubig, Yuanzhi Li, and Xiang Yue. 2024b. Visualwebbench: How far have multimodal llms evolved in web page understanding and grounding? Conference on Language Modeling.
Liu et al. (2023d) Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. 2023d. Mmbench: Is your multi-modal model an all-around player? ArXiv preprint, abs/2307.06281.
Lu et al. (2019) Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 13–23.
Lu et al. (2023a) Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. 2023a. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. ArXiv preprint, abs/2310.02255.
Lu et al. (2023b) Yujie Lu, Xiujun Li, William Yang Wang, and Yejin Choi. 2023b. Vim: Probing multimodal large language models for visual embedded instruction following. ArXiv preprint, abs/2311.17647.
Marino et al. (2019) Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. 2019. OK-VQA: A visual question answering benchmark requiring external knowledge. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 3195–3204. Computer Vision Foundation / IEEE.
Mistral (2024) Mistral. 2024. Pixtral-12b. https://mistral.ai/news/pixtral-12b.
Monajatipoor et al. (2023) Masoud Monajatipoor, Liunian Harold Li, Mozhdeh Rouhsedaghat, Lin Yang, and Kai-Wei Chang. 2023. MetaVL: Transferring in-context learning ability from language models to vision-language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 495–508, Toronto, Canada. Association for Computational Linguistics.
OpenAI (2023) OpenAI. 2023. Gpt-4v(ision) system card.
OpenAI (2024a) OpenAI. 2024a. Gpt-4o mini: advancing cost-efficient intelligence. https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/.
OpenAI (2024b) OpenAI. 2024b. Hello gpt4-o. https://openai.com/index/hello-gpt-4o/.
Oquab et al. (2023) Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. 2023. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193.
Qwen (2024) Qwen. 2024. Qwen2-vl: To see the world more clearly. https://qwenlm.github.io/blog/qwen2-vl/ .
Reid et al. (2024) Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. 2024. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. ArXiv preprint, abs/2403.05530.
Team et al. (2023) Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. 2023. Gemini: a family of highly capable multimodal models. ArXiv preprint, abs/2312.11805.
Tong et al. (2024a) Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. 2024a. Cambrian-1: A fully open, vision-centric exploration of multimodal llms. ArXiv preprint, abs/2406.16860.
Tong et al. (2024b) Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. 2024b. Eyes wide shut? exploring the visual shortcomings of multimodal llms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9568–9578.
Wang et al. (2024) Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. 2024. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. ArXiv preprint, abs/2406.01574.
Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837.
Welleck et al. (2024) Sean Welleck, Amanda Bertsch, Matthew Finlayson, Hailey Schoelkopf, Alex Xie, Graham Neubig, Ilia Kulikov, and Zaid Harchaoui. 2024. From decoding to meta-generation: Inference-time algorithms for large language models. arXiv preprint arXiv:2406.16838.
Wu and Xie (2024) Penghao Wu and Saining Xie. 2024. V*: Guided visual search as a core mechanism in multimodal llms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13084–13094.
Xu et al. (2023) Peng Xu, Wenqi Shao, Kaipeng Zhang, Peng Gao, Shuo Liu, Meng Lei, Fanqing Meng, Siyuan Huang, Yu Qiao, and Ping Luo. 2023. Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. ArXiv preprint, abs/2306.09265.
Yang et al. (2024) An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. 2024. Qwen2 technical report. ArXiv preprint, abs/2407.10671.
Yao et al. (2024) Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. 2024. Minicpm-v: A gpt-4v level mllm on your phone. ArXiv preprint, abs/2408.01800.
Ye et al. (2023a) Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. 2023a. mplug-owl: Modularization empowers large language models with multimodality. ArXiv preprint, abs/2304.14178.
Ye et al. (2023b) Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Haowei Liu, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. 2023b. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. ArXiv preprint, abs/2311.04257.
Yin et al. (2023a) Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. 2023a. A survey on multimodal large language models. ArXiv preprint, abs/2306.13549.
Yin et al. (2023b) Zhenfei Yin, Jiong Wang, Jianjian Cao, Zhelun Shi, Dingning Liu, Mukai Li, Xiaoshui Huang, Zhiyong Wang, Lu Sheng, LEI BAI, Jing Shao, and Wanli Ouyang. 2023b. Lamm: Language-assisted multi-modal instruction-tuning dataset, framework, and benchmark. In Advances in Neural Information Processing Systems, volume 36, pages 26650–26685. Curran Associates, Inc.
Young et al. (2024) Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, et al. 2024. Yi: Open foundation models by 01. ai. ArXiv preprint, abs/2403.04652.
Yu et al. (2024) Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. 2024. MM-vet: Evaluating large multimodal models for integrated capabilities. In Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pages 57730–57754. PMLR.
Yue et al. (2024) Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. 2024. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556–9567.
Yuksekgonul et al. (2023) Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. 2023. When and why vision-language models behave like bags-of-words, and what to do about it? In The Eleventh International Conference on Learning Representations.
Zhai et al. (2023) Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. 2023. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11975–11986.
Zhang et al. (2024a) Pan Zhang, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Rui Qian, Lin Chen, Qipeng Guo, Haodong Duan, Bin Wang, Linke Ouyang, Songyang Zhang, Wenwei Zhang, Yining Li, Yang Gao, Peng Sun, Xinyue Zhang, Wei Li, Jingwen Li, Wenhai Wang, Hang Yan, Conghui He, Xingcheng Zhang, Kai Chen, Jifeng Dai, Yu Qiao, Dahua Lin, and Jiaqi Wang. 2024a. Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output. ArXiv preprint, abs/2407.03320.
Zhang et al. (2021) Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. 2021. Vinvl: Revisiting visual representations in vision-language models. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 5579–5588. Computer Vision Foundation / IEEE.
Zhang et al. (2023) Renrui Zhang, Jiaming Han, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and Yu Qiao. 2023. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. ArXiv preprint, abs/2303.16199.
Zhang et al. (2024b) Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, et al. 2024b. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? arXiv preprint arXiv:2403.14624.
Zhao et al. (2023) Bo Zhao, Boya Wu, and Tiejun Huang. 2023. Svit: Scaling up visual instruction tuning. ArXiv preprint, abs/2307.04087.
Zhao et al. (2024) Haozhe Zhao, Zefan Cai, Shuzheng Si, Xiaojian Ma, Kaikai An, Liang Chen, Zixuan Liu, Sheng Wang, Wenjuan Han, and Baobao Chang. 2024. Mmicl: Empowering vision-language model with multi-modal in-context learning. The Twelfth International Conference on Learning Representations.
Zheng et al. (2024) Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. 2024. Gpt-4v(ision) is a generalist web agent, if grounded. In Forty-first International Conference on Machine Learning.
Zhou et al. (2020) Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J. Corso, and Jianfeng Gao. 2020. Unified vision-language pre-training for image captioning and VQA. In Proceedings of the AAAI Conference on Artificial Intelligence, 34, pages 13041–13049. AAAI Press.
Zhu et al. (2023) Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. 2023. Minigpt-4: Enhancing vision-language understanding with advanced large language models. ArXiv preprint, abs/2304.10592.

\startcontents\printcontents

1Table of Contents in Appendix

Appendix A Evaluation Prompts

Appendix B Approximating Human Expert Performance

Establishing a reliable benchmark for human performance on MMMU-Pro is crucial to evaluating the true capabilities of multimodal AI models. Conducting new and rigorous human evaluations, however, is both time-consuming and expensive. To address this issue, we developed an approximation method based on the existing human evaluation data from the original MMMU. The resulting estimates are presented in Table 4.

Overall

Art &

Design

Business

Science

Health &

Medicine

Human &

Social Sci.

Tech &

Eng.

Low

73.0

77.4

77.9

78.5

65.2

63.6

73.5

Medium

80.8

83.3

88.4

84.9

72.8

75.8

78.2

High

85.4

85.7

89.5

86.0

84.8

81.8

84.4

Table 4: Estimated human performance on MMMU-Pro across different disciplines, based on the original MMMU evaluation data. The table presents low, medium, and high performance estimates in terms of overall accuracy and discipline-specific breakdowns.

The validity of using this approximation method relies on several key factors. Firstly, the core content and difficulty of the questions in MMMU-Pro remain unchanged from those in the original MMMU, supporting the use of the original human performance data as a valid proxy. Secondly, in the initial MMMU evaluation, human experts were required to document their problem-solving processes, which significantly reduced the likelihood of random guessing. For questions lacking detailed solution processes, we simulated random selection from expanded candidate options and recalculated the accuracy. Finally, human experts inherently excel at seamlessly integrating visual and textual information, suggesting that their performance in a purely visual input setting would be analogous to their performance in the original format.

Given that the 577 questions in MMMU-Pro are sourced from the MMMU validation set, we extracted the corresponding data from the evaluations of the 90 human experts involved in the original MMMU assessment. We categorized and counted these questions based on whether they included a detailed solution process (w/ Solution) or were subjected to guessing due to the lack of a detailed solution process (w/o Solution). We then counted the correct and incorrect answers in each category, as summarized in Table 5. Specifically, the categorization is defined in Equation 1:

\begin{split}\text{Num}_{\text{total}}&=\text{Num}_{\text{w/o Solution}}+\text% {Num}_{\text{w/ Solution}}\\ &=\text{Num}_{\text{w/o Solution(wrong)}}+\text{Num}_{\text{w/o Solution(% correct)}}\\ &\quad+\text{Num}_{\text{w/ Solution(wrong)}}+\text{Num}_{\text{w/ Solution(% correct)}}\end{split}

(1)

Using these counts, we can estimate the lower bound of human performance on MMMU-Pro with Equation 2:

\text{Num}_{\text{Estimate(correct)}}=\text{Num}_{\text{w/ Solution(correct)}}% +\left\lfloor\left(\frac{\text{Num}_{\text{w/o Solution}}}{\text{Num}_{\text{% total}}}\right)\times\text{Num}_{\text{w/o Solution}}\right\rceil

(2)

This formula considers the number of correctly solved questions with detailed solution processes and the proportion of correctly guessed questions without detailed solution processes, ensuring a conservative estimate.

Low Medium High w/o Sol. (w/c) w/ Sol. (w/c) Est. (w/c) Acc w/o Sol. (w/c) w/ Sol. (w/c) Est. (w/c) Acc w/o Sol. (w/c) w/ Sol. (w/c) Est. (w/c) Acc Art & Design 4/11 11/64 19/65 77.4 5/1 8/70 14/70 83.3 4/2 6/72 12/72 85.7 Art 2/2 2/14 4/14 77.8 1/0 1/16 2/16 88.9 0/1 0/17 1/17 94.4 Art Theory 1/2 2/18 5/18 78.3 1/1 2/19 4/19 82.6 1/1 3/18 5/18 78.3 Design 1/4 4/10 5/10 66.7 1/0 2/12 3/12 80.0 1/0 1/13 2/13 86.7 Music 0/3 3/22 6/22 78.6 2/0 3/23 5/23 82.1 2/0 2/24 4/24 85.7 Business 4/11 11/73 21/74 77.9 4/1 6/84 11/84 88.4 2/3 5/85 10/85 89.5 Accounting 0/3 3/19 6/19 76.0 2/0 1/22 3/22 88.0 0/2 1/22 3/22 88.0 Economics 0/4 4/13 5/13 72.2 1/0 1/16 2/16 88.9 1/0 0/17 1/17 94.4 Finance 1/2 2/15 4/15 78.9 0/0 1/18 1/18 94.7 0/0 2/17 2/17 89.5 Manage 2/2 2/8 4/9 69.2 1/1 2/9 4/9 69.2 1/1 2/9 4/9 69.2 Marketing 1/0 0/18 2/18 90.0 0/0 1/19 1/19 95.0 0/0 0/20 0/20 100.0 Science 3/12 12/72 20/73 78.5 3/1 10/79 14/79 84.9 3/1 9/80 13/80 86.0 Biology 0/5 5/13 7/13 65.0 2/0 5/13 7/13 65.0 1/1 5/13 7/13 65.0 Chemistry 0/3 3/14 4/14 77.8 0/1 2/15 3/15 83.3 1/0 2/15 3/15 83.3 Geography 2/0 0/8 2/8 80.0 0/0 1/9 1/9 90.0 0/0 1/9 1/9 90.0 Math 1/4 4/14 7/14 66.7 1/0 1/19 2/19 90.5 1/0 1/19 2/19 90.5 Physics 0/0 0/23 1/23 95.8 0/0 1/23 1/23 95.8 0/0 0/24 0/24 100.0 Health & Med. 3/22 22/58 32/60 65.2 9/0 17/66 25/67 72.8 5/4 6/77 14/78 84.8 Basic Med. 2/2 2/9 4/10 71.4 1/0 2/11 3/11 78.6 1/0 1/12 2/12 85.7 Clinical Med. 1/6 6/8 9/9 50.0 3/0 5/10 7/11 61.1 2/1 1/14 3/15 83.3 Diagnostics 0/6 6/14 9/14 60.9 3/0 4/16 7/16 69.6 2/1 2/18 5/18 78.3 Pharmacy 0/3 3/13 4/13 76.5 1/0 3/13 4/13 76.5 0/1 1/15 2/15 88.2 Public Health 0/5 5/14 6/14 70.0 1/0 3/16 4/16 80.0 0/1 1/18 2/18 90.0 Humani. & Soc. 5/14 14/40 24/42 63.6 3/5 9/49 16/50 75.8 5/3 5/53 12/54 81.8 History 1/4 4/4 6/4 40.0 1/0 1/8 2/8 80.0 0/1 1/8 2/8 80.0 Literature 2/2 2/15 5/16 76.2 1/2 2/16 5/16 76.2 2/1 0/18 3/18 85.7 Sociology 0/5 5/8 7/9 56.3 1/2 4/9 6/10 62.5 2/1 2/11 4/12 75.0 Psychology 2/3 3/13 6/13 68.4 0/1 2/16 3/16 84.2 1/0 2/16 3/16 84.2 Tech & Eng. 3/25 25/106 39/108 73.5 9/4 20/114 32/115 78.2 6/7 10/124 23/124 84.4 Agriculture 0/6 6/10 9/10 52.6 1/2 5/11 8/11 57.9 2/1 2/14 5/14 73.7 Archi. Eng. 2/2 2/17 5/17 77.3 1/1 2/18 4/18 81.8 1/1 0/20 2/20 90.9 Computer Sci. 0/0 0/17 2/17 89.5 1/0 1/17 2/17 89.5 0/1 2/16 3/16 84.2 Electronics 0/0 0/8 1/8 88.9 0/0 0/9 0/9 100.0 0/0 0/9 0/9 100.0 Energy Power 0/4 4/20 6/20 76.9 2/0 4/20 6/20 76.9 1/1 1/23 3/23 88.5 Materials 0/3 3/22 5/22 81.5 1/1 3/22 5/22 81.5 1/1 2/23 4/23 85.2 Mechanical Eng. 1/10 10/12 13/12 48.0 3/0 5/17 8/17 68.0 1/2 3/19 6/19 76.0 Overall 22/95 95/413 156/421 73.0 33/12 70/462 111/466 80.8 25/20 41/491 84/493 85.4

Table 5: Detailed breakdown of estimated human performance on MMMU-Pro for low, medium, and high performance levels across various disciplines. Abbreviations: "w/o Sol." (without Solution), "w/ Sol." (with Solution), "Est." (Estimate), and "w/c" (number of wrong/correct answers).

In summary, by leveraging the original MMMU human evaluation data and applying our estimation method, we provide a reasonable approximation of human performance on MMMU-Pro. This approach maintains the human performance benchmark without incurring the substantial costs associated with new expert evaluations.

Appendix C Ensuring Quality and Diversity of Expanded Options

Expanding the number of answer options naturally increases the difficulty of the benchmark, but its effectiveness relies heavily on the quality, diversity, and contextual relevance of these additional options. To ensure this, we implemented a rigorous multi-stage validation process, combining automated and human efforts to produce high-quality results.

Initial Model-Based Option Augmentation and Filtering. We began by leveraging large language models (LLMs) to automate the initial generation and filtering of expanded options. Specifically, GPT-4o was used to generate additional options, while Claude 3.5 acted as a preliminary filter to remove options that were contextually irrelevant or logically inconsistent. This step significantly reduced the workload for human reviewers by pre-screening the candidates.

Two Rounds of Human Review. To further enhance quality and eliminate potential issues, we conducted two rounds of meticulous human validation:

•

First Round of Review: Individual reviewers assessed the expanded options for each question. They ensured that the options were diverse, logically distinct, and free from ambiguity. If any flaws were identified, reviewers were instructed to correct the issues or create new options to maintain the integrity of the question.
•

Second Round of Review: A double-check process followed, involving two additional human experts who cross-validated each question and its options. This iterative step eliminated any residual inconsistencies or errors and provided an additional layer of assurance.

By combining automated methods with multi-stage human validation, we ensured that each expanded option met high standards of quality, robustness, and alignment with the intended challenges of the benchmark. This approach not only addressed potential weaknesses in automated generation but also significantly improved the reliability of the dataset.

Appendix D Analysis of CoT’s Impact

Appendix E Experimental Setup of Vision Encoder Impact

To evaluate the influence of vision encoders on model performance, we conduct experiments using the open-source architecture Cambrian-1. These experiments fix both the training data (Cambrian-1 1M SFT data) and the large language model (Llama 3.1 8B) to isolate the impact of different vision encoders. Inspired by Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs (Tong et al., 2024a), we follow their methodology by interpolating visual features to a fixed number of tokens (576) and concatenating them along the feature dimension.

Appendix F Comparison of GPT-4o’s responses between Standard and Vision Input settings

Appendix G CoT vs. Direct Acc: Model Differences Across Disciplines

Discipline	LLaVA-OneVision-72B			GPT4o
Discipline	CoT Acc	Direct Acc	Difference	CoT Acc	DIRECT Acc	Difference
Art and Design	20.42%	37.53%	-17.12%	63.14%	61.55%	1.58%
Science	23.89%	22.61%	1.28%	46.67%	38.46%	8.22%
Business	29.26%	24.50%	4.76%	57.45%	42.79%	14.66%
Humanities and Social Science	32.14%	36.60%	-4.46%	60.08%	57.87%	2.21%
Health and Medicine	19.22%	20.78%	-1.56%	49.68%	44.34%	5.34%
Tech and Engineering	22.98%	20.65%	2.33%	37.72%	23.23%	14.49%

Table 6: Comparison of CoT and direct accuracy of two representative models across disciplines in the Vision Input setting. Difference = CoT Acc. - Direct Acc.