0% found this document useful (0 votes)
89 views12 pages

PDF 1

The document introduces LEGO-Puzzles, a benchmark designed to evaluate the multi-step spatial reasoning capabilities of Multimodal Large Language Models (MLLMs) through LEGO-based tasks. It highlights significant limitations in current MLLMs' spatial reasoning abilities compared to human performance, revealing that even advanced models struggle with basic spatial understanding and sequential reasoning tasks. The benchmark consists of over 1,100 visual question-answering samples across 11 tasks, aiming to enhance the evaluation of MLLMs in complex spatial reasoning scenarios.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views12 pages

PDF 1

The document introduces LEGO-Puzzles, a benchmark designed to evaluate the multi-step spatial reasoning capabilities of Multimodal Large Language Models (MLLMs) through LEGO-based tasks. It highlights significant limitations in current MLLMs' spatial reasoning abilities compared to human performance, revealing that even advanced models struggle with basic spatial understanding and sequential reasoning tasks. The benchmark consists of over 1,100 visual question-answering samples across 11 tasks, aiming to enhance the evaluation of MLLMs in complex spatial reasoning scenarios.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?

Kexian Tang1,2 * Junyao Gao1,2 * Yanhong Zeng1† Haodong Duan1†


Yanan Sun1 Zhening Xing1 Wenran Liu1 Kaifeng Lyu3‡ Kai Chen1‡
1 2
Shanghai AI Laboratory Tongji University Simons Institute, UC Berkeley3
{tangkexian, gaojunyao, zengyanhong, duanhaodong, sunyanan}@pjlab.org.cn
{xingzhening, liuwenran, chenkai}@pjlab.org.cn, [email protected]
arXiv:2503.19990v1 [cs.AI] 25 Mar 2025

Abstract control [22, 28], autonomous driving [18, 52], and auto-
mated assembly [12]. These complex real-world applica-
Multi-step spatial reasoning entails understanding and tions inherently require advanced multi-step spatial rea-
reasoning about spatial relationships across multiple se- soning capabilities, which involve perceiving 3D-aware
quential steps, which is crucial for tackling complex real- spatial relationships and reasoning about them across multi-
world applications, such as robotic manipulation, au- ple sequential steps [5, 44, 58]. With the rapid advancement
tonomous navigation, and automated assembly. To as- of Large Language Models (LLMs) [3, 17, 41, 49], Multi-
sess how well current Multimodal Large Language Mod- modal Large Language Models (MLLMs) [8, 34, 42, 48, 50]
els (MLLMs) have acquired this fundamental capability, we have also witnessed significant progress in perceiving vi-
introduce LEGO-Puzzles, a scalable benchmark designed sual information and interacting with humans through natu-
to evaluate both spatial understanding and sequential rea- ral language. While MLLMs have made remarkable strides
soning in MLLMs through LEGO-based tasks. LEGO- in fundamental tasks such as object recognition [14, 29] and
Puzzles consists of 1,100 carefully curated visual question- optical character recognition [16, 36, 39, 46], existing eval-
answering (VQA) samples spanning 11 distinct tasks, rang- uations [31, 37] suggest that their spatial reasoning abilities
ing from basic spatial understanding to complex multi-step are still limited.
reasoning. Based on LEGO-Puzzles, we conduct a compre- Research on evaluating MLLMs’ multi-step spatial rea-
hensive evaluation of state-of-the-art MLLMs and uncover soning capabilities remains largely unexplored. Existing
significant limitations in their spatial reasoning capabili- studies primarily focus on assessing the spatial under-
ties: even the most powerful MLLMs can answer only about standing capability, which pertains to the comprehension
half of the test cases, whereas human participants achieve of a static scene. Some works [21, 30, 53] employ synthetic
over 90% accuracy. In addition to VQA tasks, we eval- environments to render multiple simple 3D objects and then
uate MLLMs’ abilities to generate LEGO images follow- query the spatial relationships between them. However,
ing assembly illustrations. Our experiments show that only such question-answering (QA) tasks tend to be overly sim-
Gemini-2.0-Flash and GPT-4o exhibit a limited ability to plistic for MLLMs to handle, lacking the diversity and com-
follow these instructions, while other MLLMs either repli- plexity of real-world scenarios. Other studies [31, 38] con-
cate the input image or generate completely irrelevant out- struct spatial understanding tasks based on natural images,
puts. Overall, LEGO-Puzzles exposes critical deficiencies but this approach often involves manual annotations, which
in existing MLLMs’ spatial understanding and sequential may limit scalability. Moreover, most existing evaluations
reasoning capabilities, and underscores the need for further rarely evaluate reasoning over sequences of spatial transfor-
advancements in multimodal spatial reasoning. mations or actions, leaving the multi-step aspect of spatial
reasoning largely unaddressed.
In this work, we take inspiration from a common recre-
1. Introduction ational activity, LEGO construction, to design a compre-
Spatial intelligence [5] has attracted growing attention due hensive evaluation framework for assessing the multi-step
to its significance in various applications, including robotics spatial reasoning capabilities of MLLMs. The assembly
process of a complete LEGO model typically encompasses
* Equal contribution; work done during internships in Shanghai AI Lab-
oratory.
dozens or even hundreds of discrete construction steps, pro-
† Project Leads. viding an ideal foundation for testing sequential reason-
‡ Corresponding Authors. ing abilities. Each step requires accurate comprehension

1
of geometry, orientation, and connection mechanisms of main contributions are as follows:
LEGO pieces to successfully follow the provided illustra- • A novel benchmark for spatial understanding. Based
tions. Based on publicly available LEGO projects with on LEGO constructions, our benchmark LEGO-Puzzles
detailed step-by-step assembly instructions, we introduce offers natural and diverse test cases for evaluating the spa-
LEGO-Puzzles, a novel benchmark specifically engineered tial understanding capabilities of MLLMs, with improved
to evaluate MLLMs’ multi-step spatial reasoning capabili- visual richness and scalability over existing datasets.
ties. In total, LEGO-Puzzles encompasses a diverse col- • Evaluation of multi-step spatial reasoning. Built upon
lection of over 1,100 carefully curated visual question- LEGO’s step-by-step building process, LEGO-Puzzles is
answering (VQA) pairs spanning 11 distinct tasks, which the first benchmark explicitly designed to assess multi-
fall into three major categories. First, we develop a set of step spatial reasoning, where each task requires reasoning
fundamental tests to assess MLLMs’ basic spatial under- over up to 7 LEGO construction steps.
standing capabilities, including recognition of height re- • Comprehensive assessment on visual question answer-
lationships, rotational transformations, adjacency patterns, ing and image generation. LEGO-Puzzles assesses the
and viewpoints within 3D space. Building upon this foun- spatial reasoning capability of LLMs across both VQA
dation, we construct both single-step and multi-step se- and image generation tasks, providing a comprehensive
quential reasoning evaluations based on LEGO assembly assessment of their ability to comprehend and process
sequences to examine models’ sequential reasoning ability. spatial information in a human-like manner.
These advanced tests include identifying the configuration
of intermediate assembly states (single-step) or determin- 2. Related Work
ing the correct order of multiple intermediate LEGO states
(multi-step). General Multi-Modal Evaluation Benchmarks. Recent
years have seen significant advancements in multimodal
LEGO-Puzzles offers several distinctive advantages
large language models (MLLMs), accompanied by a surge
compared to existing spatial understanding benchmarks:
in benchmark datasets evaluating their visual understand-
1) Enhanced visual richness. Unlike synthetic datasets
ing. Several comprehensive benchmarks have been intro-
such as CLEVR [21, 30], which utilize rendered primitive
duced to assess various multimodal capabilities. MME [14]
shapes, LEGO-based questions present significantly greater
provides a systematic evaluation of 14 image-centric tasks,
visual complexity and diversity. 2) Superior scalability.
revealing persistent challenges such as object hallucination
A single LEGO assembly instruction manual can generate
and spatial reasoning failures. MMBench [37] introduces
hundreds of unique evaluation questions, which enables ef-
a bilingual multiple-choice format for fine-grained multi-
ficient benchmark expansion with minimal additional re-
modal assessment. Moving beyond static images, SEED-
source investment.
Bench [25] evaluates generative comprehension across 19K
Leveraging LEGO-Puzzles, we conduct comprehensive Q&A pairs spanning both image and video reasoning,
evaluations of 20 state-of-the-art MLLMs, including propri- showing that temporal understanding remains a major lim-
etary models such as GPT-4o and Gemini-2.0-Flash, as well itation. For expert-level reasoning, MMMU [60] presents
as leading open-source alternatives [8, 43, 50, 56]. Our ex- a discipline-specific benchmark across 183 subtopics, re-
perimental results reveal a substantial gap between current vealing substantial knowledge gaps even in leading MLLMs
MLLMs and human-level proficiency. Even the strongest even in leading MLLMs, such as GPT-4o and Gemini.
models struggle with basic spatial understanding tasks, such Overall, these benchmarks reveal that while MLLMs have
as accurately identifying the height of LEGO pieces and made progress, they still struggle with spatial understand-
determining adjacency relationships in 3D space. Among ing, temporal coherence, multimodal integration, and high-
open-source models, only a few achieve performance no- level reasoning, presenting clear directions for future re-
tably above random guessing across different tasks. search.
Beyond VQA tasks, LEGO-Puzzles also enables the as- Visual-Spatial Understanding in MLLMs. Multimodal
sessment of spatially grounded image generation. For in- large language models (MLLMs) have made significant
stance, given an assembly illustration, an MLLM is tasked strides in vision-and-language tasks, yet they still strug-
with generating an image of the intermediate state following gle with 3D spatial understanding. Benchmarks such as
the specified assembly operation. In these generation tests, 3DSRBench [38] show that even the most advanced models
most of the evaluated models fail completely, either disre- achieve only 45–50% accuracy on 3D spatial tasks and ex-
garding the provided instructions or generating images that perience substantial performance drops under unusual cam-
are entirely irrelevant to the intended LEGO configuration. era angles. To enhance spatial reasoning, several studies
In summary, our novel benchmark LEGO-Puzzles pro- have explored Chain-of-Thought (CoT) prompting. For ex-
vides a comprehensive evaluation of the spatial understand- ample, Park et al. [44] demonstrate that combining CoT
ing and sequential reasoning capabilities of MLLMs. Our with explicit image-to-text conversion can improve gener-

2
alization from simple to hard visual reasoning tasks. How-
ever, beyond such tailored interventions, traditional CoT
prompting alone has generally failed to improve spatial
reasoning performance [58]. In response, alternative ap-
proaches have emerged. Spatially enriched datasets, such
as Spatial Aptitude Training (SAT) [45], significantly boost
zero-shot performance across real-image benchmarks. Ar-
chitectural innovations like CAD-GPT [51], which embeds
3D coordinates into language representations, and MVoT
[27], which introduces visual sketching during inference,
further expand the solution space. Additionally, lightweight
strategies like Coarse Correspondences [33] improve spatial
understanding without requiring model fine-tuning. Despite
these advances, achieving human-level 3D spatial reasoning
in MLLMs remains an open challenge.

3. LEGO-Puzzles
Figure 1. Problem Statistics in LEGO-Puzzles.
In this section, we introduce LEGO-Puzzles, a diverse
and comprehensive benchmark designed to evaluate the correct assembly position for the next LEGO pieces. (7)
multi-step spatial reasoning capability of MLLMs in detail. Next-Step: Determine the next LEGO status based on the
Specifically, we first introduce the motivation and definition current status and the upcoming pieces. (8) Dependency:
of each task in Section 3.1. Then, we introduce our dataset Identify the necessary LEGO pieces required to transition
curation process, including data collection, question-answer from the current to the next status. Task 3: Multi-Step
generation, and quality control, in Section 3.2. Sequential Reasoning. (9) Backwards: Identify the cor-
rect LEGO status in the assembly pipeline of the LEGO ob-
3.1. Task Definition ject. (10) Ordering: Predict the correct assembly order of
To enhance the evaluation of multi-step spatial reasoning the provided final LEGO images. (11) Outlier: Detect the
for Multimodal Large Language Models (MLLMs), we de- LEGO status that does not belong to the provided assembly
fine three primary categories of tasks based on insights from sequence.
cognitive psychology and human experiences in developing In conclusion, LEGO-Puzzles consists of over 1,100
relevant skills [5, 40, 55]. Using LEGO building as a con- visual question-answering (VQA) pairs derived from 407
crete example of how humans develop spatial intelligence, LEGO building instructions, encompassing 11 tasks across
we find that individuals typically engage in the following spatial understanding, single-step and multi-step sequen-
processes: First, they must understand the spatial relation- tial reasoning. In addition to VQA tasks, We further ex-
ships between each LEGO piece and how these pieces relate tend several sub-tasks of spatial understanding (Rotation*
from different perspectives in 3D space. Next, they need and Multiview*) and single-step sequential reasoning (Posi-
to reason through the dependencies and assembly logic of tion*, Dependency* and Next-Step*) to include image gen-
each block at every step of the building process. Finally, eration following [57], as part of the visual Generation
they extend their reasoning to multi-step reasoning across evaluation of MLLMs.
the entire assembly sequence. To achieve this, our tasks
range from fundamental spatial understanding (36.4%) to
3.2. Dataset Curation
single-step sequential reasoning (36.4%) and, ultimately, to As illustrated in Figure 3, our pipeline consists of three
multi-step sequential reasoning (27.3%), as illustrated in key steps: data collection, question-answer generation, and
Figure 1. Below, we provide further details on each task. quality control. This design ensures the scalability, accu-
Task 1: Spatial Understanding. (1) Height: Distin- racy, and reliability of our data.
guish the relative heights of LEGO objects. (2) Adjacency: Data Collection. Data collection consists of three stages.
Determine whether LEGO objects are adjacent or separated. First, we collect a diverse set of open-source LEGO source
(3) Rotation: Calculate the angle of rotation between a files from the Internet, which include comprehensive step-
LEGO object and its corresponding rotated version. (4) by-step LEGO building instructions, visualizations, and the
Multiview: Predict the current LEGO status from different required LEGO pieces for each step. Notably, the cam-
viewpoints. Task 2: Single-step Sequential Reasoning. era perspective remains consistent across all steps within
(5) Rotation Status: Assess the rotation status of the next a specific instruction, ensuring temporal, spatial, and log-
LEGO pieces during assembly. (6) Position: Identify the ical coherence throughout the building process. Second,

3
Height Dependency Ordering
Question: Which LEGO object is shorter in 3D space? Question: Please select the correct option (A, B, C, or D) that shows Question: You will be provided with the current assembly state
Options: the LEGO part required to make the transition from the first state image <image 1>, the target assembly state image <image 2>, and
A. The LEGO piece pointed by the blue arrow <image 1> to the second <image 2>. four step images labeled A, B, C, and D. Your goal is to arrange the
B. The LEGO piece pointed by the red arrow Options: four step images in the correct order that transitions from the current
C. They are the same height A. B. C. D. state to the target state.
Options:
A. B. C. D.
Adjacency <image 1> <image 2>
Question: Are the LEGO pieces pointed to by the two red arrows Next Step
adjoining or separated
Options: Question: Please select the correct option (A, B, C, D, or E) that
A. adjoining shows the assembly state after adding the next part <image 1> onto <image 1> <image 2>
Answer: CBAD
B. seperated the current state <image2>. The next part is a step toward achieving
the final product <image 3>. Outlier
Options:
Multiview B. Question: Please select the correct option (A, B, C, D, or E) that does
A.
NOT represent a step in transitioning from the current state <image
Question: Based on the LEGO piece shown in the reference image,
<image 1> <image 2> <image 3> 1> to the target state <image 2>.
which of the following images shows the LEGO piece from a left-to-
Options:
right perspective? C. D. E. A. B.
Options:
A. B. <image 1> <image 2>

Rotation Status C. D. E.
Question: Does the additional piece <image 1> need to be rotated to
attach it to the base structure <image 2> in order to form the final
C. D.
structure <image 3>?
Options:
Backwards
A. Yes
reference image Question: This is the target assembly image: <image 1>Which option
B. No
(A, B, C, or D) correctly shows the correct assembly step?
Rotation <image 1> <image 2> <image 3>
Options:
Question: From a top-down perspective, how many degrees has the Position A. B.
LEGO figure in <image 2> rotated clockwise around its center
relative to <image 1>? Question: Based on the current state <image 1>, the next part <image
Options: 2> to install, and the state after installation <image 3>, which of the
A. 30° following images shows the correct installation point? C. D.
B. 60° Options:
A. B C D
C. 90°
. . .
D. 120°
<image 1>
<image 1> <image 2> <image 1> <image 2> <image 3>

Figure 2. Task examples of LEGO-Puzzles. From left to right, the columns represent tasks in Spatial Understanding, Single-Step
Sequential Reasoning, and Multi-Step Sequential Reasoning. Note: The questions above are slightly simplified for clarity and brevity.

Data Collection Quality Control


extract components Render LEGO images Duplication
Piece Object Angle Viewpoint Color Position Number

Template mismatch
Task-Specific Templates
Qu est ion: Based on the current state <ima ge 1> …
QA Generation <ima ge 7>(image len gth :7)
Gr oun d Tru th Answe r: C
Task-Specific Templates Image List
[‘509072_1.png‘,…,’ 509072_6.png’](image len gth :6)
Question: Based on the current state <image 1>, the next part <image 2> to install, and the state after installation <image 3>, which
of the following images shows the correct installation point? Options: A.<image 4> B.<image 5> C.<image 6> D.<image 7> Multi-Annotators
Ground Truth Answer: C
Image List
annotator annotator

annotations

Figure 3. Data curation pipeline. Our pipeline first collects a diverse set of LEGO building instructions to render and extract LEGO
images in a unified format. Next, we generate question-answer pairs by using a combination of human annotation and predefined question
templates. Finally, we implement three quality control strategies to ensure the accuracy, consistency, and reliability of the data.

we render LEGO building instruction files as PDF files us- (POV-Ray) style and modify the lighting strength to gener-
ing a publicly available rendering software, Studio1 . This ate realistic LEGO images from different angles. For the
tool enables us to adjust default rendering settings to con- Backward task, we also edit attributes such as color, quan-
struct tasks that evaluate spatial relationships at varying lev- tity, and assembly positions of pieces to create erroneous
els of complexity. Specifically, for the Rotation and Multi- images. Finally, we use PDF-Extract2 to extract all LEGO
view tasks, we utilize the Persistence of Vision Raytracer pieces and objects of interest from the rendered PDF files.
All images are systematically organized according to a uni-
1 https://www.bricklink.com/v3/studio/download.

page 2 https://github.com/opendatalab/PDF-Extract-Kit

4
Task-Specific Template (here for task Position) [Tiny/Small] [56], Pixtral-12B [1], LLaVA-OneVision-
7B [26], and EMU3 [54]. For proprietary models, we eval-
Instruction: You are a LEGO 3D assembly position analyzer. Your
primary task is to determine the correct assembly point of a given LEGO uate Claude-3.5-Sonnet [2], Gemini-1.5-Flash, Gemini-1.5-
piece based on the current state and the next part to be installed. You will Pro, Gemini-2.0-Flash [48], GPT-4o (20241120), and GPT-
be provided with images representing the current state (x_0), the part to
install (x_1), the state after installation (x_2), and installation options (A, B,
4o-mini [42]. For the additional image Generation eval-
C, D). Your goal is to analyze the given images and determine which of the uation, we evaluate the open-source models Emu2 [47],
four options (A, B, C, D) shows the correct assembly point for the next part. GILL [23], and Anole [9], as well as the proprietary models
Your answer should be based solely on the provided LEGO 3D data,
without any additional assumptions. Keep your responses clear, direct, and
GPT-4o and Gemini-2.0-Flash, all of which support long-
focused on the question. Please respond with only the letter corresponding range sequence input and image output. Moreover, all eval-
to your choice (A, B, C, or D). uations are conducted in a zero-shot setting for a fair com-
Question: Based on the current state (x_0), the next part (x_1) to install,
parison.
and the state after installation (x_2), which of the following images shows Baselines. We provide two baselines for comparison:
the correct installation point? Current state (x_0):<image 1>Part to install
(x_1):<image 2>State after installation (x_2):<image 3> Options: \n
• Random indicates the accuracy of random selection for
A.<image 4>B.<image 5>C.<image 6>D.<image 7>Please select the each question, assuming equal probability for all options.
correct answer from the options above.\n • p-value-based critical value indicates the minimum accu-
Answer: Ground Truth
racy required to statistically surpass random guessing at a
given significance level (p = 0.05).
Figure 4. Task-specific template. Our question-answer template Evaluation Metrics. We calculate the accuracy (%) of ver-
includes instructions, questions, and answers. Here, we provide an bal answers to multiple-choice questions using exact match-
example from the Position task for reference. ing as our primary metric, following Duan et al. [11]. When
models fail to generate an answer in the required format, we
fied naming standard and prepared for question-answer gen- utilize the ChatGPT-0125 [13] method from VLMEvalKit
eration across different tasks. [11] as a fallback option. Additionally, we randomly se-
Question-Answer Generation. To ensure the scalability of lect 20 questions from each task to create LEGO-Puzzles-
our pipeline, we design several task-specific templates for Lite, resulting in a total of 220 question-answer pairs. This
question-answer generation. See Figure 4 for an example. dataset is designed to investigate the performance gap be-
Each data example includes an instruction, a question, and tween human intelligence, as evaluated by additional human
an answer. To meet the requirements of different tasks, we annotators, and current models in patial Understanding and
create LEGO sequences of varying lengths. Note that the Sequential Reasoning. In the Generation evaluation, tra-
image token <image x> serves as a placeholder for the ditional metrics such as FID [20], CLIPScore [15, 19], and
corresponding image input here. X-IQE [6] are inadequate for assessing interleaved outputs
Quality Control. We implement a rigorous human re- in visual answers. Therefore, we enlist human experts to
view process to maintain high quality and minimize er- evaluate performance based on appearance similarity and
rors. Specifically, we carefully examine the consistency be- instruction following, using a scoring scale from 0 to 3. This
tween LEGO objects in source files and rendered PDF files, approach is necessary due to biases present in VLM-based
conducting checks for duplication, adherence to standards, scoring [35].
and formatting in the generated images. Additionally, we
apply cross-validation to ensure that each question-answer 4.2. Main Results
pair aligns with its task-specific template and that the im- We include evaluation results for spatial understanding, se-
age annotations are accurate, as verified by multiple annota- quential reasoning, and generation in Table 1, Table 2, and
tors. This rigorous process ensures high-quality and reliable Table 3. We summarize key findings as below.
evaluation. Challenges of LEGO-Puzzles. Our findings indicate that
human experts consistently achieve significantly higher
4. Evaluation on LEGO-Puzzles overall performance (93.6%), as shown in Table 2. In con-
trast, current MLLMs fall short, with even the most ad-
4.1. Experimental Setting vanced models, Gemini-2.0-Flash and GPT-4o, trailing over
Benchmark Models. We extensively evaluate 20 mod- 30% behind human performance across all tasks. This per-
els, covering a diverse range of architectures, sizes, and sistent gap highlights the need for comprehensive and sub-
training processes for Spatial Understanding and Se- stantial improvements in our LEGO-Puzzles.
quential Reasoning tasks. For open-source models, we Gap between Open-source and Proprietary Models.
evaluate MiniCPM-V2.6 [59], Qwen2-VL-[7B/72B] [50], There is a significant gap between open-source and pro-
Qwen2.5-VL-[7B/72B] [4], InternVL2.5-[8B/78B] [7], prietary MLLMs in both spatial understanding and sequen-
VILA1.5-13B [32], Idefics3-8B [24], DeepSeek-VL2- tial reasoning abilities. Most open-source MLLMs perform

5
Spatial Understanding Single-Step Reasoning Multi-Step Reasoning
Models Overall
Height Adjacency Rotation Multiview Next-Step Dependency Rotation Stat. Position Backwards Ordering Outlier
Proprietary
Claude-3.5-Sonnet 39.0 60.0 42.0 48.0 61.0 78.0 58.0 37.0 49.0 54.0 64.0 53.6
Gemini-1.5-Flash 29.0 58.0 28.0 45.0 57.0 77.0 57.0 32.0 28.0 20.0 51.0 43.8
Gemini-1.5-Pro 35.0 58.0 38.0 56.0 59.0 84.0 61.0 39.0 35.0 44.0 59.0 51.6
Gemini-2.0-Flash 35.0 70.0 49.0 45.0 69.0 81.0 54.0 46.0 56.0 46.0 43.0 54.0
GPT-4o 49.0 66.0 41.0 51.0 65.0 87.0 51.0 51.0 53.0 72.0 49.0 57.7
GPT-4o-mini 31.0 53.0 26.0 51.0 27.0 71.0 57.0 32.0 50.0 7.0 27.0 39.3
Open-source
MiniCPM-V2.6 26.0 56.0 22.0 44.0 34.0 50.0 51.0 29.0 23.0 0.0 19.0 32.2
Qwen2-VL-7B 31.0 57.0 30.0 40.0 44.0 70.0 48.0 26.0 13.0 9.0 28.0 36.0
Qwen2.5-VL-7B 35.0 60.0 22.0 27.0 26.0 60.0 49.0 25.0 24.0 5.0 13.0 31.5
InternVL2.5-8B 35.0 53.0 23.0 37.0 38.0 48.0 64.0 25.0 35.0 0.0 29.0 35.2
VILA1.5-13B 26.0 55.0 26.0 35.0 17.0 34.0 48.0 26.0 12.0 4.0 22.0 27.7
Idefics3-8B 29.0 51.0 23.0 23.0 18.0 20.0 47.0 30.0 24.0 4.0 24.0 26.6
InternVL2.5-78B 41.0 62.0 32.0 47.0 60.0 79.0 58.0 32.0 40.0 15.0 37.0 45.7
Qwen2-VL-72B 40.0 62.0 37.0 51.0 57.0 79.0 49.0 43.0 34.0 26.0 31.0 46.3
Qwen2.5-VL-72B 30.0 61.0 27.0 27.0 55.0 72.0 58.0 47.0 60.0 33.0 43.0 46.6
DeepSeek-VL2-Small 31.0 52.0 36.0 41.0 38.0 57.0 59.0 28.0 41.0 3.0 26.0 37.5
DeepSeek-VL2-Tiny 32.0 52.0 36.0 24.0 27.0 25.0 47.0 27.0 26.0 4.0 16.0 28.7
Pixtral-12B 31.0 68.0 24.0 24.0 21.0 38.0 53.0 21.0 24.0 3.0 37.0 31.3
LLaVA-OneVision-7B 42.0 59.0 21.0 41.0 30.0 50.0 59.0 26.0 20.0 0.0 22.0 33.6
EMU3 31.0 52.0 24.0 25.0 17.0 25.0 47.0 25.0 24.0 0.0 20.0 26.4
Baseline
Random Guessing 33.0 50.0 25.0 25.0 20.0 25.0 50.0 25.0 25.0 4.2 20.0 27.5
↑ Random (p < 0.05) 42.0 59.0 33.0 33.0 28.0 33.0 59.0 33.0 33.0 9.0 28.0 35.5

Table 1. Full Evaluation Results of 18 MLLMs on LEGO-Puzzles. Dark Gray indicates the best performance for each task among all
models and Light Gray indicates the best result among open-source model. We also highlight the top three models based on their overall
performance, using Dark Green , Medium Green , and Light Green , respectively.

Spatial Understanding Single-Step Reasoning Multi-Step Reasoning


Models Overall
Height Adjacency Rotation Multiview Next-Step Dependency Rotation Stat. Position Backwards Ordering Outlier
LEGO-Puzzles-Lite
Human proficiency 70.0 95.0 95.0 100.0 90.0 100.0 100.0 95.0 95.0 95.0 95.0 93.6
Claude-3.5-Sonnet 40.0 55.0 50.0 50.0 60.0 75.0 55.0 35.0 60.0 55.0 60.0 54.1
Gemini-2.0-Flash 30.0 65.0 55.0 40.0 80.0 85.0 60.0 40.0 60.0 50.0 45.0 55.5
GPT-4o 35.0 75.0 45.0 50.0 60.0 85.0 60.0 60.0 55.0 60.0 65.0 59.1
InternVL2.5-78B 40.0 55.0 30.0 45.0 60.0 85.0 55.0 30.0 25.0 20.0 50.0 45.0
Qwen2-VL-72B 30.0 65.0 45.0 50.0 55.0 80.0 45.0 35.0 30.0 15.0 35.0 44.1
Qwen2.5-VL-72B 25.0 70.0 25.0 35.0 65.0 70.0 65.0 45.0 55.0 20.0 55.0 48.2

Table 2. Comparing Top-Performing MLLMs with Human Proficiency on LEGO-Puzzles-Lite. The best results are marked in bold.
The top three overall performances are highlighted in Dark Green , Medium Green , and Light Green , respectively.

only marginally better than Random, while leading propri- interplay between 2D and 3D perspectives, most models
etary models, such as Gemini-2.0-Flash and GPT-4o, ex- (11/20) perform worse than Random, with even human
hibit strong spatial reasoning capabilities, achieving overall experts achieving significantly lower scores than in other
accuracies of 54.0% and 57.7%, respectively. tasks. In the Rotation and Rotation Status tasks, our findings
indicate that models exhibit limited sensitivity to rotation-
Model Performance in Different Tasks. In the Height related recognition, achieving low scores, with 7 out of
task, where height relationships are complicated by the

6
Gemini-2.0-Flash GPT-4o Emu2 GILL Anole GPT-4o Gemini-2.0-Flash Qwen-2.5-72B Internvl-2.5-78B
Task \MLLM Setting
App IF App IF App IF App IF App IF
w/o CoT w. CoT w/o CoT w. CoT w/o CoT w. CoT w/o CoT w. CoT
Rotation* 2.30 1.65 0.95 0.80 2.10 0.00 0.00 0.00 0.10 0.00
k=1 45.0 75.0 85.0 60.0 65.0 65.0 35.0 55.0
Multiview* 1.80 1.35 2.25 0.45 2.10 0.00 0.00 0.00 0.05 0.00
Position* 3.00 1.40 3.00 1.10 0.65 0.00 0.00 0.00 0.00 0.00 k=2 15.0 25.0 45.0 50.0 60.0 55.0 30.0 20.0
Dependency* 1.85 1.25 0.55 0.25 0.65 0.00 0.00 0.00 0.00 0.00 k=3 5.0 5.0 35.0 40.0 75.0 75.0 10.0 20.0
Next-Step* 1.80 0.20 0.55 0.20 2.10 0.00 0.00 0.00 0.00 0.00 k=4 5.0 0.0 35.0 50.0 65.0 65.0 20.0 5.0
Overall 2.15 1.17 1.46 0.56 1.52 0.00 0.00 0.00 0.03 0.00
k=5 5.0 0.0 20.0 25.0 65.0 65.0 25.0 10.0

Table 3. Evaluation on Generation. We conduct human-based Table 4. Evaluation on Next-k-Step. k represents the number
evaluation to assess the “Appearance” (App) and “Instruction Fol- of steps, and CoT refers to adding a “Think step by step before
lowing” (IF) scores of Gemini-2.0-Flash, GPT-4o, Emu2, GILL, answering” instruction in QA pairs, similar to those in LLMs.
and Anole, using a scoring scale from 0 to 3 for both dimensions.

20 models performing below Random in both tasks. Con- MLLMs to identify the correct LEGO object by sequen-
versely, most models achieve accuracy above the critical tially adding k additional LEGO pieces to the current LEGO
threshold in the Multiview task, indicating that existing object. We set k = 1, 2, 3, 4, 5 and construct 20 test cases
MLLMs possess basic spatial modeling capabilities. How- for each k value. Specifically, we input current LEGO ob-
ever, MLLMs’ performance is notably weaker in multi-step ject (x1 ), next k LEGO pieces (x2 , x3 , . . . , xk+1 ) and the
sequential reasoning tasks such as Ordering and Outlier target LEGO object (xk+2 ), along with the corresponding
compared to single-step spatial reasoning tasks like Depen- text instructions into MLLMs, excepting them to generate
dency and Next-Step. This disparity highlights the models’ the correct answer from four options (A, B, C, D). Addition-
limitations in capturing long-range dependencies and exe- ally, to investigate the effectiveness of the widely adopted
cuting effective sequential reasoning. Chain of Thought (CoT) approach from the LLM commu-
In conclusion, our LEGO-Puzzles highlights both the nity in enhancing multi-step sequential reasoning, we de-
spatial understanding and sequential reasoning abilities of signed experiments comparing model performance under
MLLMs. Results in Table 1 demonstrate that GPT-4o two conditions: standard prompting (without CoT) and ex-
achieves the highest performance. However, the overall re- plicit step-by-step reasoning (with CoT). We conduct exper-
sults suggest significant room for improvement, particularly iments using the four top-performing models on the Next-
in domains involving relative relationships, rotation percep- Step task, GPT-4o, Gemini-2.0-Flash, Qwen-2.5-72B, and
tion, and long-range sequential reasoning. Internvl-2.5-78B.
Performance Degradation when k Increases. As shown
4.3. Image Generation Evaluation
in Table 4, the relationship between accuracy and the num-
As mentioned in Section 3.1, we evaluate image generation ber of reasoning steps varies across models. GPT-4o and
ability across several tasks related to spatial understand- Gemini-2.0-Flash exhibit a clear performance decline as k
ing (Rotation* and Multiview*) and single-step sequential increases. These results align with the performance discrep-
reasoning (Position*, Dependency* and Next-Step*) as part ancy between single-step and multi-step sequential reason-
of the Generation assessment. As shown in Table 3, all ing in Table 1, further demonstrating that current MLLMs
MLLMs struggle to simultaneously maintain appearance struggle to handle multi-step sequential relationships re-
identity and strictly adhere to user instructions when gener- quiring iterative reasoning. A key challenge lies in the accu-
ating image answers across all tasks. This poor performance mulation of errors as reasoning steps increase. Each inter-
indicates that existing MLLMs are ineffective at visualizing mediate inference introduces potential deviations that can
spatial understanding and sequential reasoning capabilities, compound over multiple steps, leading to significant incon-
underscoring the challenge of integrating multimodal infor- sistencies in the final predictions. Additionally, MLLMs
mation effectively. might lack an explicit visual memory mechanism compared
to LLMs in language memory, making it difficult to co-
4.4. Exploring Multi-Step Sequential Reasoning herently track and integrate positional changes throughout
Experiments in Section 4.2 show that current MLLMs per- the reasoning process. Surprisingly, we observe that Qwen-
form poorly when extending single-step sequential reason- 2.5-72B achieves stable and relatively consistent accuracy
ing QA to multi-step tasks. To further investigate the under- scores (around 65.0%) across all values of k, even when k
lying reasons for these performance variations in sequential is increased to seven. And for Internvl-2.5-78B, accuracy
reasoning tasks, we design a fine-grained sequential reason- scores are close to random guessing.
ing task called Next-k-Step, which explicitly controls the Limited Effectiveness of Chain-of-Thought (CoT). By
number of steps required to complete the task. applying CoT prompting, we observe significant improve-
Experimental Setup. Next-k-Step builds upon our single- ments when k = 1 for GPT-4o and InternVL-2.5-78B.
step sequential reasoning task, Next-Step, and requires However, this effect diminishes for k ≥ 2, where accuracy

7
even declines dramatically. This is because these MLLMs 5. Error Analysis
perform worse than random guessing (25%) when dealing
with longer k steps, with accuracies of 5% for GPT-4o and In this section, we conduct a detailed error analysis during
10% for InternVL-2.5-78B. For other MLLMs like Gemini- the evaluation process of our benchmark, providing insights
2.0-Flash and Qwen-2.5-72B, CoT prompting does not pro- into model behaviors and shortcomings.
vide obvious benefits, as they fail to perform genuine step- Failure in Ordering Task. As shown in Table 1, several
by-step reasoning in their CoT responses. open-source models (4/14) completely fail in the Ordering
task, with scoring zero. Ordering task requires MLLMs
Task PCC P-value to enable multi-step reasoning ability. Despite explicitly
Height 0.93 0.00723 specifying the required answer format in the prompt (e.g.,
Adjacency 0.98 0.00046 a sequence such as “BACD”), some models were unable
to generate a valid response and instead produced arbi-
Table 5. Pearson Correlation Coefficients (PCC) and P-values for trary outputs. For instance, several models (InternVL2.5-
Height and Adjacency Tasks 8B, Emu3, MiniCPM-V-2-6, and LLava-OneVision-7B) ex-
hibited strong biases, frequently defaulting to a single letter
4.5. Consistency Compared with Natural Dataset rather than a complete sequence. In extreme cases, mod-
Besides its high scalability as a virtual framework, LEGO- els provided nearly identical responses across all test in-
Puzzles also demonstrates strong consistency with the natu- stances, such as Emu3, which answered “B” for 98 out of
ral environment. To verify this, we compare LEGO-Puzzles 100 test cases, demonstrating a lack of genuine reasoning
with 3DSRBench [38], which includes several similar tasks ability. These results indicate that many open-source mod-
(Height in LEGO-Puzzles and 3DSRBench, Adjacency in els struggle with sequence generation and constrained out-
LEGO-Puzzles and Location in 3DSRBench) but focuses put formatting, suggesting potential issues in their ability to
on real-world domain images. Specifically, we evaluate follow structured prompts for reasoning tasks. These biases
all proprietary models tested on LEGO-Puzzles within the in responses further suggest that models may be overly re-
3DSRBench dataset and compute the Pearson correlation liant on spurious correlations in training data rather than un-
coefficient [10] to measure the accuracy correlation be- derstanding the stepwise dependencies of a given sequence.
tween these two benchmarks. The results in Table 5 indicate Challenges in Height Perception: 2D vs. 3D Under-
that model performance on LEGO-Puzzles reliably reflects standing. In the Height task, we observe that most mod-
trends observed in natural data. els (11/20) achieve scores lower than Random. We pro-
vide some failure cases in Figure 6, which exhibit notice-
4.6. Task Similarity able 2D and 3D optical illusions. Since MLLMs are primar-
In this subsection, we analyze task similarity in our bench- ily trained on images with a predominantly 2D viewpoint,
mark by calculating the average rank correlation between the discrepancy between 2D and 3D spatial understanding
each task and all others, as proposed by Zhang et al. [61]. in our Height task often causes MLLMs to answer ques-
The similarity score for each task is derived from the av- tions based on a 2D projection rather than a true 3D per-
erage correlation between its ranking and those of all other spective. Even when we construct instruction prompts that
tasks, using three metrics: Spearman Rank Correlation Co- explicitly require MLLMs to comprehend 3D spatial rela-
efficient (SRCC), Pearson Linear Correlation Coefficient tionships, GPT-4o, the top-performing model, still fails to
(PLCC), and R² Score (R-squared Coefficient of Determi- achieve human-level performance. This observation high-
nation). lights the tendency of MLLMs to rely on 2D spatial priors
The results in Figure 5 show that only a few task pairs during inference, suggesting the need for further research in
have strong correlations, such as Next Step to Dependency, 3D understanding training.
Multi-step to Single-step Sequential Reasoning, and Spatial Weak Appearance Consistency and Instruction Follow-
Understanding. This is because they either share similar ing in Image Generation. Our evaluation of MLLM-
image inputs or integrate step-wise logical progression and generated images reveals substantial differences in instruc-
spatial comprehension. However, most tasks exhibit moder- tion following and reasoning-based image synthesis when
ate to low correlations, ensuring benchmark diversity. More processing sequential visual inputs. As shown in Table
than half of the task pairs have an SRCC between 0.3 and 7, open-source models struggle significantly in both ap-
0.6, indicating limited dependency among tasks. pearance consistency (App) and instruction-following (IF),
Overall, our benchmark offers a balanced assessment of while proprietary models demonstrate varying degrees of
MLLMs across various reasoning skills. While some tasks success. Among the proprietary models, Gemini-2.0-
show strong correlations due to conceptual overlap, the ma- Flash exhibits the strongest performance in both appearance
jority remain sufficiently independent, providing a compre- and instruction adherence. It effectively follows input con-
hensive and distinctive evaluation framework. straints and maintains high appearance fidelity, often editing

8
SRCC dimensions redundancy map PLCC dimensions redundancy map R² dimensions redundancy map
Height 1.00 0.47 0.37 0.38 0.46 0.39 0.39 0.28 0.35 0.37 0.27 Height 1.00 0.47 0.44 0.39 0.45 0.39 0.27 0.46 0.37 0.53 0.32 Height 1.00 0.22 0.20 0.15 0.21 0.15 0.08 0.21 0.14 0.28 0.10

Adjacency 0.47 1.00 0.36 0.40 0.62 0.69 0.25 0.47 0.36 0.57 0.59 Adjacency 0.47 1.00 0.47 0.31 0.60 0.58 0.13 0.54 0.41 0.59 0.51 Adjacency 0.22 1.00 0.22 0.10 0.36 0.34 0.02 0.29 0.16 0.35 0.26

Rotation 0.37 0.36 1.00 0.53 0.69 0.60 0.15 0.71 0.63 0.78 0.63 Rotation 0.44 0.47 1.00 0.51 0.72 0.55 0.12 0.69 0.58 0.77 0.60 Rotation 0.20 0.22 1.00 0.26 0.53 0.30 0.02 0.47 0.34 0.59 0.36

Multiview 0.38 0.40 0.53 1.00 0.75 0.86 0.45 0.63 0.48 0.56 0.58 Multiview 0.39 0.31 0.51 1.00 0.71 0.81 0.43 0.52 0.39 0.56 0.57 Multiview 0.15 0.10 0.26 1.00 0.50 0.66 0.18 0.27 0.16 0.31 0.33

Next Step 0.46 0.62 0.69 0.75 1.00 0.89 0.48 0.74 0.66 0.73 0.77 Next Step 0.45 0.60 0.72 0.71 1.00 0.89 0.39 0.79 0.66 0.82 0.78 Next Step 0.21 0.36 0.53 0.50 1.00 0.80 0.15 0.62 0.43 0.67 0.61

Dependency 0.39 0.69 0.60 0.86 0.89 1.00 0.39 0.70 0.61 0.77 0.73 Dependency 0.39 0.58 0.55 0.81 0.89 1.00 0.40 0.68 0.60 0.72 0.70 Dependency 0.15 0.34 0.30 0.66 0.80 1.00 0.16 0.47 0.36 0.52 0.48

Rotation Status 0.39 0.25 0.15 0.45 0.48 0.39 1.00 0.27 0.49 0.14 0.51 Rotation Status 0.27 0.13 0.12 0.43 0.39 0.40 1.00 0.17 0.47 0.17 0.49 Rotation Status 0.08 0.02 0.02 0.18 0.15 0.16 1.00 0.03 0.22 0.03 0.24

Position 0.28 0.47 0.71 0.63 0.74 0.70 0.27 1.00 0.71 0.81 0.65 Position 0.46 0.54 0.69 0.52 0.79 0.68 0.17 1.00 0.78 0.87 0.63 Position 0.21 0.29 0.47 0.27 0.62 0.47 0.03 1.00 0.61 0.76 0.40

Backwards 0.35 0.36 0.63 0.48 0.66 0.61 0.49 0.71 1.00 0.63 0.63 Backwards 0.37 0.41 0.58 0.39 0.66 0.60 0.47 0.78 1.00 0.68 0.58 Backwards 0.14 0.16 0.34 0.16 0.43 0.36 0.22 0.61 1.00 0.47 0.33

ordering 0.37 0.57 0.78 0.56 0.73 0.77 0.14 0.81 0.63 1.00 0.73 ordering 0.53 0.59 0.77 0.56 0.82 0.72 0.17 0.87 0.68 1.00 0.82 ordering 0.28 0.35 0.59 0.31 0.67 0.52 0.03 0.76 0.47 1.00 0.67

Outlier 0.27 0.59 0.63 0.58 0.77 0.73 0.51 0.65 0.63 0.73 1.00 Outlier 0.32 0.51 0.60 0.57 0.78 0.70 0.49 0.63 0.58 0.82 1.00 Outlier 0.10 0.26 0.36 0.33 0.61 0.48 0.24 0.40 0.33 0.67 1.00
Rotation

Rotation Status

Rotation

Rotation Status

Rotation

Rotation Status
Height

Adjacency

Multiview

Dependency

Position

Height

Dependency

Position
Next Step

Backwards

ordering

Outlier

Adjacency

Multiview

Dependency

Position
Next Step

Backwards

ordering

Outlier

Height

Adjacency

Multiview

Next Step

Backwards

ordering

Outlier
Figure 5. Task Similarity Heatmap. The heatmap illustrates the pairwise correlation between tasks in our benchmark, measured using
SRCC, PLCC, and R² scores.

Height Rotation*

Question: Which LEGO object is shorter in 3D space? Question: Based on this reference image, please generate a new image of the LEGO piece rotated
Options: clockwise by 60 degrees from a top-down perspective around its center.
A. The LEGO piece pointed by the red arrow
B. The LEGO piece pointed by the blue arrow
C. They are the same height
Answer: B
GPT-4o: A. The LEGO piece pointed by the red arrow.
Reference Image GPT-4o Emu2 GILL Anole Gemini-2.0-Flash

Ordering Multiview*
Question: You will be provided with the current assembly state image <image 1>, the target
assembly state image <image 2>, and four step images labeled A, B, C, and D. Your goal is to Question: Based on this reference image, please generate a new image showing the
arrange the four step images in the correct order that transitions from the current state to the target LEGO piece from a front-to-back perspective.
state.

Options:
A. B. C. D.

Reference Image GPT-4o Emu2 GILL Anole Gemini-2.0-Flash

<image 1> <image 2> Figure 7. Qualitative image generation results for Rotation*
Answer: CBAD and Multiview* tasks. Note: The questions above are slightly
MiniCPM-V2.6: C InternVL2.5-8B: B Emu3: B LLaVA-OneVision-7B : D
simplified for clarity and brevity.
Figure 6. Visualization of sample failure cases in Height and Or-
dering. The Ground Truth answer is marked in blue, while the
MLLM’s answer is marked in red. Note: The questions above are sistent, particularly for tasks requiring fine-grained reason-
slightly simplified for clarity and brevity. ing. For open-source models, Emu2 exhibits some ca-
pability in preserving visual appearance but fails entirely
the given image rather than generating a completely new in instruction-following, treating the task as mere image
one. This suggests that Gemini-2.0-Flash has a stronger reconstruction rather than reasoning-based generation. It
spatial consistency mechanism, enabling it to make precise struggles with spatial dependencies and sequential modifi-
modifications while preserving structural coherence. For cations, making it ineffective for reasoning-intensive tasks.
GPT-4o, the results suggests that GPT-4o may not directly GILL and Anole perform the worst, failing to generate rele-
edit the input image but instead interpret its semantic con- vant outputs in nearly all cases. Their instruction-following
tent and generate a new image based on textual understand- scores are close to zero, and their generated images are of-
ing. The differences in appearance fidelity and instruction ten completely unrelated to the expected result. This high-
following indicate that GPT-4o’s generation process might lights a fundamental limitation in their ability to process se-
involve reconstructing the scene conceptually rather than quential visual transformations, making them unsuitable for
modifying the original image step by step. While this al- complex, instruction-driven image generation. We provide
lows it to maintain conceptual relevance, its output often de- some failure cases in Figure 7. These findings emphasize
viates in style and structure from the original input, leading the fundamental challenges in spatial and sequential reason-
to lower appearance fidelity compared to Gemini-2.0-Flash. ing within open-source MLLMs. While Gemini-2.0-Flash
Moreover, its instruction following ability remains incon- shows the most precise adherence to instructions and im-

9
age editing capabilities, GPT-4o tends to generate semanti- [7] Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang-
cally relevant but visually divergent outputs. Open-source wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian,
models, by contrast, still lack robust mechanisms for se- Zhaoyang Liu, et al. Expanding performance boundaries of
quential reasoning, underscoring the need for advancements open-source multimodal models with model, data, and test-
in instruction-following and reasoning-aware image gener- time scaling. arXiv preprint arXiv:2412.05271, 2024. 5
ation techniques. [8] Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang-
wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng
Luo, Zheng Ma, Ji Ma, Jiaqi Wang, Xiaoyi Dong, Hang
6. Conclusion Yan, Hewei Guo, Conghui He, Botian Shi, Zhenjiang Jin,
Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang,
We introduce LEGO-Puzzles, a novel benchmark specifi-
Bo Zhang, Pinlong Cai, Licheng Wen, Xiangchao Yan, Min
cally designed to evaluate spatial understanding, as well as
Dou, Lewei Lu, Xizhou Zhu, Tong Lu, Dahua Lin, Yu Qiao,
single-step and multi-step sequential reasoning in MLLMs. Jifeng Dai, and Wenhai Wang. How far are we to gpt-4v?
Inspired by human cognitive patterns in LEGO construc- closing the gap to commercial multimodal models with open-
tion, we create a dataset that includes over 1,100 carefully source suites, 2024. 1, 2
curated visual question-answering (VQA) samples across [9] Ethan Chern, Jiadi Su, Yan Ma, and Pengfei Liu. Anole:
11 distinct tasks, providing diverse scenarios to assess mul- An open, autoregressive, native large multimodal mod-
timodal visual reasoning. We conduct comprehensive ex- els for interleaved image-text generation. arXiv preprint
periments with 20 advanced MLLMs, revealing substantial arXiv:2407.06135, 2024. 5
performance gaps compared to humans, particularly in ex- [10] Israel Cohen, Yiteng Huang, Jingdong Chen, Jacob Benesty,
tended sequential reasoning and the generation of spatially Jacob Benesty, Jingdong Chen, Yiteng Huang, and Israel
coherent visual outputs. These findings underscore the ur- Cohen. Pearson correlation coefficient. Noise reduction in
gent need to enhance the spatial understanding and sequen- speech processing, pages 1–4, 2009. 8
tial reasoning capabilities of multimodal AI. [11] Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang,
Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang,
Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for
References evaluating large multi-modality models. In Proceedings
[1] Pravesh Agrawal, Szymon Antoniak, Emma Bou Hanna, of the 32nd ACM international conference on multimedia,
Baptiste Bout, Devendra Chaplot, Jessica Chudnovsky, pages 11198–11201, 2024. 5
Diogo Costa, Baudouin De Monicault, Saurabh Garg, [12] Jianguo Duan, Liwen Zhuang, Qinglei Zhang, Ying Zhou,
Theophile Gervet, et al. Pixtral 12b. arXiv preprint and Jiyun Qin. Multimodal perception-fusion-control and
arXiv:2410.07073, 2024. 5 human–robot collaboration in manufacturing: A review. The
[2] Anthropic. The claude 3 model family: Opus, sonnet, haiku. International Journal of Advanced Manufacturing Technol-
2024. 5 ogy, 132(3):1071–1093, 2024. 1
[3] Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xi- [13] Luciano Floridi and Massimo Chiriatti. Gpt-3: Its nature,
aodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, scope, limits, and consequences. Minds and Machines, 30:
Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Day- 681–694, 2020. 5
iheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin [14] Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin,
Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang,
Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Xiawu Zheng, Ke Li, Xing Sun, and Rongrong Ji. Mme: A
Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, comprehensive evaluation benchmark for multimodal large
Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen language models. ArXiv, abs/2306.13394, 2023. 1, 2
Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan [15] Junyao Gao, Yanchen Liu, Yanan Sun, Yinhao Tang, Yan-
Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren hong Zeng, Kai Chen, and Cairong Zhao. Styleshot: A snap-
Zhou, Xiaohuan Zhou, and Tianhang Zhu. Qwen technical shot on any style. arXiv preprint arXiv:2407.01414, 2024.
report. arXiv preprint arXiv:2309.16609, 2023. 1 5
[4] Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin [16] Junyao Gao, Yanan Sun, Fei Shen, Xin Jiang, Zhening Xing,
Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Kai Chen, and Cairong Zhao. Faceshot: Bring any character
Tang, et al. Qwen2. 5-vl technical report. arXiv preprint into life. arXiv preprint arXiv:2503.00740, 2025. 1
arXiv:2502.13923, 2025. 5 [17] Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song,
[5] Marc H Bornstein. Frames of mind: The theory of multiple Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi
intelligences. Journal of Aesthetic Education, 20(2), 1986. Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning
1, 3 capability in llms via reinforcement learning. arXiv preprint
[6] Yixiong Chen, Li Liu, and Chris Ding. X-iqe: explain- arXiv:2501.12948, 2025. 1
able image quality evaluation for text-to-image genera- [18] Xianda Guo, Ruijun Zhang, Yiqun Duan, Yuhang He, Chen-
tion with visual large language models. arXiv preprint ming Zhang, Shuai Liu, and Long Chen. Drivemllm: A
arXiv:2305.10843, 2023. 5 benchmark for spatial understanding with multimodal large

10
language models in autonomous driving. arXiv preprint the IEEE/CVF conference on computer vision and pattern
arXiv:2411.13112, 2024. 1 recognition, pages 14963–14973, 2023. 1, 2
[19] Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, [31] Yuan-Hong Liao, Rafid Mahmood, Sanja Fidler, and David
and Yejin Choi. Clipscore: A reference-free evaluation met- Acuna. Reasoning paths with reference objects elicit quan-
ric for image captioning. arXiv preprint arXiv:2104.08718, titative spatial reasoning in large vision-language models.
2021. 5 arXiv preprint arXiv:2409.09788, 2024. 1
[20] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, [32] Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Moham-
Bernhard Nessler, and Sepp Hochreiter. Gans trained by a mad Shoeybi, and Song Han. Vila: On pre-training for vi-
two time-scale update rule converge to a local nash equilib- sual language models. In Proceedings of the IEEE/CVF con-
rium. Advances in neural information processing systems, ference on computer vision and pattern recognition, pages
30, 2017. 5 26689–26699, 2024. 5
[21] Justin Johnson, Bharath Hariharan, Laurens Van [33] Benlin Liu, Yuhao Dong, Yiqin Wang, Zixian Ma, Yansong
Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Tang, Luming Tang, Yongming Rao, Wei-Chiu Ma, and Ran-
Girshick. Clevr: A diagnostic dataset for compositional jay Krishna. Coarse correspondences boost spatial-temporal
language and elementary visual reasoning. In Proceedings reasoning in multimodal language models. arXiv preprint
of the IEEE conference on computer vision and pattern arXiv:2408.00754, 2024. 3
recognition, pages 2901–2910, 2017. 1, 2 [34] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee.
[22] Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Visual instruction tuning. arXiv preprint arXiv:2304.08485,
Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan 2023. 1
Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An [35] Minqian Liu, Zhiyang Xu, Zihao Lin, Trevor Ashby, Joy
open-source vision-language-action model. arXiv preprint Rimchala, Jiaxin Zhang, and Lifu Huang. Holistic evaluation
arXiv:2406.09246, 2024. 1 for interleaved text-and-image generation. arXiv preprint
[23] Jing Yu Koh, Daniel Fried, and Russ R Salakhutdinov. Gen- arXiv:2406.14643, 2024. 5
erating images with multimodal language models. Advances
[36] Yuliang Liu, Zhang Li, Hongliang Li, Wenwen Yu, Mingxin
in Neural Information Processing Systems, 36:21487–21506,
Huang, Dezhi Peng, Mingyu Liu, Mingrui Chen, Chunyuan
2023. 5
Li, Lianwen Jin, et al. On the hidden mystery of ocr in large
[24] Hugo Laurençon, Andrés Marafioti, Victor Sanh, and Léo multimodal models. arXiv preprint arXiv:2305.07895, 2023.
Tronchon. Building and better understanding vision- 1
language models: insights and future directions. In Work-
[37] Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang
shop on Responsibly Building the Next Generation of Multi-
Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He,
modal Foundational Models, 2024. 5
Ziwei Liu, et al. Mmbench: Is your multi-modal model an
[25] Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yix- all-around player? In European conference on computer vi-
iao Ge, and Ying Shan. Seed-bench: Benchmarking mul- sion, pages 216–233. Springer, 2024. 1, 2
timodal llms with generative comprehension. arXiv preprint
[38] Wufei Ma, Haoyu Chen, Guofeng Zhang, Celso M de Melo,
arXiv:2307.16125, 2023. 2
Alan Yuille, and Jieneng Chen. 3DSRBench: A compre-
[26] Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li,
hensive 3d spatial reasoning benchmark. arXiv preprint
Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi-
arXiv:2412.07825, 2024. 1, 2, 8
wei Liu, et al. Llava-onevision: Easy visual task transfer.
arXiv preprint arXiv:2408.03326, 2024. 5 [39] Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and
Anirban Chakraborty. Ocr-vqa: Visual question answering
[27] Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia,
by reading text in images. In 2019 international conference
Shaoguang Mao, Li Dong, Ivan Vulić, and Furu Wei. Imag-
on document analysis and recognition (ICDAR), pages 947–
ine while reasoning in space: Multimodal visualization-of-
952. IEEE, 2019. 1
thought. arXiv preprint arXiv:2501.07542, 2025. 3
[40] Nora S Newcombe and Andrea Frick. Early education for
[28] Xiaoqi Li, Mingxu Zhang, Yiran Geng, Haoran Geng, Yux-
spatial intelligence: Why, what, and how. Mind, Brain, and
ing Long, Yan Shen, Renrui Zhang, Jiaming Liu, and Hao
Education, 4(3):102–111, 2010. 3
Dong. Manipllm: Embodied multimodal large language
model for object-centric robotic manipulation. In Proceed- [41] OpenAI. Chatgpt. https://openai.com/blog/
ings of the IEEE/CVF Conference on Computer Vision and chatgpt, 2023. 1
Pattern Recognition, pages 18061–18070, 2024. 1 [42] OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774,
[29] Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin 2023. 1, 5
Zhao, and Ji-Rong Wen. Evaluating object hallucina- [43] OpenBMB. Minicpm: Unveiling the potential of end-side
tion in large vision-language models. arXiv preprint large language models, 2024. 2
arXiv:2305.10355, 2023. 1 [44] Simon Park, Abhishek Panigrahi, Yun Cheng, Dingli Yu,
[30] Zhuowan Li, Xingrui Wang, Elias Stengel-Eskin, Adam Ko- Anirudh Goyal, and Sanjeev Arora. Generalizing from sim-
rtylewski, Wufei Ma, Benjamin Van Durme, and Alan L ple to hard visual reasoning: Can we mitigate modality im-
Yuille. Super-clevr: A virtual benchmark to diagnose do- balance in vlms? arXiv preprint arXiv:2501.02669, 2025. 1,
main robustness in visual reasoning. In Proceedings of 2

11
[45] Arijit Ray, Jiafei Duan, Reuben Tan, Dina Bashkirova, Rose Mingyu Ding, Linjie Li, et al. Mmie: Massive mul-
Hendrix, Kiana Ehsani, Aniruddha Kembhavi, Bryan A. timodal interleaved comprehension benchmark for large
Plummer, Ranjay Krishna, Kuo-Hao Zeng, and Kate Saenko. vision-language models. arXiv preprint arXiv:2410.10139,
SAT: Spatial aptitude training for multimodal language mod- 2024. 3
els. arXiv preprint arXiv:2412.07755, 2024. 3 [58] Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han,
[46] Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Li Fei-Fei, and Saining Xie. Thinking in space: How mul-
Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus timodal large language models see, remember, and recall
Rohrbach. Towards vqa models that can read. In Proceedings spaces. arXiv preprint arXiv:2412.14171, 2024. 1, 3
of the IEEE/CVF conference on computer vision and pattern [59] Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui,
recognition, pages 8317–8326, 2019. 1 Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He,
[47] Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying et al. Minicpm-v: A gpt-4v level mllm on your phone. arXiv
Yu, Zhengxiong Luo, Yueze Wang, Yongming Rao, Jingjing preprint arXiv:2408.01800, 2024. 5
Liu, Tiejun Huang, et al. Generative multimodal models are [60] Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi
in-context learners. arXiv preprint arXiv:2312.13286, 2023. Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming
5 Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline
[48] Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui multimodal understanding and reasoning benchmark for ex-
Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan pert agi. In Proceedings of the IEEE/CVF Conference
Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a on Computer Vision and Pattern Recognition, pages 9556–
family of highly capable multimodal models. arXiv preprint 9567, 2024. 2
arXiv:2312.11805, 2023. 1, 5 [61] Zicheng Zhang, Xiangyu Zhao, Xinyu Fang, Chunyi Li, Xi-
[49] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, aohong Liu, Xiongkuo Min, Haodong Duan, Kai Chen, and
Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Guangtao Zhai. Redundancy principles for mllms bench-
Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. marks. arXiv preprint arXiv:2501.13953, 2025. 8
Llama 2: Open foundation and fine-tuned chat models. arXiv
preprint arXiv:2307.09288, 2023. 1
[50] Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan,
Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin
Ge, et al. Qwen2-vl: Enhancing vision-language model’s
perception of the world at any resolution. arXiv preprint
arXiv:2409.12191, 2024. 1, 2, 5
[51] Siyu Wang, Cailian Chen, Xinyi Le, Qimin Xu, Lei Xu,
Yanzhou Zhang, and Jie Yang. CAD-GPT: Synthesising cad
construction sequence with spatial reasoning-enhanced mul-
timodal llms. In Proceedings of the AAAI Conference on
Artificial Intelligence (AAAI), 2025. 3
[52] Wenhai Wang, Jiangwei Xie, ChuanYang Hu, Haoming Zou,
Jianan Fan, Wenwen Tong, Yang Wen, Silei Wu, Hanming
Deng, Zhiqi Li, et al. Drivemlm: Aligning multi-modal large
language models with behavioral planning states for au-
tonomous driving. arXiv preprint arXiv:2312.09245, 2023.
1
[53] Xingrui Wang, Wufei Ma, Zhuowan Li, Adam Kortylewski,
and Alan L Yuille. 3d-aware visual question answering about
parts, poses and occlusions. Advances in Neural Information
Processing Systems, 36:58717–58735, 2023. 1
[54] Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan
Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang,
Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is
all you need. arXiv preprint arXiv:2409.18869, 2024. 5
[55] Marcy Willard. What is sequential reasoning in childhood?,
2022. 3
[56] Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu,
Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue
Wu, Bingxuan Wang, et al. Deepseek-vl2: Mixture-of-
experts vision-language models for advanced multimodal
understanding. arXiv preprint arXiv:2412.10302, 2024. 2, 5
[57] Peng Xia, Siwei Han, Shi Qiu, Yiyang Zhou, Zhaoyang
Wang, Wenhao Zheng, Zhaorun Chen, Chenhang Cui,

12

You might also like