GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing

Fang, Rongyao; Duan, Chengqi; Wang, Kun; Huang, Linjiang; Li, Hao; Yan, Shilin; Tian, Hao; Zeng, Xingyu; Zhao, Rui; Dai, Jifeng; Liu, Xihui; Li, Hongsheng

Computer Science > Computer Vision and Pattern Recognition

arXiv:2503.10639 (cs)

[Submitted on 13 Mar 2025]

Title:GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing

Authors:Rongyao Fang, Chengqi Duan, Kun Wang, Linjiang Huang, Hao Li, Shilin Yan, Hao Tian, Xingyu Zeng, Rui Zhao, Jifeng Dai, Xihui Liu, Hongsheng Li

View PDF HTML (experimental)

Abstract:Current image generation and editing methods primarily process textual prompts as direct inputs without reasoning about visual composition and explicit operations. We present Generation Chain-of-Thought (GoT), a novel paradigm that enables generation and editing through an explicit language reasoning process before outputting images. This approach transforms conventional text-to-image generation and editing into a reasoning-guided framework that analyzes semantic relationships and spatial arrangements. We define the formulation of GoT and construct large-scale GoT datasets containing over 9M samples with detailed reasoning chains capturing semantic-spatial relationships. To leverage the advantages of GoT, we implement a unified framework that integrates Qwen2.5-VL for reasoning chain generation with an end-to-end diffusion model enhanced by our novel Semantic-Spatial Guidance Module. Experiments show our GoT framework achieves excellent performance on both generation and editing tasks, with significant improvements over baselines. Additionally, our approach enables interactive visual generation, allowing users to explicitly modify reasoning steps for precise image adjustments. GoT pioneers a new direction for reasoning-driven visual generation and editing, producing images that better align with human intent. To facilitate future research, we make our datasets, code, and pretrained models publicly available at this https URL.

Comments:	Dataset and models are released in this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2503.10639 [cs.CV]
	(or arXiv:2503.10639v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2503.10639

Submission history

From: Rongyao Fang [view email]
[v1] Thu, 13 Mar 2025 17:59:59 UTC (24,768 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators