Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion

Lv, Zheqi; Chen, Junhao; Tian, Qi; Yin, Keting; Zhang, Shengyu; Wu, Fei

Computer Science > Computer Vision and Pattern Recognition

arXiv:2505.20053 (cs)

[Submitted on 26 May 2025]

Title:Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion

Authors:Zheqi Lv, Junhao Chen, Qi Tian, Keting Yin, Shengyu Zhang, Fei Wu

View PDF HTML (experimental)

Abstract:Diffusion models have become the mainstream architecture for text-to-image generation, achieving remarkable progress in visual quality and prompt controllability. However, current inference pipelines generally lack interpretable semantic supervision and correction mechanisms throughout the denoising process. Most existing approaches rely solely on post-hoc scoring of the final image, prompt filtering, or heuristic resampling strategies-making them ineffective in providing actionable guidance for correcting the generative trajectory. As a result, models often suffer from object confusion, spatial errors, inaccurate counts, and missing semantic elements, severely compromising prompt-image alignment and image quality. To tackle these challenges, we propose MLLM Semantic-Corrected Ping-Pong-Ahead Diffusion (PPAD), a novel framework that, for the first time, introduces a Multimodal Large Language Model (MLLM) as a semantic observer during inference. PPAD performs real-time analysis on intermediate generations, identifies latent semantic inconsistencies, and translates feedback into controllable signals that actively guide the remaining denoising steps. The framework supports both inference-only and training-enhanced settings, and performs semantic correction at only extremely few diffusion steps, offering strong generality and scalability. Extensive experiments demonstrate PPAD's significant improvements.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
Cite as:	arXiv:2505.20053 [cs.CV]
	(or arXiv:2505.20053v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2505.20053

Submission history

From: Zheqi Lv [view email]
[v1] Mon, 26 May 2025 14:42:35 UTC (3,118 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators