Step1X-Edit: A Practical Framework for General Image Editing

Liu, Shiyu; Han, Yucheng; Xing, Peng; Yin, Fukun; Wang, Rui; Cheng, Wei; Liao, Jiaqi; Wang, Yingming; Fu, Honghao; Han, Chunrui; Li, Guopeng; Peng, Yuang; Sun, Quan; Wu, Jingwei; Cai, Yan; Ge, Zheng; Ming, Ranchen; Xia, Lei; Zeng, Xianfang; Zhu, Yibo; Jiao, Binxing; Zhang, Xiangyu; Yu, Gang; Jiang, Daxin

Computer Science > Computer Vision and Pattern Recognition

arXiv:2504.17761 (cs)

[Submitted on 24 Apr 2025 (v1), last revised 6 May 2025 (this version, v3)]

Title:Step1X-Edit: A Practical Framework for General Image Editing

Abstract:In recent years, image editing models have witnessed remarkable and rapid development. The recent unveiling of cutting-edge multimodal models such as GPT-4o and Gemini2 Flash has introduced highly promising image editing capabilities. These models demonstrate an impressive aptitude for fulfilling a vast majority of user-driven editing requirements, marking a significant advancement in the field of image manipulation. However, there is still a large gap between the open-source algorithm with these closed-source models. Thus, in this paper, we aim to release a state-of-the-art image editing model, called Step1X-Edit, which can provide comparable performance against the closed-source models like GPT-4o and Gemini2 Flash. More specifically, we adopt the Multimodal LLM to process the reference image and the user's editing instruction. A latent embedding has been extracted and integrated with a diffusion image decoder to obtain the target image. To train the model, we build a data generation pipeline to produce a high-quality dataset. For evaluation, we develop the GEdit-Bench, a novel benchmark rooted in real-world user instructions. Experimental results on GEdit-Bench demonstrate that Step1X-Edit outperforms existing open-source baselines by a substantial margin and approaches the performance of leading proprietary models, thereby making significant contributions to the field of image editing.

Comments:	code: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2504.17761 [cs.CV]
	(or arXiv:2504.17761v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2504.17761

Submission history

From: Xianfang Zeng [view email]
[v1] Thu, 24 Apr 2025 17:25:12 UTC (11,536 KB)
[v2] Mon, 28 Apr 2025 09:56:08 UTC (11,541 KB)
[v3] Tue, 6 May 2025 15:58:40 UTC (12,347 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Step1X-Edit: A Practical Framework for General Image Editing

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Step1X-Edit: A Practical Framework for General Image Editing

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators