REPA Works Until It Doesn't: Early-Stopped, Holistic Alignment Supercharges Diffusion Training

Wang, Ziqiao; Zhao, Wangbo; Zhou, Yuhao; Li, Zekai; Liang, Zhiyuan; Shi, Mingjia; Zhao, Xuanlei; Zhou, Pengfei; Zhang, Kaipeng; Wang, Zhangyang; Wang, Kai; You, Yang

Computer Science > Computer Vision and Pattern Recognition

arXiv:2505.16792 (cs)

[Submitted on 22 May 2025]

Title:REPA Works Until It Doesn't: Early-Stopped, Holistic Alignment Supercharges Diffusion Training

Authors:Ziqiao Wang, Wangbo Zhao, Yuhao Zhou, Zekai Li, Zhiyuan Liang, Mingjia Shi, Xuanlei Zhao, Pengfei Zhou, Kaipeng Zhang, Zhangyang Wang, Kai Wang, Yang You

View PDF HTML (experimental)

Abstract:Diffusion Transformers (DiTs) deliver state-of-the-art image quality, yet their training remains notoriously slow. A recent remedy -- representation alignment (REPA) that matches DiT hidden features to those of a non-generative teacher (e.g. DINO) -- dramatically accelerates the early epochs but plateaus or even degrades performance later. We trace this failure to a capacity mismatch: once the generative student begins modelling the joint data distribution, the teacher's lower-dimensional embeddings and attention patterns become a straitjacket rather than a guide. We then introduce HASTE (Holistic Alignment with Stage-wise Termination for Efficient training), a two-phase schedule that keeps the help and drops the hindrance. Phase I applies a holistic alignment loss that simultaneously distills attention maps (relational priors) and feature projections (semantic anchors) from the teacher into mid-level layers of the DiT, yielding rapid convergence. Phase II then performs one-shot termination that deactivates the alignment loss, once a simple trigger such as a fixed iteration is hit, freeing the DiT to focus on denoising and exploit its generative capacity. HASTE speeds up training of diverse DiTs without architecture changes. On ImageNet 256X256, it reaches the vanilla SiT-XL/2 baseline FID in 50 epochs and matches REPA's best FID in 500 epochs, amounting to a 28X reduction in optimization steps. HASTE also improves text-to-image DiTs on MS-COCO, demonstrating to be a simple yet principled recipe for efficient diffusion training across various tasks. Our code is available at this https URL .

Comments:	24 pages
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2505.16792 [cs.CV]
	(or arXiv:2505.16792v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2505.16792

Submission history

From: Yuhao Zhou [view email]
[v1] Thu, 22 May 2025 15:34:33 UTC (30,429 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:REPA Works Until It Doesn't: Early-Stopped, Holistic Alignment Supercharges Diffusion Training

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:REPA Works Until It Doesn't: Early-Stopped, Holistic Alignment Supercharges Diffusion Training

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators