DisenBooth: Identity-Preserving Disentangled Tuning for Subject-Driven Text-to-Image Generation

Chen, Hong; Zhang, Yipeng; Wu, Simin; Wang, Xin; Duan, Xuguang; Zhou, Yuwei; Zhu, Wenwu

Computer Science > Computer Vision and Pattern Recognition

arXiv:2305.03374 (cs)

[Submitted on 5 May 2023 (v1), last revised 27 Feb 2024 (this version, v4)]

Title:DisenBooth: Identity-Preserving Disentangled Tuning for Subject-Driven Text-to-Image Generation

Authors:Hong Chen, Yipeng Zhang, Simin Wu, Xin Wang, Xuguang Duan, Yuwei Zhou, Wenwu Zhu

View PDF HTML (experimental)

Abstract:Subject-driven text-to-image generation aims to generate customized images of the given subject based on the text descriptions, which has drawn increasing attention. Existing methods mainly resort to finetuning a pretrained generative model, where the identity-relevant information (e.g., the boy) and the identity-irrelevant information (e.g., the background or the pose of the boy) are entangled in the latent embedding space. However, the highly entangled latent embedding may lead to the failure of subject-driven text-to-image generation as follows: (i) the identity-irrelevant information hidden in the entangled embedding may dominate the generation process, resulting in the generated images heavily dependent on the irrelevant information while ignoring the given text descriptions; (ii) the identity-relevant information carried in the entangled embedding can not be appropriately preserved, resulting in identity change of the subject in the generated images. To tackle the problems, we propose DisenBooth, an identity-preserving disentangled tuning framework for subject-driven text-to-image generation. Specifically, DisenBooth finetunes the pretrained diffusion model in the denoising process. Different from previous works that utilize an entangled embedding to denoise each image, DisenBooth instead utilizes disentangled embeddings to respectively preserve the subject identity and capture the identity-irrelevant information. We further design the novel weak denoising and contrastive embedding auxiliary tuning objectives to achieve the disentanglement. Extensive experiments show that our proposed DisenBooth framework outperforms baseline models for subject-driven text-to-image generation with the identity-preserved embedding. Additionally, by combining the identity-preserved embedding and identity-irrelevant embedding, DisenBooth demonstrates more generation flexibility and controllability

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2305.03374 [cs.CV]
	(or arXiv:2305.03374v4 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2305.03374

Submission history

From: Hong Chen [view email]
[v1] Fri, 5 May 2023 09:08:25 UTC (5,820 KB)
[v2] Thu, 18 May 2023 15:36:08 UTC (13,002 KB)
[v3] Mon, 26 Feb 2024 03:53:58 UTC (13,624 KB)
[v4] Tue, 27 Feb 2024 02:45:34 UTC (13,624 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:DisenBooth: Identity-Preserving Disentangled Tuning for Subject-Driven Text-to-Image Generation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:DisenBooth: Identity-Preserving Disentangled Tuning for Subject-Driven Text-to-Image Generation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators