MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data

Berman, William; Peysakhovich, Alexander

Computer Science > Computer Vision and Pattern Recognition

arXiv:2406.18790 (cs)

[Submitted on 26 Jun 2024 (v1), last revised 11 Sep 2024 (this version, v2)]

Title:MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data

Authors:William Berman, Alexander Peysakhovich

View PDF HTML (experimental)

Abstract:We train a model to generate images from multimodal prompts of interleaved text and images such as "a <picture of a man> man and his <picture of a dog> dog in an <picture of a cartoon> animated style." We bootstrap a multimodal dataset by extracting semantically meaningful image crops corresponding to words in the image captions of synthetically generated and publicly available text-image data. Our model, MUMU, is composed of a vision-language model encoder with a diffusion decoder and is trained on a single 8xH100 GPU node. Despite being only trained on crops from the same image, MUMU learns to compose inputs from different images into a coherent output. For example, an input of a realistic person and a cartoon will output the same person in the cartoon style, and an input of a standing subject and a scooter will output the subject riding the scooter. As a result, our model generalizes to tasks such as style transfer and character consistency. Our results show the promise of using multimodal models as general purpose controllers for image generation.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2406.18790 [cs.CV]
	(or arXiv:2406.18790v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2406.18790

Submission history

From: Alexander Peysakhovich [view email]
[v1] Wed, 26 Jun 2024 23:21:42 UTC (42,291 KB)
[v2] Wed, 11 Sep 2024 21:56:02 UTC (42,291 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators