🌟 Awesome Unified Multimodal Models

Curated papers on unifying multimodal understanding and generation from my regular reading.

📬 Have a new paper or collaboration idea? Reach out to me at: [email protected].
🤝 Seeking Opportunities: I'm eager to discuss and collaborate, especially for industry internships in unified multimodal model!

📄 Papers

2023

[2023-08-12] SEED: Planting a SEED of Vision in Large Language Model

Pioneering vision integration into LLMs.

2024

[2024-03-22] LaVIT: Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization
[2024-04-22] SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation
[2024-05-08] Emu: Generative Pretraining in Multimodality
[2024-05-08] Emu2: Generative Multimodal Models are In-Context Learners
[2024-05-16] Chameleon: Mixed-Modal Early-Fusion Foundation Models
[2024-08-20] Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
[2024-09-27] Emu3: Next-Token Prediction is All You Need
[2024-10-15] MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling
[2024-10-17] Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
[2024-10-21] PUMA: Empowering Unified MLLM with Multi-granular Visual Generation
[2024-10-21] Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
[2024-10-23] VILA-U: A Unified Foundation Model Integrating Visual Understanding and Generation
[2024-10-31] MIO: A Foundation Model on Multimodal Tokens
[2024-11-12] JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation
[2024-11-28] Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads
[2024-12-04] TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation
[2024-12-05] Liquid: Language Models are Scalable Multi-modal Generators
[2024-12-05] MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding
[2024-12-08] SILMM: Self-Improving Large Multimodal Models for Compositional Text-to-Image Generation
[2024-12-09] ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance
[2024-12-09] Visual Lexicon: Rich Image Features in Language Space
[2024-12-11] Multimodal Latent Language Modeling with Next-Token Diffusion
[2024-12-12] SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding
[2024-12-18] MetaMorph: Multimodal Understanding and Generation via Instruction Tuning
⭐ Highly recommended! The most easy training way!
[2024-12-26] LMFusion: Adapting Pretrained Language Models for Multimodal Generation
[2024-12-31] Dual Diffusion for Unified Image Generation and Understanding

2025

[2025-01-21] VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model
[2025-01-21] Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
[2025-02-07] QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation
[2025-02-17] HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation
[2025-02-27] UniTok: A Unified Tokenizer for Visual Generation and Understanding
[2025-03-08] USP: Unified Self-Supervised Pretraining for Image Generation and Understanding
[2025-03-09] SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook for Multimodal Understanding and Generation
[2025-03-10] WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

⭐ Highly recommended! The only study that examines whether understanding benefits generation at the level of world knowledge. (and yes, it's my work! haha).
[2025-03-18] Unified Autoregressive Visual Generation and Understanding with Continuous Tokens
[2025-03-19] DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies
[2025-03-20] Unified Multimodal Discrete Diffusion
[2025-03-27] UGen: Unified Autoregressive Multimodal Model with Progressive Vocabulary Learning
[2025-03-27] Harmonizing Visual Representations for Unified Multimodal Understanding and Generation
[2025-03-27] ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement
[2025-04-03] VARGPT-v1.1: Improve Visual Autoregressive Large Unified Model via Iterative Instruction Tuning and Reinforcement Learning
[2025-04-03] UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding
[2025-04-09] Transfer between Modalities with MetaQueries

⭐ Highly recommended!
[2025-04-20] Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens
[2025-04-24] Step1X-Edit: A Practical Framework for General Image Editing
[2025-04-29] X-Fusion: Introducing New Modality to Frozen Large Language Models
[2025-04-29] YoChameleon: Personalized Vision and Language Generation
[2025-05-01] T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT
[2025-05-08] Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction
[2025-05-09] Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation
[2025-05-09] TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation
[2025-05-12] Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning
[2025-05-15] BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
⭐ Highly recommended!
[2025-05-16] End-to-End Vision Tokenizer Tuning
[2025-05-16] Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis
[2025-05-21] Emerging Properties in Unified Multimodal Pretraining
⭐ Highly recommended!
[2025-05-21] UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation
[2025-05-30] Are Unified Vision-Language Models Necessary: Generalization Across Understanding and Generation
[2025-05-30] UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning
[2025-05-30] Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model
⭐ Highly recommended!
[2025-05-30] OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation
[2025-05-30] R2I-Bench: Benchmarking Reasoning-Driven Text-to-Image Generation

🔗 Useful Links

(TMLR 2025) Awesome-Autoregressive-Models-in-Vision
You can refer this page for related unified models.
Awesome-Unified-Multimodal-Models
Another excellent resource for multimodal research.
SOTA-Paper-Rating
Compare and evaluate state-of-the-art papers.

Contributions welcome! Add new papers via pull requests or email me directly.
⬆ Back to Top

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🌟 Awesome Unified Multimodal Models

📑 Table of Contents

📄 Papers

2023

2024

2025

🔗 Useful Links

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 9

Uh oh!

Purshow/Awesome-Unified-Multimodal

Folders and files

Latest commit

History

Repository files navigation

🌟 Awesome Unified Multimodal Models

📑 Table of Contents

📄 Papers

2023

2024

2025

🔗 Useful Links

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 9

Uh oh!

Packages