Skip to content

πŸ“– This is a repository for organizing papers, codes, and other resources related to unified multimodal models.

Notifications You must be signed in to change notification settings

Purshow/Awesome-Unified-Multimodal

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

82 Commits
Β 
Β 

Repository files navigation

🌟 Awesome Unified Multimodal Models

Awesome PRs Welcome

Curated papers on unifying multimodal understanding and generation from my regular reading.

πŸ“¬ Have a new paper or collaboration idea? Reach out to me at: [email protected].
🀝 Seeking Opportunities: I'm eager to discuss and collaborate, especially for industry internships in unified multimodal model!


πŸ“‘ Table of Contents


πŸ“„ Papers

2023

  • [2023-08-12] SEED: Planting a SEED of Vision in Large Language Model
    Static Badge Static Badge
    Pioneering vision integration into LLMs.

2024

  • [2024-03-22] LaVIT: Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization
    Static Badge Static Badge
  • [2024-04-22] SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation
    Static Badge Static Badge
  • [2024-05-08] Emu: Generative Pretraining in Multimodality
    Static Badge Static Badge
  • [2024-05-08] Emu2: Generative Multimodal Models are In-Context Learners
    Static Badge Static Badge
  • [2024-05-16] Chameleon: Mixed-Modal Early-Fusion Foundation Models
    Static Badge Static Badge
  • [2024-08-20] Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
    Static Badge Static Badge
  • [2024-09-27] Emu3: Next-Token Prediction is All You Need
    Static Badge Static Badge
  • [2024-10-15] MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling
    Static Badge Static Badge
  • [2024-10-17] Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
    Static Badge Static Badge
  • [2024-10-21] PUMA: Empowering Unified MLLM with Multi-granular Visual Generation
    Static Badge Static Badge
  • [2024-10-21] Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
    Static Badge Static Badge
  • [2024-10-23] VILA-U: A Unified Foundation Model Integrating Visual Understanding and Generation
    Static Badge Static Badge
  • [2024-10-31] MIO: A Foundation Model on Multimodal Tokens
    Static Badge Static Badge
  • [2024-11-12] JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation
    Static Badge Static Badge
  • [2024-11-28] Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads
    Static Badge
  • [2024-12-04] TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation
    Static Badge Static Badge
  • [2024-12-05] Liquid: Language Models are Scalable Multi-modal Generators
    Static Badge
  • [2024-12-05] MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding
    Static Badge
  • [2024-12-08] SILMM: Self-Improving Large Multimodal Models for Compositional Text-to-Image Generation
    Static Badge
  • [2024-12-09] ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance
    Static Badge
  • [2024-12-09] Visual Lexicon: Rich Image Features in Language Space
    Static Badge
  • [2024-12-11] Multimodal Latent Language Modeling with Next-Token Diffusion
    Static Badge
  • [2024-12-12] SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding
    Static Badge
  • [2024-12-18] MetaMorph: Multimodal Understanding and Generation via Instruction Tuning
    Static Badge ⭐ Highly recommended! The most easy training way!
  • [2024-12-26] LMFusion: Adapting Pretrained Language Models for Multimodal Generation
    Static Badge
  • [2024-12-31] Dual Diffusion for Unified Image Generation and Understanding
    Static Badge

2025

  • [2025-01-21] VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model
    Static Badge
  • [2025-01-21] Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
    Static Badge
  • [2025-02-07] QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation
    Static Badge
  • [2025-02-17] HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation
    Static Badge
  • [2025-02-27] UniTok: A Unified Tokenizer for Visual Generation and Understanding
    Static Badge
  • [2025-03-08] USP: Unified Self-Supervised Pretraining for Image Generation and Understanding
    Static Badge
  • [2025-03-09] SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook for Multimodal Understanding and Generation
    Static Badge
  • [2025-03-10] WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation
    Static Badge
    ⭐ Highly recommended! The only study that examines whether understanding benefits generation at the level of world knowledge. (and yes, it's my work! haha).
  • [2025-03-18] Unified Autoregressive Visual Generation and Understanding with Continuous Tokens
    Static Badge
  • [2025-03-19] DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies
    Static Badge
  • [2025-03-20] Unified Multimodal Discrete Diffusion
    Static Badge
  • [2025-03-27] UGen: Unified Autoregressive Multimodal Model with Progressive Vocabulary Learning
    Static Badge
  • [2025-03-27] Harmonizing Visual Representations for Unified Multimodal Understanding and Generation
    Static Badge
  • [2025-03-27] ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement
    Static Badge
  • [2025-04-03] VARGPT-v1.1: Improve Visual Autoregressive Large Unified Model via Iterative Instruction Tuning and Reinforcement Learning
    Static Badge
  • [2025-04-03] UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding
    Static Badge
  • [2025-04-09] Transfer between Modalities with MetaQueries
    Static Badge
    ⭐ Highly recommended!
  • [2025-04-20] Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens
    Static Badge
  • [2025-04-24] Step1X-Edit: A Practical Framework for General Image Editing
    Static Badge
  • [2025-04-29] X-Fusion: Introducing New Modality to Frozen Large Language Models
    Static Badge
  • [2025-04-29] YoChameleon: Personalized Vision and Language Generation
    Static Badge
  • [2025-05-01] T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT
    Static Badge
  • [2025-05-08] Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction
    Static Badge
  • [2025-05-09] Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation
    Static Badge
  • [2025-05-09] TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation
    Static Badge
  • [2025-05-12] Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning
    Static Badge
  • [2025-05-15] BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
    Static Badge ⭐ Highly recommended!
  • [2025-05-16] End-to-End Vision Tokenizer Tuning
    Static Badge
  • [2025-05-16] Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis
    Static Badge
  • [2025-05-21] Emerging Properties in Unified Multimodal Pretraining
    Static Badge ⭐ Highly recommended!
  • [2025-05-21] UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation
    Static Badge
  • [2025-05-30] Are Unified Vision-Language Models Necessary: Generalization Across Understanding and Generation
    Static Badge
  • [2025-05-30] UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning
    Static Badge
  • [2025-05-30] Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model
    Static Badge ⭐ Highly recommended!
  • [2025-05-30] OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation
    Static Badge
  • [2025-05-30] R2I-Bench: Benchmarking Reasoning-Driven Text-to-Image Generation
    Static Badge

πŸ”— Useful Links


Last Updated
Contributions welcome! Add new papers via pull requests or email me directly.
⬆ Back to Top

About

πŸ“– This is a repository for organizing papers, codes, and other resources related to unified multimodal models.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 9