The first systematic survey dedicated to MLLM alignment! ✨
Title | Venue | Date | Code | Page |
---|---|---|---|---|
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment |
arXiv | 2025-02 | Github | Page |
Qwen2.5-VL Technical Report |
arXiv | 2025-02 | Github | Page |
OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference |
arXiv | 2025-02 | Github | - |
Multimodal Preference Data Synthetic Alignment with Reward Model |
arXiv | 2024-12 | Github | Page |
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization |
arXiv | 2024-11 | Github | Page |
vVLM: Exploring Visual Reasoning in VLMs against Language Priors |
openreview | 2024-10 | - | - |
LLaVA-Critic: Learning to Evaluate Multimodal Models |
arXiv | 2024-10 | Github | Page |
CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs |
arXiv | 2024-08 | - | - |
Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement |
arXiv | 2024-05 | - | - |
Silkie: Preference Distillation for Large Visual Language Models |
arXiv | 2023-12 | Github | Page |
Title | Venue | Date | Code | Page |
---|---|---|---|---|
LMM-R1 |
github | 2025-02 | Github | - |
Open-R1-Video |
github | 2025-02 | Github | - |
VLM-R1: A stable and generalizable R1-style Large Vision-Language Model |
github | 2025-02 | Github | - |
R1-V: Reinforcing Super Generalization Ability in Vision Language Models with Less Than $3 |
github | 2025-02 | Github | - |
EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework |
github | 2025-02 | Github | - |
Title | Venue | Date | Code | Page |
---|---|---|---|---|
MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models |
arXiv | 2024-10 | Github | Page |
Title | Venue | Date | Code | Page |
---|---|---|---|---|
SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization |
arXiv | 2024-11 | Github | - |
Title | Venue | Date | Code | Page |
---|---|---|---|---|
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment |
arXiv | 2025-02 | Github | Page |
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance |
arXiv | 2024-11 | Github | - |
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models |
arXiv | 2024-07 | Github | Page |
Title | Venue | Date | Code | Page |
---|---|---|---|---|
Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization |
arXiv | 2024-10 | - | Page |
Title | Venue | Date | Code | Page |
---|---|---|---|---|
SQuBa: Speech Mamba Language Model with Querying-Attention for Efficient Summarization |
openreview | 2024-10 | - | - |
Title | Venue | Date | Code | Page |
---|---|---|---|---|
3D-CT-GPT++: Enhancing 3D Radiology Report Generation with Direct Preference Optimization and Large Vision-Language Models |
openreview | 2024-10 | - | - |
Title | Venue | Date | Code | Page |
---|---|---|---|---|
MAVIS: Mathematical Visual Instruction Tuning with an Automatic Data Engine |
arXiv | 2024-07 | Github | Page |
Title | Venue | Date | Code | Page |
---|---|---|---|---|
HomieBot: an Adaptive System for Embodied Mobile Manipulation in Open Environments |
openreview | 2024-10 | - | - |
InteractiveCOT: Aligning Dynamic Chain-of-Thought Planning for Embodied Decision-Making |
openreview | 2024-10 | - | - |
Title | Venue | Date | Code | Page |
---|---|---|---|---|
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment |
arXiv | 2025-02 | Github | Page |
Aligning Visual Contrastive learning models via Preference Optimization |
arXiv | 2024-11 | Github | - |
AdPO: Enhancing the Adversarial Robustness of Large Vision-Language Models with Preference Optimization |
openreview | 2024-10 | - | - |
Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models |
arXiv | 2024-02 | Github | Page |
Title | Venue | Date | Code | Page |
---|---|---|---|---|
Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning |
arXiv | 2024-05 | Github | Page |
Title | Venue | Date | Code | Page |
---|---|---|---|---|
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment |
arXiv | 2025-02 | Github | Page |
Llama 3.1: An In-Depth Analysis of the Next Generation Large Language Model |
arXiv | 2024-07 | Github | Page |
Aligning Large Multimodal Models with Factually Augmented RLHF |
arXiv | 2023-09 | Github | Page |
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback |
arXiv | 2023-09 | Github | Page |
Detecting and Preventing Hallucinations in Large Vision Language Models |
arXiv | 2023-08 | Github | - |
Title | Venue | Date | Code | Page |
---|---|---|---|---|
LLaVA-Critic: Learning to Evaluate Multimodal Models |
arXiv | 2024-10 | Github | Page |
InteractiveCOT: Aligning Dynamic Chain-of-Thought Planning for Embodied Decision-Making |
openreview | 2024-10 | - | - |
CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs |
arXiv | 2024-08 | - | - |
Title | Venue | Date | Code | Page |
---|---|---|---|---|
vVLM: Exploring Visual Reasoning in VLMs against Language Priors |
openreview | 2024-10 | - | - |
Title | Venue | Date | Code | Page |
---|---|---|---|---|
AdPO: Enhancing the Adversarial Robustness of Large Vision-Language Models with Preference Optimization |
openreview | 2024-10 | - | - |
Title | Venue | Date | Code | Page |
---|---|---|---|---|
Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-Level Vision |
ICLR | 2023-09 | Github | Page |
Improved Baselines with Visual Instruction Tuning |
arXiv | 2023-10 | Github | Page |
LiveBench: A Challenging, Contamination-Free LLM Benchmark |
arXiv | 2024-06 | Github | Page |
Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models |
arXiv | 2024-05 | Github | Page |
Title | Venue | Date | Code | Page |
---|---|---|---|---|
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment |
arXiv | 2025-02 | Github | Page |
M-RewardBench: Evaluating Reward Models in Multilingual Settings |
arXiv | 2024-10 | Github | Page |
VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models |
arXiv | 2024-11 | Github | Page |
RewardBench: Evaluating Reward Models for Language Modeling |
arXiv | 2024-03 | Github | - |
MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation? |
arXiv | 2024-07 | Github | Page |
MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark |
arXiv | 2024-02 | Github | Page |
Title | Venue | Date | Code | Page |
---|---|---|---|---|
OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference |
arXiv | 2025-02 | Github | - |
From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline |
arXiv | 2024-06 | Github | Page |
Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators |
arXiv | 2024-04 | Github | - |
AlignBench: Benchmarking Chinese Alignment of Large Language Models |
arXiv | 2023-11 | Github | - |