Name	Name	Last commit message	Last commit date
Latest commit History 828 Commits
images	images
README.md	README.md

Aligning Multimodal LLM with Human Preference: A Survey

[📖 arXiv Paper] [💬 WeChat (MLLM微信交流群，欢迎加入)]

The first systematic survey dedicated to MLLM alignment! ✨

Application Scenarios
Dataset Construction
- Using External Knowledge
- Self-Annotation
Evaluation Benchmark

Application Scenarios

General Image Understanding

Mitigating Hallucinations

Title	Venue	Date	Code	Page
Re-Align: Aligning Vision Language Models via Retrieval-Augmented Direct Preference Optimization	arXiv	2025-02	Github	Page
DAMA: Data- and Model-aware Alignment of Multi-modal LLMs	arXiv	2025-02	Github	-
Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key	arXiv	2025-01	Github	Page
Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key	arXiv	2025-01	Github	-
CHiP: Cross-modal Hierarchical Direct Preference Optimization for Multimodal LLMs	arXiv	2025-01	Github	-
Mitigating Hallucination in Multimodal Large Language Model via Hallucination-targeted Direct Preference Optimization	arXiv	2024-11	-	-
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models	arXiv	2024-08	Github	Page
mDPO: Conditional Preference Optimization for Multimodal Large Language Models	arXiv	2024-06	Github	Page
RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness	arXiv	2024-05	Github	-
Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization	arXiv	2023-11	Github	Page
Aligning Large Multimodal Models with Factually Augmented RLHF	arXiv	2023-09	Github	Page
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback	arXiv	2023-09	Github	Page
Detecting and Preventing Hallucinations in Large Vision Language Models	arXiv	2023-08	Github	-

Enhancing Comprehensive Capabilities

Title	Venue	Date	Code	Page
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment	arXiv	2025-02	Github	Page
Qwen2.5-VL Technical Report	arXiv	2025-02	Github	Page
OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference	arXiv	2025-02	Github	-
Multimodal Preference Data Synthetic Alignment with Reward Model	arXiv	2024-12	Github	Page
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization	arXiv	2024-11	Github	Page
vVLM: Exploring Visual Reasoning in VLMs against Language Priors	openreview	2024-10	-	-
LLaVA-Critic: Learning to Evaluate Multimodal Models	arXiv	2024-10	Github	Page
CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs	arXiv	2024-08	-	-
Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement	arXiv	2024-05	-	-
Silkie: Preference Distillation for Large Visual Language Models	arXiv	2023-12	Github	Page

Multi-modal O1 Development

Title	Venue	Date	Code	Page
LMM-R1	github	2025-02	Github	-
Open-R1-Video	github	2025-02	Github	-
VLM-R1: A stable and generalizable R1-style Large Vision-Language Model	github	2025-02	Github	-
R1-V: Reinforcing Super Generalization Ability in Vision Language Models with Less Than $3	github	2025-02	Github	-
EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework	github	2025-02	Github	-

Multi-Image, Video, and Audio

Multi-Image

Title	Venue	Date	Code	Page
MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models	arXiv	2024-10	Github	Page

ICL

Title	Venue	Date	Code	Page
SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization	arXiv	2024-11	Github	-

Video

Title	Venue	Date	Code	Page
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment	arXiv	2025-02	Github	Page
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance	arXiv	2024-11	Github	-
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models	arXiv	2024-07	Github	Page

Audio-Visual

Title	Venue	Date	Code	Page
Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization	arXiv	2024-10	-	Page

Audio-Text

Title	Venue	Date	Code	Page
SQuBa: Speech Mamba Language Model with Querying-Attention for Efficient Summarization	openreview	2024-10	-	-

Extended Multimodal Applications

Medicine

Title	Venue	Date	Code	Page
3D-CT-GPT++: Enhancing 3D Radiology Report Generation with Direct Preference Optimization and Large Vision-Language Models	openreview	2024-10	-	-

Mathematics

Title	Venue	Date	Code	Page
MAVIS: Mathematical Visual Instruction Tuning with an Automatic Data Engine	arXiv	2024-07	Github	Page

Embodied Intelligence

Title	Venue	Date	Code	Page
HomieBot: an Adaptive System for Embodied Mobile Manipulation in Open Environments	openreview	2024-10	-	-
InteractiveCOT: Aligning Dynamic Chain-of-Thought Planning for Embodied Decision-Making	openreview	2024-10	-	-

Safety

Title	Venue	Date	Code	Page
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment	arXiv	2025-02	Github	Page
Aligning Visual Contrastive learning models via Preference Optimization	arXiv	2024-11	Github	-
AdPO: Enhancing the Adversarial Robustness of Large Vision-Language Models with Preference Optimization	openreview	2024-10	-	-
Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models	arXiv	2024-02	Github	Page

Agent

Title	Venue	Date	Code	Page
Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning	arXiv	2024-05	Github	Page

Dataset Construction

Using External Knowledge

Human Annotation

Title	Venue	Date	Code	Page
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment	arXiv	2025-02	Github	Page
Llama 3.1: An In-Depth Analysis of the Next Generation Large Language Model	arXiv	2024-07	Github	Page
Aligning Large Multimodal Models with Factually Augmented RLHF	arXiv	2023-09	Github	Page
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback	arXiv	2023-09	Github	Page
Detecting and Preventing Hallucinations in Large Vision Language Models	arXiv	2023-08	Github	-

Close-Source LLM/MLLM

Title	Venue	Date	Code	Page
OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference	arXiv	2025-02	Github	-
Mitigating Hallucination in Multimodal Large Language Model via Hallucination-targeted Direct Preference Optimization	arXiv	2024-11	-	-
HomieBot: an Adaptive System for Embodied Mobile Manipulation in Open Environments	openreview	2024-10	-	-
Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization	arXiv	2024-10	-	Page
Phantom of Latent for Large Language and Vision Models	arXiv	2024-09	Github	-
MAVIS: Mathematical Visual Instruction Tuning with an Automatic Data Engine	arXiv	2024-07	Github	Page
Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models	arXiv	2024-02	Github	Page
Silkie: Preference Distillation for Large Visual Language Models	arXiv	2023-12	Github	Page
Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization	arXiv	2023-11	Github	Page
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning	arXiv	2023-06	Github	Page

Open-Source LLM/MLLM

Title	Venue	Date	Code	Page
LLaVA-Critic: Learning to Evaluate Multimodal Models	arXiv	2024-10	Github	Page
InteractiveCOT: Aligning Dynamic Chain-of-Thought Planning for Embodied Decision-Making	openreview	2024-10	-	-
CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs	arXiv	2024-08	-	-

Self-Annotation

Single Text Modality

Title	Venue	Date	Code	Page
SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization	arXiv	2024-11	Github	-
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization	arXiv	2024-11	Github	Page
MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models	arXiv	2024-10	Github	Page
SQuBa: Speech Mamba Language Model with Querying-Attention for Efficient Summarization	openreview	2024-10	-	-
Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement	arXiv	2024-05	-	-

Single Image Modality

Title	Venue	Date	Code	Page
vVLM: Exploring Visual Reasoning in VLMs against Language Priors	openreview	2024-10	-	-

Image-Text Mixed Modality

Title	Venue	Date	Code	Page
AdPO: Enhancing the Adversarial Robustness of Large Vision-Language Models with Preference Optimization	openreview	2024-10	-	-

Evaluation Benchmark

General Knowledge

Title	Venue	Date	Code	Page
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?	arXiv	2024-08	Github	Page
Are We on the Right Way for Evaluating Large Vision-Language Models?	NeurIPS	2024-03	Github	Page
MMBench: Is Your Multi-modal Model an All-around Player?	NeurIPS	2023-07	Github	-
MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI	arXiv	2024-04	Github	Page
BLINK: Multimodal Large Language Models Can See but Not Perceive	ECCV	2024-04	Github	Page
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts	ICLR	2023-10	Github	Page
SQA3D: Situated Question Answering in 3D Scenes	ICLR	2022-10	Github	Page
MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI	CVPR	2024-11	Github	Page
MVBench: A Comprehensive Multi-Modal Video Understanding Benchmark	CVPR	2023-11	Github	-
MANTIS: Interleaved Multi-Image Instruction Tuning	arXiv	2024-05	Github	Page

Hallucination

Title	Venue	Date	Code	Page
Object Hallucination in Image Captioning	arXiv	2018-09	Github	-
VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models	arXiv	2024-06	Github	Page
VALOR-Eval: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models	ACL Findings	2024-04	Github	-
Evaluating Object Hallucination in Large Vision-Language Models	EMNLP	2023-05	Github	-
Evaluation and Analysis of Hallucination in Large Vision-Language Models	arXiv	2023-08	Github	-
MOCHA: Multi-Objective Reinforcement Mitigating Caption Hallucinations	arXiv	2023-12	Github	-
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning	ICLR	2023-06	Github	-
An LLM-Free Multi-Dimensional Benchmark for MLLMs Hallucination Evaluation	arXiv	2023-11	Github	-
Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences	arXiv	2024-01	Github	-
Aligning Large Multimodal Models with Factually Augmented RLHF	arXiv	2023-09-25	Github	Page
VLIND-Bench: Measuring Language Priors in Large Vision-Language Models	arXiv	2024-06	-	-
Detecting and Preventing Hallucinations in Large Vision Language Models	arXiv	2023-08	Github	-
HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models	arXiv	2023-10	Github	-
Visual Hallucinations of Multi-Modal Large Language Models	arXiv	2024-02	Github	-
RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness	arXiv	2024-05	Github	-
Unified Hallucination Detection for Multimodal Large Language Models	ACL	2024-02	Github	-
PHD: A Prompted Visual Hallucination Evaluation Dataset	arXiv	2024-05	Github	-
ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering	AAAI	2019-06	Github	-
Evaluating and Analyzing Relationship Hallucinations in LVLMs	ICML	2024-06	Github	-
mHumanEval -- A Multilingual Benchmark to Evaluate Large Language Models for Code Generation	arXiv	2024-10	Github	-
Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges	arXiv	2023-11	Github	-
Evaluating the Quality of Hallucination Benchmarks for Large Vision-Language Models	arXiv	2024-06	Github	-

Safety

Title	Venue	Date	Code	Page
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment	arXiv	2025-02	Github	Page
Efficiently Adversarial Examples Generation for Visual-Language Models under Targeted Transfer Scenarios Using Diffusion Models	arXiv	2024-04	-	-
Red Teaming Visual Language Models	arXiv	2024-01	Github	-
Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models	arXiv	2024-02	Github	Page
MultiTrust: A Comprehensive Benchmark Towards Trustworthy Multimodal Large Language Models	arXiv	2024-06	Github	Page
How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs	arXiv	2023-11	Github	-
MossBench: Is Your Multimodal Language Model Oversensitive to Safe Queries?	arXiv	2024-06	Github	Page

Conversation

Title	Venue	Date	Code	Page
Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-Level Vision	ICLR	2023-09	Github	Page
Improved Baselines with Visual Instruction Tuning	arXiv	2023-10	Github	Page
LiveBench: A Challenging, Contamination-Free LLM Benchmark	arXiv	2024-06	Github	Page
Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models	arXiv	2024-05	Github	Page

Reward Model

Title	Venue	Date	Code	Page
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment	arXiv	2025-02	Github	Page
M-RewardBench: Evaluating Reward Models in Multilingual Settings	arXiv	2024-10	Github	Page
VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models	arXiv	2024-11	Github	Page
RewardBench: Evaluating Reward Models for Language Modeling	arXiv	2024-03	Github	-
MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?	arXiv	2024-07	Github	Page
MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark	arXiv	2024-02	Github	Page

Alignment

Title	Venue	Date	Code	Page
OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference	arXiv	2025-02	Github	-
From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline	arXiv	2024-06	Github	Page
Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators	arXiv	2024-04	Github	-
AlignBench: Benchmarking Chinese Alignment of Large Language Models	arXiv	2023-11	Github	-

BradyFU/Awesome-Multimodal-Large-Language-Models

Folders and files

Latest commit

History

Repository files navigation

Aligning Multimodal LLM with Human Preference: A Survey

Table of Contents

Application Scenarios

General Image Understanding

Mitigating Hallucinations

Enhancing Comprehensive Capabilities

Multi-modal O1 Development

Multi-Image, Video, and Audio

Multi-Image

ICL

Video

Audio-Visual

Audio-Text

Extended Multimodal Applications

Medicine

Mathematics

Embodied Intelligence

Safety

Agent

Dataset Construction

Using External Knowledge

Human Annotation

Close-Source LLM/MLLM

Open-Source LLM/MLLM

Self-Annotation

Single Text Modality

Single Image Modality

Image-Text Mixed Modality

Evaluation Benchmark

General Knowledge

Hallucination

Safety

Conversation

Reward Model

Alignment

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 11

Packages