Skip to content

BradyFU/Awesome-Multimodal-Large-Language-Models

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 

Repository files navigation

Aligning Multimodal LLM with Human Preference: A Survey

The first systematic survey dedicated to MLLM alignment! ✨


Table of Contents

Application Scenarios

General Image Understanding

Mitigating Hallucinations

Title Venue Date Code Page
GitHub Repo stars
Re-Align: Aligning Vision Language Models via Retrieval-Augmented Direct Preference Optimization
arXiv 2025-02 Github Page
GitHub Repo stars
DAMA: Data- and Model-aware Alignment of Multi-modal LLMs
arXiv 2025-02 Github -
GitHub Repo stars
Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key
arXiv 2025-01 Github Page
GitHub Repo stars
Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key
arXiv 2025-01 Github -
GitHub Repo stars
CHiP: Cross-modal Hierarchical Direct Preference Optimization for Multimodal LLMs
arXiv 2025-01 Github -
Mitigating Hallucination in Multimodal Large Language Model via Hallucination-targeted Direct Preference Optimization
arXiv 2024-11 - -
GitHub Repo stars
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
arXiv 2024-08 Github Page
GitHub Repo stars
mDPO: Conditional Preference Optimization for Multimodal Large Language Models
arXiv 2024-06 Github Page
GitHub Repo stars
RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness
arXiv 2024-05 Github -
Star
Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization
arXiv 2023-11 Github Page
Star
Aligning Large Multimodal Models with Factually Augmented RLHF
arXiv 2023-09 Github Page
Star
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
arXiv 2023-09 Github Page
Star
Detecting and Preventing Hallucinations in Large Vision Language Models
arXiv 2023-08 Github -

Enhancing Comprehensive Capabilities

Title Venue Date Code Page
GitHub Repo stars
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment
arXiv 2025-02 Github Page
GitHub Repo stars
Qwen2.5-VL Technical Report
arXiv 2025-02 Github Page
GitHub Repo stars
OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference
arXiv 2025-02 Github -
GitHub Repo stars
Multimodal Preference Data Synthetic Alignment with Reward Model
arXiv 2024-12 Github Page
GitHub Repo stars
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
arXiv 2024-11 Github Page
vVLM: Exploring Visual Reasoning in VLMs against Language Priors
openreview 2024-10 - -
GitHub Repo stars
LLaVA-Critic: Learning to Evaluate Multimodal Models
arXiv 2024-10 Github Page
CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs
arXiv 2024-08 - -
Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement
arXiv 2024-05 - -
GitHub Repo stars
Silkie: Preference Distillation for Large Visual Language Models
arXiv 2023-12 Github Page

Multi-modal O1 Development

Title Venue Date Code Page
GitHub Repo stars
LMM-R1
github 2025-02 Github -
GitHub Repo stars
Open-R1-Video
github 2025-02 Github -
GitHub Repo stars
VLM-R1: A stable and generalizable R1-style Large Vision-Language Model
github 2025-02 Github -
GitHub Repo stars
R1-V: Reinforcing Super Generalization Ability in Vision Language Models with Less Than $3
github 2025-02 Github -
GitHub Repo stars
EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework
github 2025-02 Github -

Multi-Image, Video, and Audio

Multi-Image

Title Venue Date Code Page
GitHub Repo stars
MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models
arXiv 2024-10 Github Page

ICL

Title Venue Date Code Page
GitHub Repo stars
SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization
arXiv 2024-11 Github -

Video

Title Venue Date Code Page
GitHub Repo stars
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment
arXiv 2025-02 Github Page
GitHub Repo stars
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
arXiv 2024-11 Github -
GitHub Repo stars
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
arXiv 2024-07 Github Page

Audio-Visual

Title Venue Date Code Page
Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization
arXiv 2024-10 - Page

Audio-Text

Title Venue Date Code Page
SQuBa: Speech Mamba Language Model with Querying-Attention for Efficient Summarization
openreview 2024-10 - -

Extended Multimodal Applications

Medicine

Title Venue Date Code Page
3D-CT-GPT++: Enhancing 3D Radiology Report Generation with Direct Preference Optimization and Large Vision-Language Models
openreview 2024-10 - -

Mathematics

Title Venue Date Code Page
GitHub Repo stars
MAVIS: Mathematical Visual Instruction Tuning with an Automatic Data Engine
arXiv 2024-07 Github Page

Embodied Intelligence

Title Venue Date Code Page
HomieBot: an Adaptive System for Embodied Mobile Manipulation in Open Environments
openreview 2024-10 - -
InteractiveCOT: Aligning Dynamic Chain-of-Thought Planning for Embodied Decision-Making
openreview 2024-10 - -

Safety

Title Venue Date Code Page
GitHub Repo stars
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment
arXiv 2025-02 Github Page
GitHub Repo stars
Aligning Visual Contrastive learning models via Preference Optimization
arXiv 2024-11 Github -
AdPO: Enhancing the Adversarial Robustness of Large Vision-Language Models with Preference Optimization
openreview 2024-10 - -
GitHub Repo stars
Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models
arXiv 2024-02 Github Page

Agent

Title Venue Date Code Page
GitHub Repo stars
Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning
arXiv 2024-05 Github Page

Dataset Construction

Using External Knowledge

Human Annotation

Title Venue Date Code Page
GitHub Repo stars
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment
arXiv 2025-02 Github Page
GitHub Repo stars
Llama 3.1: An In-Depth Analysis of the Next Generation Large Language Model
arXiv 2024-07 Github Page
Star
Aligning Large Multimodal Models with Factually Augmented RLHF
arXiv 2023-09 Github Page
Star
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
arXiv 2023-09 Github Page
Star
Detecting and Preventing Hallucinations in Large Vision Language Models
arXiv 2023-08 Github -

Close-Source LLM/MLLM

Title Venue Date Code Page
GitHub Repo stars
OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference
arXiv 2025-02 Github -
Mitigating Hallucination in Multimodal Large Language Model via Hallucination-targeted Direct Preference Optimization
arXiv 2024-11 - -
HomieBot: an Adaptive System for Embodied Mobile Manipulation in Open Environments
openreview 2024-10 - -
Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization
arXiv 2024-10 - Page
GitHub Repo stars
Phantom of Latent for Large Language and Vision Models
arXiv 2024-09 Github -
GitHub Repo stars
MAVIS: Mathematical Visual Instruction Tuning with an Automatic Data Engine
arXiv 2024-07 Github Page
GitHub Repo stars
Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models
arXiv 2024-02 Github Page
GitHub Repo stars
Silkie: Preference Distillation for Large Visual Language Models
arXiv 2023-12 Github Page
Star
Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization
arXiv 2023-11 Github Page
GitHub Repo stars
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
arXiv 2023-06 Github Page

Open-Source LLM/MLLM

Title Venue Date Code Page
GitHub Repo stars
LLaVA-Critic: Learning to Evaluate Multimodal Models
arXiv 2024-10 Github Page
InteractiveCOT: Aligning Dynamic Chain-of-Thought Planning for Embodied Decision-Making
openreview 2024-10 - -
CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs
arXiv 2024-08 - -

Self-Annotation

Single Text Modality

Title Venue Date Code Page
GitHub Repo stars
SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization
arXiv 2024-11 Github -
GitHub Repo stars
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
arXiv 2024-11 Github Page
GitHub Repo stars
MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models
arXiv 2024-10 Github Page
SQuBa: Speech Mamba Language Model with Querying-Attention for Efficient Summarization
openreview 2024-10 - -
Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement
arXiv 2024-05 - -

Single Image Modality

Title Venue Date Code Page
vVLM: Exploring Visual Reasoning in VLMs against Language Priors
openreview 2024-10 - -

Image-Text Mixed Modality

Title Venue Date Code Page
AdPO: Enhancing the Adversarial Robustness of Large Vision-Language Models with Preference Optimization
openreview 2024-10 - -

Evaluation Benchmark

General Knowledge

Title Venue Date Code Page
GitHub Repo stars
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
arXiv 2024-08 Github Page
GitHub Repo stars
Are We on the Right Way for Evaluating Large Vision-Language Models?
NeurIPS 2024-03 Github Page
GitHub Repo stars
MMBench: Is Your Multi-modal Model an All-around Player?
NeurIPS 2023-07 Github -
GitHub Repo stars
MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI
arXiv 2024-04 Github Page
GitHub Repo stars
BLINK: Multimodal Large Language Models Can See but Not Perceive
ECCV 2024-04 Github Page
GitHub Repo stars
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
ICLR 2023-10 Github Page
GitHub Repo stars
SQA3D: Situated Question Answering in 3D Scenes
ICLR 2022-10 Github Page
GitHub Repo stars
MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
CVPR 2024-11 Github Page
GitHub Repo stars
MVBench: A Comprehensive Multi-Modal Video Understanding Benchmark
CVPR 2023-11 Github -
GitHub Repo stars
MANTIS: Interleaved Multi-Image Instruction Tuning
arXiv 2024-05 Github Page

Hallucination

Title Venue Date Code Page
GitHub Repo stars
Object Hallucination in Image Captioning
arXiv 2018-09 Github -
GitHub Repo stars
VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models
arXiv 2024-06 Github Page
GitHub Repo stars
VALOR-Eval: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models
ACL Findings 2024-04 Github -
GitHub Repo stars
Evaluating Object Hallucination in Large Vision-Language Models
EMNLP 2023-05 Github -
GitHub Repo stars
Evaluation and Analysis of Hallucination in Large Vision-Language Models
arXiv 2023-08 Github -
GitHub Repo stars
MOCHA: Multi-Objective Reinforcement Mitigating Caption Hallucinations
arXiv 2023-12 Github -
GitHub Repo stars
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
ICLR 2023-06 Github -
GitHub Repo stars
An LLM-Free Multi-Dimensional Benchmark for MLLMs Hallucination Evaluation
arXiv 2023-11 Github -
GitHub Repo stars
Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences
arXiv 2024-01 Github -
GitHub Repo stars
Aligning Large Multimodal Models with Factually Augmented RLHF
arXiv 2023-09-25 Github Page
VLIND-Bench: Measuring Language Priors in Large Vision-Language Models arXiv 2024-06 - -
Star
Detecting and Preventing Hallucinations in Large Vision Language Models
arXiv 2023-08 Github -
GitHub Repo stars
HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models
arXiv 2023-10 Github -
GitHub Repo stars
Visual Hallucinations of Multi-Modal Large Language Models
arXiv 2024-02 Github -
GitHub Repo stars
RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness
arXiv 2024-05 Github -
GitHub Repo stars
Unified Hallucination Detection for Multimodal Large Language Models
ACL 2024-02 Github -
GitHub Repo stars
PHD: A Prompted Visual Hallucination Evaluation Dataset
arXiv 2024-05 Github -
GitHub Repo stars
ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering
AAAI 2019-06 Github -
GitHub Repo stars
Evaluating and Analyzing Relationship Hallucinations in LVLMs
ICML 2024-06 Github -
GitHub Repo stars
mHumanEval -- A Multilingual Benchmark to Evaluate Large Language Models for Code Generation
arXiv 2024-10 Github -
GitHub Repo stars
Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges
arXiv 2023-11 Github -
GitHub Repo stars
Evaluating the Quality of Hallucination Benchmarks for Large Vision-Language Models
arXiv 2024-06 Github -

Safety

Title Venue Date Code Page
GitHub Repo stars
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment
arXiv 2025-02 Github Page
Efficiently Adversarial Examples Generation for Visual-Language Models under Targeted Transfer Scenarios Using Diffusion Models arXiv 2024-04 - -
GitHub Repo stars
Red Teaming Visual Language Models
arXiv 2024-01 Github -
GitHub Repo stars
Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models
arXiv 2024-02 Github Page
GitHub Repo stars
MultiTrust: A Comprehensive Benchmark Towards Trustworthy Multimodal Large Language Models
arXiv 2024-06 Github Page
GitHub Repo stars
How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs
arXiv 2023-11 Github -
GitHub Repo stars
MossBench: Is Your Multimodal Language Model Oversensitive to Safe Queries?
arXiv 2024-06 Github Page

Conversation

Title Venue Date Code Page
GitHub Repo stars
Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-Level Vision
ICLR 2023-09 Github Page
GitHub Repo stars
Improved Baselines with Visual Instruction Tuning
arXiv 2023-10 Github Page
GitHub Repo stars
LiveBench: A Challenging, Contamination-Free LLM Benchmark
arXiv 2024-06 Github Page
GitHub Repo stars
Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models
arXiv 2024-05 Github Page

Reward Model

Title Venue Date Code Page
GitHub Repo stars
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment
arXiv 2025-02 Github Page
GitHub Repo stars
M-RewardBench: Evaluating Reward Models in Multilingual Settings
arXiv 2024-10 Github Page
GitHub Repo stars
VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models
arXiv 2024-11 Github Page
GitHub Repo stars
RewardBench: Evaluating Reward Models for Language Modeling
arXiv 2024-03 Github -
GitHub Repo stars
MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?
arXiv 2024-07 Github Page
GitHub Repo stars
MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark
arXiv 2024-02 Github Page

Alignment

Title Venue Date Code Page
GitHub Repo stars
OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference
arXiv 2025-02 Github -
GitHub Repo stars
From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline
arXiv 2024-06 Github Page
GitHub Repo stars
Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators
arXiv 2024-04 Github -
GitHub Repo stars
AlignBench: Benchmarking Chinese Alignment of Large Language Models
arXiv 2023-11 Github -