LLaVA-MoD: Making LLaVA Tiny via MoE-Knowledge Distillation

Fangxun Shu1    Yue Liao2,4    Le Zhuo2    Chenning Xu122footnotemark: 2    Lei Zhang322footnotemark: 2    Guanghao Zhang122footnotemark: 2
Haonan Shi122footnotemark: 2    Long Chen1    Tao Zhong1    Wanggui He1    Siming Fu1    Haoyuan Li1
Bolin Li1    Zhelun Yu1    Si Liu533footnotemark: 3    Hongsheng Li2,433footnotemark: 3    Hao Jiang1
1Alibaba    2The Chinese University of Hong Kong    3Zhejiang University
4Centre for Perceptual and Interactive Intelligence, Hong Kong   5Beihang University
Equal ContributionCore MembersCorresponding Authors
Abstract

We introduce LLaVA-MoD, a novel framework designed to enable the efficient training of small-scale Multimodal Language Models (s-MLLM) distilling knowledge from large-scale MLLM (l-MLLM). Our approach tackles two fundamental challenges in MLLM distillation. First, we optimize the network structure of s-MLLM by integrating a sparse Mixture of Experts (MoE) architecture into the language model, striking a balance between computational efficiency and model expressiveness. Second, we propose a progressive knowledge transfer strategy for comprehensive knowledge transfer. This strategy begins with mimic distillation, where we minimize the Kullback-Leibler (KL) divergence between output distributions to enable s-MLLM to emulate l-MLLM’s understanding. Following this, we introduce preference distillation via Preference Optimization (PO), where the key lies in treating l-MLLM as the reference model. During this phase, the s-MLLM’s ability to discriminate between superior and inferior examples is significantly enhanced beyond l-MLLM, leading to a better s-MLLM that surpasses l-MLLM, particularly in hallucination benchmarks. Extensive experiments demonstrate that LLaVA-MoD surpasses existing works across various benchmarks while maintaining minimal activated parameters and low computational costs. Remarkably, LLaVA-MoD-2B surpasses Qwen-VL-Chat-7B with an average gain of 8.8%, using merely 0.3%percent0.30.3\%0.3 % of the training data and 23% trainable parameters. The results underscore LLaVA-MoD’s ability to effectively distill comprehensive knowledge from its teacher model, paving the way for developing efficient MLLMs. The code is available at https://github.com/shufangxun/LLaVA-MoD.

1 Introduction

Multimodal Large Language Models (MLLM) (Bai et al., 2023b; Liu et al., 2024; Lin et al., 2024b; Chen et al., 2023b; Shu et al., 2023; Lu et al., 2024) have shown promising results in multimodal tasks by integrating visual encoders with Large Language Models (LLM) (Achiam et al., 2023; Bai et al., 2023a; Team et al., 2023; Touvron et al., 2023). However, the large model size and extensive training data present significant computational challenges. For example, the largest version of LLaVA-NeXT employs Qwen-1.5-110B and trains with 128 H800 GPUs for 18 hours. Additionally, the substantial number of parameters requires advanced hardware and results in slow inference speed, complicating real-world deployment, especially on mobile devices. Therefore, exploring a small-scale MLLM (s-MLLM) that balances performance and efficiency is a crucial topic.

Refer to caption
Figure 1: Comparisons of training cost and performance. LLaVA-MoD achieves comparable performance with advanced MLLMs using significantly lower training costs while outperforming current small-scale MLLMs by a large margin.

Previous works on s-MLLM (Zhou et al., 2024a; Yuan et al., 2023; Shao et al., 2024; He et al., 2024; Chu et al., 2023; Yao et al., 2024) have focused on high-quality data collection and filtering protocols. While effective, they are inherently limited by the model capacity. Knowledge Distillation (KD) from large-scale MLLM (l-MLLM) offers a promising yet unexplored strategy for enhancing s-MLLM performance. By aligning small models’ output distributions with large models, KD allows s-MLLM to leverage the rich knowledge embedded in l-MLLM. We explore a distillation framework for MLLM and consider two primary challenges. The first challenge is to design a lightweight architecture for s-MLLM while still maintaining strong learning and expressive capabilities to effectively absorb complex knowledge from l-MLLM. The second challenge is to ensure an effective transfer of complex, multi-task knowledge from l-MLLM to s-MLLM. We present LLaVA-MoD111MoD denotes Mixture-of-Expert Knowledge Distillation for MLLMs to address the challenges through Mixture-of-Expert (MoE) and Knowledge Distillation.

To address the first challenge, one straightforward approach is to create a small model by simply scaling down l-MLLM. However, this direct reduction in size can significantly diminish the model’s expressive capabilities, resulting in a decline in performance for handling complex multimodal tasks. Drawing inspiration from the recent success of MoE (Dai et al., 2024; Jiang et al., 2024) in language modeling, We design an MoE structure into s-MLLM to balance the scale reduction while maintaining the ability to capture complex multimodal knowledge through sparsely activated experts during distillation. Specifically, we equip s-MLLM with multiple feedforward networks (FFNs) and a linear gate within the LLM block. Each FFN serves as an expert to capture fine-grained knowledge from l-MLLM, while the gate dynamically selects top-k𝑘kitalic_k experts to identify the optimal knowledge transfer pathway and maintain the training and inference efficiency.

To address the second challenge, we propose a progressive distillation strategy. The process begins by aligning the vision encoder with the LLM using a learnable adapter, thereby initializing a dense s-MLLM. Subsequently, we employ two consecutive distillation stages, where s-MLLM evolves from mimicking and approximating l-MLLM to ultimately surpassing it: Mimic Distillation. This stage is divided into two steps, i.e., dense-to-dense (D2D) and dense-to-sparse (D2S). The rationale for this two-step process is to facilitate the transfer of complex knowledge, where the first step targets general knowledge to build a solid foundation, while the second step targets specialized knowledge to handle multi-task tasks. In D2D, we employ standard KD loss to align the output logits distribution between the initialized dense s-MLLM and l-MLLM, using general captioning and conversation datasets. Next, in D2S, s-MLLM is first transformed from dense to sparse. We then employ the standard KD loss and LM loss to distill the MoE block, using the complex multi-task datasets. Preference Distillation. In this stage, l-MLLM provides knowledge regarding what constitutes “good” and “bad” samples, establishing a foundational reference for the student model. The s-MLLM leverages this knowledge to adjust its probability distribution, ensuring that good samples have a higher probability than those from l-MLLM, while bad samples are assigned a lower probability. This process enhances the s-MLLM’s ability to mitigate hallucinations by improving its judgment capabilities beyond those of l-MLLM.

LLaVA-MoD exhibits impressive performance across various multimodal benchmarks while maintaining minimal activated parameters and low computational resources. As illustrated in Figure 1, LLaVA-MoD-2B exceeds Qwen-VL-Chat-7B by an average of 8.8%percent8.88.8\%8.8 % on these benchmarks, utilizing just 0.3%percent0.30.3\%0.3 % of the training data and 23%percent2323\%23 % trainable parameters. Furthermore, it matches the performance of RLHF-based methods with 7B and 13B parameters on several hallucination benchmarks. Specifically, LLaVA-MoD-2B surpasses RLHF-V (Yu et al., 2024a) by 8.2%percent8.28.2\%8.2 % in response-level hallucination rate and by 21.3%percent21.321.3\%21.3 % in mention-level hallucination rate on the Object HalBench. The impressive results demonstrate the effectiveness of our distillation framework in transferring knowledge from l-MLLM to s-MLLM.

2 Related Work

Multimodal Large Language Models. LLMs have greatly advanced the field of natural language processing. Connecting visual information to LLM is crucial for promoting the unified comprehension of vision and language. BLIP-2 (Li et al., 2023b) adds additional intermediary structures to adapt visual features to LLM. Flamingo (Alayrac et al., 2022) incorporates additional cross-attention modules into LLM to handle arbitrarily interleaved multimodal sequences. Recently, models like LLaVA (Liu et al., 2024) and MiniGPT-4 (Zhu et al., 2023) have tried to enhance the models’ instruction-following ability through visual instruction tuning. In addition, some works focus on a stronger vision encoder (Chen et al., 2023b; Li et al., 2024) or more powerful fine-grained visual understanding capabilities (Bai et al., 2023b; Wang et al., 2024). Unlike these approaches, our method does not need to significantly enlarge the model. We aim to combine distillation and MoE to improve computational and storage efficiency while maintaining advanced performance.

Knowledge Distillation. The enormous sizes and high inference costs of LLMs limit their application in low-resource environments. Knowledge distillation (Hinton et al., 2015) uses a large model as the teacher to transfer its advanced knowledge to a smaller student model, which plays a critical role in compressing model size and enables smaller models to self-improve. MiniLLM (Gu et al., 2023) minimizes the reverse KL divergence to prevent students from overestimating low-probability regions in the teacher distribution, while GKD (Agarwal et al., 2023) introduces generalized knowledge distillation and promotes the integration of distillation with RLHF. Additionally, some works adopt context distillation (Huang et al., 2022) and CoT (Chain-of-Thought) distillation (Mukherjee et al., 2023; Li et al., 2022; Ho et al., 2022) to enhance specific skills of small models. Our approach innovatively distills the knowledge of MLLM into a smaller sparse MoE architecture, significantly enhancing the multimodal processing capabilities of small models at a minimal cost.

Mixture-of-Experts The mixture of experts (MoE) architecture, introduced by (Jacobs et al., 1991), enhances performance by leveraging independent experts to handle diverse samples. In transformer-based architecture, the feed-forward neural network (FFN) layers serve as the experts, sparsely activated through a gating strategy (Lepikhin et al., 2020; Fedus et al., 2022). This design effectively enhances model capacity while keeping computational overhead low. Additionally, the sparse upcycling method (Komatsuzaki et al., 2022) which initializes experts using those from a well-trained dense model is proposed to further mitigate training costs. MoE has shown great potential not only in language models (Jiang et al., 2024; Dai et al., 2024) but also in vision models (Riquelme et al., 2021) and vision-language models (Lin et al., 2024a; Shen et al., 2023). Our approach integrates MoE with knowledge distillation techniques to provide stronger signals for sparse training, which remarkably decreases the training cost associated with sparse models.

3 Method

Refer to caption
Figure 2: Progressive Distillation of LLaVA-MoD. (a). Mimic Distillation: Aligning the student’s response probabilities pSsubscript𝑝𝑆p_{S}italic_p start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT with those of the teacher pTsubscript𝑝𝑇p_{T}italic_p start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, via the Kullback–Leibler (KL) loss. (b). Preference Distillation: Increasing the student’s positive response probabilities pS+subscriptsuperscript𝑝𝑆p^{+}_{S}italic_p start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT to surpass those of the teacher pT+subscriptsuperscript𝑝𝑇p^{+}_{T}italic_p start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, while decreasing the student’s negative response probabilities pSsubscriptsuperscript𝑝𝑆p^{-}_{S}italic_p start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT to fall below those of the teacher pTsubscriptsuperscript𝑝𝑇p^{-}_{T}italic_p start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, via the Preference-Optimization (PO) loss.

We introduce LLaVA-MoD, a novel framework for building efficient s-MLLM using mixture-of-experts (MoE) and knowledge distillation. Our framework consists of two main components: (a). Architecture Design of s-MLLM: As shown in Fig. 3, we design a sparse s-MLLM with MoE, enhancing the ability to acquire specialized expert knowledge while maintaining training and inference efficiency. (b). Distillation Mechanism: We design a progressive distillation mechanism as shown in Fig. 2 to transfer knowledge from l-MLLM to sparse s-MLLM. This process involves two stages: mimic distillation followed by preference distillation.

3.1 Architecture Design of Sparse s-MLLM

In this section, we describe the architecture design of our sparse s-MLLM, which serves as the student model in the distillation process.

s-MLLM Definition. As illustrated in Fig. 3, the basic architecture of s-MLLM consists of three primary components: a vision encoder, a large language model (LLM), and a vision-language (VL) adaptor. Given a multimodal instruction conversation (x,y)𝑥𝑦(x,y)( italic_x , italic_y ), we define our s-MLLM to process response y𝑦yitalic_y as follows:

y=LLMϕ(Projω(ViTχ(xv)),xi),𝑦subscriptLLMitalic-ϕsubscriptProj𝜔subscriptViT𝜒subscript𝑥𝑣subscript𝑥𝑖y=\text{LLM}_{\phi}(\text{Proj}_{\omega}(\text{ViT}_{\chi}(x_{v})),x_{i}),italic_y = LLM start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( Proj start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( ViT start_POSTSUBSCRIPT italic_χ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) ) , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (1)

where xvsubscript𝑥𝑣x_{v}italic_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT is the input image, and xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the text instruction. The input image is resized to 336 ×\times× 336 and patched into 576 image tokens, each of size 14 ×\times× 14. ViTχsubscriptViT𝜒\text{ViT}_{\chi}ViT start_POSTSUBSCRIPT italic_χ end_POSTSUBSCRIPT is the CLIP vision encoder with parameters χ𝜒\chiitalic_χ which first extracts image features from xvsubscript𝑥𝑣x_{v}italic_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT. ProjωsubscriptProj𝜔\text{Proj}_{\omega}Proj start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT is the vision-language adaptor with parameters ω𝜔\omegaitalic_ω, serving as the vision tokenizer to align the image features with the word embedding space. LLMϕsubscriptLLMitalic-ϕ\text{LLM}_{\phi}LLM start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT is the large language model with parameters ϕitalic-ϕ\phiitalic_ϕ, which produces the response y𝑦yitalic_y based on the multimodal tokens of x=[xv,xi]𝑥subscript𝑥𝑣subscript𝑥𝑖x=[x_{v},x_{i}]italic_x = [ italic_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ].

Refer to caption
Figure 3: Sparsification of s-MLLM. The VL Adaptor and Vision Encoder remain unchanged, while the LLM is upcycled from dense to sparse.

Sparsify s-MLLM. The principle of building our s-MLLM is downsizing the LLM while leaving the vision encoder and vision-language adaptor unmodified. To achieve this downsizing goal, we sparsify the dense s-MLLM by incorporating an MoE architecture. Specifically, Fig. 3 illustrates the process, where we apply the sparse upcycling technique (Komatsuzaki et al., 2022) to replicate N𝑁Nitalic_N feedforward networks (FFNs) as the expert modules. Additionally, we introduce a linear layer as the router, which dynamically activates the appropriate experts by predicting the probability of expert assignment. Given each token x𝑥xitalic_x in the sequence, we first compute the routing value of N𝑁Nitalic_N experts:

r=Softmax(xWr),𝑟Softmax𝑥subscript𝑊𝑟r=\text{Softmax}(x\cdot W_{r}),italic_r = Softmax ( italic_x ⋅ italic_W start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) , (2)

where Wrsubscript𝑊𝑟W_{r}italic_W start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT denotes the weight matrix of the router and each element risubscript𝑟𝑖r_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in r𝑟ritalic_r represents the probability of activating the i𝑖iitalic_i-th expert. After that, we apply the Top-k𝑘kitalic_k strategy to determine the activated experts with the highest k𝑘kitalic_k routing values:

r~=Top-k(r)={ri,if ik0,otherwise~𝑟Top-𝑘𝑟casessubscript𝑟𝑖if 𝑖𝑘0otherwise\tilde{r}=\text{Top-}k({r})=\begin{cases}r_{i},&\text{if }i\in k\\ 0,&\text{otherwise}\\ \end{cases}over~ start_ARG italic_r end_ARG = Top- italic_k ( italic_r ) = { start_ROW start_CELL italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , end_CELL start_CELL if italic_i ∈ italic_k end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL otherwise end_CELL end_ROW (3)

where the routing values for the inactivated experts are set to zero, effectively excluding them from contributing to the final output. The output y𝑦yitalic_y is calculated by aggregating the contributions of the activated experts, weighted by their corresponding routing values:

y=i=1Nr~Ei(x).𝑦superscriptsubscript𝑖1𝑁~𝑟subscript𝐸𝑖𝑥y=\sum_{i=1}^{N}\tilde{r}\cdot E_{i}(x).italic_y = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT over~ start_ARG italic_r end_ARG ⋅ italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) . (4)

3.2 Progressive Distillation

Our progressive distillation consists of two distinct stages, i.e., mimic distillation (Fig. 2 (a)) and preference distillation (Fig. 2 (b)). In the mimic distillation stage, the s-MLLM πSsubscript𝜋S\pi_{\text{S}}italic_π start_POSTSUBSCRIPT S end_POSTSUBSCRIPT imitates the general and specific knowledge from the l-MLLM πTsubscript𝜋T\pi_{\text{T}}italic_π start_POSTSUBSCRIPT T end_POSTSUBSCRIPT. In the preference distillation stage, πSsubscript𝜋S\pi_{\text{S}}italic_π start_POSTSUBSCRIPT S end_POSTSUBSCRIPT gains the preference knowledge of πTsubscript𝜋T\pi_{\text{T}}italic_π start_POSTSUBSCRIPT T end_POSTSUBSCRIPT to further refine its output and reduce hallucinations. Both πSsubscript𝜋S\pi_{\text{S}}italic_π start_POSTSUBSCRIPT S end_POSTSUBSCRIPT and πTsubscript𝜋T\pi_{\text{T}}italic_π start_POSTSUBSCRIPT T end_POSTSUBSCRIPT are from the same LLM family. This ensures a consistent vocabulary space, which is essential for accurate imitation.

Initialization. Before distillation, we first align the vision encoder with the LLM via a learnable adapter, aiming to obtain a well-initialized dense version of πSsubscript𝜋S\pi_{\text{S}}italic_π start_POSTSUBSCRIPT S end_POSTSUBSCRIPT. The LLMϕsubscriptLLMitalic-ϕ\text{LLM}_{\phi}LLM start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT and ViTχsubscriptViT𝜒\text{ViT}_{\chi}ViT start_POSTSUBSCRIPT italic_χ end_POSTSUBSCRIPT are maintained frozen since their pre-trained parameters have already captured rich visual and language knowledge. Only ProjωsubscriptProj𝜔\text{Proj}_{\omega}Proj start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT is optimized to bridge the gap between the vision and language domain. For the initialization, we utilize common image-caption pairs from a widely used and curated dataset, which covers a diverse range of subjects and visual entities. The training objective is to minimize the cross-entropy of the generated tokens. The objective function is:

Init(πS)=𝔼(yky<k,x)πS[logπS(yky<k,x)],subscriptInitsubscript𝜋Ssubscript𝔼similar-toconditionalsubscript𝑦𝑘subscript𝑦absent𝑘𝑥subscript𝜋Sdelimited-[]subscript𝜋Sconditionalsubscript𝑦𝑘subscript𝑦absent𝑘𝑥\mathcal{L}_{\text{Init}}(\pi_{\text{S}})=-\mathbb{E}_{(y_{k}\mid y_{<k},x)% \sim\pi_{\text{S}}}\left[\log\pi_{\text{S}}(y_{k}\mid y_{<k},x)\right],caligraphic_L start_POSTSUBSCRIPT Init end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT S end_POSTSUBSCRIPT ) = - blackboard_E start_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∣ italic_y start_POSTSUBSCRIPT < italic_k end_POSTSUBSCRIPT , italic_x ) ∼ italic_π start_POSTSUBSCRIPT S end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log italic_π start_POSTSUBSCRIPT S end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∣ italic_y start_POSTSUBSCRIPT < italic_k end_POSTSUBSCRIPT , italic_x ) ] , (5)

where πS(yky<k,x)subscript𝜋Sconditionalsubscript𝑦𝑘subscript𝑦absent𝑘𝑥\pi_{\text{S}}(y_{k}\mid y_{<k},x)italic_π start_POSTSUBSCRIPT S end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∣ italic_y start_POSTSUBSCRIPT < italic_k end_POSTSUBSCRIPT , italic_x ) represents the probability of the predicted token yksubscript𝑦𝑘y_{k}italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT conditioned on x𝑥xitalic_x and the previous tokens y<k=(y1,y2,,yk1)subscript𝑦absent𝑘subscript𝑦1subscript𝑦2subscript𝑦𝑘1y_{<k}=(y_{1},y_{2},\dots,y_{k-1})italic_y start_POSTSUBSCRIPT < italic_k end_POSTSUBSCRIPT = ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ).

Mimic Distillation. We decompose the comprehensive knowledge within πTsubscript𝜋T\pi_{\text{T}}italic_π start_POSTSUBSCRIPT T end_POSTSUBSCRIPT into general and specialized aspects to address the challenges posed by their structural differences, which can complicate simultaneous learning. Subsequently, we conduct a general-to-specialized mimic distillation, which consists of two steps: dense-to-dense (D2D) and dense-to-sparse (D2S) distillation, to transfer the knowledge to πSsubscript𝜋S\pi_{\text{S}}italic_π start_POSTSUBSCRIPT S end_POSTSUBSCRIPT. This two-step approach balances the transfer of general and specialized knowledge through the progressive distillation, thereby enhancing overall performance. As illustrated in Fig. 3, we utilize the dense structure of πSsubscript𝜋S\pi_{\text{S}}italic_π start_POSTSUBSCRIPT S end_POSTSUBSCRIPT during D2D to acquire general knowledge and transform it into a sparse structure during D2S to acquire complex specialized knowledge. Throughout both stages, πTsubscript𝜋T\pi_{\text{T}}italic_π start_POSTSUBSCRIPT T end_POSTSUBSCRIPT remains unchanged.

a). Dense-to-Dense Distillation. In this stage, we aim to replicate the general knowledge of l-MLLM. Acquiring general knowledge first is crucial, as it establishes a broad foundation across various scenarios, enabling s-MLLM to develop a basic framework before advancing to multiple specialized tasks. To achieve this, we maintain ViTχsubscriptViT𝜒\text{ViT}_{\chi}ViT start_POSTSUBSCRIPT italic_χ end_POSTSUBSCRIPT frozen, and jointly optimize LLMϕsubscriptLLMitalic-ϕ\text{LLM}_{\phi}LLM start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT and ProjωsubscriptProj𝜔\text{Proj}_{\omega}Proj start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT, with trainable parameters θ={ω,ϕ}𝜃𝜔italic-ϕ\theta=\{\omega,\phi\}italic_θ = { italic_ω , italic_ϕ }. We leverage common image-caption pairs and conversation datasets. The training objective is to minimize the Kullback-Leibler divergence (KLD) between the output logits of s-MLLM and l-MLLM. The objective function is:

D2D(πS;πT)=𝔼(x,yk)πT[logπT(yky<k,x)πS(yky<k,x)],subscriptD2Dsubscript𝜋Ssubscript𝜋Tsubscript𝔼similar-to𝑥subscript𝑦𝑘subscript𝜋Tdelimited-[]subscript𝜋Tconditionalsubscript𝑦𝑘subscript𝑦absent𝑘𝑥subscript𝜋Sconditionalsubscript𝑦𝑘subscript𝑦absent𝑘𝑥\mathcal{L}_{\text{D2D}}(\pi_{\text{S}};\pi_{\text{T}})=-\mathbb{E}_{(x,y_{k})% \sim\pi_{\text{T}}}\left[\log\frac{\pi_{\text{T}}(y_{k}\mid y_{<k},x)}{\pi_{% \text{S}}(y_{k}\mid y_{<k},x)}\right],caligraphic_L start_POSTSUBSCRIPT D2D end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT S end_POSTSUBSCRIPT ; italic_π start_POSTSUBSCRIPT T end_POSTSUBSCRIPT ) = - blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∼ italic_π start_POSTSUBSCRIPT T end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_π start_POSTSUBSCRIPT T end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∣ italic_y start_POSTSUBSCRIPT < italic_k end_POSTSUBSCRIPT , italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT S end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∣ italic_y start_POSTSUBSCRIPT < italic_k end_POSTSUBSCRIPT , italic_x ) end_ARG ] , (6)

where V𝑉Vitalic_V denotes the vocabulary, while πS(yky<k,x)subscript𝜋Sconditionalsubscript𝑦𝑘subscript𝑦absent𝑘𝑥\pi_{\text{S}}(y_{k}\mid y_{<k},x)italic_π start_POSTSUBSCRIPT S end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∣ italic_y start_POSTSUBSCRIPT < italic_k end_POSTSUBSCRIPT , italic_x ) and πT(yky<k,x)subscript𝜋Tconditionalsubscript𝑦𝑘subscript𝑦absent𝑘𝑥\pi_{\text{T}}(y_{k}\mid y_{<k},x)italic_π start_POSTSUBSCRIPT T end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∣ italic_y start_POSTSUBSCRIPT < italic_k end_POSTSUBSCRIPT , italic_x ) denote the probability of the predicted tokens for s-MLLM and l-MLLM.

b). Dense-to-Sparse Distillation. In this stage, our focus shifts to transfer the specialized knowledge of l-MLLM into s-MLLM to obtain advanced capabilities and achieve superior performance in complex tasks. However, directly learning this knowledge in the dense form of s-MLLM could be insufficient due to the diverse knowledge structure of different tasks. Therefore, we sparsify the dense s-MLLM by introducing multiple experts. As detailed in Section 3.1, we replicate N𝑁Nitalic_N feedforward networks (FFNs) within LLMϕsubscriptLLMitalic-ϕ\text{LLM}_{\phi}LLM start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT and add an MLP layer as a router, forming the experts with parameters ϕesubscriptitalic-ϕ𝑒\phi_{e}italic_ϕ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT. This sparse architecture enables s-MLLM to activate the most relevant experts based on different inputs selectively, offering significant advantages in emulating the specialized knowledge of l-MLLM. For training, we leverage multi-task data, updating only the experts and the adaptor. We employ a Top-k𝑘kitalic_k routing strategy to select the experts. The trainable parameters are θ={ω,ϕe}𝜃𝜔subscriptitalic-ϕ𝑒\theta=\{\omega,\phi_{e}\}italic_θ = { italic_ω , italic_ϕ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT }. Similar to the previous stage, we adopt the KLD as the training objective. Additionally, we include the standard next-token prediction objective, i.e., minimizing the cross-entropy loss of generated tokens by s-MLLM, to inject supervision from ground truth data which can reduce the existing bias in l-MLLM. The objective function is:

D2S(πS;πT)=𝔼(x,yk)πS[logπS(yky<k,x)]𝔼(x,yk)πT[logπT(yky<k,x)πS(yky<k,x)].subscriptD2Ssubscript𝜋Ssubscript𝜋Tsubscript𝔼similar-to𝑥subscript𝑦𝑘subscript𝜋Sdelimited-[]subscript𝜋Sconditionalsubscript𝑦𝑘subscript𝑦absent𝑘𝑥subscript𝔼similar-to𝑥subscript𝑦𝑘subscript𝜋Tdelimited-[]subscript𝜋Tconditionalsubscript𝑦𝑘subscript𝑦absent𝑘𝑥subscript𝜋Sconditionalsubscript𝑦𝑘subscript𝑦absent𝑘𝑥\mathcal{L}_{\text{D2S}}(\pi_{\text{S}};\pi_{\text{T}})=-\mathbb{E}_{(x,y_{k})% \sim\pi_{\text{S}}}\left[\log\pi_{\text{S}}(y_{k}\mid y_{<k},x)\right]-\mathbb% {E}_{(x,y_{k})\sim\pi_{\text{T}}}\left[\log\frac{\pi_{\text{T}}(y_{k}\mid y_{<% k},x)}{\pi_{\text{S}}(y_{k}\mid y_{<k},x)}\right].caligraphic_L start_POSTSUBSCRIPT D2S end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT S end_POSTSUBSCRIPT ; italic_π start_POSTSUBSCRIPT T end_POSTSUBSCRIPT ) = - blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∼ italic_π start_POSTSUBSCRIPT S end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log italic_π start_POSTSUBSCRIPT S end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∣ italic_y start_POSTSUBSCRIPT < italic_k end_POSTSUBSCRIPT , italic_x ) ] - blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∼ italic_π start_POSTSUBSCRIPT T end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_π start_POSTSUBSCRIPT T end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∣ italic_y start_POSTSUBSCRIPT < italic_k end_POSTSUBSCRIPT , italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT S end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∣ italic_y start_POSTSUBSCRIPT < italic_k end_POSTSUBSCRIPT , italic_x ) end_ARG ] . (7)

Preference Distillation. In this stage, our goal is to distill the preference knowledge from l-MLLM to guide s-MLLM towards generating not only accurate but also reasonable responses, which is crucial in reducing hallucinations. For the training process, we effectively use preference data, which comprises meticulously paired positive responses y+superscript𝑦y^{+}italic_y start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT and negative responses ysuperscript𝑦y^{-}italic_y start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT for the identical prompt x𝑥xitalic_x. Our preference distillation strategy is inspired by recent advancements in Preference Optimization (PO) (Rafailov et al., 2024; Ethayarajh et al., 2024), which bypasses the need for training a reward model by directly training on an offline preference dataset. Our key insight is to treat l-MLLM as the reference model to provide insights on what constitutes “good” and “bad”, thereby establishing a fundamental reference for s-MLLM.

Specifically, the training objective is to optimize s-MLLM to assign higher probabilities to positive responses and lower probabilities to negative ones compared to l-MLLM. This training process involves two key optimization aspects: First, s-MLLM aims to align with the teacher model in distinguishing positive from negative responses. Second, s-MLLM seeks to surpass l-MLLM by assigning higher probabilities to positive responses and lower probabilities to negative responses. We only train the experts and adaptor in s-MLLM and employ a Top-k𝑘kitalic_k routing strategy to select the experts. The trainable parameters are θ={ω,ϕe}𝜃𝜔subscriptitalic-ϕ𝑒\theta=\{\omega,\phi_{e}\}italic_θ = { italic_ω , italic_ϕ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT }. The objective function is:

PD(πS;πT)=𝔼(x,y+,y)𝒟[logσ(βlogπS(y+x)πT(y+x)βlogπS(yx)πT(yx))],subscriptPDsubscript𝜋Ssubscript𝜋Tsubscript𝔼similar-to𝑥superscript𝑦superscript𝑦𝒟delimited-[]𝜎𝛽subscript𝜋Sconditionalsuperscript𝑦𝑥subscript𝜋Tconditionalsuperscript𝑦𝑥𝛽subscript𝜋Sconditionalsuperscript𝑦𝑥subscript𝜋Tconditionalsuperscript𝑦𝑥\mathcal{L}_{\text{PD}}(\pi_{\text{S}};\pi_{\text{T}})=-\mathbb{E}_{(x,y^{+},y% ^{-})\sim\mathcal{D}}\left[\log\sigma\left(\beta\log\frac{\pi_{\text{S}}(y^{+}% \mid x)}{\pi_{\text{T}}(y^{+}\mid x)}-\beta\log\frac{\pi_{\text{S}}(y^{-}\mid x% )}{\pi_{\text{T}}(y^{-}\mid x)}\right)\right],caligraphic_L start_POSTSUBSCRIPT PD end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT S end_POSTSUBSCRIPT ; italic_π start_POSTSUBSCRIPT T end_POSTSUBSCRIPT ) = - blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) ∼ caligraphic_D end_POSTSUBSCRIPT [ roman_log italic_σ ( italic_β roman_log divide start_ARG italic_π start_POSTSUBSCRIPT S end_POSTSUBSCRIPT ( italic_y start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ∣ italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT T end_POSTSUBSCRIPT ( italic_y start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ∣ italic_x ) end_ARG - italic_β roman_log divide start_ARG italic_π start_POSTSUBSCRIPT S end_POSTSUBSCRIPT ( italic_y start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ∣ italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT T end_POSTSUBSCRIPT ( italic_y start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ∣ italic_x ) end_ARG ) ] , (8)

where πS(y+x)subscript𝜋Sconditionalsuperscript𝑦𝑥\pi_{\text{S}}(y^{+}\mid x)italic_π start_POSTSUBSCRIPT S end_POSTSUBSCRIPT ( italic_y start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ∣ italic_x ) and πS(yx)subscript𝜋Sconditionalsuperscript𝑦𝑥\pi_{\text{S}}(y^{-}\mid x)italic_π start_POSTSUBSCRIPT S end_POSTSUBSCRIPT ( italic_y start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ∣ italic_x ) denote the probabilities of positive and negative responses for s-MLLM, and πT(y+x)subscript𝜋Tconditionalsuperscript𝑦𝑥\pi_{\text{T}}(y^{+}\mid x)italic_π start_POSTSUBSCRIPT T end_POSTSUBSCRIPT ( italic_y start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ∣ italic_x ) and πT(yx)subscript𝜋Tconditionalsuperscript𝑦𝑥\pi_{\text{T}}(y^{-}\mid x)italic_π start_POSTSUBSCRIPT T end_POSTSUBSCRIPT ( italic_y start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ∣ italic_x ) denote the same for l-MLLM.

4 Experiments

4.1 Experimental Settings

Implementation Details. We employ the “ViT-MLP-LLM” architecture to demonstrate the effectiveness of LLaVA-MoD. A pre-trained CLIP-ViT-L/14 is utilized as the vision encoder and a 2-layer MLP is utilized as the adaptor. Qwen-1.5/2 with different sizes are utilized as the LLM for l-MLLM and s-MLLM. Specifically, l-MLLM is configured with 7B parameters, while s-MLLM is configured with 1.8B and 0.5B parameters. The performance of l-MLLM on multimodal benchmarks is presented in Tab. 1. We use the same series of LLMs for distillation, i.e. employing Qwen-1.5 7B to distill Qwen-1.5 1.8B and Qwen-1.5 0.5B. Each stage uses distinct training configurations. The detailed training strategy and hyperparameter are illustrated in Appendix A.1.

Table 1: Performance of l-MLLM on multimodal benchmarks.
LLM VisEnc GQA VisWiz SQAII{}^{\text{I}}start_FLOATSUPERSCRIPT I end_FLOATSUPERSCRIPT VQATT{}^{\text{T}}start_FLOATSUPERSCRIPT T end_FLOATSUPERSCRIPT MME MMB MMBCNCN{}^{\text{CN}}start_FLOATSUPERSCRIPT CN end_FLOATSUPERSCRIPT AVG
Qwen-1.5-7B CLIP-L 62.3 40.7 70.9 55.3 72.1 68.5 64.9 62.1
Qwen-2-7B CLIP-L 62.5 37.9 73.2 57.2 78.0 71.6 71.1 64.5

Training Datasets. The training data consists of 5M samples from the open-source datasets, with each training stage utilizing a distinct dataset. During Initialization, 0.6M general captioning samples are used to bridge the gap between visual and language modalities. In mimic distillation, 2.4M general captioning and conversation samples are used to distill general knowledge from l-MLLM, and 1.4M multi-tasks data, including VQA, documents, science, and OCR, are used to distill specialized knowledge from l-MLLM. For preference distillation, 80K preference data samples are used to transfer preference knowledge from l-MLLM. The detailed dataset of each training stage is illustrated in Appendix A.2.

Evaluation Benchmarks. We conduct experiments on MME (Fu et al., 2023), MMB (Liu et al., 2023c), and MMBCNCN{}^{\text{CN}}start_FLOATSUPERSCRIPT CN end_FLOATSUPERSCRIPT. Each encompasses various sub-tasks, enabling comprehensive evaluation of multimodal understanding and reasoning capabilities. Additionally, we carry out experiments across a broad spectrum of VQA tasks, which include general VQA, text-oriented VQA, and science VQA. Specifically, for general VQA tasks, we use VizWiz (Gurari et al., 2018) and GQA (Hudson & Manning, 2019) to test general visual understanding and relational reasoning. TextVQA (Singh et al., 2019) is employed for text-oriented VQA tasks, focusing on fine-grained visual recognition and understanding of text within images. ScienceQA (Lu et al., 2022b) is utilized to measure scientific knowledge. Moreover, we conduct experiments on several hallucination benchmarks such as POPE (Li et al., 2023c), Object HalBench (Yu et al., 2024a), MMHal-Bench (Sun et al., 2023).

Table 2: Comparison with state-of-the-art MLLMs on the commonly-used multimodal benchmarks for MLLMs. #Sample: Training data sample. #Param: Trainable parameters. SQAII{}^{\text{I}}start_FLOATSUPERSCRIPT I end_FLOATSUPERSCRIPT: ScienceQA test, VQATT{}^{\text{T}}start_FLOATSUPERSCRIPT T end_FLOATSUPERSCRIPT: TextVQA val, MME: MME Benchmark, normalized to percentage, MMB: MMBench dev, MMBCNCN{}^{\text{CN}}start_FLOATSUPERSCRIPT CN end_FLOATSUPERSCRIPT: MMBench-Chinese dev. The best result for model sizes 1B/2B is shown in bold, and the second-best result is underlined. Our LLaVA-MoD achieves the best average result for both.
Method LLM #Sample #Param GQA VisWiz SQAII{}^{\text{I}}start_FLOATSUPERSCRIPT I end_FLOATSUPERSCRIPT VQATT{}^{\text{T}}start_FLOATSUPERSCRIPT T end_FLOATSUPERSCRIPT MME MMB MMBCNCN{}^{\text{CN}}start_FLOATSUPERSCRIPT CN end_FLOATSUPERSCRIPT AVG
BLIP-2 Vicuna-13B 129M \geq7B 41.0 19.6 61.0 42.5 64.7 - - -
VILA-7B LLaMA-7B 50M 62.3 57.8 68.2 64.4 76.7 68.9 61.7 65.7
CogVLM Vicuna-7B 1500M 64.9 - 65.6 78.2 71.8 63.7 53.8 -
InstructBLIP Vicuna-13B 130M 49.5 33.4 63.1 50.7 60.6 - - -
Qwen-VL-Chat Qwen-7B 1450M 57.5 38.9 68.2 61.5 74.4 60.6 56.7 56.7
Deepseek-VL-7B DLLM-7B 2000M 61.3 49.9 74.0 64.7 73.4 74.1 72.8 67.2
LLaVA-1.5-7B Vicuna-1.5-7B 1.2M 62.0 50.0 66.8 58.2 75.5 64.3 58.3 62.1
LLaVA-NeXT Vicuna-1.5-13B 1.3M 65.4 60.5 73.6 67.1 78.7 70.0 64.4 68.5
Imp-3B Phi-2-2.7B 1.6M similar-to\sim3B 63.5 54.1 72.8 59.8 72.3 72.9 46.7 63.2
Bunny-3B Phi-2-2.7B 2.7M 62.5 43.8 70.9 56.7 74.4 68.6 37.2 59.2
VILA-3B LLaMA-2.7B 51M 61.5 53.5 69.0 60.4 72.1 63.4 52.7 61.8
MobileVLM MLLaMA-2.7B 1.3M 59.0 - 61.0 47.5 64.4 59.6 - -
MobileVLMv2 MLLaMA-2.7B 3.6M 61.1 - 70.0 57.5 72.0 63.2 - -
MoE-LLaVA-3B Phi-2-2.7B 2.2M 61.4 43.9 68.5 51.4 71.1 65.2 41.8 57.6
MiniCPM-V MiniCPM-2.4B 570M 51.5 50.5 74.4 56.6 68.9 64.0 62.7 61.2
MiniCPM-V-2 MiniCPM-2.4B 570M 52.1 60.2 76.3 73.2 70.5 68.5 67.2 66.9
Imp-2B Qwen-1.5-1.8B 1.6M similar-to\sim2B 61.9 39.6 66.1 54.5 65.2 63.8 61.2 58.9
Bunny-2B Qwen-1.5-1.8B 2.7M 59.6 34.2 64.6 53.2 65.0 59.1 58.5 56.3
Mini-Gemini-2B Gemma-2B 2.7M 60.7 41.5 63.1 56.2 67.0 59.8 51.3 57.1
MoE-LLaVA-2B Qwen-1.5-1.8B 2.2M 61.5 32.6 63.1 48.0 64.6 59.7 57.3 55.3
DeepSeek-VL-1.3B DLLM-1.3B 2000M 59.3 36.8 64.2 58.4 65.3 64.6 61.0 58.5
LLaVA-MoD-2B Qwen-1.5-1.8B 5M 58.7 39.2 68.0 58.5 67.6 66.3 61.9 59.9
LLaVA-MoD-2B Qwen-2-1.5B 5M 58.8 40.4 69.2 59.9 69.2 68.9 64.4 61.6
SPHINX-Tiny TLLaMA-1.1B 15M similar-to\sim1B 58.0 49.2 21.5 57.8 63.1 56.6 37.8 49.2
LLaVA-MoD-1B Qwen-1.5-0.5B 5M similar-to\sim1B 56.2 31.6 62.8 53.9 65.3 58.8 50.4 54.1
LLaVA-MoD-1B Qwen-2-0.5B 5M 56.6 35.1 61.1 57.1 67.0 58.7 54.1 55.7

4.2 Main Results

In this section, we conduct experiments of LLaVA-MoD to highlight its advantages in two aspects: performance and efficiency. For performance, we evaluate comprehension-oriented benchmarks (Tab. 2) and hallucination-oriented benchmarks (Tab. 3). For efficiency, we present a comparison in terms of data samples and model size. The performance is obtained under the same series of LLMs, i.e. employing Qwen-1.5 7B to distill Qwen-1.5 1.8B and Qwen-1.5 0.5B.

Comprehension-Oriented Benchmarks. As indicated in Tab. 2, LLaVA-MoD achieves SOTA average results among the models of 1B and 2B size on comprehension-oriented benchmarks. The 2B-sized LLaVA-MoD surpasses Mini-Gemini-2B (Li et al., 2024) by 8.1%, while using a lower image resolution (336 vs. 768). The 1B-sized LLaVA-MoD surpasses SPHINX-Tiny (Gao et al., 2024) by 13.2%, using fewer data samples (5M vs. 15M). Furthermore, LLaVA-MoD-2B matches and even surpasses the performance of large-scale MLLMs. The 2B-sized LLaVA-MoD surpasses Qwen-VL-Chat-7B (Bai et al., 2023b) by 8.8% and matches the performance of VILA-3B (Lin et al., 2024b) and MiniCPM-V (Yao et al., 2024). These results highlight that our approach trains small-scale MLLMs effectively by distilling the sparse MoE architecture from large-scale MLLMs.

Table 3: Comparison with state-of-the-art MLLMs on the hallucination benchmarks. We compare LLaVA-MoD with SFT-based works and RLHF-based works . Hall: Hallucination Rate Resp: response-level hallucination rate, Ment: mention-level hallucination rate. The best result is in bold, and the second-best result is underlined.
Model LLM #Param Object HalBench POPE MMHal-Bench
Resp \downarrow Ment \downarrow F1 \uparrow Score \uparrow Hall \downarrow
Teacher MLLM Qwen-1.5-7B 7B 50.1 24.8 84.9 2.60 20.7
Teacher MLLM Qwen-2-7B 7B 29.7 23.4 85.7 2.88 14.5
Qwen-VL-Chat Qwen-7B 9.6B 40.4 20.7 74.9 2.76 38.5
LLaVA-1.5-7B Vicuna-7B 7B 53.6 25.2 86.1 2.36 51.0
VCD Vicuna-1.5-7B 7B 48.8 24.3 84.5 2.12 54.2
OPERA Vicuna-1.5-7B 7B 45.1 22.3 85.4 2.15 54.2
HA-DPO Vicuna-1.5-7B 7B 39.9 19.9 86.8 1.98 60.4
POVID Vicuna-1.5-7B 7B 48.1 24.4 86.3 2.08 56.2
LLaVA-RLHF Vicuna-1.5-13B 13B 38.1 18.9 82.7 2.02 62.5
LURE Vicuna-1.5-7B 7B 27.7 17.3 - 1.64 60.4
RLHF-V Vicuna-13B 13B 12.2 7.5 86.2 2.45 51.0
RLAIF-V Vicuna-1.5-7B 7B 8.5 4.3 - 3.06 29.2
MiniCPM-V MiniCPM-2.4B 2.8B 21.6 11.5 79.5 3.70 24.9
MiniCPM-V-2 MiniCPM-2.4B 2.8B 14.5 7.8 86.3 4.09 18.2
Mini-Gemini-2B Gemma-2B 2B 29.7 21.1 85.6 2.83 18.8
Bunny-2B Qwen-1.5-1.8 2.2B 50.2 23.4 85.8 2.72 19.3
LLaVA-MoD-2B Qwen-1.5-1.8B 2.2B 11.4 7.2 87.0 2.76 17.2
LLaVA-MoD-2B Qwen-2-1.5B 1.9B 11.2 5.9 87.2 2.91 13.8

Hallucination-Oriented Benchmarks. As indicated in Tab. 3, LLaVA-MoD shows remarkable performance in mitigating hallucination, even beating its teacher model. It can be attributed to preference distillation from two aspects: Firstly, by assigning a higher probability for the positive response, preference distillation encourages LLaVA-MoD to focus on providing correct and relevant information. Secondly, by assigning a lower probability for the negative response, preference distillation discourages incorrect or unsubstantiated information. By using the teacher model as a reference to adjust response probabilities, preference distillation enables LLaVA-MoD to handle hallucination issues more accurately and reliably, thereby surpassing its teacher model. Moreover, LLaVA-MoD even surpasses recent RLHF-based models (Sun et al., 2023; Zhou et al., 2024b; Huang et al., 2024; Leng et al., 2024; Zhou et al., 2023). On the Object HalBench, it outperforms RLHF-V (Yu et al., 2024a) by 8.2% in response-level hallucination rate and by 21.3% in mention-level hallucination rate. This demonstrates that preference distillation is an effective task to mitigate hallucination.

Efficiency Comparison. We compare the efficiency of LLaVA-MoD with recent SOTA MLLMs in terms of training data and trainable parameters. As indicated in Tab. 2, The 2B-sized LLaVA-MoD requires only 5M samples and activates merely a subset of its total 2.2B parameters during training and inference. It achieves 8.8% higher performance than Qwen-VL-Chat, using only 23% trainable parameters and 0.3% of training data samples. Moreover, compared to the small-sized model MiniCPM-V with 2.8B trainable parameters, LLaVA-MoD also achieves superior performance with just 1.6% of the data samples and 2B parameters, underscoring its efficiency. LLaVA-MoD enables more efficient utilization of data, parameters, and computational resources. This not only results in efficient training and inference but also provides a practical solution for deploying high-performance models in resource-constrained environments.

4.3 Ablation Study

In this section, we thoroughly explore the effect of the distillation strategy and model design. Firstly, we investigate the preference distillation on comprehension and hallucination benchmarks. Since our focus is on the training strategy and model design, we carry out subsequent ablation experiments specifically on mimic distillation. We use CLIP-ViT-L/14 as the default vision encoder, and Qwen-1.5-1.8B and Qwen-1.5-7B as the default student and teacher LLMs respectively.

4.3.1 Impact of Preference Distillation

Table 4: The impact of preference distillation on the comprehension-oriented benchmarks.
Distillation GQA VizWiz SQAII{}^{\text{I}}start_FLOATSUPERSCRIPT I end_FLOATSUPERSCRIPT VQATT{}^{\text{T}}start_FLOATSUPERSCRIPT T end_FLOATSUPERSCRIPT MME MMB MMBCNCN{}^{\text{CN}}start_FLOATSUPERSCRIPT CN end_FLOATSUPERSCRIPT AVG
Mimic 59.3 40.0 68.4 58.7 68.4 65.1 61.5 60.2
Mimic+Preference 58.7 39.2 68.0 58.5 66.7 66.3 61.9 59.9
Table 5: The impact of preference distillation on the hallucination-oriented benchmarks.
Distillation Object HalBench POPE MMHal-Bench
Resp \downarrow Ment \downarrow F1 \uparrow Score \uparrow Hall \downarrow
Mimic 39.1 22.6 86.7 2.75 17.8
Mimic+Preference 11.4 7.2 87.0 2.76 17.2

Preference Distillation Mitigates Hallucination. We explore the effect of preference distillation on comprehension capability and hallucination mitigation. As shown in Tab. 4 and Tab. 5, preference distillation remarkably reduces the hallucination of s-MLLM, while it does not yield consistent improvements in comprehension capability. This observation aligns with that in LLMs (Achiam et al., 2023), where preference optimization tends to prioritize reducing hallucinations, often resulting in a certain degree of decline in comprehension gains.

KTO Stabilizes Preference Distillation. We explore the training stability of various preference-optimization losses. We employ KTO (Ethayarajh et al., 2024) and DPO (Rafailov et al., 2024). As shown in Appendix B.2, both methods enhance hallucination mitigation. However, KTO has a significant advantage over DPO. The result implies that KTO, which indicates whether a sample is good or bad, provides a better signal for s-MLLM to surpass l-MLLM compared to DPO, which determines which one is better by comparing two samples.

4.3.2 Impact of Training Strategy

KD Facilitates MoE Training. We explore the effect of knowledge distillation (KD) in MoE training by comparing it with supervised fine-tuning (SFT). The ablation experiments are carried out using the same sparse configuration of E4T2, where four experts are initialized and the top two experts are activated. As shown in Tab. 6, KD surpasses SFT on all benchmarks, achieving an 8.1% average gain over SFT. Additionally, it achieves notable gains in complex multi-task scenarios, such as a +8.2% gain on MMB and a +10.0% gain on MME. These results indicate that KD provides a better optimization signal for effectively training the MoE architecture.

Table 6: Comparison between KD and SFT. The MoE architecture is E4T2.
Training GQA VizWiz SQAII{}^{\text{I}}start_FLOATSUPERSCRIPT I end_FLOATSUPERSCRIPT VQATT{}^{\text{T}}start_FLOATSUPERSCRIPT T end_FLOATSUPERSCRIPT MME MMB MMBCNCN{}^{\text{CN}}start_FLOATSUPERSCRIPT CN end_FLOATSUPERSCRIPT AVG
SFT 58.6 31.2 66.2 55.6 63.2 59.2 56.1 55.7
KD 59.3 40.0 68.4 58.7 68.4 65.1 61.5 60.2

General-to-Specialized KD Boosts Performance. We explore the effect of the general-to-specialized approach in mimic distillation by comparing the proposed D2D+D2S with D2S. The D2D+D2S involves a two-phase process. Firstly, it distills a dense s-MLLM using general data. Subsequently, it transforms the s-MLLM from dense to sparse and further distills it using multi-task data. The D2S directly distills the sparse s-MLLM using a combined of general and multi-task dataset. Tab. 7 shows that D2D+D2S surpasses D2S with an average gain of 4.0%. The result indicates that D2D+D2S enables effective knowledge transfer, striking a balance between general and specialized competencies. Additionally, D2D+D2S is more computationally efficient, consuming only 62.5% of the GPU hours required by D2S.

Table 7: Comparison between different distillation strategies in mimic distillation. #GPU indicates the training GPU hours.
Strategy #GPU GQA VizWiz SQAII{}^{\text{I}}start_FLOATSUPERSCRIPT I end_FLOATSUPERSCRIPT VQATT{}^{\text{T}}start_FLOATSUPERSCRIPT T end_FLOATSUPERSCRIPT MME MMB MMBCNCN{}^{\text{CN}}start_FLOATSUPERSCRIPT CN end_FLOATSUPERSCRIPT AVG
D2S 1536 57.9 36.5 66.3 57.5 65.7 61.4 60.0 57.9
D2D+D2S 960 59.3 40.0 68.4 58.7 68.4 65.1 61.5 60.2

Focus on Response Distillation Improves Generalization. We explores the effect of the distillation target by comparing response distillation with response+instruction distillation. Tab. 8 shows that response distillation surpasses response+instruction distillation with an average gain of 3.1%. The result indicates that focusing on response is more effective for knowledge distillation in auto-regressive modeling. A possible reason for this is that mimicking additional instructions can lead to overfitting the specific instruction patterns in the training data, potentially reducing the generalization to unseen instructions.

Table 8: Comparison between different distilled tokens. Response indicates distilling solely on the response tokens. Response+Instruction incorporates additional distillation of instruction tokens.
Distilled Tokens GQA VizWiz SQAII{}^{\text{I}}start_FLOATSUPERSCRIPT I end_FLOATSUPERSCRIPT VQATT{}^{\text{T}}start_FLOATSUPERSCRIPT T end_FLOATSUPERSCRIPT MME MMB MMBCNCN{}^{\text{CN}}start_FLOATSUPERSCRIPT CN end_FLOATSUPERSCRIPT AVG
Response + Instruction 58.5 35.0 66.8 58.3 68.5 61.8 59.0 58.4
Response 59.3 40.0 68.4 58.7 68.4 65.1 61.5 60.2

4.3.3 Impact of Model Architecture

Sparse Architecture Facilitates Knowledge Transfer.

We explore the effect of sparse architecture in distillation by comparing it with dense architecture. The configuration is E4T1 where four experts are initialized and the top-1 expert is activated to ensure that the activated parameters are the same as those of the dense architecture. Tab. 9 shows that sparse architecture surpasses dense architecture with an average gain of 3.7%, The superior performance is prominent in complex multi-task benchmarks, such as MME and MMB. The result indicates that sparse architecture leverages MoE to effectively transfer diverse knowledge from l-MLLM while maintaining computational efficiency.

Table 9: Comparison between sparse and dense architecture within the distillation.
Architecture GQA VizWiz SQAII{}^{\text{I}}start_FLOATSUPERSCRIPT I end_FLOATSUPERSCRIPT VQATT{}^{\text{T}}start_FLOATSUPERSCRIPT T end_FLOATSUPERSCRIPT MME MMB MMBCNCN{}^{\text{CN}}start_FLOATSUPERSCRIPT CN end_FLOATSUPERSCRIPT AVG
Dense 57.6 32.7 67.3 56.8 65.2 61.8 58.2 57.1
Sparse 58.7 36.9 67.9 58.2 66.1 64.5 61.7 59.2

Teacher Capacity Matters in Knowledge Distillation. We explore the effect of the teacher capacity in the distillation. We employ Qwen-1.5-7B as the strong teacher and Qwen-1.5-4B as the weaker one. The MoE configuration is set as E4T2. Tab. 10 shows that a well-suited teacher can boost performance. For Qwen-1.5-1.8B, using a 7B-sized teacher yields an average gain of 2.2% compared to a 4B-sized teacher. However, when the student and teacher have a large capacity disparity, it can hinder effective knowledge transfer. For Qwen-1.5-0.5B, using a 7B-sized teacher yields a marginal average gain compared to a 4B-sized teacher. Utilizing a “middle teacher” with an intermediate capacity can bridge the gap, facilitating smooth knowledge transfer.

Table 10: Comparison between the strong and weak teachers. We set the strong teacher as the LLM is Qwen-1.5-7B and the weak teacher as the LLM is Qwen-1.5-4B.
Student Teacher GQA VizWiz SQAII{}^{\text{I}}start_FLOATSUPERSCRIPT I end_FLOATSUPERSCRIPT VQATT{}^{\text{T}}start_FLOATSUPERSCRIPT T end_FLOATSUPERSCRIPT MME MMB MMBCNCN{}^{\text{CN}}start_FLOATSUPERSCRIPT CN end_FLOATSUPERSCRIPT AVG
Qwen-1.5-0.5B Qwen-1.5-4B 56.0 25.3 64.7 53.8 63.3 62.2 50.8 53.7
Qwen-1.5-7B 56.1 29.8 63.8 54.5 64.2 58.5 49.7 53.8
Qwen-1.5-1.8B Qwen-1.5-4B 58.7 34.6 67.9 57.7 67.6 64.9 60.7 58.9
Qwen-1.5-7B 59.3 40.0 68.4 58.7 68.4 65.1 61.5 60.2

5 Conclusion

In this paper, we introduce LLaVA-MoD, a novel framework for efficient training of small-scale multimodal language models via knowledge distillation from large-scale ones. It addresses two key challenges in MLLM distillation: enhancing s-MLLM architecture with MoE design for efficiency and expressiveness balance and implementing a progressive knowledge transfer strategy. Extensive experiments show that LLaVA-MoD outperforms existing models with low activation parameters and computational costs. Notably, it outperforms Qwen-VL-Chat-7B by 8.8% with only 2 billion activation parameters, 0.3% training data, and 23% trainable parameters, highlighting its effectiveness in distilling knowledge and driving more efficient MLLM development.

References

  • Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  • Agarwal et al. (2023) Rishabh Agarwal, Nino Vieillard, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. Gkd: Generalized knowledge distillation for auto-regressive sequence models. arXiv preprint arXiv:2306.13649, 2023.
  • Alayrac et al. (2022) Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736, 2022.
  • Bai et al. (2023a) Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou, and Tianhang Zhu. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023a.
  • Bai et al. (2023b) Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023b.
  • Cao & Xiao (2022) Jie Cao and Jing Xiao. An augmented benchmark dataset for geometric question answering through dual parallel text encoding. In COLING, pp.  1511–1520, 2022.
  • Chen et al. (2024) Guiming Hardy Chen, Shunian Chen, Ruifei Zhang, Junying Chen, Xiangbo Wu, Zhiyi Zhang, Zhihong Chen, Jianquan Li, Xiang Wan, and Benyou Wang. Allava: Harnessing gpt4v-synthesized data for a lite vision-language model. arXiv preprint arXiv:2402.11684, 2024.
  • Chen et al. (2023a) Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023a.
  • Chen et al. (2023b) Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023b.
  • Chu et al. (2023) X Chu, L Qiao, X Lin, S Xu, Y Yang, Y Hu, F Wei, X Zhang, B Zhang, X Wei, et al. Mobilevlm: A fast, strong and open vision language assistant for mobile devices. arXiv preprint arXiv:2312.16886, 2023.
  • Clark & Gardner (2018) Christopher Clark and Matt Gardner. Simple and effective multi-paragraph reading comprehension. In ACL, pp.  845–855, 2018.
  • Dai et al. (2024) Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y Wu, et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. arXiv preprint arXiv:2401.06066, 2024.
  • Diederik (2014) P Kingma Diederik. Adam: A method for stochastic optimization. (No Title), 2014.
  • Ethayarajh et al. (2024) Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306, 2024.
  • Fedus et al. (2022) William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39, 2022.
  • Fu et al. (2023) Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
  • Gao et al. (2024) Peng Gao, Renrui Zhang, Chris Liu, Longtian Qiu, Siyuan Huang, Weifeng Lin, Shitian Zhao, Shijie Geng, Ziyi Lin, Peng Jin, et al. Sphinx-x: Scaling data and parameters for a family of multi-modal large language models. arXiv preprint arXiv:2402.05935, 2024.
  • Goyal et al. (2017) Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, pp.  6904–6913, 2017.
  • Gu et al. (2023) Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models. In The Twelfth International Conference on Learning Representations, 2023.
  • Gurari et al. (2018) Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  3608–3617, 2018.
  • He et al. (2024) Muyang He, Yexin Liu, Boya Wu, Jianhao Yuan, Yueze Wang, Tiejun Huang, and Bo Zhao. Efficient multimodal learning from data-centric perspective. arXiv preprint arXiv:2402.11530, 2024.
  • Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  • Ho et al. (2022) Namgyu Ho, Laura Schmid, and Se-Young Yun. Large language models are reasoning teachers. arXiv preprint arXiv:2212.10071, 2022.
  • Huang et al. (2024) Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin Wang, Conghui He, Jiaqi Wang, Dahua Lin, Weiming Zhang, and Nenghai Yu. Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  13418–13427, 2024.
  • Huang et al. (2022) Yukun Huang, Yanda Chen, Zhou Yu, and Kathleen McKeown. In-context learning distillation: Transferring few-shot learning ability of pre-trained language models. arXiv preprint arXiv:2212.10670, 2022.
  • Hudson & Manning (2019) Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  6700–6709, 2019.
  • Jacobs et al. (1991) Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton. Adaptive mixtures of local experts. Neural Computation, 3(1):79–87, 1991. doi: 10.1162/neco.1991.3.1.79.
  • Jiang et al. (2024) Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
  • Kafle et al. (2018) Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan. Dvqa: Understanding data visualizations via question answering. In CVPR, pp.  5648–5656, 2018.
  • Kembhavi et al. (2016) Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In ECCV, pp.  235–251, 2016.
  • Kim et al. (2022) Geewook Kim, Teakgyu Hong, Moonbin Yim, JeongYeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. Ocr-free document understanding transformer. In ECCV, 2022.
  • Komatsuzaki et al. (2022) Aran Komatsuzaki, Joan Puigcerver, James Lee-Thorp, Carlos Riquelme Ruiz, Basil Mustafa, Joshua Ainslie, Yi Tay, Mostafa Dehghani, and Neil Houlsby. Sparse upcycling: Training mixture-of-experts from dense checkpoints. arXiv preprint arXiv:2212.05055, 2022.
  • Krishna et al. (2017) Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 123:32–73, 2017.
  • Leng et al. (2024) Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  13872–13882, 2024.
  • Lepikhin et al. (2020) Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020.
  • Li et al. (2023a) Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425, 2023a.
  • Li et al. (2023b) Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp.  19730–19742. PMLR, 2023b. URL https://proceedings.mlr.press/v202/li23q.html.
  • Li et al. (2022) Shiyang Li, Jianshu Chen, Yelong Shen, Zhiyu Chen, Xinlu Zhang, Zekun Li, Hong Wang, Jing Qian, Baolin Peng, Yi Mao, Wenhu Chen, and Xifeng Yan. Explanations from large language models make small reasoners better. CoRR, abs/2210.06726, 2022. doi: 10.48550/ARXIV.2210.06726. URL https://doi.org/10.48550/arXiv.2210.06726.
  • Li et al. (2024) Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. Mini-gemini: Mining the potential of multi-modality vision language models. CoRR, abs/2403.18814, 2024. doi: 10.48550/ARXIV.2403.18814. URL https://doi.org/10.48550/arXiv.2403.18814.
  • Li et al. (2023c) Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023c.
  • Lin et al. (2024a) Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin, Junwu Zhang, Munan Ning, and Li Yuan. Moe-llava: Mixture of experts for large vision-language models. arXiv preprint arXiv:2401.15947, 2024a.
  • Lin et al. (2024b) Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  26689–26699, 2024b.
  • Liu et al. (2023a) Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023a.
  • Liu et al. (2023b) Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023b.
  • Liu et al. (2024) Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
  • Liu et al. (2023c) Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023c.
  • Lu et al. (2024) Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Yaofeng Sun, et al. Deepseek-vl: towards real-world vision-language understanding. arXiv preprint arXiv:2403.05525, 2024.
  • Lu et al. (2022a) Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. NeurIPS, 35:2507–2521, 2022a.
  • Lu et al. (2022b) Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS), 2022b.
  • Mao et al. (2016) Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. In CVPR, pp.  11–20, 2016.
  • Marino et al. (2019) Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In CVPR, pp.  3195–3204, 2019.
  • Masry et al. (2022) Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. In ACL, pp.  2263–2279, 2022.
  • Mishra et al. (2019) Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. In ICDAR, pp.  947–952, 2019.
  • Mukherjee et al. (2023) Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawahar, Sahaj Agarwal, Hamid Palangi, and Ahmed Awadallah. Orca: Progressive learning from complex explanation traces of GPT-4. CoRR, abs/2306.02707, 2023. doi: 10.48550/ARXIV.2306.02707. URL https://doi.org/10.48550/arXiv.2306.02707.
  • Rafailov et al. (2024) Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.
  • Riquelme et al. (2021) Carlos Riquelme, Joan Puigcerver, Basil Mustafa, Maxim Neumann, Rodolphe Jenatton, André Susano Pinto, Daniel Keysers, and Neil Houlsby. Scaling vision with sparse mixture of experts. Advances in Neural Information Processing Systems, 34:8583–8595, 2021.
  • Schwenk et al. (2022) Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. In ECCV, pp.  146–162, 2022.
  • Shao et al. (2024) Zhenwei Shao, Zhou Yu, Jun Yu, Xuecheng Ouyang, Lihao Zheng, Zhenbiao Gai, Mingyang Wang, and Jiajun Ding. Imp: Highly capable large multimodal models for mobile devices. arXiv preprint arXiv:2405.12107, 2024.
  • Shen et al. (2023) Sheng Shen, Zhewei Yao, Chunyuan Li, Trevor Darrell, Kurt Keutzer, and Yuxiong He. Scaling vision-language models with sparse mixture of experts. arXiv preprint arXiv:2303.07226, 2023.
  • Shu et al. (2023) Fangxun Shu, Lei Zhang, Hao Jiang, and Cihang Xie. Audio-visual llm for video understanding. arXiv preprint arXiv:2312.06720, 2023.
  • Sidorov et al. (2020) Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. Textcaps: a dataset for image captioning with reading comprehension. In ECCV, pp.  742–758, 2020.
  • Singh et al. (2019) Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  8317–8326, 2019.
  • Sun et al. (2023) Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang, et al. Aligning large multimodal models with factually augmented rlhf. arXiv preprint arXiv:2309.14525, 2023.
  • Team et al. (2023) Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  • Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  • Wang et al. (2023) Junke Wang, Lingchen Meng, Zejia Weng, Bo He, Zuxuan Wu, and Yu-Gang Jiang. To see is to believe: Prompting gpt-4v for better visual instruction tuning. arXiv preprint arXiv:2311.07574, 2023.
  • Wang et al. (2024) Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, et al. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. Advances in Neural Information Processing Systems, 36, 2024.
  • Yao et al. (2024) Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800, 2024.
  • Yu et al. (2016) Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in referring expressions. In ECCV, pp.  69–85, 2016.
  • Yu et al. (2024a) Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-Tao Zheng, Maosong Sun, et al. Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  13807–13816, 2024a.
  • Yu et al. (2024b) Tianyu Yu, Haoye Zhang, Yuan Yao, Yunkai Dang, Da Chen, Xiaoman Lu, Ganqu Cui, Taiwen He, Zhiyuan Liu, Tat-Seng Chua, et al. Rlaif-v: Aligning mllms through open-source ai feedback for super gpt-4v trustworthiness. arXiv preprint arXiv:2405.17220, 2024b.
  • Yuan et al. (2023) Zhengqing Yuan, Zhaoxu Li, and Lichao Sun. Tinygpt-v: Efficient multimodal large language model via small backbones. arXiv preprint arXiv:2312.16862, 2023.
  • Zhao et al. (2023) Bo Zhao, Boya Wu, and Tiejun Huang. Svit: Scaling up visual instruction tuning. arXiv preprint arXiv:2307.04087, 2023.
  • Zhou et al. (2024a) Baichuan Zhou, Ying Hu, Xi Weng, Junlong Jia, Jie Luo, Xien Liu, Ji Wu, and Lei Huang. Tinyllava: A framework of small-scale large multimodal models. arXiv preprint arXiv:2402.14289, 2024a.
  • Zhou et al. (2023) Yiyang Zhou, Chenhang Cui, Jaehong Yoon, Linjun Zhang, Zhun Deng, Chelsea Finn, Mohit Bansal, and Huaxiu Yao. Analyzing and mitigating object hallucination in large vision-language models. arXiv preprint arXiv:2310.00754, 2023.
  • Zhou et al. (2024b) Yiyang Zhou, Chenhang Cui, Rafael Rafailov, Chelsea Finn, and Huaxiu Yao. Aligning modalities in vision large language models via preference fine-tuning. arXiv preprint arXiv:2402.11411, 2024b.
  • Zhu et al. (2023) Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.

Appendix A Implementation Details

A.1 Training Strategy and Hyperparameters

We first freeze the vision encoder and LLM while optimizing the VL adaptor to align image tokens with the word embedding space during initialization. This stage employs a cross-entropy loss with a batch size of 512 and a learning rate of 1e-4. In mimic distillation, the vision encoder remains frozen, while the LLM and VL adaptor are co-optimized to distill general knowledge from l-MLLM in dense-to-dense distillation. Then, the FFN in the LLM is first transformed into a sparse architecture with a mixture of FFNs co-optimized with the VL adaptor to distill specialized knowledge from l-MLLM in dense-to-sparse distillation. This stage employs a KL-divergence loss, with cross-entropy loss added for dense-to-sparse distillation. The batch size is 256, and the learning rate is adjusted to 2e-5. In preference distillation, the model inherits from the mimic distillation. The vision encoder remains frozen, and a mixture of FFN experts is co-optimized with the VL adaptor to distill preference knowledge from the teacher MLLM. This stage employs a KTO (Ethayarajh et al., 2024) loss to optimize the probability of s-MLLM for positive responses to be greater than that of l-MLLM and the probability of s-MLLM for negative responses to be lower than that of l-MLLM. The batch size is 256, and the learning rate is adjusted to 2e-6. Throughout all stages, we employ the adam optimizer (Diederik, 2014) and train on 16 NVIDIA A100 GPUs for one epoch each, totaling approximately 960 GPU hours. The detailed training hyperparameters are shown in Tab. 11.

Table 11: Training hyperparameters of each stage.
Configuration Initialization Mimic Distillation Preference Distillation
LLM
VL Adaptor
ViT
LLM init. Qwen-1.5-1.8B Qwen-1.5-1.8B 2nd-stage
VL Adaptor init. MLP 1st-stage 2nd-stage
ViT init. CLIP-Large@336
Image resolution 336 ×\times× 336
ViT sequence length 576
LLM sequence length 2048
Optimizer AdamW
Optimizer hyperparameter β1=0.9,β2=0.98formulae-sequencesubscript𝛽10.9subscript𝛽20.98\beta_{1}=0.9,\beta_{2}=0.98italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9 , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.98
Learning rate 1e-4 2e-5 2e-5
Learning rate schedule Cosine decay
Weight decay 0.0
Training epoch 1
Warm-up ratio 0.03
Global batch size 256 128 128
Numerical precision Bfloat16
Model parallelism Zero2 Zero2 offload Zero2 offload
Table 12: Training dataset of each stage. #Sample means the training samples
Stage Task Dataset
Initialization Captioning LLaVA-1.5-Pretrain (Liu et al., 2023b)
Dense-to-Dense Distillation Captioning ALLaVA-Caption-4V (Chen et al., 2024), ShareGPT4V-PT (Chen et al., 2023a)
Conversation MIMIC-IT (Li et al., 2023a), LVIS (Wang et al., 2023), LRV (Liu et al., 2023a), SViT (Zhao et al., 2023)
Dense-to-Sparse Distillation Captioning ShareGPT4V-100K (Chen et al., 2023a), TextCaps (Sidorov et al., 2020)
Conversation LLaVA-1.5-Instruct (Liu et al., 2023b)
General QA GQA (Hudson & Manning, 2019), VQAv2 (Goyal et al., 2017), OKVQA (Marino et al., 2019)
Grounding Visual Genome (Krishna et al., 2017), RefCOCO (Yu et al., 2016; Mao et al., 2016)
Science AI2D (Kembhavi et al., 2016), ScienceQA (Lu et al., 2022a)
Chart & Doc DVQA (Kafle et al., 2018), ChartQA (Masry et al., 2022), DocQA (Clark & Gardner, 2018)
OCR OCRVQA (Mishra et al., 2019), SynthDoG-EN (Kim et al., 2022)
Knowledge A-OKVQA (Schwenk et al., 2022), GeoQA+ (Cao & Xiao, 2022)
Preference Distillation Preference RLAIF-V (Yu et al., 2024b)

A.2 Training Datasets

We curate a comprehensive open-source dataset of 5 million samples encompassing both general and expert tasks. In the Initialization stage, we utilize 0.6M general captioning samples from LLaVA-1.5-pretrain dataset (Liu et al., 2023b) to bridge the gap between visual and language modalities. In the mimic distillation stage, we utilize 2.4M general captioning and conversation samples to distill general knowledge from the teacher MLLM and 1.4M multi-tasks data samples, including VQA, documents, science, and OCR to distill specialized knowledge from the teacher MLLM. In the preference distillation stage, we utilize 80K preference data samples to transfer preference knowledge from the teacher MLLM. The detailed dataset of each training stage is illustrated in Tab. 12.

Appendix B Additional Experiments

B.1 The Effect of Expert and Router Capacity

We explore the effects of expert and router capacity on the results. Specifically, we utilize CLIP-ViT-L/14 as the vision encoder and the Qwen-1.5-1.8B as the LLM. We experiment with experts of 4 and 8, and router strategies of Top-1 and Top-2. All experiments are conducted under the same datasets and training hyperparameters to ensure a fair comparison. Tab. 14 shows that scaling experts from 4 to 8 doesn’t improve performance. Tab. 14 shows that scaling the router strategy from Top-1 to Top-2 improves performance. This indicates that increasing the capacity of the MoE block could introduce training difficulty. As the capacity grows, the model might struggle more with convergence and optimization, potentially making it harder to effectively learn from the given data. Thus, a careful balance must be achieved between capacity and training efficiency to ensure optimal performance.

Table 13: Ablations with the number of experts. We set the gating strategy as Top-2. The number of experts is set to 4 and 8.
Expert GQA SQAII{}^{\text{I}}start_FLOATSUPERSCRIPT I end_FLOATSUPERSCRIPT VQATT{}^{\text{T}}start_FLOATSUPERSCRIPT T end_FLOATSUPERSCRIPT MME MMB AVG
4 59.3 68.4 58.7 68.4 65.1 64.0
8 58.3 67.4 58.4 69.1 64.6 63.6
Table 14: Ablations with the router strategy. We set the number of experts as 4. The Top-K strategy of routers is set to Top-1 and Top-2.
Top-K GQA SQAII{}^{\text{I}}start_FLOATSUPERSCRIPT I end_FLOATSUPERSCRIPT VQATT{}^{\text{T}}start_FLOATSUPERSCRIPT T end_FLOATSUPERSCRIPT MME MMB AVG
1 58.2 67.2 58.3 66.1 64.7 62.6
2 59.3 68.4 58.7 68.4 65.1 64.0

B.2 The Effect of Preference-Optimization Loss in Preference Distillation

We explore the training stability of various preference-optimization losses. We employ KTO (Ethayarajh et al., 2024) and DPO (Rafailov et al., 2024). As shown in Tab. 15, while both methods have an impact on reducing hallucination, KTO is significantly superior to DPO. This is because KTO, which indicates whether a sample is good or bad, offers a better signal for the s-MLLM to surpass the l-MLLM compared to DPO, which determines relative quality by comparing two samples.

Table 15: Comparison between different preference-optimization loss on comprehension-oriented benchmarks.
PO Loss GQA VizWiz SQAII{}^{\text{I}}start_FLOATSUPERSCRIPT I end_FLOATSUPERSCRIPT VQATT{}^{\text{T}}start_FLOATSUPERSCRIPT T end_FLOATSUPERSCRIPT MME MMB MMBCNCN{}^{\text{CN}}start_FLOATSUPERSCRIPT CN end_FLOATSUPERSCRIPT AVG
DPO 58.5 37.3 68.2 58.6 67.8 65.4 61.5 59.6
KTO 58.7 39.2 68.0 58.5 67.6 66.3 61.9 59.9
Table 16: Comparison between different preference-optimization loss on hallucination-oriented benchmarks.
PO Loss Object HalBench POPE MMHal-Bench
Resp \downarrow Ment \downarrow F1 \uparrow Score \uparrow Hall \downarrow
DPO 20.0 10.5 86.8 2.68 18.9
KTO 11.4 7.2 87.0 2.76 17.2

Appendix C Limitations and Future Works

LLaVA-MoD requires that both s-MLLM and l-MLLM belong to the same series of LLMs to ensure consistency in the vocabulary space. Future research can explore distillation techniques involving heterogeneous model families. Moreover, the requirement to load both s-MLLM and l-MLLM leads to substantial memory consumption. To enable efficient distillation, a possible solution could be to pre-extract the logits of the l-MLLM. This would allow only the s-MLLM to be loaded during training, reducing memory requirements and potentially speeding up the training process.