MLLM之R1-Reward：《R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning》翻译

原创已于 2025-05-11 11:24:52 修改 · 1k 阅读

30 ·

CC 4.0 BY-SA版权

文章标签：

#MLLM #R1-Reward

于 2025-05-11 11:23:08 首次发布

NLP/LLMs 专栏收录该内容

839 篇文章

订阅专栏

MLLM之R1-Reward：《R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning》翻译与解读

导读：这篇论文的核心是提出了一种名为 R1-Reward 的多模态奖励模型 (MRM)，并使用一种改进的强化学习算法 StableReinforce 进行训练，以提高多模态大型语言模型 (MLLM) 的性能。

>> 背景痛点：

● 多模态奖励模型 (MRM) 的重要性： MRM 在 MLLM 的训练和推理阶段都起着至关重要的作用，影响着模型的稳定性和最终性能。高质量的 MRM 可以提升数据过滤质量，优化测试策略，简化评估过程。

● 现有 MRM 的局限性：目前的研究主要集中在改进 MRM 的模型结构和训练数据，而对利用强化学习 (RL) 来提升 MRM 的长期推理能力以及如何激活这些能力的研究较少。现有 RL 算法（如 Reinforce++）直接应用于奖励建模时，常常导致训练不稳定甚至崩溃。具体问题包括：

●● PPO 及相关算法的局限性：简单的损失函数裁剪无法应对优势值负且当前策略与参考策略差异显著的情况。

●● 优势归一化的不稳定性：在训练后期，奖励值分布极度不平衡时，常用的优势归一化技术会导致某些样本的优势值过大或过小，造成训练不稳定。

●● 推理过程与结果的不一致性：模型的推理过程和最终输出可能存在矛盾，因为基于规则的 RL 只对结果进行评分，没有监督推理过程。

>> 具体的解决方案：论文提出了一种名为 StableReinforce 的强化学习算法，以及一个名为 R1-Reward 的多模态奖励模型。

● StableReinforce 算法：该算法通过以下改进提升了 RL 训练的稳定性：

●● Pre-CLIP：在计算指数之前对概率比进行裁剪，防止数值溢出，并减轻负优势值带来的影响。

●● Advantage Filter：使用 3-sigma 规则过滤掉异常值，避免优势值分布极度不平衡带来的不稳定性。

●● 一致性奖励 (Consistency Reward)：引入一个额外的 MLLM 作为裁判，评估模型推理过程和最终结果的一致性，确保推理过程与输出结果相符。

● R1-Reward 模型：该模型利用 StableReinforce 算法进行训练，并使用了以下策略：

●● 渐进式难度训练策略：先使用 200K 个来自多个公开数据集的偏好数据和 GPT-4o 生成的推理过程进行监督微调 (SFT)，然后选择 GPT-4o 需要多次尝试才能得到正确答案的样本进行强化学习训练。

●● 精心设计的奖励函数：包含格式奖励 (Formatting Reward)、结果奖励 (Result Reward) 和一致性奖励 (Consistency Reward)，以鼓励模型进行合理的推理并给出与推理过程一致的答案。

>> 核心思路步骤：

● 数据收集和预处理：收集 200K 个多模态偏好数据，并使用 GPT-4o 生成推理过程作为冷启动 SFT 数据。

● 监督微调 (SFT)：使用 GPT-4o 生成的 SFT 数据训练基础模型。

● 强化学习训练：使用 StableReinforce 算法和精心设计的奖励函数对模型进行强化学习训练，重点关注那些难度较高的样本。

● 模型评估：在 VL Reward-Bench、Multimodal Reward Bench 和 MM-RLHF-Reward Bench 等基准测试集上评估模型性能。

>> 优势：

● 显著提升了性能：在多个基准测试集上均取得了优于现有 SOTA 模型的性能，数据效率高。

● 增强了训练稳定性： StableReinforce 算法有效解决了现有 RL 算法在奖励建模任务中遇到的训练不稳定问题。

● 提高了推理效率：模型在 RL 训练后，平均响应长度减少了约 15%。

● 具有良好的测试时间可扩展性：通过增加推理样本数量，可以进一步提高模型的性能。

>> 结论和观点：

● 强化学习可以有效应用于多模态奖励建模，显著提高模型性能。

● StableReinforce 算法通过改进损失函数、优势估计策略和奖励设计，提高了训练稳定性和模型性能。

● R1-Reward 模型在多个基准测试集上取得了 SOTA 结果，展现了其在数据效率和泛化能力方面的优势。

● 增加推理样本数量可以进一步提高 R1-Reward 模型的性能，展现了其在测试时间可扩展性方面的潜力。

● 论文还指出了未来研究方向，例如探索更高级的测试时间缩放策略和改进训练策略等。

《R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning》翻译与解读

Abstract

Figure 1:R1-Reward performance on multimodal reward benchmarks. Performance improves significantly when using a majority voting strategy (Voting@5/15) over multiple inference samples.图 1：在多模态奖励基准测试中的 R1 奖励表现。当对多个推理样本采用多数投票策略（Voting@5/15）时，性能有显著提升。

Figure 2:Detailed comparison between StableReinforce and Reinforce++. (a) StableReinforce exhibits faster and more stable convergence of the policy loss during training. (b) StableReinforce continuously performs length compression, improving efficiency. Reinforce++ collapses around step 150, whereas StableReinforce remains stable, demonstrating its enhanced robustness. Additionally, after RL training with StableReinforce, the average response length is reduced by approximately 15% compared to base model, suggesting potential improvements in reasoning token efficiency.图 2：StableReinforce 与 Reinforce++ 的详细对比。（a）在训练过程中，StableReinforce 的策略损失收敛速度更快且更稳定。（b）StableReinforce 能持续进行长度压缩，提高效率。Reinforce++ 在大约第 150 步时崩溃，而 StableReinforce 则保持稳定，这表明其具有更强的鲁棒性。此外，使用 StableReinforce 进行强化学习训练后，平均响应长度相较于基础模型减少了约 15%，这表明在推理标记效率方面可能有所提升。

1、Introduction

Conclusion

《R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning》翻译与解读

地址	论文地址：https://arxiv.org/abs/2505.02835
时间	2025年5月5日
作者	中国科学院自动化研究所、清华大学、快手、南京大学

Abstract

Multimodal Reward Models (MRMs) play a crucial role in enhancing the performance of Multimodal Large Language Models (MLLMs). While recent advancements have primarily focused on improving the model structure and training data of MRMs, there has been limited exploration into the effectiveness of long-term reasoning capabilities for reward modeling and how to activate these capabilities in MRMs. In this paper, we explore how Reinforcement Learning (RL) can be used to improve reward modeling. Specifically, we reformulate the reward modeling problem as a rule-based RL task. However, we observe that directly applying existing RL algorithms, such as Reinforce++, to reward modeling often leads to training instability or even collapse due to the inherent limitations of these algorithms. To address this issue, we propose the StableReinforce algorithm, which refines the training loss, advantage estimation strategy, and reward design of existing RL methods. These refinements result in more stable training dynamics and superior performance. To facilitate MRM training, we collect 200K preference data from diverse datasets. Our reward model, R1-Reward, trained using the StableReinforce algorithm on this dataset, significantly improves performance on multimodal reward modeling benchmarks. Compared to previous SOTA models, R1-Reward achieves a improvement on the VL Reward-Bench and a improvement on the Multimodal Reward Bench. Moreover, with more inference compute, R1-Reward's performance is further enhanced, highlighting the potential of RL algorithms in optimizing MRMs.

多模态奖励模型（MRMs）在提升多模态大型语言模型（MLLMs）的性能方面发挥着关键作用。尽管近期的研究主要集中在改进 MRMs 的模型结构和训练数据上，但对于长期推理能力在奖励建模中的有效性以及如何在 MRMs 中激活这些能力的探索却相对有限。在本文中，我们探讨了如何利用强化学习（RL）来改进奖励建模。具体而言，我们将奖励建模问题重新表述为基于规则的 RL 任务。然而，我们发现直接将现有的 RL 算法（如 Reinforce++）应用于奖励建模，往往由于这些算法固有的局限性而导致训练不稳定甚至崩溃。为了解决这个问题，我们提出了 StableReinforce 算法，该算法对现有 RL 方法的训练损失、优势估计策略和奖励设计进行了改进。这些改进带来了更稳定的训练动态和更出色的性能。为了促进 MRM 的训练，我们从多样化的数据集中收集了 20 万条偏好数据。我们使用 StableReinforce 算法在该数据集上训练的奖励模型 R1-Reward，在多模态奖励建模基准测试中显著提升了性能。与之前的最先进模型相比，R1-Reward 在 VL Reward-Bench 上实现了的提升，在多模态奖励基准测试上实现了的提升。此外，随着推理计算量的增加，R1-Reward 的性能进一步增强，这突显了强化学习算法在优化多模态奖励模型方面的潜力。

Figure 1:R1-Reward performance on multimodal reward benchmarks. Performance improves significantly when using a majority voting strategy (Voting@5/15) over multiple inference samples.图 1：在多模态奖励基准测试中的 R1 奖励表现。当对多个推理样本采用多数投票策略（Voting@5/15）时，性能有显著提升。

Figure 2:Detailed comparison between StableReinforce and Reinforce++. (a) StableReinforce exhibits faster and more stable convergence of the policy loss during training. (b) StableReinforce continuously performs length compression, improving efficiency. Reinforce++ collapses around step 150, whereas StableReinforce remains stable, demonstrating its enhanced robustness. Additionally, after RL training with StableReinforce, the average response length is reduced by approximately 15% compared to base model, suggesting potential improvements in reasoning token efficiency.图 2：StableReinforce 与 Reinforce++ 的详细对比。（a）在训练过程中，StableReinforce 的策略损失收敛速度更快且更稳定。（b）StableReinforce 能持续进行长度压缩，提高效率。Reinforce++ 在大约第 150 步时崩溃，而 StableReinforce 则保持稳定，这表明其具有更强的鲁棒性。此外，使用 StableReinforce 进行强化学习训练后，平均响应长度相较于基础模型减少了约 15%，这表明在推理标记效率方面可能有所提升。

1、Introduction

High-quality Multimodal Reward Models (MRMs) [37, 3, 56, 51, 62] play a crucial role in the development of Multimodal Large Language Models (MLLMs) [50, 10, 4, 8, 1, 13]. In the training phase, from an algorithmic perspective, the MRM provides reward signals for RL [47, 34], directly influencing the stability and final outcomes of training. From a data perspective, a powerful MRM enables high-quality data filtering, improving data quality by removing noisy samples [66, 29]. In the inference phase, the MRM facilitates test-time scaling strategies, such as the best-of-N strategy, to select the optimal responses [51]. In the evaluation phase, a good MRM can serve as an evaluator to simplify the evaluation process, especially in open-ended scenarios [56].

Recently, reinforcement learning [9, 33] has gained widespread application in the post-training process of MLLMs [59], achieving remarkable improvements in traditional vision tasks [27, 44], multimodal reasoning tasks [17, 36, 31], video understanding tasks [11], and omni-tasks [69]. Compared to traditional post-training strategies such as supervised fine-tuning and direct preference optimization [38], RL offers better generalization [6] and demonstrates the ability to induce long-term reasoning capabilities [9]. However, recent improvements in MRMs have primarily focused on data [56, 62] and structural aspects [66], with little discussion on whether RL can be used to introduce long-term reasoning in order to improve multimodal reward modeling performance.

高质量的多模态奖励模型（MRMs）[37， 3， 56， 51， 62] 在多模态大型语言模型（MLLMs）[50， 10， 4， 8， 1， 13] 的发展中起着至关重要的作用。在训练阶段，从算法角度来看，MRM 为强化学习（RL）[47， 34] 提供奖励信号，直接影响训练的稳定性和最终结果。从数据角度来看，强大的 MRM 能够实现高质量的数据过滤，通过去除噪声样本提高数据质量[66， 29]。在推理阶段，多模态奖励模型（MRM）有助于测试时的缩放策略，例如最佳 N 选一策略，以选择最优响应[51]。在评估阶段，一个好的 MRM 可以充当评估器，简化评估过程，尤其是在开放式场景中[56]。

最近，强化学习[9, 33]在多模态大型语言模型（MLLMs）的后训练过程中得到了广泛应用[59]，在传统视觉任务[27, 44]、多模态推理任务[17, 36, 31]、视频理解任务[11]和全能任务[69]中取得了显著的改进。与传统的后训练策略（如监督微调和直接偏好优化[38]）相比，强化学习提供了更好的泛化能力[6]，并展示了诱导长期推理能力的能力[9]。然而，近期对 MRM 的改进主要集中在数据[56， 62]和结构方面[66]，很少讨论强化学习是否可用于引入长期推理以提高多模态奖励建模性能。

In this paper, we investigate whether RL algorithms can be applied to multimodal reward modeling tasks? Intuitively, the reward modeling problem can be transformed into a rule-based RL task, where the input consists of a given question and two answers. The target of the policy is to decide which answer is better. The reward during training can be obtained by comparing whether the model’s judgment is consistent with the ground truth. Our goal is to enable the model to perform long-term reasoning and then provide the correct judgment. However, RL for reward modeling presents several unique challenges, and directly using traditional RL methods can easily cause training to collapse:

1. Limitation of PPO [40] and Related Algorithms [42]. PPO and related algorithms rely on clipping the loss function to ensure training stability. However, we observe that when the advantage is negative and the current policy differs significantly from the reference policy, simple clipping fails to prevent instability, which may cause the training process to diverge or even crash.

2. Instability of Advantage Normalization. We observe that in the later stages of training, where the majority of rewards in a single batch are either 1 or 0 with very low variance, the commonly used advantage normalization technique (subtracting the mean and dividing by the variance) in algorithms such as GRPO [42] and Reinforce++ [15] can lead to extremely large or small advantage values for some samples. This can cause significant instability during training.

3. Inconsistency Between Reasoning and Results. During training, we frequently observe inconsistencies between the model’s reasoning process and its final output. The model may judge one answer as better during reasoning but ultimately output an opposite answer. This happens because rule-based RL only scores the result without supervising the reasoning process, leading the model to learn to generate correct answers without coherent reasoning.

在本文中，我们研究强化学习算法是否可以应用于多模态奖励建模任务？直观来看，奖励建模问题可以转化为基于规则的强化学习任务，其中输入包括给定的问题和两个答案。策略的目标是决定哪个答案更好。在训练期间的奖励可以通过比较模型的判断是否与真实情况一致来获得。我们的目标是使模型能够进行长期推理，然后提供正确的判断。然而，用于奖励建模的强化学习存在几个独特的挑战，直接使用传统的强化学习方法很容易导致训练崩溃：

1. PPO [40] 及相关算法 [42] 的局限性。PPO 及相关算法依靠裁剪损失函数来确保训练的稳定性。然而，我们观察到，当优势值为负且当前策略与参考策略差异显著时，简单的裁剪无法防止训练不稳定，这可能导致训练过程发散甚至崩溃。

2. 优势归一化的不稳定性。我们观察到，在训练后期，单个批次中的大多数奖励要么为 1 要么为 0，方差非常低，在诸如 GRPO [42] 和 Reinforce++ [15] 等算法中常用的归一化优势技术（减去均值并除以方差）可能会导致某些样本的优势值变得极大或极小。这会在训练过程中造成严重的不稳定性。

3. 推理过程与结果之间的不一致。在训练期间，我们经常观察到模型的推理过程与其最终输出之间存在不一致。模型在推理过程中可能认为某个答案更好，但最终输出的答案却恰恰相反。这种情况的发生是因为基于规则的强化学习仅对结果进行评分，而不监督推理过程，导致模型学会生成正确答案，但缺乏连贯的推理。

To this end, at the algorithm level, we propose StableReinforce, which introduces several key modifications to traditional RL methods. Specifically, we refine the clipping operation to mitigate numerical instability caused by large updates and introduce a robust advantage normalization technique that limits the impact of outliers. On the reward function design front, StableReinforce introduces a novel mechanism: the use of an MLLM as a referee. This referee evaluates the consistency between the model’s reasoning process and the final result, ensuring that the reasoning aligns with the output. This consistency reward promotes more accurate and logically coherent decision-making.

During the training phase, directly training the MLLM using reinforcement learning yields suboptimal results. Therefore, a progressive difficulty training strategy is adopted. Initially, 200K preference data is collected from publicly available datasets, and GPT-4o generates corresponding thinking processes, referred to as R1-Reward-200K, to serve as cold-start SFT data. Meanwhile, for each sample, the number of sampling attempts GPT-4o requires to infer a conclusion matching the ground truth is recorded, which is considered the difficulty level of that sample. In the reinforcement learning phase, samples where GPT-4o requires at least two sampling attempts to arrive at the correct answer, or fails to answer correctly even after three attempts, are selected as training data. These samples are then used to train the model with the enhanced StableReinforce algorithm. As shown in Figure 2, the reinforcement learning phase effectively performs token compression, and also resulting in a noticeable performance improvement in our experiments.

R1-Reward performs excellently on common multimodal reward modeling benchmarks. As shown in Figure 1, R1-Reward outperforms the state-of-the-art (SOTA) on all the three benchmarks. Furthermore, R1-Reward exhibits strong inference time scalability. By sampling only five times and selecting the most frequent answer as the correct one, the accuracy of reward modeling improves substantially. On the MM-RLHF Reward Bench [66], VL Reward-Bench [21], and Multimodal Reward Bench [57], R1-Reward achieves improvements of 3.5%, 13.5%, and 14.6%, respectively, compared to SOTA. As the number of samples increases, performance continues to improve, demonstrating the potential of RL for multimodal reward modeling.

为此，在算法层面，我们提出了 StableReinforce，它对传统的强化学习方法进行了几项关键改进。具体而言，我们改进了裁剪操作以减轻由大幅更新导致的数值不稳定，并引入了一种稳健的优势归一化技术，以限制异常值的影响。在奖励函数设计方面，StableReinforce 引入了一种新颖机制：使用 MLLM 作为裁判。该裁判评估模型推理过程与最终结果之间的一致性，确保推理与输出相一致。这种一致性奖励促进了更准确且逻辑连贯的决策制定。

在训练阶段，直接使用强化学习训练 MLLM 会得到次优结果。因此，采用了逐步提升难度的训练策略。首先，从公开可用的数据集中收集了 20 万条偏好数据，然后 GPT-4o 生成相应的思考过程，称为 R1-Reward-200K，作为冷启动的 SFT 数据。同时，对于每个样本，记录 GPT-4o 推断出与真实情况相符的结论所需的采样尝试次数 G，将其视为该样本的难度级别。在强化学习阶段，选择那些 GPT-4o 至少需要两次采样尝试才能得出正确答案，或者在三次尝试后仍无法正确回答的样本作为训练数据。然后使用这些样本通过增强的 StableReinforce 算法训练模型。如图 2 所示，强化学习阶段有效地实现了标记压缩，并且在我们的实验中也带来了显著的性能提升。

R1-Reward 在常见的多模态奖励建模基准测试中表现出色。如图 1 所示，R1-Reward 在所有三个基准测试中均优于最先进的（SOTA）模型。此外，R1-Reward 还展现出强大的推理时间可扩展性。通过仅采样五次并选择出现频率最高的答案作为正确答案，奖励建模的准确性大幅提高。在 MM-RLHF 奖励基准 [66]、VL 奖励基准 [21] 和多模态奖励基准 [57] 上，R1-Reward 分别比最先进的方法提高了 3.5%、13.5% 和 14.6%。随着样本数量的增加，性能持续提升，这表明强化学习在多模态奖励建模方面具有潜力。

Conclusion

In this paper, we introduce R1-Reward, a MRM trained using the StableReinforce algorithm. We demonstrate that RL can be effectively applied to reward modeling, significantly enhancing its performance. Our approach addresses key challenges, including training instability, the advantage normalization limitation, and inconsistencies between reasoning and results. By incorporating techniques such as pre-clipping, advantage filtering, consistency reward and a a progressive difficulty training strategy, StableReinforce stabilizes training and improves model performance. Experiments show that R1-Reward outperforms SOTA models on several multimodal reward model benchmarks, with significant improvements in accuracy and data efficiency.

Furthermore, R1-Reward demonstrates excellent test-time scaling capabilities, and paves the way for future research on integrating reinforcement learning into MRMs. Looking ahead, there are still many areas to explore in RL for reward modeling. For example, we only test a simple majority voting strategy for test-time scaling; more advanced methods could potentially improve performance further [26]. Additionally, improving training strategies to further enhance the foundational capabilities of reward models is also a meaningful open problem.

在本文中，我们介绍了 R1-Reward，这是一个使用 StableReinforce 算法训练的多模态奖励模型（MRM）。我们证明了强化学习能够有效地应用于奖励建模，显著提升了其性能。我们的方法解决了关键挑战，包括训练不稳定、优势归一化限制以及推理与结果之间的不一致。通过引入预裁剪、优势过滤、一致性奖励以及渐进式难度训练策略等技术，StableReinforce 稳定了训练过程并提高了模型性能。实验表明，R1-Reward 在多个多模态奖励模型基准测试中超越了最先进的模型，在准确性和数据效率方面均有显著提升。

此外，R1-Reward 展示了出色的测试时扩展能力，并为未来将强化学习集成到 MRM 中的研究铺平了道路。展望未来，在强化学习用于奖励建模方面仍有诸多领域值得探索。例如，我们仅测试了简单的多数投票策略来进行测试时扩展；更先进的方法可能进一步提升性能[26]。此外，改进培训策略以进一步提升再培训的基础能力。