Better Estimation of the KL Divergence Between Language Models

Amini, Afra; Vieira, Tim; Cotterell, Ryan

Computer Science > Computation and Language

arXiv:2504.10637 (cs)

[Submitted on 14 Apr 2025 (v1), last revised 2 May 2025 (this version, v2)]

Title:Better Estimation of the KL Divergence Between Language Models

Authors:Afra Amini, Tim Vieira, Ryan Cotterell

View PDF HTML (experimental)

Abstract:Estimating the Kullback--Leibler (KL) divergence between language models has many applications, e.g., reinforcement learning from human feedback (RLHF), interpretability, and knowledge distillation. However, computing the exact KL divergence between two arbitrary language models is intractable. Thus, practitioners often resort to the use of sampling-based estimators. While it is easy to fashion a simple Monte Carlo (MC) estimator that provides an unbiased estimate of the KL divergence between language models, this estimator notoriously suffers from high variance, and can even result in a negative estimate of the KL divergence, a non-negative quantity. In this paper, we introduce a Rao--Blackwellized estimator that is also unbiased and provably has variance less than or equal to that of the standard Monte Carlo estimator. In an empirical study on sentiment-controlled fine-tuning, we show that our estimator provides more stable KL estimates and reduces variance substantially in practice. Additionally, we derive an analogous Rao--Blackwellized estimator of the gradient of the KL divergence, which leads to more stable training and produces models that more frequently appear on the Pareto frontier of reward vs. KL compared to the ones trained with the MC estimator of the gradient.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2504.10637 [cs.CL]
	(or arXiv:2504.10637v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2504.10637

Submission history

From: Afra Amini [view email]
[v1] Mon, 14 Apr 2025 18:40:02 UTC (1,027 KB)
[v2] Fri, 2 May 2025 23:58:03 UTC (1,014 KB)

Computer Science > Computation and Language

Title:Better Estimation of the KL Divergence Between Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Better Estimation of the KL Divergence Between Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators