Distribution Learning with Valid Outputs Beyond the Worst-Case

Nick Rittler
University of California - San Diego
[email protected]
&Kamalika Chaudhuri
University of California - San Diego
[email protected]
Abstract

Generative models at times produce “invalid” outputs, such as images with generation artifacts and unnatural sounds. Validity-constrained distribution learning attempts to address this problem by requiring that the learned distribution have a provably small fraction of its mass in invalid parts of space – something which standard loss minimization does not always ensure. To this end, a learner in this model can guide the learning via “validity queries”, which allow it to ascertain the validity of individual examples. Prior work on this problem takes a worst-case stance, showing that proper learning requires an exponential number of validity queries, and demonstrating an improper algorithm which – while generating guarantees in a wide-range of settings – makes an atypical polynomial number of validity queries. In this work, we take a first step towards characterizing regimes where guaranteeing validity is easier than in the worst-case. We show that when the data distribution lies in the model class and the log-loss is minimized, the number of samples required to ensure validity has a weak dependence on the validity requirement. Additionally, we show that when the validity region belongs to a VC-class, a limited number of validity queries are often sufficient.

1 Introduction

When sampling from a generative model, it is highly desirable that its outputs meet some basic criteria of quality. In the case of text, this may mean that generated sentences respect grammar rules, or avoid the use of biased or offensive language Perez et al. (2022); Abid et al. (2021). When generating code, a criterion may be that the generated code successfully compiles Hanneke et al. (2018). In image generation, we might wish to avoid blurry outputs, or those possessing generation artifacts which clearly distinguish them from natural images Kaneko and Harada (2021); Odena et al. (2016).

In this paper, we examine the statistical cost of ensuring that learned distributions produce such “valid” outputs. To do so, we consider an elegant formulation of the problem of learning such valid models due to Hanneke et al. (2018). In their work, training data are generated according to a probability distribution P𝑃Pitalic_P, and the binary “validity” of examples is determined by some unknown “validity function” v𝑣vitalic_v. Given sample access to P𝑃Pitalic_P and query access to v𝑣vitalic_v, a learner attempts to identify a probability distribution which outputs invalid examples with probability at most ϵ2subscriptitalic-ϵ2\epsilon_{2}italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. At the same time, the distribution should have a loss which is at most ϵ1subscriptitalic-ϵ1\epsilon_{1}italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT worse than that of the minimum loss model in a class 𝒬𝒬\mathcal{Q}caligraphic_Q which outputs valid examples with probability 1. Here, query access to v𝑣vitalic_v captures the idea that collecting samples is often cheap, but verifying validity is often less so, possibly requiring a human-in-the-loop.

The initial work of Hanneke et al. (2018) suggests that choosing such a low-loss, high-validity distribution q^^𝑞\hat{q}over^ start_ARG italic_q end_ARG may require a large number of validity queries. Under the assumption that P𝑃Pitalic_P is “fully-valid”, i.e. outputs a valid example with probability 1, they show that in the worst case, 2Ω(1/ϵ1)superscript2Ω1subscriptitalic-ϵ12^{\Omega(1/\epsilon_{1})}2 start_POSTSUPERSCRIPT roman_Ω ( 1 / italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT validity queries are required to choose such a model q^^𝑞\hat{q}over^ start_ARG italic_q end_ARG from the class 𝒬𝒬\mathcal{Q}caligraphic_Q. They follow this result with an improper learning algorithm for choosing q^^𝑞\hat{q}over^ start_ARG italic_q end_ARG which, while achieving polynomial bounds on the number of validity queries, uses a relatively large number of validity queries O~(log(|𝒬|)/ϵ12ϵ2)~𝑂𝒬superscriptsubscriptitalic-ϵ12subscriptitalic-ϵ2\tilde{O}\left(\log(|\mathcal{Q}|)/\epsilon_{1}^{2}\epsilon_{2}\right)over~ start_ARG italic_O end_ARG ( roman_log ( | caligraphic_Q | ) / italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ).

The somewhat pessimistic picture painted by these fascinating complexity-theoretic results can be tracked to their generality. Firstly, it’s possible that 𝒬𝒬\mathcal{Q}caligraphic_Q and P𝑃Pitalic_P are significantly “mismatched”, i.e. the support of each model q𝒬𝑞𝒬q\in\mathcal{Q}italic_q ∈ caligraphic_Q has only a small overlap with the support of the distribution P𝑃Pitalic_P, in which case the validity information contained in valid training samples is unhelpful to a proper learner. Secondly, their improper learning algorithm is largely loss-agnostic, in that it generates guarantees for a wide class of bounded loss functions. Finally, nothing is assumed about the form of the validity function v𝑣vitalic_v, precluding provable estimation.

In this work, we offer a counterbalance to this picture, beginning an investigation into learning settings where guaranteeing validity is cheaper than such results might indicate. We first consider learning under complete elimination of model class mismatch, where 𝒬𝒬\mathcal{Q}caligraphic_Q is rich enough to contain the fully-valid data distribution P𝑃Pitalic_P, and the loss is the log-loss l(f(x))=log(1/f(x))𝑙𝑓𝑥1𝑓𝑥l(f(x))=\log(1/f(x))italic_l ( italic_f ( italic_x ) ) = roman_log ( 1 / italic_f ( italic_x ) ). It is intuitive that in this setting, loss minimization alone should guarantee validity. Somewhat less intuitively, we demonstrate an algorithm closely related to empirical risk minimization which uses just O~(log(|𝒬|)/min(ϵ12,ϵ2))~𝑂𝒬superscriptsubscriptitalic-ϵ12subscriptitalic-ϵ2\tilde{O}\left(\log(|\mathcal{Q}|)/\min(\epsilon_{1}^{2},\epsilon_{2})\right)over~ start_ARG italic_O end_ARG ( roman_log ( | caligraphic_Q | ) / roman_min ( italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) samples to guarantee its output meets loss and validity requirements – in other words, validity comes quickly under random sampling from P𝑃Pitalic_P in this setting.

Secondly, we consider learning under a different realizability assumption, namely that the validity region is a member of a VC-class of dimension D𝐷Ditalic_D. In this setting, we provide an analysis of the natural scheme of restricting the empirical risk minimizer to an estimate of the valid part of space. We show that when small-loss models q𝒬𝑞𝒬q\in\mathcal{Q}italic_q ∈ caligraphic_Q have at least constant validity, this scheme uses O~(D/ϵ2)~𝑂𝐷subscriptitalic-ϵ2\tilde{O}\left(D/\epsilon_{2}\right)over~ start_ARG italic_O end_ARG ( italic_D / italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) validity queries, implying a query cost reduction over the general-purpose algorithm of Hanneke et al. (2018). We also show that learning under the capped log-loss can be used to relax the assumption of constant validity at the cost of an extra factor of 1/ϵ11subscriptitalic-ϵ11/\epsilon_{1}1 / italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.

Our results suggest the existence of a rich web of settings in which validity may be cheaper than in the general case. They also suggest that the choice of the loss plays an important roll in guaranteeing valid outputs, compelling further investigation of the log-loss in particular.

2 Related Work

The framing of learning distributiuons in terms of PAC guarantees similar to Valiant (1984) dates back to Kearns et al. (1994), who consider the learnability of specific classes of discrete distributions under a realizability assumption. A significant body of work on distribution learning has been developed overtime, often focusing on algorithms for learning over parametric families or under specific “structural” assumptions Haghtalab et al. (2019); Daskalakis et al. (2013); Kalai et al. (2010); Daskalakis et al. (2012). The only theoretical contribution to validity-constrained distribution learning under the formulation posed by Hanneke et al. (2018) that we are aware of is that work itself.

The study of loss functions for the evaluation of probabilistic models has often been studied the lens of “scoring rules” in the forecasting literature Brier (1950); Good (1952); Gneiting and Raftery (2007). There are some notable recent contributions towards expanding the understanding of when loss functions for distribution learning display desirable properties, e.g. “properness”, which designates that the loss is minimized by the true data distribution Haghtalab et al. (2019); Frongillo et al. (2022).

The first half of this paper draws on intuition from hypothesis testing to evaluate the performance of empirical risk minimization. Hypothesis testing is a major focus of the classical statistics literature Lehmann et al. (1986). The bounds in the first half of the paper are due to analysis inspired by the Neyman-Pearson lemma Neyman and Pearson (1933); Rebeschini (2021), and rely on the approximation of total variational distance between product measures Reiss (1981).

The applied literature on generative modeling has consistently noted the problem of learned models producing “invalid” examples Kusner et al. (2017); Janz et al. (2017); IsolaP et al. (2017); Kaneko and Harada (2021). Various techniques have been proposed for mitigating invalidity generally, and in domain specific settings Aitken et al. (2017); Kusner et al. (2017); Kong and Chaudhuri (2023). While working under the assumption that the validity function lies in a VC-class, the strategy we introduce has some rough semblance to a “post-editing” procedure proposed by Kong and Chaudhuri (2023).

3 Preliminaries

3.1 Problem Setup

Let 𝒳𝒳\mathcal{X}caligraphic_X be a subset of Euclidean space dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT with finite Lebesgue measure λ𝜆\lambdaitalic_λ. Let 𝒫𝒫\mathcal{P}caligraphic_P denote the set of all probability distributions on the measurable space (𝒳,𝒳)𝒳subscript𝒳(\mathcal{X},\mathcal{F}_{\mathcal{X}})( caligraphic_X , caligraphic_F start_POSTSUBSCRIPT caligraphic_X end_POSTSUBSCRIPT ), where 𝒳subscript𝒳\mathcal{F}_{\mathcal{X}}caligraphic_F start_POSTSUBSCRIPT caligraphic_X end_POSTSUBSCRIPT arises from Lebesgue measurable sets intersected with 𝒳𝒳\mathcal{X}caligraphic_X. Let P𝒫𝑃𝒫P\in\mathcal{P}italic_P ∈ caligraphic_P be the data-generating distribution.

In the eyes of the learner, the function v:𝒳{0,1}:𝑣𝒳01v:\mathcal{X}\to\{0,1\}italic_v : caligraphic_X → { 0 , 1 } is a fixed and unknown “validity function”, measurable with respect to the relevant distributions. The validity function denotes whether or not an example x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X is considered a valid output for a learned approximation of P𝑃Pitalic_P. The learner is given a model class of 𝒬𝒫𝒬𝒫\mathcal{Q}\subset\mathcal{P}caligraphic_Q ⊂ caligraphic_P of probability distributions on 𝒳𝒳\mathcal{X}caligraphic_X, each with density fqsubscript𝑓𝑞f_{q}italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT with respect to λ𝜆\lambdaitalic_λ, and afforded with the knowledge that at least one q𝒬𝑞𝒬q\in\mathcal{Q}italic_q ∈ caligraphic_Q is “fully-valid”, i.e. that there is some q𝒬𝑞𝒬q\in\mathcal{Q}italic_q ∈ caligraphic_Q with invalidity V(q):=PrXq(v(X)=1)=1assign𝑉𝑞subscriptPrsimilar-to𝑋𝑞𝑣𝑋11V(q):=\textnormal{Pr}_{X\sim q}\left(v(X)=1\right)=1italic_V ( italic_q ) := Pr start_POSTSUBSCRIPT italic_X ∼ italic_q end_POSTSUBSCRIPT ( italic_v ( italic_X ) = 1 ) = 1. We at times use the notion of “invalidity” of a model, by which we mean I(q)=1V(q)𝐼𝑞1𝑉𝑞I(q)=1-V(q)italic_I ( italic_q ) = 1 - italic_V ( italic_q ). Following the main exposition of Hanneke et al. (2018), we assume 𝒬𝒬\mathcal{Q}caligraphic_Q is of finite cardinality.

The goodness-of-fit of a model q𝒫𝑞𝒫q\in\mathcal{P}italic_q ∈ caligraphic_P is governed by a decreasing “local” loss function l:0{}:𝑙superscriptabsent0l:\mathbb{R}^{\geq 0}\to\mathbb{R}\cup\{\infty\}italic_l : blackboard_R start_POSTSUPERSCRIPT ≥ 0 end_POSTSUPERSCRIPT → blackboard_R ∪ { ∞ }. Such a loss function gives rise to loss of model via LP(q;l):=𝔼XP[l(fq(X))]assignsubscript𝐿𝑃𝑞𝑙subscript𝔼similar-to𝑋𝑃delimited-[]𝑙subscript𝑓𝑞𝑋L_{P}(q;l):=\mathbb{E}_{X\sim P}\left[l\left(f_{q}(X)\right)\right]italic_L start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_q ; italic_l ) := blackboard_E start_POSTSUBSCRIPT italic_X ∼ italic_P end_POSTSUBSCRIPT [ italic_l ( italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_X ) ) ]. Given an i.i.d sample S𝑆Sitalic_S from P𝑃Pitalic_P, we let the empirical estimate of the loss of a model be LS(q;l)=xiSl(fq(xi))/|S|subscript𝐿𝑆𝑞𝑙subscriptsubscript𝑥𝑖𝑆𝑙subscript𝑓𝑞subscript𝑥𝑖𝑆L_{S}(q;l)=\sum_{x_{i}\in S}l(f_{q}(x_{i}))/|S|italic_L start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_q ; italic_l ) = ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_S end_POSTSUBSCRIPT italic_l ( italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) / | italic_S |. We use the shorthand LP(q)subscript𝐿𝑃𝑞L_{P}(q)italic_L start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_q ) and LS(q)subscript𝐿𝑆𝑞L_{S}(q)italic_L start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_q ) to denote the true and empirical losses of q𝑞qitalic_q under the log-loss l(q)=log(1/fq(x))𝑙𝑞1subscript𝑓𝑞𝑥l(q)=\log(1/f_{q}(x))italic_l ( italic_q ) = roman_log ( 1 / italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_x ) ), where log\logroman_log denotes the natural logarithm. We take the log-loss to be infinite at points where fq(x)=0subscript𝑓𝑞𝑥0f_{q}(x)=0italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_x ) = 0.

3.2 Goal of Learning

The goal of the learner is to choose some q^𝒫^𝑞𝒫\hat{q}\in\mathcal{P}over^ start_ARG italic_q end_ARG ∈ caligraphic_P which has a loss LP(q^;l)subscript𝐿𝑃^𝑞𝑙L_{P}(\hat{q};l)italic_L start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( over^ start_ARG italic_q end_ARG ; italic_l ) similar to that of the lowest-loss fully-valid model in 𝒬𝒬\mathcal{Q}caligraphic_Q, while simultaneously maintaining near full-validity. Explicitly, consider the model

q:=argminq𝒬:V(q)=1LP(q;l).assignsuperscript𝑞subscriptargmin:𝑞𝒬𝑉𝑞1subscript𝐿𝑃𝑞𝑙q^{*}:=\operatorname*{arg\,min}_{q\in\mathcal{Q}:V(q)=1}L_{P}(q;l).italic_q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT := start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_q ∈ caligraphic_Q : italic_V ( italic_q ) = 1 end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_q ; italic_l ) .

To describe the quality of an outputted model, we consider two learning parameters ϵ1subscriptitalic-ϵ1\epsilon_{1}italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and ϵ2subscriptitalic-ϵ2\epsilon_{2}italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, where ϵ1subscriptitalic-ϵ1\epsilon_{1}italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is used to control the loss sub-optimality, and ϵ2subscriptitalic-ϵ2\epsilon_{2}italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT to control the invalidity. Formally then, the goal of the learner is to output q^𝒫^𝑞𝒫\hat{q}\in\mathcal{P}over^ start_ARG italic_q end_ARG ∈ caligraphic_P satisfying L(q^)L(q)+ϵ1𝐿^𝑞𝐿superscript𝑞subscriptitalic-ϵ1L(\hat{q})\leq L(q^{*})+\epsilon_{1}italic_L ( over^ start_ARG italic_q end_ARG ) ≤ italic_L ( italic_q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and I(q^)ϵ2𝐼^𝑞subscriptitalic-ϵ2I(\hat{q})\leq\epsilon_{2}italic_I ( over^ start_ARG italic_q end_ARG ) ≤ italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. To accomplish this goal, the learner has sample access to P𝑃Pitalic_P, and query access to v𝑣vitalic_v, i.e. a learner can draw any finite number of i.i.d.formulae-sequence𝑖𝑖𝑑i.i.d.italic_i . italic_i . italic_d . samples from P𝑃Pitalic_P, and any request the value of the validity function v𝑣vitalic_v at any finite number of inputs in 𝒳𝒳\mathcal{X}caligraphic_X.

At a minimum, we are interested in algorithms which require a number of samples from P𝑃Pitalic_P and number validity queries that is polynomial in log(|𝒬|)𝒬\log(|\mathcal{Q}|)roman_log ( | caligraphic_Q | ), 1/ϵ11subscriptitalic-ϵ11/\epsilon_{1}1 / italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 1/ϵ21subscriptitalic-ϵ21/\epsilon_{2}1 / italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Ideally, we would like to minimize the number of validity queries given some polynomial number of samples from P𝑃Pitalic_P. The motivation for this goal is similar to the minimization of label queries in active learning for classification Hanneke (2014), where samples from the marginal over instances are often cheap, but labeling such examples is assumed expensive.

3.3 Full-Validity of P𝑃Pitalic_P

We assume that all samples from the data-generating distribution P𝑃Pitalic_P are valid, i.e. that V(P)=1𝑉𝑃1V(P)=1italic_V ( italic_P ) = 1. Under such an assumption, the query demand of a learning algorithm can be conceptualized as the overhead number of queries sufficient for choosing a good model under the standard procedure of removing invalid examples from the training set.

If the data distribution is not fully-valid, and valid samples are required by an algorithm, the question of minimizing the overall number of queries is dependent on the sample complexity of learning – if one assumes that P𝑃Pitalic_P has been constructed by “accepting” valid samples from some underlying distribution which outputs a valid sample with constant probability, then the overall query cost incurred by an algorithm is on the order of the larger of the number of samples and the number of “overhead” validity queries it uses.

In this paper, we are primarily interested in the “overhead” number of queries, which we refer to as the “number of validity queries” of a given scheme. In most cases, algorithm sample requirements are similar to O(log(|𝒬|)/ϵ12)𝑂𝒬superscriptsubscriptitalic-ϵ12O(\log(|\mathcal{Q}|)/\epsilon_{1}^{2})italic_O ( roman_log ( | caligraphic_Q | ) / italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), which allows for accurate loss estimation in many settings.

3.4 Summary of Previous Results

The learning problem above is due to Hanneke et al. (2018), who considered the possibility of specifying learning algorithms meeting the above bi-criteria objective for any choice of bounded, decreasing, local loss function.

This work gives some interesting insight into the difficulty of selecting such a low-loss, high-validity model. They begin by giving a negative result, namely that any proper learning algorithm outputting q^𝒬^𝑞𝒬\hat{q}\in\mathcal{Q}over^ start_ARG italic_q end_ARG ∈ caligraphic_Q, must make 2Ω(1/ϵ1)superscript2Ω1subscriptitalic-ϵ12^{\Omega(1/\epsilon_{1})}2 start_POSTSUPERSCRIPT roman_Ω ( 1 / italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT validity queries in the worst case, regardless of the number of samples available from P𝑃Pitalic_P. This result arises from a specific problem instance wherein every q𝒬𝑞𝒬q\in\mathcal{Q}italic_q ∈ caligraphic_Q has a significant amount of mass outside of the support of P𝑃Pitalic_P, in which case samples from a fully-valid P𝑃Pitalic_P do not give information about v𝑣vitalic_v in parts of space relevant to the choice of q^𝒬^𝑞𝒬\hat{q}\in\mathcal{Q}over^ start_ARG italic_q end_ARG ∈ caligraphic_Q.

On the other hand, they demonstrate an improper learning algorithm which achieves polynomial bounds on samples and validity queries for any choice of loss meeting the above criteria. Their algorithm harnesses a constrained ERM oracle, iteratively querying the validity of samples from the model q𝒬𝑞𝒬q\in\mathcal{Q}italic_q ∈ caligraphic_Q which is the empirical loss minimizer putting no mass on points known to be invalid. In particular, their scheme uses O~(log(|𝒬|)/ϵ12)~𝑂𝒬superscriptsubscriptitalic-ϵ12\tilde{O}(\log(|\mathcal{Q}|)/\epsilon_{1}^{2})over~ start_ARG italic_O end_ARG ( roman_log ( | caligraphic_Q | ) / italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) samples and O~(log(|𝒬|)/ϵ12ϵ2)~𝑂𝒬superscriptsubscriptitalic-ϵ12subscriptitalic-ϵ2\tilde{O}(\log(|\mathcal{Q}|)/\epsilon_{1}^{2}\epsilon_{2})over~ start_ARG italic_O end_ARG ( roman_log ( | caligraphic_Q | ) / italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) validity queries.

4 Learning Without Model Class Mismatch Under the Log-Loss

We first consider the problem of selecting a low-loss, high-validity model under a relaxation of two of the main sources of difficulty in original problem formulation: the misalignment of the model class 𝒬𝒬\mathcal{Q}caligraphic_Q with the data distribution P𝑃Pitalic_P, and the lack of assumptions on the loss.

In particular, we consider the problem under a realizability assumption, namely that P𝒬𝑃𝒬P\in\mathcal{Q}italic_P ∈ caligraphic_Q, further investigating the power of the log-loss. Such a setting is arguably more closely aligned with contemporary learning settings with rich model classes that appropriately capture features of the underlying data distribution, where the validity information contained in samples from P𝑃Pitalic_P can be exploited by convergence to the best information-theoretic representation of P𝑃Pitalic_P in 𝒬𝒬\mathcal{Q}caligraphic_Q.

The log-loss is by far the most widely-used loss in practice Haghtalab et al. (2019). It is a classic result of the proper scoring rule literature that the log-loss is the unique strictly-proper local loss, i.e. the only local loss under which for all distributions qP𝑞𝑃q\neq Pitalic_q ≠ italic_P, it holds that LP(P;l)<LP(q;l)subscript𝐿𝑃𝑃𝑙subscript𝐿𝑃𝑞𝑙L_{P}(P;l)<L_{P}(q;l)italic_L start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_P ; italic_l ) < italic_L start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_q ; italic_l ). This highly desirable property – implying that convergence to the optimum over 𝒫𝒫\mathcal{P}caligraphic_P coincides with convergence to P𝑃Pitalic_P – makes the choice of an alternative outside of capped variants preferable only under specialized circumstances.

4.1 Towards Validity without Validity Queries

Given that samples are assumed to be valid, and the log-loss permits convergence to the data generating distribution, one would hope that simply selecting a model q^𝒬^𝑞𝒬\hat{q}\in\mathcal{Q}over^ start_ARG italic_q end_ARG ∈ caligraphic_Q which is a sufficiently good representation of P𝑃Pitalic_P under the log-loss would yield validity guarantees in this setting. Simply utilizing empirical risk minimization (ERM) is the canonical approach to this end, and one which, given sufficient data from P𝑃Pitalic_P, uses exactly zero validity queries.

Note that any model q𝑞qitalic_q with invalidity I(q)>ϵ2𝐼𝑞subscriptitalic-ϵ2I(q)>\epsilon_{2}italic_I ( italic_q ) > italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT necessary has dTV(q,P)>ϵ2subscript𝑑𝑇𝑉𝑞𝑃subscriptitalic-ϵ2d_{TV}(q,P)>\epsilon_{2}italic_d start_POSTSUBSCRIPT italic_T italic_V end_POSTSUBSCRIPT ( italic_q , italic_P ) > italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. In this case, q𝑞qitalic_q must have at least ϵ2subscriptitalic-ϵ2\epsilon_{2}italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT mass in the invalid part of space, where P𝑃Pitalic_P has none. Thus, if one can guarantee that q^^𝑞\hat{q}over^ start_ARG italic_q end_ARG has dTV(q^,P)ϵ2subscript𝑑𝑇𝑉^𝑞𝑃subscriptitalic-ϵ2d_{TV}(\hat{q},P)\leq\epsilon_{2}italic_d start_POSTSUBSCRIPT italic_T italic_V end_POSTSUBSCRIPT ( over^ start_ARG italic_q end_ARG , italic_P ) ≤ italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, the validity requirement is met. Recalling the Pinsker inequality dTV(q,q)O(dKL(q,q))subscript𝑑𝑇𝑉𝑞superscript𝑞𝑂subscript𝑑𝐾𝐿𝑞superscript𝑞d_{TV}(q,q^{\prime})\leq O(\sqrt{d_{KL}(q,q^{\prime})})italic_d start_POSTSUBSCRIPT italic_T italic_V end_POSTSUBSCRIPT ( italic_q , italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≤ italic_O ( square-root start_ARG italic_d start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_q , italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG ) relating total variational distance and KL-divergence, it follows that obtaining a model q^^𝑞\hat{q}over^ start_ARG italic_q end_ARG which is at most ϵ22superscriptsubscriptitalic-ϵ22\epsilon_{2}^{2}italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT sub-optimal in log-loss yields a model meeting the validity requirement.

While this illustrates useful intuition for the setting, it glosses over two main issues. Firstly, empirical estimates of the log-loss do not admit concentration guarantees – one can construct simple examples where 𝔼XP[log(1/fq(X))]subscript𝔼similar-to𝑋𝑃delimited-[]1subscript𝑓𝑞𝑋\mathbb{E}_{X\sim P}[\log(1/f_{q}(X))]blackboard_E start_POSTSUBSCRIPT italic_X ∼ italic_P end_POSTSUBSCRIPT [ roman_log ( 1 / italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_X ) ) ] is unbounded above, but with high probability, the empirical estimate LS(q)=xSlog(1/fq(x))/|S|subscript𝐿𝑆𝑞subscript𝑥𝑆1subscript𝑓𝑞𝑥𝑆L_{S}(q)=\sum_{x\in S}\log(1/f_{q}(x))/|S|italic_L start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_q ) = ∑ start_POSTSUBSCRIPT italic_x ∈ italic_S end_POSTSUBSCRIPT roman_log ( 1 / italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_x ) ) / | italic_S | is approximately that of P𝑃Pitalic_P Haghtalab et al. (2019). Thus, selecting low-empirical loss models can never yield loss guarantees. Secondly, this application of Pinsker’s inequality demands ϵ22superscriptsubscriptitalic-ϵ22\epsilon_{2}^{2}italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT loss sub-optimality, suggesting that ensuring validity via the selection of a good model under the log-loss is even harder than guaranteeing a small loss.

We would hope that in the case that zero-query learning is possible, that guaranteeing validity arises somewhat coincidently with convergence to P𝑃Pitalic_P, meaning that the sample complexity is not much worse given a validity requirement than without one. Thus, the path towards satisfaction of the learning objectives requires subtle handling, and compels particular attention to the sample complexity dependence on the validity parameter ϵ2subscriptitalic-ϵ2\epsilon_{2}italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

4.2 Analysis of Empirical Risk Minimization

As indicated above, it is not possible to guarantee that empirical risk minimization (ERM) outputs a model with small log-loss. It is, however possible to guarantee that it outputs a model which closely resembles P𝑃Pitalic_P and inherits validity guarantees with a small number of samples.

In particular, it’s possible to show that given sufficient samples, ERM yields a model with small total variation to P𝑃Pitalic_P when P𝒬𝑃𝒬P\in\mathcal{Q}italic_P ∈ caligraphic_Q. This is due to the following folklore theorem Haghtalab et al. (2019), which we prove under assumption of density existence in the Appendix.

Lemma 4.

Fix 0<ϵ,δ<1formulae-sequence0italic-ϵ𝛿10<\epsilon,\delta<10 < italic_ϵ , italic_δ < 1 arbitrarily, and let P,q𝒫𝑃𝑞𝒫P,q\in\mathcal{P}italic_P , italic_q ∈ caligraphic_P be distributions with densities with respect to λ𝜆\lambdaitalic_λ. Then if dTV(q,P)ϵsubscript𝑑𝑇𝑉𝑞𝑃italic-ϵd_{TV}(q,P)\geq\epsilonitalic_d start_POSTSUBSCRIPT italic_T italic_V end_POSTSUBSCRIPT ( italic_q , italic_P ) ≥ italic_ϵ, and SPnsimilar-to𝑆superscript𝑃𝑛S\sim P^{n}italic_S ∼ italic_P start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT for nΩ(log(1/δ)/ϵ2)𝑛Ω1𝛿superscriptitalic-ϵ2n\geq\Omega(\log(1/\delta)/\epsilon^{2})italic_n ≥ roman_Ω ( roman_log ( 1 / italic_δ ) / italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), it holds with probability 1δabsent1𝛿\geq 1-\delta≥ 1 - italic_δ that

LS(P)<LS(q).subscript𝐿𝑆𝑃subscript𝐿𝑆𝑞L_{S}(P)<L_{S}(q).italic_L start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_P ) < italic_L start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_q ) .

Thus, at the statistical cost of estimating a coin bias, any distribution q𝑞qitalic_q with total variation ϵabsentitalic-ϵ\geq\epsilon≥ italic_ϵ from the data distribution will reveal itself to be empirically inferior when the log-loss is used. This can be easily leveraged to generate guarantees for ERM over 𝒬𝒬\mathcal{Q}caligraphic_Q in terms of total variation.

It is tempting to think that this is the entire story when it comes to guaranteeing validity. After all, we argued above that small total variation from P𝑃Pitalic_P is sufficient for ϵ2subscriptitalic-ϵ2\epsilon_{2}italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT invalidity. That said, simply looking at total variation ignores a particular structural feature of distributions q𝑞qitalic_q with I(q)>ϵ2𝐼𝑞subscriptitalic-ϵ2I(q)>\epsilon_{2}italic_I ( italic_q ) > italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT – in particular, such distributions have mass in parts of space in which P𝑃Pitalic_P does not.

This observation can be used to construct tight lower bounds on the total variational distance between product measures arising from P𝑃Pitalic_P and q𝑞qitalic_q with I(q)>ϵ2𝐼𝑞subscriptitalic-ϵ2I(q)>\epsilon_{2}italic_I ( italic_q ) > italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. This leads to the following result, which states that ERM yields a faithful representation of the data generating distribution that is at most ϵ2subscriptitalic-ϵ2\epsilon_{2}italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT invalid given a number of samples with a modest dependence on the validity parameter ϵ2subscriptitalic-ϵ2\epsilon_{2}italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

Lemma 5.

Fix 0<δ,ϵ1,ϵ2<1formulae-sequence0𝛿subscriptitalic-ϵ1subscriptitalic-ϵ210<\delta,\epsilon_{1},\epsilon_{2}<10 < italic_δ , italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT < 1 arbitrarily, and suppose P𝒬𝑃𝒬P\in\mathcal{Q}italic_P ∈ caligraphic_Q. If P𝑃Pitalic_P is fully-valid under v𝑣vitalic_v, and SPnsimilar-to𝑆superscript𝑃𝑛S\sim P^{n}italic_S ∼ italic_P start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT for nΩ(log(|𝒬|)+log(1/δ)min(ϵ12,ϵ2))𝑛Ω𝒬1𝛿superscriptsubscriptitalic-ϵ12subscriptitalic-ϵ2n\geq\Omega\left(\frac{\log(|\mathcal{Q}|)+\log(1/\delta)}{\min(\epsilon_{1}^{% 2},\epsilon_{2})}\right)italic_n ≥ roman_Ω ( divide start_ARG roman_log ( | caligraphic_Q | ) + roman_log ( 1 / italic_δ ) end_ARG start_ARG roman_min ( italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG ), then with probability 1δabsent1𝛿\geq 1-\delta≥ 1 - italic_δ over SPnsimilar-to𝑆superscript𝑃𝑛S\sim P^{n}italic_S ∼ italic_P start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, the ERM solution

q^=argminq𝒬xiSlog(1/q(xi)),^𝑞subscriptargmin𝑞𝒬subscriptsubscript𝑥𝑖𝑆1𝑞subscript𝑥𝑖\hat{q}=\operatorname*{arg\,min}_{q\in\mathcal{Q}}\sum_{x_{i}\in S}\log(1/q(x_% {i})),over^ start_ARG italic_q end_ARG = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_q ∈ caligraphic_Q end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_S end_POSTSUBSCRIPT roman_log ( 1 / italic_q ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ,

satisfies both

dTV(q^,P)ϵ1andI(q^)ϵ2.formulae-sequencesubscript𝑑𝑇𝑉^𝑞𝑃subscriptitalic-ϵ1𝑎𝑛𝑑𝐼^𝑞subscriptitalic-ϵ2d_{TV}(\hat{q},P)\leq\epsilon_{1}\ \ and\ \ I(\hat{q})\leq\epsilon_{2}.italic_d start_POSTSUBSCRIPT italic_T italic_V end_POSTSUBSCRIPT ( over^ start_ARG italic_q end_ARG , italic_P ) ≤ italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a italic_n italic_d italic_I ( over^ start_ARG italic_q end_ARG ) ≤ italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT .

Note that this guarantee is not redundant – having dTV(q^,P)ϵ1subscript𝑑𝑇𝑉^𝑞𝑃subscriptitalic-ϵ1d_{TV}(\hat{q},P)\leq\epsilon_{1}italic_d start_POSTSUBSCRIPT italic_T italic_V end_POSTSUBSCRIPT ( over^ start_ARG italic_q end_ARG , italic_P ) ≤ italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT does not imply I(q)ϵ2𝐼𝑞subscriptitalic-ϵ2I(q)\leq\epsilon_{2}italic_I ( italic_q ) ≤ italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT when ϵ2<ϵ1subscriptitalic-ϵ2subscriptitalic-ϵ1\epsilon_{2}<\epsilon_{1}italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT < italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.

4.3 Attaining Log-Loss Guarantees

Algorithm 1 Modifying ERM to Yield Log-Loss Guarantees
1:procedure finite_log_loss(Distribution Class 𝒬𝒬\mathcal{Q}caligraphic_Q, S𝑆Sitalic_S, ϵ=min(ϵ1,ϵ2)subscriptitalic-ϵsubscriptitalic-ϵ1subscriptitalic-ϵ2\epsilon_{\wedge}=\min(\epsilon_{1},\epsilon_{2})italic_ϵ start_POSTSUBSCRIPT ∧ end_POSTSUBSCRIPT = roman_min ( italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ))
2:     q^ ERMargminq𝒬xiSlog(1/fq(xi))subscript^𝑞 ERMsubscriptargmin𝑞𝒬subscriptsubscript𝑥𝑖𝑆1subscript𝑓𝑞subscript𝑥𝑖\hat{q}_{\textnormal{ ERM}}\leftarrow\operatorname*{arg\,min}_{q\in\mathcal{Q}% }\sum_{x_{i}\in S}\log\left(1/f_{q}(x_{i})\right)over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT ← start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_q ∈ caligraphic_Q end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_S end_POSTSUBSCRIPT roman_log ( 1 / italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) )
3:     return q^=(1ϵ/8)q^ ERM+ϵ/8u^𝑞1subscriptitalic-ϵ8subscript^𝑞 ERMsubscriptitalic-ϵ8𝑢\hat{q}=(1-\epsilon_{\wedge}/8)\cdot\hat{q}_{\textnormal{ ERM}}+\epsilon_{% \wedge}/8\cdot uover^ start_ARG italic_q end_ARG = ( 1 - italic_ϵ start_POSTSUBSCRIPT ∧ end_POSTSUBSCRIPT / 8 ) ⋅ over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT + italic_ϵ start_POSTSUBSCRIPT ∧ end_POSTSUBSCRIPT / 8 ⋅ italic_u \triangleright Mix ERM, uniform distribution
4:end procedure

This result can be interpreted as a vote of confidence for the naive training of generative models under the log-loss. Nevertheless, from a learning-theoretic perspective, there is a question whether or not it is possible to guarantee low log-loss while maintaining validity with zero validity queries.

While ERM cannot possibly furnish log-loss guarantees, it turns out that it is possible to modify the output of ERM to generate log-loss guarantees at the cost of an extra polylogarithmic factor in the sample complexity, at least when the densities fqsubscript𝑓𝑞f_{q}italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT are bounded above and below in their support.111This does not yield uniform convergence over 𝒬𝒬\mathcal{Q}caligraphic_Q given that the support of q𝒬𝑞𝒬q\in\mathcal{Q}italic_q ∈ caligraphic_Q need not align with P𝑃Pitalic_P

The idea, formalized in Algorithm 1, is simply to mix the output of ERM with the uniform distribution. Giving the uniform distribution a mixture component on the order of min(ϵ1,ϵ2)subscriptitalic-ϵ1subscriptitalic-ϵ2\min(\epsilon_{1},\epsilon_{2})roman_min ( italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) can be shown to ensure that the validity guarantees of the ERM output are preserved, while also giving the outputted distribution support across the entire space. This leads to the following theorem.

Theorem 1.

Fix 0<δ,ϵ1,ϵ2<1formulae-sequence0𝛿subscriptitalic-ϵ1subscriptitalic-ϵ210<\delta,\epsilon_{1},\epsilon_{2}<10 < italic_δ , italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT < 1 arbitrarily, and suppose P𝒬𝑃𝒬P\in\mathcal{Q}italic_P ∈ caligraphic_Q and that P𝑃Pitalic_P is fully-valid under v𝑣vitalic_v. If it holds that for each q𝒬𝑞𝒬q\in\mathcal{Q}italic_q ∈ caligraphic_Q that αfq(x)β𝛼subscript𝑓𝑞𝑥𝛽\alpha\leq f_{q}(x)\leq\betaitalic_α ≤ italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_x ) ≤ italic_β for all xsupp(q)𝑥supp𝑞x\in\textnormal{supp}(q)italic_x ∈ supp ( italic_q ), then there is an

NO~(log2(1/min(ϵ1,ϵ2,α))(log(|𝒬|)+log(1/δ))min(ϵ12,ϵ2)),𝑁~𝑂superscript21subscriptitalic-ϵ1subscriptitalic-ϵ2𝛼𝒬1𝛿superscriptsubscriptitalic-ϵ12subscriptitalic-ϵ2N\leq\tilde{O}\left(\frac{\log^{2}\left(1/\min(\epsilon_{1},\epsilon_{2},% \alpha)\right)\cdot\big{(}\log(|\mathcal{Q}|)+\log(1/\delta)\big{)}}{\min(% \epsilon_{1}^{2},\epsilon_{2})}\right),italic_N ≤ over~ start_ARG italic_O end_ARG ( divide start_ARG roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 / roman_min ( italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_α ) ) ⋅ ( roman_log ( | caligraphic_Q | ) + roman_log ( 1 / italic_δ ) ) end_ARG start_ARG roman_min ( italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG ) ,

such that for all nN𝑛𝑁n\geq Nitalic_n ≥ italic_N, with probability 1δabsent1𝛿\geq 1-\delta≥ 1 - italic_δ, the output q^^𝑞\hat{q}over^ start_ARG italic_q end_ARG of Algorithm 1 satisfies

LP(q^)LP(q)+ϵ1andI(q^)ϵ2.formulae-sequencesubscript𝐿𝑃^𝑞subscript𝐿𝑃superscript𝑞subscriptitalic-ϵ1𝑎𝑛𝑑𝐼^𝑞subscriptitalic-ϵ2L_{P}(\hat{q})\leq L_{P}(q^{*})+\epsilon_{1}\ \ and\ \ \ I(\hat{q})\leq% \epsilon_{2}.italic_L start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( over^ start_ARG italic_q end_ARG ) ≤ italic_L start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a italic_n italic_d italic_I ( over^ start_ARG italic_q end_ARG ) ≤ italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT .

Here the O~~𝑂\tilde{O}over~ start_ARG italic_O end_ARG notation hides a polylogarithmic dependence on 1/β1𝛽1/\beta1 / italic_β, which is insignificant in most regimes, and treats the density of the uniform distribution over 𝒳𝒳\mathcal{X}caligraphic_X as a constant, which would otherwise also enter polylogarithmically.

Theorem 1 shows that guarantees with respect to the unbounded log-loss are attainable improperly, i.e. when the learner can choose q𝒬𝑞𝒬q\notin\mathcal{Q}italic_q ∉ caligraphic_Q. It’s an interesting question whether the logarithmic dependence on min(ϵ1,ϵ2)subscriptitalic-ϵ1subscriptitalic-ϵ2\min(\epsilon_{1},\epsilon_{2})roman_min ( italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) can be removed with a more subtle strategy.

4.4 Discussion of Optimality

One might suspect that achieving a smaller dependence than 1/ϵ21subscriptitalic-ϵ21/\epsilon_{2}1 / italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT on the validity parameter should be impossible. We confirm this is true at least for proper learners, showing that the analysis of ERM in Lemma 5 is tight in its dependence on ϵ2subscriptitalic-ϵ2\epsilon_{2}italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. This lemma is used to generate the validity guarantee in Theorem 1.

Theorem 2.

For all ϵ2<1/4subscriptitalic-ϵ214\epsilon_{2}<1/4italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT < 1 / 4 and for all proper learners L:S𝒫:𝐿𝑆𝒫L:S\to\mathcal{P}italic_L : italic_S → caligraphic_P, if the sample SPnsimilar-to𝑆superscript𝑃𝑛S\sim P^{n}italic_S ∼ italic_P start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is of size n1/8ϵ2𝑛18subscriptitalic-ϵ2n\leq 1/8\epsilon_{2}italic_n ≤ 1 / 8 italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, then there exists a triple (P,𝒬,v)𝑃𝒬𝑣(P,\mathcal{Q},v)( italic_P , caligraphic_Q , italic_v ) with P𝒬𝑃𝒬P\in\mathcal{Q}italic_P ∈ caligraphic_Q and P𝑃Pitalic_P fully-valid, on which L(S)𝐿𝑆L(S)italic_L ( italic_S ) has invalidity I(L(S))>ϵ2𝐼𝐿𝑆subscriptitalic-ϵ2I(L(S))>\epsilon_{2}italic_I ( italic_L ( italic_S ) ) > italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT with probability 1/4absent14\geq 1/4≥ 1 / 4.

The intuition here is that while any invalid q𝒬𝑞𝒬q\in\mathcal{Q}italic_q ∈ caligraphic_Q has at least ϵ2subscriptitalic-ϵ2\epsilon_{2}italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT total variation from P𝑃Pitalic_P, in the worst case, the total variation between q𝑞qitalic_q and P𝑃Pitalic_P is upper bounded by O(ϵ2O(\epsilon_{2}italic_O ( italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) as well. This makes distinguishing between P𝑃Pitalic_P and some ϵ2subscriptitalic-ϵ2\epsilon_{2}italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT invalid distribution hard enough to generate such a lower bound.

The sample requirement of 1/ϵ121superscriptsubscriptitalic-ϵ121/\epsilon_{1}^{2}1 / italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, both in our guarantees and in previous work, is a standard offshoot of loss estimation, irrespective of the search of a valid model. In general, one cannot expect improvements to this end – this is the standard dependence one finds for estimating the means bounded random variables. This suggests that the “realizable complexity” for this setting is 1/min(ϵ12,ϵ2)1superscriptsubscriptitalic-ϵ12subscriptitalic-ϵ21/\min(\epsilon_{1}^{2},\epsilon_{2})1 / roman_min ( italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) – while non-zero losses should not be generally estimable using “realizable” techniques, guaranteeing small invalidity can when P𝒬𝑃𝒬P\in\mathcal{Q}italic_P ∈ caligraphic_Q.

5 Utilizing Estimates of the Validity Function in Training

In the general formulation of the problem, the learner is given an arbitrary bounded, decreasing loss, a model class 𝒬𝒬\mathcal{Q}caligraphic_Q which is mismatched with P𝑃Pitalic_P, and has no a priori information about the validity function v𝑣vitalic_v. In such a setting, it is clear that validity queries are necessary.

In this section, we consider a setting where it is known to the learner that v𝑣vitalic_v can be found in a hypothesis class 𝒱𝒱\mathcal{V}caligraphic_V of bounded complexity. Under such an assumption, we would hope to be able to lower the number of validity queries beyond the bounds of Hanneke et al. (2018).

5.1 Algorithm

Algorithm 2 Post-Hoc Restriction of ERM to an Estimate of Valid Outputs
1:procedure valid_restriction(Distribution Class 𝒬𝒬\mathcal{Q}caligraphic_Q, Validity Class 𝒱𝒱\mathcal{V}caligraphic_V, ϵ1subscriptitalic-ϵ1\epsilon_{1}italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, ϵ2subscriptitalic-ϵ2\epsilon_{2}italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, δ𝛿\deltaitalic_δ, γ𝛾\gammaitalic_γ)
2:     SΩ(M2(log(|Q|)+log(1/δ))ϵ12)𝑆Ωsuperscript𝑀2𝑄1𝛿superscriptsubscriptitalic-ϵ12S\leftarrow\Omega\left(\frac{M^{2}\left(\log(|Q|)+\log(1/\delta)\right)}{% \epsilon_{1}^{2}}\right)italic_S ← roman_Ω ( divide start_ARG italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( roman_log ( | italic_Q | ) + roman_log ( 1 / italic_δ ) ) end_ARG start_ARG italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) i.i.d. samples Psimilar-toabsent𝑃\sim P∼ italic_P
3:     q^ ERMargminq𝒬xSl(q(x))subscript^𝑞 ERMsubscriptargmin𝑞𝒬subscript𝑥𝑆𝑙𝑞𝑥\hat{q}_{\textnormal{ ERM}}\leftarrow\operatorname*{arg\,min}_{q\in\mathcal{Q}% }\sum_{x\in S}l(q(x))over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT ← start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_q ∈ caligraphic_Q end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_x ∈ italic_S end_POSTSUBSCRIPT italic_l ( italic_q ( italic_x ) )
4:     SPΩ(M(Dlog(M/ϵ1)+log(1/δ))ϵ1)subscript𝑆𝑃Ω𝑀𝐷𝑀subscriptitalic-ϵ11𝛿subscriptitalic-ϵ1S_{P}\leftarrow\Omega\left(\frac{M\left(D\log\left(M/\epsilon_{1}\right)+\log(% 1/\delta)\right)}{\epsilon_{1}}\right)italic_S start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ← roman_Ω ( divide start_ARG italic_M ( italic_D roman_log ( italic_M / italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + roman_log ( 1 / italic_δ ) ) end_ARG start_ARG italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ) i.i.d. samples Psimilar-toabsent𝑃\sim P∼ italic_P,             Sq^ ERMΩ(Dlog(1/γϵ2)+log(1/δ)γϵ2)subscript𝑆subscript^𝑞 ERMΩ𝐷1𝛾subscriptitalic-ϵ21𝛿𝛾subscriptitalic-ϵ2S_{\hat{q}_{\textnormal{ ERM}}}\leftarrow\Omega\left(\frac{D\log\left(1/\gamma% \epsilon_{2}\right)+\log(1/\delta)}{\gamma\epsilon_{2}}\right)italic_S start_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT end_POSTSUBSCRIPT ← roman_Ω ( divide start_ARG italic_D roman_log ( 1 / italic_γ italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + roman_log ( 1 / italic_δ ) end_ARG start_ARG italic_γ italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ) i.i.d. samples q^ ERMsimilar-toabsentsubscript^𝑞 ERM\sim\hat{q}_{\textnormal{ ERM}}∼ over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT
5:     v^argminh𝒱xSPSq^ ERM𝟙[h(x)v(x)]^𝑣subscriptargmin𝒱subscript𝑥subscript𝑆𝑃subscript𝑆subscript^𝑞 ERM1delimited-[]𝑥𝑣𝑥\hat{v}\leftarrow\operatorname*{arg\,min}_{h\in\mathcal{V}}\sum_{x\in S_{P}% \cup S_{\hat{q}_{\textnormal{ ERM}}}}\mathbbm{1}[h(x)\neq v(x)]over^ start_ARG italic_v end_ARG ← start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_h ∈ caligraphic_V end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_x ∈ italic_S start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ∪ italic_S start_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_1 [ italic_h ( italic_x ) ≠ italic_v ( italic_x ) ] \triangleright Label Sq^ ERMsubscript𝑆subscript^𝑞 ERMS_{\hat{q}_{\textnormal{ ERM}}}italic_S start_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT end_POSTSUBSCRIPT via queries to v𝑣vitalic_v
6:     return fq^fq^ ERM(x)𝟙[v^(x)=1]proportional-tosubscript𝑓^𝑞subscript𝑓subscript^𝑞 ERM𝑥1delimited-[]^𝑣𝑥1f_{\hat{q}}\propto f_{\hat{q}_{\textnormal{ ERM}}}(x)\cdot\mathbbm{1}[\hat{v}(% x)=1]italic_f start_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG end_POSTSUBSCRIPT ∝ italic_f start_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) ⋅ blackboard_1 [ over^ start_ARG italic_v end_ARG ( italic_x ) = 1 ] if q^ ERM({v^(x)=1})>0subscript^𝑞 ERM^𝑣𝑥10\hat{q}_{\textnormal{ ERM}}\left(\{\hat{v}(x)=1\}\right)>0over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT ( { over^ start_ARG italic_v end_ARG ( italic_x ) = 1 } ) > 0 else fq^=fq^ ERMsubscript𝑓^𝑞subscript𝑓subscript^𝑞 ERMf_{\hat{q}}=f_{\hat{q}_{\textnormal{ ERM}}}italic_f start_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT end_POSTSUBSCRIPT
7:end procedure

A natural algorithm in this setting is to “correct” the invalidity of the empirical risk minimizer – to restrict the empirical risk minimizer to parts of space which are valid with respect to an estimate of the validity v^^𝑣\hat{v}over^ start_ARG italic_v end_ARG. This is the precisely the idea formalized in Algorithm 2.

To generate guarantees for such a strategy, one must determine the distribution with respect to which the estimate v^^𝑣\hat{v}over^ start_ARG italic_v end_ARG should be accurate. In our case, we generate accuracy guarantees over both P𝑃Pitalic_P and the ERM model q^ ERMsubscript^𝑞 ERM\hat{q}_{\textnormal{ ERM}}over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT by selecting an estimate v^^𝑣\hat{v}over^ start_ARG italic_v end_ARG that has 0 empirical error over both distributions. The source of the query complexity of the algorithm comes from the fact that samples arising from q^ ERMsubscript^𝑞 ERM\hat{q}_{\textnormal{ ERM}}over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT must be labeled by oracle calls to v𝑣vitalic_v. Noting that samples from P𝑃Pitalic_P can be automatically labeled as valid by the full-validity of P𝑃Pitalic_P saves a constant factor over naively labeling all examples acquired in the second half of the algorithm.

Accuracy under samples from P𝑃Pitalic_P allows one to control the loss of q^^𝑞\hat{q}over^ start_ARG italic_q end_ARG by invoking the boundedness of the loss in the disagreement region of v^^𝑣\hat{v}over^ start_ARG italic_v end_ARG and v𝑣vitalic_v, and guarantees with respect to q^ ERMsubscript^𝑞 ERM\hat{q}_{\textnormal{ ERM}}over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT allow us to bound the invalidity of the restriction. Because P𝑃Pitalic_P is fully-valid, loss contributions from the agreement region of v𝑣vitalic_v and v^^𝑣\hat{v}over^ start_ARG italic_v end_ARG correspond to parts of space where v^(x)=1^𝑣𝑥1\hat{v}(x)=1over^ start_ARG italic_v end_ARG ( italic_x ) = 1 – as the loss is non-increasing, placing more mass in such parts of space can never increase the loss contribution attributable to integrating over this region.

Algorithm 2 also requires a parameter γ>0𝛾0\gamma>0italic_γ > 0. This parameter should be a validity lower bound on the models q𝒬𝑞𝒬q\in\mathcal{Q}italic_q ∈ caligraphic_Q, providing a safeguard on the possibility of an “invalidity blowup” when restricting the ERM output to a certain region of space – one must normalize the restriction to output a probability distribution, which in this case means increasing mass in parts of space that are estimated to be valid. An a priori lower bound on the validity allows for precise enough estimation of v^^𝑣\hat{v}over^ start_ARG italic_v end_ARG that increasing the mass in such regions is unlikely to lead to appreciable invalidity in the final model.

It’s possible that the restriction of the ERM estimated valid region is undefined – this happens if and only if the estimated valid region has zero mass under the ERM. Given validity lower bounds for models q𝒬𝑞𝒬q\in\mathcal{Q}italic_q ∈ caligraphic_Q , this is a low probability event which can occur only when estimation of the validity function is very poor relative to the query complexity. As one might imagine, the handling of this case is immaterial for PAC-guarantees. We choose to arbitrarily define behavior in this case by outputting the ERM model.

5.2 Guarantees

The restricted output q^^𝑞\hat{q}over^ start_ARG italic_q end_ARG of Algorithm 2 admits the following guarantee over loss sub-optimality and invalidity.

Theorem 3.

Suppose v𝒱𝑣𝒱v\in\mathcal{V}italic_v ∈ caligraphic_V with VC-dimension VC(𝒱)D𝑉𝐶𝒱𝐷VC(\mathcal{V})\leq Ditalic_V italic_C ( caligraphic_V ) ≤ italic_D, and that for each q𝒬𝑞𝒬q\in\mathcal{Q}italic_q ∈ caligraphic_Q, the validity V(q)γ>0𝑉𝑞𝛾0V(q)\geq\gamma>0italic_V ( italic_q ) ≥ italic_γ > 0. For all 0<ϵ1,ϵ2,δ1formulae-sequence0subscriptitalic-ϵ1subscriptitalic-ϵ2𝛿10<\epsilon_{1},\epsilon_{2},\delta\leq 10 < italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_δ ≤ 1 and for all choices of non-increasing loss functions l:0[0,M]:𝑙superscriptabsent00𝑀l:\mathbb{R}^{\geq 0}\to[0,M]italic_l : blackboard_R start_POSTSUPERSCRIPT ≥ 0 end_POSTSUPERSCRIPT → [ 0 , italic_M ], Algorithm 2 requires a number of samples

O(M2(log(|𝒬|)+log(1/δ))ϵ12+M(Dlog(M/ϵ1)+log(1/δ))ϵ1),absent𝑂superscript𝑀2𝒬1𝛿superscriptsubscriptitalic-ϵ12𝑀𝐷𝑀subscriptitalic-ϵ11𝛿subscriptitalic-ϵ1\leq O\left(\frac{M^{2}\left(\log(|\mathcal{Q}|)+\log(1/\delta)\right)}{% \epsilon_{1}^{2}}+\frac{M\left(D\log(M/\epsilon_{1})+\log(1/\delta)\right)}{% \epsilon_{1}}\right),≤ italic_O ( divide start_ARG italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( roman_log ( | caligraphic_Q | ) + roman_log ( 1 / italic_δ ) ) end_ARG start_ARG italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG italic_M ( italic_D roman_log ( italic_M / italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + roman_log ( 1 / italic_δ ) ) end_ARG start_ARG italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ) ,

and a number of validity queries

O(Dlog(1/γϵ2)+log(1/δ)γϵ2),absent𝑂𝐷1𝛾subscriptitalic-ϵ21𝛿𝛾subscriptitalic-ϵ2\leq O\left(\frac{D\log(1/\gamma\epsilon_{2})+\log(1/\delta)}{\gamma\epsilon_{% 2}}\right),≤ italic_O ( divide start_ARG italic_D roman_log ( 1 / italic_γ italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + roman_log ( 1 / italic_δ ) end_ARG start_ARG italic_γ italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ) ,

to ensure that with probability 1δabsent1𝛿\geq 1-\delta≥ 1 - italic_δ, its output enjoys

LP(q^;l)LP(q;l)+ϵ1andI(q^)ϵ2.formulae-sequencesubscript𝐿𝑃^𝑞𝑙subscript𝐿𝑃superscript𝑞𝑙subscriptitalic-ϵ1𝑎𝑛𝑑𝐼^𝑞subscriptitalic-ϵ2L_{P}(\hat{q};l)\leq L_{P}(q^{*};l)+\epsilon_{1}\ \ and\ \ I(\hat{q})\leq% \epsilon_{2}.italic_L start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( over^ start_ARG italic_q end_ARG ; italic_l ) ≤ italic_L start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ; italic_l ) + italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a italic_n italic_d italic_I ( over^ start_ARG italic_q end_ARG ) ≤ italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT .

Thus, in regimes where e.g. γΩ(ϵ1)𝛾Ωsubscriptitalic-ϵ1\gamma\geq\Omega(\epsilon_{1})italic_γ ≥ roman_Ω ( italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ), D=Θ(log(|𝒬|))𝐷Θ𝒬D=\Theta(\log(|\mathcal{Q}|))italic_D = roman_Θ ( roman_log ( | caligraphic_Q | ) ), this guarantee represents a reduced number of queries under the O~(M2log(|𝒬|)/ϵ12ϵ2)~𝑂superscript𝑀2𝒬superscriptsubscriptitalic-ϵ12subscriptitalic-ϵ2\tilde{O}(M^{2}\log(|\mathcal{Q}|)/\epsilon_{1}^{2}\epsilon_{2})over~ start_ARG italic_O end_ARG ( italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log ( | caligraphic_Q | ) / italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) bound of Hanneke et al. (2018). It also implies a “decoupling” of the query complexity from ϵ1subscriptitalic-ϵ1\epsilon_{1}italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.

We note that the sample requirement from P𝑃Pitalic_P is increased in certain regimes over the O~(log(|Q|)/ϵ12)~𝑂𝑄superscriptsubscriptitalic-ϵ12\tilde{O}(\log(|Q|)/\epsilon_{1}^{2})over~ start_ARG italic_O end_ARG ( roman_log ( | italic_Q | ) / italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) requirement of Hanneke et al. (2018). This is, however, not a concern in most settings where validity queries are expensive. If samples from a fully-valid P𝑃Pitalic_P are readily obtainable, the setting is analogous to that of active learning, where focus is directed to the number of labels requested in training.

Even if P𝑃Pitalic_P must be constructed by “accepting” valid samples from some unfiltered Psuperscript𝑃P^{\prime}italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, a comparison between the query complexity of Theorem 3 and the query bound of Hanneke et al. (2018) is often still representative of the relative data costs of the schemes. Supposing Psuperscript𝑃P^{\prime}italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT produces valid samples with constant probability, the total number of validity queries made by each scheme is proportional to the scheme’s sample requirements from P𝑃Pitalic_P, plus the number of validity queries used in its execution. Essentially, to yield validity query speedups, our scheme requires a VC bound on 𝒱𝒱\mathcal{V}caligraphic_V which does not dwarf log(|𝒬|)𝒬\log(|\mathcal{Q}|)roman_log ( | caligraphic_Q | ). Thus, in most cases of interest, the querying the validity of O~(MD/ϵ1)~𝑂𝑀𝐷subscriptitalic-ϵ1\tilde{O}\left(MD/\epsilon_{1}\right)over~ start_ARG italic_O end_ARG ( italic_M italic_D / italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) extra samples is asymptotically inconsequential relative to O(M2log(|Q|)/ϵ12)𝑂superscript𝑀2𝑄superscriptsubscriptitalic-ϵ12O(M^{2}\log(|Q|)/\epsilon_{1}^{2})italic_O ( italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log ( | italic_Q | ) / italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), and the O~(D/γϵ2)~𝑂𝐷𝛾subscriptitalic-ϵ2\tilde{O}\left(D/\gamma\epsilon_{2}\right)over~ start_ARG italic_O end_ARG ( italic_D / italic_γ italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) query budget required to execute the algorithm given access to a fully-valid P𝑃Pitalic_P.

5.3 Better Query Complexity Bounds

5.3.1 Exploiting the Power of Active Learning

Theorem 3 presents a somewhat pessimistic view of the potential of such a “post-filtering” scheme.

Firstly, it ignores the potential of active learners to improve query complexities over passive sampling. Query complexities in active learning of classifiers are often expressed in terms of the “disagreement coefficient” Hanneke (2007), often denoted via θ𝜃\thetaitalic_θ. In the realizable setting, query complexities of active learning look like O~(Dθlog(1/ϵ))~𝑂𝐷𝜃1italic-ϵ\tilde{O}(D\theta\log(1/\epsilon))over~ start_ARG italic_O end_ARG ( italic_D italic_θ roman_log ( 1 / italic_ϵ ) ) Dasgupta (2011). Definitionally, it can be shown that θ1/ϵ𝜃1italic-ϵ\theta\leq 1/\epsilonitalic_θ ≤ 1 / italic_ϵ. Thus, proving the gains of active learning algorithms usually relies on bounding the disagreement coefficient non-trivially, i.e. showing θ<o(1/ϵ)𝜃𝑜1italic-ϵ\theta<o(1/\epsilon)italic_θ < italic_o ( 1 / italic_ϵ ), or ideally, θO(1)𝜃𝑂1\theta\leq O(1)italic_θ ≤ italic_O ( 1 ).

While this is challenging, as θ𝜃\thetaitalic_θ is both a class and distribution-dependent quantity, there is a literature that addresses this potential in various settings – see the references in Hanneke (2014). In principle, one could use such an analysis to show that the query complexity of an active learning modification of Algorithm 2 is on a lower order than the guarantee of Theorem 3 when conditions are favorable. To this end, it may be useful to note that a modification of Algorithm 2 wherein v^^𝑣\hat{v}over^ start_ARG italic_v end_ARG is selected as the ERM on a dataset generated by a mixture of P𝑃Pitalic_P and q^ ERMsubscript^𝑞 ERM\hat{q}_{\textnormal{ ERM}}over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT admits guarantees as well.

5.3.2 Only Low-Loss Models Need Appreciable Validity

Another source of potential looseness in the statement of Theorem 3 is that it phrases the query complexity in terms of the worst-case validity over models q𝒬𝑞𝒬q\in\mathcal{Q}italic_q ∈ caligraphic_Q. This is unnecessary – with high probability, in the first step of the algorithm, one selects a model q𝒬𝑞𝒬q\in\mathcal{Q}italic_q ∈ caligraphic_Q with O(ϵ1)𝑂subscriptitalic-ϵ1O(\epsilon_{1})italic_O ( italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) true loss. Thus, what really matters for such a strategy is that models that have relatively small loss l𝑙litalic_l do not have invalidity nearing 1.

This is a realistic scenario in the case that the loss function l𝑙litalic_l – despite possibly not being proper – prioritizes models which in some sense resemble the data generating distribution P𝑃Pitalic_P. Indeed, it’s somewhat difficult to envision a situation where a loss would be chosen that prioritizes models with no relation to the data generating distribution. To this end, we give the following corollary to Theorem 3.

Corollary 1.

Under the conditions of Theorem 3, if in addition it holds that all models q𝒬𝑞𝒬q\in\mathcal{Q}italic_q ∈ caligraphic_Q with loss sub-optimality ϵ1subscriptitalic-ϵ1\epsilon_{1}italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT have validity greater than some constant, then the query complexity of Algorithm 2 can be improved to

O(Dlog(1/ϵ2)+log(1/δ)ϵ2).absent𝑂𝐷1subscriptitalic-ϵ21𝛿subscriptitalic-ϵ2\leq O\left(\frac{D\log(1/\epsilon_{2})+\log(1/\delta)}{\epsilon_{2}}\right).≤ italic_O ( divide start_ARG italic_D roman_log ( 1 / italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + roman_log ( 1 / italic_δ ) end_ARG start_ARG italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ) .

5.3.3 Removing the Positive Validity Requirement

Using an idea found in the algorithm of Hanneke et al. (2018), one can show that if a learner has access to single distribution 𝒟𝒟\mathcal{D}caligraphic_D with a density and at least some non-zero constant validity, and the densities fqsubscript𝑓𝑞f_{q}italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT are bounded above, that Algorithm 2 can be modified so as to drop the requirement of positive validity over models when learning under the capped log-loss.

By mixing the q^ ERMsubscript^𝑞 ERM\hat{q}_{\textnormal{ ERM}}over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT with 𝒟𝒟\mathcal{D}caligraphic_D, giving mixture component O(ϵ1)𝑂subscriptitalic-ϵ1O(\epsilon_{1})italic_O ( italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) to 𝒟𝒟\mathcal{D}caligraphic_D, one can generate similar guarantees as those of Theorem 3. The modification can be found in the Appendix as Algorithm 3, and enjoys the following guarantee.

Theorem 4.

Suppose v𝒱𝑣𝒱v\in\mathcal{V}italic_v ∈ caligraphic_V where VC(𝒱)D𝑉𝐶𝒱𝐷VC(\mathcal{V})\leq Ditalic_V italic_C ( caligraphic_V ) ≤ italic_D, and that for each q𝒬𝑞𝒬q\in\mathcal{Q}italic_q ∈ caligraphic_Q, we have fq(x)βsubscript𝑓𝑞𝑥𝛽f_{q}(x)\leq\betaitalic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_x ) ≤ italic_β. Suppose further that there is some known 𝒟𝒫𝒟𝒫\mathcal{D}\in\mathcal{P}caligraphic_D ∈ caligraphic_P with density f𝒟subscript𝑓𝒟f_{\mathcal{D}}italic_f start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT which has V(𝒟)c>0𝑉𝒟𝑐0V(\mathcal{D})\geq c>0italic_V ( caligraphic_D ) ≥ italic_c > 0 for some constant c𝑐citalic_c. Then for all choices of 0<ϵ1,ϵ2,δ<1/2formulae-sequence0subscriptitalic-ϵ1subscriptitalic-ϵ2𝛿120<\epsilon_{1},\epsilon_{2},\delta<1/20 < italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_δ < 1 / 2 and M>0𝑀0M>0italic_M > 0, Algorithm 3 requires a number of samples

O~(M2(log(|𝒬|)+log(1/δ))ϵ12+M(Dlog(M/ϵ1)+log(1/δ))ϵ1),absent~𝑂superscript𝑀2𝒬1𝛿superscriptsubscriptitalic-ϵ12𝑀𝐷𝑀subscriptitalic-ϵ11𝛿subscriptitalic-ϵ1\leq\tilde{O}\left(\frac{M^{2}\left(\log(|\mathcal{Q}|)+\log(1/\delta)\right)}% {\epsilon_{1}^{2}}+\frac{M\left(D\log(M/\epsilon_{1})+\log(1/\delta)\right)}{% \epsilon_{1}}\right),≤ over~ start_ARG italic_O end_ARG ( divide start_ARG italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( roman_log ( | caligraphic_Q | ) + roman_log ( 1 / italic_δ ) ) end_ARG start_ARG italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG italic_M ( italic_D roman_log ( italic_M / italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + roman_log ( 1 / italic_δ ) ) end_ARG start_ARG italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ) ,

and a number of validity queries

O(Dlog(1/ϵ1ϵ2)+log(1/δ)ϵ1ϵ2),absent𝑂𝐷1subscriptitalic-ϵ1subscriptitalic-ϵ21𝛿subscriptitalic-ϵ1subscriptitalic-ϵ2\leq O\left(\frac{D\log(1/\epsilon_{1}\epsilon_{2})+\log(1/\delta)}{\epsilon_{% 1}\epsilon_{2}}\right),≤ italic_O ( divide start_ARG italic_D roman_log ( 1 / italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + roman_log ( 1 / italic_δ ) end_ARG start_ARG italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ) ,

to ensure that with probability 1δabsent1𝛿\geq 1-\delta≥ 1 - italic_δ, its output enjoys

𝔼XP[min(M,log(1/fq^(X)))]𝔼XP[min(M,log(1/fq(X)))]+ϵ1andI(q^)ϵ2.formulae-sequencesubscript𝔼similar-to𝑋𝑃delimited-[]𝑀1subscript𝑓^𝑞𝑋subscript𝔼similar-to𝑋𝑃delimited-[]𝑀1subscript𝑓superscript𝑞𝑋subscriptitalic-ϵ1𝑎𝑛𝑑𝐼^𝑞subscriptitalic-ϵ2\mathbb{E}_{X\sim P}\bigg{[}\min\left(M,\ \log(1/f_{\hat{q}}(X))\right)\bigg{]% }\leq\mathbb{E}_{X\sim P}\bigg{[}\min\left(M,\ \log(1/f_{q^{*}}(X))\right)% \bigg{]}+\epsilon_{1}\ \ and\ \ I(\hat{q})\leq\epsilon_{2}.blackboard_E start_POSTSUBSCRIPT italic_X ∼ italic_P end_POSTSUBSCRIPT [ roman_min ( italic_M , roman_log ( 1 / italic_f start_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG end_POSTSUBSCRIPT ( italic_X ) ) ) ] ≤ blackboard_E start_POSTSUBSCRIPT italic_X ∼ italic_P end_POSTSUBSCRIPT [ roman_min ( italic_M , roman_log ( 1 / italic_f start_POSTSUBSCRIPT italic_q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_X ) ) ) ] + italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a italic_n italic_d italic_I ( over^ start_ARG italic_q end_ARG ) ≤ italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT .

Here, the O~~𝑂\tilde{O}over~ start_ARG italic_O end_ARG notation again hides factors polylogarithmic in 1/β1𝛽1/\beta1 / italic_β.

Note that the O~~𝑂\tilde{O}over~ start_ARG italic_O end_ARG now appears in the sample complexity. This simply reflects the fact that the M𝑀Mitalic_M-capped log-loss can range between gap M𝑀Mitalic_M and log(1/β)1𝛽\log(1/\beta)roman_log ( 1 / italic_β ) when working with densities bounded above by β1𝛽1\beta\geq 1italic_β ≥ 1. In the case that densities can be bounded above by 1111, as in the discrete setting of Hanneke et al. (2018), this dependence disappears.

6 Conclusion

This work is intended as a first-look into settings closer to the common-case, where ensuring validity may be relatively cheap.

A more thorough investigation of the log-loss, as well as capped variants, seems a very relevant line of further inquiry, given the widespread use of this family in practice and its useful information-theoretic properties. A natural extension to the first part of this work would be to consider learning in the agnostic case P𝒬𝑃𝒬P\notin\mathcal{Q}italic_P ∉ caligraphic_Q under the log-loss, where one would hope to be able to exploit these properties and the validity of training samples to keep the number of validity queries low.

In general, characterizing the sample and query demands of validity-constrained distribution learning is challenging, given that proving lower bounds in general requires arguing against learners with two tools at their disposal – sampling and actively querying validity. Work in this direction will likely require some creative constructions.

References

  • Perez et al. [2022] Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models. arXiv preprint arXiv:2202.03286, 2022.
  • Abid et al. [2021] Abubakar Abid, Maheen Farooqi, and James Zou. Persistent anti-muslim bias in large language models. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pages 298–306, 2021.
  • Hanneke et al. [2018] Steve Hanneke, Adam Kalai, Gautam Kamath, and Christos Tzamos. Actively avoiding nonsense in generative models. CoRR, abs/1802.07229, 2018. URL http://arxiv.org/abs/1802.07229.
  • Kaneko and Harada [2021] Takuhiro Kaneko and Tatsuya Harada. Blur, noise, and compression robust generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13579–13589, 2021.
  • Odena et al. [2016] Augustus Odena, Vincent Dumoulin, and Chris Olah. Deconvolution and checkerboard artifacts. Distill, 1(10):e3, 2016.
  • Valiant [1984] L. G. Valiant. A theory of the learnable. Communications of the ACM, pages 1134–1142, 1984.
  • Kearns et al. [1994] Michael Kearns, Yishay Mansour, Dana Ron, Ronitt Rubinfeld, Robert E. Schapire, and Linda Sellie. On the learnability of discrete distributions. In Proceedings of the Twenty-Sixth Annual ACM Symposium on Theory of Computing, STOC ’94, page 273–282, New York, NY, USA, 1994. Association for Computing Machinery. ISBN 0897916638. doi: 10.1145/195058.195155. URL https://doi.org/10.1145/195058.195155.
  • Haghtalab et al. [2019] Nika Haghtalab, Cameron Musco, and Bo Waggoner. Toward a characterization of loss functions for distribution learning. CoRR, abs/1906.02652, 2019. URL http://arxiv.org/abs/1906.02652.
  • Daskalakis et al. [2013] Constantinos Daskalakis, Ilias Diakonikolas, Ryan ODonnell, Rocco A Servedio, and Li-Yang Tan. Learning sums of independent integer random variables. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, pages 217–226. IEEE, 2013.
  • Kalai et al. [2010] Adam Tauman Kalai, Ankur Moitra, and Gregory Valiant. Efficiently learning mixtures of two gaussians. In Proceedings of the forty-second ACM symposium on Theory of computing, pages 553–562, 2010.
  • Daskalakis et al. [2012] Constantinos Daskalakis, Ilias Diakonikolas, and Rocco A Servedio. Learning poisson binomial distributions. In Proceedings of the forty-fourth annual ACM symposium on Theory of computing, pages 709–728, 2012.
  • Brier [1950] Glenn W. Brier. Verification of Forecasts Expressed in Terms of Probability. Monthly Weather Review, 78(1), January 1950. doi: 10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2.
  • Good [1952] Irving John Good. Rational decisions. Journal of the Royal Statistical Society: Series B (Methodological), 14(1):107–114, 1952.
  • Gneiting and Raftery [2007] Tilmann Gneiting and Adrian Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102:359–378, 03 2007. doi: 10.1198/016214506000001437.
  • Frongillo et al. [2022] Rafael Frongillo, Dhamma Kimpara, and Bo Waggoner. Proper losses for discrete generative models, 2022. URL https://arxiv.org/abs/2211.03761.
  • Lehmann et al. [1986] Erich Leo Lehmann, Joseph P Romano, and George Casella. Testing statistical hypotheses, volume 3. Springer, 1986.
  • Neyman and Pearson [1933] Jerzy Neyman and Egon Sharpe Pearson. On the Problem of the Most Efficient Tests of Statistical Hypotheses. Phil. Trans. Roy. Soc. Lond. A, 231(694-706):289–337, 1933. doi: 10.1098/rsta.1933.0009.
  • Rebeschini [2021] Patrick Rebeschini. Minimax lower bounds and hypothesis testing, December 2021. URL https://www.stats.ox.ac.uk/~rebeschi/teaching/AFoL/22/material/lecture16.pdf.
  • Reiss [1981] R.D. Reiss. Approximation of product measures with an application to order statistics. The Annals of Probability, 9(2):335 – 341, 1981. doi: 10.1214/aop/1176994477.
  • Kusner et al. [2017] Matt J Kusner, Brooks Paige, and José Miguel Hernández-Lobato. Grammar variational autoencoder. In International conference on machine learning, pages 1945–1954. PMLR, 2017.
  • Janz et al. [2017] David Janz, Jos van der Westhuizen, Brooks Paige, Matt J Kusner, and José Miguel Hernández-Lobato. Learning a generative model for validity in complex discrete structures. arXiv preprint arXiv:1712.01664, 2017.
  • IsolaP et al. [2017] ZhuJY IsolaP, T Zhou, et al. Image to imagetranslation with conditional adversarial networks. Proceedingsofthe IEEEConferenceonComputerVisionandPatternRecognition. Hawaii, USA, 1125:1134, 2017.
  • Aitken et al. [2017] Andrew Aitken, Christian Ledig, Lucas Theis, Jose Caballero, Zehan Wang, and Wenzhe Shi. Checkerboard artifact free sub-pixel convolution: A note on sub-pixel convolution, resize convolution and convolution resize. arXiv preprint arXiv:1707.02937, 2017.
  • Kong and Chaudhuri [2023] Zhifeng Kong and Kamalika Chaudhuri. Data redaction from pre-trained gans. In 2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pages 638–677. IEEE, 2023.
  • Hanneke [2014] Steve Hanneke. Theory of disagreement-based active learning. Foundations and Trends in Machine Learning, 7:131–309, 01 2014. doi: 10.1561/2200000037.
  • Hanneke [2007] Steve Hanneke. A bound on the label complexity of agnostic active learning. In Proceedings of the 24th International Conference on Machine Learning, page 353–360, 2007.
  • Dasgupta [2011] Sanjoy Dasgupta. Two faces of active learning. Theoretical Computer Science, 412(19):1767–1781, 2011. ISSN 0304-3975.
  • Driver [2010] Bruce Driver. Probability tools with examples, June 2010. URL https://mathweb.ucsd.edu/~tkemp/280A/Driver-Probability-Lecture-Notes.pdf.
  • Steerneman [1983] Ton Steerneman. On the total variation and hellinger distance between signed measures; an application to product measures. Proceedings of the American Mathematical Society, 88(4):684–688, 1983.
  • Zhao and Lai [2022] Puning Zhao and Lifeng Lai. Analysis of knn density estimation. IEEE Transactions on Information Theory, 2022. doi: 10.1109/TIT.2022.3195870.

7 Appendix

7.1 Probability Distributions and Measure Theoretic Formalism

We work over Euclidean space dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and let 𝒳𝒳\mathcal{X}caligraphic_X be a Lebesgue measurable set with Lebesgue measure λ(𝒳)<𝜆𝒳\lambda(\mathcal{X})<\inftyitalic_λ ( caligraphic_X ) < ∞. By λ𝜆\lambdaitalic_λ and Lebesgue measurable set, we refer to the measure and σ𝜎\sigmaitalic_σ-algebra \mathcal{F}caligraphic_F arising from the usual construction of Lebesgue measure on dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. By “distribution”, we mean a probability measure on the measurable space (𝒳,𝒳)𝒳subscript𝒳(\mathcal{X},\mathcal{F}_{\mathcal{X}})( caligraphic_X , caligraphic_F start_POSTSUBSCRIPT caligraphic_X end_POSTSUBSCRIPT ), where 𝒳={E𝒳:E}subscript𝒳conditional-set𝐸𝒳𝐸\mathcal{F}_{\mathcal{X}}=\{E\cap\mathcal{X}:E\in\mathcal{F}\}caligraphic_F start_POSTSUBSCRIPT caligraphic_X end_POSTSUBSCRIPT = { italic_E ∩ caligraphic_X : italic_E ∈ caligraphic_F }. Let u𝑢uitalic_u the uniform distribution be the measure given by u(E)=λ(E)/λ(𝒳)𝑢𝐸𝜆𝐸𝜆𝒳u(E)=\lambda(E)/\lambda(\mathcal{X})italic_u ( italic_E ) = italic_λ ( italic_E ) / italic_λ ( caligraphic_X ) for E𝒳𝐸subscript𝒳E\in\mathcal{F}_{\mathcal{X}}italic_E ∈ caligraphic_F start_POSTSUBSCRIPT caligraphic_X end_POSTSUBSCRIPT. Let 𝒫𝒫\mathcal{P}caligraphic_P denote the set of all probability measures on (𝒳,𝒳)𝒳subscript𝒳(\mathcal{X},\mathcal{F}_{\mathcal{X}})( caligraphic_X , caligraphic_F start_POSTSUBSCRIPT caligraphic_X end_POSTSUBSCRIPT ).

We assume that P𝒫𝑃𝒫P\in\mathcal{P}italic_P ∈ caligraphic_P and q𝒬𝒫𝑞𝒬𝒫q\in\mathcal{Q}\subset\mathcal{P}italic_q ∈ caligraphic_Q ⊂ caligraphic_P have densities fPsubscript𝑓𝑃f_{P}italic_f start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT and fqsubscript𝑓𝑞f_{q}italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT with respect to the reference measure λ𝜆\lambdaitalic_λ. At times, it will be useful to assume that densities are bounded away from zero in certain parts of space. By saying densities are bounded in their support by β>α>0𝛽𝛼0\beta>\alpha>0italic_β > italic_α > 0, we mean that for all xsupp(q):=cl{x:fq(x)>0}𝑥𝑠𝑢𝑝𝑝𝑞assignclconditional-set𝑥subscript𝑓𝑞𝑥0x\in supp(q):=\textnormal{cl}\{x:f_{q}(x)>0\}italic_x ∈ italic_s italic_u italic_p italic_p ( italic_q ) := cl { italic_x : italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_x ) > 0 }, we have αfq(x)β𝛼subscript𝑓𝑞𝑥𝛽\alpha\leq f_{q}(x)\leq\betaitalic_α ≤ italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_x ) ≤ italic_β, where the closure is defined through open balls in the Euclidean metric. Note that in this setting, we have q(supp(q))=1𝑞𝑠𝑢𝑝𝑝𝑞1q(supp(q))=1italic_q ( italic_s italic_u italic_p italic_p ( italic_q ) ) = 1, as q(supp(q)c)=𝟙[xsupp(q)c]fq(x)𝑑λ(x)=0𝑑λ=0𝑞𝑠𝑢𝑝𝑝superscript𝑞𝑐1delimited-[]𝑥𝑠𝑢𝑝𝑝superscript𝑞𝑐subscript𝑓𝑞𝑥differential-d𝜆𝑥0differential-d𝜆0q(supp(q)^{c})=\int\mathbbm{1}[x\in supp(q)^{c}]f_{q}(x)d\lambda(x)=\int 0d% \lambda=0italic_q ( italic_s italic_u italic_p italic_p ( italic_q ) start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) = ∫ blackboard_1 [ italic_x ∈ italic_s italic_u italic_p italic_p ( italic_q ) start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ] italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_x ) italic_d italic_λ ( italic_x ) = ∫ 0 italic_d italic_λ = 0.

Denote via qn=qqsuperscript𝑞𝑛tensor-product𝑞𝑞q^{n}=q\otimes\dots\otimes qitalic_q start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_q ⊗ ⋯ ⊗ italic_q the product measure over the measurable space (𝒳n,𝒳n)superscript𝒳tensor-productabsent𝑛superscriptsubscript𝒳tensor-productabsent𝑛(\mathcal{X}^{\otimes n},\mathcal{F}_{\mathcal{X}}^{\otimes n})( caligraphic_X start_POSTSUPERSCRIPT ⊗ italic_n end_POSTSUPERSCRIPT , caligraphic_F start_POSTSUBSCRIPT caligraphic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊗ italic_n end_POSTSUPERSCRIPT ). Such a measure corresponds to the process of taking n𝑛nitalic_n i.i.d. samples qsimilar-toabsent𝑞\sim q∼ italic_q. Denote the density of qnsuperscript𝑞𝑛q^{n}italic_q start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT with respect to λnsuperscript𝜆𝑛\lambda^{n}italic_λ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT via fqnsubscriptsuperscript𝑓𝑛𝑞f^{n}_{q}italic_f start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT.

We define log(1/0)=10\log(1/0)=\inftyroman_log ( 1 / 0 ) = ∞. Following the conventions in Driver [2010], we say that 𝟙[xE]g(x)=01delimited-[]𝑥𝐸𝑔𝑥0\mathbbm{1}[x\in E]\cdot g(x)=0blackboard_1 [ italic_x ∈ italic_E ] ⋅ italic_g ( italic_x ) = 0 if g(x)<𝑔𝑥g(x)<\inftyitalic_g ( italic_x ) < ∞ for xE𝑥𝐸x\in Eitalic_x ∈ italic_E and g(x)=𝑔𝑥g(x)=\inftyitalic_g ( italic_x ) = ∞ for some xEc𝑥superscript𝐸𝑐x\in E^{c}italic_x ∈ italic_E start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT. This allows us to integrate over the finite part of functions and get a finite result.

To facilitate digestibility, we refrain from measure theoretic notation as much as possible. It is at times useful, particularly in dealing with total variation. We assume throughout that all functions we encounter in the Appendix – including the fixed validity function v𝑣vitalic_v and functions in the validity class 𝒱𝒱\mathcal{V}caligraphic_V – are (𝒳,𝒳)𝒳subscript𝒳(\mathcal{X},\mathcal{F}_{\mathcal{X}})( caligraphic_X , caligraphic_F start_POSTSUBSCRIPT caligraphic_X end_POSTSUBSCRIPT )-measurable.

7.2 Estimates of Validity, Invalidity

We fix the validity function v𝑣vitalic_v as an arbitrary function v:𝒳{0,1}:𝑣𝒳01v:\mathcal{X}\to\{0,1\}italic_v : caligraphic_X → { 0 , 1 } measurable with respect to each distribution q𝑞qitalic_q arising in the Appendix. As discussed above, for a given model q𝒫𝑞𝒫q\in\mathcal{P}italic_q ∈ caligraphic_P, the “validity” of q𝑞qitalic_q is the quantity V(q)=PrXq(v(X)=1)𝑉𝑞subscriptPrsimilar-to𝑋𝑞𝑣𝑋1V(q)=\textnormal{Pr}_{X\sim q}\left(v(X)=1\right)italic_V ( italic_q ) = Pr start_POSTSUBSCRIPT italic_X ∼ italic_q end_POSTSUBSCRIPT ( italic_v ( italic_X ) = 1 ), and the “invalidity” I(q)=1V(q)𝐼𝑞1𝑉𝑞I(q)=1-V(q)italic_I ( italic_q ) = 1 - italic_V ( italic_q ). We will at times be interested in estimating the validity of a model q𝑞qitalic_q using samples from q𝑞qitalic_q along with validity queries. Given an i.i.d. sample {Xi}i=1nsuperscriptsubscriptsubscript𝑋𝑖𝑖1𝑛\{X_{i}\}_{i=1}^{n}{ italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT from q𝑞qitalic_q, we let V^(q)=i=1nv(Xi)/n^𝑉𝑞superscriptsubscript𝑖1𝑛𝑣subscript𝑋𝑖𝑛\hat{V}(q)=\sum_{i=1}^{n}v(X_{i})/nover^ start_ARG italic_V end_ARG ( italic_q ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_v ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) / italic_n be the natural estimate of the validity of q𝑞qitalic_q.

At times, we will be interested in the validity of a model under an estimate of the underlying validity function. To this end, given a model q𝒫𝑞𝒫q\in\mathcal{P}italic_q ∈ caligraphic_P and a function g:𝒳{0,1}:𝑔𝒳01g:\mathcal{X}\to\{0,1\}italic_g : caligraphic_X → { 0 , 1 }, we let Vg(q)=PrXq(g(X)=1)subscript𝑉𝑔𝑞subscriptPrsimilar-to𝑋𝑞𝑔𝑋1V_{g}(q)=\textnormal{Pr}_{X\sim q}\left(g(X)=1\right)italic_V start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_q ) = Pr start_POSTSUBSCRIPT italic_X ∼ italic_q end_POSTSUBSCRIPT ( italic_g ( italic_X ) = 1 ). Given a sample {Xi}i=1nsuperscriptsubscriptsubscript𝑋𝑖𝑖1𝑛\{X_{i}\}_{i=1}^{n}{ italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, let V^g(q)=i=1ng(Xi)/nsubscript^𝑉𝑔𝑞superscriptsubscript𝑖1𝑛𝑔subscript𝑋𝑖𝑛\hat{V}_{g}(q)=\sum_{i=1}^{n}g(X_{i})/nover^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_q ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_g ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) / italic_n. Note that in the language of this notation, we have V(q)=Vv(q)𝑉𝑞subscript𝑉𝑣𝑞V(q)=V_{v}(q)italic_V ( italic_q ) = italic_V start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_q ) and V^(q)=V^v(q)^𝑉𝑞subscript^𝑉𝑣𝑞\hat{V}(q)=\hat{V}_{v}(q)over^ start_ARG italic_V end_ARG ( italic_q ) = over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_q ). We extend this notation in the natural way to invalidity quantities.

7.3 Analysis of Empirical Risk Minimization, Improper Algorithm in Realizable Setting

To begin our analysis of the realizable setting, we first observe that models with appreciable invalidity look very different from a fully-valid data generating distribution – because they must have mass in parts of space where P𝑃Pitalic_P does not, they are separated in total variation from P𝑃Pitalic_P by a margin. We formalize this idea via the following.

Lemma 1.

Fix 0<ϵ<10italic-ϵ10<\epsilon<10 < italic_ϵ < 1 arbitrarily. For any validity function v𝑣vitalic_v, if q𝒫𝑞𝒫q\in\mathcal{P}italic_q ∈ caligraphic_P has I(q)>ϵ𝐼𝑞italic-ϵI(q)>\epsilonitalic_I ( italic_q ) > italic_ϵ, and P𝒫𝑃𝒫P\in\mathcal{P}italic_P ∈ caligraphic_P has I(P)=0𝐼𝑃0I(P)=0italic_I ( italic_P ) = 0, then

dTV(P,q)>ϵ.subscript𝑑𝑇𝑉𝑃𝑞italic-ϵd_{TV}(P,q)>\epsilon.italic_d start_POSTSUBSCRIPT italic_T italic_V end_POSTSUBSCRIPT ( italic_P , italic_q ) > italic_ϵ .
Proof.

Fix the validity function arbitrarily. Consider the event E¬v={x𝒳:v(x)=0}subscript𝐸𝑣conditional-set𝑥𝒳𝑣𝑥0E_{\neg v}=\{x\in\mathcal{X}:v(x)=0\}italic_E start_POSTSUBSCRIPT ¬ italic_v end_POSTSUBSCRIPT = { italic_x ∈ caligraphic_X : italic_v ( italic_x ) = 0 }. Then we have dTV(P,q)=supE𝒳|P(E)q(E)|q(E¬v)P(E¬v)>ϵsubscript𝑑𝑇𝑉𝑃𝑞subscriptsupremum𝐸𝒳𝑃𝐸𝑞𝐸𝑞subscript𝐸𝑣𝑃subscript𝐸𝑣italic-ϵd_{TV}(P,q)=\sup_{E\subseteq\mathcal{X}}|P(E)-q(E)|\geq q(E_{\neg v})-P(E_{% \neg v})>\epsilonitalic_d start_POSTSUBSCRIPT italic_T italic_V end_POSTSUBSCRIPT ( italic_P , italic_q ) = roman_sup start_POSTSUBSCRIPT italic_E ⊆ caligraphic_X end_POSTSUBSCRIPT | italic_P ( italic_E ) - italic_q ( italic_E ) | ≥ italic_q ( italic_E start_POSTSUBSCRIPT ¬ italic_v end_POSTSUBSCRIPT ) - italic_P ( italic_E start_POSTSUBSCRIPT ¬ italic_v end_POSTSUBSCRIPT ) > italic_ϵ, where we have used that I(P)=0𝐼𝑃0I(P)=0italic_I ( italic_P ) = 0 implies P(E¬v)=0𝑃subscript𝐸𝑣0P(E_{\neg v})=0italic_P ( italic_E start_POSTSUBSCRIPT ¬ italic_v end_POSTSUBSCRIPT ) = 0. ∎

We next extend this observation to the associated product measures, which are the main target of analysis under i.i.d. sampling from P𝑃Pitalic_P. The idea is to lower bound the total variation between product measures by the difference in probabilities on the event that at least one example from a sample of size n𝑛nitalic_n is invalid. Of course, for each sample, this happens with probability 00 under P𝑃Pitalic_P and with probability at least ϵitalic-ϵ\epsilonitalic_ϵ under any model q𝑞qitalic_q with at least ϵitalic-ϵ\epsilonitalic_ϵ invalidity. Thus, identically to how one shows realizable rates for classification tasks, we attain a large total variation gap between P𝑃Pitalic_P and any q𝑞qitalic_q with appreciable invalidity – mistakes in classification are thus analogous to the generation of an “invalid” samples in our setting.

Lemma 2.

Fix 0<ϵ<10italic-ϵ10<\epsilon<10 < italic_ϵ < 1 and n{0}𝑛0n\in\mathbb{N}\setminus\{0\}italic_n ∈ blackboard_N ∖ { 0 } arbitrarily. For any validity function v𝑣vitalic_v, if q𝒫𝑞𝒫q\in\mathcal{P}italic_q ∈ caligraphic_P has I(q)>ϵ𝐼𝑞italic-ϵI(q)>\epsilonitalic_I ( italic_q ) > italic_ϵ, and P𝒫𝑃𝒫P\in\mathcal{P}italic_P ∈ caligraphic_P has I(P)=0𝐼𝑃0I(P)=0italic_I ( italic_P ) = 0, then

dTV(Pn,qn)>1enϵ.subscript𝑑𝑇𝑉superscript𝑃𝑛superscript𝑞𝑛1superscript𝑒𝑛italic-ϵd_{TV}(P^{n},q^{n})>1-e^{-n\epsilon}.italic_d start_POSTSUBSCRIPT italic_T italic_V end_POSTSUBSCRIPT ( italic_P start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_q start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) > 1 - italic_e start_POSTSUPERSCRIPT - italic_n italic_ϵ end_POSTSUPERSCRIPT .
Proof.

Fix the validity function arbitrarily. Consider lower bounding the total variation between the product measures via the magnitude of the difference of their measures on the event

E1={(x1,,xn)𝒳n:is.t.v(xi)=0}.subscript𝐸absent1conditional-setsubscript𝑥1subscript𝑥𝑛superscript𝒳𝑛formulae-sequence𝑖𝑠𝑡𝑣subscript𝑥𝑖0E_{\geq 1}=\bigg{\{}(x_{1},\dots,x_{n})\in\mathcal{X}^{n}:\exists i\ s.t.\ v(x% _{i})=0\bigg{\}}.italic_E start_POSTSUBSCRIPT ≥ 1 end_POSTSUBSCRIPT = { ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ∈ caligraphic_X start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT : ∃ italic_i italic_s . italic_t . italic_v ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = 0 } .

Because P𝑃Pitalic_P has perfect validity, any given draw from P𝑃Pitalic_P has probability 0 of being invalid. Thus, Pn(E1)=0superscript𝑃𝑛subscript𝐸absent10P^{n}(E_{\geq 1})=0italic_P start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_E start_POSTSUBSCRIPT ≥ 1 end_POSTSUBSCRIPT ) = 0. On the other hand, the invalidity of q𝑞qitalic_q states that for any 1in1𝑖𝑛1\leq i\leq n1 ≤ italic_i ≤ italic_n, we have q({x𝒳:v(x)=0})>ϵ𝑞conditional-set𝑥𝒳𝑣𝑥0italic-ϵq(\{x\in\mathcal{X}:v(x)=0\})>\epsilonitalic_q ( { italic_x ∈ caligraphic_X : italic_v ( italic_x ) = 0 } ) > italic_ϵ. Let Ev={x𝒳:v(x)=1}subscript𝐸𝑣conditional-set𝑥𝒳𝑣𝑥1E_{v}=\{x\in\mathcal{X}:v(x)=1\}italic_E start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = { italic_x ∈ caligraphic_X : italic_v ( italic_x ) = 1 }, and note that q(Ev)<1ϵ𝑞subscript𝐸𝑣1italic-ϵq(E_{v})<1-\epsilonitalic_q ( italic_E start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) < 1 - italic_ϵ. Then we have

qn(E1)superscript𝑞𝑛subscript𝐸absent1\displaystyle q^{n}(E_{\geq 1})italic_q start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_E start_POSTSUBSCRIPT ≥ 1 end_POSTSUBSCRIPT ) =1q(Ev)nabsent1𝑞superscriptsubscript𝐸𝑣𝑛\displaystyle=1-q(E_{v})^{n}= 1 - italic_q ( italic_E start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT
>1(1ϵ)nabsent1superscript1italic-ϵ𝑛\displaystyle>1-(1-\epsilon)^{n}> 1 - ( 1 - italic_ϵ ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT
1enϵ,absent1superscript𝑒𝑛italic-ϵ\displaystyle\geq 1-e^{-n\epsilon},≥ 1 - italic_e start_POSTSUPERSCRIPT - italic_n italic_ϵ end_POSTSUPERSCRIPT ,

where the final inequality follows from the fact that (1+x/n)nexsuperscript1𝑥𝑛𝑛superscript𝑒𝑥(1+x/n)^{n}\leq e^{x}( 1 + italic_x / italic_n ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ≤ italic_e start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT for xn𝑥𝑛x\leq nitalic_x ≤ italic_n. ∎

We can then borrow from classical analysis of hypothesis testing given by the Neyman-Pearson lemma to leverage this gap in total variation between product measures into a bound on the probability that after n𝑛nitalic_n samples, a model with appreciable invalidity has a smaller loss than P𝑃Pitalic_P.

Lemma 3.

Fix 0<ϵ<10italic-ϵ10<\epsilon<10 < italic_ϵ < 1 and n{0}𝑛0n\in\mathbb{N}\setminus\{0\}italic_n ∈ blackboard_N ∖ { 0 } arbitrarily. For any validity function v𝑣vitalic_v, if q𝒫𝑞𝒫q\in\mathcal{P}italic_q ∈ caligraphic_P has I(q)>ϵ𝐼𝑞italic-ϵI(q)>\epsilonitalic_I ( italic_q ) > italic_ϵ, P𝒫𝑃𝒫P\in\mathcal{P}italic_P ∈ caligraphic_P has I(P)=0𝐼𝑃0I(P)=0italic_I ( italic_P ) = 0, and q𝑞qitalic_q and P𝑃Pitalic_P have densities with respect to the reference measure λ𝜆\lambdaitalic_λ, then

PrSPn(qn(S)Pn(S))enϵ.subscriptPrsimilar-to𝑆superscript𝑃𝑛superscript𝑞𝑛𝑆superscript𝑃𝑛𝑆superscript𝑒𝑛italic-ϵ\textnormal{Pr}_{S\sim P^{n}}\bigg{(}q^{n}(S)\geq P^{n}(S)\bigg{)}\leq e^{-n% \epsilon}.Pr start_POSTSUBSCRIPT italic_S ∼ italic_P start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_q start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_S ) ≥ italic_P start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_S ) ) ≤ italic_e start_POSTSUPERSCRIPT - italic_n italic_ϵ end_POSTSUPERSCRIPT .
Proof.

The proof follows that of the Neyman-Pearson lemma’s claim that the Likelihood Ratio Test achieves the lower bound on the sum of Type I and Type II errors Rebeschini [2021], combined with Lemma 2. We give the full argument for completeness.

Fix the validity function arbitrarily, and note the following string of relations:

PrSPn(qn(S)Pn(S))subscriptPrsimilar-to𝑆superscript𝑃𝑛superscript𝑞𝑛𝑆superscript𝑃𝑛𝑆\displaystyle\textnormal{Pr}_{S\sim P^{n}}\bigg{(}q^{n}(S)\geq P^{n}(S)\bigg{)}Pr start_POSTSUBSCRIPT italic_S ∼ italic_P start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_q start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_S ) ≥ italic_P start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_S ) ) =𝟙[fqn(x)fPn(x)]fPn(x)𝑑λn(x)absent1delimited-[]superscriptsubscript𝑓𝑞𝑛𝑥superscriptsubscript𝑓𝑃𝑛𝑥superscriptsubscript𝑓𝑃𝑛𝑥differential-dsuperscript𝜆𝑛𝑥\displaystyle=\int\mathbbm{1}[f_{q}^{n}(x)\geq f_{P}^{n}(x)]f_{P}^{n}(x)d% \lambda^{n}(x)= ∫ blackboard_1 [ italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_x ) ≥ italic_f start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_x ) ] italic_f start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_x ) italic_d italic_λ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_x )
min(fqn(x),fPn(x))𝑑λn(x)absentsuperscriptsubscript𝑓𝑞𝑛𝑥superscriptsubscript𝑓𝑃𝑛𝑥differential-dsuperscript𝜆𝑛𝑥\displaystyle\leq\int\min\left(f_{q}^{n}(x),f_{P}^{n}(x)\right)d\lambda^{n}(x)≤ ∫ roman_min ( italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_x ) , italic_f start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_x ) ) italic_d italic_λ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_x )
=1dTV(Pn,qn)absent1subscript𝑑𝑇𝑉superscript𝑃𝑛superscript𝑞𝑛\displaystyle=1-d_{TV}(P^{n},q^{n})= 1 - italic_d start_POSTSUBSCRIPT italic_T italic_V end_POSTSUBSCRIPT ( italic_P start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_q start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT )
enϵ,absentsuperscript𝑒𝑛italic-ϵ\displaystyle\leq e^{-n\epsilon},≤ italic_e start_POSTSUPERSCRIPT - italic_n italic_ϵ end_POSTSUPERSCRIPT ,

where the switch to total variation in the second to last line is the result of a classic characterization of total variation given by “Scheffé’s Theorem”, and the final line comes from Lemma 2. ∎

To generate loss guarantees, we need to be able to reason about models which do not have appreciable invalidity. The next lemma is an analogue of the previous one – it is identical up to the replacement of the condition that q𝑞qitalic_q have appreciable invalidity with a weaker one that the total variation between a model q𝑞qitalic_q and the data distribution P𝑃Pitalic_P is appreciable. When q𝑞qitalic_q does not have appreciable invalidity, we can no longer rely on the structure of the supports of q𝑞qitalic_q and P𝑃Pitalic_P to generate large margins for the total variation separation of product measures, and have to fall back on more general estimates for total variation between product measures.

Lemma 4.

Fix 0<ϵ,δ<1formulae-sequence0italic-ϵ𝛿10<\epsilon,\delta<10 < italic_ϵ , italic_δ < 1 arbitrarily, and let P,q𝒫𝑃𝑞𝒫P,q\in\mathcal{P}italic_P , italic_q ∈ caligraphic_P be distributions with densities with respect to λ𝜆\lambdaitalic_λ. Then if dTV(q,P)ϵsubscript𝑑𝑇𝑉𝑞𝑃italic-ϵd_{TV}(q,P)\geq\epsilonitalic_d start_POSTSUBSCRIPT italic_T italic_V end_POSTSUBSCRIPT ( italic_q , italic_P ) ≥ italic_ϵ, and SPnsimilar-to𝑆superscript𝑃𝑛S\sim P^{n}italic_S ∼ italic_P start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT for nΩ(log(1/δ)/ϵ2)𝑛Ω1𝛿superscriptitalic-ϵ2n\geq\Omega(\log(1/\delta)/\epsilon^{2})italic_n ≥ roman_Ω ( roman_log ( 1 / italic_δ ) / italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), it holds with probability 1δabsent1𝛿\geq 1-\delta≥ 1 - italic_δ that

LS(P)<LS(q).subscript𝐿𝑆𝑃subscript𝐿𝑆𝑞L_{S}(P)<L_{S}(q).italic_L start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_P ) < italic_L start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_q ) .
Proof.

When q𝑞qitalic_q and P𝑃Pitalic_P both possess densities, we can related the the probability that the likelihood of q𝑞qitalic_q is at least that of P𝑃Pitalic_P to their total variation, as in the Neyman-Pearson Lemma.

PrSPn(Pn(S)qn(S))subscriptPrsimilar-to𝑆superscript𝑃𝑛superscript𝑃𝑛𝑆superscript𝑞𝑛𝑆\displaystyle\textnormal{Pr}_{S\sim P^{n}}\bigg{(}P^{n}(S)\leq q^{n}(S)\bigg{)}Pr start_POSTSUBSCRIPT italic_S ∼ italic_P start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_P start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_S ) ≤ italic_q start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_S ) ) =𝟙[fPn(x)fqn(x)]fPn(x)𝑑λn(x)absent1delimited-[]superscriptsubscript𝑓𝑃𝑛𝑥superscriptsubscript𝑓𝑞𝑛𝑥superscriptsubscript𝑓𝑃𝑛𝑥differential-dsuperscript𝜆𝑛𝑥\displaystyle=\int\mathbbm{1}[f_{P}^{n}(x)\leq f_{q}^{n}(x)]f_{P}^{n}(x)d% \lambda^{n}(x)= ∫ blackboard_1 [ italic_f start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_x ) ≤ italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_x ) ] italic_f start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_x ) italic_d italic_λ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_x )
min(fPn(x),fqn(x))𝑑λn(x)absentsuperscriptsubscript𝑓𝑃𝑛𝑥superscriptsubscript𝑓𝑞𝑛𝑥differential-dsuperscript𝜆𝑛𝑥\displaystyle\leq\int\min\left(f_{P}^{n}(x),f_{q}^{n}(x)\right)d\lambda^{n}(x)≤ ∫ roman_min ( italic_f start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_x ) , italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_x ) ) italic_d italic_λ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_x )
=1dTV(Pn,qn)absent1subscript𝑑𝑇𝑉superscript𝑃𝑛superscript𝑞𝑛\displaystyle=1-d_{TV}(P^{n},q^{n})= 1 - italic_d start_POSTSUBSCRIPT italic_T italic_V end_POSTSUBSCRIPT ( italic_P start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_q start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT )
endTV(p,q)2/2absentsuperscript𝑒𝑛subscript𝑑𝑇𝑉superscript𝑝𝑞22\displaystyle\leq e^{-nd_{TV}(p,q)^{2}/2}≤ italic_e start_POSTSUPERSCRIPT - italic_n italic_d start_POSTSUBSCRIPT italic_T italic_V end_POSTSUBSCRIPT ( italic_p , italic_q ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 end_POSTSUPERSCRIPT
enϵ2/2.absentsuperscript𝑒𝑛superscriptitalic-ϵ22\displaystyle\leq e^{-n\epsilon^{2}/2}.≤ italic_e start_POSTSUPERSCRIPT - italic_n italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 end_POSTSUPERSCRIPT .

Here, the second to last inequality is the consequence of powerful result of Reiss [1981] (and later Steerneman [1983]), namely that for any two collections of probability measures {qi}i=1nsuperscriptsubscriptsubscript𝑞𝑖𝑖1𝑛\{q_{i}\}_{i=1}^{n}{ italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and {pi}i=1nsuperscriptsubscriptsubscript𝑝𝑖𝑖1𝑛\{p_{i}\}_{i=1}^{n}{ italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT over measurable spaces {(𝒳i,i)}i=1nsuperscriptsubscriptsubscript𝒳𝑖subscript𝑖𝑖1𝑛\{(\mathcal{X}_{i},\mathcal{F}_{i})\}_{i=1}^{n}{ ( caligraphic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, the product measures over the respective collections qnsuperscript𝑞𝑛q^{n}italic_q start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and pnsuperscript𝑝𝑛p^{n}italic_p start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT satisfy

1exp(12i=1ndTV(qi,pi)2)dTV(qn,pn).112superscriptsubscript𝑖1𝑛subscript𝑑𝑇𝑉superscriptsubscript𝑞𝑖subscript𝑝𝑖2subscript𝑑𝑇𝑉superscript𝑞𝑛superscript𝑝𝑛1-\exp\left(-\frac{1}{2}\sum_{i=1}^{n}d_{TV}(q_{i},p_{i})^{2}\right)\leq d_{TV% }(q^{n},p^{n}).1 - roman_exp ( - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_T italic_V end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ≤ italic_d start_POSTSUBSCRIPT italic_T italic_V end_POSTSUBSCRIPT ( italic_q start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_p start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) .

The final inequality follows from assumed gap in total variation between P𝑃Pitalic_P and q𝑞qitalic_q. ∎

The previous two results concern the testing of individual models against P𝑃Pitalic_P. In the standard way, we now leverage the finite cardinality to argue via a union bound that given enough samples from P𝑃Pitalic_P, it’s unlikely that the ERM model has a large total variation distance from P𝑃Pitalic_P.

Lemma 5.

Fix 0<δ,ϵ1,ϵ2<1formulae-sequence0𝛿subscriptitalic-ϵ1subscriptitalic-ϵ210<\delta,\epsilon_{1},\epsilon_{2}<10 < italic_δ , italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT < 1 arbitrarily, and suppose P𝒬𝑃𝒬P\in\mathcal{Q}italic_P ∈ caligraphic_Q. If P𝑃Pitalic_P is fully-valid under v𝑣vitalic_v, and SPnsimilar-to𝑆superscript𝑃𝑛S\sim P^{n}italic_S ∼ italic_P start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT for nΩ(log(|𝒬|)+log(1/δ)min(ϵ12,ϵ2))𝑛Ω𝒬1𝛿superscriptsubscriptitalic-ϵ12subscriptitalic-ϵ2n\geq\Omega\left(\frac{\log(|\mathcal{Q}|)+\log(1/\delta)}{\min(\epsilon_{1}^{% 2},\epsilon_{2})}\right)italic_n ≥ roman_Ω ( divide start_ARG roman_log ( | caligraphic_Q | ) + roman_log ( 1 / italic_δ ) end_ARG start_ARG roman_min ( italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG ), then with probability 1δabsent1𝛿\geq 1-\delta≥ 1 - italic_δ over SPnsimilar-to𝑆superscript𝑃𝑛S\sim P^{n}italic_S ∼ italic_P start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, the ERM solution

q^=argminq𝒬xiSlog(1/q(xi)),^𝑞subscriptargmin𝑞𝒬subscriptsubscript𝑥𝑖𝑆1𝑞subscript𝑥𝑖\hat{q}=\operatorname*{arg\,min}_{q\in\mathcal{Q}}\sum_{x_{i}\in S}\log(1/q(x_% {i})),over^ start_ARG italic_q end_ARG = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_q ∈ caligraphic_Q end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_S end_POSTSUBSCRIPT roman_log ( 1 / italic_q ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ,

satisfies both

dTV(q^,P)ϵ1andI(q^)ϵ2.formulae-sequencesubscript𝑑𝑇𝑉^𝑞𝑃subscriptitalic-ϵ1𝑎𝑛𝑑𝐼^𝑞subscriptitalic-ϵ2d_{TV}(\hat{q},P)\leq\epsilon_{1}\ \ and\ \ I(\hat{q})\leq\epsilon_{2}.italic_d start_POSTSUBSCRIPT italic_T italic_V end_POSTSUBSCRIPT ( over^ start_ARG italic_q end_ARG , italic_P ) ≤ italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a italic_n italic_d italic_I ( over^ start_ARG italic_q end_ARG ) ≤ italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT .
Proof.

Let 𝒬dTV>ϵ1={q𝒬:dTV(q,P)>ϵ1}subscript𝒬subscript𝑑𝑇𝑉subscriptitalic-ϵ1conditional-set𝑞𝒬subscript𝑑𝑇𝑉𝑞𝑃subscriptitalic-ϵ1\mathcal{Q}_{d_{TV}>\epsilon_{1}}=\{q\in\mathcal{Q}:d_{TV}(q,P)>\epsilon_{1}\}caligraphic_Q start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_T italic_V end_POSTSUBSCRIPT > italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { italic_q ∈ caligraphic_Q : italic_d start_POSTSUBSCRIPT italic_T italic_V end_POSTSUBSCRIPT ( italic_q , italic_P ) > italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT }. By Lemma 4 and a union bound, it holds that

PrSPn(dTV(q^,P)>ϵ1)subscriptPrsimilar-to𝑆superscript𝑃𝑛subscript𝑑𝑇𝑉^𝑞𝑃subscriptitalic-ϵ1\displaystyle\textnormal{Pr}_{S\sim P^{n}}\bigg{(}d_{TV}(\hat{q},P)>\epsilon_{% 1}\bigg{)}Pr start_POSTSUBSCRIPT italic_S ∼ italic_P start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_d start_POSTSUBSCRIPT italic_T italic_V end_POSTSUBSCRIPT ( over^ start_ARG italic_q end_ARG , italic_P ) > italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) PrSPn(q𝒬dTV>ϵ1s.t.LS(q)LS(P))\displaystyle\leq\textnormal{Pr}_{S\sim P^{n}}\bigg{(}\exists q\in\mathcal{Q}_% {d_{TV}>\epsilon_{1}}\ s.t.\ L_{S}(q)\leq L_{S}(P)\bigg{)}≤ Pr start_POSTSUBSCRIPT italic_S ∼ italic_P start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( ∃ italic_q ∈ caligraphic_Q start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_T italic_V end_POSTSUBSCRIPT > italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_s . italic_t . italic_L start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_q ) ≤ italic_L start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_P ) )
|𝒬dTV>ϵ1|maxq𝒬dTV>ϵ1PrSPn(LS(q)LS(P))absentsubscript𝒬subscript𝑑𝑇𝑉subscriptitalic-ϵ1subscript𝑞subscript𝒬subscript𝑑𝑇𝑉subscriptitalic-ϵ1subscriptPrsimilar-to𝑆superscript𝑃𝑛subscript𝐿𝑆𝑞subscript𝐿𝑆𝑃\displaystyle\leq|\mathcal{Q}_{d_{TV}>\epsilon_{1}}|\cdot\max_{q\in\mathcal{Q}% _{d_{TV}>\epsilon_{1}}}\textnormal{Pr}_{S\sim P^{n}}\bigg{(}L_{S}(q)\leq L_{S}% (P)\bigg{)}≤ | caligraphic_Q start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_T italic_V end_POSTSUBSCRIPT > italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT | ⋅ roman_max start_POSTSUBSCRIPT italic_q ∈ caligraphic_Q start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_T italic_V end_POSTSUBSCRIPT > italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT Pr start_POSTSUBSCRIPT italic_S ∼ italic_P start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_L start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_q ) ≤ italic_L start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_P ) )
|𝒬|enϵ12/2.absent𝒬superscript𝑒𝑛superscriptsubscriptitalic-ϵ122\displaystyle\leq|\mathcal{Q}|e^{-n\epsilon_{1}^{2}/2}.≤ | caligraphic_Q | italic_e start_POSTSUPERSCRIPT - italic_n italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 end_POSTSUPERSCRIPT .

In the same manner, let v𝑣vitalic_v be an arbitrary validity function, and let 𝒬I>ϵ2={q𝒬:I(q)>ϵ2}subscript𝒬𝐼subscriptitalic-ϵ2conditional-set𝑞𝒬𝐼𝑞subscriptitalic-ϵ2\mathcal{Q}_{I>\epsilon_{2}}=\{q\in\mathcal{Q}:I(q)>\epsilon_{2}\}caligraphic_Q start_POSTSUBSCRIPT italic_I > italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { italic_q ∈ caligraphic_Q : italic_I ( italic_q ) > italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT }. Then we have

PrSPn(I(q^)>ϵ2)subscriptPrsimilar-to𝑆superscript𝑃𝑛𝐼^𝑞subscriptitalic-ϵ2\displaystyle\textnormal{Pr}_{S\sim P^{n}}\bigg{(}I(\hat{q})>\epsilon_{2}\bigg% {)}Pr start_POSTSUBSCRIPT italic_S ∼ italic_P start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_I ( over^ start_ARG italic_q end_ARG ) > italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) PrSPn(q𝒬I>ϵ2s.t.LS(q)LS(P))\displaystyle\leq\textnormal{Pr}_{S\sim P^{n}}\bigg{(}\exists q\in\mathcal{Q}_% {I>\epsilon_{2}}\ s.t.\ L_{S}(q)\leq L_{S}(P)\bigg{)}≤ Pr start_POSTSUBSCRIPT italic_S ∼ italic_P start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( ∃ italic_q ∈ caligraphic_Q start_POSTSUBSCRIPT italic_I > italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_s . italic_t . italic_L start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_q ) ≤ italic_L start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_P ) )
|𝒬I>ϵ2|maxq𝒬I>ϵ2PrSPn(LS(q)LS(P))absentsubscript𝒬𝐼subscriptitalic-ϵ2subscript𝑞subscript𝒬𝐼subscriptitalic-ϵ2subscriptPrsimilar-to𝑆superscript𝑃𝑛subscript𝐿𝑆𝑞subscript𝐿𝑆𝑃\displaystyle\leq\left|\mathcal{Q}_{I>\epsilon_{2}}\right|\cdot\max_{q\in% \mathcal{Q}_{I>\epsilon_{2}}}\textnormal{Pr}_{S\sim P^{n}}\bigg{(}L_{S}(q)\leq L% _{S}(P)\bigg{)}≤ | caligraphic_Q start_POSTSUBSCRIPT italic_I > italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT | ⋅ roman_max start_POSTSUBSCRIPT italic_q ∈ caligraphic_Q start_POSTSUBSCRIPT italic_I > italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT Pr start_POSTSUBSCRIPT italic_S ∼ italic_P start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_L start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_q ) ≤ italic_L start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_P ) )
|𝒬|enϵ2,absent𝒬superscript𝑒𝑛subscriptitalic-ϵ2\displaystyle\leq\left|\mathcal{Q}\right|e^{-n\epsilon_{2}},≤ | caligraphic_Q | italic_e start_POSTSUPERSCRIPT - italic_n italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ,

where the final inequality holds by Lemma 3. Then finally,

PrSPn(dTV(q^,P)>ϵ1I(q^)>ϵ2)|𝒬|enϵ12/2+|𝒬|enϵ2,subscriptPrsimilar-to𝑆superscript𝑃𝑛subscript𝑑𝑇𝑉^𝑞𝑃subscriptitalic-ϵ1𝐼^𝑞subscriptitalic-ϵ2𝒬superscript𝑒𝑛superscriptsubscriptitalic-ϵ122𝒬superscript𝑒𝑛subscriptitalic-ϵ2\textnormal{Pr}_{S\sim P^{n}}\bigg{(}d_{TV}(\hat{q},P)>\epsilon_{1}\ \vee\ I(% \hat{q})>\epsilon_{2}\bigg{)}\leq|\mathcal{Q}|e^{-n\epsilon_{1}^{2}/2}+|% \mathcal{Q}|e^{-n\epsilon_{2}},Pr start_POSTSUBSCRIPT italic_S ∼ italic_P start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_d start_POSTSUBSCRIPT italic_T italic_V end_POSTSUBSCRIPT ( over^ start_ARG italic_q end_ARG , italic_P ) > italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∨ italic_I ( over^ start_ARG italic_q end_ARG ) > italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ≤ | caligraphic_Q | italic_e start_POSTSUPERSCRIPT - italic_n italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 end_POSTSUPERSCRIPT + | caligraphic_Q | italic_e start_POSTSUPERSCRIPT - italic_n italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ,

and so choosing n2(log(|𝒬|)+log(1/δ))/min(ϵ12,ϵ2)𝑛2𝒬1𝛿superscriptsubscriptitalic-ϵ12subscriptitalic-ϵ2n\geq 2(\log(|\mathcal{Q}|)+\log(1/\delta))/\min(\epsilon_{1}^{2},\epsilon_{2})italic_n ≥ 2 ( roman_log ( | caligraphic_Q | ) + roman_log ( 1 / italic_δ ) ) / roman_min ( italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ensures that the sum of these final terms is δabsent𝛿\leq\delta≤ italic_δ. ∎

Before we can prove Theorem 1, we need one final intermediate result, which we now give. It states that the value of the log-loss at any given x𝑥xitalic_x for a mixture distribution constructed by heavily weighting one of two distributions is not much different than the value of the loss for the heavily weighted component. This follows from the fact that the natural log is well-approximated by a linear function near 1. We will use this result to argue that the loss of the output of Algorithm 1 is not significantly different than that of the ERM in the support of the ERM.

Lemma 4.

Fix 0<ϵ<10italic-ϵ10<\epsilon<10 < italic_ϵ < 1. For any q𝒫𝑞𝒫q\in\mathcal{P}italic_q ∈ caligraphic_P having a density with respect to λ𝜆\lambdaitalic_λ, the mixture M=(1ϵ/2)q+ϵu/2𝑀1italic-ϵ2𝑞italic-ϵ𝑢2M=(1-\epsilon/2)q+\epsilon u/2italic_M = ( 1 - italic_ϵ / 2 ) italic_q + italic_ϵ italic_u / 2 has density fM(x)=(1ϵ/2)fq(x)+ϵfu(x)/2subscript𝑓𝑀𝑥1italic-ϵ2subscript𝑓𝑞𝑥italic-ϵsubscript𝑓𝑢𝑥2f_{M}(x)=(1-\epsilon/2)f_{q}(x)+\epsilon f_{u}(x)/2italic_f start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_x ) = ( 1 - italic_ϵ / 2 ) italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_x ) + italic_ϵ italic_f start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_x ) / 2 and this density satisfies

log(1/fM(x))ϵ+log(1/fq(x)),1subscript𝑓𝑀𝑥italic-ϵ1subscript𝑓𝑞𝑥\log\bigg{(}1/f_{M}(x)\bigg{)}\leq\epsilon+\log\bigg{(}1/f_{q}(x)\bigg{)},roman_log ( 1 / italic_f start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_x ) ) ≤ italic_ϵ + roman_log ( 1 / italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_x ) ) ,

for all x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X.

Proof.

The existence claim on the densities is immediate given the definition of u𝑢uitalic_u as the uniform distribution with respect to the reference measure λ𝜆\lambdaitalic_λ. To see the inequality, fix x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X arbitrarily, and note that fM(x)(1ϵ/2)fq(x)subscript𝑓𝑀𝑥1italic-ϵ2subscript𝑓𝑞𝑥f_{M}(x)\geq(1-\epsilon/2)f_{q}(x)italic_f start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_x ) ≥ ( 1 - italic_ϵ / 2 ) italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_x ). Thus, we may write

log(1/fM(x))1subscript𝑓𝑀𝑥\displaystyle\log\bigg{(}1/f_{M}(x)\bigg{)}roman_log ( 1 / italic_f start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_x ) ) log(11ϵ/2)+log(1/fq(x))absent11italic-ϵ21subscript𝑓𝑞𝑥\displaystyle\leq\log\bigg{(}\frac{1}{1-\epsilon/2}\bigg{)}+\log\bigg{(}1/f_{q% }(x)\bigg{)}≤ roman_log ( divide start_ARG 1 end_ARG start_ARG 1 - italic_ϵ / 2 end_ARG ) + roman_log ( 1 / italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_x ) )
(11ϵ/21)+log(1/fq(x))absent11italic-ϵ211subscript𝑓𝑞𝑥\displaystyle\leq\left(\frac{1}{1-\epsilon/2}-1\right)+\log\bigg{(}1/f_{q}(x)% \bigg{)}≤ ( divide start_ARG 1 end_ARG start_ARG 1 - italic_ϵ / 2 end_ARG - 1 ) + roman_log ( 1 / italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_x ) )
ϵ/21ϵ/2+log(1/fq(x))absentitalic-ϵ21italic-ϵ21subscript𝑓𝑞𝑥\displaystyle\leq\frac{\epsilon/2}{1-\epsilon/2}+\log\bigg{(}1/f_{q}(x)\bigg{)}≤ divide start_ARG italic_ϵ / 2 end_ARG start_ARG 1 - italic_ϵ / 2 end_ARG + roman_log ( 1 / italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_x ) )
ϵ+log(1/fq(x)),absentitalic-ϵ1subscript𝑓𝑞𝑥\displaystyle\leq\epsilon+\log\bigg{(}1/f_{q}(x)\bigg{)},≤ italic_ϵ + roman_log ( 1 / italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_x ) ) ,

where second equality comes from the fact that log(z)z1𝑧𝑧1\log(z)\leq z-1roman_log ( italic_z ) ≤ italic_z - 1, the final follows from the fact that z/(1z)2z𝑧1𝑧2𝑧z/(1-z)\leq 2zitalic_z / ( 1 - italic_z ) ≤ 2 italic_z for z1/2𝑧12z\leq 1/2italic_z ≤ 1 / 2. ∎

To get guarantees for the log-loss, we make heavy use of the previous lemma. Being able to guarantee a small total variational distance from the ERM to P𝑃Pitalic_P and small invalidity means that the ERM output is already likely to be a faithful representation of P𝑃Pitalic_P with small invalidity. All that is then needed is to eliminate the possibility that the ERM has a large log-loss because of small mismatches in support with P𝑃Pitalic_P.

To deal with this possibility, we mix the ERM model with the uniform distribution in accordance with Algorithm 1. Because the weight given to the uniform distribution is O(min(ϵ12,ϵ2))𝑂superscriptsubscriptitalic-ϵ12subscriptitalic-ϵ2O(\min(\epsilon_{1}^{2},\epsilon_{2}))italic_O ( roman_min ( italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ), the invalidity is close to that of the ERM. To show that such a move does not increase the loss significantly, we split the contribution to the loss of the outputted model into two that arising from supp(q^ ERM)𝑠𝑢𝑝𝑝subscript^𝑞 ERMsupp(\hat{q}_{\textnormal{ ERM}})italic_s italic_u italic_p italic_p ( over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT ) and it’s complement. We first observe that the ERM always has an empirical risk which is a faithful estimator of the integral 𝔼XP[𝟙[Xsupp(q^ ERM)]log(1/fq^ ERM(x))]subscript𝔼similar-to𝑋𝑃delimited-[]1delimited-[]𝑋𝑠𝑢𝑝𝑝subscript^𝑞 ERM1subscript𝑓subscript^𝑞 ERM𝑥\mathbb{E}_{X\sim P}\left[\mathbbm{1}[X\in supp(\hat{q}_{\textnormal{ ERM}})]% \cdot\log(1/f_{\hat{q}_{\textnormal{ ERM}}}(x))\right]blackboard_E start_POSTSUBSCRIPT italic_X ∼ italic_P end_POSTSUBSCRIPT [ blackboard_1 [ italic_X ∈ italic_s italic_u italic_p italic_p ( over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT ) ] ⋅ roman_log ( 1 / italic_f start_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) ) ], allowing us to bound this integral above in terms of the loss of P𝑃Pitalic_P. The integral of the loss over supp(q^ ERM)c𝑠𝑢𝑝𝑝superscriptsubscript^𝑞 ERM𝑐supp(\hat{q}_{\textnormal{ ERM}})^{c}italic_s italic_u italic_p italic_p ( over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT can be shown to be small using the lower bound on the density of the output model q^^𝑞\hat{q}over^ start_ARG italic_q end_ARG afforded by mixing with the uniform distribution, and the fact that supp(q^ ERM)c𝑠𝑢𝑝𝑝superscriptsubscript^𝑞 ERM𝑐supp(\hat{q}_{\textnormal{ ERM}})^{c}italic_s italic_u italic_p italic_p ( over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT must have a small measure under P𝑃Pitalic_P when the ERM is close to P𝑃Pitalic_P in total variation.

Theorem 1.

Fix 0<δ,ϵ1,ϵ2<1formulae-sequence0𝛿subscriptitalic-ϵ1subscriptitalic-ϵ210<\delta,\epsilon_{1},\epsilon_{2}<10 < italic_δ , italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT < 1 arbitrarily, and suppose P𝒬𝑃𝒬P\in\mathcal{Q}italic_P ∈ caligraphic_Q and that P𝑃Pitalic_P is fully-valid under v𝑣vitalic_v. If it holds that for each q𝒬𝑞𝒬q\in\mathcal{Q}italic_q ∈ caligraphic_Q that αfq(x)β𝛼subscript𝑓𝑞𝑥𝛽\alpha\leq f_{q}(x)\leq\betaitalic_α ≤ italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_x ) ≤ italic_β for all xsupp(q)𝑥supp𝑞x\in\textnormal{supp}(q)italic_x ∈ supp ( italic_q ), then there is an

NO~(log2(1/min(ϵ1,ϵ2,α))(log(|𝒬|)+log(1/δ))min(ϵ12,ϵ2)),𝑁~𝑂superscript21subscriptitalic-ϵ1subscriptitalic-ϵ2𝛼𝒬1𝛿superscriptsubscriptitalic-ϵ12subscriptitalic-ϵ2N\leq\tilde{O}\left(\frac{\log^{2}\left(1/\min(\epsilon_{1},\epsilon_{2},% \alpha)\right)\cdot\big{(}\log(|\mathcal{Q}|)+\log(1/\delta)\big{)}}{\min(% \epsilon_{1}^{2},\epsilon_{2})}\right),italic_N ≤ over~ start_ARG italic_O end_ARG ( divide start_ARG roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 / roman_min ( italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_α ) ) ⋅ ( roman_log ( | caligraphic_Q | ) + roman_log ( 1 / italic_δ ) ) end_ARG start_ARG roman_min ( italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG ) ,

such that for all nN𝑛𝑁n\geq Nitalic_n ≥ italic_N, with probability 1δabsent1𝛿\geq 1-\delta≥ 1 - italic_δ, the output q^^𝑞\hat{q}over^ start_ARG italic_q end_ARG of Algorithm 1 satisfies

LP(q^)LP(q)+ϵ1andI(q^)ϵ2.formulae-sequencesubscript𝐿𝑃^𝑞subscript𝐿𝑃superscript𝑞subscriptitalic-ϵ1𝑎𝑛𝑑𝐼^𝑞subscriptitalic-ϵ2L_{P}(\hat{q})\leq L_{P}(q^{*})+\epsilon_{1}\ \ and\ \ \ I(\hat{q})\leq% \epsilon_{2}.italic_L start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( over^ start_ARG italic_q end_ARG ) ≤ italic_L start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a italic_n italic_d italic_I ( over^ start_ARG italic_q end_ARG ) ≤ italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT .
Proof.

Fix v𝑣vitalic_v arbitrarily, and let ϵ=min(ϵ1,ϵ2)subscriptitalic-ϵsubscriptitalic-ϵ1subscriptitalic-ϵ2\epsilon_{\wedge}=\min(\epsilon_{1},\epsilon_{2})italic_ϵ start_POSTSUBSCRIPT ∧ end_POSTSUBSCRIPT = roman_min ( italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ). To see the claim on the validity, note that by the guarantee of Lemma 5, the assymptotic complexity of Lemma 5 yields the guarantee I(q^ ERM)ϵ2/2𝐼subscript^𝑞 ERMsubscriptitalic-ϵ22I(\hat{q}_{\textnormal{ ERM}})\leq\epsilon_{2}/2italic_I ( over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT ) ≤ italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT / 2 with probability 1δ/3absent1𝛿3\geq 1-\delta/3≥ 1 - italic_δ / 3, in which case

PrXq^(v(X)=0)subscriptPrsimilar-to𝑋^𝑞𝑣𝑋0\displaystyle\textnormal{Pr}_{X\sim\hat{q}}\big{(}v(X)=0\big{)}Pr start_POSTSUBSCRIPT italic_X ∼ over^ start_ARG italic_q end_ARG end_POSTSUBSCRIPT ( italic_v ( italic_X ) = 0 ) =PrXq^ ERM(v(X)=0)(1ϵ8)+PrXU(v(X)=0)ϵ8absentsubscriptPrsimilar-to𝑋subscript^𝑞 ERM𝑣𝑋01subscriptitalic-ϵ8subscriptPrsimilar-to𝑋𝑈𝑣𝑋0subscriptitalic-ϵ8\displaystyle=\textnormal{Pr}_{X\sim\hat{q}_{\textnormal{ ERM}}}\big{(}v(X)=0% \big{)}\cdot\left(1-\frac{\epsilon_{\wedge}}{8}\right)+\textnormal{Pr}_{X\sim U% }\big{(}v(X)=0\big{)}\cdot\frac{\epsilon_{\wedge}}{8}= Pr start_POSTSUBSCRIPT italic_X ∼ over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_v ( italic_X ) = 0 ) ⋅ ( 1 - divide start_ARG italic_ϵ start_POSTSUBSCRIPT ∧ end_POSTSUBSCRIPT end_ARG start_ARG 8 end_ARG ) + Pr start_POSTSUBSCRIPT italic_X ∼ italic_U end_POSTSUBSCRIPT ( italic_v ( italic_X ) = 0 ) ⋅ divide start_ARG italic_ϵ start_POSTSUBSCRIPT ∧ end_POSTSUBSCRIPT end_ARG start_ARG 8 end_ARG
ϵ22(1ϵ8)+1ϵ8absentsubscriptitalic-ϵ221subscriptitalic-ϵ81subscriptitalic-ϵ8\displaystyle\leq\frac{\epsilon_{2}}{2}\cdot\left(1-\frac{\epsilon_{\wedge}}{8% }\right)+1\cdot\frac{\epsilon_{\wedge}}{8}≤ divide start_ARG italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ⋅ ( 1 - divide start_ARG italic_ϵ start_POSTSUBSCRIPT ∧ end_POSTSUBSCRIPT end_ARG start_ARG 8 end_ARG ) + 1 ⋅ divide start_ARG italic_ϵ start_POSTSUBSCRIPT ∧ end_POSTSUBSCRIPT end_ARG start_ARG 8 end_ARG
<ϵ2.absentsubscriptitalic-ϵ2\displaystyle<\epsilon_{2}.< italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT .

The loss of the outputted model q^^𝑞\hat{q}over^ start_ARG italic_q end_ARG can be decomposed into the contributions from the loss in supp(q^ ERM)suppsubscript^𝑞 ERM\textnormal{supp}(\hat{q}_{\textnormal{ ERM}})supp ( over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT ) and it’s complement:

LP(q^)subscript𝐿𝑃^𝑞\displaystyle L_{P}(\hat{q})italic_L start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( over^ start_ARG italic_q end_ARG ) =𝔼XP[𝟙[Xsupp(q^ ERM)]log(1/fq^(X))]+𝔼XP[𝟙[Xsupp(q^ ERM)c]log(1/fq^(X))].absentsubscript𝔼similar-to𝑋𝑃delimited-[]1delimited-[]𝑋suppsubscript^𝑞 ERM1subscript𝑓^𝑞𝑋subscript𝔼similar-to𝑋𝑃delimited-[]1delimited-[]𝑋suppsuperscriptsubscript^𝑞 ERM𝑐1subscript𝑓^𝑞𝑋\displaystyle=\mathbb{E}_{X\sim P}\bigg{[}\mathbbm{1}[X\in\textnormal{supp}(% \hat{q}_{\textnormal{ ERM}})]\cdot\log(1/f_{\hat{q}}(X))\bigg{]}+\mathbb{E}_{X% \sim P}\bigg{[}\mathbbm{1}[X\in\textnormal{supp}(\hat{q}_{\textnormal{ ERM}})^% {c}]\cdot\log(1/f_{\hat{q}}(X))\bigg{]}.= blackboard_E start_POSTSUBSCRIPT italic_X ∼ italic_P end_POSTSUBSCRIPT [ blackboard_1 [ italic_X ∈ supp ( over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT ) ] ⋅ roman_log ( 1 / italic_f start_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG end_POSTSUBSCRIPT ( italic_X ) ) ] + blackboard_E start_POSTSUBSCRIPT italic_X ∼ italic_P end_POSTSUBSCRIPT [ blackboard_1 [ italic_X ∈ supp ( over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ] ⋅ roman_log ( 1 / italic_f start_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG end_POSTSUBSCRIPT ( italic_X ) ) ] .

To bound the first term, for each q𝒬𝑞𝒬q\in\mathcal{Q}italic_q ∈ caligraphic_Q, consider the function

Bq(x)={log(1/fq(x))if xsupp(q),0else.subscript𝐵𝑞𝑥cases1subscript𝑓𝑞𝑥if xsupp(q),0else.B_{q}(x)=\begin{cases}\log\left(1/f_{q}(x)\right)&\text{if $x\in supp(q)$,}\\ 0&\text{else.}\end{cases}italic_B start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_x ) = { start_ROW start_CELL roman_log ( 1 / italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_x ) ) end_CELL start_CELL if italic_x ∈ italic_s italic_u italic_p italic_p ( italic_q ) , end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL else. end_CELL end_ROW

These functions are bounded above by log(1/α)1𝛼\log(1/\alpha)roman_log ( 1 / italic_α ) and below by log(1/β)1𝛽\log(1/\beta)roman_log ( 1 / italic_β ) (if β<1𝛽1\beta<1italic_β < 1, simply loosen the density upper bound), and thus for a sample XPsimilar-to𝑋𝑃X\sim Pitalic_X ∼ italic_P, define bounded random variables Bq(X)subscript𝐵𝑞𝑋B_{q}(X)italic_B start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_X ). By Hoeffding’s inequality and a union bound, it holds that a sample S𝑆Sitalic_S of size nΩ~(log2(1/α)(log(|𝒬|)+log(1/δ))/ϵ12)𝑛~Ωsuperscript21𝛼𝒬1𝛿superscriptsubscriptitalic-ϵ12n\geq\tilde{\Omega}(\log^{2}(1/\alpha)(\log(|\mathcal{Q}|)+\log(1/\delta))/% \epsilon_{1}^{2})italic_n ≥ over~ start_ARG roman_Ω end_ARG ( roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 / italic_α ) ( roman_log ( | caligraphic_Q | ) + roman_log ( 1 / italic_δ ) ) / italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) from P𝑃Pitalic_P is large enough such that with probability 1δ/3absent1𝛿3\geq 1-\delta/3≥ 1 - italic_δ / 3, for each q𝒬𝑞𝒬q\in\mathcal{Q}italic_q ∈ caligraphic_Q, it holds that

|𝔼XP[Bq(X)]1nxiSBq(xi)|ϵ18.subscript𝔼similar-to𝑋𝑃delimited-[]subscript𝐵𝑞𝑋1𝑛subscriptsubscript𝑥𝑖𝑆subscript𝐵𝑞subscript𝑥𝑖subscriptitalic-ϵ18\left|\mathbb{E}_{X\sim P}[B_{q}(X)]-\frac{1}{n}\sum_{x_{i}\in S}B_{q}(x_{i})% \right|\leq\frac{\epsilon_{1}}{8}.| blackboard_E start_POSTSUBSCRIPT italic_X ∼ italic_P end_POSTSUBSCRIPT [ italic_B start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_X ) ] - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_S end_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | ≤ divide start_ARG italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG 8 end_ARG .

Note that for each q𝒬𝑞𝒬q\in\mathcal{Q}italic_q ∈ caligraphic_Q with LS(q)<subscript𝐿𝑆𝑞L_{S}(q)<\inftyitalic_L start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_q ) < ∞, it holds that LS(q)subscript𝐿𝑆𝑞L_{S}(q)italic_L start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_q ) coincides with the empirical estimates of 𝔼XP[Bq(X)]subscript𝔼similar-to𝑋𝑃delimited-[]subscript𝐵𝑞𝑋\mathbb{E}_{X\sim P}[B_{q}(X)]blackboard_E start_POSTSUBSCRIPT italic_X ∼ italic_P end_POSTSUBSCRIPT [ italic_B start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_X ) ], namely

1nxiSBq(xi)=1nxiSlog(1/fq(xi)).1𝑛subscriptsubscript𝑥𝑖𝑆subscript𝐵𝑞subscript𝑥𝑖1𝑛subscriptsubscript𝑥𝑖𝑆1subscript𝑓𝑞subscript𝑥𝑖\frac{1}{n}\sum_{x_{i}\in S}B_{q}(x_{i})=\frac{1}{n}\sum_{x_{i}\in S}\log\left% (1/f_{q}(x_{i})\right).divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_S end_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_S end_POSTSUBSCRIPT roman_log ( 1 / italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) .

Furthermore, this coincidence takes place for q^ ERMsubscript^𝑞 ERM\hat{q}_{\textnormal{ ERM}}over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT with probability 1, as P𝒬𝑃𝒬P\in\mathcal{Q}italic_P ∈ caligraphic_Q implies that LS(q^ ERM)LS(P)<subscript𝐿𝑆subscript^𝑞 ERMsubscript𝐿𝑆𝑃L_{S}(\hat{q}_{\textnormal{ ERM}})\leq L_{S}(P)<\inftyitalic_L start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT ) ≤ italic_L start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_P ) < ∞ with probability 1 – note that LS(P)subscript𝐿𝑆𝑃L_{S}(P)italic_L start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_P ) is a good estimator for LP(P)subscript𝐿𝑃𝑃L_{P}(P)italic_L start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_P ) in the sense arising from an application of Hoeffding, as log(1/fP(X))1subscript𝑓𝑃𝑋\log(1/f_{P}(X))roman_log ( 1 / italic_f start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_X ) ) is bounded almost surely for XPsimilar-to𝑋𝑃X\sim Pitalic_X ∼ italic_P given that P𝑃Pitalic_P has a density that is bounded in it’s support (as a member of 𝒬𝒬\mathcal{Q}caligraphic_Q). Thus, we may write

𝔼XP[𝟙[Xsupp(q^ ERM)]log(1/fq^(X))]subscript𝔼similar-to𝑋𝑃delimited-[]1delimited-[]𝑋suppsubscript^𝑞 ERM1subscript𝑓^𝑞𝑋\displaystyle\mathbb{E}_{X\sim P}\bigg{[}\mathbbm{1}[X\in\textnormal{supp}(% \hat{q}_{\textnormal{ ERM}})]\cdot\log(1/f_{\hat{q}}(X))\bigg{]}blackboard_E start_POSTSUBSCRIPT italic_X ∼ italic_P end_POSTSUBSCRIPT [ blackboard_1 [ italic_X ∈ supp ( over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT ) ] ⋅ roman_log ( 1 / italic_f start_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG end_POSTSUBSCRIPT ( italic_X ) ) ] 𝔼XP[𝟙[Xsupp(q^ ERM)]log(1/fq^ ERM(X))]+ϵ14absentsubscript𝔼similar-to𝑋𝑃delimited-[]1delimited-[]𝑋suppsubscript^𝑞 ERM1subscript𝑓subscript^𝑞 ERM𝑋subscriptitalic-ϵ14\displaystyle\leq\mathbb{E}_{X\sim P}\bigg{[}\mathbbm{1}[X\in\textnormal{supp}% (\hat{q}_{\textnormal{ ERM}})]\cdot\log(1/f_{\hat{q}_{\textnormal{ ERM}}}(X))% \bigg{]}+\frac{\epsilon_{1}}{4}≤ blackboard_E start_POSTSUBSCRIPT italic_X ∼ italic_P end_POSTSUBSCRIPT [ blackboard_1 [ italic_X ∈ supp ( over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT ) ] ⋅ roman_log ( 1 / italic_f start_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X ) ) ] + divide start_ARG italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG 4 end_ARG
=𝔼XP[Bq^ ERM(X)]+ϵ14absentsubscript𝔼similar-to𝑋𝑃delimited-[]subscript𝐵subscript^𝑞 ERM𝑋subscriptitalic-ϵ14\displaystyle=\mathbb{E}_{X\sim P}\left[B_{\hat{q}_{\textnormal{ ERM}}}(X)% \right]+\frac{\epsilon_{1}}{4}= blackboard_E start_POSTSUBSCRIPT italic_X ∼ italic_P end_POSTSUBSCRIPT [ italic_B start_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X ) ] + divide start_ARG italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG 4 end_ARG
LS(q^ ERM)+3ϵ18absentsubscript𝐿𝑆subscript^𝑞 ERM3subscriptitalic-ϵ18\displaystyle\leq L_{S}(\hat{q}_{\textnormal{ ERM}})+\frac{3\epsilon_{1}}{8}≤ italic_L start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT ) + divide start_ARG 3 italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG 8 end_ARG
LP(q)+ϵ12,absentsubscript𝐿𝑃superscript𝑞subscriptitalic-ϵ12\displaystyle\leq L_{P}(q^{*})+\frac{\epsilon_{1}}{2},≤ italic_L start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + divide start_ARG italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ,

where we invoke Lemma 4 in the first step, and in the last step use that q=Psuperscript𝑞𝑃q^{*}=Pitalic_q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_P by the strict properness of the log-loss.

To bound the second term, note that by the argument in given in Lemma 5, when nΩ(log2(1/ϵ)(log(|𝒬|+log(1/δ))/ϵ12)n\geq\Omega\left(\log^{2}(1/\epsilon_{\wedge})\cdot\left(\log(|\mathcal{Q}|+% \log(1/\delta)\right)/\epsilon_{1}^{2}\right)italic_n ≥ roman_Ω ( roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 / italic_ϵ start_POSTSUBSCRIPT ∧ end_POSTSUBSCRIPT ) ⋅ ( roman_log ( | caligraphic_Q | + roman_log ( 1 / italic_δ ) ) / italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), it holds with probability 1δ/3absent1𝛿3\geq 1-\delta/3≥ 1 - italic_δ / 3 that dTV(q^ ERM,P)ϵ1/8log(8/ϵ)subscript𝑑𝑇𝑉subscript^𝑞 ERM𝑃subscriptitalic-ϵ188subscriptitalic-ϵd_{TV}(\hat{q}_{\textnormal{ ERM}},P)\leq\epsilon_{1}/8\log(8/\epsilon_{\wedge})italic_d start_POSTSUBSCRIPT italic_T italic_V end_POSTSUBSCRIPT ( over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT , italic_P ) ≤ italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT / 8 roman_log ( 8 / italic_ϵ start_POSTSUBSCRIPT ∧ end_POSTSUBSCRIPT ). Because q^^𝑞\hat{q}over^ start_ARG italic_q end_ARG has density at least ϵ/8subscriptitalic-ϵ8\epsilon_{\wedge}/8italic_ϵ start_POSTSUBSCRIPT ∧ end_POSTSUBSCRIPT / 8 everywhere in 𝒳𝒳\mathcal{X}caligraphic_X,222When the uniform distribution has a density smaller than 1, there is an extra log factor to account for here. We can WLOG this away by adding the condition that λ(𝒳)=1𝜆𝒳1\lambda(\mathcal{X})=1italic_λ ( caligraphic_X ) = 1, e.g. arising from normalized data. we have

𝔼XP[𝟙[Xsupp(q^ ERM)c]log(1/fq^(X))]subscript𝔼similar-to𝑋𝑃delimited-[]1delimited-[]𝑋suppsuperscriptsubscript^𝑞 ERM𝑐1subscript𝑓^𝑞𝑋\displaystyle\mathbb{E}_{X\sim P}\bigg{[}\mathbbm{1}[X\in\textnormal{supp}(% \hat{q}_{\textnormal{ ERM}})^{c}]\cdot\log(1/f_{\hat{q}}(X))\bigg{]}blackboard_E start_POSTSUBSCRIPT italic_X ∼ italic_P end_POSTSUBSCRIPT [ blackboard_1 [ italic_X ∈ supp ( over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ] ⋅ roman_log ( 1 / italic_f start_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG end_POSTSUBSCRIPT ( italic_X ) ) ] log(8/ϵ)𝔼XP[𝟙[Xsupp(q^ ERM)c]]absent8subscriptitalic-ϵsubscript𝔼similar-to𝑋𝑃delimited-[]1delimited-[]𝑋suppsuperscriptsubscript^𝑞 ERM𝑐\displaystyle\leq\log(8/\epsilon_{\wedge})\cdot\mathbb{E}_{X\sim P}\bigg{[}% \mathbbm{1}[X\in\textnormal{supp}(\hat{q}_{\textnormal{ ERM}})^{c}]\bigg{]}≤ roman_log ( 8 / italic_ϵ start_POSTSUBSCRIPT ∧ end_POSTSUBSCRIPT ) ⋅ blackboard_E start_POSTSUBSCRIPT italic_X ∼ italic_P end_POSTSUBSCRIPT [ blackboard_1 [ italic_X ∈ supp ( over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ] ]
log(8/ϵ)dTV(P,q^ ERM)absent8subscriptitalic-ϵsubscript𝑑𝑇𝑉𝑃subscript^𝑞 ERM\displaystyle\leq\log(8/\epsilon_{\wedge})\cdot d_{TV}(P,\hat{q}_{\textnormal{% ERM}})≤ roman_log ( 8 / italic_ϵ start_POSTSUBSCRIPT ∧ end_POSTSUBSCRIPT ) ⋅ italic_d start_POSTSUBSCRIPT italic_T italic_V end_POSTSUBSCRIPT ( italic_P , over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT )
<ϵ1/2,absentsubscriptitalic-ϵ12\displaystyle<\epsilon_{1}/2,< italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT / 2 ,

where in the second line we use the fact that q^ ERM(supp(q^ ERM))=1subscript^𝑞 ERM𝑠𝑢𝑝𝑝subscript^𝑞 ERM1\hat{q}_{\textnormal{ ERM}}(supp(\hat{q}_{\textnormal{ ERM}}))=1over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT ( italic_s italic_u italic_p italic_p ( over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT ) ) = 1 to argue that dTV(P,q^ ERM)P(supp(q^ ERM)c)subscript𝑑𝑇𝑉𝑃subscript^𝑞 ERM𝑃𝑠𝑢𝑝𝑝superscriptsubscript^𝑞 ERM𝑐d_{TV}(P,\hat{q}_{\textnormal{ ERM}})\geq P(supp(\hat{q}_{\textnormal{ ERM}})^% {c})italic_d start_POSTSUBSCRIPT italic_T italic_V end_POSTSUBSCRIPT ( italic_P , over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT ) ≥ italic_P ( italic_s italic_u italic_p italic_p ( over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ). Combining the bounds on the two summands and union bounding the confidence yields the full guarantee.

We note that this argument implies a slight more precise sample complexity bound given by

O~(max(log(|𝒬|)+log(1/δ)ϵ2,log2(1/min(ϵ1,ϵ2,α))(log(|𝒬|)+log(1/δ))ϵ12)),~𝑂𝒬1𝛿subscriptitalic-ϵ2superscript21subscriptitalic-ϵ1subscriptitalic-ϵ2𝛼𝒬1𝛿superscriptsubscriptitalic-ϵ12\tilde{O}\left(\max\left(\frac{\log(|\mathcal{Q}|)+\log(1/\delta)}{\epsilon_{2% }},\frac{\log^{2}\left(1/\min(\epsilon_{1},\epsilon_{2},\alpha)\right)\big{(}% \log(|\mathcal{Q}|)+\log(1/\delta)\big{)}}{\epsilon_{1}^{2}}\right)\right),over~ start_ARG italic_O end_ARG ( roman_max ( divide start_ARG roman_log ( | caligraphic_Q | ) + roman_log ( 1 / italic_δ ) end_ARG start_ARG italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG , divide start_ARG roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 / roman_min ( italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_α ) ) ( roman_log ( | caligraphic_Q | ) + roman_log ( 1 / italic_δ ) ) end_ARG start_ARG italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) ) ,

which affords a minor improvement in certain regimes where ϵ2<ϵ12subscriptitalic-ϵ2superscriptsubscriptitalic-ϵ12\epsilon_{2}<\epsilon_{1}^{2}italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT < italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. ∎

7.4 A Zero-Query Lower Bound

Theorem 2.

For all ϵ2<1/4subscriptitalic-ϵ214\epsilon_{2}<1/4italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT < 1 / 4 and for all proper learners L:S𝒫:𝐿𝑆𝒫L:S\to\mathcal{P}italic_L : italic_S → caligraphic_P, if the sample SPnsimilar-to𝑆superscript𝑃𝑛S\sim P^{n}italic_S ∼ italic_P start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is of size n1/8ϵ2𝑛18subscriptitalic-ϵ2n\leq 1/8\epsilon_{2}italic_n ≤ 1 / 8 italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, then there exists a triple (P,𝒬,v)𝑃𝒬𝑣(P,\mathcal{Q},v)( italic_P , caligraphic_Q , italic_v ) with P𝒬𝑃𝒬P\in\mathcal{Q}italic_P ∈ caligraphic_Q and P𝑃Pitalic_P fully-valid, on which L(S)𝐿𝑆L(S)italic_L ( italic_S ) has invalidity I(L(S))>ϵ2𝐼𝐿𝑆subscriptitalic-ϵ2I(L(S))>\epsilon_{2}italic_I ( italic_L ( italic_S ) ) > italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT with probability 1/4absent14\geq 1/4≥ 1 / 4.

Proof.

We give an argument inspired by the proof of a lower bound in Theorem 2 of Zhao and Lai [2022].

Let 𝒳=[0,1]𝒳01\mathcal{X}=[0,1]\subset\mathbb{R}caligraphic_X = [ 0 , 1 ] ⊂ blackboard_R. Fix the proper learner L𝐿Litalic_L and ϵ2<1/4subscriptitalic-ϵ214\epsilon_{2}<1/4italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT < 1 / 4 arbitrarily. Consider a model class defined by 𝒬={P,P~}𝒬𝑃~𝑃\mathcal{Q}=\{P,\tilde{P}\}caligraphic_Q = { italic_P , over~ start_ARG italic_P end_ARG }, where P𝑃Pitalic_P and P~~𝑃\tilde{P}over~ start_ARG italic_P end_ARG have densities with respect to the Lebesgue measure

fP(x)=𝟙[x[0,12ϵ2]]112ϵ2,fP~(x)=𝟙[x[2ϵ2,1]]112ϵ2.formulae-sequencesubscript𝑓𝑃𝑥1delimited-[]𝑥012subscriptitalic-ϵ2112subscriptitalic-ϵ2subscript𝑓~𝑃𝑥1delimited-[]𝑥2subscriptitalic-ϵ21112subscriptitalic-ϵ2f_{P}(x)=\mathbbm{1}[x\in[0,1-2\epsilon_{2}]]\frac{1}{1-2\epsilon_{2}},\ \ f_{% \tilde{P}}(x)=\mathbbm{1}[x\in[2\epsilon_{2},1]]\frac{1}{1-2\epsilon_{2}}.italic_f start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_x ) = blackboard_1 [ italic_x ∈ [ 0 , 1 - 2 italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] ] divide start_ARG 1 end_ARG start_ARG 1 - 2 italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG , italic_f start_POSTSUBSCRIPT over~ start_ARG italic_P end_ARG end_POSTSUBSCRIPT ( italic_x ) = blackboard_1 [ italic_x ∈ [ 2 italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , 1 ] ] divide start_ARG 1 end_ARG start_ARG 1 - 2 italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG .

This 𝒬𝒬\mathcal{Q}caligraphic_Q gives rise to two realizable problem instances of interest. Under the first, P𝑃Pitalic_P is the data generating distribution and v(x)=1𝑣𝑥1v(x)=1italic_v ( italic_x ) = 1 everywhere except for in (12ϵ2,1]12subscriptitalic-ϵ21(1-2\epsilon_{2},1]( 1 - 2 italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , 1 ], where v(x)=0𝑣𝑥0v(x)=0italic_v ( italic_x ) = 0. Under the second, P~~𝑃\tilde{P}over~ start_ARG italic_P end_ARG is the data generating distribution and v(x)=1𝑣𝑥1v(x)=1italic_v ( italic_x ) = 1 everywhere except for in [0,2ϵ2)02subscriptitalic-ϵ2[0,2\epsilon_{2})[ 0 , 2 italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), where v(x)=0𝑣𝑥0v(x)=0italic_v ( italic_x ) = 0. Assume by contradiction that for both problem instances, given a sample of size n1/8ϵ2𝑛18subscriptitalic-ϵ2n\leq 1/8\epsilon_{2}italic_n ≤ 1 / 8 italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, we have that I(L(S))ϵ2𝐼𝐿𝑆subscriptitalic-ϵ2I(L(S))\leq\epsilon_{2}italic_I ( italic_L ( italic_S ) ) ≤ italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT with probability >3/4absent34>3/4> 3 / 4. In both cases, we have that I(q)=2ϵ2/(12ϵ2)>ϵ2𝐼𝑞2subscriptitalic-ϵ212subscriptitalic-ϵ2subscriptitalic-ϵ2I(q)=2\epsilon_{2}/(1-2\epsilon_{2})>\epsilon_{2}italic_I ( italic_q ) = 2 italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT / ( 1 - 2 italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) > italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT for the model q𝑞qitalic_q which is not the data generating distribution. Thus, L(S)𝐿𝑆L(S)italic_L ( italic_S ) is a model with I(L(S))ϵ2𝐼𝐿𝑆subscriptitalic-ϵ2I(L(S))\leq\epsilon_{2}italic_I ( italic_L ( italic_S ) ) ≤ italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT with probability >3/4absent34>3/4> 3 / 4 over both problem instances if and only if it identifies the data generating distribution with probability >3/4absent34>3/4> 3 / 4 over both problem instances.

Consider the simple hypothesis tester defined by TL(S)=L(S)subscript𝑇𝐿𝑆𝐿𝑆T_{L}(S)=L(S)italic_T start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( italic_S ) = italic_L ( italic_S ), which by the above, outputs the correct data generating distribution given a choice of P𝑃Pitalic_P or P~~𝑃\tilde{P}over~ start_ARG italic_P end_ARG – and given n1/8ϵ2𝑛18subscriptitalic-ϵ2n\leq 1/8\epsilon_{2}italic_n ≤ 1 / 8 italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT samples from either P𝑃Pitalic_P or P~~𝑃\tilde{P}over~ start_ARG italic_P end_ARG – with probability >3/4absent34>3/4> 3 / 4. Note that the distributions P𝑃Pitalic_P, P~~𝑃\tilde{P}over~ start_ARG italic_P end_ARG have dTV(P,P~)=2ϵ2/(12ϵ2)4ϵ2subscript𝑑𝑇𝑉𝑃~𝑃2subscriptitalic-ϵ212subscriptitalic-ϵ24subscriptitalic-ϵ2d_{TV}(P,\tilde{P})=2\epsilon_{2}/(1-2\epsilon_{2})\leq 4\epsilon_{2}italic_d start_POSTSUBSCRIPT italic_T italic_V end_POSTSUBSCRIPT ( italic_P , over~ start_ARG italic_P end_ARG ) = 2 italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT / ( 1 - 2 italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ≤ 4 italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, where we use ϵ2<1/4subscriptitalic-ϵ214\epsilon_{2}<1/4italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT < 1 / 4 and z/(1z)2z𝑧1𝑧2𝑧z/(1-z)\leq 2zitalic_z / ( 1 - italic_z ) ≤ 2 italic_z for small enough z𝑧zitalic_z in the inequality. By a classic upper bound on the total variation between product measures Reiss [1981], we have that dTV(Pn,P~n)4nϵ2subscript𝑑𝑇𝑉superscript𝑃𝑛superscript~𝑃𝑛4𝑛subscriptitalic-ϵ2d_{TV}(P^{n},\tilde{P}^{n})\leq 4n\epsilon_{2}italic_d start_POSTSUBSCRIPT italic_T italic_V end_POSTSUBSCRIPT ( italic_P start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ≤ 4 italic_n italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Then, by Le Cam’s method and this upper bound on dTV(Pn,P~n)subscript𝑑𝑇𝑉superscript𝑃𝑛superscript~𝑃𝑛d_{TV}(P^{n},\tilde{P}^{n})italic_d start_POSTSUBSCRIPT italic_T italic_V end_POSTSUBSCRIPT ( italic_P start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ), we have

infT:S{P,P~}maxq{P,P~}PrSqn(TL(S)q)1212dTV(Pn,P~n)122nϵ2.subscriptinfimum:𝑇𝑆𝑃~𝑃subscript𝑞𝑃~𝑃subscriptPrsimilar-to𝑆superscript𝑞𝑛subscript𝑇𝐿𝑆𝑞1212subscript𝑑𝑇𝑉superscript𝑃𝑛superscript~𝑃𝑛122𝑛subscriptitalic-ϵ2\inf_{T:S\to\{P,\tilde{P}\}}\max_{q\in\{P,\tilde{P}\}}\textnormal{Pr}_{S\sim q% ^{n}}\left(T_{L}(S)\neq q\right)\geq\frac{1}{2}-\frac{1}{2}\cdot d_{TV}(P^{n},% \tilde{P}^{n})\geq\frac{1}{2}-2n\epsilon_{2}.roman_inf start_POSTSUBSCRIPT italic_T : italic_S → { italic_P , over~ start_ARG italic_P end_ARG } end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_q ∈ { italic_P , over~ start_ARG italic_P end_ARG } end_POSTSUBSCRIPT Pr start_POSTSUBSCRIPT italic_S ∼ italic_q start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( italic_S ) ≠ italic_q ) ≥ divide start_ARG 1 end_ARG start_ARG 2 end_ARG - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ⋅ italic_d start_POSTSUBSCRIPT italic_T italic_V end_POSTSUBSCRIPT ( italic_P start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ≥ divide start_ARG 1 end_ARG start_ARG 2 end_ARG - 2 italic_n italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT .

Thus, if n1/8ϵ2𝑛18subscriptitalic-ϵ2n\leq 1/8\epsilon_{2}italic_n ≤ 1 / 8 italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, the there is a choice of data generating distribution such that TL(S)subscript𝑇𝐿𝑆T_{L}(S)italic_T start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( italic_S ) incurs error probability 1/4absent14\geq 1/4≥ 1 / 4, which is a contradiction. ∎

7.5 Strategies Arising from Estimation of the Validity Function

When the validity function v𝑣vitalic_v is known to lie in a class of bounded complexity, it is learnable, and learned estimates may be utilized in the selection of a low-loss, high-validity model. An interesting feature of the problem setup here is that unlike in most learning settings, the learner can actually choose which distributions it would like estimate v𝑣vitalic_v with respect to, i.e. decide under which marginal distributions over 𝒳𝒳\mathcal{X}caligraphic_X the estimate v^^𝑣\hat{v}over^ start_ARG italic_v end_ARG should small disagreement with v𝑣vitalic_v.

We begin with a lemma arguing that whenever the ERM has positive probability of outputting an example with v^(x)=1^𝑣𝑥1\hat{v}(x)=1over^ start_ARG italic_v end_ARG ( italic_x ) = 1, and the disagreement of v^^𝑣\hat{v}over^ start_ARG italic_v end_ARG and v𝑣vitalic_v under a proposal distribution is small enough, the restriction will indeed yield a low-validity model.

Lemma 5.

Fix 0<ϵ<10italic-ϵ10<\epsilon<10 < italic_ϵ < 1 and a distribution q𝒫𝑞𝒫q\in\mathcal{P}italic_q ∈ caligraphic_P absolutely continuous with respect to λ𝜆\lambdaitalic_λ arbitrarily. Further, fix V^(q)^𝑉𝑞\hat{V}(q)over^ start_ARG italic_V end_ARG ( italic_q ) such that 0<V^(q)V(q)0^𝑉𝑞𝑉𝑞0<\hat{V}(q)\leq V(q)0 < over^ start_ARG italic_V end_ARG ( italic_q ) ≤ italic_V ( italic_q ), and suppose that for some v^:𝒳{0,1}:^𝑣𝒳01\hat{v}:\mathcal{X}\to\{0,1\}over^ start_ARG italic_v end_ARG : caligraphic_X → { 0 , 1 } we have

PrXq(v^(X)v(X))V^(q)ϵ2.subscriptPrsimilar-to𝑋𝑞^𝑣𝑋𝑣𝑋^𝑉𝑞italic-ϵ2\textnormal{Pr}_{X\sim q}\left(\hat{v}(X)\neq v(X)\right)\leq\frac{\hat{V}(q)% \epsilon}{2}.Pr start_POSTSUBSCRIPT italic_X ∼ italic_q end_POSTSUBSCRIPT ( over^ start_ARG italic_v end_ARG ( italic_X ) ≠ italic_v ( italic_X ) ) ≤ divide start_ARG over^ start_ARG italic_V end_ARG ( italic_q ) italic_ϵ end_ARG start_ARG 2 end_ARG .

Then whenever there is a distribution q^^𝑞\hat{q}over^ start_ARG italic_q end_ARG corresponding to

fq^(x)fq(x)𝟙[v^(x)=1],proportional-tosubscript𝑓^𝑞𝑥subscript𝑓𝑞𝑥1delimited-[]^𝑣𝑥1f_{\hat{q}}(x)\propto f_{q}(x)\mathbbm{1}[\hat{v}(x)=1],italic_f start_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG end_POSTSUBSCRIPT ( italic_x ) ∝ italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_x ) blackboard_1 [ over^ start_ARG italic_v end_ARG ( italic_x ) = 1 ] ,

it has invalidity I(q^)ϵ𝐼^𝑞italic-ϵI(\hat{q})\leq\epsilonitalic_I ( over^ start_ARG italic_q end_ARG ) ≤ italic_ϵ.

Proof.

Let Vv^(q)=PrXq(v^(X)=1)subscript𝑉^𝑣𝑞subscriptPrsimilar-to𝑋𝑞^𝑣𝑋1V_{\hat{v}}(q)=\textnormal{Pr}_{X\sim q}\left(\hat{v}(X)=1\right)italic_V start_POSTSUBSCRIPT over^ start_ARG italic_v end_ARG end_POSTSUBSCRIPT ( italic_q ) = Pr start_POSTSUBSCRIPT italic_X ∼ italic_q end_POSTSUBSCRIPT ( over^ start_ARG italic_v end_ARG ( italic_X ) = 1 ) denote the normalizing constant for the restriction to estimated valid region. Note that the restriction corresponds to a probability distribution if and only if Vv^(q)>0subscript𝑉^𝑣𝑞0V_{\hat{v}}(q)>0italic_V start_POSTSUBSCRIPT over^ start_ARG italic_v end_ARG end_POSTSUBSCRIPT ( italic_q ) > 0, and that in this case

fq^(x)=fq(x)𝟙[v^(x)=1]Vv^(q).subscript𝑓^𝑞𝑥subscript𝑓𝑞𝑥1delimited-[]^𝑣𝑥1subscript𝑉^𝑣𝑞f_{\hat{q}}(x)=\frac{f_{q}(x)\mathbbm{1}[\hat{v}(x)=1]}{V_{\hat{v}}(q)}.italic_f start_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG end_POSTSUBSCRIPT ( italic_x ) = divide start_ARG italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_x ) blackboard_1 [ over^ start_ARG italic_v end_ARG ( italic_x ) = 1 ] end_ARG start_ARG italic_V start_POSTSUBSCRIPT over^ start_ARG italic_v end_ARG end_POSTSUBSCRIPT ( italic_q ) end_ARG .

It holds further that q^^𝑞\hat{q}over^ start_ARG italic_q end_ARG is absolutely continuous with respect to λ𝜆\lambdaitalic_λ, and that we can write the following chain of relations:

I(q^)𝐼^𝑞\displaystyle I(\hat{q})italic_I ( over^ start_ARG italic_q end_ARG ) =𝟙[v(x)=0]fq^(x)𝑑λ(x)absent1delimited-[]𝑣𝑥0subscript𝑓^𝑞𝑥differential-d𝜆𝑥\displaystyle=\int\mathbbm{1}[v(x)=0]f_{\hat{q}}(x)\ d\lambda(x)= ∫ blackboard_1 [ italic_v ( italic_x ) = 0 ] italic_f start_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG end_POSTSUBSCRIPT ( italic_x ) italic_d italic_λ ( italic_x )
1Vv^(q)𝟙[v^(x)v(x)]fq(x)𝑑λ(x)absent1subscript𝑉^𝑣𝑞1delimited-[]^𝑣𝑥𝑣𝑥subscript𝑓𝑞𝑥differential-d𝜆𝑥\displaystyle\leq\frac{1}{V_{\hat{v}}(q)}\int\mathbbm{1}[\hat{v}(x)\neq v(x)]f% _{q}(x)\ d\lambda(x)≤ divide start_ARG 1 end_ARG start_ARG italic_V start_POSTSUBSCRIPT over^ start_ARG italic_v end_ARG end_POSTSUBSCRIPT ( italic_q ) end_ARG ∫ blackboard_1 [ over^ start_ARG italic_v end_ARG ( italic_x ) ≠ italic_v ( italic_x ) ] italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_x ) italic_d italic_λ ( italic_x )
V^(q)Vv^(q)ϵ2absent^𝑉𝑞subscript𝑉^𝑣𝑞italic-ϵ2\displaystyle\leq\frac{\hat{V}(q)}{V_{\hat{v}}(q)}\frac{\epsilon}{2}≤ divide start_ARG over^ start_ARG italic_V end_ARG ( italic_q ) end_ARG start_ARG italic_V start_POSTSUBSCRIPT over^ start_ARG italic_v end_ARG end_POSTSUBSCRIPT ( italic_q ) end_ARG divide start_ARG italic_ϵ end_ARG start_ARG 2 end_ARG
V(q)Vv^(q)ϵ2,absent𝑉𝑞subscript𝑉^𝑣𝑞italic-ϵ2\displaystyle\leq\frac{V(q)}{V_{\hat{v}}(q)}\frac{\epsilon}{2},≤ divide start_ARG italic_V ( italic_q ) end_ARG start_ARG italic_V start_POSTSUBSCRIPT over^ start_ARG italic_v end_ARG end_POSTSUBSCRIPT ( italic_q ) end_ARG divide start_ARG italic_ϵ end_ARG start_ARG 2 end_ARG ,

where the first inequality follows after inputting the definition of fq^subscript𝑓^𝑞f_{\hat{q}}italic_f start_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG end_POSTSUBSCRIPT, and the final two inequalities come by assumption. It’s further possible to show that validity of q𝑞qitalic_q can be approximated from above by a constant multiple of Vv^(q)subscript𝑉^𝑣𝑞V_{\hat{v}}(q)italic_V start_POSTSUBSCRIPT over^ start_ARG italic_v end_ARG end_POSTSUBSCRIPT ( italic_q ), which can be conceptualized as the validity of q𝑞qitalic_q if v^^𝑣\hat{v}over^ start_ARG italic_v end_ARG were the true validity function. In particular,

V(q)𝑉𝑞\displaystyle V(q)italic_V ( italic_q ) =(Vv^(q)Vv^(q))+V(q)absentsubscript𝑉^𝑣𝑞subscript𝑉^𝑣𝑞𝑉𝑞\displaystyle=\bigg{(}V_{\hat{v}}(q)-V_{\hat{v}}(q)\bigg{)}+V(q)= ( italic_V start_POSTSUBSCRIPT over^ start_ARG italic_v end_ARG end_POSTSUBSCRIPT ( italic_q ) - italic_V start_POSTSUBSCRIPT over^ start_ARG italic_v end_ARG end_POSTSUBSCRIPT ( italic_q ) ) + italic_V ( italic_q )
=Vv^(q)+(𝟙[v(x)=1]𝟙[v^(x)=1])fq(x)𝑑λ(x)absentsubscript𝑉^𝑣𝑞1delimited-[]𝑣𝑥11delimited-[]^𝑣𝑥1subscript𝑓𝑞𝑥differential-d𝜆𝑥\displaystyle=V_{\hat{v}}(q)+\int\bigg{(}\mathbbm{1}[v(x)=1]-\mathbbm{1}[\hat{% v}(x)=1]\bigg{)}f_{q}(x)\ d\lambda(x)= italic_V start_POSTSUBSCRIPT over^ start_ARG italic_v end_ARG end_POSTSUBSCRIPT ( italic_q ) + ∫ ( blackboard_1 [ italic_v ( italic_x ) = 1 ] - blackboard_1 [ over^ start_ARG italic_v end_ARG ( italic_x ) = 1 ] ) italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_x ) italic_d italic_λ ( italic_x )
Vv^(q)+𝟙[v(x)v^(x)]fq(x)𝑑λ(x)absentsubscript𝑉^𝑣𝑞1delimited-[]𝑣𝑥^𝑣𝑥subscript𝑓𝑞𝑥differential-d𝜆𝑥\displaystyle\leq V_{\hat{v}}(q)+\int\mathbbm{1}[v(x)\neq\hat{v}(x)]f_{q}(x)\ % d\lambda(x)≤ italic_V start_POSTSUBSCRIPT over^ start_ARG italic_v end_ARG end_POSTSUBSCRIPT ( italic_q ) + ∫ blackboard_1 [ italic_v ( italic_x ) ≠ over^ start_ARG italic_v end_ARG ( italic_x ) ] italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_x ) italic_d italic_λ ( italic_x )
Vv^(q)+V^(q)ϵ2absentsubscript𝑉^𝑣𝑞^𝑉𝑞italic-ϵ2\displaystyle\leq V_{\hat{v}}(q)+\frac{\hat{V}(q)\epsilon}{2}≤ italic_V start_POSTSUBSCRIPT over^ start_ARG italic_v end_ARG end_POSTSUBSCRIPT ( italic_q ) + divide start_ARG over^ start_ARG italic_V end_ARG ( italic_q ) italic_ϵ end_ARG start_ARG 2 end_ARG
Vv^(q)+V(q)ϵ2.absentsubscript𝑉^𝑣𝑞𝑉𝑞italic-ϵ2\displaystyle\leq V_{\hat{v}}(q)+\frac{V(q)\epsilon}{2}.≤ italic_V start_POSTSUBSCRIPT over^ start_ARG italic_v end_ARG end_POSTSUBSCRIPT ( italic_q ) + divide start_ARG italic_V ( italic_q ) italic_ϵ end_ARG start_ARG 2 end_ARG .

This implies that V(q)Vv^(q)/(1ϵ/2)𝑉𝑞subscript𝑉^𝑣𝑞1italic-ϵ2V(q)\leq V_{\hat{v}}(q)/(1-\epsilon/2)italic_V ( italic_q ) ≤ italic_V start_POSTSUBSCRIPT over^ start_ARG italic_v end_ARG end_POSTSUBSCRIPT ( italic_q ) / ( 1 - italic_ϵ / 2 ), which yields V(q)2Vv^(q)𝑉𝑞2subscript𝑉^𝑣𝑞V(q)\leq 2V_{\hat{v}}(q)italic_V ( italic_q ) ≤ 2 italic_V start_POSTSUBSCRIPT over^ start_ARG italic_v end_ARG end_POSTSUBSCRIPT ( italic_q ) as ϵ<1italic-ϵ1\epsilon<1italic_ϵ < 1. Utilizing this inequality in the last line of the first string of inequalities gives the guarantee. ∎

The need for a lower estimate on the validity in the precision of the estimate for v^^𝑣\hat{v}over^ start_ARG italic_v end_ARG can be understood as follows: when the proposal distribution has very small validity, the restriction to the estimation of the valid parts of space under v^^𝑣\hat{v}over^ start_ARG italic_v end_ARG may create huge (or infinite) increases in mass over the proposal distribution – in the language of Lemma 5, Vv^(q)subscript𝑉^𝑣𝑞V_{\hat{v}}(q)italic_V start_POSTSUBSCRIPT over^ start_ARG italic_v end_ARG end_POSTSUBSCRIPT ( italic_q ) may be very small for a proposal distribution q𝑞qitalic_q. If this is the case, the estimate v^^𝑣\hat{v}over^ start_ARG italic_v end_ARG must be more precise, as small errors in the estimation of the validity function may lead to invalid parts of space having large mass under q^^𝑞\hat{q}over^ start_ARG italic_q end_ARG.

We introduce another lemma before proving Theorem 3. It argues that for a model q^^𝑞\hat{q}over^ start_ARG italic_q end_ARG constructed by accepting samples from some "proposal distribution" q𝑞qitalic_q that fall in the valid part of space under v^^𝑣\hat{v}over^ start_ARG italic_v end_ARG, the contribution to the loss from the part of space where v^^𝑣\hat{v}over^ start_ARG italic_v end_ARG agrees with v𝑣vitalic_v can never exceed the total loss of the proposal distribution q𝑞qitalic_q. Essentially, we are exploiting the full-validity of P𝑃Pitalic_P here – in the subregion of the agreement region {v(x)=v^(x)}𝑣𝑥^𝑣𝑥\{v(x)=\hat{v}(x)\}{ italic_v ( italic_x ) = over^ start_ARG italic_v end_ARG ( italic_x ) } on which the loss is computed, we have that v^(x)=1^𝑣𝑥1\hat{v}(x)=1over^ start_ARG italic_v end_ARG ( italic_x ) = 1 by the fact that XPsimilar-to𝑋𝑃X\sim Pitalic_X ∼ italic_P is valid. This means that the density of q^^𝑞\hat{q}over^ start_ARG italic_q end_ARG can only be larger than the density of q𝑞qitalic_q in this region, which under a non-increasing loss cannot increase the loss over that incurred by q𝑞qitalic_q.

Lemma 6.

Fix 0<ϵ,δ<1formulae-sequence0italic-ϵ𝛿10<\epsilon,\delta<10 < italic_ϵ , italic_δ < 1, a validity function estimate v^:𝒳{0,1}:^𝑣𝒳01\hat{v}:\mathcal{X}\to\{0,1\}over^ start_ARG italic_v end_ARG : caligraphic_X → { 0 , 1 }, and q𝒬𝑞𝒬q\in\mathcal{Q}italic_q ∈ caligraphic_Q arbitrarily. Suppose l:0[0,M]:𝑙superscriptabsent00𝑀l:\mathbb{R}^{\geq 0}\to[0,M]italic_l : blackboard_R start_POSTSUPERSCRIPT ≥ 0 end_POSTSUPERSCRIPT → [ 0 , italic_M ] is a non-increasing loss function. Then whenever

fq^(x)fq(x)𝟙[v^(x)=1]proportional-tosubscript𝑓^𝑞𝑥subscript𝑓𝑞𝑥1delimited-[]^𝑣𝑥1f_{\hat{q}}(x)\propto f_{q}(x)\mathbbm{1}[\hat{v}(x)=1]italic_f start_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG end_POSTSUBSCRIPT ( italic_x ) ∝ italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_x ) blackboard_1 [ over^ start_ARG italic_v end_ARG ( italic_x ) = 1 ]

corresponds to a probability distribution, it enjoys

𝔼XP[l(fq^(X))𝟙[v^(X)=v(X)]]LP(q;l).subscript𝔼similar-to𝑋𝑃delimited-[]𝑙subscript𝑓^𝑞𝑋1delimited-[]^𝑣𝑋𝑣𝑋subscript𝐿𝑃𝑞𝑙\mathbb{E}_{X\sim P}\big{[}l\left(f_{\hat{q}}(X)\right)\cdot\mathbbm{1}[\hat{v% }(X)=v(X)]\big{]}\leq L_{P}(q;l).blackboard_E start_POSTSUBSCRIPT italic_X ∼ italic_P end_POSTSUBSCRIPT [ italic_l ( italic_f start_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG end_POSTSUBSCRIPT ( italic_X ) ) ⋅ blackboard_1 [ over^ start_ARG italic_v end_ARG ( italic_X ) = italic_v ( italic_X ) ] ] ≤ italic_L start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_q ; italic_l ) .
Proof.

Let Vv^(q)=PrXq(v^(x)=1)subscript𝑉^𝑣𝑞subscriptPrsimilar-to𝑋𝑞^𝑣𝑥1V_{\hat{v}}(q)=\textnormal{Pr}_{X\sim q}\left(\hat{v}(x)=1\right)italic_V start_POSTSUBSCRIPT over^ start_ARG italic_v end_ARG end_POSTSUBSCRIPT ( italic_q ) = Pr start_POSTSUBSCRIPT italic_X ∼ italic_q end_POSTSUBSCRIPT ( over^ start_ARG italic_v end_ARG ( italic_x ) = 1 ) be the normalizing constant for q^^𝑞\hat{q}over^ start_ARG italic_q end_ARG, where we note that Vv^(q)>0subscript𝑉^𝑣𝑞0V_{\hat{v}}(q)>0italic_V start_POSTSUBSCRIPT over^ start_ARG italic_v end_ARG end_POSTSUBSCRIPT ( italic_q ) > 0 when fq(x)𝟙[v^(x)=1]subscript𝑓𝑞𝑥1delimited-[]^𝑣𝑥1f_{q}(x)\mathbbm{1}[\hat{v}(x)=1]italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_x ) blackboard_1 [ over^ start_ARG italic_v end_ARG ( italic_x ) = 1 ] corresponds to a probability distribution.

Given that P𝑃Pitalic_P is fully-valid, we have PrXP(v(X)=1)=1subscriptPrsimilar-to𝑋𝑃𝑣𝑋11\textnormal{Pr}_{X\sim P}\left(v(X)=1\right)=1Pr start_POSTSUBSCRIPT italic_X ∼ italic_P end_POSTSUBSCRIPT ( italic_v ( italic_X ) = 1 ) = 1. By the fact that integration is defined up to null sets, it holds that

𝔼XP[l(fq^(X))𝟙[v^(X)=v(X)]]=𝔼XP[l(fq^(X))𝟙[v^(X)=v(X)v(X)=1]].subscript𝔼similar-to𝑋𝑃delimited-[]𝑙subscript𝑓^𝑞𝑋1delimited-[]^𝑣𝑋𝑣𝑋subscript𝔼similar-to𝑋𝑃delimited-[]𝑙subscript𝑓^𝑞𝑋1delimited-[]^𝑣𝑋𝑣𝑋𝑣𝑋1\mathbb{E}_{X\sim P}\bigg{[}l\left(f_{\hat{q}}(X)\right)\cdot\mathbbm{1}[\hat{% v}(X)=v(X)]\bigg{]}=\mathbb{E}_{X\sim P}\bigg{[}l\left(f_{\hat{q}}(X)\right)% \cdot\mathbbm{1}[\hat{v}(X)=v(X)\wedge v(X)=1]\bigg{]}.blackboard_E start_POSTSUBSCRIPT italic_X ∼ italic_P end_POSTSUBSCRIPT [ italic_l ( italic_f start_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG end_POSTSUBSCRIPT ( italic_X ) ) ⋅ blackboard_1 [ over^ start_ARG italic_v end_ARG ( italic_X ) = italic_v ( italic_X ) ] ] = blackboard_E start_POSTSUBSCRIPT italic_X ∼ italic_P end_POSTSUBSCRIPT [ italic_l ( italic_f start_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG end_POSTSUBSCRIPT ( italic_X ) ) ⋅ blackboard_1 [ over^ start_ARG italic_v end_ARG ( italic_X ) = italic_v ( italic_X ) ∧ italic_v ( italic_X ) = 1 ] ] .

Further, we may write

𝔼XP[l(fq^(X))\displaystyle\mathbb{E}_{X\sim P}\bigg{[}l\left(f_{\hat{q}}(X)\right)blackboard_E start_POSTSUBSCRIPT italic_X ∼ italic_P end_POSTSUBSCRIPT [ italic_l ( italic_f start_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG end_POSTSUBSCRIPT ( italic_X ) ) 𝟙[v^(X)=v(X)v(X)=1]]\displaystyle\cdot\mathbbm{1}[\hat{v}(X)=v(X)\wedge v(X)=1]\bigg{]}⋅ blackboard_1 [ over^ start_ARG italic_v end_ARG ( italic_X ) = italic_v ( italic_X ) ∧ italic_v ( italic_X ) = 1 ] ]
𝔼XP[l(fq(X)𝟙[v^(X)=1]Vv^(q))𝟙[v^(X)=1]]absentsubscript𝔼similar-to𝑋𝑃delimited-[]𝑙subscript𝑓𝑞𝑋1delimited-[]^𝑣𝑋1subscript𝑉^𝑣𝑞1delimited-[]^𝑣𝑋1\displaystyle\leq\mathbb{E}_{X\sim P}\bigg{[}l\left(\frac{f_{q}(X)\mathbbm{1}[% \hat{v}(X)=1]}{V_{\hat{v}}(q)}\right)\cdot\mathbbm{1}[\hat{v}(X)=1]\bigg{]}≤ blackboard_E start_POSTSUBSCRIPT italic_X ∼ italic_P end_POSTSUBSCRIPT [ italic_l ( divide start_ARG italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_X ) blackboard_1 [ over^ start_ARG italic_v end_ARG ( italic_X ) = 1 ] end_ARG start_ARG italic_V start_POSTSUBSCRIPT over^ start_ARG italic_v end_ARG end_POSTSUBSCRIPT ( italic_q ) end_ARG ) ⋅ blackboard_1 [ over^ start_ARG italic_v end_ARG ( italic_X ) = 1 ] ]
𝔼XP[l(fq(X)Vv^(q))]absentsubscript𝔼similar-to𝑋𝑃delimited-[]𝑙subscript𝑓𝑞𝑋subscript𝑉^𝑣𝑞\displaystyle\leq\mathbb{E}_{X\sim P}\bigg{[}l\left(\frac{f_{q}(X)}{V_{\hat{v}% }(q)}\right)\bigg{]}≤ blackboard_E start_POSTSUBSCRIPT italic_X ∼ italic_P end_POSTSUBSCRIPT [ italic_l ( divide start_ARG italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_X ) end_ARG start_ARG italic_V start_POSTSUBSCRIPT over^ start_ARG italic_v end_ARG end_POSTSUBSCRIPT ( italic_q ) end_ARG ) ]
𝔼XP[l(fq(X))]absentsubscript𝔼similar-to𝑋𝑃delimited-[]𝑙subscript𝑓𝑞𝑋\displaystyle\leq\mathbb{E}_{X\sim P}\bigg{[}l\left(f_{q}(X)\right)\bigg{]}≤ blackboard_E start_POSTSUBSCRIPT italic_X ∼ italic_P end_POSTSUBSCRIPT [ italic_l ( italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_X ) ) ]
=LP(q;l).absentsubscript𝐿𝑃𝑞𝑙\displaystyle=L_{P}(q;l).= italic_L start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_q ; italic_l ) .

Here, the second to last inequality comes from the non-negativity of the loss along with the fact that whenever v^(X)=0^𝑣𝑋0\hat{v}(X)=0over^ start_ARG italic_v end_ARG ( italic_X ) = 0, the integrand is zero – when v^(X)=1^𝑣𝑋1\hat{v}(X)=1over^ start_ARG italic_v end_ARG ( italic_X ) = 1, the loss is just evaluated at the normalized density, and so the integrand introduced in this line is an upper bound for the previous integrand. The final inequality comes from the non-increasingness of the loss function along with the observation that Vv^(q^)1subscript𝑉^𝑣^𝑞1V_{\hat{v}}(\hat{q})\leq 1italic_V start_POSTSUBSCRIPT over^ start_ARG italic_v end_ARG end_POSTSUBSCRIPT ( over^ start_ARG italic_q end_ARG ) ≤ 1 – in removing the normalizing constant, we can only make the value at which the loss is evaluated at smaller, which cannot decrease the value of the loss. ∎

We are now ready to prove the main result of the second half of the paper – the guarantee for Algorithm 2. It combines the previous lemmas, noting further that the number of samples in SPsubscript𝑆𝑃S_{P}italic_S start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT is sufficient to make the disagreement of v𝑣vitalic_v and v^^𝑣\hat{v}over^ start_ARG italic_v end_ARG small enough under P𝑃Pitalic_P such that the contribution to the loss in that part of space can be controlled by trivially applying the loss upper bound M𝑀Mitalic_M.

Theorem 3.

Suppose v𝒱𝑣𝒱v\in\mathcal{V}italic_v ∈ caligraphic_V with VC-dimension VC(𝒱)D𝑉𝐶𝒱𝐷VC(\mathcal{V})\leq Ditalic_V italic_C ( caligraphic_V ) ≤ italic_D, and that for each q𝒬𝑞𝒬q\in\mathcal{Q}italic_q ∈ caligraphic_Q, the validity V(q)γ>0𝑉𝑞𝛾0V(q)\geq\gamma>0italic_V ( italic_q ) ≥ italic_γ > 0. For all 0<ϵ1,ϵ2,δ1formulae-sequence0subscriptitalic-ϵ1subscriptitalic-ϵ2𝛿10<\epsilon_{1},\epsilon_{2},\delta\leq 10 < italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_δ ≤ 1 and for all choices of non-increasing loss functions l:0[0,M]:𝑙superscriptabsent00𝑀l:\mathbb{R}^{\geq 0}\to[0,M]italic_l : blackboard_R start_POSTSUPERSCRIPT ≥ 0 end_POSTSUPERSCRIPT → [ 0 , italic_M ], Algorithm 2 requires a number of samples

O(M2(log(|𝒬|)+log(1/δ))ϵ12+M(Dlog(M/ϵ1)+log(1/δ))ϵ1),absent𝑂superscript𝑀2𝒬1𝛿superscriptsubscriptitalic-ϵ12𝑀𝐷𝑀subscriptitalic-ϵ11𝛿subscriptitalic-ϵ1\leq O\left(\frac{M^{2}\left(\log(|\mathcal{Q}|)+\log(1/\delta)\right)}{% \epsilon_{1}^{2}}+\frac{M\left(D\log(M/\epsilon_{1})+\log(1/\delta)\right)}{% \epsilon_{1}}\right),≤ italic_O ( divide start_ARG italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( roman_log ( | caligraphic_Q | ) + roman_log ( 1 / italic_δ ) ) end_ARG start_ARG italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG italic_M ( italic_D roman_log ( italic_M / italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + roman_log ( 1 / italic_δ ) ) end_ARG start_ARG italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ) ,

and a number of validity queries

O(Dlog(1/γϵ2)+log(1/δ)γϵ2),absent𝑂𝐷1𝛾subscriptitalic-ϵ21𝛿𝛾subscriptitalic-ϵ2\leq O\left(\frac{D\log(1/\gamma\epsilon_{2})+\log(1/\delta)}{\gamma\epsilon_{% 2}}\right),≤ italic_O ( divide start_ARG italic_D roman_log ( 1 / italic_γ italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + roman_log ( 1 / italic_δ ) end_ARG start_ARG italic_γ italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ) ,

to ensure that with probability 1δabsent1𝛿\geq 1-\delta≥ 1 - italic_δ, its output enjoys

LP(q^;l)LP(q;l)+ϵ1andI(q^)ϵ2.formulae-sequencesubscript𝐿𝑃^𝑞𝑙subscript𝐿𝑃superscript𝑞𝑙subscriptitalic-ϵ1𝑎𝑛𝑑𝐼^𝑞subscriptitalic-ϵ2L_{P}(\hat{q};l)\leq L_{P}(q^{*};l)+\epsilon_{1}\ \ and\ \ I(\hat{q})\leq% \epsilon_{2}.italic_L start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( over^ start_ARG italic_q end_ARG ; italic_l ) ≤ italic_L start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ; italic_l ) + italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a italic_n italic_d italic_I ( over^ start_ARG italic_q end_ARG ) ≤ italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT .
Proof.

Given that the loss is bounded, Hoeffding’s inequality applied to the random variables l(fq(X))𝑙subscript𝑓𝑞𝑋l\left(f_{q}(X)\right)italic_l ( italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_X ) ) for XPsimilar-to𝑋𝑃X\sim Pitalic_X ∼ italic_P, and a union bound, imply that S𝑆Sitalic_S is large enough that with probability 1δ/3absent1𝛿3\geq 1-\delta/3≥ 1 - italic_δ / 3, we have that for all q𝒬𝑞𝒬q\in\mathcal{Q}italic_q ∈ caligraphic_Q, the empirical loss estimates LS(q;l)subscript𝐿𝑆𝑞𝑙L_{S}(q;l)italic_L start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_q ; italic_l ) are at most ϵ1/4subscriptitalic-ϵ14\epsilon_{1}/4italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT / 4 away from true losses LP(q;l)subscript𝐿𝑃𝑞𝑙L_{P}(q;l)italic_L start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_q ; italic_l ). For any choice of q^ ERMsubscript^𝑞 ERM\hat{q}_{\textnormal{ ERM}}over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT, because we have v𝒱𝑣𝒱v\in\mathcal{V}italic_v ∈ caligraphic_V, it must hold that any minimizer v^^𝑣\hat{v}over^ start_ARG italic_v end_ARG is consistent with the labeling under v𝑣vitalic_v of both SPsubscript𝑆𝑃S_{P}italic_S start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT and Sq^ ERMsubscript𝑆subscript^𝑞 ERMS_{\hat{q}_{\textnormal{ ERM}}}italic_S start_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT end_POSTSUBSCRIPT. The standard rates of convergence when choosing an arbitrary consistent hypothesis thus imply that the sizes of SPsubscript𝑆𝑃S_{P}italic_S start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT and Sq^ ERMsubscript𝑆subscript^𝑞 ERMS_{\hat{q}_{\textnormal{ ERM}}}italic_S start_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT end_POSTSUBSCRIPT are large enough to guarantee that, with probability 12δ/3absent12𝛿3\geq 1-2\delta/3≥ 1 - 2 italic_δ / 3, we have

PrXP(v^(X)v(X))ϵ12MPrXq^ ERM(v^(X)v(X))γϵ22.formulae-sequencesubscriptPrsimilar-to𝑋𝑃^𝑣𝑋𝑣𝑋subscriptitalic-ϵ12𝑀subscriptPrsimilar-to𝑋subscript^𝑞 ERM^𝑣𝑋𝑣𝑋𝛾subscriptitalic-ϵ22\textnormal{Pr}_{X\sim P}\bigg{(}\hat{v}(X)\neq v(X)\bigg{)}\leq\frac{\epsilon% _{1}}{2M}\ \ \wedge\ \ \textnormal{Pr}_{X\sim\hat{q}_{\textnormal{ ERM}}}\bigg% {(}\hat{v}(X)\neq v(X)\bigg{)}\leq\frac{\gamma\epsilon_{2}}{2}.Pr start_POSTSUBSCRIPT italic_X ∼ italic_P end_POSTSUBSCRIPT ( over^ start_ARG italic_v end_ARG ( italic_X ) ≠ italic_v ( italic_X ) ) ≤ divide start_ARG italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG 2 italic_M end_ARG ∧ Pr start_POSTSUBSCRIPT italic_X ∼ over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_v end_ARG ( italic_X ) ≠ italic_v ( italic_X ) ) ≤ divide start_ARG italic_γ italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG .

By a union bound, with probability 1δabsent1𝛿\geq 1-\delta≥ 1 - italic_δ, all of these estimation accuracy events take place. We condition on these favorable events taking place going forwards.

Note that conditioned on these favorable events, the normalizing constant q^ ERM({v^(x)=1})>0subscript^𝑞 ERM^𝑣𝑥10\hat{q}_{\textnormal{ ERM}}\left(\{\hat{v}(x)=1\}\right)>0over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT ( { over^ start_ARG italic_v end_ARG ( italic_x ) = 1 } ) > 0, as for any ERM, we have

q^ ERM({v^(x)=1})subscript^𝑞 ERM^𝑣𝑥1\displaystyle\hat{q}_{\textnormal{ ERM}}\left(\{\hat{v}(x)=1\}\right)over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT ( { over^ start_ARG italic_v end_ARG ( italic_x ) = 1 } ) =𝔼Xq^ ERM[𝟙[v(x)=1]](𝟙[v(x)=1]𝟙[v^(x)=1])𝑑q^ ERM(x)absentsubscript𝔼similar-to𝑋subscript^𝑞 ERMdelimited-[]1delimited-[]𝑣𝑥11delimited-[]𝑣𝑥11delimited-[]^𝑣𝑥1differential-dsubscript^𝑞 ERM𝑥\displaystyle=\mathbb{E}_{X\sim\hat{q}_{\textnormal{ ERM}}}\left[\mathbbm{1}[v% (x)=1]\right]-\int\left(\mathbbm{1}[v(x)=1]-\mathbbm{1}[\hat{v}(x)=1]\right)d% \hat{q}_{\textnormal{ ERM}}(x)= blackboard_E start_POSTSUBSCRIPT italic_X ∼ over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ blackboard_1 [ italic_v ( italic_x ) = 1 ] ] - ∫ ( blackboard_1 [ italic_v ( italic_x ) = 1 ] - blackboard_1 [ over^ start_ARG italic_v end_ARG ( italic_x ) = 1 ] ) italic_d over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT ( italic_x )
𝔼Xq^ ERM[𝟙[v(x)=1]]𝟙[v(x)v^(x)]𝑑q^ ERM(x)absentsubscript𝔼similar-to𝑋subscript^𝑞 ERMdelimited-[]1delimited-[]𝑣𝑥11delimited-[]𝑣𝑥^𝑣𝑥differential-dsubscript^𝑞 ERM𝑥\displaystyle\geq\mathbb{E}_{X\sim\hat{q}_{\textnormal{ ERM}}}\left[\mathbbm{1% }[v(x)=1]\right]-\int\mathbbm{1}[v(x)\neq\hat{v}(x)]d\hat{q}_{\textnormal{ ERM% }}(x)≥ blackboard_E start_POSTSUBSCRIPT italic_X ∼ over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ blackboard_1 [ italic_v ( italic_x ) = 1 ] ] - ∫ blackboard_1 [ italic_v ( italic_x ) ≠ over^ start_ARG italic_v end_ARG ( italic_x ) ] italic_d over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT ( italic_x )
γγϵ22absent𝛾𝛾subscriptitalic-ϵ22\displaystyle\geq\gamma-\frac{\gamma\epsilon_{2}}{2}≥ italic_γ - divide start_ARG italic_γ italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG
>0.absent0\displaystyle>0.> 0 .

Thus, the restriction of the ERM to the estimated validity region is a viable probability distribution, and is outputted by the algorithm as q^^𝑞\hat{q}over^ start_ARG italic_q end_ARG. For any estimate of the validity function v^^𝑣\hat{v}over^ start_ARG italic_v end_ARG, we can decompose the loss of q^^𝑞\hat{q}over^ start_ARG italic_q end_ARG as

LP(q^;l)=𝔼XP[l(fq^(X))𝟙[v^(X)=v(X)]]+𝔼XP[l(fq^(X))𝟙[v^(X)v(X)]].subscript𝐿𝑃^𝑞𝑙subscript𝔼similar-to𝑋𝑃delimited-[]𝑙subscript𝑓^𝑞𝑋1delimited-[]^𝑣𝑋𝑣𝑋subscript𝔼similar-to𝑋𝑃delimited-[]𝑙subscript𝑓^𝑞𝑋1delimited-[]^𝑣𝑋𝑣𝑋L_{P}(\hat{q};l)=\mathbb{E}_{X\sim P}\bigg{[}l\left(f_{\hat{q}}(X)\right)\cdot% \mathbbm{1}[\hat{v}(X)=v(X)]\bigg{]}+\mathbb{E}_{X\sim P}\bigg{[}l\left(f_{% \hat{q}}(X)\right)\cdot\mathbbm{1}[\hat{v}(X)\neq v(X)]\bigg{]}.italic_L start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( over^ start_ARG italic_q end_ARG ; italic_l ) = blackboard_E start_POSTSUBSCRIPT italic_X ∼ italic_P end_POSTSUBSCRIPT [ italic_l ( italic_f start_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG end_POSTSUBSCRIPT ( italic_X ) ) ⋅ blackboard_1 [ over^ start_ARG italic_v end_ARG ( italic_X ) = italic_v ( italic_X ) ] ] + blackboard_E start_POSTSUBSCRIPT italic_X ∼ italic_P end_POSTSUBSCRIPT [ italic_l ( italic_f start_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG end_POSTSUBSCRIPT ( italic_X ) ) ⋅ blackboard_1 [ over^ start_ARG italic_v end_ARG ( italic_X ) ≠ italic_v ( italic_X ) ] ] .

First using Lemma 6, and then using the uniform convergence of the loss estimates, we can bound the first term as

𝔼XP[l(fq^(X))𝟙[v^(X)=v(X)]]subscript𝔼similar-to𝑋𝑃delimited-[]𝑙subscript𝑓^𝑞𝑋1delimited-[]^𝑣𝑋𝑣𝑋\displaystyle\mathbb{E}_{X\sim P}\bigg{[}l\left(f_{\hat{q}}(X)\right)\cdot% \mathbbm{1}[\hat{v}(X)=v(X)]\bigg{]}blackboard_E start_POSTSUBSCRIPT italic_X ∼ italic_P end_POSTSUBSCRIPT [ italic_l ( italic_f start_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG end_POSTSUBSCRIPT ( italic_X ) ) ⋅ blackboard_1 [ over^ start_ARG italic_v end_ARG ( italic_X ) = italic_v ( italic_X ) ] ] LP(q^ ERM;l)absentsubscript𝐿𝑃subscript^𝑞 ERM𝑙\displaystyle\leq L_{P}(\hat{q}_{\textnormal{ ERM}};l)≤ italic_L start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT ; italic_l )
LP(ql;l)+ϵ12absentsubscript𝐿𝑃subscriptsuperscript𝑞𝑙𝑙subscriptitalic-ϵ12\displaystyle\leq L_{P}(q^{*}_{l};l)+\frac{\epsilon_{1}}{2}≤ italic_L start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ; italic_l ) + divide start_ARG italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG
LP(q;l)+ϵ12,absentsubscript𝐿𝑃superscript𝑞𝑙subscriptitalic-ϵ12\displaystyle\leq L_{P}(q^{*};l)+\frac{\epsilon_{1}}{2},≤ italic_L start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ; italic_l ) + divide start_ARG italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ,

where ql=argminq𝒬LP(q;l)subscriptsuperscript𝑞𝑙subscriptargmin𝑞𝒬subscript𝐿𝑃𝑞𝑙q^{*}_{l}=\operatorname*{arg\,min}_{q\in\mathcal{Q}}L_{P}(q;l)italic_q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_q ∈ caligraphic_Q end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_q ; italic_l ) is the lowest-loss model in the class 𝒬𝒬\mathcal{Q}caligraphic_Q. To upper bound the second term in the loss decomposition, we can use the fact that PrXP(v^(X)v(X))ϵ1/2MsubscriptPrsimilar-to𝑋𝑃^𝑣𝑋𝑣𝑋subscriptitalic-ϵ12𝑀\textnormal{Pr}_{X\sim P}\left(\hat{v}(X)\neq v(X)\right)\leq\epsilon_{1}/2MPr start_POSTSUBSCRIPT italic_X ∼ italic_P end_POSTSUBSCRIPT ( over^ start_ARG italic_v end_ARG ( italic_X ) ≠ italic_v ( italic_X ) ) ≤ italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT / 2 italic_M and the upper bound on the loss to write

𝔼XP[l(fq^(X))𝟙[v^(X)v(X)]]M𝔼XP[𝟙[v^(X)v(X)]]ϵ12,subscript𝔼similar-to𝑋𝑃delimited-[]𝑙subscript𝑓^𝑞𝑋1delimited-[]^𝑣𝑋𝑣𝑋𝑀subscript𝔼similar-to𝑋𝑃delimited-[]1delimited-[]^𝑣𝑋𝑣𝑋subscriptitalic-ϵ12\mathbb{E}_{X\sim P}\bigg{[}l\left(f_{\hat{q}}(X)\right)\cdot\mathbbm{1}[\hat{% v}(X)\neq v(X)]\bigg{]}\leq M\cdot\mathbb{E}_{X\sim P}\bigg{[}\mathbbm{1}[\hat% {v}(X)\neq v(X)]\bigg{]}\leq\frac{\epsilon_{1}}{2},blackboard_E start_POSTSUBSCRIPT italic_X ∼ italic_P end_POSTSUBSCRIPT [ italic_l ( italic_f start_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG end_POSTSUBSCRIPT ( italic_X ) ) ⋅ blackboard_1 [ over^ start_ARG italic_v end_ARG ( italic_X ) ≠ italic_v ( italic_X ) ] ] ≤ italic_M ⋅ blackboard_E start_POSTSUBSCRIPT italic_X ∼ italic_P end_POSTSUBSCRIPT [ blackboard_1 [ over^ start_ARG italic_v end_ARG ( italic_X ) ≠ italic_v ( italic_X ) ] ] ≤ divide start_ARG italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ,

yielding the loss guarantee.

The validity guarantee follows directly from the fact that PrXq^ ERM(v^(X)v(X))γϵ2/2subscriptPrsimilar-to𝑋subscript^𝑞 ERM^𝑣𝑋𝑣𝑋𝛾subscriptitalic-ϵ22\textnormal{Pr}_{X\sim\hat{q}_{\textnormal{ ERM}}}\left(\hat{v}(X)\neq v(X)% \right)\leq\gamma\epsilon_{2}/2Pr start_POSTSUBSCRIPT italic_X ∼ over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_v end_ARG ( italic_X ) ≠ italic_v ( italic_X ) ) ≤ italic_γ italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT / 2 and Lemma 5, where γ𝛾\gammaitalic_γ furnishes the lower estimate for the validity of the model q^ ERMsubscript^𝑞 ERM\hat{q}_{\textnormal{ ERM}}over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT. ∎

The corollary to Theorem 3 stating that only low-loss models need appreciable validity is straightforwards. One can simply add an extra line to the proof of Theorem 3, arguing that when the intersection of good estimation events takes place, the loss of the ERM distribution is within O(ϵ1)𝑂subscriptitalic-ϵ1O(\epsilon_{1})italic_O ( italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) of the optimal loss across models in 𝒬𝒬\mathcal{Q}caligraphic_Q, meaning that it has validity greater than some constant c𝑐citalic_c. Thus, one can run Algorithm 2 with an Sq^ ERMsubscript𝑆subscript^𝑞 ERMS_{\hat{q}_{\textnormal{ ERM}}}italic_S start_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT end_POSTSUBSCRIPT large enough to achieve O(ϵ2)𝑂subscriptitalic-ϵ2O(\epsilon_{2})italic_O ( italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) disagreement rate between v^^𝑣\hat{v}over^ start_ARG italic_v end_ARG and v𝑣vitalic_v under samples from q^ ERMsubscript^𝑞 ERM\hat{q}_{\textnormal{ ERM}}over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT, lowering the label complexity.

Algorithm 3 Restriction to ERM under Log-Loss without Validity Assumption
1:procedure valid_restriction_log(Distribution Class 𝒬𝒬\mathcal{Q}caligraphic_Q, Validity Class 𝒱𝒱\mathcal{V}caligraphic_V, 𝒟𝒟\mathcal{D}caligraphic_D, δ𝛿\deltaitalic_δ, ϵ1subscriptitalic-ϵ1\epsilon_{1}italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, ϵ2subscriptitalic-ϵ2\epsilon_{2}italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT)
2:     SΩ(M2(log(|𝒬|)+log(1/δ))ϵ12)𝑆Ωsuperscript𝑀2𝒬1𝛿superscriptsubscriptitalic-ϵ12S\leftarrow\Omega\left(\frac{M^{2}\left(\log(|\mathcal{Q}|)+\log(1/\delta)% \right)}{\epsilon_{1}^{2}}\right)italic_S ← roman_Ω ( divide start_ARG italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( roman_log ( | caligraphic_Q | ) + roman_log ( 1 / italic_δ ) ) end_ARG start_ARG italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) i.i.d samples Psimilar-toabsent𝑃\sim P∼ italic_P
3:     q^ ERMargminq𝒬xSmin(M,log(fq(x)))subscript^𝑞 ERMsubscriptargmin𝑞𝒬subscript𝑥𝑆𝑀subscript𝑓𝑞𝑥\hat{q}_{\textnormal{ ERM}}\leftarrow\operatorname*{arg\,min}_{q\in\mathcal{Q}% }\sum_{x\in S}\min(M,-\log(f_{q}(x)))over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT ← start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_q ∈ caligraphic_Q end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_x ∈ italic_S end_POSTSUBSCRIPT roman_min ( italic_M , - roman_log ( italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_x ) ) )
4:     q~ ERM(1ϵ1/8)q^ ERM+ϵ1/8𝒟subscript~𝑞 ERM1subscriptitalic-ϵ18subscript^𝑞 ERMsubscriptitalic-ϵ18𝒟\tilde{q}_{\textnormal{ ERM}}\leftarrow(1-\epsilon_{1}/8)\cdot\hat{q}_{% \textnormal{ ERM}}+\epsilon_{1}/8\cdot\mathcal{D}over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT ← ( 1 - italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT / 8 ) ⋅ over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT + italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT / 8 ⋅ caligraphic_D \triangleright Mix with constant validity 𝒟𝒟\mathcal{D}caligraphic_D
5:     SPΩ(M(Dlog(M/ϵ1)+log(1/δ))ϵ1)subscript𝑆𝑃Ω𝑀𝐷𝑀subscriptitalic-ϵ11𝛿subscriptitalic-ϵ1S_{P}\leftarrow\Omega\left(\frac{M\left(D\log\left(M/\epsilon_{1}\right)+\log(% 1/\delta)\right)}{\epsilon_{1}}\right)italic_S start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ← roman_Ω ( divide start_ARG italic_M ( italic_D roman_log ( italic_M / italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + roman_log ( 1 / italic_δ ) ) end_ARG start_ARG italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ) i.i.d. samples Psimilar-toabsent𝑃\sim P∼ italic_P,             Sq~ ERMΩ(Dlog(1/ϵ1ϵ2)+log(1/δ)ϵ1ϵ2)subscript𝑆subscript~𝑞 ERMΩ𝐷1subscriptitalic-ϵ1subscriptitalic-ϵ21𝛿subscriptitalic-ϵ1subscriptitalic-ϵ2S_{\tilde{q}_{\textnormal{ ERM}}}\leftarrow\Omega\left(\frac{D\log\left(1/% \epsilon_{1}\epsilon_{2}\right)+\log(1/\delta)}{\epsilon_{1}\epsilon_{2}}\right)italic_S start_POSTSUBSCRIPT over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT end_POSTSUBSCRIPT ← roman_Ω ( divide start_ARG italic_D roman_log ( 1 / italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + roman_log ( 1 / italic_δ ) end_ARG start_ARG italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ) i.i.d. samples q~ ERMsimilar-toabsentsubscript~𝑞 ERM\sim\tilde{q}_{\textnormal{ ERM}}∼ over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT
6:     v^argminh𝒱xSPSq~ ERM𝟙[h(x)v(x)]^𝑣subscriptargmin𝒱subscript𝑥subscript𝑆𝑃subscript𝑆subscript~𝑞 ERM1delimited-[]𝑥𝑣𝑥\hat{v}\leftarrow\operatorname*{arg\,min}_{h\in\mathcal{V}}\sum_{x\in S_{P}% \cup S_{\tilde{q}_{\textnormal{ ERM}}}}\mathbbm{1}[h(x)\neq v(x)]over^ start_ARG italic_v end_ARG ← start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_h ∈ caligraphic_V end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_x ∈ italic_S start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ∪ italic_S start_POSTSUBSCRIPT over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_1 [ italic_h ( italic_x ) ≠ italic_v ( italic_x ) ] \triangleright Label xSq~ ERM𝑥subscript𝑆subscript~𝑞 ERMx\in S_{\tilde{q}_{\textnormal{ ERM}}}italic_x ∈ italic_S start_POSTSUBSCRIPT over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT end_POSTSUBSCRIPT via v𝑣vitalic_v
7:     return fq^fq~ ERM(x)𝟙[v^(x)=1]proportional-tosubscript𝑓^𝑞subscript𝑓subscript~𝑞 ERM𝑥1delimited-[]^𝑣𝑥1f_{\hat{q}}\propto f_{\tilde{q}_{\textnormal{ ERM}}}(x)\cdot\mathbbm{1}[\hat{v% }(x)=1]italic_f start_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG end_POSTSUBSCRIPT ∝ italic_f start_POSTSUBSCRIPT over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) ⋅ blackboard_1 [ over^ start_ARG italic_v end_ARG ( italic_x ) = 1 ] if q~ ERM({v^(x)=1})>0subscript~𝑞 ERM^𝑣𝑥10\tilde{q}_{\textnormal{ ERM}}\left(\{\hat{v}(x)=1\}\right)>0over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT ( { over^ start_ARG italic_v end_ARG ( italic_x ) = 1 } ) > 0 else fq^=fq^ ERMsubscript𝑓^𝑞subscript𝑓subscript^𝑞 ERMf_{\hat{q}}=f_{\hat{q}_{\textnormal{ ERM}}}italic_f start_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT end_POSTSUBSCRIPT
8:end procedure

The proof of Theorem 4 is very similar to that of Theorem 3. The main difference is that when the loss is the capped log-loss, we can exploit a stability property under mixture similar to that introduced in Lemma 4. This allows us to mix q^ ERMsubscript^𝑞 ERM\hat{q}_{\textnormal{ ERM}}over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT with a distribution of constant validity to get a validity lower bound on the final proposal distribution q~ ERMsubscript~𝑞 ERM\tilde{q}_{\textnormal{ ERM}}over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT without increasing the loss more than O(ϵ1)𝑂subscriptitalic-ϵ1O(\epsilon_{1})italic_O ( italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ). The validity lower bound can then be used as in Theorem 3.

Theorem 4.

Suppose v𝒱𝑣𝒱v\in\mathcal{V}italic_v ∈ caligraphic_V where VC(𝒱)D𝑉𝐶𝒱𝐷VC(\mathcal{V})\leq Ditalic_V italic_C ( caligraphic_V ) ≤ italic_D, and that for each q𝒬𝑞𝒬q\in\mathcal{Q}italic_q ∈ caligraphic_Q, we have fq(x)βsubscript𝑓𝑞𝑥𝛽f_{q}(x)\leq\betaitalic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_x ) ≤ italic_β. Suppose further that there is some known 𝒟𝒫𝒟𝒫\mathcal{D}\in\mathcal{P}caligraphic_D ∈ caligraphic_P with density f𝒟subscript𝑓𝒟f_{\mathcal{D}}italic_f start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT which has V(𝒟)c>0𝑉𝒟𝑐0V(\mathcal{D})\geq c>0italic_V ( caligraphic_D ) ≥ italic_c > 0 for some constant c𝑐citalic_c. Then for all choices of 0<ϵ1,ϵ2,δ<1/2formulae-sequence0subscriptitalic-ϵ1subscriptitalic-ϵ2𝛿120<\epsilon_{1},\epsilon_{2},\delta<1/20 < italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_δ < 1 / 2 and M>0𝑀0M>0italic_M > 0, Algorithm 3 requires a number of samples

O~(M2(log(|𝒬|)+log(1/δ))ϵ12+M(Dlog(M/ϵ1)+log(1/δ))ϵ1),absent~𝑂superscript𝑀2𝒬1𝛿superscriptsubscriptitalic-ϵ12𝑀𝐷𝑀subscriptitalic-ϵ11𝛿subscriptitalic-ϵ1\leq\tilde{O}\left(\frac{M^{2}\left(\log(|\mathcal{Q}|)+\log(1/\delta)\right)}% {\epsilon_{1}^{2}}+\frac{M\left(D\log(M/\epsilon_{1})+\log(1/\delta)\right)}{% \epsilon_{1}}\right),≤ over~ start_ARG italic_O end_ARG ( divide start_ARG italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( roman_log ( | caligraphic_Q | ) + roman_log ( 1 / italic_δ ) ) end_ARG start_ARG italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG italic_M ( italic_D roman_log ( italic_M / italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + roman_log ( 1 / italic_δ ) ) end_ARG start_ARG italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ) ,

and a number of validity queries

O(Dlog(1/ϵ1ϵ2)+log(1/δ)ϵ1ϵ2),absent𝑂𝐷1subscriptitalic-ϵ1subscriptitalic-ϵ21𝛿subscriptitalic-ϵ1subscriptitalic-ϵ2\leq O\left(\frac{D\log(1/\epsilon_{1}\epsilon_{2})+\log(1/\delta)}{\epsilon_{% 1}\epsilon_{2}}\right),≤ italic_O ( divide start_ARG italic_D roman_log ( 1 / italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + roman_log ( 1 / italic_δ ) end_ARG start_ARG italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ) ,

to ensure that with probability 1δabsent1𝛿\geq 1-\delta≥ 1 - italic_δ, its output enjoys

𝔼XP[min(M,log(1/fq^(X)))]𝔼XP[min(M,log(1/fq(X)))]+ϵ1andI(q^)ϵ2.formulae-sequencesubscript𝔼similar-to𝑋𝑃delimited-[]𝑀1subscript𝑓^𝑞𝑋subscript𝔼similar-to𝑋𝑃delimited-[]𝑀1subscript𝑓superscript𝑞𝑋subscriptitalic-ϵ1𝑎𝑛𝑑𝐼^𝑞subscriptitalic-ϵ2\mathbb{E}_{X\sim P}\bigg{[}\min\left(M,\log(1/f_{\hat{q}}(X))\right)\bigg{]}% \leq\mathbb{E}_{X\sim P}\bigg{[}\min\left(M,\log(1/f_{q^{*}}(X))\right)\bigg{]% }+\epsilon_{1}\ \ and\ \ I(\hat{q})\leq\epsilon_{2}.blackboard_E start_POSTSUBSCRIPT italic_X ∼ italic_P end_POSTSUBSCRIPT [ roman_min ( italic_M , roman_log ( 1 / italic_f start_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG end_POSTSUBSCRIPT ( italic_X ) ) ) ] ≤ blackboard_E start_POSTSUBSCRIPT italic_X ∼ italic_P end_POSTSUBSCRIPT [ roman_min ( italic_M , roman_log ( 1 / italic_f start_POSTSUBSCRIPT italic_q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_X ) ) ) ] + italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a italic_n italic_d italic_I ( over^ start_ARG italic_q end_ARG ) ≤ italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT .
Proof.

WLOG assume β1𝛽1\beta\geq 1italic_β ≥ 1, and consider learning over l¯(z)=min(M,log(1/z))log(1/β)¯𝑙𝑧𝑀1𝑧1𝛽\bar{l}(z)=\min(M,\log(1/z))-\log(1/\beta)over¯ start_ARG italic_l end_ARG ( italic_z ) = roman_min ( italic_M , roman_log ( 1 / italic_z ) ) - roman_log ( 1 / italic_β ), a translation of the capped log-loss bounded below by 00 for all inputs to fq𝒬subscript𝑓𝑞𝒬f_{q}\in\mathcal{Q}italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ∈ caligraphic_Q, and bounded above by M¯=Mlog(1/β)¯𝑀𝑀1𝛽\bar{M}=M-\log(1/\beta)over¯ start_ARG italic_M end_ARG = italic_M - roman_log ( 1 / italic_β ).

Similar to the proof of Theorem 3, with probability 1δ/2absent1𝛿2\geq 1-\delta/2≥ 1 - italic_δ / 2 over the sample SPnsimilar-to𝑆superscript𝑃𝑛S\sim P^{n}italic_S ∼ italic_P start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, it holds that for all q𝒬𝑞𝒬q\in\mathcal{Q}italic_q ∈ caligraphic_Q that |LS(q;l¯)LP(q;l¯)|ϵ1/8subscript𝐿𝑆𝑞¯𝑙subscript𝐿𝑃𝑞¯𝑙subscriptitalic-ϵ18\left|L_{S}(q;\bar{l})-L_{P}(q;\bar{l})\right|\leq\epsilon_{1}/8| italic_L start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_q ; over¯ start_ARG italic_l end_ARG ) - italic_L start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_q ; over¯ start_ARG italic_l end_ARG ) | ≤ italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT / 8; in this case, we use the fact that fqβsubscript𝑓𝑞𝛽f_{q}\leq\betaitalic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ≤ italic_β to ensure that the random variables l¯(fq(X))¯𝑙subscript𝑓𝑞𝑋\bar{l}\left(f_{q}(X)\right)over¯ start_ARG italic_l end_ARG ( italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_X ) ) for XPsimilar-to𝑋𝑃X\sim Pitalic_X ∼ italic_P are bounded, allowing for an application of Hoeffding’s inequality over empirical estimates of a loss unbounded below. As above, for any choice of q~ ERMsubscript~𝑞 ERM\tilde{q}_{\textnormal{ ERM}}over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT, the sizes of SPsubscript𝑆𝑃S_{P}italic_S start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT and Sq~ ERMsubscript𝑆subscript~𝑞 ERMS_{\tilde{q}_{\textnormal{ ERM}}}italic_S start_POSTSUBSCRIPT over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT end_POSTSUBSCRIPT are such that with probability 12δ/3absent12𝛿3\geq 1-2\delta/3≥ 1 - 2 italic_δ / 3,

PrXP(v^(X)v(X))ϵ12MPrXq~ ERM(v^(X)v(X))cϵ1ϵ216.formulae-sequencesubscriptPrsimilar-to𝑋𝑃^𝑣𝑋𝑣𝑋subscriptitalic-ϵ12𝑀subscriptPrsimilar-to𝑋subscript~𝑞 ERM^𝑣𝑋𝑣𝑋𝑐subscriptitalic-ϵ1subscriptitalic-ϵ216\textnormal{Pr}_{X\sim P}\bigg{(}\hat{v}(X)\neq v(X)\bigg{)}\leq\frac{\epsilon% _{1}}{2M}\ \ \wedge\ \ \textnormal{Pr}_{X\sim\tilde{q}_{\textnormal{ ERM}}}% \bigg{(}\hat{v}(X)\neq v(X)\bigg{)}\leq\frac{c\epsilon_{1}\epsilon_{2}}{16}.Pr start_POSTSUBSCRIPT italic_X ∼ italic_P end_POSTSUBSCRIPT ( over^ start_ARG italic_v end_ARG ( italic_X ) ≠ italic_v ( italic_X ) ) ≤ divide start_ARG italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG 2 italic_M end_ARG ∧ Pr start_POSTSUBSCRIPT italic_X ∼ over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_v end_ARG ( italic_X ) ≠ italic_v ( italic_X ) ) ≤ divide start_ARG italic_c italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG 16 end_ARG .

As before, by a union bound, these bounds both hold simultaneously with probability 1δabsent1𝛿\geq 1-\delta≥ 1 - italic_δ. We condition on this intersection of favorable events going forwards.

Note that when this intersection of events takes place, we have q~ ERM({v^(x)=1})>0subscript~𝑞 ERM^𝑣𝑥10\tilde{q}_{\textnormal{ ERM}}\left(\{\hat{v}(x)=1\}\right)>0over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT ( { over^ start_ARG italic_v end_ARG ( italic_x ) = 1 } ) > 0. In this case, we have that V(q~ ERM)ϵ1V(𝒟)/8ϵ1c/8𝑉subscript~𝑞 ERMsubscriptitalic-ϵ1𝑉𝒟8subscriptitalic-ϵ1𝑐8V(\tilde{q}_{\textnormal{ ERM}})\geq\epsilon_{1}V(\mathcal{D})/8\geq\epsilon_{% 1}c/8italic_V ( over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT ) ≥ italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_V ( caligraphic_D ) / 8 ≥ italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_c / 8 as V(𝒟)c𝑉𝒟𝑐V(\mathcal{D})\geq citalic_V ( caligraphic_D ) ≥ italic_c, and so identically to our work in Theorem 3, we may write

q~ ERM({v^(x)=1})subscript~𝑞 ERM^𝑣𝑥1\displaystyle\tilde{q}_{\textnormal{ ERM}}\left(\{\hat{v}(x)=1\}\right)over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT ( { over^ start_ARG italic_v end_ARG ( italic_x ) = 1 } ) 𝔼Xq~ ERM[𝟙[v(X)=1]]𝟙[v^(x)v(x)]𝑑q~ ERM(x)absentsubscript𝔼similar-to𝑋subscript~𝑞 ERMdelimited-[]1delimited-[]𝑣𝑋11delimited-[]^𝑣𝑥𝑣𝑥differential-dsubscript~𝑞 ERM𝑥\displaystyle\geq\mathbb{E}_{X\sim\tilde{q}_{\textnormal{ ERM}}}\left[\mathbbm% {1}[v(X)=1]\right]-\int\mathbbm{1}[\hat{v}(x)\neq v(x)]d\tilde{q}_{\textnormal% { ERM}}(x)≥ blackboard_E start_POSTSUBSCRIPT italic_X ∼ over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ blackboard_1 [ italic_v ( italic_X ) = 1 ] ] - ∫ blackboard_1 [ over^ start_ARG italic_v end_ARG ( italic_x ) ≠ italic_v ( italic_x ) ] italic_d over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT ( italic_x )
cϵ18cϵ1ϵ216absent𝑐subscriptitalic-ϵ18𝑐subscriptitalic-ϵ1subscriptitalic-ϵ216\displaystyle\geq\frac{c\epsilon_{1}}{8}-\frac{c\epsilon_{1}\epsilon_{2}}{16}≥ divide start_ARG italic_c italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG 8 end_ARG - divide start_ARG italic_c italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG 16 end_ARG
>0.absent0\displaystyle>0.> 0 .

Thus, the restriction of q~ ERMsubscript~𝑞 ERM\tilde{q}_{\textnormal{ ERM}}over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT to the estimate of the valid region is defined and outputted by the algorithm as q^^𝑞\hat{q}over^ start_ARG italic_q end_ARG. To see that the loss guarantee then holds for such a q^^𝑞\hat{q}over^ start_ARG italic_q end_ARG, consider the loss decomposition used in the proof of Theorem 3:

LP(q^;l¯)=𝔼XP[l¯(fq^(X))𝟙[v^(X)=v(X)]]+𝔼XP[l¯(fq^(X))𝟙[v^(X)v(X)]].subscript𝐿𝑃^𝑞¯𝑙subscript𝔼similar-to𝑋𝑃delimited-[]¯𝑙subscript𝑓^𝑞𝑋1delimited-[]^𝑣𝑋𝑣𝑋subscript𝔼similar-to𝑋𝑃delimited-[]¯𝑙subscript𝑓^𝑞𝑋1delimited-[]^𝑣𝑋𝑣𝑋L_{P}(\hat{q};\bar{l})=\mathbb{E}_{X\sim P}\bigg{[}\bar{l}\left(f_{\hat{q}}(X)% \right)\mathbbm{1}[\hat{v}(X)=v(X)]\bigg{]}+\mathbb{E}_{X\sim P}\bigg{[}\bar{l% }\left(f_{\hat{q}}(X)\right)\mathbbm{1}[\hat{v}(X)\neq v(X)]\bigg{]}.italic_L start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( over^ start_ARG italic_q end_ARG ; over¯ start_ARG italic_l end_ARG ) = blackboard_E start_POSTSUBSCRIPT italic_X ∼ italic_P end_POSTSUBSCRIPT [ over¯ start_ARG italic_l end_ARG ( italic_f start_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG end_POSTSUBSCRIPT ( italic_X ) ) blackboard_1 [ over^ start_ARG italic_v end_ARG ( italic_X ) = italic_v ( italic_X ) ] ] + blackboard_E start_POSTSUBSCRIPT italic_X ∼ italic_P end_POSTSUBSCRIPT [ over¯ start_ARG italic_l end_ARG ( italic_f start_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG end_POSTSUBSCRIPT ( italic_X ) ) blackboard_1 [ over^ start_ARG italic_v end_ARG ( italic_X ) ≠ italic_v ( italic_X ) ] ] .

We upper bound the second term exactly as in Theorem 3. To upper bound the first term, consider an argument similar to that of the proof of Lemma 6. Let Vv^(q~ ERM)>0subscript𝑉^𝑣subscript~𝑞 ERM0V_{\hat{v}}(\tilde{q}_{\textnormal{ ERM}})>0italic_V start_POSTSUBSCRIPT over^ start_ARG italic_v end_ARG end_POSTSUBSCRIPT ( over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT ) > 0 be the normalizing constant for q^^𝑞\hat{q}over^ start_ARG italic_q end_ARG. We can write

𝔼XP[l¯(fq^(X))𝟙[v^(X)=v(X)]]subscript𝔼similar-to𝑋𝑃delimited-[]¯𝑙subscript𝑓^𝑞𝑋1delimited-[]^𝑣𝑋𝑣𝑋\displaystyle\mathbb{E}_{X\sim P}\bigg{[}\bar{l}\left(f_{\hat{q}}(X)\right)% \mathbbm{1}[\hat{v}(X)=v(X)]\bigg{]}blackboard_E start_POSTSUBSCRIPT italic_X ∼ italic_P end_POSTSUBSCRIPT [ over¯ start_ARG italic_l end_ARG ( italic_f start_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG end_POSTSUBSCRIPT ( italic_X ) ) blackboard_1 [ over^ start_ARG italic_v end_ARG ( italic_X ) = italic_v ( italic_X ) ] ] 𝔼XP[l¯(fq~ ERM(X)𝟙[v^(X)=1]Vv^(q~ ERM))𝟙[v^(X)=1]]absentsubscript𝔼similar-to𝑋𝑃delimited-[]¯𝑙subscript𝑓subscript~𝑞 ERM𝑋1delimited-[]^𝑣𝑋1subscript𝑉^𝑣subscript~𝑞 ERM1delimited-[]^𝑣𝑋1\displaystyle\leq\mathbb{E}_{X\sim P}\bigg{[}\bar{l}\left(\frac{f_{\tilde{q}_{% \textnormal{ ERM}}}(X)\mathbbm{1}[\hat{v}(X)=1]}{V_{\hat{v}}(\tilde{q}_{% \textnormal{ ERM}})}\right)\cdot\mathbbm{1}[\hat{v}(X)=1]\bigg{]}≤ blackboard_E start_POSTSUBSCRIPT italic_X ∼ italic_P end_POSTSUBSCRIPT [ over¯ start_ARG italic_l end_ARG ( divide start_ARG italic_f start_POSTSUBSCRIPT over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X ) blackboard_1 [ over^ start_ARG italic_v end_ARG ( italic_X ) = 1 ] end_ARG start_ARG italic_V start_POSTSUBSCRIPT over^ start_ARG italic_v end_ARG end_POSTSUBSCRIPT ( over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT ) end_ARG ) ⋅ blackboard_1 [ over^ start_ARG italic_v end_ARG ( italic_X ) = 1 ] ]
𝔼XP[l¯(fq~ ERM(X)Vv^(q~ ERM))]absentsubscript𝔼similar-to𝑋𝑃delimited-[]¯𝑙subscript𝑓subscript~𝑞 ERM𝑋subscript𝑉^𝑣subscript~𝑞 ERM\displaystyle\leq\mathbb{E}_{X\sim P}\bigg{[}\bar{l}\left(\frac{f_{\tilde{q}_{% \textnormal{ ERM}}}(X)}{V_{\hat{v}}(\tilde{q}_{\textnormal{ ERM}})}\right)% \bigg{]}≤ blackboard_E start_POSTSUBSCRIPT italic_X ∼ italic_P end_POSTSUBSCRIPT [ over¯ start_ARG italic_l end_ARG ( divide start_ARG italic_f start_POSTSUBSCRIPT over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X ) end_ARG start_ARG italic_V start_POSTSUBSCRIPT over^ start_ARG italic_v end_ARG end_POSTSUBSCRIPT ( over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT ) end_ARG ) ]
𝔼XP[l¯(fq~ ERM(X))].absentsubscript𝔼similar-to𝑋𝑃delimited-[]¯𝑙subscript𝑓subscript~𝑞 ERM𝑋\displaystyle\leq\mathbb{E}_{X\sim P}\bigg{[}\bar{l}\left(f_{\tilde{q}_{% \textnormal{ ERM}}}(X)\right)\bigg{]}.≤ blackboard_E start_POSTSUBSCRIPT italic_X ∼ italic_P end_POSTSUBSCRIPT [ over¯ start_ARG italic_l end_ARG ( italic_f start_POSTSUBSCRIPT over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X ) ) ] .

Here, the non-increasingness of the loss still holds, leading to the final step. Now, fix some x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X arbitrarily, and note the following, in the style of Lemma 4:

l¯(fq~ ERM(x))+log(1/β)¯𝑙subscript𝑓subscript~𝑞 ERM𝑥1𝛽\displaystyle\bar{l}(f_{\tilde{q}_{\textnormal{ ERM}}}(x))+\log(1/\beta)over¯ start_ARG italic_l end_ARG ( italic_f start_POSTSUBSCRIPT over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) ) + roman_log ( 1 / italic_β ) =log(1/fq~ ERM(x))Mabsent1subscript𝑓subscript~𝑞 ERM𝑥𝑀\displaystyle=\log\bigg{(}1/f_{\tilde{q}_{\textnormal{ ERM}}}(x)\bigg{)}\wedge M= roman_log ( 1 / italic_f start_POSTSUBSCRIPT over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) ) ∧ italic_M
(log(11ϵ1/8)+log(1/fq^ ERM(x)))Mabsent11subscriptitalic-ϵ181subscript𝑓subscript^𝑞 ERM𝑥𝑀\displaystyle\leq\left(\log\bigg{(}\frac{1}{1-\epsilon_{1}/8}\bigg{)}+\log% \bigg{(}1/f_{\hat{q}_{\textnormal{ ERM}}}(x)\bigg{)}\right)\wedge M≤ ( roman_log ( divide start_ARG 1 end_ARG start_ARG 1 - italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT / 8 end_ARG ) + roman_log ( 1 / italic_f start_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) ) ) ∧ italic_M
(ϵ1/4+log(1/fq^ ERM(x)))Mabsentsubscriptitalic-ϵ141subscript𝑓subscript^𝑞 ERM𝑥𝑀\displaystyle\leq\left(\epsilon_{1}/4+\log\bigg{(}1/f_{\hat{q}_{\textnormal{ % ERM}}}(x)\bigg{)}\right)\wedge M≤ ( italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT / 4 + roman_log ( 1 / italic_f start_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) ) ) ∧ italic_M
log(1/fq^ ERM(x))M+ϵ14.absent1subscript𝑓subscript^𝑞 ERM𝑥𝑀subscriptitalic-ϵ14\displaystyle\leq\log\bigg{(}1/f_{\hat{q}_{\textnormal{ ERM}}}(x)\bigg{)}% \wedge M+\frac{\epsilon_{1}}{4}.≤ roman_log ( 1 / italic_f start_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) ) ∧ italic_M + divide start_ARG italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG 4 end_ARG .

Thus, it holds that l¯(fq~ ERM(x))l¯(fq^ ERM(x))+ϵ1/4¯𝑙subscript𝑓subscript~𝑞 ERM𝑥¯𝑙subscript𝑓subscript^𝑞 ERM𝑥subscriptitalic-ϵ14\bar{l}(f_{\tilde{q}_{\textnormal{ ERM}}}(x))\leq\bar{l}(f_{\hat{q}_{% \textnormal{ ERM}}}(x))+\epsilon_{1}/4over¯ start_ARG italic_l end_ARG ( italic_f start_POSTSUBSCRIPT over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) ) ≤ over¯ start_ARG italic_l end_ARG ( italic_f start_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) ) + italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT / 4, and so we may write

𝔼XP[l¯(fq^(X))]subscript𝔼similar-to𝑋𝑃delimited-[]¯𝑙subscript𝑓^𝑞𝑋\displaystyle\mathbb{E}_{X\sim P}\bigg{[}\bar{l}(f_{\hat{q}}(X))\bigg{]}blackboard_E start_POSTSUBSCRIPT italic_X ∼ italic_P end_POSTSUBSCRIPT [ over¯ start_ARG italic_l end_ARG ( italic_f start_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG end_POSTSUBSCRIPT ( italic_X ) ) ] 𝔼XP[l¯(fq^ ERM(X))]+ϵ14absentsubscript𝔼similar-to𝑋𝑃delimited-[]¯𝑙subscript𝑓subscript^𝑞 ERM𝑋subscriptitalic-ϵ14\displaystyle\leq\mathbb{E}_{X\sim P}\bigg{[}\bar{l}\left(f_{\hat{q}_{% \textnormal{ ERM}}}(X)\right)\bigg{]}+\frac{\epsilon_{1}}{4}≤ blackboard_E start_POSTSUBSCRIPT italic_X ∼ italic_P end_POSTSUBSCRIPT [ over¯ start_ARG italic_l end_ARG ( italic_f start_POSTSUBSCRIPT over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X ) ) ] + divide start_ARG italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG 4 end_ARG
=LP(q^ ERM;l¯)+ϵ14.absentsubscript𝐿𝑃subscript^𝑞 ERM¯𝑙subscriptitalic-ϵ14\displaystyle=L_{P}(\hat{q}_{\textnormal{ ERM}};\bar{l})+\frac{\epsilon_{1}}{4}.= italic_L start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT ; over¯ start_ARG italic_l end_ARG ) + divide start_ARG italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG 4 end_ARG .

Thus, we have written the loss in terms of the ERM, and so we have, as in Theorem 3, that the first term of the loss decomposition can be bounded by LP(q)+ϵ1/2subscript𝐿𝑃superscript𝑞subscriptitalic-ϵ12L_{P}(q^{*})+\epsilon_{1}/2italic_L start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT / 2.

Given PrXq~ ERM(v^(X)v(X))cϵ1ϵ2/16subscriptPrsimilar-to𝑋subscript~𝑞 ERM^𝑣𝑋𝑣𝑋𝑐subscriptitalic-ϵ1subscriptitalic-ϵ216\textnormal{Pr}_{X\sim\tilde{q}_{\textnormal{ ERM}}}\left(\hat{v}(X)\neq v(X)% \right)\leq c\epsilon_{1}\epsilon_{2}/16Pr start_POSTSUBSCRIPT italic_X ∼ over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_v end_ARG ( italic_X ) ≠ italic_v ( italic_X ) ) ≤ italic_c italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT / 16, and the lower bound on the validity of q~ ERMsubscript~𝑞 ERM\tilde{q}_{\textnormal{ ERM}}over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT ERM end_POSTSUBSCRIPT derived in the third paragraph above, we can again apply Lemma 5 to get the validity guarantee. ∎