Distribution Learning with Valid Outputs Beyond the Worst-Case
Abstract
Generative models at times produce “invalid” outputs, such as images with generation artifacts and unnatural sounds. Validity-constrained distribution learning attempts to address this problem by requiring that the learned distribution have a provably small fraction of its mass in invalid parts of space – something which standard loss minimization does not always ensure. To this end, a learner in this model can guide the learning via “validity queries”, which allow it to ascertain the validity of individual examples. Prior work on this problem takes a worst-case stance, showing that proper learning requires an exponential number of validity queries, and demonstrating an improper algorithm which – while generating guarantees in a wide-range of settings – makes an atypical polynomial number of validity queries. In this work, we take a first step towards characterizing regimes where guaranteeing validity is easier than in the worst-case. We show that when the data distribution lies in the model class and the log-loss is minimized, the number of samples required to ensure validity has a weak dependence on the validity requirement. Additionally, we show that when the validity region belongs to a VC-class, a limited number of validity queries are often sufficient.
1 Introduction
When sampling from a generative model, it is highly desirable that its outputs meet some basic criteria of quality. In the case of text, this may mean that generated sentences respect grammar rules, or avoid the use of biased or offensive language Perez et al. (2022); Abid et al. (2021). When generating code, a criterion may be that the generated code successfully compiles Hanneke et al. (2018). In image generation, we might wish to avoid blurry outputs, or those possessing generation artifacts which clearly distinguish them from natural images Kaneko and Harada (2021); Odena et al. (2016).
In this paper, we examine the statistical cost of ensuring that learned distributions produce such “valid” outputs. To do so, we consider an elegant formulation of the problem of learning such valid models due to Hanneke et al. (2018). In their work, training data are generated according to a probability distribution , and the binary “validity” of examples is determined by some unknown “validity function” . Given sample access to and query access to , a learner attempts to identify a probability distribution which outputs invalid examples with probability at most . At the same time, the distribution should have a loss which is at most worse than that of the minimum loss model in a class which outputs valid examples with probability 1. Here, query access to captures the idea that collecting samples is often cheap, but verifying validity is often less so, possibly requiring a human-in-the-loop.
The initial work of Hanneke et al. (2018) suggests that choosing such a low-loss, high-validity distribution may require a large number of validity queries. Under the assumption that is “fully-valid”, i.e. outputs a valid example with probability 1, they show that in the worst case, validity queries are required to choose such a model from the class . They follow this result with an improper learning algorithm for choosing which, while achieving polynomial bounds on the number of validity queries, uses a relatively large number of validity queries .
The somewhat pessimistic picture painted by these fascinating complexity-theoretic results can be tracked to their generality. Firstly, it’s possible that and are significantly “mismatched”, i.e. the support of each model has only a small overlap with the support of the distribution , in which case the validity information contained in valid training samples is unhelpful to a proper learner. Secondly, their improper learning algorithm is largely loss-agnostic, in that it generates guarantees for a wide class of bounded loss functions. Finally, nothing is assumed about the form of the validity function , precluding provable estimation.
In this work, we offer a counterbalance to this picture, beginning an investigation into learning settings where guaranteeing validity is cheaper than such results might indicate. We first consider learning under complete elimination of model class mismatch, where is rich enough to contain the fully-valid data distribution , and the loss is the log-loss . It is intuitive that in this setting, loss minimization alone should guarantee validity. Somewhat less intuitively, we demonstrate an algorithm closely related to empirical risk minimization which uses just samples to guarantee its output meets loss and validity requirements – in other words, validity comes quickly under random sampling from in this setting.
Secondly, we consider learning under a different realizability assumption, namely that the validity region is a member of a VC-class of dimension . In this setting, we provide an analysis of the natural scheme of restricting the empirical risk minimizer to an estimate of the valid part of space. We show that when small-loss models have at least constant validity, this scheme uses validity queries, implying a query cost reduction over the general-purpose algorithm of Hanneke et al. (2018). We also show that learning under the capped log-loss can be used to relax the assumption of constant validity at the cost of an extra factor of .
Our results suggest the existence of a rich web of settings in which validity may be cheaper than in the general case. They also suggest that the choice of the loss plays an important roll in guaranteeing valid outputs, compelling further investigation of the log-loss in particular.
2 Related Work
The framing of learning distributiuons in terms of PAC guarantees similar to Valiant (1984) dates back to Kearns et al. (1994), who consider the learnability of specific classes of discrete distributions under a realizability assumption. A significant body of work on distribution learning has been developed overtime, often focusing on algorithms for learning over parametric families or under specific “structural” assumptions Haghtalab et al. (2019); Daskalakis et al. (2013); Kalai et al. (2010); Daskalakis et al. (2012). The only theoretical contribution to validity-constrained distribution learning under the formulation posed by Hanneke et al. (2018) that we are aware of is that work itself.
The study of loss functions for the evaluation of probabilistic models has often been studied the lens of “scoring rules” in the forecasting literature Brier (1950); Good (1952); Gneiting and Raftery (2007). There are some notable recent contributions towards expanding the understanding of when loss functions for distribution learning display desirable properties, e.g. “properness”, which designates that the loss is minimized by the true data distribution Haghtalab et al. (2019); Frongillo et al. (2022).
The first half of this paper draws on intuition from hypothesis testing to evaluate the performance of empirical risk minimization. Hypothesis testing is a major focus of the classical statistics literature Lehmann et al. (1986). The bounds in the first half of the paper are due to analysis inspired by the Neyman-Pearson lemma Neyman and Pearson (1933); Rebeschini (2021), and rely on the approximation of total variational distance between product measures Reiss (1981).
The applied literature on generative modeling has consistently noted the problem of learned models producing “invalid” examples Kusner et al. (2017); Janz et al. (2017); IsolaP et al. (2017); Kaneko and Harada (2021). Various techniques have been proposed for mitigating invalidity generally, and in domain specific settings Aitken et al. (2017); Kusner et al. (2017); Kong and Chaudhuri (2023). While working under the assumption that the validity function lies in a VC-class, the strategy we introduce has some rough semblance to a “post-editing” procedure proposed by Kong and Chaudhuri (2023).
3 Preliminaries
3.1 Problem Setup
Let be a subset of Euclidean space with finite Lebesgue measure . Let denote the set of all probability distributions on the measurable space , where arises from Lebesgue measurable sets intersected with . Let be the data-generating distribution.
In the eyes of the learner, the function is a fixed and unknown “validity function”, measurable with respect to the relevant distributions. The validity function denotes whether or not an example is considered a valid output for a learned approximation of . The learner is given a model class of of probability distributions on , each with density with respect to , and afforded with the knowledge that at least one is “fully-valid”, i.e. that there is some with invalidity . We at times use the notion of “invalidity” of a model, by which we mean . Following the main exposition of Hanneke et al. (2018), we assume is of finite cardinality.
The goodness-of-fit of a model is governed by a decreasing “local” loss function . Such a loss function gives rise to loss of model via . Given an i.i.d sample from , we let the empirical estimate of the loss of a model be . We use the shorthand and to denote the true and empirical losses of under the log-loss , where denotes the natural logarithm. We take the log-loss to be infinite at points where .
3.2 Goal of Learning
The goal of the learner is to choose some which has a loss similar to that of the lowest-loss fully-valid model in , while simultaneously maintaining near full-validity. Explicitly, consider the model
To describe the quality of an outputted model, we consider two learning parameters and , where is used to control the loss sub-optimality, and to control the invalidity. Formally then, the goal of the learner is to output satisfying and . To accomplish this goal, the learner has sample access to , and query access to , i.e. a learner can draw any finite number of samples from , and any request the value of the validity function at any finite number of inputs in .
At a minimum, we are interested in algorithms which require a number of samples from and number validity queries that is polynomial in , and . Ideally, we would like to minimize the number of validity queries given some polynomial number of samples from . The motivation for this goal is similar to the minimization of label queries in active learning for classification Hanneke (2014), where samples from the marginal over instances are often cheap, but labeling such examples is assumed expensive.
3.3 Full-Validity of
We assume that all samples from the data-generating distribution are valid, i.e. that . Under such an assumption, the query demand of a learning algorithm can be conceptualized as the overhead number of queries sufficient for choosing a good model under the standard procedure of removing invalid examples from the training set.
If the data distribution is not fully-valid, and valid samples are required by an algorithm, the question of minimizing the overall number of queries is dependent on the sample complexity of learning – if one assumes that has been constructed by “accepting” valid samples from some underlying distribution which outputs a valid sample with constant probability, then the overall query cost incurred by an algorithm is on the order of the larger of the number of samples and the number of “overhead” validity queries it uses.
In this paper, we are primarily interested in the “overhead” number of queries, which we refer to as the “number of validity queries” of a given scheme. In most cases, algorithm sample requirements are similar to , which allows for accurate loss estimation in many settings.
3.4 Summary of Previous Results
The learning problem above is due to Hanneke et al. (2018), who considered the possibility of specifying learning algorithms meeting the above bi-criteria objective for any choice of bounded, decreasing, local loss function.
This work gives some interesting insight into the difficulty of selecting such a low-loss, high-validity model. They begin by giving a negative result, namely that any proper learning algorithm outputting , must make validity queries in the worst case, regardless of the number of samples available from . This result arises from a specific problem instance wherein every has a significant amount of mass outside of the support of , in which case samples from a fully-valid do not give information about in parts of space relevant to the choice of .
On the other hand, they demonstrate an improper learning algorithm which achieves polynomial bounds on samples and validity queries for any choice of loss meeting the above criteria. Their algorithm harnesses a constrained ERM oracle, iteratively querying the validity of samples from the model which is the empirical loss minimizer putting no mass on points known to be invalid. In particular, their scheme uses samples and validity queries.
4 Learning Without Model Class Mismatch Under the Log-Loss
We first consider the problem of selecting a low-loss, high-validity model under a relaxation of two of the main sources of difficulty in original problem formulation: the misalignment of the model class with the data distribution , and the lack of assumptions on the loss.
In particular, we consider the problem under a realizability assumption, namely that , further investigating the power of the log-loss. Such a setting is arguably more closely aligned with contemporary learning settings with rich model classes that appropriately capture features of the underlying data distribution, where the validity information contained in samples from can be exploited by convergence to the best information-theoretic representation of in .
The log-loss is by far the most widely-used loss in practice Haghtalab et al. (2019). It is a classic result of the proper scoring rule literature that the log-loss is the unique strictly-proper local loss, i.e. the only local loss under which for all distributions , it holds that . This highly desirable property – implying that convergence to the optimum over coincides with convergence to – makes the choice of an alternative outside of capped variants preferable only under specialized circumstances.
4.1 Towards Validity without Validity Queries
Given that samples are assumed to be valid, and the log-loss permits convergence to the data generating distribution, one would hope that simply selecting a model which is a sufficiently good representation of under the log-loss would yield validity guarantees in this setting. Simply utilizing empirical risk minimization (ERM) is the canonical approach to this end, and one which, given sufficient data from , uses exactly zero validity queries.
Note that any model with invalidity necessary has . In this case, must have at least mass in the invalid part of space, where has none. Thus, if one can guarantee that has , the validity requirement is met. Recalling the Pinsker inequality relating total variational distance and KL-divergence, it follows that obtaining a model which is at most sub-optimal in log-loss yields a model meeting the validity requirement.
While this illustrates useful intuition for the setting, it glosses over two main issues. Firstly, empirical estimates of the log-loss do not admit concentration guarantees – one can construct simple examples where is unbounded above, but with high probability, the empirical estimate is approximately that of Haghtalab et al. (2019). Thus, selecting low-empirical loss models can never yield loss guarantees. Secondly, this application of Pinsker’s inequality demands loss sub-optimality, suggesting that ensuring validity via the selection of a good model under the log-loss is even harder than guaranteeing a small loss.
We would hope that in the case that zero-query learning is possible, that guaranteeing validity arises somewhat coincidently with convergence to , meaning that the sample complexity is not much worse given a validity requirement than without one. Thus, the path towards satisfaction of the learning objectives requires subtle handling, and compels particular attention to the sample complexity dependence on the validity parameter .
4.2 Analysis of Empirical Risk Minimization
As indicated above, it is not possible to guarantee that empirical risk minimization (ERM) outputs a model with small log-loss. It is, however possible to guarantee that it outputs a model which closely resembles and inherits validity guarantees with a small number of samples.
In particular, it’s possible to show that given sufficient samples, ERM yields a model with small total variation to when . This is due to the following folklore theorem Haghtalab et al. (2019), which we prove under assumption of density existence in the Appendix.
Lemma 4.
Fix arbitrarily, and let be distributions with densities with respect to . Then if , and for , it holds with probability that
Thus, at the statistical cost of estimating a coin bias, any distribution with total variation from the data distribution will reveal itself to be empirically inferior when the log-loss is used. This can be easily leveraged to generate guarantees for ERM over in terms of total variation.
It is tempting to think that this is the entire story when it comes to guaranteeing validity. After all, we argued above that small total variation from is sufficient for invalidity. That said, simply looking at total variation ignores a particular structural feature of distributions with – in particular, such distributions have mass in parts of space in which does not.
This observation can be used to construct tight lower bounds on the total variational distance between product measures arising from and with . This leads to the following result, which states that ERM yields a faithful representation of the data generating distribution that is at most invalid given a number of samples with a modest dependence on the validity parameter .
Lemma 5.
Fix arbitrarily, and suppose . If is fully-valid under , and for , then with probability over , the ERM solution
satisfies both
Note that this guarantee is not redundant – having does not imply when .
4.3 Attaining Log-Loss Guarantees
This result can be interpreted as a vote of confidence for the naive training of generative models under the log-loss. Nevertheless, from a learning-theoretic perspective, there is a question whether or not it is possible to guarantee low log-loss while maintaining validity with zero validity queries.
While ERM cannot possibly furnish log-loss guarantees, it turns out that it is possible to modify the output of ERM to generate log-loss guarantees at the cost of an extra polylogarithmic factor in the sample complexity, at least when the densities are bounded above and below in their support.111This does not yield uniform convergence over given that the support of need not align with
The idea, formalized in Algorithm 1, is simply to mix the output of ERM with the uniform distribution. Giving the uniform distribution a mixture component on the order of can be shown to ensure that the validity guarantees of the ERM output are preserved, while also giving the outputted distribution support across the entire space. This leads to the following theorem.
Theorem 1.
Fix arbitrarily, and suppose and that is fully-valid under . If it holds that for each that for all , then there is an
such that for all , with probability , the output of Algorithm 1 satisfies
Here the notation hides a polylogarithmic dependence on , which is insignificant in most regimes, and treats the density of the uniform distribution over as a constant, which would otherwise also enter polylogarithmically.
Theorem 1 shows that guarantees with respect to the unbounded log-loss are attainable improperly, i.e. when the learner can choose . It’s an interesting question whether the logarithmic dependence on can be removed with a more subtle strategy.
4.4 Discussion of Optimality
One might suspect that achieving a smaller dependence than on the validity parameter should be impossible. We confirm this is true at least for proper learners, showing that the analysis of ERM in Lemma 5 is tight in its dependence on . This lemma is used to generate the validity guarantee in Theorem 1.
Theorem 2.
For all and for all proper learners , if the sample is of size , then there exists a triple with and fully-valid, on which has invalidity with probability .
The intuition here is that while any invalid has at least total variation from , in the worst case, the total variation between and is upper bounded by ) as well. This makes distinguishing between and some invalid distribution hard enough to generate such a lower bound.
The sample requirement of , both in our guarantees and in previous work, is a standard offshoot of loss estimation, irrespective of the search of a valid model. In general, one cannot expect improvements to this end – this is the standard dependence one finds for estimating the means bounded random variables. This suggests that the “realizable complexity” for this setting is – while non-zero losses should not be generally estimable using “realizable” techniques, guaranteeing small invalidity can when .
5 Utilizing Estimates of the Validity Function in Training
In the general formulation of the problem, the learner is given an arbitrary bounded, decreasing loss, a model class which is mismatched with , and has no a priori information about the validity function . In such a setting, it is clear that validity queries are necessary.
In this section, we consider a setting where it is known to the learner that can be found in a hypothesis class of bounded complexity. Under such an assumption, we would hope to be able to lower the number of validity queries beyond the bounds of Hanneke et al. (2018).
5.1 Algorithm
A natural algorithm in this setting is to “correct” the invalidity of the empirical risk minimizer – to restrict the empirical risk minimizer to parts of space which are valid with respect to an estimate of the validity . This is the precisely the idea formalized in Algorithm 2.
To generate guarantees for such a strategy, one must determine the distribution with respect to which the estimate should be accurate. In our case, we generate accuracy guarantees over both and the ERM model by selecting an estimate that has 0 empirical error over both distributions. The source of the query complexity of the algorithm comes from the fact that samples arising from must be labeled by oracle calls to . Noting that samples from can be automatically labeled as valid by the full-validity of saves a constant factor over naively labeling all examples acquired in the second half of the algorithm.
Accuracy under samples from allows one to control the loss of by invoking the boundedness of the loss in the disagreement region of and , and guarantees with respect to allow us to bound the invalidity of the restriction. Because is fully-valid, loss contributions from the agreement region of and correspond to parts of space where – as the loss is non-increasing, placing more mass in such parts of space can never increase the loss contribution attributable to integrating over this region.
Algorithm 2 also requires a parameter . This parameter should be a validity lower bound on the models , providing a safeguard on the possibility of an “invalidity blowup” when restricting the ERM output to a certain region of space – one must normalize the restriction to output a probability distribution, which in this case means increasing mass in parts of space that are estimated to be valid. An a priori lower bound on the validity allows for precise enough estimation of that increasing the mass in such regions is unlikely to lead to appreciable invalidity in the final model.
It’s possible that the restriction of the ERM estimated valid region is undefined – this happens if and only if the estimated valid region has zero mass under the ERM. Given validity lower bounds for models , this is a low probability event which can occur only when estimation of the validity function is very poor relative to the query complexity. As one might imagine, the handling of this case is immaterial for PAC-guarantees. We choose to arbitrarily define behavior in this case by outputting the ERM model.
5.2 Guarantees
The restricted output of Algorithm 2 admits the following guarantee over loss sub-optimality and invalidity.
Theorem 3.
Suppose with VC-dimension , and that for each , the validity . For all and for all choices of non-increasing loss functions , Algorithm 2 requires a number of samples
and a number of validity queries
to ensure that with probability , its output enjoys
Thus, in regimes where e.g. , , this guarantee represents a reduced number of queries under the bound of Hanneke et al. (2018). It also implies a “decoupling” of the query complexity from .
We note that the sample requirement from is increased in certain regimes over the requirement of Hanneke et al. (2018). This is, however, not a concern in most settings where validity queries are expensive. If samples from a fully-valid are readily obtainable, the setting is analogous to that of active learning, where focus is directed to the number of labels requested in training.
Even if must be constructed by “accepting” valid samples from some unfiltered , a comparison between the query complexity of Theorem 3 and the query bound of Hanneke et al. (2018) is often still representative of the relative data costs of the schemes. Supposing produces valid samples with constant probability, the total number of validity queries made by each scheme is proportional to the scheme’s sample requirements from , plus the number of validity queries used in its execution. Essentially, to yield validity query speedups, our scheme requires a VC bound on which does not dwarf . Thus, in most cases of interest, the querying the validity of extra samples is asymptotically inconsequential relative to , and the query budget required to execute the algorithm given access to a fully-valid .
5.3 Better Query Complexity Bounds
5.3.1 Exploiting the Power of Active Learning
Theorem 3 presents a somewhat pessimistic view of the potential of such a “post-filtering” scheme.
Firstly, it ignores the potential of active learners to improve query complexities over passive sampling. Query complexities in active learning of classifiers are often expressed in terms of the “disagreement coefficient” Hanneke (2007), often denoted via . In the realizable setting, query complexities of active learning look like Dasgupta (2011). Definitionally, it can be shown that . Thus, proving the gains of active learning algorithms usually relies on bounding the disagreement coefficient non-trivially, i.e. showing , or ideally, .
While this is challenging, as is both a class and distribution-dependent quantity, there is a literature that addresses this potential in various settings – see the references in Hanneke (2014). In principle, one could use such an analysis to show that the query complexity of an active learning modification of Algorithm 2 is on a lower order than the guarantee of Theorem 3 when conditions are favorable. To this end, it may be useful to note that a modification of Algorithm 2 wherein is selected as the ERM on a dataset generated by a mixture of and admits guarantees as well.
5.3.2 Only Low-Loss Models Need Appreciable Validity
Another source of potential looseness in the statement of Theorem 3 is that it phrases the query complexity in terms of the worst-case validity over models . This is unnecessary – with high probability, in the first step of the algorithm, one selects a model with true loss. Thus, what really matters for such a strategy is that models that have relatively small loss do not have invalidity nearing 1.
This is a realistic scenario in the case that the loss function – despite possibly not being proper – prioritizes models which in some sense resemble the data generating distribution . Indeed, it’s somewhat difficult to envision a situation where a loss would be chosen that prioritizes models with no relation to the data generating distribution. To this end, we give the following corollary to Theorem 3.
5.3.3 Removing the Positive Validity Requirement
Using an idea found in the algorithm of Hanneke et al. (2018), one can show that if a learner has access to single distribution with a density and at least some non-zero constant validity, and the densities are bounded above, that Algorithm 2 can be modified so as to drop the requirement of positive validity over models when learning under the capped log-loss.
By mixing the with , giving mixture component to , one can generate similar guarantees as those of Theorem 3. The modification can be found in the Appendix as Algorithm 3, and enjoys the following guarantee.
Theorem 4.
Suppose where , and that for each , we have . Suppose further that there is some known with density which has for some constant . Then for all choices of and , Algorithm 3 requires a number of samples
and a number of validity queries
to ensure that with probability , its output enjoys
Here, the notation again hides factors polylogarithmic in .
Note that the now appears in the sample complexity. This simply reflects the fact that the -capped log-loss can range between gap and when working with densities bounded above by . In the case that densities can be bounded above by , as in the discrete setting of Hanneke et al. (2018), this dependence disappears.
6 Conclusion
This work is intended as a first-look into settings closer to the common-case, where ensuring validity may be relatively cheap.
A more thorough investigation of the log-loss, as well as capped variants, seems a very relevant line of further inquiry, given the widespread use of this family in practice and its useful information-theoretic properties. A natural extension to the first part of this work would be to consider learning in the agnostic case under the log-loss, where one would hope to be able to exploit these properties and the validity of training samples to keep the number of validity queries low.
In general, characterizing the sample and query demands of validity-constrained distribution learning is challenging, given that proving lower bounds in general requires arguing against learners with two tools at their disposal – sampling and actively querying validity. Work in this direction will likely require some creative constructions.
References
- Perez et al. [2022] Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models. arXiv preprint arXiv:2202.03286, 2022.
- Abid et al. [2021] Abubakar Abid, Maheen Farooqi, and James Zou. Persistent anti-muslim bias in large language models. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pages 298–306, 2021.
- Hanneke et al. [2018] Steve Hanneke, Adam Kalai, Gautam Kamath, and Christos Tzamos. Actively avoiding nonsense in generative models. CoRR, abs/1802.07229, 2018. URL http://arxiv.org/abs/1802.07229.
- Kaneko and Harada [2021] Takuhiro Kaneko and Tatsuya Harada. Blur, noise, and compression robust generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13579–13589, 2021.
- Odena et al. [2016] Augustus Odena, Vincent Dumoulin, and Chris Olah. Deconvolution and checkerboard artifacts. Distill, 1(10):e3, 2016.
- Valiant [1984] L. G. Valiant. A theory of the learnable. Communications of the ACM, pages 1134–1142, 1984.
- Kearns et al. [1994] Michael Kearns, Yishay Mansour, Dana Ron, Ronitt Rubinfeld, Robert E. Schapire, and Linda Sellie. On the learnability of discrete distributions. In Proceedings of the Twenty-Sixth Annual ACM Symposium on Theory of Computing, STOC ’94, page 273–282, New York, NY, USA, 1994. Association for Computing Machinery. ISBN 0897916638. doi: 10.1145/195058.195155. URL https://doi.org/10.1145/195058.195155.
- Haghtalab et al. [2019] Nika Haghtalab, Cameron Musco, and Bo Waggoner. Toward a characterization of loss functions for distribution learning. CoRR, abs/1906.02652, 2019. URL http://arxiv.org/abs/1906.02652.
- Daskalakis et al. [2013] Constantinos Daskalakis, Ilias Diakonikolas, Ryan ODonnell, Rocco A Servedio, and Li-Yang Tan. Learning sums of independent integer random variables. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, pages 217–226. IEEE, 2013.
- Kalai et al. [2010] Adam Tauman Kalai, Ankur Moitra, and Gregory Valiant. Efficiently learning mixtures of two gaussians. In Proceedings of the forty-second ACM symposium on Theory of computing, pages 553–562, 2010.
- Daskalakis et al. [2012] Constantinos Daskalakis, Ilias Diakonikolas, and Rocco A Servedio. Learning poisson binomial distributions. In Proceedings of the forty-fourth annual ACM symposium on Theory of computing, pages 709–728, 2012.
- Brier [1950] Glenn W. Brier. Verification of Forecasts Expressed in Terms of Probability. Monthly Weather Review, 78(1), January 1950. doi: 10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2.
- Good [1952] Irving John Good. Rational decisions. Journal of the Royal Statistical Society: Series B (Methodological), 14(1):107–114, 1952.
- Gneiting and Raftery [2007] Tilmann Gneiting and Adrian Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102:359–378, 03 2007. doi: 10.1198/016214506000001437.
- Frongillo et al. [2022] Rafael Frongillo, Dhamma Kimpara, and Bo Waggoner. Proper losses for discrete generative models, 2022. URL https://arxiv.org/abs/2211.03761.
- Lehmann et al. [1986] Erich Leo Lehmann, Joseph P Romano, and George Casella. Testing statistical hypotheses, volume 3. Springer, 1986.
- Neyman and Pearson [1933] Jerzy Neyman and Egon Sharpe Pearson. On the Problem of the Most Efficient Tests of Statistical Hypotheses. Phil. Trans. Roy. Soc. Lond. A, 231(694-706):289–337, 1933. doi: 10.1098/rsta.1933.0009.
- Rebeschini [2021] Patrick Rebeschini. Minimax lower bounds and hypothesis testing, December 2021. URL https://www.stats.ox.ac.uk/~rebeschi/teaching/AFoL/22/material/lecture16.pdf.
- Reiss [1981] R.D. Reiss. Approximation of product measures with an application to order statistics. The Annals of Probability, 9(2):335 – 341, 1981. doi: 10.1214/aop/1176994477.
- Kusner et al. [2017] Matt J Kusner, Brooks Paige, and José Miguel Hernández-Lobato. Grammar variational autoencoder. In International conference on machine learning, pages 1945–1954. PMLR, 2017.
- Janz et al. [2017] David Janz, Jos van der Westhuizen, Brooks Paige, Matt J Kusner, and José Miguel Hernández-Lobato. Learning a generative model for validity in complex discrete structures. arXiv preprint arXiv:1712.01664, 2017.
- IsolaP et al. [2017] ZhuJY IsolaP, T Zhou, et al. Image to imagetranslation with conditional adversarial networks. Proceedingsofthe IEEEConferenceonComputerVisionandPatternRecognition. Hawaii, USA, 1125:1134, 2017.
- Aitken et al. [2017] Andrew Aitken, Christian Ledig, Lucas Theis, Jose Caballero, Zehan Wang, and Wenzhe Shi. Checkerboard artifact free sub-pixel convolution: A note on sub-pixel convolution, resize convolution and convolution resize. arXiv preprint arXiv:1707.02937, 2017.
- Kong and Chaudhuri [2023] Zhifeng Kong and Kamalika Chaudhuri. Data redaction from pre-trained gans. In 2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pages 638–677. IEEE, 2023.
- Hanneke [2014] Steve Hanneke. Theory of disagreement-based active learning. Foundations and Trends in Machine Learning, 7:131–309, 01 2014. doi: 10.1561/2200000037.
- Hanneke [2007] Steve Hanneke. A bound on the label complexity of agnostic active learning. In Proceedings of the 24th International Conference on Machine Learning, page 353–360, 2007.
- Dasgupta [2011] Sanjoy Dasgupta. Two faces of active learning. Theoretical Computer Science, 412(19):1767–1781, 2011. ISSN 0304-3975.
- Driver [2010] Bruce Driver. Probability tools with examples, June 2010. URL https://mathweb.ucsd.edu/~tkemp/280A/Driver-Probability-Lecture-Notes.pdf.
- Steerneman [1983] Ton Steerneman. On the total variation and hellinger distance between signed measures; an application to product measures. Proceedings of the American Mathematical Society, 88(4):684–688, 1983.
- Zhao and Lai [2022] Puning Zhao and Lifeng Lai. Analysis of knn density estimation. IEEE Transactions on Information Theory, 2022. doi: 10.1109/TIT.2022.3195870.
7 Appendix
7.1 Probability Distributions and Measure Theoretic Formalism
We work over Euclidean space and let be a Lebesgue measurable set with Lebesgue measure . By and Lebesgue measurable set, we refer to the measure and -algebra arising from the usual construction of Lebesgue measure on . By “distribution”, we mean a probability measure on the measurable space , where . Let the uniform distribution be the measure given by for . Let denote the set of all probability measures on .
We assume that and have densities and with respect to the reference measure . At times, it will be useful to assume that densities are bounded away from zero in certain parts of space. By saying densities are bounded in their support by , we mean that for all , we have , where the closure is defined through open balls in the Euclidean metric. Note that in this setting, we have , as .
Denote via the product measure over the measurable space . Such a measure corresponds to the process of taking i.i.d. samples . Denote the density of with respect to via .
We define . Following the conventions in Driver [2010], we say that if for and for some . This allows us to integrate over the finite part of functions and get a finite result.
To facilitate digestibility, we refrain from measure theoretic notation as much as possible. It is at times useful, particularly in dealing with total variation. We assume throughout that all functions we encounter in the Appendix – including the fixed validity function and functions in the validity class – are -measurable.
7.2 Estimates of Validity, Invalidity
We fix the validity function as an arbitrary function measurable with respect to each distribution arising in the Appendix. As discussed above, for a given model , the “validity” of is the quantity , and the “invalidity” . We will at times be interested in estimating the validity of a model using samples from along with validity queries. Given an i.i.d. sample from , we let be the natural estimate of the validity of .
At times, we will be interested in the validity of a model under an estimate of the underlying validity function. To this end, given a model and a function , we let . Given a sample , let . Note that in the language of this notation, we have and . We extend this notation in the natural way to invalidity quantities.
7.3 Analysis of Empirical Risk Minimization, Improper Algorithm in Realizable Setting
To begin our analysis of the realizable setting, we first observe that models with appreciable invalidity look very different from a fully-valid data generating distribution – because they must have mass in parts of space where does not, they are separated in total variation from by a margin. We formalize this idea via the following.
Lemma 1.
Fix arbitrarily. For any validity function , if has , and has , then
Proof.
Fix the validity function arbitrarily. Consider the event . Then we have , where we have used that implies . ∎
We next extend this observation to the associated product measures, which are the main target of analysis under i.i.d. sampling from . The idea is to lower bound the total variation between product measures by the difference in probabilities on the event that at least one example from a sample of size is invalid. Of course, for each sample, this happens with probability under and with probability at least under any model with at least invalidity. Thus, identically to how one shows realizable rates for classification tasks, we attain a large total variation gap between and any with appreciable invalidity – mistakes in classification are thus analogous to the generation of an “invalid” samples in our setting.
Lemma 2.
Fix and arbitrarily. For any validity function , if has , and has , then
Proof.
Fix the validity function arbitrarily. Consider lower bounding the total variation between the product measures via the magnitude of the difference of their measures on the event
Because has perfect validity, any given draw from has probability 0 of being invalid. Thus, . On the other hand, the invalidity of states that for any , we have . Let , and note that . Then we have
where the final inequality follows from the fact that for . ∎
We can then borrow from classical analysis of hypothesis testing given by the Neyman-Pearson lemma to leverage this gap in total variation between product measures into a bound on the probability that after samples, a model with appreciable invalidity has a smaller loss than .
Lemma 3.
Fix and arbitrarily. For any validity function , if has , has , and and have densities with respect to the reference measure , then
Proof.
The proof follows that of the Neyman-Pearson lemma’s claim that the Likelihood Ratio Test achieves the lower bound on the sum of Type I and Type II errors Rebeschini [2021], combined with Lemma 2. We give the full argument for completeness.
Fix the validity function arbitrarily, and note the following string of relations:
where the switch to total variation in the second to last line is the result of a classic characterization of total variation given by “Scheffé’s Theorem”, and the final line comes from Lemma 2. ∎
To generate loss guarantees, we need to be able to reason about models which do not have appreciable invalidity. The next lemma is an analogue of the previous one – it is identical up to the replacement of the condition that have appreciable invalidity with a weaker one that the total variation between a model and the data distribution is appreciable. When does not have appreciable invalidity, we can no longer rely on the structure of the supports of and to generate large margins for the total variation separation of product measures, and have to fall back on more general estimates for total variation between product measures.
Lemma 4.
Fix arbitrarily, and let be distributions with densities with respect to . Then if , and for , it holds with probability that
Proof.
When and both possess densities, we can related the the probability that the likelihood of is at least that of to their total variation, as in the Neyman-Pearson Lemma.
Here, the second to last inequality is the consequence of powerful result of Reiss [1981] (and later Steerneman [1983]), namely that for any two collections of probability measures and over measurable spaces , the product measures over the respective collections and satisfy
The final inequality follows from assumed gap in total variation between and . ∎
The previous two results concern the testing of individual models against . In the standard way, we now leverage the finite cardinality to argue via a union bound that given enough samples from , it’s unlikely that the ERM model has a large total variation distance from .
Lemma 5.
Fix arbitrarily, and suppose . If is fully-valid under , and for , then with probability over , the ERM solution
satisfies both
Proof.
Before we can prove Theorem 1, we need one final intermediate result, which we now give. It states that the value of the log-loss at any given for a mixture distribution constructed by heavily weighting one of two distributions is not much different than the value of the loss for the heavily weighted component. This follows from the fact that the natural log is well-approximated by a linear function near 1. We will use this result to argue that the loss of the output of Algorithm 1 is not significantly different than that of the ERM in the support of the ERM.
Lemma 4.
Fix . For any having a density with respect to , the mixture has density and this density satisfies
for all .
Proof.
The existence claim on the densities is immediate given the definition of as the uniform distribution with respect to the reference measure . To see the inequality, fix arbitrarily, and note that . Thus, we may write
where second equality comes from the fact that , the final follows from the fact that for . ∎
To get guarantees for the log-loss, we make heavy use of the previous lemma. Being able to guarantee a small total variational distance from the ERM to and small invalidity means that the ERM output is already likely to be a faithful representation of with small invalidity. All that is then needed is to eliminate the possibility that the ERM has a large log-loss because of small mismatches in support with .
To deal with this possibility, we mix the ERM model with the uniform distribution in accordance with Algorithm 1. Because the weight given to the uniform distribution is , the invalidity is close to that of the ERM. To show that such a move does not increase the loss significantly, we split the contribution to the loss of the outputted model into two that arising from and it’s complement. We first observe that the ERM always has an empirical risk which is a faithful estimator of the integral , allowing us to bound this integral above in terms of the loss of . The integral of the loss over can be shown to be small using the lower bound on the density of the output model afforded by mixing with the uniform distribution, and the fact that must have a small measure under when the ERM is close to in total variation.
Theorem 1.
Fix arbitrarily, and suppose and that is fully-valid under . If it holds that for each that for all , then there is an
such that for all , with probability , the output of Algorithm 1 satisfies
Proof.
Fix arbitrarily, and let . To see the claim on the validity, note that by the guarantee of Lemma 5, the assymptotic complexity of Lemma 5 yields the guarantee with probability , in which case
The loss of the outputted model can be decomposed into the contributions from the loss in and it’s complement:
To bound the first term, for each , consider the function
These functions are bounded above by and below by (if , simply loosen the density upper bound), and thus for a sample , define bounded random variables . By Hoeffding’s inequality and a union bound, it holds that a sample of size from is large enough such that with probability , for each , it holds that
Note that for each with , it holds that coincides with the empirical estimates of , namely
Furthermore, this coincidence takes place for with probability 1, as implies that with probability 1 – note that is a good estimator for in the sense arising from an application of Hoeffding, as is bounded almost surely for given that has a density that is bounded in it’s support (as a member of ). Thus, we may write
where we invoke Lemma 4 in the first step, and in the last step use that by the strict properness of the log-loss.
To bound the second term, note that by the argument in given in Lemma 5, when , it holds with probability that . Because has density at least everywhere in ,222When the uniform distribution has a density smaller than 1, there is an extra log factor to account for here. We can WLOG this away by adding the condition that , e.g. arising from normalized data. we have
where in the second line we use the fact that to argue that . Combining the bounds on the two summands and union bounding the confidence yields the full guarantee.
We note that this argument implies a slight more precise sample complexity bound given by
which affords a minor improvement in certain regimes where . ∎
7.4 A Zero-Query Lower Bound
Theorem 2.
For all and for all proper learners , if the sample is of size , then there exists a triple with and fully-valid, on which has invalidity with probability .
Proof.
We give an argument inspired by the proof of a lower bound in Theorem 2 of Zhao and Lai [2022].
Let . Fix the proper learner and arbitrarily. Consider a model class defined by , where and have densities with respect to the Lebesgue measure
This gives rise to two realizable problem instances of interest. Under the first, is the data generating distribution and everywhere except for in , where . Under the second, is the data generating distribution and everywhere except for in , where . Assume by contradiction that for both problem instances, given a sample of size , we have that with probability . In both cases, we have that for the model which is not the data generating distribution. Thus, is a model with with probability over both problem instances if and only if it identifies the data generating distribution with probability over both problem instances.
Consider the simple hypothesis tester defined by , which by the above, outputs the correct data generating distribution given a choice of or – and given samples from either or – with probability . Note that the distributions , have , where we use and for small enough in the inequality. By a classic upper bound on the total variation between product measures Reiss [1981], we have that . Then, by Le Cam’s method and this upper bound on , we have
Thus, if , the there is a choice of data generating distribution such that incurs error probability , which is a contradiction. ∎
7.5 Strategies Arising from Estimation of the Validity Function
When the validity function is known to lie in a class of bounded complexity, it is learnable, and learned estimates may be utilized in the selection of a low-loss, high-validity model. An interesting feature of the problem setup here is that unlike in most learning settings, the learner can actually choose which distributions it would like estimate with respect to, i.e. decide under which marginal distributions over the estimate should small disagreement with .
We begin with a lemma arguing that whenever the ERM has positive probability of outputting an example with , and the disagreement of and under a proposal distribution is small enough, the restriction will indeed yield a low-validity model.
Lemma 5.
Fix and a distribution absolutely continuous with respect to arbitrarily. Further, fix such that , and suppose that for some we have
Then whenever there is a distribution corresponding to
it has invalidity .
Proof.
Let denote the normalizing constant for the restriction to estimated valid region. Note that the restriction corresponds to a probability distribution if and only if , and that in this case
It holds further that is absolutely continuous with respect to , and that we can write the following chain of relations:
where the first inequality follows after inputting the definition of , and the final two inequalities come by assumption. It’s further possible to show that validity of can be approximated from above by a constant multiple of , which can be conceptualized as the validity of if were the true validity function. In particular,
This implies that , which yields as . Utilizing this inequality in the last line of the first string of inequalities gives the guarantee. ∎
The need for a lower estimate on the validity in the precision of the estimate for can be understood as follows: when the proposal distribution has very small validity, the restriction to the estimation of the valid parts of space under may create huge (or infinite) increases in mass over the proposal distribution – in the language of Lemma 5, may be very small for a proposal distribution . If this is the case, the estimate must be more precise, as small errors in the estimation of the validity function may lead to invalid parts of space having large mass under .
We introduce another lemma before proving Theorem 3. It argues that for a model constructed by accepting samples from some "proposal distribution" that fall in the valid part of space under , the contribution to the loss from the part of space where agrees with can never exceed the total loss of the proposal distribution . Essentially, we are exploiting the full-validity of here – in the subregion of the agreement region on which the loss is computed, we have that by the fact that is valid. This means that the density of can only be larger than the density of in this region, which under a non-increasing loss cannot increase the loss over that incurred by .
Lemma 6.
Fix , a validity function estimate , and arbitrarily. Suppose is a non-increasing loss function. Then whenever
corresponds to a probability distribution, it enjoys
Proof.
Let be the normalizing constant for , where we note that when corresponds to a probability distribution.
Given that is fully-valid, we have . By the fact that integration is defined up to null sets, it holds that
Further, we may write
Here, the second to last inequality comes from the non-negativity of the loss along with the fact that whenever , the integrand is zero – when , the loss is just evaluated at the normalized density, and so the integrand introduced in this line is an upper bound for the previous integrand. The final inequality comes from the non-increasingness of the loss function along with the observation that – in removing the normalizing constant, we can only make the value at which the loss is evaluated at smaller, which cannot decrease the value of the loss. ∎
We are now ready to prove the main result of the second half of the paper – the guarantee for Algorithm 2. It combines the previous lemmas, noting further that the number of samples in is sufficient to make the disagreement of and small enough under such that the contribution to the loss in that part of space can be controlled by trivially applying the loss upper bound .
Theorem 3.
Suppose with VC-dimension , and that for each , the validity . For all and for all choices of non-increasing loss functions , Algorithm 2 requires a number of samples
and a number of validity queries
to ensure that with probability , its output enjoys
Proof.
Given that the loss is bounded, Hoeffding’s inequality applied to the random variables for , and a union bound, imply that is large enough that with probability , we have that for all , the empirical loss estimates are at most away from true losses . For any choice of , because we have , it must hold that any minimizer is consistent with the labeling under of both and . The standard rates of convergence when choosing an arbitrary consistent hypothesis thus imply that the sizes of and are large enough to guarantee that, with probability , we have
By a union bound, with probability , all of these estimation accuracy events take place. We condition on these favorable events taking place going forwards.
Note that conditioned on these favorable events, the normalizing constant , as for any ERM, we have
Thus, the restriction of the ERM to the estimated validity region is a viable probability distribution, and is outputted by the algorithm as . For any estimate of the validity function , we can decompose the loss of as
First using Lemma 6, and then using the uniform convergence of the loss estimates, we can bound the first term as
where is the lowest-loss model in the class . To upper bound the second term in the loss decomposition, we can use the fact that and the upper bound on the loss to write
yielding the loss guarantee.
The validity guarantee follows directly from the fact that and Lemma 5, where furnishes the lower estimate for the validity of the model . ∎
The corollary to Theorem 3 stating that only low-loss models need appreciable validity is straightforwards. One can simply add an extra line to the proof of Theorem 3, arguing that when the intersection of good estimation events takes place, the loss of the ERM distribution is within of the optimal loss across models in , meaning that it has validity greater than some constant . Thus, one can run Algorithm 2 with an large enough to achieve disagreement rate between and under samples from , lowering the label complexity.
The proof of Theorem 4 is very similar to that of Theorem 3. The main difference is that when the loss is the capped log-loss, we can exploit a stability property under mixture similar to that introduced in Lemma 4. This allows us to mix with a distribution of constant validity to get a validity lower bound on the final proposal distribution without increasing the loss more than . The validity lower bound can then be used as in Theorem 3.
Theorem 4.
Suppose where , and that for each , we have . Suppose further that there is some known with density which has for some constant . Then for all choices of and , Algorithm 3 requires a number of samples
and a number of validity queries
to ensure that with probability , its output enjoys
Proof.
WLOG assume , and consider learning over , a translation of the capped log-loss bounded below by for all inputs to , and bounded above by .
Similar to the proof of Theorem 3, with probability over the sample , it holds that for all that ; in this case, we use the fact that to ensure that the random variables for are bounded, allowing for an application of Hoeffding’s inequality over empirical estimates of a loss unbounded below. As above, for any choice of , the sizes of and are such that with probability ,
As before, by a union bound, these bounds both hold simultaneously with probability . We condition on this intersection of favorable events going forwards.
Note that when this intersection of events takes place, we have . In this case, we have that as , and so identically to our work in Theorem 3, we may write
Thus, the restriction of to the estimate of the valid region is defined and outputted by the algorithm as . To see that the loss guarantee then holds for such a , consider the loss decomposition used in the proof of Theorem 3:
We upper bound the second term exactly as in Theorem 3. To upper bound the first term, consider an argument similar to that of the proof of Lemma 6. Let be the normalizing constant for . We can write
Here, the non-increasingness of the loss still holds, leading to the final step. Now, fix some arbitrarily, and note the following, in the style of Lemma 4:
Thus, it holds that , and so we may write
Thus, we have written the loss in terms of the ERM, and so we have, as in Theorem 3, that the first term of the loss decomposition can be bounded by .
Given , and the lower bound on the validity of derived in the third paragraph above, we can again apply Lemma 5 to get the validity guarantee. ∎