Statistics Theory
See recent articles
Showing new listings for Friday, 30 May 2025
- [1] arXiv:2505.22807 [pdf, html, other]
-
Title: Distribution free M-estimationComments: 26 pagesSubjects: Statistics Theory (math.ST); Information Theory (cs.IT); Machine Learning (cs.LG)
The basic question of delineating those statistical problems that are solvable without making any assumptions on the underlying data distribution has long animated statistics and learning theory. This paper characterizes when a (univariate) convex M-estimation or stochastic optimization problem is solvable in such an assumption-free setting, providing a precise dividing line between solvable and unsolvable problems. The conditions we identify show, perhaps surprisingly, that Lipschitz continuity of the loss being minimized is not necessary for distribution free minimization, and they are also distinct from classical characterizations of learnability in machine learning.
- [2] arXiv:2505.23023 [pdf, html, other]
-
Title: Density Estimation on Rectifiable SetsSubjects: Statistics Theory (math.ST); Classical Analysis and ODEs (math.CA)
Kernel density estimation is a popular method for estimating unseen probability distributions. However, the convergence of these classical estimators to the true density slows down in high dimensions. Moreover, they do not define meaningful probability distributions when the intrinsic dimension of data is much smaller than its ambient dimension. We build on previous work on density estimation on manifolds to show that a modified kernel density estimator converges to the true density on $d-$rectifiable sets. As a special case, we consider algebraic varieties and semi-algebraic sets and prove a convergence rate in this setting. We conclude the paper with a numerical experiment illustrating the convergence of this estimator on sparse data.
- [3] arXiv:2505.23240 [pdf, html, other]
-
Title: Joint estimation of smooth graph signals from partial linear measurementsComments: 29 pages, 2 figuresSubjects: Statistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)
Given an undirected and connected graph $G$ on $T$ vertices, suppose each vertex $t$ has a latent signal $x_t \in \mathbb{R}^n$ associated to it. Given partial linear measurements of the signals, for a potentially small subset of the vertices, our goal is to estimate $x_t$'s. Assuming that the signals are smooth w.r.t $G$, in the sense that the quadratic variation of the signals over the graph is small, we obtain non-asymptotic bounds on the mean squared error for jointly recovering $x_t$'s, for the smoothness penalized least squares estimator. In particular, this implies for certain choices of $G$ that this estimator is weakly consistent (as $T \rightarrow \infty$) under potentially very stringent sampling, where only one coordinate is measured per vertex for a vanishingly small fraction of the vertices. The results are extended to a ``multi-layer'' ranking problem where $x_t$ corresponds to the latent strengths of a collection of $n$ items, and noisy pairwise difference measurements are obtained at each ``layer'' $t$ via a measurement graph $G_t$. Weak consistency is established for certain choices of $G$ even when the individual $G_t$'s are very sparse and disconnected.
- [4] arXiv:2505.23541 [pdf, html, other]
-
Title: Upper and lower bounds for local Lipschitz stability of Bayesian posteriorsComments: 51 pagesSubjects: Statistics Theory (math.ST)
The work of Sprungk (Inverse Problems, 2020) established the local Lipschitz continuity of the misfit-to-posterior and prior-to-posterior maps with respect to the Kullback--Leibler divergence and the total variation, Hellinger, and 1-Wasserstein metrics, by proving certain upper bounds. The upper bounds were also used to show that if a posterior measure is more concentrated, then it can be more sensitive to perturbations in the misfit or prior. We prove upper bounds and lower bounds that emphasise the importance of the evidence. The lower bounds show that the sensitivity of posteriors to perturbations in the misfit or the prior not only can increase, but in general will increase as the posterior measure becomes more concentrated, i.e. as the evidence decreases to zero. Using the explicit dependence of our bounds on the evidence, we identify sufficient conditions for the misfit-to-posterior and prior-to-posterior maps to be locally bi-Lipschitz continuous.
- [5] arXiv:2505.23592 [pdf, html, other]
-
Title: A Modern Theory of Cross-Validation through the Lens of StabilityComments: 120 pages, 1 figureSubjects: Statistics Theory (math.ST)
Modern data analysis and statistical learning are marked by complex data structures and black-box algorithms. Data complexity stems from technologies like imaging, remote sensing, wearables, and genomic sequencing. Simultaneously, black-box models -- especially deep neural networks -- have achieved impressive results. This combination raises new challenges for uncertainty quantification and statistical inference, which we term "black-box inference."
Black-box inference is difficult due to the lack of traditional modeling assumptions and the opaque behavior of modern estimators. These make it hard to characterize the distribution of estimation errors. A popular solution is post-hoc randomization, which, under mild assumptions like exchangeability, can yield valid uncertainty quantification. Such methods range from classical techniques like permutation tests, jackknife, and bootstrap, to recent innovations like conformal inference. These approaches typically need little knowledge of data distributions or the internal working of estimators. Many rely on the idea that estimators behave similarly under small data changes -- a concept formalized as stability. Over time, stability has become a key principle in data science, influencing generalization error, privacy, and adaptive inference.
This article investigates cross-validation (CV) -- a widely used resampling method -- through the lens of stability. We first review recent theoretical results on CV for estimating generalization error and model selection under stability. We then examine uncertainty quantification for CV-based risk estimates. Together, these insights yield new theory and tools, which we apply to topics like model selection, selective inference, and conformal prediction.
New submissions (showing 5 of 5 entries)
- [6] arXiv:2505.22781 (cross-list from stat.ML) [pdf, html, other]
-
Title: Finite-Sample Convergence Bounds for Trust Region Policy Optimization in Mean-Field GamesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
We introduce Mean-Field Trust Region Policy Optimization (MF-TRPO), a novel algorithm designed to compute approximate Nash equilibria for ergodic Mean-Field Games (MFG) in finite state-action spaces. Building on the well-established performance of TRPO in the reinforcement learning (RL) setting, we extend its methodology to the MFG framework, leveraging its stability and robustness in policy optimization. Under standard assumptions in the MFG literature, we provide a rigorous analysis of MF-TRPO, establishing theoretical guarantees on its convergence. Our results cover both the exact formulation of the algorithm and its sample-based counterpart, where we derive high-probability guarantees and finite sample complexity. This work advances MFG optimization by bridging RL techniques with mean-field decision-making, offering a theoretically grounded approach to solving complex multi-agent problems.
- [7] arXiv:2505.23046 (cross-list from stat.ME) [pdf, html, other]
-
Title: Revisit CP Tensor Decomposition: Statistical Optimality and Fast ConvergenceSubjects: Methodology (stat.ME); Numerical Analysis (math.NA); Statistics Theory (math.ST); Machine Learning (stat.ML)
Canonical Polyadic (CP) tensor decomposition is a fundamental technique for analyzing high-dimensional tensor data. While the Alternating Least Squares (ALS) algorithm is widely used for computing CP decomposition due to its simplicity and empirical success, its theoretical foundation, particularly regarding statistical optimality and convergence behavior, remain underdeveloped, especially in noisy, non-orthogonal, and higher-rank settings.
In this work, we revisit CP tensor decomposition from a statistical perspective and provide a comprehensive theoretical analysis of ALS under a signal-plus-noise model. We establish non-asymptotic, minimax-optimal error bounds for tensors of general order, dimensions, and rank, assuming suitable initialization. To enable such initialization, we propose Tucker-based Approximation with Simultaneous Diagonalization (TASD), a robust method that improves stability and accuracy in noisy regimes. Combined with ALS, TASD yields a statistically consistent estimator. We further analyze the convergence dynamics of ALS, identifying a two-phase pattern-initial quadratic convergence followed by linear refinement. We further show that in the rank-one setting, ALS with an appropriately chosen initialization attains optimal error within just one or two iterations. - [8] arXiv:2505.23113 (cross-list from stat.ME) [pdf, html, other]
-
Title: Valid F-screening in linear regressionSubjects: Methodology (stat.ME); Statistics Theory (math.ST); Applications (stat.AP)
Suppose that a data analyst wishes to report the results of a least squares linear regression only if the overall null hypothesis, $H_0^{1:p}: \beta_1= \beta_2 = \ldots = \beta_p=0$, is rejected. This practice, which we refer to as F-screening (since the overall null hypothesis is typically tested using an $F$-statistic), is in fact common practice across a number of applied fields. Unfortunately, it poses a problem: standard guarantees for the inferential outputs of linear regression, such as Type 1 error control of hypothesis tests and nominal coverage of confidence intervals, hold unconditionally, but fail to hold conditional on rejection of the overall null hypothesis. In this paper, we develop an inferential toolbox for the coefficients in a least squares model that are valid conditional on rejection of the overall null hypothesis. We develop selective p-values that lead to tests that control the selective Type 1 error, i.e., the Type 1 error conditional on having rejected the overall null hypothesis. Furthermore, they can be computed without access to the raw data, i.e., using only the standard outputs of a least squares linear regression, and therefore are suitable for use in a retrospective analysis of a published study. We also develop confidence intervals that attain nominal selective coverage, and point estimates that account for having rejected the overall null hypothesis. We show empirically that our selective procedure is preferable to an alternative approach that relies on sample splitting, and we demonstrate its performance via re-analysis of two datasets from the biomedical literature.
- [9] arXiv:2505.23141 (cross-list from math.PR) [pdf, html, other]
-
Title: Random Field Representations of Kernel DistancesSubjects: Probability (math.PR); Functional Analysis (math.FA); Statistics Theory (math.ST)
Positive semi-definite kernels are used to induce pseudo-metrics, or ``distances'', between measures. We write these as an expected quadratic variation of, or expected inner product between, a random field and the difference of measures. This alternate viewpoint offers important intuition and interesting connections to existing forms. Metric distances leading to convenient finite sample estimates are shown to be induced by fields with dense support, stationary increments, and scale invariance. The main example of this is energy distance. We show that the common generalization preserving continuity is induced by fractional Brownian motion. We induce an alternate generalization with the Gaussian free field, formally extending the Cramér-von Mises distance. Pathwise properties give intuition about practical aspects of each. This is demonstrated through signal to noise ratio studies.
- [10] arXiv:2505.23445 (cross-list from stat.ML) [pdf, other]
-
Title: The Strong, Weak and Benign Goodhart's law. An independence-free and paradigm-agnostic formalisationComments: 32 pages, 1 figureSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
Goodhart's law is a famous adage in policy-making that states that ``When a measure becomes a target, it ceases to be a good measure''. As machine learning models and the optimisation capacity to train them grow, growing empirical evidence reinforced the belief in the validity of this law without however being formalised. Recently, a few attempts were made to formalise Goodhart's law, either by categorising variants of it, or by looking at how optimising a proxy metric affects the optimisation of an intended goal. In this work, we alleviate the simplifying independence assumption, made in previous works, and the assumption on the learning paradigm made in most of them, to study the effect of the coupling between the proxy metric and the intended goal on Goodhart's law. Our results show that in the case of light tailed goal and light tailed discrepancy, dependence does not change the nature of Goodhart's effect. However, in the light tailed goal and heavy tailed discrepancy case, we exhibit an example where over-optimisation occurs at a rate inversely proportional to the heavy tailedness of the discrepancy between the goal and the metric. %
- [11] arXiv:2505.23599 (cross-list from cs.LG) [pdf, html, other]
-
Title: On Transferring Transferability: Towards a Theory for Size GeneralizationComments: 69 pages, 8 figuresSubjects: Machine Learning (cs.LG); Representation Theory (math.RT); Statistics Theory (math.ST); Machine Learning (stat.ML)
Many modern learning tasks require models that can take inputs of varying sizes. Consequently, dimension-independent architectures have been proposed for domains where the inputs are graphs, sets, and point clouds. Recent work on graph neural networks has explored whether a model trained on low-dimensional data can transfer its performance to higher-dimensional inputs. We extend this body of work by introducing a general framework for transferability across dimensions. We show that transferability corresponds precisely to continuity in a limit space formed by identifying small problem instances with equivalent large ones. This identification is driven by the data and the learning task. We instantiate our framework on existing architectures, and implement the necessary changes to ensure their transferability. Finally, we provide design principles for designing new transferable models. Numerical experiments support our findings.
Cross submissions (showing 6 of 6 entries)
- [12] arXiv:2406.19061 (replaced) [pdf, other]
-
Title: Entrywise dynamics and universality of general first order methodsSubjects: Statistics Theory (math.ST); Information Theory (cs.IT)
General first order methods (GFOMs), including various gradient descent and AMP algorithms, constitute a broad class of iterative algorithms in modern statistical learning problems. Some GFOMs also serve as constructive proof devices, iteratively characterizing the empirical distributions of statistical estimators in the large system limits for any fixed number of iterations.
This paper develops a non-asymptotic, entrywise characterization for a general class of GFOMs. Our characterizations capture the precise entrywise behavior of the GFOMs, and hold universally across a broad class of heterogeneous random matrix models. As a corollary, we provide the first non-asymptotic description of the empirical distributions of the GFOMs beyond Gaussian ensembles.
We demonstrate the utility of these general results in two applications. In the first application, we prove entrywise universality for regularized least squares estimators in the linear model, by controlling the entrywise error relative to a suitably constructed GFOM. This algorithmic proof method also leads to systematically improved averaged universality results for regularized regression estimators in the linear model, and resolves the universality conjecture for (regularized) MLEs in logistic regression. In the second application, we obtain entrywise Gaussian approximations for a class of gradient descent algorithms. Our approach provides non-asymptotic state evolution for the bias and variance of the algorithm along the iteration path, applicable for non-convex loss functions.
The proof relies on a new recursive leave-k-out method that provides almost delocalization for the GFOMs and their derivatives. Crucially, our method ensures entrywise universality for up to poly-logarithmic many iterations, which facilitates effective $\ell_2/\ell_\infty$ control between certain GFOMs and statistical estimators in applications. - [13] arXiv:2410.02835 (replaced) [pdf, html, other]
-
Title: The Empirical Mean is Minimax Optimal for Local Glivenko-CantelliSubjects: Statistics Theory (math.ST); Machine Learning (cs.LG); Methodology (stat.ME)
We revisit the recently introduced Local Glivenko-Cantelli setting, which studies distribution-dependent uniform convergence rates of the Empirical Mean Estimator (EME). In this work, we investigate generalizations of this setting where arbitrary estimators are allowed rather than just the EME. Can a strictly larger class of measures be learned? Can better risk decay rates be obtained? We provide exhaustive answers to these questions, which are both negative, provided the learner is barred from exploiting some infinite-dimensional pathologies. On the other hand, allowing such exploits does lead to a strictly larger class of learnable measures.
- [14] arXiv:2412.21125 (replaced) [pdf, html, other]
-
Title: Optimal e-value testing for properly constrained hypothesesSubjects: Statistics Theory (math.ST)
Hypothesis testing via e-variables can be framed as a sequential betting game, where a player each round picks an e-variable. A good player's strategy results in an effective statistical test that rejects the null hypothesis as soon as sufficient evidence arises. Building on recent advances, we address the question of restricting the pool of e-variables to simplify strategy design without compromising effectiveness. We extend the results of Clerico(2024), by characterising optimal sets of e-variables for a broad class of non-parametric hypothesis tests, defined by finitely many regular constraints. As an application, we discuss this notion of optimality in algorithmic mean estimation, including for heavy-tailed random variables.
- [15] arXiv:2502.17671 (replaced) [pdf, html, other]
-
Title: Optimal Recovery Meets Minimax EstimationSubjects: Statistics Theory (math.ST); Numerical Analysis (math.NA); Machine Learning (stat.ML)
A fundamental problem in statistics and machine learning is to estimate a function $f$ from possibly noisy observations of its point samples. The goal is to design a numerical algorithm to construct an approximation $\hat f$ to $f$ in a prescribed norm that asymptotically achieves the best possible error (as a function of the number $m$ of observations and the variance $\sigma^2$ of the noise). This problem has received considerable attention in both nonparametric statistics (noisy observations) and optimal recovery (noiseless observations). Quantitative bounds require assumptions on $f$, known as model class assumptions. Classical results assume that $f$ is in the unit ball of a Besov space. In nonparametric statistics, the best possible performance of an algorithm for finding $\hat f$ is known as the minimax rate and has been studied in this setting under the assumption that the noise is Gaussian. In optimal recovery, the best possible performance of an algorithm is known as the optimal recovery rate and has also been determined in this setting. While one would expect that the minimax rate recovers the optimal recovery rate when the noise level $\sigma$ tends to zero, it turns out that the current results on minimax rates do not carefully determine the dependence on $\sigma$ and the limit cannot be taken. This paper handles this issue and determines the noise-level-aware (NLA) minimax rates for Besov classes when error is measured in an $L_q$-norm with matching upper and lower bounds. The end result is a reconciliation between minimax rates and optimal recovery rates. The NLA minimax rate continuously depends on the noise level and recovers the optimal recovery rate when $\sigma$ tends to zero.
- [16] arXiv:2307.02236 (replaced) [pdf, html, other]
-
Title: D-optimal Subsampling Design for Massive Data Linear RegressionComments: 28 pages, 11 figuresSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
Data reduction is a fundamental challenge of modern technology, where classical statistical methods are not applicable because of computational limitations. We consider multiple linear regression for an extraordinarily large number of observations, but only a few covariates. Subsampling aims at the selection of a given proportion of the existing original data. Under distributional assumptions on the covariates, we derive D-optimal subsampling designs and study their theoretical properties. We make use of fundamental concepts of optimal design theory and an equivalence theorem from constrained convex optimization. The thus obtained subsampling designs provide simple rules for whether to accept or reject a data point, allowing for an easy algorithmic implementation. In addition, we propose a simplified subsampling method with lower computational complexity that deviates from the D-optimal design. We present a simulation study, comparing both subsampling schemes with the IBOSS method in the case of a fixed size of the subsample.
- [17] arXiv:2410.14205 (replaced) [pdf, html, other]
-
Title: Modelling 1/f Noise in TRNGs via Fractional Brownian MotionComments: formal proofs, including bounded leakage conjectureSubjects: Cryptography and Security (cs.CR); Statistics Theory (math.ST)
Security of oscillatory true random number generators remains not fully understood due to insufficient understanding of complex $1/f^\alpha$ phase noise. To bridge this gap, we introduce fractional Brownian motion as a comprehensive theoretical framework, capturing power-law spectral densities from white to flicker frequency noise.
Our key contributions provide closed-form tractable solutions: (1) a quasi-renewal property showing conditional variance grows with power-law time dependence, enabling tractable leakage analysis; (2) closed-form min-entropy expressions under Gaussian phase posteriors; and (3) asymptotically unbiased Allan variance parameter estimation.
This framework bridges physical modelling with cryptographic requirements, providing both theoretical foundations and practical calibration for oscillator-based TRNGs. - [18] arXiv:2410.24163 (replaced) [pdf, html, other]
-
Title: Improve the Precision of Area Under the Curve Estimation for Recurrent Events Through Covariate AdjustmentSubjects: Methodology (stat.ME); Statistics Theory (math.ST); Applications (stat.AP)
The area under the curve (AUC) of the mean cumulative function (MCF) has recently been introduced as a novel estimand for evaluating treatment effects in recurrent event settings, offering an alternative to the commonly used Lin-Wei-Yang-Ying (LWYY) model. The AUC of the MCF provides a clinically interpretable summary measure that captures the overall burden of disease progression, regardless of whether the proportionality assumption holds. To improve the precision of the AUC estimation while preserving its unconditional interpretability, we propose a nonparametric covariate adjustment approach. This approach guarantees efficiency gain compared to unadjusted analysis, as demonstrated by theoretical asymptotic distributions, and is universally applicable to various randomization schemes, including both simple and covariate-adaptive designs. Extensive simulations across different scenarios further support its advantage in increasing statistical power. Our findings highlight the importance of covariate adjustment for the analysis of AUC in recurrent event settings, offering practical guidance for its application in randomized clinical trials.
- [19] arXiv:2502.06381 (replaced) [pdf, html, other]
-
Title: Revisiting Optimal Allocations for Binary Responses: Insights from Considering Type-I Error Rate ControlComments: 14 pages, 2 figures, 5 tablesSubjects: Methodology (stat.ME); Statistics Theory (math.ST); Applications (stat.AP)
This work revisits optimal response-adaptive designs from a type-I error rate perspective, highlighting when and how much these allocations exacerbate type-I error rate inflation - an issue previously undocumented. We explore a range of approaches from the literature that can be applied to reduce type-I error rate inflation. However, we found that all of these approaches fail to give a robust solution to the problem. To address this, we derive two optimal allocation proportions, incorporating the more robust score test (instead of the Wald test) with finite sample estimators (instead of the unknown true values) in the formulation of the optimization problem. One proportion optimizes statistical power and the other minimizes the total number failures in a trial while maintaining a fixed variance level. Through simulations based on an early-phase and a confirmatory trial we provide crucial practical insight into how these new optimal proportion designs can offer substantial patient outcomes advantages while controlling type-I error rate. While we focused on binary outcomes, the framework offers valuable insights that naturally extend to other outcome types, multi-armed trials and alternative measures of interest.
- [20] arXiv:2505.22371 (replaced) [pdf, other]
-
Title: Adaptive tail index estimation: minimal assumptions and non-asymptotic guaranteesSubjects: Other Statistics (stat.OT); Statistics Theory (math.ST)
A notoriously difficult challenge in extreme value theory is the choice of the number $k\ll n$, where $n$ is the total sample size, of extreme data points to consider for inference of tail quantities. Existing theoretical guarantees for adaptive methods typically require second-order assumptions or von Mises assumptions that are difficult to verify and often come with tuning parameters that are challenging to calibrate. This paper revisits the problem of adaptive selection of $k$ for the Hill estimator. Our goal is not an `optimal' $k$ but one that is `good enough', in the sense that we strive for non-asymptotic guarantees that might be sub-optimal but are explicit and require minimal conditions. We propose a transparent adaptive rule that does not require preliminary calibration of constants, inspired by `adaptive validation' developed in high-dimensional statistics. A key feature of our approach is the consideration of a grid for $k$ of size $ \ll n $, which aligns with common practice among practitioners but has remained unexplored in theoretical analysis. Our rule only involves an explicit expression of a variance-type term; in particular, it does not require controlling or estimating a biasterm. Our theoretical analysis is valid for all heavy-tailed distributions, specifically for all regularly varying survival functions. Furthermore, when von Mises conditions hold, our method achieves `almost' minimax optimality with a rate of $\sqrt{\log \log n}~ n^{-|\rho|/(1+2|\rho|)}$ when the grid size is of order $\log n$, in contrast to the $ (\log \log (n)/n)^{|\rho|/(1+2|\rho|)} $ rate in existing work. Our simulations show that our approach performs particularly well for ill-behaved distributions.