Auto-Calibration Tests for Discrete Finite Regression Functions
Abstract
Auto-calibration is an important property of regression functions for actuarial applications. Comparably little is known about statistical testing of auto-calibration. Denuit et al. (2024) recently published a test with an asymptotic distribution that is not fully explicit and its evaluation needs non-parametric Monte Carlo sampling. In a simpler set-up, we present three test statistics with fully known and interpretable asymptotic distributions.
Keywords. Auto-calibration, concentration curve, Lorenz curve, area between the curves.
1 Introduction
Recent actuarial and financial literature acknowledges the importance of the statistical concept of auto-calibration; see, e.g., Krüger–Ziegel [6], Denuit et al. [1] and Wüthrich [7]. Select an integrable response variable and covariates with support .
Definition 1.1
A measurable regression function is auto-calibrated for if
In an actuarial pricing context this means that every price cohort is on average self-financing for the claims , or in other words, there is no systematic cross-financing within a pricing scheme designed by the regression function .
Surprisingly, there is no mature literature on testing for auto-calibration. Most proposals only consider binary responses, e.g., Gneiting–Resin [5] discuss a bootstrap test and Dimitriadis et al. [4] study calibration bands. Recently, Denuit et al. [2] presented an auto-calibration test that studies the difference between the concentration curve (CC) and the Lorenz curve (LC). Also this test requires simulations because the asymptotic distribution of the test statistics is not sufficiently explicitly. We take one step back here, and we present simpler test statistics with fully known and interpretable asymptotic distributions, though, in a simpler set-up.
One needs three ingredients for an auto-calibration test. (a) A regression function . This regression function can be fully general, i.e., we do not require that it is close (in some metric) to the conditional mean , nor do we specify whether has been estimated from past data or whether it has been set by an expert. (b) A pair . For simplicity, we assume that the response is positive and square integrable. The covariates have support . (c) An i.i.d. sample for testing. This sample should have the same law as . These three ingredients (a)-(c) are sufficient for testing for auto-calibration of for ; if has been estimated from past data , we generally assume that , and are independent, and all subsequent statements need then be understood conditional on .
2 Tests for auto-calibration
Assume that the regression function takes only finitely many (ordered) values . This gives us a partition of the covariate space with
(2.1) |
We assume probabilities , otherwise the corresponding part of the covariate space can be dropped. In this finite partition case (2.1), auto-calibration of for is equivalent to
Using the tower property, auto-calibration of for implies for all
(2.2) |
this statement is essentially the same as Wüthrich [7, Proposition 4.1]. For a given i.i.d. sample , this motivates the statistics
Under auto-calibration of for , these empirical quantities , , converge to zero, -a.s., as , and we have the following central limit theorem.
Proposition 2.1
Under auto-calibration of for
with conditional variances for .
The proof of this proposition is standard and based on characteristic functions.
Test 1. Under the null hypothesis of being auto-calibrated for , (2.2) is a necessary condition for all . We test this against the alternative that there exists a with . Under the null hypothesis, Proposition 2.1 gives us for and large
(2.3) |
Often, it is beneficial to test for the maximum of the normalized quantities , to have all terms on the same scale. This provides asymptotic limit .
Denuit et al. [2, formula (2.4)] consider an aggregated version of . Namely, auto-calibration of for implies for all
(2.4) |
For a given i.i.d. sample , this motivates the statistics
The following corollary is an immediate consequence of Proposition 2.1.
Corollary 2.2
Under auto-calibration of for
Thus, the aggregated statistics can asymptotically be described by a random walk
(2.5) |
with i.i.d. standard Gaussian innovations for .
Test 2. Under the null hypothesis of being auto-calibrated for , (2.4) is a necessary condition for all . We test this against the alternative that there exists a with . Under the null hypothesis, Corollary 2.2 gives us for and large
(2.6) |
Up to one point discussed below, asymptotic approximation (2.6) gives an explicit explanation to the intractable limit in Denuit et al. [2, Proposition 3.1]. Namely, the asymptotic distribution of the test statistics in (2.6) corresponds to the maximum of the random walk (2.5) whose increments are fully determined by the probabilities , given in (2.1), and the conditional variances , given in Proposition 2.1. These two parameter sets can be determined from past data , being independent of the i.i.d. sample , see discussion in Section 1. The rejection area is then received by (easy) random walk simulations involving only these two (estimated) parameter sets and . This seems simpler than the non-parametric Monte Carlo method used in Denuit et al. [2, Section 3.1].
3 Testing for the area between the curves
The consideration of is motivated by the difference of the CC and the LC. Denote by the left-continuous generalized inverse of the distribution function of . The difference between the CC and the LC at probability level is defined by
For a regression function with discrete finite range (2.1), only takes different values in the cumulative probabilities , and we set . Namely, we have
(3.1) |
Under unbiasedness , we have
These normalized differences motivate the study of under auto-calibration of for , which implies the above unbiasedness. Denuit et al. [2, Proposition 3.1] do not exploit an auto-calibration test for , but rather for . Unfortunately, the normalized quantities and are more involved. For a given i.i.d. sample , consider
with being the empirical mean of and the empirical mean of . Dealing with instead of is more cumbersome because of these normalizations. These normalizations are mainly motivated by the fact that they imply that both the CC and the LC are calibrated to 1 for . In statistical modeling, this then allows one to perform model selection by selecting the model that has the most convex CC, as a higher convexity implies better discrimination; see Wüthrich [7]. Similarly, in economics, a more convex LC indicates higher inequality in wealth distribution. However, for testing of auto-calibration this normalization seems not justified, and we give preference to the simpler unscaled quantity . Note that
(3.2) | |||||
Corollary 2.2 and Slutsky’s theorem give weak convergence of the first term in (3.2) to . For the second term in (3.2), one establishes weak convergence of , and the other terms are treated by Slutsky’s theorem. Finally, one needs to compute the covariance between the two terms in (3.2) to get the asymptotic variance of . This is doable, but cumbersome. Therefore, we prefer to study the non-normalized quantities .
Based on , Denuit et al. [3, formula (4.4)] introduced the area between the curves (ABC) as a model selection criterion. The ABC is defined by
Again, we prefer the unscaled version. Under the discrete finite regression function, we have
For a given i.i.d. sample , this motivates the an integrated random walk statistics
Under auto-calibration of for , statistics converges to zero, -a.s. Slightly modifying the terms, we propose the following weighted -norm statistics of the increments
(3.3) |
thus, the random walk increments with different signs cannot compensate each other.
Corollary 3.1
Under auto-calibration of for
where are i.i.d. -distributed random variables with one degree of freedom.
Test 3. Under the above assumptions, we can test for auto-calibration of for by exploiting the limiting distribution of Corollary 3.1 numerically. As in Test 2, this limiting distribution only depends on the two parameter sets and .
Dropping the weighting in (3.3) and scaling the individual terms by gives a -test with degrees of freedom.
4 Conclusions
This letter considers statistical testing of auto-calibration. In the simplified set-up of a discrete finite regression function, we provide three different test statistics that have fully known asymptotic distributions under auto-calibration, see (2.3), (2.6) and Corollary 3.1. These three test statistics consider random walk increments, a random walk and an integrated random walk. The three test statistics can be used for statistical testing of auto-calibration in our simpler set-up; Test 2 is a modified version of Denuit et al. [2, Proposition 3.1].
In this letter, we did not cover a study of the powers of these tests. This will depend on the kind of violation of auto-calibration; in fact, we believe that it is beneficial to normalize all random walk increments to unit variance in any of the three presented tests. Another open problem is to generalize these tests to arbitrary regression functions, this seems feasible for Tests 2 and 3.
References
- [1] Denuit, M., Charpentier, A., Trufin, J. (2021). Autocalibration and Tweedie-dominance for insurance pricing in machine learning. Insurance: Mathematics and Economics 101/B, 485-497.
- [2] Denuit, M., Huyghe, J., Trufin, J., Verdebout, T. (2024). Testing for auto-calibration with Lorenz and concentration curves. Insurance: Mathematics and Economics 117, 130-139.
- [3] Denuit, M., Sznajder, D., Trufin, J. (2019). Model selection based on Lorenz and concentration curves, Gini indices and convex order. Insurance: Mathematics and Economics 89, 128-139.
- [4] Dimitriadis, T., Dümbgen, L., Henzi, A., Puke, M., Ziegel, J. (2023). Honest calibration assessment for binary outcome predictions. Biometrika 110/3, 663-680.
- [5] Gneiting, T., Resin, J. (2023). Regression diagnostics meets forecst evaluation: conditional calibration, reliability diagrams, and coefficient of determination. Electronic Journal of Statistics 17, 3226-3286.
- [6] Krüger, F., Ziegel, J.F. (2021). Generic conditions for forecast dominance. Journal of Business and Economics Statistics 39/4, 972-983.
- [7] Wüthrich, M.V. (2023). Model selection with Gini indices under auto-calibration. European Actuarial Journal 13/1, 469-477.
Supplementary
Proofs
Proof of Proposition 2.1. Set . For , consider the characteristic function
where in the second last step we use auto-calibration of for . This completes the proof.
Example
We study a gamma distribution example with expected response levels . Table 1 shows the selected parameters. Firstly, we choose the probabilities such that the boundary levels receive the smallest probabilities, and the levels in the middle get the highest probabilities. This is a quite common feature in real data. Secondly, the variance parameters are increasing in regression means . Also this is a rather common feature, e.g., a Poisson or a gamma generalized linear model (GLM) have this property. Based on these parameters, we simulate first the regression level using the probabilities . Based on this level , we then simulate the response with shape parameter and scale parameter . This gives us conditional mean and conditional variance , see Table 1. In particular, auto-calibration is fulfilled in this example because we simulate from the correct means.
1 | 2 | 3 | 4 | 5 | 6 | |
10 | 11 | 12 | 13 | 14 | 15 | |
10/3 | 11/3 | 12/3 | 13/3 | 14/3 | 15/3 |



Based on the parameters given in Table 1, we simulate an i.i.d. sample of sample size . Figure 1 (lhs) shows the resulting boxplot of the responses classified w.r.t. their conditional means . Remark that there is auto-calibration in this example. Figure 1 (middle) plots the empirical level means
against their (true) conditional expectations ; this plot is sometimes also called lift plot. Under auto-calibration, the resulting scatter plot should lie fairly much on the diagonal, and their deviation from the diagonal is described (asymptotically) by Proposition 2.1. Figure 1 (rhs) shows the resulting statistics , for . These are obtained from the lift plot by using a different normalization
the ratio on the right-hand side is an empirical estimate of . The magnitude of fluctuations of these statistics around zero should be of order , see Proposition 2.1.
We repeat this simulation of an i.i.d. sample times to study the empirical distribution of the statistics . For large sample sizes , this empirical distribution should approximately look like the Gaussian limiting distribution given in Proposition 2.1. Our simulation has an empirical mean of magnitude , thus, close to zero. The empirical covariance matrix reads as
The off-diagonals are close to zero and the diagonal is close to true parameters
see asymptotic covariance matrix in Proposition 2.1. This confirms the limiting parameters in the weak convergence result of Proposition 2.1.


Figure 2 (lhs) shows the empirical densities of , for and sample size . They are benchmarked against the standard Gaussian density in black color. We see a quite good alignment of these empirical densities, supporting the statement of Proposition 2.1. This justifies using the asymptotic approximation (2.3) for the auto-calibration Test 1 in this example. Since the components in the maximum in (2.3) may live on different scales, we also use an alternative test statistics that evaluates the normalized quantities
(S.1) |
This then directly relates to the (normalized) graphs in Figure 2 (lhs).
Next, we turn our attention to the second test, involving the random walk consideration (2.5). In this case we get the random walk type empirical covariance matrix
Since we work with a small sample size of , there is still some noise involved which makes to above empirical covariance matrix not a perfect random walk covariance matrix. The random walk covariance matrix of Corollary 2.2 has diagonal entries . Figure 2 (rhs) plots the empirical densities , for , and these are benchmarked against the Gaussian random walk densities (2.5) of , . Again we see a rather good alignment, supporting the asymptotic approximation (2.6) for auto-calibration Test 2. Clearly, the last random walk components and , respectively, have the biggest variance, which implies that they will frequently determine the test statistics, see (2.6). Naturally, one could also revert index by studying the mirrored quantity, see also Wüthrich (2023) for mirroring,
(S.2) |
and its empirical counterpart
If the terms are increasing in , this latter option may give a test with a better power, because the random walk increments will have a decreasing standard deviation.

Finally, Figure 3 illustrates the asymptotic result of Corollary 3.1 for a sample size of . The test statistics does not consider the maximum of the increments, , but it considers a weighted -norm of all increments. In (3.3) we study a weighted -norm which has been motivated by the ABC. However, in general, it is not clear why this weighting should be justified. Alternatively, we could also consider an unweighted test statistics
(S.3) |
assuming is auto-calibrated for . Equivalently, we could just consider a -test
(S.4) |
where the right-hand side is a -distributed random variable with degrees of freedom. This is the same scaling as in (S.1), however, we do not consider maximums of increments, but rather aggregated squares of the normalized random walk increments.
Summarizing, we have seen seven different test statistics that we will exploit numerically:
-
(1a)
From Test 1, we can study the maximum of the increments, see (2.3).
-
(1b)
A differently scaled version of Test 1 is given in (S.1).
-
(2a)
From Test 2, we can study the maximum of a random walk, see (2.6).
-
(2b)
An index reverted version of Test 2 is given in (S.2).
-
(3a)
From Test 3, we get a weighted -norm of the random walk increments, see (3.3).
-
(3b)
An unweighted alternative of Test 3 is given in (S.3).
-
(3c)
Finally, we have -test given by (S.4).
Because we have a discrete regression function taking finitely many values, we receive a natural partition of the covariates space, , and of the range of the regression function, . For continuous regression functions , one can discretize the range of the regression function and then perform a -test for auto-calibration. In the Bernoulli case this has been proposed by Hosmer–Lemeshow (1980), and the discretization is done with the help of the (empirical) quantiles of . Our proposal is a generalization to arbitrary responses, and we present test statistics that are different (and differently aggregated and normalized) from the classical -test in the Bernoulli case.
Next, we aim at comparing the resulting powers of the seven tests in a simulation analysis. We therefore contaminate the above model. We simulate responses
(S.5) |
Thus, we introduce a global bias by shifting the means by a positive constant . This is a global shift as it affects equally all levels , .
Test 1a | Test 1b | Test 2a | Test 2b | Test 3a | Test 3b | Test 3c | |
---|---|---|---|---|---|---|---|
95% quantiles | 2.3456 | 2.6310 | 4.2060 | 4.2263 | 5.4066 | 9.1198 | 12.5916 |
Table 2 gives the quantiles for significance level 5% for the different tests. The quantiles of Tests 1b and 3c are directly available in standard software, the quantile of Test 1a can be found by a root search algorithm, and quantiles of Tests 2a, 2b, 3a and 3b were computed empirically by a (simple) Monte Carlo simulation.




We simulate 10,000 times (with different seeds) i.i.d. samples , , and for a grid of contaminations , see (S.5). This gives us for every simulation and for every contamination level the seven test statistics. In the uncontaminated case roughly 5% of the 10,000 simulations should be above the quantiles of Table 2. This then verifies that the asymptotic results for the tests apply, i.e., that is a sufficiently large sample size for these tests.
For contaminations significantly more simulations should be above the quantiles of Table 2, and the more samples there are above the corresponding quantile the bigger the power of the test. Figure 4 (top-lhs) shows the results. We see that all curves start at the significance level of 5% for . Then, they increase to 1 for increasing contamination . The fastest increase is achieved by Tests 2a-2b (maximum of random walk), followed by Tests 3a-3c (squared sum of random walk increments), and the slowest increase is achieved by Tests 1a-1b (maximum of random walk increments). From this we conclude that the random walk tests (2.6) and (S.2) have the biggest power in case of a global shift, and they should be preferred to find global shifts. Intuitively this is clear, each random walk increment is shifted by the contamination , and in the random walk these shifts are aggregated across all increments. Thus, we have an impact of on the last random walk component . This is why Tests 2a-2b are the most sensitive ones to global shifts. In our example the order of aggregation is not very relevant, and Tests 2a-2b have almost equal power.
Global shifts are one potential cause of a violation of auto-calibration, but the violation can also only occur on individual levels , or on different levels with different signs. To test for this local failure of the auto-calibration property, we only contaminate individual levels of the regression function. Fix , and consider the local contamination
(S.6) |
this only contaminates the responses that have conditional expectation .
Based on this local contamination we repeat the above simulation experiment. Since violation of auto-calibration often happens at the boundary of the range of the regression function, we contaminate the model for the smallest and biggest conditional expectations , . These are also the least frequent levels in our example. Additionally we contaminate level , , being in the main body of the covariate distribution. The results are presented in Figure 4 (top-rhs and bottom). The picture now significantly changes compared to the global contamination. Tests 1b and 3c have the best behavior, both of these tests consider the normalized increments . From this we conclude that one should bring all random walk increments first to the same scale. This is especially true if the violation of auto-calibration takes place at rare boundary levels, and in our case. For contaminated middle levels, in our case, the Tests 1a-1b and 3a-3c are all almost equally good. On the other hand, one should not use the aggregated random walk versions of Tests 2a-2b, because through aggregation the impact of individual violations of auto-calibration gets diluted. Another observation is that if the violation of auto-calibration happens on the biggest level , it cannot be found by the ABC inspired test (3.3). This comes from the scaling which often is a small number. Therefore, we cannot generally recommend Test 3a.
We summarize our findings of the simulation example as follows:
-
•
Global shifts can most effectively be found by the random walk Tests 2a-2b, but this requires that auto-calibration is violated in the same direction on the entire support of the regression function.
-
•
Local violation of auto-calibration, especially in the tails of the regression function can most effectively be found by Tests 1b and 3c. Both tests consider scaled random walk increments (with unit variance), i.e., it seems beneficial that all random walk increments live on the same scale.
-
•
The ABC inspired Test 3a can generally not be recommended, because the ABC weighting seems to prefer the lower over the upper tail of the regression function, but there is no specific reason that justifies such a weighting, compare magenta dotted lines in Figures 4 (top-rhs) and (bottom-rhs).
References
-
Hosmer, D.W., Lemeshow, S. (1980). Goodness of fit tests for the multiple logistic regression model. Communications in Statistics - Theory and Methods 9, 1043-1069.
-
Wüthrich, M.V. (2023). Model selection with Gini indices under auto-calibration. European Actuarial Journal 13/1, 469-477.