0% found this document useful (0 votes)
363 views4 pages

Understanding P-Value Hacking Dynamics

This document discusses p-value hacking and the skewed distribution of p-values across multiple trials. It presents the expected distribution of the minimum p-value when selecting from m independent tests. Even with a single trial, the minimum p-value can be considerably lower than the true p-value due to the extreme skewness of the distribution. The formulas show that p-values are highly volatile and vary greatly across replications, making the minimum p-value diverge significantly from the true value. This implies that apparent statistical significance may simply be due to selecting favorable results from multiple tests rather than a true effect.

Uploaded by

t435486
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
363 views4 pages

Understanding P-Value Hacking Dynamics

This document discusses p-value hacking and the skewed distribution of p-values across multiple trials. It presents the expected distribution of the minimum p-value when selecting from m independent tests. Even with a single trial, the minimum p-value can be considerably lower than the true p-value due to the extreme skewness of the distribution. The formulas show that p-values are highly volatile and vary greatly across replications, making the minimum p-value diverge significantly from the true value. This implies that apparent statistical significance may simply be due to selecting favorable results from multiple tests rather than a true effect.

Uploaded by

t435486
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

FAT TAILS RESEARCH PROGRAM

A Short Note on P-Value Hacking


Nassim Nicholas Taleb
Tandon School of Engineering

Abstract—We present the expected values from p-value hack- Expected min p-val
ing as a choice of the minimum p-value among m independents
tests, which can be considerably lower than the "true" p-value, 0.10
even with a single trial, owing to the extreme skewness of the
meta-distribution. 0.08
We first present an exact probability distribution (meta-
distribution) for p-values across ensembles of statistically iden-
arXiv:1603.07532v4 [stat.AP] 25 Jan 2018

0.06
tical phenomena. We derive the distribution for small samples
2 < n ≤ n∗ ≈ 30 as well as the limiting one as the sample size n 0.04
becomes large. We also look at the properties of the "power" of
a test through the distribution of its inverse for a given p-value 0.02
and parametrization.
The formulas allow the investigation of the stability of the
m trials
reproduction of results and "p-hacking" and other aspects of 2 4 6 8 10 12 14
meta-analysis.
P-values are shown to be extremely skewed and volatile, Fig. 1. The "p-hacking" value across m trials for the "true" median p-value
regardless of the sample size n, and vary greatly across repetitions pM = .15 and expected "true" value ps = .22. We can observe how easily
of exactly same protocols under identical stochastic copies of the one can reach spurious values < .02 with a small number of trials.
phenomenon; such volatility makes the minimum p value diverge
significantly from the "true" one. Setting the power is shown to PDF
offer little remedy unless sample size is increased markedly or 10
the p-value is lowered by at least one order of magnitude.
-VALUE hacking, just like an option or other mem-
P bers in the class of convex payoffs, is a function that
benefits from the underlying variance and higher moment
8 n=5
n=10

6 n=15
variability. The researcher or group of researchers have an
n=20
implicit "option" to pick the most favorable result in m trials,
without disclosing the number of attempts, so we tend to get n=25
4
a rosier picture of the end result than reality. The distribution
of the minimum p-value and the "optionality" can be made
2
explicit, expressed in a parsimonious formula allowing for the
understanding of biases in scientific studies, particularly under
environments with high publication pressure. p
0.00 0.05 0.10 0.15 0.20
Assume that we know the "true" p-value, ps , what would its
realizations look like across various attempts on statistically Fig. 2. The different values for Equ. 1 showing convergence to the limiting
identical copies of the phenomena? By true value ps , we distribution.
mean its expected value by the law of large numbers across
an m ensemble of possible samples for the phenomenon
1
P P P .12 will be below .05. This implies serious gaming and "p-
under scrutiny, that is m ≤m pi − → ps (where − → denotes
hacking" by researchers, even under a moderate amount of
convergence in probability). A similar convergence argument
repetition of experiments.
can be also made for the corresponding "true median" pM . The
distribution of n small samples can be made explicit (albeit Although with compact support, the distribution exhibits
with special inverse functions), as well as its parsimonious the attributes of extreme fat-tailedness. For an observed
limiting one for n large, with no other parameter than the p-value of, say, .02, the "true" p-value is likely to be >.1
median value pM . We were unable to get an explicit form for (and very possibly close to .2), with a standard deviation
ps but we go around it with the use of the median. >.2 (sic) and a mean deviation of around .35 (sic, sic).
It turns out, as we can see in Fig. 3 the distribution is Because of the excessive skewness, measures of disper-
extremely asymmetric (right-skewed), to the point where 75% sion in L1 and L2 (and higher norms) vary hardly with
of the realizations of a "true" p-value of .05 will be <.05 (a ps , so the standard deviation is not proportional, meaning
borderline situation is 3× as likely to pass than fail a given an in-sample .01 p-value has a significant probability of
protocol), and, what is worse, 60% of the true p-value of having a true value > .3.
Second version, January 2018, First version was March 2015.

N. N. Taleb 1
FAT TAILS RESEARCH PROGRAM

So clearly we don’t know what we are talking about with n degrees of freedom, and, crucially, supposed to deliver
when we talk about p-values. a mean of ζ̄,
Earlier attempts for an explicit meta-distribution in the   n+1
2
n
literature were found in [1] and [2], though for situations of (ζ̄−ζ)2 +n
f (ζ; ζ̄) = √
nB n2 , 12

Gaussian subordination and less parsimonious parametrization.
The severity of the problem of significance of the so-called
"statistically significant" has been discussed in [3] and offered where B(.,.) is the standard beta function. Let g(.) be the one-
a remedy via Bayesian methods in [4], which in fact recom- tailed survival function of the Student T distribution with zero
mends the same tightening of standards to p-values ≈ .01. mean and n degrees of freedom:
But the gravity of the extreme skewness of the distribution 
 12 I ζ2n+n n2 , 12

 ζ≥0
of p-values is only apparent when one looks at the meta-
g(ζ) = P(Z > ζ) = 1
 
distribution. 1 n

 2 I ζ2 2 , 2 + 1
 ζ<0
For notation, we use n for the sample size of a given study ζ 2 +n

and m the number of trials leading to a p-value. where I(.,.) is the incomplete Beta function.
We now look for the distribution of g ◦ f (ζ). Given that
I. DERIVATION OF THE METADISTRIBUTION OF P - VALUES g(.) is a legit Borel function, and naming p the probability
Proposition 1. Let P be a random variable ∈ [0, 1]) corre- as a random variable, we have by a standard result for the
sponding to the sample-derived one-tailed p-value from the transformation:
paired T-test statistic (unknown variance) with median value 
M(P ) = pM ∈ [0, 1] derived from a sample of n size. f g (−1) (p)
ϕ(p, ζ̄) = 0 (−1) 
The distribution across the ensemble of statistically identical |g g (p) |
copies of the sample has for PDF
( We can convert ζ̄ into the corresponding median survival
ϕ(p; pM )L for p < 12 probability because of symmetry of Z. Since one half the
ϕ(p; pM ) =
ϕ(p; pM )H for p > 21 observations fall on either side of ζ̄, we can ascertain that
the transformation is median preserving: g(ζ̄) = 12 , hence
ϕ(pM , .) = 21 . Hence we end up having {ζ̄ : 12 I ζ̄2n+n n2 , 12 =

1
(−n−1)
ϕ(p; pM )L = λp2
 
pM } (positive case) and {ζ̄ : 21 I ζ2 12 , n2 + 1 = pM }

s
λ (λ − 1) ζ 2 +n
− p p pM p (negative case). Replacing we get Eq.1 and Proposition 1 is
(λp − 1) λpM − 2 (1 − λp ) λp (1 − λpM ) λpM + 1 done.
 n/2
 1 
 √ √  We note that n does not increase significance, since p-
 1 2 1−λp λpM 1

λp − √ √ + 1−λpM −1 values are computed from normalized variables (hence the
λp 1−λpM
universality of the meta-distribution); a high n corresponds
1 to an increased convergence to the Gaussian. For large n, we
ϕ(p; pM )H = 1 − λ0p 2 (−n−1) can prove the following proposition:
 
λ0p − 1 (λpM − 1) Proposition 2. Under the same assumptions as above, the

n+1
 q  2 limiting distribution for ϕ(.):
 p
λ0p (−λpM ) + 2 1 − λ0p λ0p (1 − λpM ) λpM + 1
−1
(2pM )(erfc−1 (2pM )−2erfc−1 (2p))
(1) lim ϕ(p; pM ) = e−erfc (2)
n→∞
−1 n 1 −1 1 0
n
 
where λp = I2p 2 , 2 , λpM = I1−2pM 2 , 2 , λp = where erfc(.) is the complementary error function and
−1 1 n −1
I2p−1 2 , 2 , and I(.) (., .) is the inverse beta regularized erf c(.)−1 its inverse.
function. The limiting CDF Φ(.)
Remark 1. For p= 12 the distribution doesn’t exist in theory, 1
erfc erf−1 (1 − 2k) − erf−1 (1 − 2pM ) (3)

Φ(k; pM ) =
but does in practice and we can work around it with the 2
sequence pmk = 12 ± k1 , as in the graph showing a convergence mv
Proof. For large n, the distribution of Z = s
√v
becomes that
to the Uniform distribution on [0, 1] in Figure 4. Also note n

that what is called the "null" hypothesis is effectively a set of of a Gaussian, and the one-tailed survival function g(.) =
  √
measure 0.
1
2 erfc
ζ

2
, ζ(p) → 2erfc−1 (p).
Proof. Let Z be a random normalized variable with realiza- This limiting distribution applies for paired tests with known
tions ζ, from a vector ~v of n realizations, with sample mean or assumed sample variance since the test becomes a Gaussian
mv , and sample standard deviation sv , ζ = mv√s−m
v
h
(where mh variable, equivalent to the convergence of the T-test (Student
n
is the level it is tested against), hence assumed to ∼ Student T T) to the Gaussian when n is large.

N. N. Taleb 2
FAT TAILS RESEARCH PROGRAM

PDF/Frequ. lowest p-values of many experiments, or try until one of the


∼ 53% of realizations <.05
tests produces statistical significance.
∼25% of realizations <.01 Proposition 3. The distribution of the minimum of m obser-
0.15
vations of statistically identical p-values becomes (under the
limiting distribution of proposition 2):
5% p-value
0.10 −1 −1 −1
cutpoint (true mean) ϕm (p; pM ) = m eerfc (2pM )(2erfc (2p)−erfc (2pM ))
 m−1
 
Median 1 −1 −1
1 − erfc erfc (2p) − erfc (2pM ) (5)
0.05 2
Tn
Proof. P (p1 > p, p2 > p, . . . , pm > p) = i=1 Φ(pi ) =
0.00 p
Φ̄(p)m . Taking the first derivative we get the result.
0.05 0.10 0.15 0.20
Outside the limiting distribution: we integrate numerically
Fig. 3. The probability distribution of a one-tailed p-value with expected for different values of m as shown in figure 1. So, more
value .11 generated by Monte Carlo (histogram) as well as analytically with precisely, for m trials, the expectation is calculated as:
ϕ(.) (the solid line). We draw all possible subsamples from an ensemble with m−1
given properties. The excessive skewness of the distribution makes the average
Z 1 Z p
value considerably higher than most observations, hence causing illusions of E(pmin ) = −m ϕ(p; pM ) ϕ(u, .) du dp
"statistical significance". 0 0

φ III. OTHER D ERIVATIONS


5
Inverse Power of Test
Let β be the power of a test for a given p-value p, for
4 .025 random draws X from unobserved parameter θ and a sample
.1 size of n. To gauge the reliability of β as a true measure of
3 .15 power, we perform an inverse problem:
0.5 β Xθ,p,n
2

β −1 (X)
1

Proposition 4. Let βc be the projection of the power of the


p test from the realizations assumed to be student T distributed
0.0 0.2 0.4 0.6 0.8 1.0
and evaluated under the parameter θ. We have
Fig. 4. The probability distribution of p at different values of pM . We observe
(
how pM = 12 leads to a uniform distribution. Φ(βc )L for βc < 21
Φ(βc ) =
Φ(βc )H for βc > 21

Remark 2. For values of p close to 0, ϕ in Equ. 2 can be where


usefully calculated as: p −n
Φ(βc )L = 1 − γ1 γ1 2
s  !


1 γ1 n+1
ϕ(p; pM ) = 2πpM log − √ √ 2
2πp2M
q  q 
1
2 γ3 −1 −(γ1 −1)γ1 −2 −(γ1 −1)γ1 +γ1 2 γ1 −1− γ1 −1
3 3
r s    p
− (γ1 − 1) γ1
  
1 1
− log 2π log 2πp 2 −2 log(p) − log 2π log −2 log(pM )
2πp2
e M
(6)
2
+ O(p ). (4)
 
√ −n 1 n
The approximation works more precisely for the band of Φ(βc )H = γ2 (1 − γ2 ) 2
B ,
1
relevant values 0 < p < 2π . 2 2
 
From this we can get numerical results for convolutions of  √ √ 1√
√  n+1
2
ϕ using the Fourier Transform or similar methods. −2( −(γ2 −1)γ2 +γ2 ) γ13 −1+2 γ13 −1+2 −(γ2 −1)γ2 −1 1
γ2 −1 +γ
3
p n 1

− (γ2 − 1) γ2 B 2 , 2
II. P-VALUE H ACKING
(7)
We can and get the distribution of the minimum p-value per
−1 n 1 −1 1 n
 
m trials across statistically identical situations thus get an idea where γ1 = I2β 2, 2 , γ2 = I2βc −1 2, 2 , and γ3 =
−1 n 1
c
of "p-hacking", defined as attempts by researchers to get the I(1,2ps −1) 2 , 2 .

N. N. Taleb 3
FAT TAILS RESEARCH PROGRAM

IV. A PPLICATION AND C ONCLUSION


• One can safely see that under such stochasticity for
the realizations of p-values and the distribution of its
minimum, to get what a scientist would expect from a
5% confidence level (and the inferences they get from it),
one needs a p-value of at least one order of magnitude
smaller.
• Attempts at replicating papers, such as the open science
project [5], should consider a margin of error in its
own procedure and a pronounced bias towards favorable
results (Type-I error). There should be no surprise that a
previously deemed significant test fails during replication
–in fact it is the replication of results deemed significant
at a close margin that should be surprising.
• The "power" of a test has the same problem unless one
either lowers p-values or sets the test at higher levels,
such at .99.

ACKNOWLEDGMENT
Marco Avellaneda, Pasquale Cirillo, Yaneer Bar-Yam,
friendly people on twitter, less friendly verbagiastic psychol-
ogists on twitter, ...

R EFERENCES
[1] H. J. Hung, R. T. O’Neill, P. Bauer, and K. Kohne, “The behavior of the
p-value when the alternative hypothesis is true,” Biometrics, pp. 11–22,
1997.
[2] H. Sackrowitz and E. Samuel-Cahn, “P values as random variables—
expected p values,” The American Statistician, vol. 53, no. 4, pp. 326–
331, 1999.
[3] A. Gelman and H. Stern, “The difference between “significant” and
“not significant” is not itself statistically significant,” The American
Statistician, vol. 60, no. 4, pp. 328–331, 2006.
[4] V. E. Johnson, “Revised standards for statistical evidence,” Proceedings
of the National Academy of Sciences, vol. 110, no. 48, pp. 19 313–19 317,
2013.
[5] O. S. Collaboration et al., “Estimating the reproducibility of psychological
science,” Science, vol. 349, no. 6251, p. aac4716, 2015.

N. N. Taleb 4

You might also like