ZĞǀŝĞǁƌƟĐůĞ
The P Value and Statistical Significance:
Misunderstandings, Explanations, Challenges, and
Alternatives
Chittaranjan Andrade
ABSTRACT
The calculation of a P value in research and especially the use of a threshold to declare the statistical significance of the
P value have both been challenged in recent years. There are at least two important reasons for this challenge: research
data contain much more meaning than is summarized in a P value and its statistical significance, and these two concepts
are frequently misunderstood and consequently inappropriately interpreted. This article considers why 5% may be set
as a reasonable cut-off for statistical significance, explains the correct interpretation of P < 0.05 and other values of
P, examines arguments for and against the concept of statistical significance, and suggests other and better ways for
analyzing data and for presenting, interpreting, and discussing the results.
Key words:&RPSDWLELOLW\LQWHUYDOFRQ¿GHQFHLQWHUYDO3YDOXHVWDWLVWLFDOVLJQL¿FDQFH
In empirical research, statistical procedures are applied to who performs or reads research is familiar with the
the data to identify a signal through the noise and to draw expression “P < 0.05” as a cut-off that indicates
inferences from the data collected. Statistical procedures, “statistical significance.” In this context, most persons
therefore, steer us toward a better understanding of the interpret P < 0.05 to mean that “the probability that
data and toward drawing conclusions from the data. It is chance is responsible for the finding is less than 5%”
therefore important to fully understand what statistical and that “the probability that the finding is a true
procedures and their results mean when these procedures finding is more than 95%.” Both these interpretations
are applied in research. are incorrect; unfortunately, they are widely prevalent
because they are an easy way to explain and understand
All inferential statistical tests end with a test statistic a slightly tricky concept.
and the associated P value. This P value has been
accorded such an elevated status that, now, everybody
This is an open access journal, and articles are distributed under the
terms of the Creative Commons Attribution-NonCommercial-ShareAlike
Access this article online
4.0 License, which allows others to remix, tweak, and build upon the
Quick Response Code
work non-commercially, as long as appropriate credit is given and
Website:
the new creations are licensed under the identical terms.
ZZZLMSPLQIR
For reprints contact: [email protected]
DOI: How to cite this article: Andrade C. The PYDOXHDQGVWDWLVWLFDOVLJQL¿FDQFH
,-36<0,-36<0BB Misunderstandings, explanations, challenges, and alternatives. Indian J
Psychol Med 2019;41:210-5.
Department of Psychopharmacology, National Institute of Mental Health and Neurosciences, Bengaluru, Karnataka, India
Address for correspondence: Dr. Chittaranjan Andrade
Department of Psychopharmacology, National Institute of Mental Health and Neurosciences, Bengaluru - 560 029, Karnataka, India.
E-mail:
[email protected]Received: 19th April, 2019, Accepted: 19th April, 2019
210 © 2019 Indian Psychiatric Society - South Zonal Branch | Published by Wolters Kluwer - Medknow
Andrade: The P YDOXHDQGVWDWLVWLFDOVLJQL¿FDQFH
This article considers why 5% could be a reasonable has an antidepressant effect. The conclusion is correct
cut-off for statistical significance, explains what but iffy because the 5% cut-off and even the concept
P < 0.05 really means, discusses the concept of of statistical significance are being challenged. The
statistical significance and why it has been roundly interpretation is wrong because a P value, even one that
criticized, and suggests other and perhaps better ways is statistically significant, does not determine truth.
of interpreting the results of statistical testing.
So, what are the right conclusion and the right
WHY 5%? interpretation? This requires an understanding of
what statistical testing means.[2] Imagine that the null
Imagine that you toss a coin and it falls tails. Then hypothesis is true; that is, the new antidepressant is no
you toss it again, and it falls tails again. Well, that can different from placebo. Now, if you conduct a hundred
certainly happen. You toss it a third time, and it falls RCTs that compare the drug with placebo, you would
tails again. This, too, can sometimes happen; the same certainly not get an identical response rate for drug
face shows thrice in a row. When you toss it a fourth and placebo in each RCT. Rather, in some RCTs, the
time, and it falls tails, you sit up and take notice. And drug would outperform placebo, and in other RCTs,
when you toss it a fifth time, and it falls tails yet again, placebo would outperform the drug. Furthermore,
you develop a strong suspicion that there is something the magnitude by which the drug and placebo
wrong with the coin.[1] Why? Theoretically, if you toss outperformed each other would vary from trial to trial.
an unbiased coin in runs of five for several dozen trials, In this context, what P = 0.04 (i.e., 4%) means is that
a run of five identical faces can certainly happen by if the null hypothesis is true and if you perform the study a
chance. However, you did not toss the coin in dozens of large number of times and in exactly the same manner, drawing
trials. You tossed it in just one trial. You found that the random samples from the population on each occasion, then, on
coin showed the same face on all five occasions in that 4% of occasions, you would get the same or greater difference
one trial. In other words, something that should have between groups than what you obtained on this one occasion.
been a rather rare occurrence happened the very first
time. This suggests that at least for that coin, it may not However, you did not perform the RCT a large number
have been a rare occurrence, after all. In other words, of times. You performed it just once. You found that
you consider that your finding is significant. That is, you on the single occasion that you performed the RCT, the
reject the null hypothesis that the coin is unbiased and result that you obtained was something that would be
accept an alternate hypothesis – that the coin is biased. considered rare. So, perhaps the finding is not really
rare. This is possible only if the null hypothesis is false.
Simple mathematics tells us that the probability that Therefore, just as you rejected the null hypothesis that
a tossed coin will display the same face (heads or tails)
the tossed coin was unbiased (see the previous section),
five times in a row is 0.5 × 0.5 × 0.5 × 0.5; that is,
you reject the null hypothesis that the drug is no different
0.0625. This P value, 0.0625, is rather close to the value
from placebo. Because this (correct) reasoning is rather
0.05 that is by general convention set as the cut-off for
complicated, many prefer to explain and understand the
“statistical significance.”
concept in simpler but incorrect ways, as stated in the
introductory paragraph to this article. Other incorrect
A slightly more scientific explanation for choosing 5% as
the cut-off is that approximately 5% (4.5%, to be more interpretations have also been described.[3]
precise) of the normal distribution comprises outlying
or “significantly different” values, that is, values that INTERPRETATIONS FOR P < 0.05 AND
are more than two standard deviations distant from P > 0.05
the mean. Other explanations have also been offered.[1]
If the null hypothesis is rejected (P < 0.05), why cannot
WHAT DOES P < 0.05 REALLY MEAN? we conclude that just as the drug outperformed placebo
in our study, the drug is truly superior to placebo in
Imagine that you conduct a randomized controlled the population from which the sample was drawn? The
trial (RCT) that compares a new antidepressant drug answer is that the P value describes a probability, not
with placebo. At the 8-week study endpoint, you find a certainty. So, we can never be certain that the drug
that 60% of patients have responded to the drug and is truly superior to placebo in the population; we can
40% have responded to placebo. The Chi-square test merely be rather confident about it.
that you apply yields a P value of 0.04, a value that is less
than 0.05. You conclude that significantly more patients Next, imagine that instead of obtaining P = 0.04, you
responded to the antidepressant than to placebo. Your obtained P = 0.14 in the imaginary RCT described
interpretation is that the new antidepressant drug truly earlier. In this situation, we do not reject the null
Indian Journal of Psychological Medicine | Volume 41 | Issue 3 | May-June 2019 211
Andrade: The P YDOXHDQGVWDWLVWLFDOVLJQL¿FDQFH
hypothesis, based on the 5% threshold. So, can we there is no difference between drug and placebo? No!
conclude that the drug is no different from placebo? What P = 1.00 means is that if the null hypothesis is
Certainly not, and we definitely cannot conclude that true and if we perform the study in an identical manner
the drug is similar to placebo, either. After all, we did a large number of times, then on 100% of occasions
find that there was a definite difference in the response we will obtain a difference between groups of 0% or
rate between drug and placebo; it is just that this difference greater! This is actually common sense. If the drug truly
did not meet our arbitrary cut-off for statistical significance. has no antidepressant effect, then on some occasions
So “not significantly different” does not mean “not the drug will outperform placebo by some margin,
different from” or “similar.” on other occasions placebo will outperform the drug
by some margin, and perhaps on some occasions the
WHY IT COULD BE NECESSARY TO STOP results will be identical in the two groups; that is, on
all (100%) occasions we obtain a difference between
USING A THRESHOLD FOR STATISTICAL
groups of 0% or greater.
SIGNIFICANCE
This brings us to a question: if everything boils down to
From the previous section, it is quite clear that repeating the study a large number of times and getting
just as the P value lies along a continuum of 0 to 1, different answers each time, can we reduce the range
our interpretations should also lie along a continuum of of uncertainty to something that could actually be
differing levels of confidence (or diffidence) in the null helpful? Here is where 95% confidence intervals (CI)
hypothesis; we can never be certain, either way. This come into the picture. Means, differences between
means that the P value should be reported as an means, proportions, differences between proportions,
exact value and should be regarded as a continuous relative risks (RRs), odds ratios, numbers needed to
variable. Consequently, it should be considered treat, numbers needed to harm, and other statistics
fallacious to insert an arbitrary threshold to define that are obtained from a study are accurate only
results as significant or nonsignificant, as though for that study. However, what we really want to
significant versus nonsignificant results are in some know is what the values of these statistics are in the
ways categorically different the way people who are population, because we wish to generalize the results
dead versus alive are categorically different. Expressed of our study to the population from which our sample
otherwise, declaring statistical significance does not was drawn. We cannot know for certain what the
improve our understanding of the data over and above population values are because it is (usually) impossible
what is already explained by the value of P.[4] In fact, to study the entire population. However, the 95% CI
declaring significance may give us a false sense of can help give us an idea. Whereas the 95% CI, like
confidence that a finding exists in the population, the P value, is also frequently misunderstood; here is
while rejecting significance may give us a false sense an explanation. If we repeat a study in an identical
of confidence that the finding does not exist. fashion a hundred times, then 95 of the 95% CIs
that we estimate in these studies would be expected
It follows, therefore, that it is fallacious to privilege to contain the population mean. So, by inference, if
significant results for journal publication or for media we examine the 95% CI that we have obtained from
dissemination. Finally, the probability continuum is also a single study, the probability that this particular CI
the reason why a study which obtains a nonsignificant contains the population mean is 95%.[6]
result does not contradict a study which obtains a
significant result; both obtained findings that lie along In the RCT example cited earlier in this section, the
a continuum, and the contradiction exists only because response rate was 50% in each group; that is, there
the findings lie on the opposite sides of an arbitrary was no difference in the response rate between the
and imaginary fence, P < 0.05, that we insert into this drug and placebo. A little calculation will tell us that
continuum. Bayesian methods are no exception to these the RR for response is 1.00 and that the 95% CI is
assertions.[5] 0.55-1.83. That is, we are 95% confident that the
population result for the response to drug versus
THE 95% CONFIDENCE INTERVAL placebo lies within the range of the drug being as much
as 45% inferior to placebo to as much as 83% superior
Imagine an RCT in which 10 of 20 patients responded to placebo. Notice that there is no need whatsoever
to a new antidepressant drug and 11 of 22 patients to bring statistical significance into the picture here.
responded to placebo. The response rate is exactly 50% Also notice that the 95% CI provides a range of
in each group. The difference in response rates is 0%. values that are possible for the population, which is
Whatever statistical test is applied, the P value will be far more informative than a dichotomous inference of
1.00. Does this mean that we are 100% certain that significance versus nonsignificance.
212 Indian Journal of Psychological Medicine | Volume 41 | Issue 3 | May-June 2019
Andrade: The P YDOXHDQGVWDWLVWLFDOVLJQL¿FDQFH
UNCERTAINTY AND THE 95% Amrhein et al.[5] remind readers that even a 95% CI
COMPATIBILITY INTERVAL describes probabilities; it does not exclude the possibility
that the population value lies outside the compatibility
Basing interpretations on a 0.05 or other threshold range. It must also be remembered that the 95% CI is
tends to provide an element of certainty to the an estimate; it is not a definitive statement of where the
interpretations. As already explained, this certainty is population parameter probably lies.
illusory because probability lies along a continuum.
Furthermore, just as there are variations within a NO TO P AND NO TO A THRESHOLD FOR
data set, there will be variations across replicatory STATISTICAL SIGNIFICANCE
studies, even across hypothetical replications. We can
never be certain about which data set and which set P values and the concept of statistical significance have
of conclusions provide the best fit to the population. been questioned for long.[7] In 2016, the American
So, taking the discussion to its logical end, Amrhein Statistical Association (ASA) released a statement on
et al.[5] and Wasserstein et al.[4] suggested that instead of statistical significance and P values.[8] The statement
drawing dichotomous conclusions that imply certainty, asserted that P values were never intended to substitute
scientists should embrace uncertainty. for scientific reasoning. The statement highlighted six
points: (1) P values can provide an indication of how
In this context, as one possible solution, Amrhein et al.[5] compatible or incompatible the data are with a specified
offered the suggestion of reconceptualizing 95% CI as statistical model. (2) Taken alone, the P value is not
compatibility intervals. That is, all values within the 95% a good test of a hypothesis or a good evaluation of a
CI are compatible with the data recorded in the study; the model. (3) P values do not estimate the probability that
point estimate (e.g., a mean or a RR), regardless of “statistical a hypothesis is true or the probability that chance is
significance,” is the most compatible, and other values in responsible for the findings. (4) P values, including those
the CI are progressively less compatible (but nevertheless that meet arbitrary criteria for statistical significance, do
still compatible) the greater their distance from the not indicate an effect size or the importance of a result.
point estimate. Explained somewhat simplistically, this (5) scientific conclusions and decision-making should
means that (provided the study was well-designed, not be based only on whether or not the P value falls
well-conducted, and well-analyzed) the point estimate below an arbitrary threshold; and (6) drawing proper
obtained in the study has the best chance of being the inferences requires complete reporting and transparency.
population value, and that all the other values in the The ASA added that other statistical estimates, such as
95% CI also have a chance of being the population value, CIs, need to be included; and that Bayesian approaches
with progressively decreasing likelihood the greater the need to be used, and false discovery rates need to be
distance from the point estimate.
considered. Some of these points have already been
explained; the rest are out of the scope of this article,
Explained with the help of an example, consider the
and the reader is referred to the original statement.
RCT in which we found that the RR for a response
to the study drug (vs. placebo) was 1.00 (95% CI,
Doing away with P and a threshold for statistical
0.55-1.83). We should not interpret this finding as
significance will, however, be hard. This is because
nonsignificant; rather, we should consider that the
estimating P and declaring statistical significance (or its
most likely interpretation is that the drug is no better
absence) has become the cornerstone of empirical
or worse than placebo, and that lower efficacy (to the
research, and if changes are to be made herein,
most extreme and least likely value of 45% worse) and
textbooks, the education system, scientists, funding
higher efficacy (to the most extreme and least likely
organizations, and scientific journals will all need to
value of 83% better) possibilities are also compatible
make a sea change. This could take years or decades
with the data recorded in the study. The reader is once
if indeed it ever happens. The motivation to effect
again reminded that statistical significance does not
the change will be small, because P values are easy to
enter the picture anywhere.
calculate and use, alternatives are not easy to either
If the 95% CI for an RR is 0.95–2.20, the traditional understand or use, and, besides, there is no consensus
interpretation would have been “not significant,” but on what the alternatives must be.[4]
a better interpretation would be that the results are
mostly compatible with an increase in risk. Similarly, IN FAVOR OF RETAINING DICHOTOMOUS
if the 95% CI for an RR is 0.65–1.05, the traditional DISTINCTIONS
interpretation would again have been “not significant,”
but the better interpretation is that the results are There is a small but definite role for the retention
mostly compatible with a decrease in risk. In this regard, of the P < 0.05 threshold for statistical significance.
Indian Journal of Psychological Medicine | Volume 41 | Issue 3 | May-June 2019 213
Andrade: The P YDOXHDQGVWDWLVWLFDOVLJQL¿FDQFH
Dichotomous interpretations of research findings need deviation, RR, and numbers needed to treat, and the
to be made when action is called for, such as whether or confidence (compatibility) intervals associated with
not to approve a drug for marketing.[9] Preset rules are these measures of effect size should also be reported.
required in such situations; uncertainty, recommended
by Armhem et al.,[5] cannot be embraced because, then, All findings should be interpreted in the context of
no decision would be possible. In such circumstances, the study design, including the nature of the sample,
study findings will need to meet or exceed expectations, the sample size, the reliability and validity of the
and so a threshold for statistical significance needs to instruments used, and the rigor with which the study
be retained. However, to protect the integrity of science was conducted.
and reduce false-positive findings, there may be a case
to set the bar higher, such as at P < 0.005.[10] In fact, FURTHER READING
in genetics research, reduction in the false-positive
risk is achieved by setting the bar very high, such Readers who are enthusiastic may refer to a special
as at P < 0.00000001 or lower. If a threshold for supplement of the American Statistician, published in
significance were to be completely discarded, as many 2019, titled “Statistical Inference in the 21st Century:
now demand, then there is a risk that study results will A World Beyond P < 0.05.” This issue contains
be interpreted in ways that suit the user’s interest; that 43 articles on the subject, some of which are technical
is, bias will receive a free pass.[11] Setting a threshold but many of which are understandable to the average
for P is also necessary for sample size estimation and medical scientist. Whereas the concepts of P and
power calculations. statistical significance are not altogether rejected,
and whereas there is no consensus on what the best
There are other circumstances, too, when a threshold for alternative is, many proposals have been made.
P may be required. An example is for industry quality These include transforming P values into S-values,
control, or for risk tolerance. Consider a man who uses deriving second-generation P values, using an analysis
a parachute; he would like to be far more than 95% of credibility, combining P values with a computed
certain that the parachute will open.[1] Thresholds will false-positive risk, combing sufficiently small P values
also be required as a filter when choosing variables for with sufficiently large effect sizes, the use of a
further investigation, as in brain imaging or genome confidence index, the use of statistical decision theory,
analyses.[4] and, as already discussed, the use of compatibility
intervals.
RECOMMENDATIONS
The articles in this special issue are arranged in five
The P value should be interpreted as a continuous sections: Getting to a post “P < 0.05” era; interpreting
variable and not in a dichotomous way. So, we should and using P; supplementing or replacing P; adopting
not conclude that just because the P value is < 0.05 more holistic approaches; and reforming institutions:
or some other predetermined threshold, the study changing publication policies and statistical education.
hypothesis is true. Likewise, we should not say that The editorial in the special issue[4] presents a useful
just because P > 0.05 or some other predetermined summary of each article, provided by the authors of
threshold, the study hypothesis is false. These are, in the articles.
any case, wrong interpretations of what the P value
means. Last but not least, readers are also strongly encouraged
to consult the article by Goodman[3] which lists 12
Whereas a threshold for statistical significance could misconceptions about the P value. These are as follows:
be useful to base decisions upon, its limitations should if the P value is 0.05, the null hypothesis has a 5%
be recognized. It may be wise to set a threshold that chance of being true; a nonsignificant P value means
is lower than 0.05 and to examine the false-positive that (for example) there is no difference between
rate associated with the study findings. It is also groups; a statistically significant finding (P is below a
important to examine whether what has been accepted predetermined threshold) is clinically important; studies
as statistically significant is clinically significant. that yield P values on opposite sides of 0.05 describe
conflicting results; analyses that yield the same P value
Examining a single estimate and the associated P value provide identical evidence against the null hypothesis; a
is insufficient. It is necessary to assess as much as P value of 0.05 means that the observed data would be
possible about the estimate. Besides absolute values, obtained only 5% of the time if the null hypothesis were
95% CIs should be examined as compatibility intervals, true; a P value of 0.05 and a P value less than or equal to
and the precision of this interval should be considered. 0.05 have the same meaning; P values are better written
Measures of effect size, such as standardized mean as inequalities, such as P < 0.01 when P = 0.009; a
214 Indian Journal of Psychological Medicine | Volume 41 | Issue 3 | May-June 2019
Andrade: The P YDOXHDQGVWDWLVWLFDOVLJQL¿FDQFH
P value of 0.05 means that if the null hypothesis is 2. Kyriacou DN. The enduring evolution of the P-value. JAMA
rejected, then there is only a 5% probability of a Type 1 2016;315:1113-5.
3. Goodman S. A dirty dozen: Twelve P-value misconceptions.
error; when the threshold for statistical significance is
Semin Hematol 2008;45:135-40.
set at 0.05, then the probability of a Type 1 error is 5%; 4. Wasserstein RL, Schirm AL, Lazar NA. Moving to a world
a one-tail P value should be used when the researcher beyond “p<0.05.” Am Stat 2019;73(Suppl. 1):1-19.
is uninterested in a result in one direction, or when a 5. Amrhein V, Greenland S, McShane B. Scientists rise up
value in that direction is not possible; and scientific against statistical significance. Nature 2019;567:305-7.
conclusions and treatment policies should be based on 6. Andrade C. A primer on confidence inter vals in
psychopharmacology. J Clin Psychiatry 2015;76:e228-31.
statistical significance.
7. Nuzzo R. Scientific method: Statistical errors. Nature
2014;506:150-2.
Financial support and sponsorship 8. Wasserstein RL, Lazar NA. The ASA’s statement on P values:
Nil. Context, process, and purpose. Am Stat 2016;70:129-33.
9. Ioannidis JPA. The importance of predefined rules
&RQÁLFWVRILQWHUHVW and prespecified statistical analyses: Do not abandon
There are no conflicts of interest. significance. JAMA 2019; Apr 4. doi: 10.1001/jama.
2019.4582. [Epub ahead of print].
10. Ioannidis JPA. The proposal to lower P-value thresholds to.
REFERENCES 005. JAMA 2018;319:1429-30.
11. Ioannidis JPA. Retiring statistical significance would give
1. Gauvreau K, Pagano M. Why 5%? Nutrition 1994;10:93-4. bias a free pass. Nature 2019;567:461.
“Quick Response Code” link for full text articles
The journal issue has a unique new feature for reaching to the journal’s website without typing a single letter. Each article on its first page has
a “Quick Response Code”. Using any mobile or other hand-held device with camera and GPRS/other internet source, one can reach to the full
text of that particular article on the journal’s website. Start a QR-code reading software (see list of free applications from http://tinyurl.com/
yzlh2tc) and point the camera to the QR-code printed in the journal. It will automatically take you to the HTML full text of that article. One can
also use a desktop or laptop with web camera for similar functionality. See http://tinyurl.com/2bw7fn3 or http://tinyurl.com/3ysr3me for the free
applications.
Indian Journal of Psychological Medicine | Volume 41 | Issue 3 | May-June 2019 215