0% found this document useful (0 votes)
5 views8 pages

Wagenmakers 2012

The article discusses the issue of confirmatory versus exploratory research in psychology, highlighting the tendency of researchers to adjust their analyses after viewing data, which undermines the validity of statistical tests. The authors propose preregistration of studies to ensure that analyses are planned before data collection, thereby distinguishing confirmatory research from exploratory research. They argue that this practice would enhance scientific integrity and address biases that currently threaten the reliability of psychological findings.

Uploaded by

Lucas Gallindo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views8 pages

Wagenmakers 2012

The article discusses the issue of confirmatory versus exploratory research in psychology, highlighting the tendency of researchers to adjust their analyses after viewing data, which undermines the validity of statistical tests. The authors propose preregistration of studies to ensure that analyses are planned before data collection, thereby distinguishing confirmatory research from exploratory research. They argue that this practice would enhance scientific integrity and address biases that currently threaten the reliability of psychological findings.

Uploaded by

Lucas Gallindo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Perspectives on Psychological

Science http://pps.sagepub.com/

An Agenda for Purely Confirmatory Research


Eric-Jan Wagenmakers, Ruud Wetzels, Denny Borsboom, Han L. J. van der Maas and Rogier A. Kievit
Perspectives on Psychological Science 2012 7: 632
DOI: 10.1177/1745691612463078

The online version of this article can be found at:


http://pps.sagepub.com/content/7/6/632

Published by:

http://www.sagepublications.com

On behalf of:

Association For Psychological Science

Additional services and information for Perspectives on Psychological Science can be found at:

Email Alerts: http://pps.sagepub.com/cgi/alerts

Subscriptions: http://pps.sagepub.com/subscriptions

Reprints: http://www.sagepub.com/journalsReprints.nav

Permissions: http://www.sagepub.com/journalsPermissions.nav

Downloaded from pps.sagepub.com at UNIV ARIZONA LIBRARY on July 2, 2014


Perspectives on Psychological Science

An Agenda for Purely Confirmatory 7(6) 632­–638


© The Author(s) 2012
Reprints and permission:
Research sagepub.com/journalsPermissions.nav
DOI: 10.1177/1745691612463078
http://pps.sagepub.com

Eric-Jan Wagenmakers, Ruud Wetzels, Denny Borsboom,


Han L. J. van der Maas, and Rogier A. Kievit
University of Amsterdam, The Netherlands

Abstract
The veracity of substantive research claims hinges on the way experimental data are collected and analyzed. In this article,
we discuss an uncomfortable fact that threatens the core of psychology’s academic enterprise: almost without exception,
psychologists do not commit themselves to a method of data analysis before they see the actual data. It then becomes tempting
to fine tune the analysis to the data in order to obtain a desired result—a procedure that invalidates the interpretation of
the common statistical tests. The extent of the fine tuning varies widely across experiments and experimenters but is almost
impossible for reviewers and readers to gauge. To remedy the situation, we propose that researchers preregister their
studies and indicate in advance the analyses they intend to conduct. Only these analyses deserve the label “confirmatory,”
and only for these analyses are the common statistical tests valid. Other analyses can be carried out but these should be
labeled “exploratory.” We illustrate our proposal with a confirmatory replication attempt of a study on extrasensory
perception.

Keywords
confirmatory experiments, wonky statistics, ESP, Bayesian hypothesis test

You cannot find your starting hypothesis in your final that bias influences the research process is that researchers
results. It makes the stats go all wonky. seek to confirm, not falsify, their main hypothesis (Sterling,
1959; Sterling, Rosenbaum, & Weinkam, 1995). The impact
—Ben Goldacre (2009, p. 221) of bias is exacerbated in an environment that puts a premium
on output quantity: When academic survival depends on how
Psychology is a challenging discipline. Empirical data are many papers one publishes, researchers are attracted to meth-
noisy, formal theory is scarce, and the processes of interest ods and procedures that maximize the probability of publica-
(e.g., attention, jealousy, loss aversion) cannot be observed tion (Bakker, van Dijk, & Wicherts, 2012; John, Loewenstein,
directly. Nevertheless, psychologists have managed to gener- & Prelec, 2012; Neuroskeptic, 2012; Nosek, Spies, & Motyl,
ate many key insights about human cognition and behavior. 2012). It should be noted that such behavior is ecologically
For instance, research has shown that people tend to seek con- rational in the sense that it maximizes the proximal goals of
firmation rather than disconfirmation of their beliefs—a phe- the researcher. However, when each researcher acts this way in
nomenon known as confirmation bias (Nickerson, 1998). an entirely understandable attempt at academic self-preserva-
Confirmation bias operates in at least three ways. First, ambig- tion, the cumulative effect on the field as a whole can be cata-
uous information is readily interpreted to be consistent with strophic. The primary concern is that many published results
one’s prior beliefs; second, people tend to search for informa- may simply be false, as they have been obtained partly by
tion that confirms rather than disconfirms their preferred dubious or inappropriate methods of observation, analysis,
hypothesis; third, people more easily remember information and reporting (Jasny, Chin, Chong, & Vignieri, 2011; Sare-
that supports their position. We also know that people fall prey witz, 2012).
to hindsight bias, the tendency to judge an event as more pre-
dictable after it has occurred (Roese & Vohs, 2012).
In light of these and other biases1 it would be naive to Corresponding Author:
Eric-Jan Wagenmakers, University of Amsterdam, Department of
believe that, without special protective measures, the scientific Psychological Methods, Weesperplein 4, 1018 XA Amsterdam, The
research process is somehow exempt from the systematic Netherlands
imperfections of the human mind. For example, one indication E-mail: [email protected]

Downloaded from pps.sagepub.com at UNIV ARIZONA LIBRARY on July 2, 2014


Wagenmakers et al. 633

Several years ago, Ioannidis (2005) famously argued that statistical software has resulted in a situation in which the
“most published research findings are false.” And indeed, number of opportunities for massaging the data is virtually
recent results from biomedical and cancer research suggest infinite.
that replication rates are lower than 50%, with some as low as True, researchers may not use these tricks with the explicit
11% (Begley & Ellis, 2012; Osherovich, 2011; Prinz, Sch- purpose to deceive—for instance, hindsight bias often makes
lange, & Asadullah, 2011). If the above results carry over to exploratory findings appear perfectly sensible. Even research-
psychology, our discipline is in serious trouble (Carpenter, ers who advise their students to “torture the data until they
2012; Roediger, 2012; Yong, 2012). Research findings that do confess”4 are hardly evil geniuses out to deceive the public or
not replicate are worse than fairy tales; with fairy tales the their peers. Instead, these researchers may genuinely believe
reader is at least aware that the work is fictional. that they are giving valuable advice that leads the student to
In this article, we focus on what we believe to be the main analyze the data more thoroughly and increases the odds of
“fairy-tale factor” in psychology today (and indeed in all of the publication along the way. How could such advice be wrong?
empirical sciences): the fact that researchers do not commit In fact, the advice to torture the data until they confess is not
themselves to a plan of analysis before they see the data. Conse- wrong—just as long as this torture is clearly acknowledged in
quently, researchers can fine tune their analyses to the data, a the research report. Academic deceit sets in when this does not
procedure that make the data appear to be more compelling than happen and partly exploratory research is analyzed as if it had
they really are. This fairy-tale factor increases the probability been completely confirmatory. At the heart of the problem lies
that a presented finding is fictional and hence non-replicable. the statistical law that, for the purpose of hypothesis testing, the
We propose a radical remedy—preregistration—to ensure sci- data may be used only once. So when you turn your data set
entific integrity and inoculate the research process against the inside and out, looking for interesting patterns, you have used
inalienable biases of human reasoning. We conclude by illus- the data to help you formulate a specific hypothesis. Although
trating the remedy of preregistration using a replication attempt the data may still serve many purposes after such fishing expe-
of an extrasensory-perception (ESP) experiment reported by ditions, there is one purpose for which the data are no longer
Bem (2011). appropriate—namely, for testing the hypothesis that they
helped to suggest. Just as conspiracy theories are never falsi-
fied by the facts that they were designed to explain, a hypoth-
Bad Science: Exploratory Findings, esis that is developed on the basis of exploration of a data set is
Confirmatory Conclusions unlikely to be refuted by that same data. Thus, one always
Science can be bad in many ways. Flawed design, faulty logic, needs a fresh data set for testing one’s hypothesis. This also
and limited scholarship engender no confidence or enthusiasm means that the interpretation of common statistical tests in
whatsoever.2 In this section, we discuss another important fac- terms of Type I and Type II error rates is valid only if the data
tor that reduces confidence and enthusiasm for a scientific were used only once and if the statistical test was not chosen on
finding: the fact that almost no psychological research is con- the basis of suggestive patterns in the data. If you carry out a
ducted in a purely confirmatory fashion3 (e.g., Kerr, 1998; hypothesis test on the very data that inspired that test in the first
Wagenmakers, Wetzels, Borsboom, & van der Maas, 2011; for place then the statistics are invalid (or “wonky”, as Ben Golda-
a similar discussion in biology, see Anderson, Burnham, cre put it). In neuroimaging, this has been referred to as “dou-
Gould, & Cherry, 2001). Only rarely do psychologists indi- ble dipping” (Kriegeskorte, Simmons, Bellgowan, & Baker,
cate, in advance of data collection, the specific analyses they 2009; Vul, Harris, Winkielman, & Pashler, 2009). Whenever a
intend to carry out. In the face of human biases and the vested researcher uses double-dipping strategies, Type I error rates
interest of the experimenter, such freedom of analysis provides will be inflated and p values can no longer be trusted.
access to a Pandora’s box of tricks that can be used to achieve As illustrated in Figure 1, psychological studies can be
any desired result (e.g., John et al., 2012; Simmons, Nelson, & placed on a continuum from purely exploratory, where the
Simonsohn, 2011; for what may happen to psychologists in the hypothesis is found in the data, to purely confirmatory, where
afterlife, see Neuroskeptic, 2012). For instance, researchers the entire analysis plan has been explicated before the first
can engage in cherry picking: They can measure many vari- participant is tested. Every study in psychology falls some-
ables (gender, personality characteristics, age, etc.) and only where along this continuum; the exact location may differ
report those that yield the desired result, and they can include depending on the initial outcome (i.e., poor initial results may
in their papers only those experiments that produced the encourage exploration), the clarity of the research question
desired outcome, even though these experiments were (i.e., vague questions allow more exploration), the amount of
designed as pilot experiments that could be easily discarded data collected (i.e., more dependent variables encourage more
had the results turned out less favorably. Researchers can also exploration), the a priori beliefs of the researcher (i.e., strong
explore various transformations of the data, rely on one-sided belief in the presence of an effect encourages exploration
p values, and construct post-hoc hypotheses that have been when the initial result is ambiguous), and so on. Hence, the
tailored to fit the observed data (MacCallum, Roznowski, & amount of exploration, data dredging, or data torture may dif-
Necowitz, 1992). In the past decades, the development of fer widely from one study to the next; consequently, so does

Downloaded from pps.sagepub.com at UNIV ARIZONA LIBRARY on July 2, 2014


634 An Agenda for Purely Confirmatory Research

Exploratory Confirmatory
Research Research

Wonky Stats Sound Stats


Fig. 1. A continuum of experimental exploration and the corresponding continuum of
statistical wonkiness. On the far left of the continuum, researchers find their hypothesis in
the data by post-hoc theorizing, and the corresponding statistics are “wonky,” dramatically
overestimating the evidence for the hypothesis. On the far right of the continuum, researchers
preregister their studies such that data collection and data analyses leave no room whatsoever
for exploration, and the corresponding statistics are “sound” in the sense that they are used for
their intended purpose. Much empirical research operates somewhere in between these two
extremes, although for any specific study the exact location may be impossible to determine.
In the grey area of exploration, data are tortured to some extent, and the corresponding
statistics are somewhat wonky. Figure downloaded from Flickr, courtesy of Dirk-Jan Hoek.

the reliability of the statistical results. It is important to stress torture them until a confession is obtained, even if the data are
again that we do not disapprove of exploratory research as perfectly innocent. More important, researchers may then pro-
long as its exploratory character is openly acknowledged. If ceed to analyze and report their data as if these had undergone
fishing expeditions are sold as hypothesis tests, however, it a spa treatment rather than torture. Psychology is not the only
becomes impossible to judge the strength of the evidence discipline in which exploratory methods masquerade as con-
reported. firmatory, thereby polluting the field and eroding public trust
Together with other fairy-tale factors, the pervasive confu- (Sarewitz, 2012). In his fascinating book Bad Science, Ben
sion between exploratory and confirmatory research threatens Goldacre discusses several fairy tale factors in public health
to unravel the very fabric of our field. This special issue fea- science and medicine, and concludes:
tures several papers that propose remedies to right what is
wrong, such as changes in incentive structures (Nosek et al., What’s truly extraordinary is that almost all of these
2012) and an increased focus on replicability (Bakker et al., problems—the suppression of negative results, data
2012; Frank & Saxe, 2012; Grahe et al., 2012). In the next sec- dredging, hiding unhelpful data, and more—could
tion, we stress a radical remedy that holds great promise, not largely be solved with one very simple intervention that
just for the state of the entire field but also for researchers would cost almost nothing: a clinical trial register, pub-
individually. lic, open, and properly enforced (…) Before you even
start your study, you publish the ‘protocol’ for it, the
methods section of the paper, somewhere public. This
Good Science: Confirmatory Conclusions means that everyone can see what you’re going to do in
Require Preregistration your trial, what you’re going to measure, how, in how
Science can be good in many ways, but a key characteristic is many people, and so on, before you start. The problems
that the researcher is honest. Unfortunately, an abstract call for of publication bias, duplicate publication and hidden
more honesty is unlikely to change anything. Blinded by con- data on side-effects—which all cause unnecessary death
firmation bias and hindsight bias, researchers may be con- and suffering—would be eradicated overnight, in one
vinced that they are honest even when they are not. We fell swoop. If you registered a trial, and conducted it,
therefore focus on a more concrete objective: separating but it didn’t appear in the literature, it would stick out
exploratory experiments from confirmatory experiments. like a sore thumb. (Goldacre, 2009, pp. 220–221)
The articles by Simmons et al. (2011) and John et al. (2012)
suggest to us that considerable care needs to be taken before We believe this idea has great potential for psychological
researchers are allowed near their own data: They may well science as well (see also Bakker et al., 2012; Nosek et al.,

Downloaded from pps.sagepub.com at UNIV ARIZONA LIBRARY on July 2, 2014


Wagenmakers et al. 635

2012, and the Neuroskeptic blog)5 By preregistering the study explicitly acknowledged as such. The only way to safeguard
design and the analysis plan, psychology’s main fairy tale fac- academics against fooling themselves, their readers, review-
tor (i.e., presenting and analyzing exploratory results as if they ers, and the general public, is to demand that confirmatory
were confirmatory) is eliminated in its entirety. To some, pre- results are clearly separated from work that is exploratory. In a
registering an experiment may seem a draconian measure. To way, our proposal is merely a matter of common sense, and we
us, this response only highlights how exceptional it is for psy- have not met many colleagues who wish to argue against it;
chologists to commit to a specific method of analysis in nevertheless, we know of almost no research in experimental
advance of data collection. Also, we wish to emphasize that psychology that follows this procedure.
we have nothing against exploratory work per se. Exploration
is an essential component of science and is key to new discov-
eries and scientific progress; without exploratory studies, the Example: Precognitive Detection of Erotic
scientific landscape is sterile and uninspiring. However, we do Stimuli?
believe that it is important to separate exploratory from confir- In 2011, Bem published an article in the Journal of Personal-
matory work, and we do not believe that researchers can be ity and Social Psychology, the flagship journal of social psy-
trusted to observe this distinction if they are not forced to.6 chology, in which he claimed that people can look into the
Hence, in the first stage of a research program, researchers future (Bem, 2011; but see Galak, LeBoeuf, Nelson, & Sim-
should feel free to conduct exploratory studies and do what- mons, in press; Ritchie, Wiseman, & French, 2012). In his first
ever they please: turn the data inside out, discard participants experiment, “precognitive detection of erotic stimuli,” partici-
and trials at will, and enjoy the fishing expedition. However, pants were instructed as follows: “(…) on each trial of the
exploratory studies cannot be presented as strong evidence in experiment, pictures of two curtains will appear on the screen
favor of a particular claim; instead, the focus of exploratory side by side. One of them has a picture behind it; the other has
work should be on describing interesting aspects of the data, a blank wall behind it. Your task is to click on the curtain that
on determining which tentative findings are of particular inter- you feel has the picture behind it. The curtain will then open,
est, and on proposing efficient ways in which future studies permitting you to see if you selected the correct curtain.” In
may confirm or disconfirm the initial exploratory results. the experiment, the location of the pictures was random and
In the second stage of a research program, a purely confir- chance performance is therefore 50%. Nevertheless, Bem’s
matory approach is desired. This requires the psychological participants scored 53.1%, significantly higher than chance;
science community to begin using online repositories such as however, the effect was present only for erotic pictures, and
the one that has recently been set up by the Open Science not for neutral pictures, positive pictures, negative pictures,
Framework at http://openscienceframework.org/.7 Before a and romantic-but-not-erotic pictures. Bem also claimed that
single participant is tested, the researcher submits to the online the psi effects were more pronounced for extraverts and that
repository a document that details what dependent variables women showed psi for certain erotic pictures but men did not.
will be collected and how the data will be analyzed (i.e., which To illustrate our proposal we set out to replicate Bem’s
hypotheses are of interest, which statistical tests will be used, experiment in a purely confirmatory fashion. First, we detailed
and which outlier criteria or data transformations will be our method, design, and planned analyses in a document that
applied). When p values are used, the researcher also needs to we posted online before a single participant was tested.9 As
indicate exactly how many participants will be tested. When outlined in the online document, our replication focused on
researchers wish to claim that their studies are confirmatory, Bem’s key findings; therefore, we tested only women, used
the online document then becomes part of the review process. only neutral and erotic pictures, and included a standard extra-
An attractive implementation of this two-step procedure is version questionnaire. We also tested each participant in two
to collect the data all at once and then split the data in an contiguous sessions. Each session featured the same pictures
exploratory and a confirmatory subset.8 For example, research- but presented in a different random order. The idea is that indi-
ers can decide to freely analyze only the even-numbered par- vidual differences in psi—if these exist—would lead to a posi-
ticipants, exploring the data however they like. In the next tive correlation between performance in Session 1 and Session
stage, however, the favored hypothesis can be tested on the 2. Performance is quantified by the proportion of times that the
odd-numbered participants in a purely confirmatory fashion. participant chooses the curtain that hides the picture. Each ses-
To enforce academic self-discipline, the second stage still sion featured 60 trials, with 45 neutral pictures and 15 erotic
requires preregistration. Although it is always possible for pictures.
researchers to cheat, the main advantage of preregistration is A vital part of the online document concerns the a priori
that it removes the effects of confirmation bias and hindsight specification of our statistical analyses. We decided in advance
bias. In addition, researchers who cheat with respect to pre- not to compute p values, as their main drawbacks include the
registration of experiments are well aware that they have com- inability to quantify evidence in favor of the null hypothesis
mitted a serious academic offense. (e.g., Gallistel, 2009; Rouder, Speckman, Sun, Morey, & Iver-
What we propose is a method to ensure academic honesty: son, 2009; Wetzels, Raaijmakers, Jakab, & Wagenmakers,
there is nothing wrong with exploration as long as it is 2009), the sensitivity to optional stopping (e.g., Dienes, 2011;

Downloaded from pps.sagepub.com at UNIV ARIZONA LIBRARY on July 2, 2014


636 An Agenda for Purely Confirmatory Research

Wagenmakers, 2007), and the tendency to overestimate the The details of how the two alternative hypotheses were
support in favor of the alternative hypothesis (e.g., Edwards, specified are not important here, save for the fact that these
Lindman, & Savage, 1963; Sellke, Bayarri, & Berger, 2001; hypotheses were constructed a priori, based on general prin-
Wetzels et al., 2011). Instead, our main analysis tool is ciples (the default prior) or substantive considerations (the
the Bayes factor (e.g., Hoijtink, Klugkist, & Boelen, 2008; knowledge-based prior).
Jeffreys, 1961; Kass & Raftery, 1995). The Bayes factor BF01 Next, we outlined a series of six hypotheses to test. For
quantifies the evidence that the data provide for the null instance, the second analysis was specified as follows:
hypothesis (H0) vis-a-vis an alternative hypothesis (H1). For
instance, when BF01 = 10, the observed data are 10 times as “(2) Based on the data of session 1 only: Does perfor-
likely to have occurred under H0 than under H1. When BF01 = mance for erotic pictures differ from chance (in this
1/5 = .20, the observed data are 5 times as likely to have study 50%)? To address this question we compute a
occurred under H1 than under H0. An additional bonus of using one-sample t-test and monitor BF01 and BF02 as the data
the Bayes factor is that it eliminates the problem of optional come in.”
stopping. As noted in the classic article by Edwards et al.
(1963), “the rules governing when data collection stops are And the sixth analysis was specified as follows:
irrelevant to data interpretation. It is entirely appropriate to
collect data until a point has been proven or disproven, or until “(6) Same as (2), but now for the combined data from
the data collector runs out of time, money, or patience” sessions 1 and 2.”
(p. 193; see also Kerridge, 1963).
Hence, we outlined the details of our Bayes factor calcula- Readers curious to know whether people can look into the
tion in the online document: future are invited to examine the results for all six hypotheses
in the online appendix at http://pps.sagepub.com/supplemen-
Data analysis proceeds by a series of Bayesian tests. For tal.11 In this article, we only present the results from our sixth
the Bayesian t-tests, the null hypothesis H0 is always hypothesis. Figure 2 shows the development of the Bayes fac-
specified as the absence of a difference. Alternative tor as the data accumulate. It is clear that the evidence in favor
hypothesis 1, H1, assumes that effect size is distributed of H0 increases as more participants are tested and the number
as Cauchy (0,1); this is the default prior proposed by of sessions increases. With the default prior, the data are 16.6
Rouder et al. (2009). Alternative hypothesis 2, H2, times more likely under H0 than under H1; with the “knowl-
assumes that effect size is distributed as a half-normal edge-based prior” from Bem, Utts, and Johnson (2011), the
distribution with positive mass only and the 90th percen- data are 6.2 times more likely under H0 than under H1. Because
tile at an effect size of 0.5; this is the “knowledge-based our analysis uses the Bayes factor, we did not have to indicate
prior” proposed by Bem et al. (submitted).10 We will in advance that we were going to test 100 participants. We
compute the Bayes factor for H0 vs. H1 (BF01) and for calculated the Bayes factor two or three times as the experi-
H0 vs. H2 (BF02).” ment was running, and after 100 participants we inspected

log(30) Evidence in
16.6
Favor of H0
log(10) 6.2

log(3)
log(BF01)

–log(3)
Default Prior
Evidence in
BUJ Prior
–log(10) Favor of H1
4 40 80 120 160 200
Number of Sessions
Fig. 2. Results from a purely confirmatory replication test for the presence of precognition.
The intended analysis was specified online in advance of data collection. The evidence (i.e., the
logarithm of the Bayes factor) supports H 0 (“performance for erotic stimuli does not differ
from chance”). Note that the evidence may be monitored as the data accumulate.

Downloaded from pps.sagepub.com at UNIV ARIZONA LIBRARY on July 2, 2014


Wagenmakers et al. 637

Figure 2 and decided that the results were sufficiently compel- Notes
ling for the present purposes. Also note how the Bayes factor 1. For an overview, see http://en.wikipedia.org/wiki/List_of_
can be used to quantify evidence in favor of the null cognitive_biases.
hypothesis. 2. We are indebted to an anonymous reviewer of a different paper for
The results reported here are purely confirmatory— bringing this sentence to our attention.
absolutely everything that we have done here was decided 3. Note the distinction between confirmation bias, which drives
before we saw the data. In this respect, these results are excep- researchers to fine tune their analyses to the data, and confirmatory
tional in experimental psychology, a state of affairs that we research, which prevents researchers from such fine tuning because
hope will change in the future. the analysis steps have been specified in advance of data collection.
Naturally, it is possible that our data might have shown 4. The expression is attributed to Ronald Coase. Earlier, Mackay
something unexpected and interesting or that we could have (1852/1932) made a similar statement, one that is perhaps even more
forgotten to include an important analysis in our preregistra- apt: “When men wish to construct or support a theory, how they
tion document. It is also possible that reviewers of this article torture facts into their service!” (p. 552).
could have asked for additional information (e.g., a credible 5. See in particular http://neuroskeptic.blogspot.co.uk/2008/11/ reg-
interval for effect size). How should we deal with such altera- istration-not-just-for-clinical.html, http://neuroskeptic.blogspot.
tions of the original data-analysis scheme? We suggest that, co.uk/2011/05/how-to-fix-science.html, and http://neuroskeptic.
rather than walking the fine line of trying to decide which blogspot.co.uk/2012/04/ fixing-science-systems-and-politics.html.
alterations are appropriate and which are not, all such findings 6. This should not be taken personally: We distrust ourselves as well.
and analyses should be mentioned in a separate section enti- In his cargo cult address, Feynman (1974) famously argued that the
tled “exploratory results.” When such exploratory results are first principle of scientific integrity is that “(…) you must not fool
analyzed, it is important to realize that the data have been used yourself—and you are the easiest person to fool” (p. 12).
more than once and that the inferential statistics may therefore 7. The feasibility of this suggestion is evident from the fact that
to some extent be wonky. some other fields already use such registers—see, for instance, http://
Preregistration of our study was suboptimal. The key docu- isrctn.org/ or http://clinicaltrials.gov/.
ment was posted on Eric-Jan Wagenmakers’s website and a 8. This procedure is conceptually similar to cross-validation.
purpose-made blog, and therefore the file would have been 9. See http://confrep.blogspot.nl/ and http://dl.dropbox.com/u/1018886/
easy to alter, remove, or ignore.12 With the online resources of Advance_Information_on_Experiment_and_Analysis.pdf.
the current day, however, the field should find it easy to con- 10. This paper has since been published (i.e., Bem, Utts, & Johnson,
struct a professional repository to push academic honesty to 2011).
greater heights. We believe that researchers who use preregis- 11. Available from the first author’s webpage or directly from https://
tration will quickly realize how different this procedure is dl.dropbox.com/u/1018886/Appendix_PoPS_WagenmakersEtAl.pdf.
from what is now standard practice. The extra work involved 12. Some protection against this is offered by automatic archiving
in preregistering an experiment is a small price to pay for a programs such as the Wayback Machine at http://archive.org/web/
large increase in evidentiary impact. Top journals could facili- web.php.
tate the transition to more confirmatory research by imple-
menting a policy to reward empirical manuscripts that feature References
at least one confirmatory experiment; for instance, these man- Anderson, D. R., Burnham, K. P., Gould, W. R., & Cherry, S. (2001).
uscripts could be published in a separate section explicitly Concerns about finding effects that are actually spurious. Wildlife
containing confirmatory research. We hope that our proposal Society Bulletin, 29, 311–316.
will increase the transparency of the scientific process, dimin- Bakker, M., van Dijk, A., & Wicherts, J. M. (2012). The rules of the
ish the proportion of false findings, and improve the status of game called psychological science. Perspectives on Psychologi-
psychology as a rigorous scientific discipline. cal Science, 7, 543–554.
Begley, C. G., & Ellis, L. M. (2012). Raise standards for preclinical
Acknowledgments cancer research. Nature, 483, 531–533.
We thank Adam Sasiadek, Boris Pinksterboer, Esther Lietaert Bem, D. J. (2011). Feeling the future: Experimental evidence for
Peerbolte, and Rebecca Schild for their help with data collection. anomalous retroactive influences on cognition and affect. Jour-
nal of Personality and Social Psychology, 100, 407–425.
Declaration of Conflicting Interests Bem, D. J., Utts, J., & Johnson, W. O. (2011). Must psychologists
The authors declared that they had no conflicts of interest with change the way they analyze their data? Journal of Personality
respect to their authorship or the publication of this article. and Social Psychology, 101, 716–719.
Carpenter, S. (2012). Psychology’s bold initiative. Science, 335,
Funding 1558–1560.
This research was supported by Vidi grants from the Dutch Dienes, Z. (2011). Bayesian versus orthodox statistics: Which side
Organization for Scientific Research (NWO). are you on? Perspectives on Psychological Science, 6, 274–290.

Downloaded from pps.sagepub.com at UNIV ARIZONA LIBRARY on July 2, 2014


638 An Agenda for Purely Confirmatory Research

Edwards, W., Lindman, H., & Savage, L. J. (1963). Bayesian statisti- Osherovich, L. (2011). Hedging against academic risk. Science–
cal inference for psychological research. Psychological Review, Business eXchange, 4. doi:10.1038/scibx.2011.416
70, 193–242. Prinz, F., Schlange, T., & Asadullah, K. (2011). Believe it or not:
Feynman, R. P. (1974). Cargo cult science. Engineering & Science, How much can we rely on published data on potential drug tar-
37, 10–13. gets? Nature Reviews Drug Discovery, 10, 712–713.
Frank, M. C., & Saxe, R. (2012). Teaching replication to promote Ritchie, S. J., Wiseman, R., & French, C. C. (2012). Failing the
a culture of reliable science. Perspectives on Psychological Sci- future: Three unsuccessful attempts to replicate Bem’s “retroac-
ence, 7, 600–604. tive facilitation of recall” effect. PLoS ONE, 7, e33423.
Galak, J., LeBoeuf, R. A., Nelson, L. D., & Simmons, J. P. (in press). Roediger, H. L. (2012). Psychology’s woes and a partial cure: The
Correcting the past: Failures to replicate psi. Journal of Personal- value of replication. APS Observer, 25(2), 9, 27–29 .
ity and Social Psychology. Roese, N., & Vohs, K. (2012). Hindsight bias. Perspectives on Psy-
Gallistel, C. R. (2009). The importance of proving the null. Psycho- chological Science, 7, 411–426.
logical Review, 116, 439–453. Rouder, J. N., Speckman, P. L., Sun, D., Morey, R. D., & Iverson,
Goldacre, B. (2009). Bad science. London, England: Fourth Estate. G. (2009). Bayesian t tests for accepting and rejecting the null
Grahe, J., Reifman, A., Herman, A., Walker, M., Oleson, K., Nario– hypothesis. Psychonomic Bulletin & Review, 16, 225–237.
Redmond, M., & Wiebe, R. (2012). Harnessing the undiscovered Sarewitz, D. (2012). Beware the creeping cracks of bias. Nature, 485,
resource of student research projects. Perspectives on Psycho- 149.
logical Science. Sellke, T., Bayarri, M. J., & Berger, J. O. (2001). Calibration of p
Hoijtink, H., Klugkist, I., & Boelen, P. (2008). Bayesian evaluation of values for testing precise null hypotheses. The American Statisti-
informative hypotheses. New York, NY: Springer. cian, 55, 62–71.
Ioannidis, J. P. A. (2005). Why most published research findings are Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False–pos-
false. PLoS Medicine, 2, 696–701. itive psychology: Undisclosed flexibility in data collection and
Jasny, B. R., Chin, G., Chong, L., & Vignieri, S. (2011). Again, and analysis allows presenting anything as significant. Psychological
again, and again. Science, 334, 1225. Science, 22, 1359–1366.
Jeffreys, H. (1961). Theory of probability (3rd ed.). Oxford, England: Sterling, T. D. (1959). Publication decisions and their possible effects
Oxford University Press. on inferences drawn from tests of significance—Or vice versa.
John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the Journal of the American Statistical Association, 54, 30–34.
prevalence of questionable research practices with incentives for Sterling, T. D., Rosenbaum, W. L., & Weinkam, J. J. (1995). Publi-
truth–telling. Psychological Science, 23, 524–532. cation decisions revisited: The effect of the outcome of statisti-
Kass, R. E., & Raftery, A. E. (1995). Bayes factors. Journal of the cal tests on the decision to publish and vice versa. The American
American Statistical Association, 90, 773–795. Statistician, 49, 108–112.
Kerr, N. L. (1998). HARKing: Hypothesizing after the results are Vul, E., Harris, C., Winkielman, P., & Pashler, H. (2009). Puzzlingly
known. Personality and Social Psychology Review, 2, 196–217. high correlations in fMRI studies of emotion, personality, and
Kerridge, D. (1963). Bounds for the frequency of misleading Bayes social cognition. Perspectives on Psychological Science, 4, 274–
inferences. The Annals of Mathematical Statistics, 34, 1109–1110. 290.
Kriegeskorte, N., Simmons, W. K., Bellgowan, P. S. F., & Baker, C. I. Wagenmakers, E.-J. (2007). A practical solution to the pervasive
(2009). Circular analysis in systems neuroscience: The dangers of problems of p values. Psychonomic Bulletin & Review, 14, 779–
double dipping. Nature Neuroscience, 12, 535–540. 804.
MacCallum, R. C., Roznowski, M., & Necowitz, L. B. (1992). Model Wagenmakers, E.-J., Wetzels, R., Borsboom, D., & van der Maas, H.
modifications in covariance structure analysis: The problem of L. J. (2011). Why psychologists must change the way they ana-
capitalization on chance. Psychological Bulletin, 111, 490–504. lyze their data: The case of psi. Journal of Personality and Social
Mackay, C. (1932). Extraordinary popular delusions and the mad- Psychology, 100, 426–432.
ness of crowds (2nd ed.). Boston, MA: Page. (Original work pub- Wetzels, R., Matzke, D., Lee, M. D., Rouder, J. N., Iverson, G. J., &
lished 1852) Wagenmakers, E.-J. (2011). Statistical evidence in experimental
Neuroskeptic. (2012). The nine circles of scientific hell. Perspectives psychology: An empirical comparison using 855 t tests. Perspec-
on Psychological Science, 7, 643–644. tives on Psychological Science, 6, 291–298.
Nickerson, R. S. (1998). Confirmation bias: A ubiquitous phenom- Wetzels, R., Raaijmakers, J. G. W., Jakab, E., & Wagenmakers, E.-J.
enon in many guises. Review of General Psychology, 2, 175–220. (2009). How to quantify support for and against the null hypoth-
Nosek, B. A., Spies, J. R., & Motyl, M. (2012). Scientific utopia: II. esis: A flexible WinBUGS implementation of a default Bayesian
Restructuring incentives and practices to promote truth over pub- t test. Psychonomic Bulletin & Review, 16, 752–760.
lishability. Perspectives on Psychological Science, 7, 615–631. Yong, E. (2012). Bad copy. Nature, 485, 298–300.

Downloaded from pps.sagepub.com at UNIV ARIZONA LIBRARY on July 2, 2014

You might also like