Discovery-oriented research
Discovery-oriented research used a random selection of an independent and a dependent variable for each of the 100 studies simulated during the first round of experimentation. Because selection was random with replacement, multiple identical studies could be conducted and the same effect discovered more than once. This mirrors scientific practice that frequently gives rise to independent discoveries of the same phenomenon. Figure?
3b
provides an overview of the simulation procedure. Each study during the first round was classified as “significant” based either on its
p
value (
\(p\, <\, .05\)
, two-tailed single-sample
t
-test) or its Bayes factor (
\({{\rm{BF}}}_{10}\, > \, 3\)
, Bayesian single-sample
t
-test with Jeffrey-Zellner-Siow prior, Cauchy distribution on effect size, see ref.
28
), irrespective of whether the null hypothesis was actually false. As is typical for discovery-oriented research, we were not concerned with detection of null effects. Some or all of the studies thus identified were then selected for replication according to the applicable regime (Fig.?
1
).
For frequentist analysis, we set statistical power either at 0.5 or 0.8. Figure?
4
shows the results for the higher (Fig.
4a, b
) and lower power (Fig.
4c, d
). The figure reveals that regardless of statistical power, the replication regime did not affect the success of scientific discovery (Fig.
4b, d
). Under both regimes, the number of true and interesting discovered effects increased with temperature, reflecting the fact that with a more diffuse threshold of scientific interest more studies were selected for replication in the public regime, or were deemed interesting after publication in the private regime. When power is low (Fig.
4d
), fewer effects are discovered than when power is high (Fig.
4b
). Note that nearly all replicated effects are also true: this is because the probability of two successive type I errors is small (
\({\alpha }^{2}=.0025\)
).
By contrast, the cost of generating knowledge differed strikingly between replication regimes (Fig.
4a, c
), again irrespective of statistical power. The private replication regime incurred an additional cost of around ten studies compared to public replications. This difference represents ~10% of the total effort the scientific community expended on data collection. Publication of single studies whose replicability is unknown thus boosts the scientific community’s efficiency, whereas replicating studies before they are published carries a considerable opportunity cost. This cost is nearly unaffected by statistical power. Because variation in power has no impact on our principal conclusions, we keep it constant at 0.8 from here on. Moreover, as shown in Fig.?
5
, the opportunity cost arising from the private replication regime also persists when Bayesian statistics are used instead of conventional frequentist analyses.
The reasons for this result are not mysterious: Notwithstanding scientists’ best intentions and fervent hopes, much of their work is of limited interest to the community. Any effort to replicate such uninteresting work is thus wasted. To maximize scientific productivity overall, that effort should be spent elsewhere, for example in theory development and test, or in replicating published results deemed interesting.
Theory-testing research
The basic premise of theory-testing research is that the search for effects is structured and guided by the theory. The quality or plausibility of a theory is reflected in how well the theory targets real effects to be tested. We instantiated those ideas by introducing structure into the landscape of true effects and into the experimental search (Methods section). Figure?
6
illustrates the role of theory. Across panels, the correspondence between the location of true effects and the search space guided by the theory (parameter
\(\rho\)
) increases from 0.1 (poor theory) to 1 (perfect theory). A poor theory is targeting a part of the landscape that contains no real effects, whereas a highly plausible theory targets a segment that contains many real effects.
Not unexpectedly, the introduction of theory boosts performance considerably. Figure?
7
shows results when all statistical tests focus on rejecting the null hypothesis, with power kept constant at 0.8. When experimentation is guided by a perfect theory (
\(\rho =1\)
), the number of true phenomena being discovered under either replication regime with a diffuse decision threshold (high temperature) is approaching or exceeding the actual number of existing effects. (Because the same phenomenon can be discovered in multiple experiments, the discovery count can exceed the true number of phenomena.) The cost associated with those discoveries, however, again differs strikingly between replication regimes. In the extreme case, with the most powerful theory, the private replication regime required nearly 40% additional experimental effort compared to the public regime. The cost associated with private replications is thus even greater with theory-testing research than with discovery-oriented research. The greater penalty is an ironic consequence of the greater accuracy of theory-testing research, because the larger number of significant effects (many of them true) automatically entails a larger number of private replications and hence many additional experiments. As with discovery-oriented research, the cost advantage of the public regime persists irrespective of whether frequentist or Bayesian techniques are used to analyze the experiments.
There is nonetheless an important difference between the two classes of statistical techniques: Unlike frequentist statistics, Bayesian techniques permit rigorous tests of the absence of effects. This raises the issue of whether such statistically well-supported null results are of interest to the community, and if so, whether the interest follows the same distribution as for non-null results. In the context of discovery-oriented research, we assumed that null results are of little or no interest because failures to find an effect that is not a necessary prediction of any theory is of no theoretical or practical value
19
. The matter is very different with theory-testing research, where a convincing failure to find an effect counts against the theory that predicted it. We therefore performed a symmetrical Bayesian analysis for theory-testing research and assumed that the same process applied to determining interest in a null result as for non-null results. That is, whenever a Bayes Factor provided evidence for the presence of an effect (
\({{\rm{BF}}}_{10}\,> \, 3\)
) or for its absence (
\({{\rm{BF}}}_{10}\, <\, 1/3={{\rm{BF}}}_{01}\, > \, 3\)
), we considered it a notable candidate for replication. Figure?
8
shows that when both presence and absence of effects are considered, the cost for the private replication regime is increased even further, to 50% or more. This is because there is now also evidence for null effects (
\({{\rm{BF}}}_{01}\,> \,3\)
) that require replication irrespective of whether they are deemed interesting by the community.
Another aspect of Fig.?
8
is that the value of
\(\rho\)
matters considerably less than when only non-null effects are considered. This is because a poor theory that is being consistently falsified (by the failure to find predicted effects) generates as many interesting (null) results as a perfect theory that is consistently confirmed. Because our focus here is on empirical facts (i.e., effects and null-effects) rather than the welfare of particular theories, we are not concerned with the balance between confirmations and falsifications of a theory’s predictions.
Boundary conditions and limitations
We consider several conceptual and methodological boundary conditions of our model. One objection to our analysis might invoke doubts about the validity of citations as an indicator of scientific quality. This objection would be based on a misunderstanding of our reliance on citation rates. The core of our model is the assumption that the scientific community shows an uneven distribution of interest in phenomena. Any differentiation between findings, no matter how small, will render the public replication regime more efficient. It is only when there is complete uniformity and all effects are considered equally interesting, that the cost advantage of the public replication regime is eliminated (this result trivially follows from the fact that the public replication regime then no longer differs from the private regime). It follows that our analysis does not hinge on whether or not citation rates are a valid indicator of scientific quality or merit. Even if citations were an error-prone measure of scientific merit
29
, they indubitably are an indicator of attention or interest. An article that has never been cited simply cannot be as interesting to the community as one that has been cited thousands of times, whatever one’s personal judgment of its quality may be.
Another objection to our results might invoke the fact that we simulated an idealized scientific community that eschewed fraud or questionable research practices. We respond to this objection by showing that our model is robust to several perturbations of the idealized community. The first perturbation involves
p
-hacking. As noted at the outset,
p
-hacking may variously involve removal of outlying observations, switching of dependent measures, adding ad hoc covariates, such as participants’?gender, and so on. A shared consequence of all those questionable research practices is an increased type I error rate: the actual
\(\alpha\)
can be vastly greater than the value set by the experimenter (e.g., the conventional .05). Figure?
9a, b
shows the consequences of
p
-hacking with frequentist analysis, operationalized by setting
\(\alpha =0.2\)
in a simulation of discovery-oriented research. The most notable effect of
p
-hacking is that a greater number of interesting replicated effects are not true (difference between dashed and solid lines in Fig.
9b
). The opportunity cost associated with private replications, however, is unaffected.
Figure
9c, d
explores the consequences of an optional stopping rule, another common variant of
p
-hacking. This practice involves repeated testing of additional participants, if a desired effect has failed to reach significance with the initial sample. If this process is repeated sufficiently often, a significant outcome is guaranteed even if the null hypothesis is true
10
. We instantiated the optional stopping rule by adding
\({N}_{ph}\in \{1,5,10\}\)
additional participants, if an effect had not reached significance with the initial sample. This continued for a maximum of five additional batches or until significance had been reached. Optional stopping had little effect on the basic pattern of results, including the opportunity cost associated with the private replication regime, although persistent testing of additional participants, as expected, again increased the number of replicated results that did not represent true effects. Overall, Fig.?
9
confirms that our principal conclusions hold even if the simulated scientific community engages in questionable research practices.
We examined two further and even more extreme cases (both simulations are reported in the online supplement). First, we considered the effects of extreme fraud, where all effects during the first round are arbitrarily declared significant irrespective of the actual outcome (Supplementary Fig.
3
), and only subsequent public replications are honest (the private replication regime makes little sense when simulating fraud, as a fraudster would presumably just report a second faked significance level). Fraud was found to have two adverse consequences compared to truthful research: (a) it incurs a greater cost in terms of experiments conducted by other investigators (because if everything is declared significant at the first round, more effects will be of interest and hence require replication). (b) Fraud engenders a greater number of falsely identified interesting effects because all type I errors during the honest replications are assumed to represent successfully replicated findings. These results clarify that our public replication regime is not comparable to a scenario in which completely fictitious results are offered to the community for potential replication?this scenario would merely mislead the community by generating numerous ostensibly replicated results that are actually type I errors.
Second, we considered the consequences of true effects being absent from the landscape of ground truths (
\(P({{\rm{H}}}_{1})=0\)
). This situation likely confronts research in parapsychology. In these circumstances, significant results from the first round can only reflect type I errors. In consequence, the overall cost of experimentation is lower than when true effects are present, but the cost advantage of the public regime persists (Supplementary Fig.
4
).