When is a replication attempt successful?

Authors

Mariusz Maziarz

Affiliation: Jagiellonian University

Category: Philosophy

Keywords: replication crisis, philosophy of science, replicability

Schedule & Location

Date: Wednesday 3rd of September

Time: 17:00

Location: GSSR Plenary Hall (268)

View the full session: Methodology

Abstract

The replicability crisis, which started in psychology, is one of the significant problems troubling contemporary research. However, even though researchers across social and biomedical sciences agree that some considerably large proportion of empirical findings do not replicate and hence are purportedly false, several conflicting concepts of replication and successful replication are used in the debate. We develop existing literature concerning these questions by building on the Resampling Account of replication (Machery 2020) and Fletcher’s (2021; 2022) criticism of the measures of replication success used by the Open Science Collaboration (2015) to estimate the replicability of psychological experiments. We argue that confidence intervals considered in assessing replication success should consider not only random error (ε), but also the different experimental components between the original study and replication attempts (τ^2). On this ground, we argue that successful replications are actually more common than is currently reported in the metascientific literature. Fully-fledged abstract Replication crisis denotes the problem (crisis) faced by virtually all empirical disciplines, such as psychology, medicine, biology, and economics. Romero (2019) defined the replication crisis as attempts at independent repeats of the original experiment that report results that differ from the original. Replication attempts differ in the degree of similarity to the original study. Reproducibility of results means “obtaining the same numerical results when repeating the analysis using the original data and the same computer code”. Reproducibility failures are usually a sign of calculation or coding error committed by researchers either conducting the original study or the replication attempt. Direct replications are identical (concerning the causally relevant factors) repetitions of the original experiment. Replication failures indicate that some unknown factors confound outcomes, the original study includes errors or questionable research practices, or random differences between trial arms have led to false-positive results. Finally, conceptual replications use modified research methods or outcome measures to test the generalizability of the original results to other contexts (Romero, 2019). However, Machery (2020) recently argued against the distinction between these two types of replications. According to his Resampling Account of replication, the experimental components of the original studies are treated as random factors; a resample of which is included in a replication attempt. This account defuses the distinction between direct and conceptual replication. Similarly, Feest (2019) suggested that failed conceptual replications are not necessarily a sign of the original result emerging from mistakes or errors but result from replication attempts differing from the original studies. She argued that the differences between studies account for the heterogeneity of reported results. The rate of failed replication attempts varies across disciplines, but even though researchers across social and biomedical sciences agree that some considerably large proportion of empirical findings do not replicate and hence are purportedly false, several conflicting concepts of replication and successful replication populate the debate. Fletcher (2021) criticized ways of assessing whether a replication attempt reports a result similar to the original study used by Open Science Collaboration: subjective assessment, null hypothesis significance testing, comparing effect sizes, comparing the original effect size with the replication confidence interval, and meta-analysis. He later (Fletcher 2022) argued that the replicability should be tackled with meta-analysis, which only allows for correctly analyzing result reliability and confirmation. While we agree that meta-analytic estimates are (at least in principle) the best estimates of the actual treatment effects, Fletcher’s (2022) approach remains uninformative concerning whether a replication attempt of the same research question reports an average treatment effect that agrees with the original study. To solve this problem and allow for a meaningful discussion of replication success, we develop this concept in a way inspired by the random effects model of meta-analysis. We start by developing the successful replication concept for an ideal case of an exact replication. In such a case, the difference in results (assuming both studies being free from errors, questionable research practices, or outright frauds) emerges only due to random error (ε). However, considering that the two studies report only estimates of the true average treatment effect ((〖ATE〗_1 ) ̂; (〖ATE〗_2 ) ̂), while the true ATE may be different and falls within the confidence interval with a certain probability (e.g., the standard 95% confidence interval), it suffices for a successful replication study to report confidence interval that overlaps with the confidence interval reported by the original study. In other words, only disjunctive confidence intervals of the original study and replication exclude (with 95% probability) the possibility of the intervention tested in the two studies having the same effect size. The Resampling Account of replication (Machery 2020) treats various aspects of experimental design as random. Assuming an additive model of interactions between these components (which is a conservative assumption given our purpose and the plausibility of multiplicative interactions), we argue that the distribution of the joint effect of different research designs and statistical analysis pipelines can be modeled as having normal distribution and null average. Such an assumption resembles the random effects model of meta-analysis, where the heterogeneity of reported average treatment effects results not only from random errors but also from differences among studies. We argue that accounting not only for the sampling error (ε) but also the differences in study design and statistical analyses (τ^2) leading to (much) wider 95% confidence intervals that should be used to assess replication success correctly. Based on this, we conclude that replication success is actually more common than it is believed. The crisis claims result from underappreciating the heterogeneity of reported average treatment effect estimates resulting from differences in design and statistical analysis between original studies and their replications. References: Feest, U. (2019). Why replication is overrated. Philosophy of Science. Fletcher, S. C. (2021). The role of replication in psychological science. Fletcher, S. C. (2022). Replication is for meta-analysis. Fletcher, S. C., Jones, G., & Rothman, A. (2021). Discussion: What Is a Replication?. Machery, E. (2020). What is a replication? Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Romero, F. (2019). Philosophy of science and the replicability crisis.