Open Science

The Replication Crisis: What Actually Happened

By Moonlit Social Labs · April 16, 2026 · 16 min read

Sometime in the early 2010s, psychology — and, soon after, large parts of biomedicine, economics, and neuroscience — began to notice that many of its most cited findings could not be reproduced. The arc of that noticing, and the institutional response that followed, is what we now call the replication crisis. More than a decade on, the contours of the story are clear enough to tell honestly: who did what, what broke, what survived, and what the field looks like today.

This is a long retrospective rather than a polemic. The goal is to lay out the events in roughly the order they happened, identify the structural causes that scholars converged on, and assess where reform efforts stand in 2026. Where the evidence is genuinely contested, we say so.

1. The Inciting Events (2011)

Three publications in 2011 are usually credited with crystallizing the crisis, though the underlying problems were older.

Bem (2011): "Feeling the Future"

Daryl Bem, a respected social psychologist at Cornell, published nine experiments in the Journal of Personality and Social Psychology claiming evidence for precognition — the ability to be influenced by future events. Across studies on memory, attraction, and reaction time, Bem reported small but significant effects suggesting participants were detecting stimuli before they were presented (Bem, 2011).

The paper passed peer review at psychology's flagship journal using methods that were entirely standard for the field: t-tests, ANOVAs, modest sample sizes, and selective reporting of positive results across many measures. The implication was uncomfortable. Either precognition is real, or the standard methods of social psychology can produce convincing evidence for something that is almost certainly false. Most readers chose the second interpretation. As Eric-Jan Wagenmakers and colleagues argued in a Bayesian re-analysis, the methods themselves were the problem.

Bem's paper became a kind of natural experiment on the field's epistemic standards — and the field, by and large, failed it.

The Stapel Fraud

In late 2011, Tilburg University announced that the Dutch social psychologist Diederik Stapel had fabricated data across dozens of papers, some published in Science and Psychological Science. Stapel had reported eye-catching effects of environmental cues on attitudes and behavior — trash on the floor making people more likely to stereotype, for instance — and many had been celebrated.

What made the Stapel case more than a tabloid scandal was the diagnosis of his enabling conditions. The Levelt, Noort, and Drenth committees concluded that Stapel's results were too clean, too consistent, and too convenient, but that nobody — co-authors, reviewers, editors — had asked to see the data. A culture that treated raw data as the private property of the lead author had no immune system against fabrication.

Simmons, Nelson & Simonsohn (2011): "False-Positive Psychology"

The most consequential 2011 paper was a methods piece. Simmons, Nelson, and Simonsohn showed via simulation that ordinary, undeclared researcher flexibility — collecting more data after peeking at p-values, dropping conditions, choosing among covariates, switching dependent variables — could push the false-positive rate from a nominal 5% to over 60% (Simmons, Nelson, & Simonsohn, 2011). They demonstrated this by "proving" that listening to "When I'm Sixty-Four" made participants younger, an obviously absurd conclusion produced through entirely defensible analytic choices.

The argument was hard to dismiss because it didn't require anyone to be dishonest. It only required the everyday practices of the discipline.

2. The Open Science Collaboration (2015)

In 2011, Brian Nosek and colleagues launched what became the Open Science Collaboration (OSC): a coordinated, large-scale effort to attempt direct replications of 100 studies drawn from three top psychology journals (Psychological Science, JPSP, and JEP: Learning, Memory, and Cognition). Each replication used the original materials where possible, recruited adequate samples, and was vetted by the original authors.

The 2015 results, published in Science, were the public turning point of the crisis (Open Science Collaboration, 2015). Of the 100 attempted replications:

Only 36% produced statistically significant results in the same direction as the original.
The mean effect size in replications was roughly half the size of the originals (r = .197 vs. .403).
Effects from cognitive psychology replicated at higher rates (~50%) than effects from social psychology (~25%).
Studies with p-values just below .05 in the original almost never replicated.

The 39% figure that circulated in news coverage referred to a slightly different criterion (subjective replication judgment); under the strict significance criterion the number was 36%. Either way, it was far below what most researchers expected, and far below the rate implied by the published literature's near-universal positive results.

3. Many Labs and the Coordinated Replication Era

The OSC project showed that replication failure was widespread. The Many Labs projects, led by Richard Klein and collaborators, asked a complementary question: when an effect does replicate, how stable is it across samples, contexts, and labs?

Many Labs 1 (Klein et al., 2014) attempted 13 classic findings across 36 sites. Some classics — like anchoring — replicated robustly almost everywhere. Others, including several social-priming effects, failed almost everywhere.
Many Labs 2 (Klein et al., 2018) attempted 28 findings across 125 samples in 36 countries. Replication rates were similar to the original OSC estimate (about half), and crucially, effect heterogeneity across samples was small — suggesting that "hidden moderators" were not, in general, the explanation for failed replications.
Many Labs 3 tested whether time-of-semester moderated effects. It mostly didn't.

The Many Labs program was important because it ruled out a comforting story. When primes failed to replicate, it wasn't because Tuesday undergraduates in Dortmund were psychologically different from Friday undergraduates at NYU. The effects, in many cases, simply weren't there.

4. The Crisis Spreads: Cancer Biology and Beyond

Psychology was the visible face of the crisis, but the same problems were already known in biomedicine. Begley and Ellis (2012) had reported that of 53 "landmark" preclinical cancer studies, only 6 (11%) replicated at Amgen. Prinz and colleagues at Bayer reported similarly low rates.

The Reproducibility Project: Cancer Biology, organized by the Center for Open Science and Science Exchange, was the systematic version of those reports. Between 2013 and 2021, the project attempted to replicate experiments from 53 high-impact cancer papers. The final reports, summarized in eLife in 2021, were sobering: only about a quarter of effects replicated with the predicted direction and statistical significance, effect sizes were on average 85% smaller than originals, and many experiments could not be attempted at all because the original protocols and reagents were inadequately documented.

Economics had its own moment with Camerer and colleagues' systematic replications of experimental economics papers in Science (2016) and social-science papers in Nature Human Behaviour (2018), with replication rates of about 60% and 62% respectively — better than psychology, worse than the field had assumed.

5. The Causes: A Convergent Diagnosis

By 2018 a consensus diagnosis had emerged. The crisis was not caused primarily by fraud, nor by stupidity, nor by any single methodological mistake. It was the predictable outcome of a system whose incentives, statistics, and publication norms quietly multiplied false positives.

Underpowered designs

Katherine Button and colleagues (2013) showed in Nature Reviews Neuroscience that the median statistical power in neuroscience was around 21%. That has two consequences: most true effects are missed, and most published significant results overestimate the true effect size (a phenomenon known as the "winner's curse" or Type M error). Low-powered studies don't just fail to detect real effects; when they do detect them, they tend to be wrong about how big they are.

P-hacking

Building on Simmons et al., Simonsohn, Nelson, and Simmons (2014) developed the p-curve as a diagnostic. The shape of the distribution of significant p-values in a literature reveals whether the underlying effect is real. Many literatures showed the telltale spike just below .05 that signals selective reporting and analytic flexibility.

The garden of forking paths

Gelman and Loken (2013) introduced a subtler concept. Even researchers who never explicitly try multiple analyses are still in trouble, because the analysis they would have run if the data had looked different counts as a multiple comparison. The "garden of forking paths" is the implicit multiverse of analyses contingent on the data — a problem that can arise without any conscious p-hacking. (We explore this in depth in our companion post on multiple comparisons.)

HARKing

Norbert Kerr (1998), well before the crisis broke, named one of its central practices: HARKing, or Hypothesizing After Results are Known. When researchers present a post hoc explanation as if it were an a priori prediction, the inferential math no longer works — what looks like a confirmed prediction is actually an unconstrained search across a large space of possible patterns.

Publication bias

Journals overwhelmingly publish positive, novel, statistically significant results. Null findings sit in file drawers. The published literature is therefore a non-random sample of the evidence, and meta-analyses built on it inherit the bias. Trim-and-fill, PET-PEESE, and selection-model corrections have become standard partly because the raw published record is known to be misleading.

Questionable Research Practices (QRPs)

John, Loewenstein, and Prelec (2012) anonymously surveyed psychologists on a menu of practices including selective dropping of conditions, optional stopping, and reporting only "successful" measures. Self-admission rates ranged from roughly 30% to over 60% depending on the practice. These were not rare lapses by bad actors; they were the working methods of a substantial fraction of researchers in good standing.

6. Field-Specific Casualties

Social priming

The most public casualty was social priming, especially the work emerging from John Bargh's lab and similar programs. The classic claim — that subtle environmental cues like words associated with elderly stereotypes could measurably slow people's walking speed — failed to replicate in well-powered direct attempts (Doyen et al., 2012). Other priming effects, including macho-related performance shifts and money primes, fared similarly. By the late 2010s most working social psychologists treated the larger social-priming literature as substantially overstated.

Ego depletion

The idea that self-control is a depletable resource — supported by hundreds of studies and a Roy Baumeister bestseller — was probed by a pre-registered Many Labs replication (Hagger et al., 2016) involving over 2,000 participants across 24 labs. The pooled effect was indistinguishable from zero. Subsequent registered replications produced similar results. Ego depletion is now widely regarded as either much smaller than reported or non-existent under standard conditions.

Power posing

Carney, Cuddy, and Yap's 2010 finding that two minutes of expansive postures raised testosterone, lowered cortisol, and increased risk-taking became a TED-talk phenomenon. By 2016, after multiple failed replications, Dana Carney publicly disavowed the original effect. A subsequent p-curve analysis by Simmons and Simonsohn found the literature inconsistent with a real effect on hormones or behavior, though some downstream feelings-of-power effects survive in attenuated form.

What survived

Crucially, not everything fell. Direct replications of the Stroop effect, the Müller-Lyer illusion, anchoring, loss aversion (in its core form), basic conditioning paradigms, and many cognitive-psychological phenomena reproduced cleanly. Daniel Lakens has called this the "shaken not stirred" pattern: the foundational findings — the ones that survived because they were robust to begin with — remain. What broke were the small, narrative-friendly effects that depended on optimal stimuli, optimal samples, and optimal analyses.

7. The Reform Era

The response to the crisis has been, by the standards of academic reform, fast and serious.

Pre-registration and registered reports

Pre-registration — depositing one's hypothesis, design, and analysis plan before seeing the data — transforms exploratory analyses into something more honest by clearly separating confirmatory from exploratory work. Nosek and colleagues (2018) summarized the rationale and practical machinery in "The preregistration revolution," documenting rapid uptake across fields. Registered Reports go further: peer review of the design happens before data collection, and acceptance is independent of the results. Chris Chambers and colleagues showed that Registered Reports yield null results far more often than conventional papers — about 60% versus under 10% — suggesting they are doing what they were designed to do.

Multiverse and specification-curve analysis

Steegen and colleagues' multiverse analysis (2016) and Simonsohn, Simmons, and Nelson's specification curves (2020) operationalize transparency for analytical decisions. Rather than picking one defensible specification, you run all of them and show the distribution. Convergent Core Analysis, which we describe elsewhere on this blog, extends this idea by extracting the claim that survives across the multiverse.

Larger samples and effect-size reporting

The field has shifted toward larger samples driven by formal power analysis, and toward reporting effect sizes with confidence intervals rather than bare p-values. The American Statistical Association's 2016 statement on p-values was an early formal nudge; the 2019 follow-up, with its call to stop using "statistical significance" as a binary verdict, went further.

Open data, open materials, open code

The Center for Open Science's badge program and the rise of platforms like the Open Science Framework, OSF Preprints, and GitHub-hosted analysis code have made it dramatically easier to share data, materials, and analytic pipelines. The TOP Guidelines (Nosek et al., 2015) gave journals concrete tiers of openness to adopt, and many leading journals now require at least Tier 1 data-availability statements.

Coordinated infrastructure

Many Labs, the Psychological Science Accelerator, the Reproducibility Project: Cancer Biology, and the Collaborative Replications and Education Project gave the field large-scale, distributed machinery for doing replications well. These networks turned replication from a hobby into infrastructure.

8. Was It a "Crisis" or Healthy Self-Correction?

The honest answer is: both, and the framing matters.

On one hand, "crisis" can be overstated. Daniel Gilbert and colleagues (2016) published a methodological critique of the OSC project arguing that the replication studies often diverged from the originals in ways that could plausibly explain non-replication, and that the headline 36% figure should be interpreted cautiously. Subsequent analysis suggested Gilbert et al. overstated their case — the OSC effect-size attenuation was too systematic to be explained by protocol drift — but the underlying point is fair: judging "successful replication" is harder than the binary statistic suggests, and not every failed replication is a refutation.

It is also true that many disciplines have always known that some fraction of their published results would not hold up. Science is supposed to self-correct, and noticing that it must self-correct is itself a sign of health. As Vazire (2018) and others have argued, the visible turbulence of the 2010s was a sign that psychology had finally become rigorous enough to embarrass itself in public.

On the other hand, the structural problems were severe. A field in which most published p < .05 results from the leading journals fail to replicate, in which median power is around 20%, in which over half of researchers admit to QRPs, and in which celebrated effects later vanish on direct test, is a field with a real epistemic problem. Calling that "self-correction" without acknowledging the damage is a kind of complacency. Many real careers, real public-policy recommendations, and real undergraduate textbooks were built on findings that turned out not to exist.

The right framing is probably: a real crisis that has provoked a real, ongoing correction. Whether the correction will be deep enough to change the underlying incentives is the open question of the next decade.

9. The State of Replication in 2026

A decade on from the OSC paper, the landscape has changed in measurable ways.

Sample sizes are larger. Median sample sizes in social psychology have roughly doubled since 2011. Power calculations are now expected at submission to most major journals.
Pre-registration is mainstream. The OSF hosts well over 100,000 pre-registrations. Major journals offer or require registered reports for at least some submission tracks.
Replication studies are publishable. Journals like Royal Society Open Science, Collabra: Psychology, and Meta-Psychology have made direct replications a normal output rather than a career risk.
Effect sizes are smaller and uncertainty is larger. Newer literatures look less spectacular — precisely because they are more honestly reported.
Novel pathologies have emerged. "Pre-registration theatre," in which authors file vague pre-registrations that fail to constrain analysis, is now a known concern. Bakker and colleagues have documented that adherence to pre-registered analyses is far from universal.
Generative-AI-era worries — including AI-fabricated data, AI-generated literature reviews that miss replication failures, and automated p-hacking through analysis-search agents — are the next frontier of methodological concern.

Recent large-scale efforts continue to find that significant fractions of even high-profile literatures do not hold up under direct replication. Each new such report is now greeted with less shock and more institutional response — a sign of cultural change. Whether the underlying incentive structure of academic publishing has been reformed deeply enough to prevent the next crisis is less clear.

What working researchers can do

Plan your sample with a power analysis before you run anything. Pre-register your primary hypothesis and analysis. Report effect sizes with confidence intervals. Run a multiverse of defensible specifications and show the full distribution — not just the version that "worked." Share your data, materials, and code unless you have an explicit reason not to. None of these are heroic acts; they are the new baseline.

Try It in PsyStat Nexus

PsyStat Nexus was built around the lessons of the replication crisis. The Pre-Registration module helps you lock in hypotheses, primary outcomes, exclusion rules, and analytic plans before you collect data — producing a timestamped, OSF-compatible document you can deposit immediately.

The Multiverse Analysis tool runs your model across the full grid of defensible specifications and shows you the distribution of results, not just one cherry-picked path. Convergent Core Analysis (CCA) goes one step further by extracting the robust claim that survives across virtually all specifications — so you can report not just "67% of analyses were significant," but the structured finding your data actually supports.

Get started free →

References

Bem, D. J. (2011). Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect. Journal of Personality and Social Psychology, 100(3), 407–425.
Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S. J., & Munafò, M. R. (2013). Power failure: Why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14(5), 365–376.
Gelman, A., & Loken, E. (2013). The garden of forking paths. Department of Statistics, Columbia University.
Kerr, N. L. (1998). HARKing: Hypothesizing after the results are known. Personality and Social Psychology Review, 2(3), 196–217.
Klein, R. A., et al. (2014). Investigating variation in replicability: A "Many Labs" replication project. Social Psychology, 45(3), 142–152.
Klein, R. A., et al. (2018). Many Labs 2: Investigating variation in replicability across samples and settings. Advances in Methods and Practices in Psychological Science, 1(4), 443–490.
Nosek, B. A., et al. (2018). The preregistration revolution. Proceedings of the National Academy of Sciences, 115(11), 2600–2606.
Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716.
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366.
Steegen, S., Tuerlinckx, F., Gelman, A., & Vanpaemel, W. (2016). Increasing transparency through a multiverse analysis. Perspectives on Psychological Science, 11(5), 702–712.

Methods

Multiple Comparisons and the Garden of Forking Paths

Why every analytical decision you make is implicitly a multiple comparison — and what to do about it.