The Replication Crisis — The Missing Chapters

Chapter 98

The Day Science Looked in the Mirror

In 2015, a collective of 270 researchers did something no one had thought to do at scale: they tried to replicate 100 published psychology studies. The results were, to put it charitably, humbling.

The project was called the Open Science Collaboration, and it was led by Brian Nosek, a social psychologist at the University of Virginia who had the kind of dangerous idea that seems obvious only in retrospect: what if we just… checked? What if we took a hundred studies that had been published in top journals, studies that had survived peer review and been cited by other researchers, studies that had become part of the fabric of psychological knowledge — and ran them again?1

The answer was not reassuring. Of the 100 studies, only 36 produced statistically significant results in the replication attempt. The average effect size — the magnitude of the thing being measured — dropped by half. These weren't fringe results from predatory journals. These were studies from Psychological Science, the Journal of Personality and Social Psychology, the Journal of Experimental Psychology: Learning, Memory, and Cognition. The best journals in the field.

And here's the thing that makes this story worth telling in a math book: it wasn't fraud. It wasn't sloppy pipetting or fabricated data or any of the dramatic sins that make for good retractions. It was the perfectly natural, perfectly predictable consequence of doing statistics the way everyone had been taught to do statistics. The replication crisis, at its core, is a math problem. And once you understand the math, the surprise isn't that so many studies failed to replicate — it's that anyone expected them to.

· · ·

The Underpowered Machine

To understand why published research so often fails to replicate, you need to understand statistical power — and you need to understand that most published studies don't have nearly enough of it.

Statistical power is the probability that your study will detect a real effect if one actually exists. If you're testing whether a drug works, and it truly does work, power is the chance your experiment will correctly conclude "yes, it works." A study with 80% power has a 20% chance of missing a real effect. A study with 35% power has a 65% chance of missing it. It's like trying to photograph a rare bird with a camera that only works a third of the time.2

The median statistical power in psychology? About 35%.3 This means that even when there's a real phenomenon to find, most studies in the field are more likely to miss it than to catch it. You might think this is merely wasteful — scientists spending time and grant money on studies that usually come up empty. But it's much worse than that. Low power doesn't just mean you miss things. It means the things you do find are less likely to be real.

With only 35% power, a study misses the true effect nearly two-thirds of the time. The green region shows the probability of detecting a real effect.

This is where the math gets truly unsettling. Let's think about it with Bayes' theorem — because, as we keep discovering in this book, Bayes' theorem is lurking behind every statistical surprise.

Positive Predictive Value

PPV = Power × Prior / (Power × Prior + α × (1 − Prior))

The probability that a "significant" finding reflects a real effect

Power: Probability of detecting a real effect (e.g., 0.35)
Prior: Pre-study probability the effect is real (e.g., 0.10)
α: Significance threshold (typically 0.05)

Suppose you're testing a hypothesis that has, let's say, a 10% prior probability of being true — which is probably generous for the average exploratory study. Your study has the typical 35% power. You set α at the traditional 0.05. Now plug in the numbers:

PPV = (0.35 × 0.10) / (0.35 × 0.10 + 0.05 × 0.90) = 0.035 / (0.035 + 0.045) = 0.035 / 0.08 = 43.75%

Let that sink in. You ran your study, you got p < 0.05, you celebrate, you publish. And there is a better than even chance that your result is wrong. Not because you cheated. Not because you made an error. Because the math of low power and low priors makes it so.

This is the core insight of John Ioannidis's legendary 2005 paper, "Why Most Published Research Findings Are False" — a title so provocative that you'd expect the paper itself to be sensationalist polemic. It isn't. It's essentially a Bayesian accounting exercise. Ioannidis just ran the numbers honestly, and the numbers said: under typical research conditions, most "discoveries" aren't.4

· · ·

Chapter 98

The Garden of Forking Paths

Low power alone would be bad enough. But it's not alone. It arrives at the party with a whole entourage of problems that make things worse.

There's p-hacking — the practice of running multiple analyses and reporting only the ones that produce significant results. We explored this in Chapter 92, where we saw how easy it is to find significance in pure noise if you're willing to try enough different ways of slicing the data. Remove an outlier here, add a control variable there, split the sample by gender or age or zodiac sign, and eventually something will cross the magical p < 0.05 threshold.

There's publication bias — the systematic preference of journals for positive results. Negative findings (the study found nothing) are boring. They don't make headlines. They don't get citations. So they go unpublished, creating what's called the "file drawer problem": for every published study that found an effect, there may be several unpublished studies that didn't. The published literature isn't a representative sample of all studies — it's a highlight reel.5

There's HARKing — Hypothesizing After the Results are Known. You run an exploratory study, find an unexpected pattern, and then write your paper as though you predicted it all along. The 20/20 hindsight gets laundered into a "confirmatory" test. What was really a data-dredging expedition gets presented as a hypothesis-driven investigation.

And there's what the statistician Andrew Gelman calls the garden of forking paths — a more subtle version of p-hacking that doesn't require any conscious intent to deceive. At every stage of data analysis, the researcher makes choices: how to code variables, which participants to exclude, which covariates to include, how to handle missing data. Each choice is defensible. But the choices aren't independent of the data. The researcher looks at the data, makes a reasonable choice, and that choice nudges things toward significance. No single step is dishonest. But the cumulative effect is a form of overfitting that inflates false positive rates far beyond the nominal 5%.6

Ego depletion was one of psychology's most cited findings: the idea that willpower is a finite resource that gets "used up" like fuel in a tank. Hundreds of studies built on it. Textbooks taught it. TED talks popularized it. Then a massive pre-registered replication with 23 labs and over 2,000 participants found… nothing. The effect size was essentially zero. The original finding had been real in the same way that a desert mirage is real — it looked convincing, but it wasn't there when you walked up to it.

· · ·

Chapter 98

The Replication Simulator

The best way to understand why underpowered studies fail to replicate is to watch it happen. The simulator below lets you be the scientific community. Set a true effect size, choose how many participants each study gets, and run 100 studies. See how many find significance. Then replicate only the significant ones — and watch the replication rate.

Replication Simulator

Run 100 studies, then replicate the "significant" ones. Watch what underpowered science looks like.

True effect size (Cohen's d) 0.30

Sample size per group 25

Significance level (α) 0.050

Click "Run 100 Studies" to begin

Significant (p < α)

Replicated

Not significant / Not tested

Play with the sliders. Notice how with a small effect (d = 0.2) and a modest sample (n = 25), you might get 20 or 30 "significant" findings out of 100 — but when you replicate them, maybe half survive. That's the replication crisis in miniature. Now crank the sample size up to 150 and watch the replication rate soar. The cure for the replication crisis has always been known. It's just expensive: use more data.

· · ·

Chapter 98

The Button-Pressing Problem

Part of what got us into this mess is a fundamental confusion about what p-values actually mean — a confusion so deep that it's baked into the way statistics is taught.

There are two traditions in statistical testing, and they were never supposed to be merged. Ronald Fisher, the great early-twentieth-century statistician, thought of the p-value as a measure of evidence against the null hypothesis. A small p-value meant the data were surprising under the null. It was a continuous measure — more surprising data gave smaller p-values — and it was meant to inform judgment, not replace it. Fisher never intended for "p < 0.05" to be a binary bright line separating truth from falsehood.

Jerzy Neyman and Egon Pearson had a different framework entirely. They thought in terms of decision rules. You set α in advance, you set β (the false negative rate), and you designed your experiment to have enough power to control both error rates. The p-value wasn't a measure of evidence; it was a cog in a decision machine designed to control long-run error rates across many experiments.

What actually happened in practice was a Frankenstein's monster: scientists adopted Fisher's "p < 0.05" as a mechanical decision rule (from Neyman-Pearson) but ignored the power analysis that the Neyman-Pearson framework demanded. They got the worst of both worlds: a rigid threshold without the careful design that was supposed to make that threshold meaningful.7

Two incompatible statistical philosophies were merged into a Frankenstein procedure that neither Fisher nor Neyman-Pearson would have endorsed.

The result was what you might call the "button-pressing" model of science: run experiment, compute p-value, check if p < 0.05, publish if yes. No thinking about effect sizes. No thinking about priors. No thinking about power. Just: is the number below 0.05? It's a model that would horrify both Fisher and Neyman-Pearson, and it's the model that produced the replication crisis.

· · ·

Chapter 98

The Damage Report

Psychology got most of the attention, but it wasn't alone. The replication crisis touched nearly every empirical field.

Field	Study	Replication Rate
Psychology	Open Science Collaboration (2015)	36%
Cancer biology	Amgen (Begley & Ellis, 2012)	11% (6 of 53)
Economics	Camerer et al. (2016)	61%
Social sciences in Nature/Science	Camerer et al. (2018)	62%
Preclinical medicine	Bayer (Prinz et al., 2011)	~25%

The cancer biology result is perhaps the most chilling. C. Glenn Begley, then head of global cancer research at Amgen, tried to replicate 53 "landmark" cancer studies — the kind of studies that launch entire research programs and clinical trials. He could replicate six of them. Six. That means the research foundation for a significant portion of experimental cancer treatment was, at best, uncertain.8

Some of the most famous specific casualties:

Ego depletion — willpower as a depletable resource. Multi-lab replication: null result.
Power posing — standing in an expansive pose raises testosterone and increases risk-taking. Replication: the postural feedback effect on feelings may be real, but the hormonal and behavioral claims weren't.
Social priming — reading words related to old age makes you walk slower. Multiple replication failures.
Stereotype threat — telling people about stereotypes hurts their performance. Effect sizes in replications much smaller than originals.

Each of these was, at some point, considered well-established science. They appeared in textbooks. They influenced policy. They were the subject of bestselling books and TED talks watched by millions. The replication crisis didn't just embarrass scientists — it raised the question of how much of what we "know" we actually know.

· · ·

Chapter 98

The Ioannidis Calculator

Try the math yourself. The calculator below implements the core of Ioannidis's argument. Set the statistical power, the prior probability that the effect is real, and the significance threshold, and see the Positive Predictive Value: the probability that a "significant" finding is actually true.

Power & Positive Predictive Value Calculator

What fraction of "significant" results reflect real effects? Adjust the inputs to see how power, priors, and α interact.

Statistical power (1 − β) 0.35

Prior probability (effect is real) 0.10

Significance level (α) 0.050

Positive Predictive Value

43.8%

Less than half of significant results are true

43.8% true

True positives (per 1000 tests)

False positives (per 1000 tests)

Total "significant" findings

The defaults — 35% power, 10% prior, α = 0.05 — give a PPV of about 44%. Fewer than half of published significant findings are true. Now try cranking the power to 80% (the long-recommended standard that most studies fail to reach) and the PPV jumps to about 64%. Still not great! The prior probability matters enormously. If you're fishing for effects in a sea of hypotheses where only 1 in 100 might be real, even well-powered studies produce mostly false positives.

The Key Insight

Statistical significance was designed to control the false positive rate among null hypotheses. But what scientists actually want to know is the false positive rate among significant results. These are not the same thing, and the difference is exactly what Bayes' theorem quantifies.

· · ·

Chapter 98

How to Fix Science

The good news — and there is good news — is that the replication crisis prompted exactly the kind of self-correction that science is supposed to be about. It just took longer than anyone would have liked.

Preregistration. Before running your study, you publicly declare your hypothesis, your sample size, your analysis plan, and your stopping rules. This makes it vastly harder to HARK or p-hack, because everyone can check your registered plan against your published paper. Platforms like the Open Science Framework (OSF) and AsPredicted make this easy.

Registered reports. An even stronger version: journals agree to publish your study based on the methods — before the results come in. This eliminates publication bias entirely. If the method is sound, the result gets published whether it's positive or negative.

Larger samples and multi-site replications. The ManyLabs projects run the same experiment across dozens of labs simultaneously, yielding enormous sample sizes and built-in replication. When ManyLabs finds an effect, you can trust it. When it doesn't, you can trust that too.

Bayesian methods. Instead of asking "is the p-value below 0.05?", Bayesian analysis asks "given the data, what's the probability of the hypothesis?" This naturally incorporates prior information and doesn't require an arbitrary significance threshold. It also lets you collect evidence for the null hypothesis, not just against it — something classical statistics literally cannot do.

Better incentives. Slowly, painfully, the academic reward system is beginning to shift. Some journals now offer "badges" for open data and preregistration. Some hiring committees are starting to value methodological rigor over publication count. It's slow. But it's happening.

The replication crisis timeline: from diagnosis to (slow) treatment.

· · ·

Chapter 98

The Self-Correcting Machine

There's a paradox at the heart of the replication crisis that's worth sitting with. The crisis itself is evidence that science works. The system that produced all those unreplicable findings also produced the tools and the will to discover that they were unreplicable. Scientists found the problem, diagnosed it, and started fixing it. That's the self-correcting machinery doing exactly what it's supposed to do.

But — and this is the part that should make us humble — the correction took decades. Ioannidis published his paper in 2005. The problems he identified had been festering since at least the 1960s, when Jacob Cohen first started warning about low statistical power in psychology. That's forty years of published research built on shaky foundations, forty years of findings entering textbooks and treatment protocols and public policy, forty years during which the self-correcting machine was running, but slowly.

Science is self-correcting, but on a timescale that should give us pause. The question isn't whether truth will out — it's how much damage the errors do while we wait.

The deeper lesson of the replication crisis isn't that science is broken. It's that science is a human institution, run by humans, subject to human incentives. When the incentives point toward flashy results, small samples, and rapid publication, that's what you get. When the incentives shift toward rigor, transparency, and replication, science gets better. The math was never the problem. The math was, in fact, the solution — it just took us a while to actually do it.

Ellenberg's mantra applies here as much as anywhere: don't just ask whether the result is significant. Ask how significant, with how much data, testing a hypothesis with what prior plausibility. The number is never the whole story. And when you forget that — when you reduce the rich, complicated, beautifully uncertain process of scientific inference to a single binary question — you get exactly what we got: a crisis.

But also, eventually: a correction.

Notes & References

Open Science Collaboration, "Estimating the Reproducibility of Psychological Science," Science 349, no. 6251 (2015): aac4716. The project was coordinated through the Center for Open Science.
Cohen, Jacob, "The Statistical Power of Abnormal-Social Psychological Research: A Review," Journal of Abnormal and Social Psychology 65, no. 3 (1962): 145–153. Cohen was sounding this alarm over sixty years ago.
Bakker, Marjan, Annette van Dijk, and Jelte M. Wicherts, "The Rules of the Game Called Psychological Science," Perspectives on Psychological Science 7, no. 6 (2012): 543–554.
Ioannidis, John P.A., "Why Most Published Research Findings Are False," PLOS Medicine 2, no. 8 (2005): e124. As of 2024, this is the most-accessed paper in the history of PLOS Medicine.
Rosenthal, Robert, "The File Drawer Problem and Tolerance for Null Results," Psychological Bulletin 86, no. 3 (1979): 638–641.
Gelman, Andrew, and Eric Loken, "The Statistical Crisis in Science," American Scientist 102, no. 6 (2014): 460–465.
Gigerenzer, Gerd, "Mindless Statistics," The Journal of Socio-Economics 33, no. 5 (2004): 587–606. Gigerenzer calls the hybrid procedure "the null ritual."
Begley, C. Glenn, and Lee M. Ellis, "Raise Standards for Preclinical Cancer Research," Nature 483 (2012): 531–533.