In 2011, the psychologist Daryl Bem published a paper in one of the most prestigious journals in his field. The paper claimed to prove that humans could see the future. Not metaphorically — literally. In nine experiments involving over a thousand students, Bem reported statistically significant evidence for precognition: people performing better than chance at predicting random events that hadn't happened yet. The p-values were small. The sample sizes were large. The methods looked, on paper, perfectly standard. The only problem was the conclusion was almost certainly wrong.1
What Your Statistics Textbook Didn't Tell You
Here's what you learn in a statistics class: you collect your data, you compute a test statistic, you check whether your p-value falls below 0.05, and if it does, you declare the result "statistically significant." It's a clean procedure — almost mechanical. You'd think it would be hard to get wrong.
But the procedure has a hole in it the size of a freight train, and the hole is this: the p-value only means what you think it means if you decided exactly what to test before you looked at the data. If you peek at the data and then choose your test — or your subgroup, or your outcome variable, or your exclusion criteria, or when to stop collecting — the p-value is no longer what it claims to be. It's been corrupted. And the corruption is invisible, because the final paper looks exactly the same either way.
This is p-hacking: the practice of massaging data or analyses until a p-value crosses the magic threshold of 0.05. Sometimes it's deliberate fraud. But usually — and this is the insidious part — it's not. It's a researcher making perfectly reasonable-seeming choices at each step, choices that just happen to push the result toward significance. It's not lying. It's something worse: it's a way of being wrong that feels like being rigorous.2
The Green Jelly Bean Problem
The webcomic xkcd made this point unforgettably.3 Imagine a group of scientists investigating whether jelly beans cause acne. They test the hypothesis and find no link — p = 0.6. Fine. But then someone asks: what about green jelly beans specifically? So they test green jelly beans against acne and get p = 0.04. Headline: "GREEN JELLY BEANS LINKED TO ACNE (p < 0.05)." And if you read only the final paper, it looks entirely legitimate.
The catch, of course, is that they tested twenty colors. If each test has a 5% chance of a false positive (that's what α = 0.05 means), then across twenty independent tests, the probability of at least one false positive is:
Sixty-four percent. You're more likely than not to find a "discovery" that doesn't exist. And this is not a pathological case — twenty comparisons is, if anything, modest by the standards of modern research. In genomics, you might test millions of genetic variants. In neuroimaging, you test activity in tens of thousands of brain voxels. In economics, you can slice and dice the data by age, gender, income, region, time period, and a dozen other variables. The space of possible tests is enormous, and the researcher gets to wander through it, stopping wherever the landscape happens to dip below 0.05.
Try it yourself:
🫘 The Jelly Bean Experiment
We test whether each color of jelly bean causes acne. There's no real effect — any "significant" result is a false positive. Click to run the experiment.
The Garden of Forking Paths
The jelly bean problem is the simple version: you test many hypotheses and cherry-pick the winner. But the statistician Andrew Gelman has pointed out that p-hacking doesn't require such blunt tactics. It can happen through what he calls the "garden of forking paths" — the branching tree of seemingly innocuous analytical decisions that every researcher faces.4
Should you include that outlier or drop it? Should you control for age, or does that muddy the picture? Should the dependent variable be the raw score or the change from baseline? Should you analyze men and women together or separately? Should you use a parametric test or a nonparametric one? Should you stop collecting data at 50 participants or push for 80?
Each of these is a fork in the road. At each fork, the researcher has two or three or four reasonable options. And at each fork, the researcher can see which option gives the more favorable result. No single choice feels like cheating — each is defensible on its own. But the cumulative effect of many small, data-dependent choices is enormous. If you have ten binary analysis decisions, that's 210 = 1,024 different analyses you could run. Even if only a small fraction of those yield p < 0.05 by chance, you'll almost certainly find at least one. And the one you find is the one you'll write up.
Simmons, Nelson, and Simonsohn (2011) demonstrated this spectacularly. They set out to "prove" that listening to the Beatles song "When I'm Sixty-Four" actually made people younger. Using standard psychological methods and entirely real data, they were able to produce a statistically significant result (p = 0.04) by making a series of defensible analytical choices: controlling for the right variables, choosing the right dependent variable, recruiting just enough participants. The result was statistically significant, methodologically defensible, and completely absurd.5
This is why Gelman calls it the garden of forking paths rather than "fishing" or "data dredging." Those terms imply the researcher is consciously trying thousands of tests. The garden metaphor is more accurate: the researcher takes what feels like a single, natural walk through the data — but the walk was guided, at every branching point, by the data itself. The researcher didn't try a thousand analyses. They tried one. But it was the one the data told them to try.
Try It Yourself: The P-Hacking Playground
Here's a dataset with absolutely no real effect. Zero. The "treatment" and "control" groups were generated from the same distribution. But you have a research assistant's toolkit of defensible analytical choices. How many do you need to toggle before you get p < 0.05?
🔬 P-Hacking Playground
There is no real effect in this data. Toggle analytical choices to see how they change the p-value. Can you hack your way to "significance"?
The Scope of the Disaster
You might think p-hacking is an academic curiosity — a worry for journal editors, not for the rest of us. You'd be wrong. P-hacking — and the broader problem of analytical flexibility — has corrupted vast swathes of the published scientific literature, with real consequences for medicine, policy, and human welfare.
In 2015, the Open Science Collaboration attempted to replicate 100 psychology studies published in top journals. The results were devastating: only 36% of the replications found a significant effect, compared to 97% of the originals. Effect sizes were, on average, half as large.6 This wasn't just psychology. In cancer biology, a team at Amgen tried to replicate 53 "landmark" studies and succeeded with only 6 — an 11% replication rate.7 In economics, about 60% of 18 high-profile studies replicated.8
The pattern is consistent: the published literature is far too optimistic. Effects are inflated. Many "discoveries" are likely false positives, statistical phantoms conjured by the garden of forking paths. And p-hacking is a major — perhaps the major — reason why.
Consider what this means in medicine. A pharmaceutical company runs a clinical trial. The drug doesn't work for the overall population, but it does seem to work for patients over 60, or for women, or for patients with a particular genetic marker. This subgroup analysis wasn't planned in advance — but the result is significant, so it gets published, and doctors start prescribing accordingly. Patients take drugs that don't work for them because of a statistical artifact. This is not hypothetical; it has happened many times.9
Or consider nutrition science, which has given us a new dietary recommendation approximately every week for the past three decades. Eat eggs; don't eat eggs. Red wine prevents heart disease; red wine causes cancer. Much of this churn comes from the same dataset — large observational studies that collect information on hundreds of dietary and health variables — being sliced in every conceivable way until something significant emerges. The result is a literature where almost everything you eat has been both positively and negatively associated with some health outcome.10
Chapter 6The Math of Multiple Comparisons
Let's be precise about why this happens. Suppose there are m independent hypotheses being tested, and all of them are false — there really is no effect. At a significance level of α = 0.05, the probability that any single test falsely rejects is 0.05. The probability that all tests correctly fail to reject is (1 − α)m. So the probability that at least one test falsely rejects — the family-wise error rate — is:
This grows fast. At 10 tests, FWER is 40%. At 20 tests, it's 64%. At 50 tests, it's 92%. At 100, it's 99.4%. If you test enough things, finding something "significant" is not the exception — it's the expectation.
📊 Family-Wise Error Rate Calculator
See how the chance of at least one false positive grows with the number of tests.
The standard correction for this is the Bonferroni correction: if you're running m tests, use α/m as your significance threshold instead of α. Testing 20 jelly bean colors? Your threshold per test should be 0.05/20 = 0.0025, not 0.05. This is conservative — it controls the family-wise error rate at 0.05 — but at least it's honest.11
The deeper issue, though, is that Bonferroni only works when you know how many tests you ran. And in the garden of forking paths, you often don't. Every conditional decision — "I'll control for age because the groups look different in age" — is an implicit test that never gets counted. The number of effective comparisons is invisible, even to the researcher.
Chapter 7How Many Results Are False?
In 2005, the physician-statistician John Ioannidis published what became the most downloaded paper in the history of PLOS Medicine: "Why Most Published Research Findings Are False."12 The title was not clickbait. Ioannidis showed, through a straightforward Bayesian argument, that when the prior probability of a hypothesis being true is low, even a p < 0.05 result is more likely to be a false positive than a true discovery.
Here's the intuition. Suppose you test 1,000 hypotheses, of which 100 are genuinely true and 900 are false. At 80% power and α = 0.05:
True hypotheses (100): Your test correctly detects 80% of them → 80 true positives.
False hypotheses (900): Your test falsely flags 5% of them → 45 false positives.
Total "significant" results: 125. Of these, 45/125 = 36% are false.
And that's with a 10% prior — relatively generous. If only 1% of your hypotheses are true (10 out of 1,000), you get 8 true positives and ~50 false positives. Now 86% of your "discoveries" are wrong.
Add p-hacking to this picture and it gets worse. If researchers can inflate their false positive rate from 5% to, say, 20% through flexible analysis, then with a 10% prior you'd get 80 true positives and 180 false positives — meaning 69% of significant results are false. The scientific literature becomes more fiction than fact.
What Do We Do About It?
The good news is that the scientific community, having diagnosed the disease, has begun developing treatments. None of them is a cure, but together they make p-hacking harder and honesty easier.
Pre-registration
The single most powerful reform is pre-registration: before collecting data, the researcher writes down exactly what hypothesis they plan to test, how they'll analyze it, and what would count as a positive or negative result. This plan is filed with a public registry (like the Open Science Framework or ClinicalTrials.gov) and timestamped. Now the garden has only one path — the one the researcher committed to before seeing the data. Exploratory analyses are still allowed, but they must be labeled as such.13
Registered Reports
Even better than pre-registration is the registered report: a journal agrees (or declines) to publish a study before the data are collected, based on the research question and methodology alone. This eliminates the incentive to p-hack entirely, because publication no longer depends on the result. Studies with registered reports have significantly lower rates of positive findings — about 44%, compared to over 90% for standard papers — which is exactly what we'd expect if the inflated rate was largely an artifact of p-hacking.14
Better Statistics
Some researchers advocate replacing the p < 0.05 threshold entirely. One proposal: lower it to 0.005, which would reduce false positive rates dramatically.15 Others argue for moving to Bayesian statistics, which quantify the evidence for and against a hypothesis as a continuous ratio (the Bayes factor) rather than a binary significant/not-significant verdict. Still others push for focusing on effect sizes and confidence intervals rather than p-values at all. No consensus has emerged, but the common thread is clear: we need to stop treating the p-value as a bright line between truth and noise.
P-hacking isn't a bug in statistics — it's a feature of human cognition. We are pattern-seeking creatures working with noisy data, and we have strong incentives (career, funding, fame) to find patterns. The solution isn't just better statistical methods; it's building institutions that reward honesty over novelty. Pre-registration, data sharing, replication — these are not just technical fixes. They're commitments to knowing the truth even when the truth is boring.
The Monte Carlo Truth Machine
Don't take my word for how bad this is. Let's simulate it. The tool below runs thousands of fake experiments where there is no true effect. It then applies various levels of p-hacking (modeled as trying multiple analyses and reporting the best) and shows you what the published literature would look like.
🎲 P-Hacking Simulation
Each "study" draws from two identical distributions (no real effect). P-hacking means trying multiple analyses and reporting the smallest p-value.
Play with the sliders. Watch what happens as you increase the number of analyses per study. With just one analysis, about 5% of studies cross the threshold — exactly as expected. With ten analyses, it jumps to around 40%. With fifty, it's over 90%. The p-value, which is supposed to be a guard against false positives, has been rendered completely meaningless.
This is the central lesson. The p-value is not a fact about the world. It's a fact about a procedure. If you follow the procedure honestly — formulate your hypothesis, collect your data, run your test — then the p-value protects you, at least somewhat, from being fooled by randomness. But if you deviate from the procedure — even in small, well-intentioned ways — the protection evaporates. And you won't feel it disappear. That's the most dangerous thing about p-hacking: it feels exactly like doing science.
Daryl Bem's precognition paper, the one we started with, passed peer review. It used standard methods. Its p-values were small. And yet it was wrong — or at very least, its evidence was far weaker than the p-values suggested, because the analyses involved many decisions made after seeing the data. When other researchers tried to replicate Bem's work using pre-registered protocols, they found nothing.16
Bem's paper wasn't an embarrassment for Bem alone. It was an embarrassment for the entire system of scientific inference that let it through. The methods that "proved" precognition were the same methods being used to evaluate drugs, educational interventions, and economic policies. If those methods could prove that humans see the future, what else might they be getting wrong?
The answer, we now know, is: quite a lot. The replication crisis that followed Bem's paper — and the broader recognition of p-hacking as a systemic problem — has been one of the most important developments in the history of science. Not because it revealed that scientists are fraudulent (most aren't), but because it revealed something more troubling: that honest scientists, using standard methods, can reliably produce false results. The tools themselves were broken. And the first step toward fixing them was understanding exactly how.