P-Hacking — The Missing Chapters

In 2011, the psychologist Daryl Bem published a paper in one of the most prestigious journals in his field. The paper claimed to prove that humans could see the future. Not metaphorically — literally. In nine experiments involving over a thousand students, Bem reported statistically significant evidence for precognition: people performing better than chance at predicting random events that hadn't happened yet. The p-values were small. The sample sizes were large. The methods looked, on paper, perfectly standard. The only problem was the conclusion was almost certainly wrong.1

· · ·

Chapter 1

What Your Statistics Textbook Didn't Tell You

Here's what you learn in a statistics class: you collect your data, you compute a test statistic, you check whether your p-value falls below 0.05, and if it does, you declare the result "statistically significant." It's a clean procedure — almost mechanical. You'd think it would be hard to get wrong.

But the procedure has a hole in it the size of a freight train, and the hole is this: the p-value only means what you think it means if you decided exactly what to test before you looked at the data. If you peek at the data and then choose your test — or your subgroup, or your outcome variable, or your exclusion criteria, or when to stop collecting — the p-value is no longer what it claims to be. It's been corrupted. And the corruption is invisible, because the final paper looks exactly the same either way.

This is p-hacking: the practice of massaging data or analyses until a p-value crosses the magic threshold of 0.05. Sometimes it's deliberate fraud. But usually — and this is the insidious part — it's not. It's a researcher making perfectly reasonable-seeming choices at each step, choices that just happen to push the result toward significance. It's not lying. It's something worse: it's a way of being wrong that feels like being rigorous.2

The same dataset can tell different stories depending on which analytical path the researcher takes. Only the "successful" path makes it into the paper.

Chapter 2

The Green Jelly Bean Problem

The webcomic xkcd made this point unforgettably.3 Imagine a group of scientists investigating whether jelly beans cause acne. They test the hypothesis and find no link — p = 0.6. Fine. But then someone asks: what about green jelly beans specifically? So they test green jelly beans against acne and get p = 0.04. Headline: "GREEN JELLY BEANS LINKED TO ACNE (p < 0.05)." And if you read only the final paper, it looks entirely legitimate.

The catch, of course, is that they tested twenty colors. If each test has a 5% chance of a false positive (that's what α = 0.05 means), then across twenty independent tests, the probability of at least one false positive is:

Probability of at least one false positive

P(≥1 false positive) = 1 − (1 − 0.05)²⁰ ≈ 64%

With 20 tests at α = 0.05, you have a nearly two-in-three chance of finding something "significant" by pure luck.

Sixty-four percent. You're more likely than not to find a "discovery" that doesn't exist. And this is not a pathological case — twenty comparisons is, if anything, modest by the standards of modern research. In genomics, you might test millions of genetic variants. In neuroimaging, you test activity in tens of thousands of brain voxels. In economics, you can slice and dice the data by age, gender, income, region, time period, and a dozen other variables. The space of possible tests is enormous, and the researcher gets to wander through it, stopping wherever the landscape happens to dip below 0.05.

Try it yourself:

🫘 The Jelly Bean Experiment

We test whether each color of jelly bean causes acne. There's no real effect — any "significant" result is a false positive. Click to run the experiment.

Click "Test All 20 Colors" to run the experiment

Chapter 3

The Garden of Forking Paths

The jelly bean problem is the simple version: you test many hypotheses and cherry-pick the winner. But the statistician Andrew Gelman has pointed out that p-hacking doesn't require such blunt tactics. It can happen through what he calls the "garden of forking paths" — the branching tree of seemingly innocuous analytical decisions that every researcher faces.4

Should you include that outlier or drop it? Should you control for age, or does that muddy the picture? Should the dependent variable be the raw score or the change from baseline? Should you analyze men and women together or separately? Should you use a parametric test or a nonparametric one? Should you stop collecting data at 50 participants or push for 80?

Each of these is a fork in the road. At each fork, the researcher has two or three or four reasonable options. And at each fork, the researcher can see which option gives the more favorable result. No single choice feels like cheating — each is defensible on its own. But the cumulative effect of many small, data-dependent choices is enormous. If you have ten binary analysis decisions, that's 2¹⁰ = 1,024 different analyses you could run. Even if only a small fraction of those yield p < 0.05 by chance, you'll almost certainly find at least one. And the one you find is the one you'll write up.

Simmons, Nelson, and Simonsohn (2011) demonstrated this spectacularly. They set out to "prove" that listening to the Beatles song "When I'm Sixty-Four" actually made people younger. Using standard psychological methods and entirely real data, they were able to produce a statistically significant result (p = 0.04) by making a series of defensible analytical choices: controlling for the right variables, choosing the right dependent variable, recruiting just enough participants. The result was statistically significant, methodologically defensible, and completely absurd.5

This is why Gelman calls it the garden of forking paths rather than "fishing" or "data dredging." Those terms imply the researcher is consciously trying thousands of tests. The garden metaphor is more accurate: the researcher takes what feels like a single, natural walk through the data — but the walk was guided, at every branching point, by the data itself. The researcher didn't try a thousand analyses. They tried one. But it was the one the data told them to try.

The garden of forking paths: each fork represents a reasonable analytical choice. The red path leads to p < 0.05 — and that's the path the paper reports.

Chapter 4

Try It Yourself: The P-Hacking Playground

Here's a dataset with absolutely no real effect. Zero. The "treatment" and "control" groups were generated from the same distribution. But you have a research assistant's toolkit of defensible analytical choices. How many do you need to toggle before you get p < 0.05?

🔬 P-Hacking Playground

There is no real effect in this data. Toggle analytical choices to see how they change the p-value. Can you hack your way to "significance"?

p = 0.42 — Not significant

Chapter 5

The Scope of the Disaster

You might think p-hacking is an academic curiosity — a worry for journal editors, not for the rest of us. You'd be wrong. P-hacking — and the broader problem of analytical flexibility — has corrupted vast swathes of the published scientific literature, with real consequences for medicine, policy, and human welfare.

In 2015, the Open Science Collaboration attempted to replicate 100 psychology studies published in top journals. The results were devastating: only 36% of the replications found a significant effect, compared to 97% of the originals. Effect sizes were, on average, half as large.6 This wasn't just psychology. In cancer biology, a team at Amgen tried to replicate 53 "landmark" studies and succeeded with only 6 — an 11% replication rate.7 In economics, about 60% of 18 high-profile studies replicated.8

The pattern is consistent: the published literature is far too optimistic. Effects are inflated. Many "discoveries" are likely false positives, statistical phantoms conjured by the garden of forking paths. And p-hacking is a major — perhaps the major — reason why.

The first principle is that you must not fool yourself — and you are the easiest person to fool. — Richard Feynman

Consider what this means in medicine. A pharmaceutical company runs a clinical trial. The drug doesn't work for the overall population, but it does seem to work for patients over 60, or for women, or for patients with a particular genetic marker. This subgroup analysis wasn't planned in advance — but the result is significant, so it gets published, and doctors start prescribing accordingly. Patients take drugs that don't work for them because of a statistical artifact. This is not hypothetical; it has happened many times.9

Or consider nutrition science, which has given us a new dietary recommendation approximately every week for the past three decades. Eat eggs; don't eat eggs. Red wine prevents heart disease; red wine causes cancer. Much of this churn comes from the same dataset — large observational studies that collect information on hundreds of dietary and health variables — being sliced in every conceivable way until something significant emerges. The result is a literature where almost everything you eat has been both positively and negatively associated with some health outcome.10

Chapter 6

The Math of Multiple Comparisons

Let's be precise about why this happens. Suppose there are m independent hypotheses being tested, and all of them are false — there really is no effect. At a significance level of α = 0.05, the probability that any single test falsely rejects is 0.05. The probability that all tests correctly fail to reject is (1 − α)^m. So the probability that at least one test falsely rejects — the family-wise error rate — is:

Family-Wise Error Rate (FWER)

FWER = 1 − (1 − α)^m

This grows fast. At 10 tests, FWER is 40%. At 20 tests, it's 64%. At 50 tests, it's 92%. At 100, it's 99.4%. If you test enough things, finding something "significant" is not the exception — it's the expectation.

📊 Family-Wise Error Rate Calculator

See how the chance of at least one false positive grows with the number of tests.

Number of tests (m) 20

Significance level (α) 0.05

Family-Wise Error Rate

64.2%

Chance of at least one false positive in 20 tests

Expected false positives

1.0

Bonferroni-corrected α

0.0025

The standard correction for this is the Bonferroni correction: if you're running m tests, use α/m as your significance threshold instead of α. Testing 20 jelly bean colors? Your threshold per test should be 0.05/20 = 0.0025, not 0.05. This is conservative — it controls the family-wise error rate at 0.05 — but at least it's honest.11

The deeper issue, though, is that Bonferroni only works when you know how many tests you ran. And in the garden of forking paths, you often don't. Every conditional decision — "I'll control for age because the groups look different in age" — is an implicit test that never gets counted. The number of effective comparisons is invisible, even to the researcher.

Chapter 7

How Many Results Are False?

In 2005, the physician-statistician John Ioannidis published what became the most downloaded paper in the history of PLOS Medicine: "Why Most Published Research Findings Are False."12 The title was not clickbait. Ioannidis showed, through a straightforward Bayesian argument, that when the prior probability of a hypothesis being true is low, even a p < 0.05 result is more likely to be a false positive than a true discovery.

Here's the intuition. Suppose you test 1,000 hypotheses, of which 100 are genuinely true and 900 are false. At 80% power and α = 0.05:

True hypotheses (100): Your test correctly detects 80% of them → 80 true positives.

False hypotheses (900): Your test falsely flags 5% of them → 45 false positives.

Total "significant" results: 125. Of these, 45/125 = 36% are false.

And that's with a 10% prior — relatively generous. If only 1% of your hypotheses are true (10 out of 1,000), you get 8 true positives and ~50 false positives. Now 86% of your "discoveries" are wrong.

Add p-hacking to this picture and it gets worse. If researchers can inflate their false positive rate from 5% to, say, 20% through flexible analysis, then with a 10% prior you'd get 80 true positives and 180 false positives — meaning 69% of significant results are false. The scientific literature becomes more fiction than fact.

When p-hacking inflates the false positive rate, the majority of "discoveries" in the literature may be wrong. Assumes 10% base rate of true hypotheses.

Chapter 8

What Do We Do About It?

The good news is that the scientific community, having diagnosed the disease, has begun developing treatments. None of them is a cure, but together they make p-hacking harder and honesty easier.

Pre-registration

The single most powerful reform is pre-registration: before collecting data, the researcher writes down exactly what hypothesis they plan to test, how they'll analyze it, and what would count as a positive or negative result. This plan is filed with a public registry (like the Open Science Framework or ClinicalTrials.gov) and timestamped. Now the garden has only one path — the one the researcher committed to before seeing the data. Exploratory analyses are still allowed, but they must be labeled as such.13

Registered Reports

Even better than pre-registration is the registered report: a journal agrees (or declines) to publish a study before the data are collected, based on the research question and methodology alone. This eliminates the incentive to p-hack entirely, because publication no longer depends on the result. Studies with registered reports have significantly lower rates of positive findings — about 44%, compared to over 90% for standard papers — which is exactly what we'd expect if the inflated rate was largely an artifact of p-hacking.14

Better Statistics

Some researchers advocate replacing the p < 0.05 threshold entirely. One proposal: lower it to 0.005, which would reduce false positive rates dramatically.15 Others argue for moving to Bayesian statistics, which quantify the evidence for and against a hypothesis as a continuous ratio (the Bayes factor) rather than a binary significant/not-significant verdict. Still others push for focusing on effect sizes and confidence intervals rather than p-values at all. No consensus has emerged, but the common thread is clear: we need to stop treating the p-value as a bright line between truth and noise.

The Deeper Lesson

P-hacking isn't a bug in statistics — it's a feature of human cognition. We are pattern-seeking creatures working with noisy data, and we have strong incentives (career, funding, fame) to find patterns. The solution isn't just better statistical methods; it's building institutions that reward honesty over novelty. Pre-registration, data sharing, replication — these are not just technical fixes. They're commitments to knowing the truth even when the truth is boring.

Chapter 9

The Monte Carlo Truth Machine

Don't take my word for how bad this is. Let's simulate it. The tool below runs thousands of fake experiments where there is no true effect. It then applies various levels of p-hacking (modeled as trying multiple analyses and reporting the best) and shows you what the published literature would look like.

🎲 P-Hacking Simulation

Each "study" draws from two identical distributions (no real effect). P-hacking means trying multiple analyses and reporting the smallest p-value.

Analyses per study (1 = honest, more = hacking) 1

Sample size per group 30

—

False positive rate

—

Median reported p

—

Studies with p < 0.05

"Significant" (p < 0.05)

Not significant

Play with the sliders. Watch what happens as you increase the number of analyses per study. With just one analysis, about 5% of studies cross the threshold — exactly as expected. With ten analyses, it jumps to around 40%. With fifty, it's over 90%. The p-value, which is supposed to be a guard against false positives, has been rendered completely meaningless.

This is the central lesson. The p-value is not a fact about the world. It's a fact about a procedure. If you follow the procedure honestly — formulate your hypothesis, collect your data, run your test — then the p-value protects you, at least somewhat, from being fooled by randomness. But if you deviate from the procedure — even in small, well-intentioned ways — the protection evaporates. And you won't feel it disappear. That's the most dangerous thing about p-hacking: it feels exactly like doing science.

· · ·

Daryl Bem's precognition paper, the one we started with, passed peer review. It used standard methods. Its p-values were small. And yet it was wrong — or at very least, its evidence was far weaker than the p-values suggested, because the analyses involved many decisions made after seeing the data. When other researchers tried to replicate Bem's work using pre-registered protocols, they found nothing.16

Bem's paper wasn't an embarrassment for Bem alone. It was an embarrassment for the entire system of scientific inference that let it through. The methods that "proved" precognition were the same methods being used to evaluate drugs, educational interventions, and economic policies. If those methods could prove that humans see the future, what else might they be getting wrong?

The answer, we now know, is: quite a lot. The replication crisis that followed Bem's paper — and the broader recognition of p-hacking as a systemic problem — has been one of the most important developments in the history of science. Not because it revealed that scientists are fraudulent (most aren't), but because it revealed something more troubling: that honest scientists, using standard methods, can reliably produce false results. The tools themselves were broken. And the first step toward fixing them was understanding exactly how.

Notes & References

Bem, D.J. (2011). "Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect." Journal of Personality and Social Psychology, 100(3), 407–425.
Simmons, J.P., Nelson, L.D., & Simonsohn, U. (2011). "False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant." Psychological Science, 22(11), 1359–1366.
Munroe, R. (2011). "Significant." xkcd, #882. https://xkcd.com/882/
Gelman, A. & Loken, E. (2013). "The garden of forking paths: Why multiple comparisons can be a problem, even when there is no 'fishing expedition' or 'p-hacking' and the research hypothesis was posited ahead of time." Unpublished manuscript, Columbia University.
Simmons, J.P., Nelson, L.D., & Simonsohn, U. (2011). "False-positive psychology." Psychological Science, 22(11), 1359–1366. The "Sixty-Four" example is from Section 2 of this paper.
Open Science Collaboration (2015). "Estimating the reproducibility of psychological science." Science, 349(6251), aac4716.
Begley, C.G. & Ellis, L.M. (2012). "Raise standards for preclinical cancer research." Nature, 483, 531–533.
Camerer, C.F., et al. (2016). "Evaluating replicability of laboratory experiments in economics." Science, 351(6280), 1433–1436.
Wallach, J.D., et al. (2017). "Evaluation of evidence of statistical support and corroboration of subgroup claims in randomized clinical trials." JAMA Internal Medicine, 177(4), 554–560.
Schoenfeld, J.D. & Ioannidis, J.P.A. (2013). "Is everything we eat associated with cancer? A systematic cookbook review." American Journal of Clinical Nutrition, 97(1), 127–134.
Dunn, O.J. (1961). "Multiple comparisons among means." Journal of the American Statistical Association, 56(293), 52–64.
Ioannidis, J.P.A. (2005). "Why most published research findings are false." PLOS Medicine, 2(8), e124.
Nosek, B.A., et al. (2018). "The preregistration revolution." Proceedings of the National Academy of Sciences, 115(11), 2600–2606.
Allen, C. & Mehler, D.M.A. (2019). "Open science challenges, benefits and tips in early career and beyond." PLOS Biology, 17(5), e3000246. See also Scheel, A.M., et al. (2021). "An excess of positive results: Comparing the standard psychology literature with Registered Reports." Advances in Methods and Practices in Psychological Science, 4(2).
Benjamin, D.J., et al. (2018). "Redefine statistical significance." Nature Human Behaviour, 2, 6–10.
Ritchie, S.J., Wiseman, R., & French, C.C. (2012). "Failing the future: Three unsuccessful attempts to replicate Bem's 'retroactive facilitation of recall' effect." PLOS ONE, 7(3), e33423.