Publication Bias — The Missing Chapters

Chapter 92

Twenty Labs, One Headline

Imagine twenty laboratories around the world independently testing whether jelly beans cause acne. Each lab recruits subjects, randomly assigns them to eat jelly beans or not, counts pimples, and runs a statistical test. Their significance threshold is the standard α = 0.05. Here's the thing: jelly beans don't actually cause acne. There is no real effect. Zero. Zilch.

But probability doesn't care about truth. If you test a false hypothesis at the 5% level, there's a 5% chance you'll get a "significant" result anyway—that's literally what α = 0.05 means. With twenty independent labs, the expected number of false positives is 20 × 0.05 = 1. One lab, through nothing but the ordinary wobble of random sampling, will find p < 0.05.

Now comes the dark part. That one lab writes up their result. They submit it to a journal. "Jelly beans linked to acne (p < 0.05)!" The journal publishes it because positive results are exciting and publishable. The other nineteen labs? They found nothing. Nothing isn't a story. Nothing doesn't get you tenure. Nothing goes in the file drawer.1

The published scientific literature now contains exactly one study on jelly beans and acne, and it says there's a statistically significant link. If you're a reporter, a doctor, or a curious person Googling the question, you'd conclude the science says jelly beans cause acne. You'd be wrong—but you'd have no way of knowing it.

This is the file drawer problem, named by the psychologist Robert Rosenthal in 1979.2 It's one of the most important ideas in the philosophy of science that almost nobody outside academia has heard of, and it's been quietly corroding the reliability of published research for decades.

· · ·

The Math of Invisible Evidence

Let's make this precise. Suppose 100 labs test a completely false hypothesis at α = 0.05. The expected number of false positives is simply:

Expected false positives

E[FP] = n × α = 100 × 0.05 = 5

Five labs will find "significant" results testing a hypothesis that is completely false.

If positive results are three times more likely to be published than negative ones—a conservative estimate based on actual studies of publication rates3—then those 5 false positives get published while most of the 95 null results stay in drawers. The published literature is now overwhelmingly, confidently, and entirely wrong.

The problem gets worse the more you think about it. It's not just that we're missing some studies. It's that the studies we're missing are systematically different from the ones we see. We don't have a random sample of all research ever conducted. We have a biased sample—biased toward findings that look interesting, that confirm hypotheses, that reach the magical p < 0.05 threshold. The error isn't random. It's structural.

🗄️ The File Drawer Simulator

Twenty labs test whether jelly beans cause acne (they don't). Click "Run Studies" to see which get lucky. Only "significant" results publish. Repeat to watch a false literature accumulate.

All studies (reality)

Published only (what you see)

· · ·

Chapter 92

Reading the Shape of Silence

There's a beautiful tool for detecting publication bias, and it works by looking for shapes that should be there but aren't. It's called a funnel plot.

The idea is simple. Plot each study as a dot: effect size on the x-axis, sample size (or precision) on the y-axis. Big studies cluster tightly near the true effect—they're precise. Small studies scatter widely—they're noisy. If there's no bias, the dots form a symmetric funnel shape, spreading out evenly on both sides as you move toward smaller studies.

A symmetric funnel (left) suggests no bias. When small negative studies vanish (right), the funnel tilts—the telltale fingerprint of publication bias.

But when publication bias is at work, the small negative studies disappear. The bottom-left of the funnel gets scooped out. You're left with an asymmetric shape—a lopsided funnel leaning toward positive results. The trim-and-fill method, developed by Duval and Tweedie, formalizes this intuition: it "fills in" the missing studies to make the funnel symmetric again, then recalculates the overall effect.4 Often, the corrected effect is much smaller than the published one. Sometimes it vanishes entirely.

It's a bit like a detective noticing that a crowd photo has been cropped. You can't see what was cut, but the asymmetry of what remains tells you something was removed—and roughly what it must have looked like.

· · ·

Chapter 92

The Garden of Forking Paths

Publication bias is bad enough when researchers honestly test one hypothesis and the filter happens at the journal level. But there's a much more insidious version that happens inside the lab itself. It's called p-hacking.

Here's how it works. You collect data. You run your analysis. The p-value comes back at 0.12. Not significant. But you really want this paper to work out. So you think: what if I remove that one weird outlier? p = 0.08. Closer. What if I control for age? p = 0.06. So close. What if I split the sample by gender and just look at women? p = 0.03. There it is.

You haven't done anything that feels like cheating. Each individual decision seems reasonable—outliers are often removed, controlling for covariates is standard practice, subgroup analyses are legitimate. But each decision is also a fork in the road, and you kept taking whichever fork led toward significance. The statistician Andrew Gelman calls this the "garden of forking paths."5

In 2011, Joseph Simmons, Leif Nelson, and Uri Simonsohn published a landmark paper with the dryly devastating title "False-Positive Psychology." They showed that using standard, seemingly innocent "researcher degrees of freedom"—choosing when to stop collecting data, which variables to control for, which comparisons to report—you could find a statistically significant result for essentially any hypothesis, including the claim that listening to "When I'm Sixty-Four" by the Beatles literally makes people younger.6

🔬 The p-Hacking Playground

Below is a dataset where "Wonderdrug" has NO real effect on health scores. But with enough analytical flexibility, you can almost always find p < 0.05. Toggle options and watch the p-value dance.

p = 0.42

Two-sample t-test: Wonderdrug vs Placebo

Not significant 😐

Your analysis history will appear here...

The Core Insight

A p-value only means what it says if you chose your analysis before looking at the data. The moment you try multiple analyses and report the best one, your effective α isn't 0.05 anymore—it could be 0.5 or higher. The significance is an illusion created by selection, not evidence.

· · ·

Chapter 92

The Crisis

In 2015, a group called the Open Science Collaboration tried to replicate 100 published psychology studies. These were real studies from top journals—the kind that get cited in textbooks and newspaper articles. They followed the original methods as closely as possible. The result was devastating: only 36% of the replications found a statistically significant effect in the same direction as the original.7

Replication rates across fields. The bars represent the percentage of studies that successfully replicated.

This wasn't just psychology. Cancer biology fared similarly—a 2021 project found that only about half of high-profile cancer studies replicated. Economics did somewhat better, but still poorly by any reasonable standard. The whole edifice of "statistically significant findings" was shakier than anyone had wanted to admit.

The replication crisis is what happens when publication bias and p-hacking compound over years. Each individual study might represent an honest effort. But the system—the journals, the incentives, the culture of "publish or perish"—acts as a giant filter that amplifies noise and discards signal. We built a machine for producing confident, published, citable claims that turn out to be wrong.

"The first principle is that you must not fool yourself—and you are the easiest person to fool."
— Richard Feynman

· · ·

Chapter 92

Fixing the Machine

The encouraging part of this story is that people are actually doing something about it. The fixes are elegant, and they all share a common logic: commit to your analysis before you see the results.

Pre-registration

The simplest fix is pre-registration. Before you collect data, you write down exactly what hypothesis you're testing, what analysis you'll run, what your sample size will be, and what would count as a positive result. You post this plan on a public registry. Now you can't p-hack, because your forking paths have been sealed off in advance. You committed to one path through the garden before you entered it.

Registered Reports

Even better: registered reports. In this model, you submit your study design to a journal before collecting data. The journal reviews your methods and, if they're sound, commits to publishing the results regardless of what they find. Null results are just as publishable as positive ones. The file drawer slams shut—or rather, it never opens in the first place.8

This is profound. It decouples the question "Is this a well-designed study?" from "Did it find something exciting?" And it turns out that registered reports have dramatically lower rates of positive results than traditional publications—around 44% positive, versus 96% positive in traditional papers. That gap is the fingerprint of all the bias that registered reports eliminate.

Open Data and Open Code

The third piece is transparency. When researchers share their raw data and analysis code, others can check for p-hacking, try alternative analyses, and catch errors. It's harder to quietly exclude outliers when everyone can see which data points you dropped and why.

Traditional publishing filters by results. Registered reports filter by methods—eliminating publication bias by design.

· · ·

Chapter 92

The Lesson

Publication bias is a story about how a perfectly rational system can produce perfectly irrational outcomes. Every individual actor is doing something sensible. Researchers study what interests them. Journals publish what's newsworthy. Readers pay attention to surprising findings. But the aggregate effect of all this sensible behavior is a literature that systematically overstates effects, undercounts failures, and presents a distorted picture of reality.

Ellenberg would point out that this is a mathematical problem, not a moral one. You don't need to posit fraud or bad faith. You just need to understand what happens when you apply a biased filter to random noise. The signal that comes through isn't the signal that went in. It's been shaped by the filter itself.

The lesson extends far beyond science. Whenever you see a collection of stories, successes, results, or data points that have been filtered by some selection process, ask yourself: what's in the file drawer? The companies you read about on TechCrunch (what about the ones that failed?). The investment strategies touted in bestsellers (what about the ones that lost money?). The diets that "worked" for your friends (what about the times they didn't?).

Absence of evidence isn't evidence of absence—but systematic absence of evidence is evidence of a filter. Whenever you see only successes, ask what mechanism is hiding the failures. The file drawer is everywhere. Learning to see it is one of the most important mathematical habits of mind you can develop.

Notes & References

This scenario is adapted from a famous xkcd comic (#882, "Significant") which depicts exactly this process with jelly beans and acne, testing different colors until one yields p < 0.05.
Rosenthal, R. (1979). "The file drawer problem and tolerance for null results." Psychological Bulletin, 86(3), 638–641. Rosenthal estimated the number of unpublished null studies needed to overturn a significant finding—the "fail-safe N."
Fanelli, D. (2012). "Negative results are disappearing from most disciplines and countries." Scientometrics, 90(3), 891–904. Found that positive results increased from 70% in 1990 to 86% in 2007 across all disciplines.
Duval, S. & Tweedie, R. (2000). "Trim and fill: A simple funnel-plot–based method of testing and adjusting for publication bias in meta-analysis." Biometrics, 56(2), 455–463.
Gelman, A. & Loken, E. (2013). "The garden of forking paths: Why multiple comparisons can be a problem, even when there is no 'fishing expedition' or 'p-hacking.'" Unpublished manuscript, Columbia University.
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). "False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant." Psychological Science, 22(11), 1359–1366.
Open Science Collaboration (2015). "Estimating the reproducibility of psychological science." Science, 349(6251), aac4716. The mean effect size of replications was half that of the originals.
Chambers, C. D. (2013). "Registered Reports: A new publishing initiative at Cortex." Cortex, 49(3), 609–610. As of 2023, over 300 journals accept registered reports.

The Invisible Evidence

Twenty Labs, One Headline

The Math of Invisible Evidence

🗄️ The File Drawer Simulator

📚 The Published Literature

Reading the Shape of Silence

The Garden of Forking Paths

🔬 The p-Hacking Playground

The Crisis

Fixing the Machine

Pre-registration

Registered Reports

Open Data and Open Code

The Lesson

Notes & References