The Texas Sharpshooter Fallacy — The Missing Chapters

Chapter 32

The Best Shot in the County

There's a joke about a Texan who fires his rifle at the broad side of a barn — just empties the whole clip, bang bang bang, holes everywhere. Then he walks over with a bucket of red paint, finds the tightest cluster of bullet holes, and carefully paints a bullseye around it. Steps back, admires his work. Calls over the neighbors. "Best shot in the county," they say.

This is funny because it's obviously cheating. You're supposed to pick the target first, then shoot. Not the other way around. The order matters.

And yet this is exactly what happens, every day, in science, medicine, finance, and public policy — often by smart, well-meaning people who don't realize they're holding the paintbrush.

The Texas Sharpshooter Fallacy is the name for a specific failure of reasoning: you look at a pile of data, find a pattern, and then act as if that pattern was what you were looking for all along. You confuse a post-hoc observation with a pre-registered hypothesis. The technical name is data dredging, or the more evocative term: p-hacking.1

It's not a minor issue. It's arguably the single biggest reason that published scientific findings fail to replicate, that pharmaceutical companies waste billions on drugs that looked promising in early trials, and that your uncle is convinced that his astrological sign predicts his stock market performance.

• • •

The Arithmetic of Fishing

Here's the math, and it's devastating in its simplicity.

The standard threshold for "statistical significance" in most sciences is p < 0.05. That means: if there's truly no effect, there's only a 5% chance you'd see data this extreme by random luck. One in twenty. Sounds pretty safe, right?

Now: what if you test twenty hypotheses?

Probability of at least one false positive in n tests

P(≥1 false positive) = 1 − (1 − α)ⁿ

With α = 0.05 and n = 20: 1 − 0.95²⁰ ≈ 0.64 — a 64% chance

Let that sink in. If you test twenty things and none of them are real, you have roughly a two-in-three chance of finding at least one "significant" result anyway. Not because you got unlucky — because the math guarantees it.

And here's the kicker: nobody ever publishes the paper titled "We Tested 20 Things and 19 Came Up Empty." They publish the one that crossed the threshold. The bullseye gets painted. The barn wall gets famous.2

Same bullet holes. Same barn. The only thing that moved was the target.

Try It Yourself

Don't take my word for it. Here's a simulator that generates purely random data — no real effects, no signal, just noise — and tests 20 hypotheses. Press the button and watch "discoveries" materialize from nothing.

P-Hacking Simulator

Testing 20 random variables against a random outcome. There is no real effect — any "significant" result is a false positive.

• • •

The Bible Code and Moby Dick

In 1997, journalist Michael Drosnin published The Bible Code, claiming that the Hebrew text of the Torah contained hidden predictions of future events — the assassination of Yitzhak Rabin, the Gulf War, the Oklahoma City bombing — all encoded as "equidistant letter sequences."3 The method: pick a starting letter, skip every nth letter, and see if words emerge. It was a global bestseller. People were convinced.

Mathematician Brendan McKay and colleagues did the obvious thing nobody had thought to do: they applied the same method to Moby Dick.

Using Drosnin's exact technique on Melville's novel, McKay found equally stunning "predictions": the assassination of Indira Gandhi, the death of Princess Diana, the assassination of Martin Luther King Jr., and the death of John F. Kennedy — all "encoded" in a 19th-century novel about a whale.

The point wasn't that Moby Dick is prophetic. The point is that any sufficiently long text, searched with enough freedom, will produce whatever patterns you want. The bullseye was painted after the shots were fired.4

This is the sharpshooter's method stripped bare. The Torah has 304,805 letters. If you're free to choose your starting letter, your skip distance, your reading direction, and the language of the target word — the number of possible "tests" you're implicitly running is astronomical. Finding a few hits isn't a miracle. It's a mathematical certainty.

• • •

Clusters in the Void

Here's a map of a county. Some dots represent, say, houses where someone was diagnosed with a rare cancer. You look at the map and your eye immediately finds a cluster — a suspicious grouping of cases near, let's say, the old chemical plant on the east side.

But here's the thing about randomness that human brains are spectacularly bad at understanding: random points always form clusters.

Think of it this way. If I asked you to place 100 dots "randomly" on a piece of paper, you'd probably space them out more or less evenly. But that's not random — that's regular. Truly random placement means some dots will land near each other just by chance. Some areas will be dense, others sparse. The result looks clustered even when the process is perfectly uniform.5

This is the birthday problem in geographic space. In a room of 23 people, there's a 50% chance two share a birthday — not because of some cosmic conspiracy, but because there are 253 possible pairs to check. On a map, every small sub-region is a possible "cluster" to notice. Your visual cortex is a cluster-finding machine. It can't help itself.

Can You Spot the Difference?

Below are two maps of dots. One is purely random. The other has real clusters baked in. Can you tell which is which?

Cluster Spotter

One map has purely random dots. The other has genuine clusters. Click the one you think is truly clustered.

Map A

Map B

Click the map you think has real clusters.

Regular spacing isn't random. Random spacing has apparent clusters. Actual clusters are tighter and more extreme.

• • •

Green Jelly Beans Cause Acne

In 2011, Randall Munroe published an XKCD comic that became the single best illustration of multiple testing I've ever seen.6 The setup: scientists test whether jelly beans cause acne. They find no link. "But what about a specific color?" So they test all 20 colors, one by one. Purple: no link. Brown: no link. Pink: no link. And so on, through 19 colors. Then they test green — and find p < 0.05. Headlines the next day: "Green Jelly Beans Linked to Acne! (p < 0.05)"

The comic is funny because it's exactly what happens. The 19 negative results vanish. The one positive result — the one you'd expect to get by chance alone — is the one that gets published. This is the file drawer problem: negative results go in the file drawer; positive results go in the journal.

If you torture the data long enough, it will confess to anything.

• • •

The Replication Crisis

In 2015, the Open Science Collaboration tried to replicate 100 published psychology studies. The results were grim: only 36% replicated.7 Not 36% of studies done by bad researchers, or 36% of studies in predatory journals — 36% of studies published in top-tier psychology journals. The same crisis has since been documented in medicine, economics, and cancer biology.

The Texas Sharpshooter is a major reason why. Not the only one — there's also small sample sizes, flexible analysis pipelines, and outright fraud — but the pattern is consistent. Researchers collect rich datasets with many variables, test numerous relationships, and report the ones that clear the significance bar. They're not lying. They often don't even realize they're doing it. The human brain is wired to find patterns, and the scientific incentive structure rewards discoveries over null results.

The Garden of Forking Paths

Statistician Andrew Gelman calls this the "garden of forking paths" — at every stage of analysis, researchers make choices (how to code variables, which outliers to exclude, which subgroups to examine) that multiply the number of implicit tests being run, even if only one final test is reported. You don't need to consciously p-hack; the forking paths lead you there naturally.

• • •

How to Shoot Straight

The good news: we know how to fix this. The fixes are not complicated. They just require discipline.

1. Pre-registration

Before you collect data, publicly declare what you're going to test and how. Write down your hypothesis, your analysis plan, your significance threshold. Post it on a registry where it gets timestamped. Now nobody — including you — can paint the bullseye after. This is standard practice in clinical trials and increasingly common across the sciences.

2. The Bonferroni Correction

If you're testing n hypotheses, use α/n as your threshold instead of α. Testing 20 things? Your bar for significance is p < 0.0025, not 0.05. It's conservative — some real effects will be missed — but it's honest.

Bonferroni-corrected threshold

α_corrected = α / n = 0.05 / 20 = 0.0025

3. False Discovery Rate (Benjamini-Hochberg)

A more sophisticated approach: instead of controlling the probability of any false positive, control the proportion of false positives among your discoveries. The Benjamini-Hochberg procedure ranks your p-values from smallest to largest and applies an escalating threshold. It's less conservative than Bonferroni but still controls the rate of false discoveries.8

4. Exploratory vs. Confirmatory

Perhaps the most important fix is cultural. There's nothing wrong with exploring data and finding unexpected patterns — that's how science generates hypotheses. But exploratory findings must be labeled as such. They need to be confirmed in a new dataset, with a pre-registered analysis, before anyone should believe them. The Texas Sharpshooter doesn't cheat by noticing the cluster. He cheats by claiming he was aiming there all along.

The same pattern means very different things depending on whether you predicted it or discovered it.

• • •

The Barn Is Everywhere

The Texas Sharpshooter is not an obscure fallacy for methodologists to fret about. It's an everyday hazard of thinking in a data-rich world. When you read that a study found that people born in October are more likely to be CEOs, or that living near power lines causes leukemia, or that some obscure dietary supplement reduces inflammation by 12% — ask yourself: was the target painted before or after?

How many months did they check before they found October? How many environmental exposures were tested before power lines came up? How many outcomes were measured before inflammation was the one that cleared the bar?

Ellenberg puts it beautifully: the question isn't whether a result is surprising. The question is whether it would be surprising if nothing were going on. And in a world of big data and small p-values, the answer is usually no. The sharpshooter never misses because he never aims. Don't be the neighbor who's impressed by the bullseye without checking who painted it.

Notes & References

The term "p-hacking" was popularized by Simmons, Nelson, and Simonsohn in "False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis," Psychological Science 22, no. 11 (2011): 1359–66.
This is related to publication bias, extensively documented by Ioannidis in "Why Most Published Research Findings Are False," PLOS Medicine 2, no. 8 (2005): e124.
Michael Drosnin, The Bible Code (New York: Simon & Schuster, 1997). The initial statistical claim was based on Witztum, Rips, and Rosenberg, "Equidistant Letter Sequences in the Book of Genesis," Statistical Science 9, no. 3 (1994): 429–38.
Brendan McKay, Dror Bar-Natan, Maya Bar-Hillel, and Gil Kalai, "Solving the Bible Code Puzzle," Statistical Science 14, no. 2 (1999): 150–73.
For a thorough treatment of spatial randomness and clustering, see Brian Ripley, Spatial Statistics (Hoboken: Wiley, 1981).
Randall Munroe, "Significant," XKCD #882, xkcd.com/882.
Open Science Collaboration, "Estimating the Reproducibility of Psychological Science," Science 349, no. 6251 (2015): aac4716.
Yoav Benjamini and Yosef Hochberg, "Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing," Journal of the Royal Statistical Society, Series B 57, no. 1 (1995): 289–300.