The Statistics War That Won't End — The Missing Chapters

Chapter 65

A Coin Walks Into a Lab

You flip a coin ten times. It comes up heads seven times. Is the coin biased?

This sounds like a question with an answer. It's a coin. You flipped it. You counted. Math should be able to handle this, right?

Here's the thing: math can handle this. It just handles it in two completely different ways, and those two ways have been at war with each other for about a century. The combatants have names—"frequentists" and "Bayesians"—and if you think those sound like rival factions in a particularly nerdy dystopian novel, you're not far off. People have lost jobs over this. Careers have been derailed. Textbooks have been written with the explicit purpose of proving the other side wrong.1

Let's start with what the frequentist says about your coin.

The frequentist sets up a null hypothesis: the coin is fair (probability of heads = 0.5). Then they ask: if the coin really were fair, how surprising would it be to see 7 or more heads in 10 flips? They compute a p-value. For this particular experiment, the p-value turns out to be about 0.17—meaning that a fair coin would produce results this extreme or more about 17% of the time.2

By the conventional threshold of α = 0.05 (the magic number that Fisher pulled roughly out of the ether in 1925), 0.17 is not "statistically significant." The frequentist verdict: fail to reject the null hypothesis. There's not enough evidence to declare the coin unfair.

Note the careful language. The frequentist doesn't say "the coin is fair." They say they can't prove it's unfair. This is like a jury saying "not guilty"—it doesn't mean innocent, just not proven. If this strikes you as lawyerly and unsatisfying, congratulations: you've identified why Bayesians exist.

· · ·

Now Let the Bayesian Speak

The Bayesian approaches the same problem from the opposite direction. Instead of asking "how surprising is this data if the coin is fair?", they ask: "given this data, what should I believe about the coin?"

But here's the key move—and the one that makes frequentists break out in hives—the Bayesian starts with a prior belief. Before seeing any data, what did you think about this coin?

If you picked this coin from your pocket—a mass-produced quarter from the U.S. Mint—you probably had a pretty strong prior that it's fair. You've flipped thousands of coins in your life, and they've all been basically fair. This one's probably fair too. Seven heads out of ten? Meh. Coins do weird things sometimes. Your posterior belief (prior + data) is: still probably fair.

But what if I handed you the coin and said "I got this from a magic shop"? Now your prior is different. You're open to the possibility that it's weighted. Seven heads out of ten? That's starting to look like confirmation. Your posterior belief shifts: probably biased toward heads.

Same coin. Same flips. Same data. Different conclusions.

And here, in a nutshell, is the war.

Two frameworks, one dataset, a century of argument.

· · ·

Chapter 65

The Philosophical Chasm

The disagreement isn't about math. Both sides can do calculus. The disagreement is about what the word "probability" means.

For a frequentist, probability is a physical property of the world. When we say "this coin has a 50% chance of landing heads," we mean: if you flipped it infinitely many times, half the results would be heads. Probability is out there, in the coin, waiting to be measured. You don't get to have an opinion about it any more than you get to have an opinion about the mass of an electron.3

For a Bayesian, probability lives in your head. It's a measure of uncertainty—of how much you know or don't know. Saying "I think there's a 70% chance it rains tomorrow" doesn't mean that 70% of identical tomorrows will have rain. It means: given everything I know right now, my confidence in rain is 70%. It's a statement about you, not about the sky.4

This might sound like philosophy-department hairsplitting. It is not. These two interpretations lead to genuinely different methods, and those methods can give genuinely different answers. The coin problem was a gentle example. In medicine, criminal justice, and climate science, the stakes get higher and the disagreements get uglier.

A new drug shows a statistically significant benefit in a trial of 50 patients (p = 0.03). Should you prescribe it? The frequentist says: significant result, effect is real. The Bayesian says: wait—what's the prior probability this drug works? Most candidate drugs fail. If only 10% of drugs at this stage work, a p = 0.03 result still leaves roughly a 50% chance the result is a false positive. Same data. Same p-value. Wildly different conclusions about whether to give patients this drug.

· · ·

Chapter 65

A Brief History of Hostilities

The irony of the frequentist-Bayesian war is that Bayes came first. Thomas Bayes, a Presbyterian minister, worked out his famous theorem in the 1740s. Pierre-Simon Laplace independently developed it further and used it for everything from astronomical measurements to courtroom probability. For a century, Bayesian reasoning was just how you did statistics.

Then came the frequentist revolution. In the early 20th century, Ronald Fisher—brilliant, combative, and spectacularly petty—developed the framework of significance testing, p-values, and maximum likelihood that would come to dominate science. Fisher didn't just disagree with Bayesian methods; he found them morally offensive. The idea that scientists should inject subjective prior beliefs into their analysis struck him as corruption of the scientific method.5

Meanwhile, Jerzy Neyman and Egon Pearson developed a competing frequentist framework—hypothesis testing with Type I and Type II errors—that Fisher also hated, despite it being, you know, frequentist. Fisher thought Neyman-Pearson testing was too rigid, too industrial, too focused on decision-making rather than scientific inference. The frequentist camp couldn't even agree with itself.

What actually gets taught in Stats 101 today is a weird Frankenstein hybrid of Fisher and Neyman-Pearson that neither would have endorsed. You compute Fisher's p-value and then compare it to Neyman-Pearson's α threshold, which is like combining the worst features of two different legal systems and calling it justice.6

Bayes was winning until Fisher showed up with a chip on his shoulder.

For most of the 20th century, Bayesian methods were essentially banished from mainstream statistics. Using a prior was seen as unscientific—you might as well read tea leaves. The few Bayesian holdouts, like Harold Jeffreys and Dennis Lindley, were treated as eccentrics at best and cranks at worst.

Then computers happened. Bayesian methods require solving integrals that are often mathematically intractable—you can write down the formula, but you can't compute the answer by hand. The development of Markov Chain Monte Carlo (MCMC) methods in the 1990s changed everything. Suddenly, you could compute Bayesian posteriors for complex models just by letting a computer run for a while. The Bayesians came roaring back.7

· · ·

Chapter 65

Try It Yourself: Same Data, Two Answers

The best way to feel the difference between these two worldviews is to play with them. Below, you can set up a coin-flipping experiment and see what each framework concludes. The key thing to watch: when you change the prior, the Bayesian answer moves. The frequentist answer doesn't budge. It doesn't care what you believed before you flipped.

The Same Data, Two Answers

Set the observed coin flips and your prior belief. Watch how Bayesian and frequentist conclusions diverge.

Heads: Tails:

Prior belief coin is fairModerate (β = 10)

No opinion (uniform) Very strong prior: fair

Frequentist

Two-tailed p-value

0.172

—

Bayesian

Posterior mean (P(heads))

0.583

—

95% credible interval

[ 0.39 , 0.76 ]

Blue curve: prior. Red curve: posterior. Dashed line: θ = 0.5 (fair coin).

Play with the prior slider. At the far left (uniform prior, β = 1), the Bayesian just follows the data: 7/10 heads means the posterior peaks around 0.7. But slide it right—give yourself a strong prior that the coin is fair—and the posterior stubbornly clusters near 0.5, barely budged by a mere 10 flips. Meanwhile, the frequentist p-value sits there, motionless, completely indifferent to your beliefs.

This is the heart of the disagreement. The frequentist says: "Your prior beliefs are your problem, not mine." The Bayesian says: "Ignoring prior knowledge isn't objectivity—it's willful ignorance."

· · ·

Chapter 65

The Paradox That Proves Everyone's Point

If you want to see the frequentist-Bayesian divide at its most dramatic, you need the Jeffreys-Lindley paradox. It's one of the most important results in the philosophy of statistics, and it goes like this:

Suppose you're testing whether a coin is exactly fair (θ = 0.5) versus the alternative that it's not. You flip the coin many, many times, and you observe a very slight excess of heads—say, 51% heads.

With a small sample (100 flips, 51 heads), neither framework gets excited. The frequentist p-value is large (not significant), and the Bayesian posterior still favors the null.

But now increase the sample size. With 10,000 flips and 5,100 heads (still 51%), the frequentist p-value drops below 0.05. Significant! With 100,000 flips and 51,000 heads, the p-value is astronomically small. The frequentist is screaming: this coin is definitely not fair!

The Bayesian, meanwhile, is doing the opposite. A reasonable Bayesian who assigns some prior probability to the coin being exactly fair (say, 50%) will find that as the sample grows, the posterior probability of fairness actually increases. The data is showing that the bias, if any, is tiny—so tiny that the "fair" hypothesis becomes more plausible, not less.8

More data. Stronger frequentist rejection. Stronger Bayesian acceptance. Same data, opposite conclusions, and the gap gets worse with more evidence.

"With enough data, frequentist and Bayesian methods don't just disagree—they disagree more."

The Jeffreys-Lindley Paradox

Keep the proportion of heads fixed at 51%. Increase the sample size and watch the two frameworks diverge—then fly apart.

Sample size (n)1,000

Frequentist

p-value (two-tailed)

—

Bayesian

P(coin is fair | data)

—

Blue: −log₁₀(p-value), higher = more significant. Red: P(fair | data). They diverge.

The paradox cuts to the core of the philosophical divide. The frequentist is answering: "Could pure chance produce this pattern?" As your sample grows, even tiny deviations from perfect fairness become statistically detectable, so the answer becomes a resounding "no." But the Bayesian is answering: "Is the 'exactly fair' hypothesis a good explanation of the data?" And a tiny deviation spread across a huge sample is, in fact, very well explained by a fair coin—better explained, in fact, than by any specific alternative.

Neither side is making a mathematical error. They're answering different questions. And that, when you strip away the century of rhetoric, is what the whole war is really about.

· · ·

Chapter 65

Machine Learning's Dirty Secret

Here's an irony that would make Fisher spin in his grave: the most powerful prediction machines ever built—deep neural networks, random forests, the algorithms that drive self-driving cars and generate AI art—are quietly, pervasively Bayesian.

When a machine learning engineer adds "regularization" to a model (and they always do), they're imposing a prior. L2 regularization—the most common kind—is mathematically identical to putting a Gaussian prior on the model's parameters. When they do "MAP estimation" (maximum a posteriori), it's right there in the name: a posteriori. When they use "dropout" during training, they're approximately doing Bayesian inference over an ensemble of models.

The whole field is Bayesian in practice while being frequentist in its vocabulary. People talk about "test set accuracy" (a frequency) while building models that are drenched in priors. It's like someone who insists they don't believe in ghosts while carrying around a bag of salt and holy water. The methods work too well to abandon, but the language hasn't caught up.

The Pragmatist's View

Modern statisticians increasingly reject the tribal warfare. The pragmatic position: use frequentist methods when you have lots of data and weak prior knowledge (nobody trusts your priors anyway). Use Bayesian methods when you have small samples and genuine prior knowledge to incorporate (clinical trials building on previous studies, for instance). The question isn't "which philosophy is true?"—it's "which tool works here?"

· · ·

Chapter 65

What Ellenberg Would Say

The lesson, as always with math, is that the question matters as much as the answer. "Is this coin biased?" sounds like one question. It's actually two:

Frequentist version: "Is there enough evidence to rule out fairness?" This is a question about the data.

Bayesian version: "Given everything I know, what should I believe about this coin?" This is a question about me.

Both questions are legitimate. Both have rigorous mathematical frameworks for answering them. The mistake—the way to be wrong—is to think you're asking one when you're actually asking the other. Or worse: to not realize there are two questions at all.

The coin doesn't care about your philosophical commitments. It just lands.

The coin doesn't pick a side.

So the next time someone shows you a p-value and declares victory—or a posterior distribution and declares certainty—ask them: what question were you answering? What assumptions went in? And would the other side agree?

If the answer is no, you haven't learned the truth about the data. You've learned something about the person who analyzed it. Which, come to think of it, might be more useful.

Notes & References

The most famous feud in statistics history is probably Fisher vs. Neyman, who went from colleagues to bitter enemies. Fisher once described Neyman-Pearson theory as "worse than useless" and suitable only for "Russians." Neyman, for his part, called Fisher's fiducial inference "a beautiful error." See Lehmann, E.L., Fisher, Neyman, and the Creation of Classical Statistics (2011).
Technically, the two-tailed p-value for 7 heads in 10 flips under H₀: θ = 0.5 is P(X ≥ 7) + P(X ≤ 3) where X ~ Binomial(10, 0.5). This equals 2 × 0.1719 = 0.3438 for the symmetric two-tailed test, or about 0.17 for the one-tailed version. The exact value depends on whether you use a one-tailed or two-tailed test—another source of endless arguments.
This is the "frequentist" interpretation, formalized by Richard von Mises in his 1928 book Probability, Statistics and Truth. Von Mises defined probability as the limit of relative frequency in an infinite sequence of trials.
The subjective Bayesian interpretation was most rigorously developed by Bruno de Finetti (Theory of Probability, 1974) and L.J. Savage (The Foundations of Statistics, 1954). De Finetti's famous declaration: "Probability does not exist"—meaning objective probability doesn't exist; only subjective degrees of belief.
Fisher, R.A., "Statistical Methods and Scientific Inference" (1956). Fisher was particularly incensed by the use of "uninformative priors," which he saw as smuggling in assumptions while pretending to be objective.
Gigerenzer, G., "Mindless Statistics," Journal of Socio-Economics 33 (2004): 587–606. Gigerenzer calls the hybrid "a mishmash" and documents how statistics textbooks routinely conflate Fisher's and Neyman-Pearson's incompatible frameworks.
The key breakthrough was the Metropolis-Hastings algorithm (originally Metropolis et al., 1953, generalized by Hastings, 1970) and its application to Bayesian statistics via Gibbs sampling (Geman & Geman, 1984; Gelfand & Smith, 1990). The software package BUGS (Bayesian inference Using Gibbs Sampling), released in 1989, made these methods accessible to non-specialists.
Lindley, D.V., "A Statistical Paradox," Biometrika 44 (1957): 187–192. Robert, C.P., "The Jeffreys-Lindley Paradox Revisited" (2014). The paradox requires the Bayesian to assign positive prior probability to the point null θ = 0.5, which not all Bayesians accept—but it's the standard setup for hypothesis testing.