All Chapters

The Missing Chapter

The Confidence Trap

What confidence intervals are, what they aren't, and why 97% of scientists get them wrong

An extension of Jordan Ellenberg's "How Not to Be Wrong"

Chapter 68

A Poll, a President, and a Problem

Here's a sentence you've read a thousand times: "The president's approval rating is 52% ± 3%, with 95% confidence." You probably think you know what it means. There's a 95% chance the true approval is between 49% and 55%. That's what it sounds like it means. That's what your gut says. That's what most newspaper readers believe, what most college students learn, and what—brace yourself—97% of academic researchers think.1

It's wrong.

A 95% confidence interval does not mean there's a 95% probability that the true value lies within it. The true value is either in there or it isn't. There is no probability about it.

If that makes your head hurt, good. You're paying attention. The confidence interval is one of the most widely used—and widely misunderstood—tools in all of science. It adorns nearly every published study, every poll, every clinical trial. And almost nobody, including the people who compute them, can correctly explain what they are.

So let's explain.

The Machine, Not the Output

Imagine a factory that produces rulers. Each ruler is supposed to be 30 centimeters long, but the manufacturing process has some variability. Most rulers are close to 30cm, but not exactly 30cm. Now imagine a quality inspector who takes one ruler off the assembly line and measures it: 30.2cm.

Here's the question that feels natural: "What's the probability that the true ruler length is between 29.8 and 30.6 centimeters?" But here's the thing—the true intended length is 30 centimeters. It just is. There's no probability about it. It's a fixed number, like the number of moons orbiting Jupiter. The randomness isn't in the parameter. It's in the measurement.

And that—right there—is the whole point.

A confidence interval is a statement about the procedure, not about any particular result. When we say "95% confidence interval," we mean: if we ran this entire experiment over and over—new sample, new data, new interval—then 95% of the intervals we'd generate would contain the true value.2 The confidence is in the machine, not the particular ruler it spit out.

"Confidence" is a property of the recipe, not the cake.

This is Jerzy Neyman's great insight, published in 1937,3 and it's one of those ideas that's simultaneously brilliant and maddening. Neyman, a Polish mathematician working in London, realized you could construct interval estimates with a guaranteed long-run coverage rate without ever having to say what you "believe" about the parameter. No priors, no subjective probability, no Bayesian machinery. Just a promise: use this procedure and you'll be right 95% of the time.

The trouble is that what people want is the other thing. They want to say: given this particular data, what can I conclude about the truth? They want to reason from the specific to the general. And a confidence interval, strictly speaking, refuses to do that.

True Value (μ) Miss! Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 Sample 6 Sample 7 Sample 8
Each sample produces a different interval. Most capture the true value (green). Some miss (red). "95% confidence" means about 95 out of 100 will capture it.
···
Chapter 68

See It for Yourself

The best way to understand a confidence interval is to watch the procedure work. Below, we have a population with a known true mean (the dashed red line). We'll draw 100 random samples, compute a 95% confidence interval for each, and stack them up. Green intervals captured the truth. Red ones missed. If the math is right—and it is—about 95 should be green and about 5 should be red.

Hit the button. Watch the intervals pile up. That's what "95% confidence" means.

CI Coverage Simulator

Each horizontal line is a 95% CI from a new random sample. Green = contains the true mean. Red = miss.

Captured (green)
Missed (red)
Coverage rate

Notice something? The coverage rate hovers around whatever confidence level you chose. That's not a coincidence—it's the definition. Change the confidence level to 80% and suddenly about 20 intervals miss. Crank it to 99% and the intervals get wider (more cautious) but almost all of them capture the truth.

This is the profound boringness of confidence intervals: they're a calibrated procedure. Nothing more, nothing less. The magic number—95%, or whatever you pick—describes the machine's track record, not your certainty about any single output.

···
Chapter 68

Why the Confusion Matters

You might be thinking: who cares? If 95% of intervals contain the truth, then any particular interval probably contains the truth too, right? In casual usage, sure, that's roughly fine. But the distinction matters enormously when it collides with other ideas—especially in the courtroom, in medicine, and in the replication crisis.

Imagine a forensic scientist testifies: "We computed a 95% confidence interval for the defendant's blood alcohol level, and 0.08 is not in it." The jury hears: "There's a 95% probability the defendant wasn't over the limit." But that's not what the interval says. It says: "If we tested the defendant's blood many times with this method, 95% of the resulting intervals would contain the true BAC." Whether this particular interval contains it? The confidence interval framework literally cannot answer that question.

In 2014, Rink Hoekstra and colleagues put this to the test.1 They gave six statements about confidence intervals to 120 researchers, 442 students, and 34 textbook authors. Every statement was false. The most popular wrong interpretation—"There is a 95% probability that the true mean lies between the bounds"—was endorsed by a staggering 97% of researchers. The very people who compute these things professionally got them wrong.

This isn't a failure of intelligence. It's a failure of design. The confidence interval framework gives you something that looks like a probability statement about a parameter but isn't. It's the statistical equivalent of a decoy.

Width ≠ Accuracy

Another common trap: people think a narrow confidence interval means your estimate is close to the truth. But width tells you about precision—how tightly clustered your estimates would be if you repeated the experiment—not accuracy—how close you are to the true value. You can have a beautifully narrow interval that's nowhere near the right answer if your measurement process is biased.4

Precise, not accurate (narrow CI, biased) Accurate, not precise (wide CI, unbiased)
A narrow confidence interval (high precision) doesn't guarantee accuracy. Systematic bias can make a tight cluster land far from the bullseye.

The p-Value Connection

Here's a fact that ties two confusing concepts together: a 95% confidence interval for a parameter excludes exactly those values that would have a p-value less than 0.05.5 If zero isn't in your 95% CI for a treatment effect, then the p-value for "no effect" is less than 0.05. They're two views of the same underlying math. So when people misunderstand confidence intervals, they typically misunderstand p-values too—and vice versa. It's a buy-one-get-one-free deal on statistical confusion.

···
Chapter 68

What People Actually Want: The Bayesian Alternative

Here's the good news. The thing people think a confidence interval is? It exists. It's called a credible interval, and it's the Bayesian version of the same idea.6

A 95% Bayesian credible interval really does mean: "Given the data and my prior beliefs, there's a 95% probability the true parameter is in this range." That's it. No tortured frequentist philosophy, no "imagine repeating the experiment infinitely many times." Just a direct probability statement about where the parameter is.

The catch? You need a prior—an explicit statement of what you believed before seeing the data. And different priors give different answers. This is what makes frequentists nervous: subjectivity in the machinery. But it's also what makes credible intervals actually answer the question people are asking.

The beautiful thing is: with enough data, it doesn't matter. As the sample grows large, the credible interval and the confidence interval converge. The prior gets overwhelmed by the evidence. But with small samples—which is where most interesting science happens—they can diverge dramatically.

n = 5 Big gap n = 30 Small gap n = 500 ≈ Same Frequentist CI Bayesian Credible Interval
With small samples, the frequentist CI and Bayesian credible interval can differ substantially. As n grows, they converge—the prior is drowned out by data.

Try it yourself. The simulator below shows both intervals for the same data. Move the prior and watch what happens.

CI vs. Credible Interval

Compare the frequentist confidence interval (blue) with the Bayesian credible interval (gold). Adjust the prior mean and sample size to see how they diverge and converge.

Frequentist CI Bayesian Credible Interval True Mean

Notice what happens when the sample is small and the prior is far from the truth: the credible interval gets pulled toward the prior. Is that a bug or a feature? Bayesians say feature—you're incorporating prior knowledge, and if your prior is bad, that's your fault, not the method's. Frequentists say bug—the procedure doesn't have guaranteed coverage. Both are right, in their own frameworks.

···
Chapter 68

The Replication Crisis, by Design

Here's a thought that should keep you up at night. If every published study reports a 95% confidence interval, then by construction, 5% of those intervals miss the true value. That's not a bug. That's the definition. One in twenty intervals is a liar, and you have no way of knowing which ones.7

Now layer on publication bias—studies that find "significant" results are more likely to get published—and you've got a recipe for systematic overconfidence. The intervals that miss tend to miss in an interesting direction (they exclude zero, suggesting an effect exists when it might not). The boring intervals—wide, containing zero, saying "we can't tell"—end up in the file drawer.

This is part of why the replication crisis hit so hard. For decades, scientists were publishing confidence intervals and p-values that they—and their readers—misinterpreted. They thought each interval was a 95% probability cage around the truth. It wasn't. It was one output of a procedure that's wrong 5% of the time, filtered through a system that preferentially publishes the wrong ones.

The Meta-Lesson

Statistical tools are only as good as our understanding of them. A confidence interval, correctly understood, is a beautiful thing—a calibrated, assumption-light guarantee about long-run performance. Misunderstood, it becomes a false sense of security, a license to be overconfident about uncertain things. The difference between good science and bad science often isn't better data or fancier methods. It's knowing what your methods actually claim.

So What Should We Do?

First: teach the distinction. Every statistics course should start with the Hoekstra quiz and let students discover they're wrong. Nothing cements understanding like confident wrongness followed by correction.

Second: consider Bayesian methods when you want Bayesian answers. If you want to say "there's a 95% probability the drug effect is between X and Y," you need a credible interval, not a confidence interval. And that's fine. Just be honest about your prior.

Third: report the interval, but think about the procedure. Any individual study might be one of the 5%. That's okay—as long as we remember to replicate, to aggregate, to not treat any single result as gospel.8

Neyman gave us something remarkable in 1937: a way to quantify uncertainty without subjective beliefs. But he also gave us something dangerous—a tool so easy to misinterpret that nearly a century later, the experts still get it wrong. The confidence interval doesn't need to be abandoned. It needs to be understood.

And now, maybe, you do.

Notes & References

  1. Hoekstra, R., Morey, R.D., Rouder, J.N., & Wagenmakers, E.-J. (2014). "Robust misinterpretation of confidence intervals." Psychonomic Bulletin & Review, 21(5), 1157–1164. The study found that 97% of researchers endorsed at least one wrong interpretation of CIs.
  2. This is sometimes called the "repeated sampling" or "frequentist" interpretation. The key idea is that probability attaches to the procedure, not to the parameter or any particular interval.
  3. Neyman, J. (1937). "Outline of a theory of statistical estimation based on the classical theory of probability." Philosophical Transactions of the Royal Society A, 236(767), 333–380.
  4. A classic example: measuring the speed of light with a miscalibrated instrument. You'll get very precise (narrow CI) but wildly inaccurate estimates. The CI tells you about the spread of your measurements, not whether they're centered on the truth.
  5. This duality between CIs and hypothesis tests is sometimes called the "test inversion" construction. A 95% CI contains exactly those null values you'd fail to reject at α = 0.05.
  6. The Bayesian credible interval uses Bayes' theorem to compute a posterior distribution over the parameter, then finds the interval containing 95% of that posterior probability. Unlike the CI, it really does say "given the data, I believe with 95% probability that…"
  7. This is related to Ioannidis, J.P.A. (2005). "Why most published research findings are false." PLoS Medicine, 2(8), e124. The combination of low prior probability, moderate power, and multiple testing means the false discovery rate is much higher than 5%.
  8. The movement toward pre-registration, replication studies, and meta-analysis can be seen as the field collectively remembering what "95% confidence" actually means: that 5% of the time, you're wrong.