All Chapters

The Missing Chapter

The Mathematics of Suspicion

Why every alarm system is wrong — and why that's a feature, not a bug

An extension of Jordan Ellenberg's "How Not to Be Wrong"

Chapter 88

A Blip on the Screen

It's 1943, and you're a radar operator on the English coast. Your cathode-ray tube glows green. A blip appears. Is it a Luftwaffe bomber that will kill people in London in twenty minutes, or is it a flock of starlings heading home to roost?

You have to decide now. There is no "let me get back to you on this." If you scramble fighters against every blip, you'll exhaust your pilots chasing birds. If you ignore too many blips, bombers get through. Every single decision you make is a bet — and every bet has exactly four possible outcomes.

If there really is a plane and you call it, that's a hit. If there's a plane and you miss it, that's a miss. If it's birds and you call it a plane, that's a false alarm. And if it's birds and you correctly ignore it, that's a correct rejection. Mathematicians, being mathematicians, organized these into a tidy two-by-two table and called it signal detection theory.1

The thing that makes this interesting — the thing that makes it mathematics rather than just common sense — is that you can't make the hits go up without also making the false alarms go up. There is no free lunch at the radar station.

Reality: Signal Reality: Noise Say "Signal" Say "Noise" HIT True Positive FALSE ALARM False Positive MISS False Negative CORRECT True Negative Every detection decision lands in exactly one of these four boxes. Lowering the threshold increases BOTH
The four outcomes of any detection decision. You can't improve one column without affecting the other.

Think of it this way. You have an internal dial — call it your threshold. Turn it down (be more suspicious, call more blips as planes) and you catch more real planes. Great! But you also chase more birds. Turn it up (be more relaxed, ignore ambiguous blips) and you stop wasting fighters on birds. Also great! But now some bombers sneak through.

This isn't a flaw in your radar. It isn't a flaw in you. It's a mathematical fact about the structure of the problem. The signal (planes) and the noise (birds) produce overlapping distributions of evidence. Sometimes birds make big blips. Sometimes planes make faint ones. Wherever you put your threshold, you're slicing through a region where signal and noise are intermingled, and some of each will land on the wrong side of the cut.2

· · ·
Chapter 88

The Dial and the Curve

The British radar operators of World War II didn't have the mathematics for this yet — that would come later, from psychophysicists and statisticians at the University of Michigan in the 1950s and from earlier work by radar engineers at MIT.3 But the key insight can be stated simply.

Imagine two bell curves. One represents the distribution of evidence you get when noise is present (birds, weather, electrical glitches). The other represents the distribution when signal is present (actual aircraft). The signal distribution is shifted to the right — on average, planes produce stronger blips than birds. But the two curves overlap. In that overlapping region, a particular blip strength is ambiguous: it could be either.

Your threshold is a vertical line you draw through these overlapping curves. Everything to the right of the line, you call "signal." Everything to the left, you call "noise." The hit rate is the proportion of the signal distribution that falls to the right of your line. The false alarm rate is the proportion of the noise distribution that falls to the right.

Now here's the beautiful part. As you slide that threshold from right to left — from conservative to liberal — you trace out a curve in hit-rate-vs-false-alarm-rate space. This curve is called the Receiver Operating Characteristic, or ROC curve, and it is one of the most important objects in all of applied mathematics.4

ROC Curve Builder

Two overlapping Gaussian distributions — signal and noise. Drag the threshold to see how hit rate and false alarm rate change. The ROC curve traces every possible tradeoff.

Hit Rate
False Alarm Rate
d′
AUC

The shape of the ROC curve tells you something profound: it tells you how good your detector is, completely independently of where you chose to put the threshold. A perfect detector — one where signal and noise don't overlap at all — gives an ROC curve that goes straight up the left edge and across the top. A useless detector — one that can't tell signal from noise at all — gives a diagonal line from (0,0) to (1,1). You might as well flip a coin.

The measure of this is called d′ (d-prime): the distance between the peaks of the signal and noise distributions, measured in standard deviations. A d′ of zero means the distributions are on top of each other — you're guessing. A d′ of 1 gives moderate discrimination. A d′ of 3 or more and you're rarely wrong. The area under the ROC curve (AUC) captures the same information in a single number between 0.5 (chance) and 1.0 (perfect).

The Key Insight

d′ measures how discriminable signal is from noise. Your threshold determines how you trade off between the two types of errors. These are completely independent choices. You can have a great detector and still set a terrible threshold — or a mediocre detector with a perfectly calibrated one.

· · ·
Chapter 88

The Base Rate Strikes Back

Now here's where it gets really interesting — and where signal detection theory meets our old friend Bayes' theorem from Chapter 23.

Suppose you've built a magnificent radar system. It detects 95% of enemy planes (hit rate = 0.95, which doctors call "sensitivity"). And when there are no planes, it correctly stays quiet 95% of the time (correct rejection rate = 0.95, or "specificity"). This sounds amazing. Ninety-five percent accurate in both directions! Surely almost every alarm is real.

Not so fast. What if enemy planes are rare? Suppose only 1% of blips correspond to actual aircraft. Now watch what happens when you process 10,000 blips:

100 blips are actual planes. Your detector catches 95 of them (hits) and misses 5.

9,900 blips are noise. Your detector correctly ignores 9,405 of them — but it raises a false alarm for 495.

Total alarms: 95 + 495 = 590.

Real planes among alarms: 95 / 590 = 16.1%.

Your "95% accurate" detector is wrong five out of six times it goes off.

This is not a paradox, though it feels like one. It's pure Bayes. The prior probability of a plane was 1%. Even a good test doesn't move you as far from the prior as your intuition suggests, because the vast majority of the population is noise, and 5% of a vast number is still a lot of false alarms.5

10,000 Blips (1% prevalence, 95% sensitivity & specificity) 9,405 Correct Rejections No alarm, no plane ✓ 495 False Alarms 95 Hits 5 495 false alarms vs. 95 hits → 84% of alarms are wrong!
When the signal is rare, even an excellent test produces mostly false alarms. The blocks are roughly proportional.

This has enormous consequences. It means you can't evaluate a detection system by its sensitivity and specificity alone. You must also know the base rate — how common the thing you're looking for actually is.

· · ·
Chapter 88

Mammograms, Spam Filters, and Reasonable Doubt

Signal detection theory was born in wartime radar, but its empire now spans everything from medicine to criminal law.

Medical screening. Mammograms for breast cancer have a sensitivity of roughly 85% and specificity around 90%.6 Among women aged 40-49, about 1.5% have breast cancer at any given time. Run the Bayesian numbers, and you find that most positive mammograms in this age group are false alarms. This is not a reason to abandon mammograms — the cancers they catch save lives. But it is a reason to expect callbacks, follow-up tests, and anxious weeks that end with "everything's fine." The system is set to a low threshold on purpose, because missing a cancer (a false negative) is far more costly than an unnecessary biopsy (a false positive).

Spam filters. Your email spam filter faces exactly the same tradeoff. If it's too aggressive (low threshold), it catches all the spam — but some real emails vanish into the junk folder. Too lenient (high threshold) and spam floods your inbox. Gmail's filter has an extraordinarily high d′ — it's very good at telling spam from legitimate email — but it still must choose a threshold, and it errs on the side of letting some spam through rather than eating your important messages.7

Criminal justice. "Beyond a reasonable doubt" is a threshold. Set it high (demanding near-certainty) and fewer innocent people go to prison — but more guilty people walk free. Set it lower ("preponderance of the evidence," the civil standard) and you catch more wrongdoers — but convict more innocents. The choice isn't about the quality of the evidence; it's about the costs we assign to each type of error. A society that considers convicting an innocent to be a special horror sets a high threshold. Blackstone's ratio — "better that ten guilty persons escape than that one innocent suffer" — is literally a statement about where to set the decision criterion.8

COVID tests. Rapid antigen tests had moderate sensitivity (~80%) but high specificity (~99%). When COVID was surging (high prevalence), a positive rapid test was very likely real. When prevalence dropped, the same positive test became much less reliable. The test didn't change. The math did.

Airport security. The TSA screens millions of passengers to find a vanishingly tiny number of actual threats. The base rate of terrorism among airline passengers is something like one in tens of millions. At that prevalence, no technology on Earth can avoid a tsunami of false positives. Every bag searched, every passenger patted down — almost all false alarms. The question is whether the deterrent value justifies the cost, and that's a question about utilities, not about test accuracy.

You can't evaluate a test without knowing what you're looking for, how common it is, and what happens when you're wrong.
· · ·
Chapter 88

Try It: The Screening Machine

There's nothing like watching the numbers yourself. Below is a simulator: set the disease prevalence, your test's sensitivity and specificity, and send 10,000 patients through. Watch what happens to the confusion matrix — and especially to the positive predictive value, the probability that a positive test actually means disease.

Screening Test Simulator

Set the parameters and run 10,000 patients through your test. Watch how base rate dominates the story.

Actually SickActually HealthyTotal
Test +
Test −
Total10,000
Positive Predictive Value
Chance a positive test means actual disease

Play with the prevalence slider. When the disease is common (say 30%), most positive tests are correct. When it's rare (1% or less), most positives are false — even with an excellent test. This is probably the most important statistical lesson that most people never learn: the predictive value of a test depends on how common the condition is.

· · ·
Chapter 88

The Cost of Being Wrong

So where should you set the threshold? Signal detection theory gives a precise answer: it depends on the costs.

Optimal threshold criterion
β* = P(noise) × CFA / P(signal) × Cmiss
The optimal criterion balances the cost of false alarms against the cost of misses, weighted by how common each really is.

If misses are expensive relative to false alarms — as in cancer screening, where missing a tumor can be fatal but a biopsy is merely unpleasant — you push the threshold down. Accept more false alarms to catch more true positives. If false alarms are expensive — as in criminal justice, where a false conviction destroys a life — you push the threshold up. Accept more misses to reduce false convictions.

This is why different detection systems operate at different points on the ROC curve, even when they have the same underlying d′. A smoke detector is set to an absurdly low threshold: it goes off when you burn toast, because the cost of missing a real fire is so asymmetric that we tolerate constant false alarms. A nuclear launch detection system, by contrast, is set to a very high threshold — because the false alarm cost is civilization-ending.

The Threshold Spectrum LOW THRESHOLD HIGH THRESHOLD More hits, more false alarms Fewer false alarms, more misses Smoke Detector Cancer Screening Spam Filter Criminal Conviction Nuclear Launch Where you set the threshold depends on what scares you more: missing a real signal, or crying wolf?
Different systems operate at different points on the sensitivity-specificity tradeoff, depending on the relative cost of errors.

The lesson is not that detection is hopeless. The lesson is that detection is a choice. There's no objectively correct threshold — there's only the threshold that's right given your values. Mathematics can tell you the tradeoff. It can show you the ROC curve and calculate the optimal criterion for any set of costs. What it cannot tell you is how much a false alarm costs versus a miss. That's a human decision. That's politics, ethics, philosophy.

And maybe that's the deepest lesson of signal detection theory: that the line between mathematics and morality is thinner than we think. Every alarm system embeds a value judgment. Every screening program is a bet about what kind of errors a society is willing to live with. The math doesn't make these choices for us. But it makes them visible — and that, in a world full of hidden tradeoffs, is no small thing.

Notes & References

  1. Signal detection theory was formalized by W. P. Tanner and J. A. Swets in "A decision-making theory of visual detection," Psychological Review 61(6), 1954, 401–409. The framework drew heavily on earlier statistical decision theory by Wald and Neyman-Pearson.
  2. The key assumption is that both signal and noise produce evidence that follows a probability distribution (typically Gaussian), and these distributions overlap. This is called the "equal-variance Gaussian model," the simplest version of SDT.
  3. The wartime radar work was mostly done at MIT's Radiation Laboratory (1940–1945). The statistical formalization came from Peterson, Birdsall, and Fox at the University of Michigan's Electronic Defense Group in the early 1950s. See D. M. Green and J. A. Swets, Signal Detection Theory and Psychophysics (Wiley, 1966).
  4. The term "Receiver Operating Characteristic" literally comes from radar receiver engineering. The ROC curve was developed during WWII to evaluate radar operators' ability to distinguish signal from noise at various threshold settings.
  5. This is a direct application of Bayes' theorem: P(plane|alarm) = P(alarm|plane)×P(plane) / [P(alarm|plane)×P(plane) + P(alarm|no plane)×P(no plane)] = 0.95 × 0.01 / (0.95 × 0.01 + 0.05 × 0.99) = 0.161. See also Chapter 23 of Ellenberg's How Not to Be Wrong for a fuller treatment of Bayesian reasoning.
  6. Sensitivity and specificity of mammography vary by study, age group, and breast density. The figures here are approximate. See Lehman et al., "Diagnostic Accuracy of Digital Screening Mammography," JAMA 311(22), 2014, 2311–2320.
  7. Google reports that Gmail's spam filter catches 99.9% of spam with a false positive rate under 0.05%. This corresponds to an extremely high d′, achievable because email spam has many distinguishing features (sender reputation, content patterns, link analysis) that human-to-human variation rarely mimics.
  8. William Blackstone, Commentaries on the Laws of England (1765), Book IV, Chapter 27. The "10:1 ratio" is sometimes called "Blackstone's ratio." In signal detection terms, it asserts that the cost of a false positive (convicting an innocent) is at least ten times the cost of a false negative (acquitting a guilty person).