Simpson's Paradox — When the Data Lies to Your Face

Chapter 1

The University That Hated Women (Except It Didn't)

In the fall of 1973, the University of California at Berkeley had a problem. A big, embarrassing, front-page problem. Of the 8,442 men who applied to its graduate programs, about 44% were admitted. Of the 4,321 women who applied? Only 35%.1 The gap was so large that it looked, statistically speaking, like the kind of thing that doesn't happen by accident.

So naturally, people sued. The accusation: Berkeley was systematically discriminating against women in graduate admissions. And if you looked at the numbers — just the numbers — it was hard to argue otherwise. Forty-four percent versus thirty-five percent. That's not a rounding error. That's a policy.

Except it wasn't.

When statisticians P.J. Bickel, E.A. Hammel, and J.W. O'Connell cracked open the data department by department, they found something bizarre. In four of the six largest departments, women were admitted at a higher rate than men. In the remaining two, the rates were close — with men slightly ahead.

The university wasn't biased against women. If anything, it had a slight bias in their favor.

Read that again. Women were favored in most departments, roughly equal in the rest, and yet overall, they were admitted at a dramatically lower rate. If this doesn't make your brain hurt, you're not paying attention.

A trend that appears in every group of data can reverse when the groups are combined.

The classic reversal: A beats B in every subgroup, but unequal sample sizes (circle sizes) flip the aggregate. The weights are where the paradox hides.

This is Simpson's Paradox, and it is, in my considered opinion, one of the most important ideas in all of mathematics — not because it's hard (it isn't, once you see it), but because it's easy to miss, and the consequences of missing it range from embarrassing to catastrophic.

◆ ◆ ◆

Chapter 2

How Berkeley Actually Worked

Here's the trick, and it's not really a trick at all — it's just arithmetic being sneaky. Women at Berkeley disproportionately applied to competitive departments: English, art history, the humanities programs that admitted maybe 20–30% of applicants. Men, meanwhile, flooded into engineering and physical science programs that admitted 60% or more.

So women were playing the game on hard mode and men were playing on easy mode, and then someone looked at the overall scoreboard and said, "Hey, the women aren't doing as well!" Well, no kidding.

The Berkeley Paradox Machine

Toggle between the aggregated view and the department-level view. Watch the conclusion flip.

Gender	Applied	Admitted	Rate
Men	8,442	3,714	44%
Women	4,321	1,512	35%

Men

44%

Women

35%

⚠️ Men admitted at significantly higher rate. Bias against women?

Dept	Men App.	Men %	Women App.	Women %	Winner
A	825	62%	108	82%	♀
B	560	63%	25	68%	♀
C	325	37%	593	34%	♂
D	417	33%	375	35%	♀
E	191	28%	393	24%	♂
F	373	6%	341	7%	♀

Dept A — Easy admit (82% vs 62%)

Men (825)

62%

Women (108)

82%

Dept B — Easy admit (68% vs 63%)

Men (560)

63%

Women (25)

68%

Dept C — Competitive (37% vs 34%)

Men (325)

37%

Women (593)

34%

Dept D — Competitive (35% vs 33%)

Men (417)

33%

Women (375)

35%

Dept E — Hard (28% vs 24%)

Men (191)

28%

Women (393)

24%

Dept F — Very hard (7% vs 6%)

Men (373)

Women (341)

Spot the pattern

Women favored in 4 of 6 departments. But look at the application counts: 108 women applied to easy Dept A versus 593 to competitive Dept C and 393 to hard Dept E. Men did the opposite — 825 piled into easy Dept A. The aggregate hides the game each group was playing.

✅ Women admitted at equal or HIGHER rates in 4 of 6 departments

The culprit wasn't sexist admissions committees. It was confounding — specifically, the confounding variable of which department people applied to. Women applied to harder-to-enter programs. When you naively aggregate across departments, that difficulty differential gets laundered into what looks like discrimination.

◆ ◆ ◆

Chapter 3

Kidney Stones and the Surgeon's Dilemma

You'd think this was just an academic curiosity — something that happens in admissions data but nowhere else. You'd be wrong. Simpson's Paradox shows up in medicine, and when it does, it can kill people.

In 1986, a study compared two treatments for kidney stones: Treatment A (open surgery) and Treatment B (percutaneous nephrolithotomy, a less invasive puncture procedure).2 Here were the results:

Overall: Treatment A succeeded 78% of the time (273/350). Treatment B succeeded 83% (289/350). Looks like B wins, right?

Small stones: Treatment A: 93% (81/87). Treatment B: 87% (234/270). A wins.

Large stones: Treatment A: 73% (192/263). Treatment B: 69% (55/80). A wins again.

Treatment A was better for small stones and better for large stones. Yet Treatment A was worse overall.

The lurking variable? Severity. Doctors tended to assign the more invasive Treatment A to the harder cases (large stones), while easier cases got the gentle Treatment B. So A was fighting with one hand tied behind its back in the aggregate numbers — it had 263 large-stone cases versus B's mere 80 — even though it was winning every individual fight.

Now imagine you're a patient. You Google "kidney stone treatment success rates." You see 78% versus 83%. You pick B. You just made a decision based on a number that is technically correct and completely misleading. This is not a thought experiment. This is how medical misinformation actually works.

Key Insight

The paradox isn't about the math being wrong. The math is perfect. It's about the question being wrong. "Which treatment has a higher success rate?" is the wrong question when treatments aren't assigned randomly. The right question is: "Which treatment works better for patients like me?"

◆ ◆ ◆

Chapter 4

The Math Behind the Magic

Enough stories. Let's understand why this happens, because once you see the mechanism, you'll start spotting it everywhere. And the mechanism is, honestly, disappointingly simple.

Imagine two groups. In Group 1, Treatment A works 90% of the time (9 out of 10 patients) and Treatment B works 85% (85 out of 100). In Group 2, Treatment A works 30% (30 out of 100) and Treatment B works 25% (2.5 out of 10 — round to 3). A is better in both groups. Now combine them:

The Aggregation Trap

Group 1: A = 9/10 (90%) > B = 85/100 (85%)
Group 2: A = 30/100 (30%) > B = 3/10 (30%)

Combined A: 39/110 = 35.5%
Combined B: 88/110 = 80.0%

A wins in both subgroups but loses the aggregate. The secret: wildly unequal sample sizes per group.

See what happened? Treatment A had most of its patients (100 of 110) in Group 2, where success rates are low. Treatment B had most of its patients (100 of 110) in Group 1, where success rates are high. The combined rate is a weighted average, and the weights — the sample sizes — are exactly where the confounding variable hides.

Simpson's Paradox — Formally

a₁/b₁ > c₁/d₁ and a₂/b₂ > c₂/d₂
does NOT imply
(a₁+a₂)/(b₁+b₂) > (c₁+c₂)/(d₁+d₂)

The inequality can reverse when the denominators (sample sizes) are distributed unevenly across groups.

This is the deep reason: weighted averages don't combine linearly. When the weights differ across the things being compared — and they almost always do when a confounding variable is at play — the aggregate can tell a completely different story than the parts. It's not a paradox in the logical sense. It's an arithmetic ambush.

You can win every battle and still lose the war — not because you fought badly, but because you fought in the wrong places.

◆ ◆ ◆

Chapter 5

Build Your Own Paradox

The best way to understand Simpson's Paradox is to create one yourself. The interactive below lets you set the success rates and group sizes for two treatments across two groups. Adjust the sliders until the paradox appears — you'll feel the moment it clicks.

The Paradox Laboratory

Adjust success rates and sample sizes. Can you make Treatment A win both subgroups but lose overall?

Group 1 (Easy)

A success rate 90%

B success rate 85%

A patients 10

B patients 100

Group 2 (Hard)

A success rate 30%

B success rate 25%

A patients 100

B patients 10

Treatment A Overall

35.5%

Treatment B Overall

80.0%

🔄 PARADOX! A wins both subgroups but loses overall

Notice the pattern: the paradox appears when Treatment A's patients are concentrated in the hard group while Treatment B's patients are concentrated in the easy group. The aggregate is dominated by each treatment's largest sample — which comes from different difficulty levels. It's not magic. It's bookkeeping.

◆ ◆ ◆

Chapter 6

David Justice vs. Derek Jeter

Baseball, America's most statistically obsessive sport, gives us perhaps the cleanest example of all. In both 1995 and 1996, David Justice had a higher batting average than Derek Jeter. In both years. And yet, when you combined the two seasons, Jeter's average was higher.3

1995: Justice .253 (104/411) > Jeter .250 (12/48)

1996: Justice .321 (45/140) > Jeter .314 (183/582)

Combined: Justice .270 (149/551) < Jeter .310 (195/630)

Justice won both seasons but lost the two-year combined average.

The confounding variable is playing time. Jeter's 1995 was a cup of coffee — 48 at-bats at a mediocre .250. His 1996 was a full season at a strong .314. Justice played a full season in 1995 (at .253) but only 140 at-bats in 1996. Each player's combined average is dominated by their full season — for Jeter, that was his better year; for Justice, his worse one.

It's Berkeley all over again. The "departments" are the seasons, the "acceptance rates" are the batting averages, and the "application counts" are the at-bats. Same paradox, different costume.

◆ ◆ ◆

Chapter 7

Enter Judea Pearl: Why "Just Look at the Data" Isn't Enough

Here's where Simpson's Paradox stops being a fun math puzzle and becomes something genuinely profound. Because the paradox poses a question that pure statistics cannot answer: which number should you believe — the aggregated one or the disaggregated one?

In the Berkeley case, the disaggregated data was clearly right. In a drug trial, disaggregating by severity makes sense. But is that always true? Should you always break the data into subgroups?

The computer scientist and philosopher Judea Pearl — whose work on causal inference won him the Turing Award — argued that the answer is: it depends on the causal structure.4 You can't resolve Simpson's Paradox with statistics alone. You need a causal model.

Suppose you're testing a drug and you notice that blood pressure is a potential confounding variable. Should you stratify by blood pressure?

If blood pressure is a common cause of both the treatment assignment and the outcome — yes, stratify. It's a confounder. Grouping by it removes its distorting effect.

If blood pressure is a mediator — meaning the drug works by lowering blood pressure — then no! Stratifying by blood pressure would remove the very effect you're trying to measure.

Same data. Same subgroups. Completely different correct answers. Only the causal arrows distinguish them.

This is why Pearl insisted on drawing causal diagrams — directed acyclic graphs, or DAGs — and why he argued that you cannot do data science without them. The data alone cannot tell you which way the arrows point. For that, you need to think.

The Deep Lesson

Statistics tells you what happened. Causal inference tells you why. Simpson's Paradox is what happens when you try to answer a "why" question with a "what" tool. The numbers aren't lying — you're asking them the wrong question.

◆ ◆ ◆

Pearl's key distinction: when the lurking variable is a confounder (left), stratify to remove bias. When it's a mediator (right), stratifying destroys the very effect you're measuring.

Chapter 8

So What Do You Do About It?

Now I know what you're thinking: "Great, so I can never trust any statistic ever again." That's not quite right, but it's closer to right than you'd like. Here's what Simpson's Paradox actually teaches us:

First: Always ask what's being aggregated. Whenever someone shows you an overall number — a success rate, an average, a percentage — your first question should be: "What subgroups are hiding inside this number, and could they tell a different story?" Most of the time, they won't. But when they do, the difference is everything.

Second: Look for confounders. If two groups being compared weren't randomly assigned — and outside of controlled experiments, they almost never are — then there's probably a lurking variable distorting the comparison. Find it. In Berkeley, it was department choice. In kidney stones, it was severity. In baseball, it was playing time. The confounder is always there. Your job is to name it.

Third: Think causally. Don't just ask "what does the data show?" Ask "what is the data-generating process?" Draw the causal diagram, even if it's just on a napkin. Which variables cause which? Where are the arrows? This is Pearl's great contribution: the realization that no amount of data can substitute for a model of how the data came to be.

Fourth: Be humble. The world is full of confounders you haven't thought of. Every observational study is a potential Simpson's Paradox waiting to be revealed. This doesn't mean we should ignore data — it means we should treat it with the respect it deserves, which includes respecting its limitations.

The numbers never lie. But they never tell the whole truth, either.

Simpson's Paradox isn't really about paradoxes. It's about the gap between seeing and understanding. The data is a map, and like all maps, it leaves things out. The paradox is what happens when you forget that the map isn't the territory — when you trust the aggregate so completely that you forget to ask what it's made of.

Edward Hugh Simpson published his paper on this phenomenon in 1951.5 Udny Yule noticed it even earlier, in 1903.6 Karl Pearson and others observed related reversals around the same era. We've known about it for over a century. And yet it keeps catching us off guard — in medical research, in hiring data, in educational testing, in criminal justice statistics, in pandemic mortality rates. Every time we aggregate without thinking, we risk seeing a pattern that isn't there, or missing one that is.

The cure isn't to stop looking at data. The cure is to look harder.

Notes & References

Bickel, P.J., Hammel, E.A., & O'Connell, J.W. (1975). "Sex Bias in Graduate Admissions: Data from Berkeley." Science, 187(4175), 398–404. The six departments are anonymized as A–F in the original paper. The aggregate gap (44% vs 35%) was statistically significant at conventional levels.
Charig, C.R., Webb, D.R., Payne, S.R., & Wickham, J.E. (1986). "Comparison of treatment of renal calculi by open surgery, percutaneous nephrolithotomy, and extracorporeal shockwave lithotripsy." BMJ, 292(6524), 879–882. The Simpson's Paradox in this data was highlighted by Steven Julious and Mark Mullee in 1994.
Ken Ross (2004). "A Mathematician at the Ballpark: Odds and Probabilities for Baseball Fans." The Jeter/Justice example is widely cited in statistics pedagogy. The at-bat numbers: Justice 411 AB in 1995, 140 in 1996; Jeter 48 AB in 1995, 582 in 1996.
Pearl, Judea (2009). Causality: Models, Reasoning, and Inference (2nd ed.). Cambridge University Press. See especially Chapter 6, where Pearl argues that Simpson's Paradox cannot be resolved without causal assumptions — a claim that remains controversial among some frequentist statisticians.
Simpson, E.H. (1951). "The Interpretation of Interaction in Contingency Tables." Journal of the Royal Statistical Society, Series B, 13(2), 238–241.
Yule, G.U. (1903). "Notes on the Theory of Association of Attributes in Statistics." Biometrika, 2(2), 121–134. Yule's observation predates Simpson's by nearly fifty years, leading some to call it the Yule–Simpson effect.

Dept	Men App.	Men %	Women App.	Women %	Winner
A	825	62%	108	82%	♀
B	560	63%	25	68%	♀
C	325	37%	593	34%	♂
D	417	33%	375	35%	♀
E	191	28%	393	24%	♂
F	373	6%	341	7%	♀

Dept	Men App.	Men %	Women App.	Women %	Winner
A	825	62%	108	82%	♀
B	560	63%	25	68%	♀
C	325	37%	593	34%	♂
D	417	33%	375	35%	♀
E	191	28%	393	24%	♂
F	373	6%	341	7%	♀

Dept	Men App.	Men %	Women App.	Women %	Winner
A	825	62%	108	82%	♀
B	560	63%	25	68%	♀
C	325	37%	593	34%	♂
D	417	33%	375	35%	♀
E	191	28%	393	24%	♂
F	373	6%	341	7%	♀