Overfitting — The Missing Chapters

Chapter 1

The Portfolio That Beat the Market

In the early 2000s, a financial newsletter began arriving in the mailboxes of 10,000 strangers. The first issue made a bold prediction: a certain stock would go up the following week. Half the recipients got a version predicting it would go up; the other half, that it would go down. The following week, the stock moved. One group of 5,000 had received the "correct" prediction. The other 5,000 never heard from the newsletter again.

The next week, the surviving 5,000 were split again — half got "up," half got "down." After the result, 2,500 had now seen two correct predictions in a row. Then 1,250. Then 625. After six rounds, about 156 people had received six consecutive correct stock predictions. To those 156 people, this newsletter looked like it was run by a genius.

The newsletter then offered its premium subscription: $1,000 a year. A bargain, surely, for access to someone who had called the market correctly six times running. Except the newsletter had no insight at all. It had simply overfit to the past — constructing, after the fact, a track record that looked prophetic but contained zero predictive power.

This scam is real — the SEC has prosecuted versions of it.1 But the deeper lesson isn't about fraud. It's about a mistake so fundamental, so pervasive, that it infects everything from Wall Street quantitative models to medical research to your personal beliefs about what makes you happy. The mistake is called overfitting, and it is the dark twin of pattern recognition — our greatest cognitive gift turned against us.

Overfitting

A model that explains the past perfectly by fitting to noise rather than signal. The better it matches historical data, the worse it predicts the future. Overfitting is the statistical equivalent of a conspiracy theory: an explanation so comprehensive it must be true — except it isn't.

· · ·

Chapter 2

The Polynomial That Ate the Data

Here is the simplest way to see overfitting in action. Suppose you have seven data points — say, the relationship between hours of study and exam scores. You want to draw a curve through them to understand the relationship and, crucially, to predict what would happen for eight hours, or ten, or three and a half.

A straight line won't pass through all seven points. It'll miss some, maybe most. But it captures the trend: more study, higher scores, roughly linearly. A second-degree polynomial — a parabola — might fit a little better, curving to catch a few points the line missed. A third-degree? Better still.

Now try a sixth-degree polynomial. With six "knobs" to turn (plus the intercept), a degree-six polynomial can pass through all seven points exactly. Training error: zero. Perfect fit. Victory?

Not even close. Because that sixth-degree polynomial, in its desperate quest to thread every needle, does something insane between and beyond the data points. It swoops and dives. It predicts that studying for 4.5 hours will produce a worse score than studying for 3 hours. It predicts that 10 hours of study will yield a negative score. The curve that explains the data perfectly is the curve that understands nothing.

Try it yourself:

Polynomial Fitting Explorer

Drag the slider to increase polynomial degree. Watch how training error drops to zero — while the curve becomes increasingly absurd.

Polynomial Degree

Degree 1 (Linear)

Training Error

—

Test Error

—

Parameters

—

The pattern is unmistakable. As complexity rises, training error falls monotonically — it has to, because you're giving the model more freedom to contort itself around the data. But test error — how well the model predicts new data it hasn't seen — follows a U-shape. It drops at first (as the model captures genuine structure), then rises (as it starts memorizing noise). The bottom of that U is where wisdom lives.

· · ·

Chapter 3

The Bias-Variance Tango

Statisticians have a beautiful way of decomposing what's happening here. Every model's prediction error can be split into three parts: bias, variance, and irreducible noise.

The Bias-Variance Decomposition

Error = Bias² + Variance + Noise

The fundamental tradeoff at the heart of all statistical learning.

Bias: How far off the model is on average — the systematic error from simplifying assumptions. A straight line fit to curved data has high bias. It's consistently wrong in the same direction.
Variance: How much the model changes when you give it different training data. A high-degree polynomial has high variance — train it on a slightly different sample and it produces a wildly different curve.
Noise: The irreducible randomness in the data itself. No model can eliminate this. It's the universe's contribution to unpredictability.

Bias and variance are locked in a tug-of-war. Make your model simpler (fewer parameters, stronger assumptions) and bias goes up but variance goes down. Make it more complex and bias drops but variance explodes. The art of modeling — of thinking, really — is finding the sweet spot where their sum is minimized.

The bias-variance tradeoff: as model complexity increases, bias falls but variance rises. The total error has a sweet spot in between.

Ellenberg would appreciate the analogy to everyday life. Consider how you form opinions about restaurants. If you go once and have a bad meal, you could conclude "that place is terrible" — a high-variance model based on a single data point. Or you could apply a prior: "most restaurants that stay in business are decent, so maybe I had bad luck" — adding bias (an assumption) but reducing variance. Neither extreme is right. One visit isn't enough data for a confident model, but refusing to update your beliefs at all isn't wisdom either. The best thinkers are the ones who calibrate their complexity to their evidence.

· · ·

Chapter 4

Conspiracy Theories and Curve Fitting

Here is where overfitting escapes the statistics classroom and starts eating the world.

A conspiracy theory is, at its heart, an overfit model. Consider the person who believes the moon landing was faked. They have an explanation for every piece of evidence. The flag appears to wave? Wires. The shadows look wrong? Studio lighting. Thousands of NASA engineers never talked? They were all in on it, or threatened into silence. Multiple independent tracking stations confirmed the signals? Also in on it. The Van Allen radiation belts should have killed the astronauts? NASA lied about the shielding.

The conspiracy theory fits every data point perfectly. Its training error is zero. And that's exactly the problem — a theory flexible enough to accommodate any possible evidence is a theory that predicts nothing. It's a degree-n polynomial where n equals the number of data points minus one.

Karl Popper understood this in the 1930s when he articulated his criterion of falsifiability.2 A theory that can explain any outcome explains nothing. The value of a theory lies not in what it accounts for, but in what it rules out. Newton's theory of gravity is powerful precisely because it says the apple must fall down, never sideways, never up. A conspiracy theory that can absorb any disconfirming evidence — "that's just what they want you to think" — has zero predictive power, regardless of how satisfying its narrative.

The overfitting lens reframes this beautifully. Falsifiability isn't just a philosophical principle. It's a statistical one. A falsifiable theory is one with constrained complexity — it has made commitments that could be wrong. An unfalsifiable theory is one with unlimited parameters, free to contort itself around any data. The first can generalize; the second cannot.

A simple theory that misses some evidence can still generalize. A conspiracy theory that explains everything is just a polynomial with too many degrees of freedom.

· · ·

Chapter 5

The Replication Crisis, or: Science's Overfitting Problem

In 2011, the psychologist Joseph Simmons and his colleagues published a landmark paper with an unforgettable title: "False-Positive Psychology."3 Using entirely standard and accepted research practices, they proved — with a p-value below 0.05 — that listening to "When I'm Sixty-Four" by the Beatles literally makes you younger.

They didn't fake data. They didn't hack computers. They just used the many "degrees of freedom" available to a researcher: Should we control for gender? Should we exclude outliers? Should we log-transform the dependent variable? Should we stop collecting data now, or run ten more participants? Each decision is a knob. And with enough knobs, you can fit any noise.

This is overfitting in the wild. A researcher, hunting for a publishable result, tries specification after specification until one yields a p-value below 0.05 — the magic threshold for publication. They aren't lying. They may not even realize they're doing it. But the effect is the same as our stock-picking newsletter: by selecting the specification that fits the data best, they've overfit to the sample. The result looks real but won't replicate.

And it didn't replicate. In 2015, the Open Science Collaboration attempted to reproduce 100 psychology studies published in top journals. Only 36% replicated.4 The replication crisis, as it came to be called, was overfitting writ large — an entire discipline accidentally fitting to noise.

With four parameters I can fit an elephant, and with five I can make him wiggle his trunk.

— John von Neumann

Von Neumann's quip captures the essence. A model with enough parameters can fit anything — elephants, stock returns, the therapeutic benefit of listening to Beatles songs. The question is never whether your model fits the data. It's whether it would fit different data drawn from the same process. That's the only question that matters, and it's the one overfitting systematically obscures.

· · ·

Chapter 6

Can You Spot the Overfit?

Test your intuition. In each scenario, decide whether the evidence is genuine signal or overfit noise.

Overfit or Insight?

Read each scenario and decide: is this genuine signal, or noise dressed up as pattern?

Score: 0/0 Round 1/6

SCENARIO 1

· · ·

Chapter 7

The Razor's Edge

So how do we fight overfitting? Centuries of mathematical thinking converge on a single principle: prefer the simpler explanation. William of Ockham said it in the 14th century ("entities must not be multiplied beyond necessity"). Statisticians formalized it in the 20th.

The most elegant formalization may be the Akaike Information Criterion (AIC), developed by Hirotugu Akaike in 1973.5 The AIC scores a model by its goodness of fit minus a penalty proportional to the number of parameters. Add a parameter and your fit improves — but the penalty increases. The AIC asks: did the improvement in fit justify the added complexity? Only if the new parameter captures genuine structure, not noise, will the AIC improve.

Akaike Information Criterion

AIC = 2k − 2 ln(L̂)

Where k = number of parameters and L̂ = maximum likelihood. Lower is better.

Cross-validation is the empirical cousin of the same idea. Instead of penalizing complexity mathematically, you test your model on data it hasn't seen. Hold out 20% of your data, train on the remaining 80%, then measure performance on the held-out set. An overfit model will perform beautifully on the training data and terribly on the test data. A well-calibrated model will perform similarly on both.

But the deepest defense against overfitting isn't a formula. It's a disposition — the willingness to be suspicious of your own beautiful explanations. The stock picker who backtested 200 strategies and found one that "worked" should be suspicious. The researcher whose hypothesis was confirmed only after the third statistical test should be suspicious. The political commentator who can explain every election result after the fact — with a different theory each time — should be suspicious.

Bias is how far off-center your shots cluster. Variance is how spread out they are. Overfitting trades the first for the second.

Overfitting is not just a statistical concept. It is a way of being wrong that feels like being right. The overfit model has an answer for everything. The overfit political theory explains every election. The overfit medical protocol accounts for every symptom. They feel powerful and comprehensive. But they are brittle — the first new data point shatters them, and then their adherents scramble to add another epicycle, another parameter, another "well actually."

The great irony of overfitting is that it's driven by our best instinct: the desire to explain, to leave nothing to chance, to account for every anomaly. But the world is noisy. Some things happen for no reason. Some data points are just luck. The hardest skill in mathematics — and in life — is knowing when to stop explaining and start accepting that the remaining variation is just noise.

Or, as the statistician George Box put it: "All models are wrong, but some are useful."6 The useful ones are wrong in simple, predictable ways. The useless ones are wrong in complicated ways that look like being right.

Choose the simple wrongness. It's the only kind that helps.

Notes & References

This stock-picking scam (sometimes called the "binary prediction" scheme) has been documented by the SEC and featured in Derren Brown's 2008 Channel 4 special The System, where he demonstrated it live with horse racing.
Karl Popper, The Logic of Scientific Discovery (1934; English edition 1959). Popper's falsifiability criterion remains foundational to the philosophy of science, despite refinements by Lakatos and others.
Joseph P. Simmons, Leif D. Nelson, and Uri Simonsohn, "False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant," Psychological Science 22, no. 11 (2011): 1359–1366.
Open Science Collaboration, "Estimating the Reproducibility of Psychological Science," Science 349, no. 6251 (2015): aac4716.
Hirotugu Akaike, "Information Theory and an Extension of the Maximum Likelihood Principle," in Proceedings of the 2nd International Symposium on Information Theory (1973): 267–281.
George E. P. Box, "Robustness in the Strategy of Scientific Model Building," in Robustness in Statistics, ed. Robert L. Launer and Graham N. Wilkinson (Academic Press, 1979), 201–236.