Maximum Likelihood — The Missing Chapters

Chapter 67

The Coin That Wouldn't Confess

You find a coin on the ground. You flip it seven times. Five heads, two tails. Is this a fair coin?

That's a perfectly reasonable question. It's also, in a very precise sense, the wrong question — or at least not the most useful one. A better question is: what kind of coin would make five-out-of-seven heads the most likely thing to happen?

A fair coin — one that comes up heads exactly half the time — would produce five heads in seven flips about 16.4% of the time. Not impossible, but not the coin's best party trick either. A coin biased 5/7 toward heads, on the other hand, would produce exactly this outcome more often than any other coin you could dream up. So if you had to bet on a single number for the coin's bias, 5/7 is your answer.

Congratulations. You've just performed maximum likelihood estimation, the single most important idea in statistical inference for the last hundred years, and you did it without a single equation. The equations are coming — they're elegant, I promise — but the core insight is already in your hands: find the parameter value that makes your observed data most probable.

This chapter is about why that simple idea is so powerful, where it came from, how it works, and the surprising places it shows up — including inside every neural network you've ever used.

· · ·

Chapter 67.1

The Likelihood Function

Here's where we need to be careful with language, because English is going to betray us. When I say "the likelihood of the coin having bias p," I don't mean "the probability that the coin's bias is p." Those sound like the same thing. They are emphatically not.

The probability of data given a parameter tells you: "If the coin's bias really is p, how probable is the data I saw?" The likelihood of a parameter given data tells you the same number — literally the exact same number — but viewed from the other direction. You're holding the data fixed and sliding the parameter around.1

The Likelihood Function

L(θ | data) = P(data | θ)

Same number, profoundly different interpretation. The left side is a function of θ; the right side is a function of data.

This distinction — same formula, different variable — is one of those things that seems pedantic until you realize the entire edifice of modern statistics rests on it. Probability says: given a model of the world, how surprising is this data? Likelihood says: given this data, which model of the world fits best?

For our coin, with k = 5 heads in n = 7 flips, the likelihood function is:

L(p) = p⁵(1 − p)²

We drop the binomial coefficient C(7,5) = 21 because it doesn't depend on p — it's the same for every candidate coin.

Try every value of p from 0 to 1. At p = 0, the likelihood is zero (a two-headed-tails coin can't produce any heads). At p = 1, it's also zero (an all-heads coin can't produce tails). Somewhere in between there's a peak. Calculus tells us it's at p = 5/7. The interactive below lets you see this for yourself.

Interactive: The Likelihood Landscape

Flip coins and watch the likelihood function respond. Drag the slider to find the MLE visually. Add more data and watch the curve sharpen.

Your guess for p: 0.50

True MLE — Flip some coins to begin

0 Heads

0 Tails

0 Total Flips

Notice something? With just 7 flips, the likelihood curve is broad and lazy — lots of values of p give nearly-as-good likelihoods. But flip 70 times and the curve becomes a needle. This is the law of large numbers whispering in the language of likelihood: more data means more certainty.2

· · ·

Chapter 67.2

Fisher's Revolution

The idea of maximum likelihood was formalized by R.A. Fisher in 1921, in a paper that reads like a man who knows he's rewriting the rules.3 Fisher was 31, already famous for his work on genetics, and absolutely insufferable in the way that only a person who is right about nearly everything can be.

Before Fisher, statisticians mostly used the method of moments: match the sample mean to the theoretical mean, match the sample variance to the theoretical variance, and solve for the unknown parameters. It's simple, it's intuitive, and it works — but it's also oddly arbitrary. Why the mean? Why the variance? Why not the third moment, or the seventh?

Fisher's answer was: stop matching moments. Instead, ask which parameter value makes the observed data most probable. This has a beautiful inevitability to it. You're not choosing which summary statistic to match. You're using all the data, in the most natural way possible.

In practice, nobody maximizes the likelihood directly. They maximize the log-likelihood, because logarithms turn products into sums:

ℓ(θ) = log L(θ) = Σ log P(x_i | θ)

Products of tiny numbers underflow computers. Sums of logs don't. And since log is a strictly increasing function, whatever maximizes log L also maximizes L. It's a free lunch — mathematically identical, computationally superior.

To find the maximum, you take the derivative of the log-likelihood and set it to zero. This gives you the score equation:

The Score Equation

dℓ/dθ = 0

The slope of the log-likelihood is zero at the peak. Solve for θ to get the MLE.

For our coin, the log-likelihood is ℓ(p) = 5 log(p) + 2 log(1 − p). Take the derivative, set it to zero, and you get p̂ = 5/7. The hat on the p is statistician code for "this is our best estimate." It's not the truth. It's the most likely explanation of what we saw.

Probability flows from model to data. Likelihood flows from data back to model. Same formula, opposite direction.

· · ·

Chapter 67.3

Fisher Information: How Sharp Is the Peak?

Not all data is created equal. Flip a coin 10 times and you learn something. Flip it 10,000 times and you learn a lot more. But how much more? Fisher answered this too, with a quantity now called, inevitably, Fisher information.4

The Fisher information measures the curvature of the log-likelihood at its peak. A sharp peak means high information — the data strongly points at one particular parameter value. A broad, flat peak means low information — many parameter values are roughly equally consistent with the data.

Fisher Information

I(θ) = −E[d²ℓ / dθ²]

The expected negative second derivative of the log-likelihood. More curvature = more information.

The beautiful payoff: for large samples, the MLE is approximately normally distributed with variance 1/I(θ). The more information in your data, the tighter your estimate. This is called the Cramér-Rao bound, and it says no unbiased estimator can do better than the MLE (at least asymptotically). The MLE is, in a very real sense, the best you can possibly do.5

More data → sharper likelihood peak → more precise estimates. The width of the peak is inversely related to Fisher information.

· · ·

Chapter 67.4

The Invariance Property (A Free Theorem)

Here's one of the most delightful properties of the MLE, and it's completely free. Suppose you've estimated the coin's bias to be p̂ = 5/7. Now someone asks: "What's the MLE of the odds of heads?" The odds are p/(1 − p). Do you need to re-derive everything from scratch?

No. You just plug in. The MLE of the odds is p̂/(1 − p̂) = (5/7)/(2/7) = 5/2 = 2.5. This is the invariance property: if θ̂ is the MLE of θ, then g(θ̂) is the MLE of g(θ) for any function g.6

This sounds like it should be obvious. It is not. The method of moments doesn't have this property. Neither do many Bayesian estimates. The MLE carries its optimality through any transformation you throw at it, like a mathematical frequent-flyer status that works on every airline.

· · ·

Chapter 67.5

MLE in the Wild

If you've ever fit a regression, you've done MLE. If the errors are normally distributed, then maximizing the likelihood is exactly the same as minimizing the sum of squared residuals. Least squares — the workhorse of all science since Gauss — is MLE wearing a different hat.7

But the MLE's greatest modern cameo is inside neural networks. When you train a classifier with cross-entropy loss, you are minimizing negative log-likelihood. Every time a neural network learns to distinguish cats from dogs, it's doing MLE. Every large language model predicting the next word? MLE. The entire deep learning revolution runs on Fisher's 1921 idea, scaled up by a factor of a billion.8

"When a neural network learns, it's asking Fisher's question at incomprehensible scale: what parameters make this data most probable?"

MLE also appears in phylogenetic trees (what evolutionary tree makes this DNA data most likely?), hidden Markov models (what sequence of hidden states best explains this chain of observations?), signal processing, epidemiology, econometrics, and essentially anywhere someone has data and a model with unknown parameters.

· · ·

Chapter 67.6

Where MLE Goes Wrong

No method is perfect, and MLE's failure modes are instructive.

Overfitting. If you flip a coin twice and get two heads, the MLE says p̂ = 1. The coin is a guaranteed-heads coin! This is technically correct — p = 1 really does maximize p² — and also obviously insane. With tiny datasets, the MLE can be absurdly confident.

The fix? Regularization — which, delightfully, turns out to be equivalent to adding a Bayesian prior. When you add an L2 penalty to a neural network's loss function, you're secretly saying "I believe the parameters are probably small" — a Gaussian prior centered at zero. MLE and Bayesian inference aren't rivals. They're two ends of the same spectrum.

Non-existence. Sometimes the MLE doesn't exist. In logistic regression with perfectly separable data, the MLE shoots off to infinity — the model wants to be infinitely confident. No maximum exists because the likelihood keeps climbing.

Local maxima. For complicated models with many parameters, the likelihood surface can have multiple peaks. Gradient ascent finds a peak, not necessarily the peak. This is the daily struggle of anyone training deep neural networks.

The MLE's Achilles' heel: with too little data, the most likely explanation can be an extreme one.

· · ·

Chapter 67.7

Convergence: Watching the Estimate Settle

One of the MLE's most reassuring properties is consistency: as you gather more and more data, the MLE converges to the true parameter value. Not sometimes, not usually — always (under mild regularity conditions that almost every model you'll encounter satisfies).

The interactive below lets you see this convergence in action across different distributions. Pick a distribution, choose the true parameter, and watch as the MLE homes in on the truth with increasing sample size. For comparison, we also show the method of moments estimator — Fisher's rival, doing its best.

Interactive: MLE Convergence

Choose a distribution and its true parameter. Draw samples and watch the MLE (and method of moments) converge to the truth.

True parameter: 3.00

0 Sample size

— MLE estimate

— Method of moments

3.00 True value

MLE

Method of Moments

True Value

Both estimators converge — that's the law of large numbers at work. But notice that the MLE tends to converge faster, especially for the exponential distribution. This isn't a coincidence. Fisher proved that the MLE is asymptotically efficient: among all consistent estimators, it has the smallest possible variance. It squeezes every drop of information out of your data.

· · ·

Chapter 67.8

The Deepest Lesson

Here's what I find most beautiful about maximum likelihood. It teaches you to think backwards. Instead of starting with a theory and predicting what you'll see, you start with what you've seen and ask: what theory would have made this the least surprising outcome?

This is, when you think about it, how all of science works. We observe the cosmic microwave background, and we ask: what model of the early universe makes these patterns most probable? We sequence a genome, and we ask: what evolutionary history makes this sequence of base pairs most likely? We look at election results, and we ask: what model of voter behavior best explains these numbers?

Every time, the logic is the same. We didn't see the world being made. We only see the data it left behind. And maximum likelihood is our best tool for reading those traces backwards to the truth — or at least, to the most probable version of it.

The Core Insight

Maximum likelihood doesn't tell you what's true. It tells you what's most consistent with the evidence. In a world of uncertainty, that's the best any method can offer — and Fisher proved that no method offers more.

Fisher gave us a framework that's simultaneously profound and practical. It works for coins and it works for neural networks with billions of parameters. It's mathematically optimal and computationally tractable. It unifies least squares, cross-entropy, logistic regression, and a hundred other techniques under a single principle.

And at its heart, it's just one question: what makes the data sing?

Notes & References

This is the fundamental difference between likelihood and probability. The likelihood function L(θ|x) = P(x|θ) is numerically identical but conceptually distinct: it is a function of θ for fixed x, not a probability distribution over θ. See Edwards, A.W.F., Likelihood (Cambridge University Press, 1972).
More precisely, the width of the likelihood function scales as 1/√n, where n is the sample size. This is a consequence of the central limit theorem applied to the score function.
Fisher, R.A., "On the Mathematical Foundations of Theoretical Statistics," Philosophical Transactions of the Royal Society A, 222 (1922), 309–368. The paper was actually read in 1921 and published in 1922.
Fisher information is related to the variance of the score function: I(θ) = Var[∂ log f/∂θ]. The Cramér-Rao inequality states that Var(θ̂) ≥ 1/nI(θ) for any unbiased estimator.
"Asymptotically" is doing heavy lifting here. For finite samples, Bayesian estimators with good priors can outperform MLE. But as n → ∞, the MLE achieves the Cramér-Rao lower bound. See Lehmann, E.L., and Casella, G., Theory of Point Estimation (Springer, 1998).
The invariance property was proven by Zehna (1966), though Fisher assumed it from early on. It requires g to be a well-defined function, not a relation. Zehna, P.W., "Invariance of Maximum Likelihood Estimators," Annals of Mathematical Statistics, 37 (1966), 744.
This equivalence holds specifically when errors are i.i.d. normal. For other error distributions, MLE gives different estimators — robust regression, quantile regression, etc. — each optimal for its assumed model.
The connection is exact: for a categorical output with softmax probabilities, minimizing cross-entropy loss H(p, q) = −Σ p_i log q_i over parameters of q is equivalent to maximizing the log-likelihood of the observed labels under the model. See Goodfellow, Bengio, and Courville, Deep Learning (MIT Press, 2016), Chapter 5.