The Coin That Wouldn't Confess
You find a coin on the ground. You flip it seven times. Five heads, two tails. Is this a fair coin?
That's a perfectly reasonable question. It's also, in a very precise sense, the wrong question — or at least not the most useful one. A better question is: what kind of coin would make five-out-of-seven heads the most likely thing to happen?
A fair coin — one that comes up heads exactly half the time — would produce five heads in seven flips about 16.4% of the time. Not impossible, but not the coin's best party trick either. A coin biased 5/7 toward heads, on the other hand, would produce exactly this outcome more often than any other coin you could dream up. So if you had to bet on a single number for the coin's bias, 5/7 is your answer.
Congratulations. You've just performed maximum likelihood estimation, the single most important idea in statistical inference for the last hundred years, and you did it without a single equation. The equations are coming — they're elegant, I promise — but the core insight is already in your hands: find the parameter value that makes your observed data most probable.
This chapter is about why that simple idea is so powerful, where it came from, how it works, and the surprising places it shows up — including inside every neural network you've ever used.
The Likelihood Function
Here's where we need to be careful with language, because English is going to betray us. When I say "the likelihood of the coin having bias p," I don't mean "the probability that the coin's bias is p." Those sound like the same thing. They are emphatically not.
The probability of data given a parameter tells you: "If the coin's bias really is p, how probable is the data I saw?" The likelihood of a parameter given data tells you the same number — literally the exact same number — but viewed from the other direction. You're holding the data fixed and sliding the parameter around.1
This distinction — same formula, different variable — is one of those things that seems pedantic until you realize the entire edifice of modern statistics rests on it. Probability says: given a model of the world, how surprising is this data? Likelihood says: given this data, which model of the world fits best?
For our coin, with k = 5 heads in n = 7 flips, the likelihood function is:
Try every value of p from 0 to 1. At p = 0, the likelihood is zero (a two-headed-tails coin can't produce any heads). At p = 1, it's also zero (an all-heads coin can't produce tails). Somewhere in between there's a peak. Calculus tells us it's at p = 5/7. The interactive below lets you see this for yourself.
Interactive: The Likelihood Landscape
Flip coins and watch the likelihood function respond. Drag the slider to find the MLE visually. Add more data and watch the curve sharpen.
Notice something? With just 7 flips, the likelihood curve is broad and lazy — lots of values of p give nearly-as-good likelihoods. But flip 70 times and the curve becomes a needle. This is the law of large numbers whispering in the language of likelihood: more data means more certainty.2
Fisher's Revolution
The idea of maximum likelihood was formalized by R.A. Fisher in 1921, in a paper that reads like a man who knows he's rewriting the rules.3 Fisher was 31, already famous for his work on genetics, and absolutely insufferable in the way that only a person who is right about nearly everything can be.
Before Fisher, statisticians mostly used the method of moments: match the sample mean to the theoretical mean, match the sample variance to the theoretical variance, and solve for the unknown parameters. It's simple, it's intuitive, and it works — but it's also oddly arbitrary. Why the mean? Why the variance? Why not the third moment, or the seventh?
Fisher's answer was: stop matching moments. Instead, ask which parameter value makes the observed data most probable. This has a beautiful inevitability to it. You're not choosing which summary statistic to match. You're using all the data, in the most natural way possible.
In practice, nobody maximizes the likelihood directly. They maximize the log-likelihood, because logarithms turn products into sums:
Products of tiny numbers underflow computers. Sums of logs don't. And since log is a strictly increasing function, whatever maximizes log L also maximizes L. It's a free lunch — mathematically identical, computationally superior.
To find the maximum, you take the derivative of the log-likelihood and set it to zero. This gives you the score equation:
For our coin, the log-likelihood is ℓ(p) = 5 log(p) + 2 log(1 − p). Take the derivative, set it to zero, and you get p̂ = 5/7. The hat on the p is statistician code for "this is our best estimate." It's not the truth. It's the most likely explanation of what we saw.
Fisher Information: How Sharp Is the Peak?
Not all data is created equal. Flip a coin 10 times and you learn something. Flip it 10,000 times and you learn a lot more. But how much more? Fisher answered this too, with a quantity now called, inevitably, Fisher information.4
The Fisher information measures the curvature of the log-likelihood at its peak. A sharp peak means high information — the data strongly points at one particular parameter value. A broad, flat peak means low information — many parameter values are roughly equally consistent with the data.
The beautiful payoff: for large samples, the MLE is approximately normally distributed with variance 1/I(θ). The more information in your data, the tighter your estimate. This is called the Cramér-Rao bound, and it says no unbiased estimator can do better than the MLE (at least asymptotically). The MLE is, in a very real sense, the best you can possibly do.5
The Invariance Property (A Free Theorem)
Here's one of the most delightful properties of the MLE, and it's completely free. Suppose you've estimated the coin's bias to be p̂ = 5/7. Now someone asks: "What's the MLE of the odds of heads?" The odds are p/(1 − p). Do you need to re-derive everything from scratch?
No. You just plug in. The MLE of the odds is p̂/(1 − p̂) = (5/7)/(2/7) = 5/2 = 2.5. This is the invariance property: if θ̂ is the MLE of θ, then g(θ̂) is the MLE of g(θ) for any function g.6
This sounds like it should be obvious. It is not. The method of moments doesn't have this property. Neither do many Bayesian estimates. The MLE carries its optimality through any transformation you throw at it, like a mathematical frequent-flyer status that works on every airline.
MLE in the Wild
If you've ever fit a regression, you've done MLE. If the errors are normally distributed, then maximizing the likelihood is exactly the same as minimizing the sum of squared residuals. Least squares — the workhorse of all science since Gauss — is MLE wearing a different hat.7
But the MLE's greatest modern cameo is inside neural networks. When you train a classifier with cross-entropy loss, you are minimizing negative log-likelihood. Every time a neural network learns to distinguish cats from dogs, it's doing MLE. Every large language model predicting the next word? MLE. The entire deep learning revolution runs on Fisher's 1921 idea, scaled up by a factor of a billion.8
MLE also appears in phylogenetic trees (what evolutionary tree makes this DNA data most likely?), hidden Markov models (what sequence of hidden states best explains this chain of observations?), signal processing, epidemiology, econometrics, and essentially anywhere someone has data and a model with unknown parameters.
Where MLE Goes Wrong
No method is perfect, and MLE's failure modes are instructive.
Overfitting. If you flip a coin twice and get two heads, the MLE says p̂ = 1. The coin is a guaranteed-heads coin! This is technically correct — p = 1 really does maximize p2 — and also obviously insane. With tiny datasets, the MLE can be absurdly confident.
The fix? Regularization — which, delightfully, turns out to be equivalent to adding a Bayesian prior. When you add an L2 penalty to a neural network's loss function, you're secretly saying "I believe the parameters are probably small" — a Gaussian prior centered at zero. MLE and Bayesian inference aren't rivals. They're two ends of the same spectrum.
Non-existence. Sometimes the MLE doesn't exist. In logistic regression with perfectly separable data, the MLE shoots off to infinity — the model wants to be infinitely confident. No maximum exists because the likelihood keeps climbing.
Local maxima. For complicated models with many parameters, the likelihood surface can have multiple peaks. Gradient ascent finds a peak, not necessarily the peak. This is the daily struggle of anyone training deep neural networks.
Convergence: Watching the Estimate Settle
One of the MLE's most reassuring properties is consistency: as you gather more and more data, the MLE converges to the true parameter value. Not sometimes, not usually — always (under mild regularity conditions that almost every model you'll encounter satisfies).
The interactive below lets you see this convergence in action across different distributions. Pick a distribution, choose the true parameter, and watch as the MLE homes in on the truth with increasing sample size. For comparison, we also show the method of moments estimator — Fisher's rival, doing its best.
Interactive: MLE Convergence
Choose a distribution and its true parameter. Draw samples and watch the MLE (and method of moments) converge to the truth.
Both estimators converge — that's the law of large numbers at work. But notice that the MLE tends to converge faster, especially for the exponential distribution. This isn't a coincidence. Fisher proved that the MLE is asymptotically efficient: among all consistent estimators, it has the smallest possible variance. It squeezes every drop of information out of your data.
The Deepest Lesson
Here's what I find most beautiful about maximum likelihood. It teaches you to think backwards. Instead of starting with a theory and predicting what you'll see, you start with what you've seen and ask: what theory would have made this the least surprising outcome?
This is, when you think about it, how all of science works. We observe the cosmic microwave background, and we ask: what model of the early universe makes these patterns most probable? We sequence a genome, and we ask: what evolutionary history makes this sequence of base pairs most likely? We look at election results, and we ask: what model of voter behavior best explains these numbers?
Every time, the logic is the same. We didn't see the world being made. We only see the data it left behind. And maximum likelihood is our best tool for reading those traces backwards to the truth — or at least, to the most probable version of it.
Maximum likelihood doesn't tell you what's true. It tells you what's most consistent with the evidence. In a world of uncertainty, that's the best any method can offer — and Fisher proved that no method offers more.
Fisher gave us a framework that's simultaneously profound and practical. It works for coins and it works for neural networks with billions of parameters. It's mathematically optimal and computationally tractable. It unifies least squares, cross-entropy, logistic regression, and a hundred other techniques under a single principle.
And at its heart, it's just one question: what makes the data sing?