All Chapters

The Missing Chapter

The Bootstrap

Pulling yourself up by your statistical straps

An extension of Jordan Ellenberg's "How Not to Be Wrong"

Chapter 66

The Problem with the Median

You have twenty numbers. You want to know how uncertain you should be about their median. And the entire apparatus of classical statistics looks at you, shrugs, and says: "Sorry, that's hard."

Let me be concrete. Suppose you're a biologist studying how long it takes a certain species of beetle to right itself after being flipped on its back. You flip twenty beetles. You time them. Your data looks something like this: 1.2, 0.8, 3.1, 0.5, 14.7, 1.1, 0.9, 2.3, 1.7, 0.6, 1.4, 5.2, 0.7, 1.0, 2.8, 1.3, 0.4, 1.5, 8.9, 1.1 seconds.1

The mean of those numbers is about 2.6 seconds. And you know exactly how to build a confidence interval for the mean — it's the first thing you learn in any statistics class. Take the standard deviation, divide by the square root of n, multiply by 1.96. Done. You can even tell people the formula is justified by the Central Limit Theorem, and they'll nod approvingly.

But look at that data again. There's a beetle that took 14.7 seconds and another at 8.9. The distribution is wildly skewed — most beetles right themselves quickly, but a few stragglers are really struggling. The mean of 2.6 doesn't describe any beetle in your sample very well. The median — 1.25 seconds — is much more representative of the typical beetle's experience.

So you want a confidence interval for the median. You pull out your textbook. You flip to the chapter on confidence intervals. And you discover that while there are formulas for the distribution of sample medians, they depend on knowing the density of the population distribution at the median — which is exactly the thing you don't know and are trying to estimate.2 It's a snake eating its own tail.

Before 1979, this is where you were stuck. You could use the mean and pretend your skewed data was symmetric (wrong). You could try nonparametric methods based on order statistics (complicated, conservative, limited). Or you could call a mathematical statistician and wait three months for a custom derivation.

Then Bradley Efron had an idea so simple it seemed like cheating.

• • •
Chapter 66

Baron Munchausen's Statistical Method

The idea was this: you already have the data. Use it again.

Here's what Efron proposed. Take your twenty beetle-righting times. Now create a new "fake" dataset of twenty numbers by drawing randomly from your original data — with replacement. That means the same beetle's time might appear twice, or three times, or not at all. Some beetles get cloned; some get erased. Your resample might look like: 1.2, 0.8, 0.8, 0.5, 1.1, 1.1, 0.9, 2.3, 1.7, 0.6, 1.4, 0.7, 0.7, 1.0, 2.8, 1.3, 0.4, 1.5, 1.5, 1.1.

Compute the median of this resampled dataset. It'll be a little different from the original median — maybe 1.1 instead of 1.25. Now do it again. And again. A thousand times. Ten thousand times. Each resample gives you a slightly different median.

The spread of those resampled medians IS your uncertainty estimate.

If most of the resampled medians cluster between 0.9 and 1.6, that's your confidence interval. You didn't need any formula. You didn't need to know the population distribution. You didn't even need to think very hard. You just let the data talk to itself.

Efron called this the bootstrap, after the tall tales of Baron Munchausen, who supposedly pulled himself out of a swamp by tugging on his own bootstraps.3 The statistical version is almost as miraculous: you pull uncertainty estimates out of the data itself, with no external leverage required.

Original Data (n=20) 1.2 0.8 3.1 0.5 14.7 1.1 0.9 2.3 1.7 0.6 1.4 5.2 0.7 1.0 ... resample Resample* 1.2 0.8 0.8 0.5 1.1 1.1 0.9 2.3 1.7 0.6 0.7 0.7 1.0 2.8 ... compute median* 1.10 repeat 1,000× — collect all medians* → Distribution of 1,000 medians* = your uncertainty

The bootstrap loop: resample with replacement, compute your statistic, repeat.

Why On Earth Does This Work?

The bootstrap feels like a magic trick. You're using the data to learn about the data. You're asking "how variable is my estimate?" by creating fake datasets that are just reshuffled versions of the real one. How can this possibly give you reliable answers?

The key insight is what statisticians call the plug-in principle. Here's the logic:

Step one: if you had access to the entire population of beetles — every beetle in the world — you could just sample from that population over and over and see how much the sample median varies. That would give you the true sampling distribution.

Step two: you don't have the entire population. But your sample of twenty beetles is the best estimate you have of what that population looks like. Not perfect, but the best available.

Step three: so resampling from your sample approximates resampling from the population. The bootstrap distribution of the median approximates the true sampling distribution of the median.

Your sample is a miniature portrait of the population. Resampling from the sample is like sampling from the population, viewed through a slightly blurry lens. As your sample gets larger, the lens gets sharper. In the limit, they're identical.

This might sound hand-wavy, but Efron proved it rigorously.4 Under quite general conditions — the statistic needs to be "smooth" in a certain technical sense, and the population can't be too pathological — the bootstrap distribution converges to the true sampling distribution as the sample size grows. The proof uses empirical process theory, a beautiful branch of probability that studies how well the sample distribution function approximates the population distribution function.

The mathematical name for this convergence is bootstrap consistency, and establishing it for various statistics and data structures has kept theoretical statisticians busy for four decades.

• • •
Chapter 66

Try It Yourself

Nothing makes the bootstrap click like watching it happen. Enter some data below — or use one of the presets — and watch as the bootstrap machine resamples your data over and over, building up a histogram of medians (or whatever statistic you choose). The confidence interval will emerge before your eyes.

The Bootstrap Machine

Enter data points (comma-separated) or choose a preset. Then hit Bootstrap to watch 1,000 resamples build up.

Watch what happens. The histogram fills in, bell-shaped or not, and the 95% confidence interval (the middle 95% of all those bootstrap statistics) stabilises quickly. This is the bootstrap in action: no formulas, no assumptions about the shape of the population, no tears.

• • •
Chapter 66

The Revolution Nobody Saw Coming

To appreciate what Efron did, you have to understand what statistics was like before the bootstrap. Every time you wanted a confidence interval for some quantity — a trimmed mean, a ratio of variances, a regression coefficient in a nonlinear model — you had to derive the sampling distribution by hand. This typically involved Taylor expansions, moment calculations, or appeals to asymptotic normality that might or might not apply to your particular sample size.

This was artisanal statistics. Each problem required its own bespoke formula, like a tailor making a suit from scratch for every customer. And the customers — scientists, economists, public health researchers — were piling up in the waiting room.

The bootstrap was the statistical equivalent of ready-to-wear clothing.5 One method, applicable to almost any estimator, requiring nothing more than the ability to simulate. Suddenly the scientist didn't need the mathematical statistician. A simple loop of resample-compute-repeat gave you what used to take months of derivation.

Before Bootstrap (artisanal) Mean CI → CLT formula ✓ Median CI → density estimation?? Ratio CI → delta method Correlation CI → Fisher z-transform Custom stat → call a theorist Novel estimator → PhD thesis After Bootstrap (universal) Mean CI → resample ✓ Median CI → resample ✓ Ratio CI → resample ✓ Correlation CI → resample ✓ Custom stat → resample ✓ Novel estimator → resample ✓

Before the bootstrap, each statistic needed its own theory. After, one method fits all.

The Bootstrap Agrees with the Classics (When the Classics Work)

Here's the beautiful thing. For the cases where classical formulas do exist — like the confidence interval for a mean — the bootstrap gives you essentially the same answer. This isn't a coincidence; it's a consequence of the theory. The bootstrap is consistent for the mean, which means its confidence intervals converge to the same ones you'd get from the Central Limit Theorem.

But for the cases where classical formulas don't exist cleanly — the median, the trimmed mean, the 90th percentile, the ratio of two medians, whatever weird quantity your scientific question demands — the bootstrap still works. It fills in the gaps that classical theory left behind.

The interactive below lets you see this directly. Generate random data, and compare the bootstrap confidence interval to the classical one. For the mean, they'll agree. For the median, the bootstrap stands alone.

Bootstrap vs. Classical

Generate random samples and compare confidence intervals. For the mean, both methods agree. For the median, the bootstrap is the only game in town.

Sample size (n)
30
Population skew
Moderate
• • •
Chapter 66

The Varieties of Bootstrap Experience

What I've described so far is the nonparametric bootstrap — the original, the greatest hit. You resample your data with replacement, making no assumptions about where the data came from. But the bootstrap idea is more general than any single recipe.

The parametric bootstrap is for when you actually have a model. Suppose you've fit a normal distribution to your data (mean = 5, standard deviation = 2). Instead of resampling the raw data, you generate new datasets by drawing from that fitted normal. This can be more efficient than the nonparametric version when the model is right — and catastrophically wrong when it's not.6

Then there's the wild bootstrap, designed for regression with heteroskedasticity — the annoyingly common situation where the variance of your errors changes across observations. The classic bootstrap doesn't handle this well because resampling residuals with replacement can destroy the heteroskedastic structure. The wild bootstrap preserves it by multiplying each residual by a random sign (or a cleverly chosen random variable), keeping the magnitude structure intact while still generating variability.7

And for time series data, where observations aren't independent, there's the block bootstrap. You can't just resample individual data points from a time series because you'd destroy the temporal dependence — the fact that today's stock price is related to yesterday's. Instead, you resample blocks of consecutive observations, preserving local dependence while still generating the variation you need.

The Bootstrap Family

Nonparametric: Resample raw data. No assumptions. Works almost everywhere.

Parametric: Simulate from a fitted model. More powerful when the model is right.

Wild: Perturb residuals with random signs. Handles unequal variances in regression.

Block: Resample chunks of consecutive observations. Preserves time-series dependence.

When the Bootstrap Fails

No method is magic, and the bootstrap has its failure modes. Understanding them is at least as important as knowing how to use it.

Small samples. The plug-in principle says your sample approximates the population. But if you have six observations, your sample is a pretty lousy portrait of anything. The bootstrap can be unreliable below about n = 20, giving confidence intervals that are too narrow — overconfident, which is the worst kind of wrong in statistics.8

Extreme statistics. If you're trying to bootstrap the maximum of your sample, you're in trouble. The maximum is extremely sensitive to which observations show up in the resample. The bootstrap distribution of the maximum does not converge to the true sampling distribution — this is one of the known theoretical failures, and it's not fixable with more resamples.

Heavy tails. When the population has very heavy tails (infinite variance), the bootstrap for the mean can fail because the Central Limit Theorem itself fails. Your resampled means will be all over the place, but in a way that doesn't correctly capture the actual variability. You need specialised methods — the m-out-of-n bootstrap, which resamples fewer observations than you started with, can sometimes rescue you.

Dependent data. As mentioned, naïvely bootstrapping time series data destroys the dependence structure. You need the block bootstrap or one of its variants (stationary bootstrap, moving block bootstrap, etc.).

Bootstrap Works Great ✓ n ≥ 20 ✓ "Smooth" statistics (mean, median, quantiles) ✓ Independent observations ✓ Finite variance ✓ Continuous distributions Bootstrap Needs Care ⚠ n < 20 ⚠ Extremes (max, min) ⚠ Time series (need blocks) ⚠ Infinite variance ⚠ Lattice distributions ⚠ Boundary parameters

The bootstrap is powerful but not infallible. Know the terrain before you hike.

• • •
Chapter 66

The Deeper Lesson

The bootstrap is sometimes dismissed as "just simulation" — as if it were a brute-force trick that real mathematicians would find beneath them. This gets the story exactly backwards. The bootstrap was invented by a real mathematician (Efron is one of the most decorated statisticians alive), and its theoretical justification draws on some of the most beautiful mathematics in probability theory.

But the lesson of the bootstrap goes beyond statistics. It's about the power of simulation as a mode of understanding.

Before computers, mathematical proof was the only way to know things about random processes. You had to derive the distribution of your statistic, in closed form, on paper. This limited what questions you could ask to the questions whose answers were analytically tractable — which is a tiny, weirdly shaped subset of all the questions worth asking.

The bootstrap said: forget closed-form solutions. Simulate. Let the computer be your proof engine. This was philosophically radical even though it was computationally trivial. And it opened the door to an entire style of statistical reasoning — computational inference — that now includes permutation tests, cross-validation, Markov chain Monte Carlo, and much more.

The bootstrap democratised uncertainty. Before Efron, quantifying uncertainty was a craft practiced by specialists. After Efron, it was a recipe anyone could follow.

Think about what that means in practice. A marine biologist studying coral growth doesn't need to master the delta method. An epidemiologist tracking disease clusters doesn't need to derive the asymptotic distribution of her custom spatial statistic. A machine learning engineer comparing two models' performance metrics doesn't need to invoke arcane theorems. They all just bootstrap.

This is what it looks like when a mathematical idea is truly good. It doesn't just solve the problem it was designed for. It dissolves an entire category of problems, replaces artisanal craft with democratic method, and then quietly becomes so ubiquitous that people forget there was ever a problem in the first place.

The next time you see a confidence interval in a scientific paper — for a median, a ratio, a difference in survival curves, a machine learning metric — there's a decent chance the bootstrap computed it. And there's a decent chance the scientist who used it has no idea that before 1979, that interval would have been impossible to obtain.

Baron Munchausen pulled himself out of the swamp by his own bootstraps. It was a lie, of course — physics doesn't work that way. But statistics? Statistics actually does.

Notes & References

  1. I made up these particular beetles, but the study of insect righting behaviour is real and scientifically important. See Brackenbury, J. (1990). "Wing movements in the bush-cricket Tettigonia viridissima and the mantis Ameles spallanziana during falling with and without flight." Journal of Experimental Biology, 49(2), 235–245.
  2. The asymptotic variance of the sample median is 1/(4nf(m)²), where f(m) is the population density at the median. This formula requires you to estimate f(m), which is a nonparametric density estimation problem — generally harder than the confidence interval problem you were trying to solve in the first place.
  3. Efron, B. (1979). "Bootstrap methods: Another look at the jackknife." The Annals of Statistics, 7(1), 1–26. The paper that launched a thousand resamples. The Baron Munchausen reference is to Rudolf Erich Raspe's 1785 tales.
  4. The key theoretical result is that the bootstrap is consistent when the statistic of interest is a smooth functional of the empirical distribution. See Bickel, P.J. and Freedman, D.A. (1981). "Some asymptotic theory for the bootstrap." The Annals of Statistics, 9(6), 1196–1217.
  5. This analogy is imperfect — the bootstrap does still require some expertise to apply well, as the "when it fails" section makes clear. But the barrier to entry dropped by orders of magnitude.
  6. The parametric bootstrap is essentially model-based simulation. If you believe Y ~ Normal(μ, σ²) and you've estimated μ̂ and σ̂, you generate new datasets from Normal(μ̂, σ̂²). This is more efficient than the nonparametric version (lower variance CIs) when the normality assumption holds, but useless when it doesn't.
  7. Wu, C.F.J. (1986). "Jackknife, bootstrap and other resampling methods in regression analysis." The Annals of Statistics, 14(4), 1261–1295. The wild bootstrap multiplies residuals by random variables with mean 0 and variance 1 (commonly a Rademacher distribution: +1 or −1 with equal probability).
  8. Bickel and Freedman's 1981 consistency results are asymptotic — they guarantee the bootstrap works as n → ∞. For finite samples, particularly small ones, the bootstrap percentile interval can have coverage well below the nominal 95%. The BCa (bias-corrected and accelerated) bootstrap of Efron (1987) partially addresses this.