Explore vs. Exploit — The Missing Chapters

Chapter 59

The Casino of Life

You walk into a casino. There are five slot machines in front of you. Each one has a different payout rate — but nobody's going to tell you what it is. You have 100 pulls. How do you make the most money?

The obvious first move is to try each machine once. Now you have a little information. Machine 3 paid out $4, Machine 1 paid $2, and the rest gave you nothing. So you go back to Machine 3. Makes sense, right? It's the best one you've found.

But here's the thing: Machine 5 might be the real jackpot. You only tried it once, and it happened to come up empty. That's not a lot of evidence. Maybe you should try it a few more times? But every pull you spend on Machine 5 is a pull you're not spending on Machine 3, which you already know is decent.

This is the explore-exploit tradeoff, and it is, quietly, one of the most fundamental problems in all of mathematics. Computer scientists call it the multi-armed bandit problem — named after the old nickname for slot machines ("one-armed bandits"), except now you're facing several of them at once.1

And it's not really about casinos at all.

It's about whether you should go to your favorite restaurant tonight or try the new Thai place that just opened. It's about whether a 22-year-old should stay in their comfortable accounting job or take a flyer on a startup. It's about whether Netflix should keep showing you crime dramas (you clicked on three in a row) or surprise you with a Korean romance. It's about whether a doctor running a clinical trial should give more patients the drug that seems to be working, even though the trial isn't over yet.

Every single one of these is the same problem wearing a different costume.

· · ·

The Oldest Algorithm You've Never Heard Of

The first person to crack this problem mathematically was William R. Thompson, and he did it in 1933.2 That's before computers. Before information theory. Before anyone had even coined the term "algorithm" in its modern sense. Thompson was working on clinical trials — the decidedly non-trivial question of how to test a new drug without giving too many patients the worse treatment.

His solution, now called Thompson sampling, is elegant in the way that great mathematical ideas often are: it's basically a formalization of optimism tempered by experience.

Here's how it works. For each slot machine, you maintain a belief about how good it might be — not a single number, but a whole distribution of possibilities. At the start, you know nothing, so every machine could be anything. Each time you pull an arm and see the result, you update your belief about that machine using Bayes' theorem. Then — and this is the clever part — you randomly sample from each machine's distribution and pull whichever one gave the highest sample.

Why does random sampling help? Because machines you're uncertain about will occasionally produce high samples (since their distribution is wide), which means you'll try them. Machines you know a lot about will produce samples close to their true value. The algorithm automatically balances exploration and exploitation — it explores where it's uncertain and exploits where it's confident. No tuning parameters. No arbitrary decisions. Just Bayes' theorem doing what Bayes' theorem does.

Thompson sampling favors machines with wide distributions — they might be terrible, but they might also be the best.

The Competition

Thompson had the first solution, but he wasn't the last. Over the following decades, mathematicians and computer scientists developed several competing strategies, each with its own philosophy:

Epsilon-greedy (ε-greedy) is the simplest approach, and also the dumbest one that still works surprisingly well. Pick a small number ε — say, 10%. Then, 90% of the time, pull the arm that has the best average payout so far. The other 10% of the time, pull a completely random arm. That's it. No Bayesian updating, no probability distributions. Just "mostly exploit, sometimes explore at random."3

UCB (Upper Confidence Bound) takes a more principled approach. For each machine, compute an "optimistic estimate" that equals its observed average payout plus a bonus for uncertainty. Machines you haven't tried much get a big bonus; machines you know well get a small one. Then always pick the machine with the highest optimistic estimate. The motto is: be optimistic in the face of uncertainty.4

UCB Score

UCB(i) = x̄ᵢ + c · √(ln(n) / nᵢ)

The second term — the exploration bonus — shrinks as you pull arm i more often. It grows (slowly) as total pulls n increase, nudging you to revisit neglected arms.

Try it yourself. Below are five slot machines with hidden payout rates. You have 100 pulls. See if you can beat the algorithms.

🎰 The Multi-Armed Bandit

Click a machine to pull its arm. Watch your total earnings compared to three classic algorithms playing alongside you.

Pull 0 / 100

🎰 A

Pulls: 0

Avg: —

🎰 B

Pulls: 0

Avg: —

🎰 C

Pulls: 0

Avg: —

🎰 D

Pulls: 0

Avg: —

🎰 E

Pulls: 0

Avg: —

You

Thompson

UCB

ε-Greedy

· · ·

Chapter 59

The Provably Optimal Solution That Nobody Uses

In 1979, the British mathematician John C. Gittins proved something remarkable: there exists a provably optimal solution to the multi-armed bandit problem.5 Not approximately optimal. Not "works pretty well in practice." Optimal, in the fullest mathematical sense of the word.

The Gittins index assigns a single number to each arm based on its entire history of rewards. You simply pull whichever arm has the highest index. The proof is beautiful: Gittins showed that the multi-armed bandit — which seems to require you to plan over all possible futures — can actually be decomposed into independent problems, one per arm. Each arm's index captures exactly how valuable it is to pull that arm, accounting for both its expected immediate reward and the value of what you'd learn.

So why doesn't everybody just use it?

Because computing the Gittins index is hard. You need to solve a separate optimization problem for every arm at every step. For the idealized version of the problem — binary outcomes, known discount factor — you can precompute lookup tables. But the moment you add real-world complications (non-stationary rewards, contextual information, continuous action spaces), the Gittins index becomes intractable. It's a bit like knowing that, in theory, chess has a perfect strategy — but no computer in the universe can compute it.

So in practice, everyone uses the approximations: Thompson sampling, UCB, epsilon-greedy. They're suboptimal, but they're computable. And they're very close to optimal in most realistic scenarios. This is a recurring theme in mathematics: sometimes the best is the enemy of the good.

The Gittins index is a map of the entire forest drawn from a satellite. Thompson sampling is a compass. In practice, the compass gets you home just as fast.

What the Algorithms Are Really Doing at Netflix

When Netflix recommends a movie, it faces a bandit problem millions of times a day. Each user is a different game. Each movie is an arm. The "payout" is whether you click, watch, and enjoy. Netflix can't just show everyone The Office — even though that would have the highest average payout — because then it would never discover which users secretly love Romanian art films.6

Google's ad system is the same problem with billions of dollars riding on it. Which ad should it show you? The one with the highest historical click-through rate? Or a new ad that might perform even better? Every time Google shows you a "worse" ad to learn about it, that's exploration — and it costs real money.

But perhaps the most morally fraught version of the problem is in clinical trials. Suppose you're testing a new cancer drug against the standard treatment. After 50 patients, Drug A seems to be working better than Drug B. Should you keep randomly assigning patients to Drug B? On one hand, the trial isn't over — maybe Drug A just got lucky. On the other hand, you're asking real patients to take a treatment you increasingly believe is worse. Thompson sampling was literally invented for this problem, and modern adaptive clinical trials use bandit algorithms to gradually shift patients toward the treatment that appears to be working — exploring enough to be statistically rigorous, but exploiting enough that fewer patients get the short end of the stick.7

· · ·

Chapter 59

The Age of Exploration

Here's where things get personal.

The explore-exploit tradeoff depends on one critical variable: how much time you have left.

If you have 1,000 pulls remaining, exploring a new machine is cheap — even if it's a dud, you have plenty of pulls left to recover on the good one. But if you have 3 pulls remaining, you'd be insane to waste one on an unknown machine. You should stick with your best option.

Now replace "pulls" with "years of life."

As you age, exploration becomes more costly and exploitation more valuable — the mathematics agrees with your grandmother's advice.

If you're 22, the math says: try the new city, the unfamiliar career, the weird hobby. You have decades to exploit whatever turns out to be great. The cost of a bad year is low because you have many years left to make up for it. But if you're 75, the math says: go to your favorite restaurant. Reread your favorite book. Call the friend you already know you love. You've done your exploring; now is the time to enjoy the best of what you've found.

This isn't just folk wisdom dressed up in equations. The mathematics actually quantifies it. The optimal exploration rate is roughly proportional to the fraction of your time horizon remaining. At the start of a 100-pull game, you should be exploring maybe 30-40% of the time. By pull 90, you should be exploring almost never.

The computer scientist Brian Christian and the cognitive scientist Tom Griffiths, in their wonderful book Algorithms to Live By, put it this way: "In the explore/exploit tradeoff, the interval makes the strategy." A short interval means exploit. A long interval means explore. And a lifetime means do both, but in the right order.8

The Connection to Optimal Stopping

This is closely related to the optimal stopping problem we saw in Chapter 6 — the famous "secretary problem" where you interview candidates and must decide when to stop looking. Both problems ask: when do you stop gathering information and commit to a choice? The secretary problem says: explore the first 37% of options, then commit to the next one that beats all previous candidates. The bandit problem is more flexible — you can revisit options — but the underlying tension is the same. More time = more exploration. Less time = commit to what you know.

🧭 Life Strategy Calculator

How should you balance trying new things vs. enjoying what you know? Enter your age and see how the math shifts.

Your Age

Decision Significance

Medium

Recommended Explore / Exploit Balance

Explore 60%

Exploit 40%

With decades ahead, lean into exploration — try new careers, cities, and relationships.

Time Horizon

~55 years

Exploration Value

High

Switching Cost

Low

· · ·

The Restaurant Problem

Let me bring this down to earth with what I'll call the Restaurant Problem, since it's one we all face weekly.

You live in a city with 100 restaurants. You eat out once a week. After a few years, you've found a place you really love — let's call it an 8.5 out of 10. Should you keep going there, or try somewhere new?

If the new place turns out to be a 6, you've wasted an evening. If it turns out to be a 9.5, you've upgraded your life. The expected value of exploring depends on your prior — how likely is it that a random untried restaurant beats an 8.5? If you've already tried 60 of the 100 restaurants and your current favorite is the best of those 60, the chance that one of the remaining 40 beats it is... well, it depends on the distribution, but it's not great.

On the other hand, if you just moved to a new city and you've only tried 5 restaurants, your current "favorite" is just the best of a tiny sample. Almost certainly there are much better options out there. The math screams: explore!

This is why people who've recently moved to a new city tend to discover great restaurants faster than longtime residents. They're forced into exploration by their ignorance, and ignorance — in bandit problems — is a kind of gift.

The value of exploration depends on how much of the space you've already covered.

Chapter 59

The Deepest Lesson

What I find most beautiful about the explore-exploit tradeoff is that it gives mathematical backing to something we feel intuitively but struggle to articulate: that the right life strategy changes over time, and that's not a failure of consistency — it's optimal.

The 20-year-old who bounces between majors isn't flaky. She's exploring. The 60-year-old who goes to the same vacation spot every year isn't stuck in a rut. He's exploiting. And the 40-year-old who feels torn between adventure and stability? She's right at the crossover point, where the math itself is ambiguous.

The algorithms teach us that there is no single "right" amount of exploration. The right amount depends on context — on your time horizon, on how much you already know, on how different the options really are. But they also teach us that some exploration is always optimal, as long as you have time left. The epsilon-greedy algorithm never stops exploring entirely. Neither should you.

Thompson, working in 1933, couldn't have known that his little trick for clinical trials would one day run the recommendation engines of billion-dollar companies, or that his mathematical insight would apply equally well to career advice and restaurant reviews. But that's the nature of deep mathematical ideas — they're solutions looking for problems, and they never stop finding new ones.

So the next time you're standing on the sidewalk, your favorite restaurant on the left and the unknown new place on the right, remember: you're not just choosing dinner. You're solving one of the oldest problems in mathematics. And the answer — as always — depends on how much time you think you have left.

Notes & References

The name "multi-armed bandit" was popularized by Herbert Robbins in his 1952 paper "Some Aspects of the Sequential Design of Experiments," Bulletin of the American Mathematical Society, 58(5), 527–535.
Thompson, W. R. (1933). "On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples." Biometrika, 25(3/4), 285–294. Remarkably, this paper was largely forgotten until the 2010s, when it was rediscovered and shown to be competitive with the best modern algorithms.
Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. Chapter 2 provides an excellent comparison of ε-greedy, UCB, and Thompson sampling on the multi-armed bandit.
Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). "Finite-time Analysis of the Multiarmed Bandit Problem." Machine Learning, 47(2–3), 235–256. This paper proved that UCB achieves logarithmic regret, matching an information-theoretic lower bound established by Lai and Robbins (1985).
Gittins, J. C. (1979). "Bandit Processes and Dynamic Allocation Indices." Journal of the Royal Statistical Society, Series B, 41(2), 148–177. Gittins received the IMS Wald Medal in part for this work.
For a detailed account of how Netflix uses bandit algorithms, see Basilico, J. (2016). "Recommendations at Netflix." Talk at RecSys 2016. Netflix's personalization team has estimated that their recommendation system saves the company over $1 billion per year in reduced churn.
Berry, D. A. (2006). "Bayesian Clinical Trials." Nature Reviews Drug Discovery, 5(1), 27–36. The I-SPY 2 breast cancer trial is a notable real-world example of Bayesian adaptive design using bandit-like allocation.
Christian, B., & Griffiths, T. (2016). Algorithms to Live By: The Computer Science of Human Decisions. Henry Holt. Chapter 2, "Explore/Exploit," pp. 31–62.